Skip to content

hscontrol: structured config validation and typed listener errors#3236

Open
kradalby wants to merge 15 commits into
juanfont:mainfrom
kradalby:kradalby/3227-letsencrypt-port-80-conflict
Open

hscontrol: structured config validation and typed listener errors#3236
kradalby wants to merge 15 commits into
juanfont:mainfrom
kradalby:kradalby/3227-letsencrypt-port-80-conflict

Conversation

@kradalby

@kradalby kradalby commented Apr 30, 2026

Copy link
Copy Markdown
Collaborator

listen_addr: 0.0.0.0:80 + tls_letsencrypt_hostname collide internally — main HTTP and autocert HTTP-01 both want port 80. Second bind loses with address already in use and the ACME goroutine log.Fatals. No hint that the conflict is internal.

Now rejected at config load:

Fatal config error: listen_addr and tls_letsencrypt_listen would bind the same TCP socket
  current: listen_addr: "0.0.0.0:80"
  conflicts with: tls_letsencrypt_listen: ":http"
  hint: give each listener a distinct port, or bind them to different non-wildcard hosts
  see: https://headscale.net/stable/ref/tls/

Real runtime bind failures get a typed *ListenerBindError plus a CLI hint:

binding main HTTP listener (listen_addr="0.0.0.0:443"): listen tcp 0.0.0.0:443: bind: address already in use

Hint: another process on this host is bound to the same address. Find it with: sudo ss -tlnp 'sport = :443'

errors.Is / As / Join work through the chain. Validation reports every violation in one pass. ACME listener moved into the main errgroup; orphan log.Fatal is gone.

Fixes #3227

Generated with the help of an AI assistant

@kradalby kradalby force-pushed the kradalby/3227-letsencrypt-port-80-conflict branch from f9d63c4 to adeee64 Compare April 30, 2026 11:53
@kradalby kradalby added this to the v0.29.0 milestone Apr 30, 2026
@kradalby kradalby requested a review from Copilot May 4, 2026 08:31
Comment thread CHANGELOG.md Outdated

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Headscale’s operator experience around configuration and listener binding by introducing structured, multi-error config validation and typed listener bind errors, including detection of internal listener collisions (notably listen_addr vs ACME HTTP-01).

Changes:

  • Add structured config error reporting (ConfigError + configValidator) that aggregates multiple validation failures using errors.Join.
  • Detect and reject TCP listener collisions at config load time (including ACME HTTP-01 listener conflicts) and document the common port-80 pitfall.
  • Introduce *ListenerBindError for runtime bind failures and add CLI hints for EADDRINUSE / EACCES; move ACME HTTP-01 serving into the main errgroup and shutdown path.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
hscontrol/types/listener.go Adds ListenerBindError, port parsing helpers, and listener-collision validation.
hscontrol/types/listener_test.go Unit tests for port parsing, overlap logic, wildcard detection, and ListenerBindError.
hscontrol/types/config.go Refactors validation into an aggregator, adds listener-collision validation, and converts several log.Fatal validations into ConfigErrors.
hscontrol/types/config_test.go Updates expectations for new structured validation output and adds coverage for collision detection and multi-error aggregation.
hscontrol/types/config_errors.go Introduces structured ConfigError, ErrConfig sentinel, validator aggregation, and tree-walking helpers.
hscontrol/types/config_errors_test.go Tests rendering and errors.Is/As behavior for ConfigError and joined validation errors.
hscontrol/app.go Uses typed bind errors, replaces ACME HTTP-01 goroutine+log.Fatal with errgroup-managed server + shutdown integration, and refactors TLS settings return type.
cmd/headscale/cli/serve.go Adds serve-error classification to append actionable hints for bind failures.
cmd/headscale/cli/serve_test.go Tests serve error classification and hint rendering.
docs/ref/tls.md Documents the listen_addr vs ACME HTTP-01 port-80 collision and the new startup validation behavior.
config-example.yaml Adds guidance around ACME HTTP-01 listener binding and collision avoidance.
CHANGELOG.md Documents the new config validation behavior and typed listener bind errors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread hscontrol/types/listener.go Outdated
Comment thread cmd/headscale/cli/serve.go Outdated
Comment thread hscontrol/types/config.go
@kradalby kradalby force-pushed the kradalby/3227-letsencrypt-port-80-conflict branch 2 times, most recently from 8e39300 to af6bed9 Compare May 4, 2026 09:23
@kradalby kradalby marked this pull request as ready for review May 4, 2026 09:25
kradalby added 9 commits May 9, 2026 14:08
Foundation for restructuring config validation feedback. Typed error
with errors.Is/As/Join hooks, structured fields rendered as a
multi-line operator-facing block (current, conflicts with, allowed,
minimum, maximum, why, hint, see), and a configValidator that joins
violations via errors.Join. ConfigErrors walks the tree to collect
every *ConfigError, and Cause keeps existing sentinel identities
reachable through errors.Is.

Updates juanfont#3227
portFromAddr resolves numeric and named ports (":http", ":https")
without touching /etc/services or the resolver. listenersOverlap
follows kernel rules: same port plus a wildcard host on either side,
or same port plus identical specific host, both count as collision;
different specific hosts on the same port do not. Set an explicit
viper default for tls_letsencrypt_listen so a minimal config still
resolves to ":http".

Updates juanfont#3227
Reject configurations where two configured TCP listeners would bind
the same kernel socket. Covers every pair of listen_addr,
grpc_listen_addr, metrics_listen_addr, and tls_letsencrypt_listen
(when HTTP-01 ACME is in use). Each violation renders as a structured
ConfigError naming both YAML keys, the values the operator wrote, and
a hint pointing at the canonical setup.

Closes the symptom in juanfont#3227: the misconfiguration is now rejected at
config-load time with a self-explaining error, instead of failing at
runtime with a misleading "address already in use" log because the
ACME challenge listener and the main HTTP listener competed for the
same port inside the same process.

Fixes juanfont#3227
Convert the errorText accumulator inside validateServerConfig to the
typed configValidator pattern. Each rule violation now renders as a
multi-line block naming the YAML key, the value the operator wrote,
and a hint pointing at the resolution. The dns.extra_records mutex
that previously crashed via log.Fatal is folded into the same
collector so an operator sees every problem in one go instead of
fixing them one startup attempt at a time.

Test wantErr assertions on validation output switch to substring
match because the rendered errors are now multi-line.

Updates juanfont#3227
derpConfig, databaseConfig, and dnsToTailcfgDNS used log.Fatal to
reject invalid combinations the moment they were observed. Lift those
checks into modular validators (validateDERPConfig,
validateDatabaseConfig, validateMagicDNSConfig) called from
validateServerConfig so each violation lands in the same configValidator
collector and renders as a structured ConfigError next to all other
config feedback. The sub-builder functions trust validation has run
and no longer crash the process.

Updates juanfont#3227
ListenerBindError wraps a TCP listener bind failure with the listener
name and the YAML key that drove its address. Preserves the underlying
*net.OpError via Unwrap so errors.Is(err, syscall.EADDRINUSE) keeps
working through any number of fmt.Errorf("%w") wraps. Lets the
top-level CLI classifier match listener failures with errors.As
instead of string matching.

Updates juanfont#3227
Replace the three "binding to TCP address" wraps with typed
ListenerBindError values that carry the listener role, the YAML key
that drove the address, and the resolved address. The wrapped error
chain still walks to syscall.EADDRINUSE / EACCES so existing callers
that match those sentinels keep working; the difference is that the
operator now sees which listener failed instead of an unattributed
"binding to TCP address" line.

Updates juanfont#3227
The HTTP-01 challenge listener was launched in an orphan goroutine that
called log.Fatal on bind failure, which bypassed the signal-handler
shutdown path and made bind errors look identical to the main HTTP
listener.

Bind eagerly via net.ListenConfig in getTLSSettings, return the listener
+ http.Server in a tlsBundle, register the Serve loop with the existing
errgroup, and call Shutdown alongside the other servers. Bind failures
now surface as ListenerBindError(Listener:"ACME HTTP-01 challenge")
through the normal error chain.

Updates juanfont#3227
A *ListenerBindError that wraps syscall.EADDRINUSE now ends with a
"sudo ss -tlnp 'sport = :PORT'" pointer, and one wrapping
syscall.EACCES with a CAP_NET_BIND_SERVICE / setcap pointer. The
underlying chain is preserved via fmt.Errorf("%w"), so errors.Is /
errors.As continue to walk to the typed bind error and the syscall
errno.

Drop the "headscale ran into an error and had to shut down" wrap,
which only restated the symptom.

Export types.PortFromAddr so the classifier can render the port
number in the hint.

Updates juanfont#3227
@kradalby kradalby force-pushed the kradalby/3227-letsencrypt-port-80-conflict branch from af6bed9 to 6b258ee Compare May 9, 2026 14:09
Comment thread config-example.yaml Outdated
#
# For production:
# listen_addr: 0.0.0.0:8080
#

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this section? The config file is already quite long and headscale checks and refuses to start if it detects that misconfiguration?

Comment thread config-example.yaml Outdated
# verification endpoint, and it will be listening on:
# :http = port 80
#
# This must not collide with listen_addr, grpc_listen_addr, or

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, I think it should be removed altogether.

@nblock nblock modified the milestones: v0.29.0, v0.30.0 May 10, 2026
@kradalby kradalby force-pushed the kradalby/3227-letsencrypt-port-80-conflict branch from 66606d5 to b87e77d Compare May 11, 2026 13:15
@kradalby kradalby requested a review from nblock May 11, 2026 13:15
kradalby added 2 commits May 11, 2026 13:20
Add a callout under the HTTP-01 docs so operators see the
constraint before they hit the validation error at startup.

Updates juanfont#3227
kradalby added 4 commits May 11, 2026 13:20
deprecator.Log() called log.Fatal directly, so a single deprecated
key killed the process before any other validation rule could
report. Operators fixing one issue at a time.

Replace Log() with Apply(*configValidator). Warns still go to
log.Warn; fatals are pushed onto the validator as *ConfigError so
they merge with the rest of the config-load report. The free-form
strings collected via set.Set are gone; deprecator now stores
typed deprecation{OldKey, NewKey} records and Apply renders them
into the structured ConfigError shape.

Move the early-return PKCE check into a new validatePKCEConfig
modular validator for the same reason.

Updates juanfont#3227
Sub-builders called after validateServerConfig (prefixV4, prefixV6,
allocation strategy, dns, oidc client secret/path, isSafeServerURL)
each returned the first error. So an operator that fixed one
issue saw the next one only on the next startup.

Lift one *configValidator across the entire LoadServerConfig flow.
validateServerConfigInto(v) populates it; each sub-builder's error
is wrapped in a structured *ConfigError (with the original sentinel
kept on the Cause field, so errors.Is against errOidcMutuallyExclusive,
errServerURLSame/Suffix, ErrNoPrefixConfigured, and
ErrInvalidAllocationStrategy keeps working). v.Err() is checked once,
right before the &Config{} construction.

TestReadConfig/base-domain-in-server-url-err matched the old sentinel
wording; flipped to the new structured Reason. Added
TestLoadServerConfig_CollectsAcrossSubBuilders to lock the wiring:
four sub-builder failures from a single config produce four
*ConfigErrors in the report.

Updates juanfont#3227
Upstream tailscale.com/go.mod requires Go 1.26.3; the pinned
golang:1.26.2-alpine base refuses to fetch a newer toolchain
(GOTOOLCHAIN=local) and exits with:

    go: go.mod requires go >= 1.26.3 (running go 1.26.2; GOTOOLCHAIN=local)

This blocks every PR's integration build. Bump to 1.26.3-alpine.
Same upstream go.mod >= 1.26.3 requirement broke the DERPer
container build (TestDERPVerifyEndpoint and friends), in addition
to the tailscale-HEAD container fixed in 513f629.
@kradalby kradalby force-pushed the kradalby/3227-letsencrypt-port-80-conflict branch from b87e77d to 211d864 Compare May 11, 2026 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Fresh Debian 13 VPS; headscale can't bind to port 80. But nginx can (temporarily)!

3 participants