Performance optimizations to handle thousands of clients#3156
Conversation
Problem: During bulk node registration (e.g., 500 nodes joining with unique preauth keys), SetUsers is called for every new node with the same user list. Each call triggers updateLocked(), which recompiles all ACL filter rules — an O(rules × nodes) operation. With a 14.6K-rule policy and 500 nodes, this dominated CPU at 32% (61.8s of 192.7s total samples), causing registration throughput to drop from 11/s to 3/s as the node count grew. Fix: Add a deephash-based short-circuit to SetUsers. Before triggering updateLocked(), hash the incoming user list and compare it against the previously stored hash. If the users haven't changed (which is the common case during registration — the user list is stable), return immediately without recompiling filters. Impact: Registration of 500 nodes with unique preauth keys and a 145K-line brutal ACL policy improved from 139s (3.6 nodes/s) to 53s (9.4 nodes/s) — a 2.6x speedup. The 32% CPU from compileFilterRules in the SetUsers path drops to ~0%. How: Added a usersHash field (deephash.Sum) to PolicyManager. In SetUsers, compute deephash.Hash(&users) and compare with the stored hash before proceeding. This is safe because the hash captures the full user list state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Problem: bcrypt.CompareHashAndPassword is called on every preauth key validation during node registration. Each call costs ~75ms of CPU time. When using a single reusable preauth key with many concurrent registrations (e.g., 50 parallel), bcrypt becomes a thundering herd where 50 goroutines all perform the same expensive computation simultaneously. Even with unique keys, bcrypt consumes 42% of total CPU (38.75s of 91.7s samples). Fix: Add a sync.Map-based singleflight cache keyed by "prefix:sha256(bcryptHash)". Each cache entry uses sync.Once so that only the first goroutine performs the actual bcrypt comparison; all concurrent goroutines for the same key block and reuse the result. Impact: For reusable keys (single key shared by many nodes), this reduces bcrypt from O(N) computations to O(1) — a single bcrypt call regardless of how many nodes register concurrently. For unique keys, each key still requires one bcrypt call (cache entries are used once), but the singleflight prevents duplicate work if the same key is validated concurrently by multiple code paths. How: Added bcryptCacheEntry struct with sync.Once + error, stored in a package-level sync.Map. In findAuthKey, LoadOrStore the cache entry and call once.Do with the bcrypt comparison. The cache key includes a SHA-256 of the stored bcrypt hash to ensure correctness if the hash is rotated. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, and reachability tests
… eliminate GC overhead
Add a wake channel that signals the doWork loop to process pending changes immediately instead of waiting for the next tick interval. This ensures node additions and policy changes are delivered to connected clients without the full batch delay, preventing stale peer lists in fast registration scenarios. Also include UserProfiles in policyChangeResponse so that newly visible peers have displayable identity information in the netmap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
These performance improvements are quite impressive, can you create a benchmark so myself and others can verify your claims with our own ACLs? |
|
Of course, do you want me to use my own Acls or do you have your own I can use? |
|
Just want to put in that I am not actively ignoring this, and it probably would be helpful in the future. We are mostly spending time on the planned work and trying to not get distracted. The other aspect is that we need to have sufficient test coverage to be comfortable changing some of this stuff. A lot of tests are going in recently and we might be getting closer to something like this, but no promises. |
|
I take no offense in you ignoring or even closing this and copying parts of it piecemeal, this is mostly AI generated with some optimisations I thought up (and a lot of AI tokens), I highly appreciate your work with Headscale and I understand you are busy, thank you for responding! |
|
@miyoyo I'm not comfortable sharing the ACL publicly as I'm using it for work within a highly competitive market. If you're interested in testing our your code and benchmarks against my ACL, I'd be happy to send it to you via a private message. Shoot me an email if you're interested: samuel.batista@usercentrics.com |
…group:self When a policy uses autogroup:self (which expands differently per user), ComputeNodePeers and BuildPeerMap previously recompiled matchers from filter rules on every peer-pair check, then iterated ALL matchers to find source matches — O(N² × M) where M is the matcher count. Add three caches that persist across ComputeNodePeers calls within a filter compilation cycle: 1. perNodeMatcherCache: caches []matcher.Match per node ID, avoiding repeated MatchesFromFilterRules allocations and GC pressure 2. perNodeSrcIdxCache: caches source-matcher indices per (srcNode, matcherOwner) pair, reducing CanAccess from O(M) to O(relevant) where relevant << M for large rule sets 3. getNodeMatchers/getPerNodeSrcIndices helper methods that lazily populate both caches Both BuildPeerMap and ComputeNodePeers now use canAccessIndexed with per-node source indices instead of the full CanAccess scan, matching the optimization already used in the non-autogroup:self path. All caches are invalidated on filter recompilation, policy updates, and selective autogroup:self cache invalidation (user-scoped). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ations When using autogroup:self policies, compileFilterRulesForNodeLocked calls compileFilterRulesForNode which resolves every Username, Group, and Tag in the ACL rules. Previously, the resolve cache and node IP indexes were only enabled during ensureFilterCompiled (global filter compilation) and torn down immediately after via defer. Per-node compilations ran without any cache, causing O(users × rules) string matching on every call. With 100 nodes and 5000 rules, this meant ~500K resolveUser calls doing linear scans — 39.5% of total CPU in the profile. Fix: persist the resolve cache and node IP indexes across the filter compilation cycle for autogroup:self policies. Add ensureResolveCacheForCompilation which lazily initializes the cache on first per-node compilation, and clearResolveCache which tears it down on invalidation events (user changes, node identity changes, policy updates). Results (100 nodes, 5000 ACL rules): commit-12: 9.3s registration, resolveUser at 39.5% CPU commit-13: 6.0s registration (cached matchers) commit-14: 0.9s registration, ComputeNodePeers not in top 40 10x improvement over commit-12. At 500 nodes: 38.6s with 500/500 success, CPU profile now dominated by bcrypt (89.9%) not peer computation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pilation MatchFromStrings calls util.ParseIPSet for every source and destination IP string in every filter rule. With 5000 rules × 950 nodes, the same IP strings (node addresses like "100.64.0.1/32") are parsed millions of times, each building an IPSetBuilder, normalizing ranges, and allocating. Add a package-level ipSetCache that deduplicates ParseIPSet calls within a compilation cycle. The cache is reset via ResetIPSetCache() whenever filters are invalidated (node/user/policy changes). Results (950 fake + 50 real clients, 5000 ACL rules): commit-14: 174.9s registration, MatchFromStrings at 22.1% CPU commit-15: 124.2s registration, MatchFromStrings at 12.8% CPU 29% faster registration, 42% reduction in matcher compilation CPU Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When usesAutogroupSelf is true, compileFilterRulesForNode previously compiled ALL ACL rules for each node — iterating 5000 rules to find the ~2 that reference autogroup:self. The other 4998 rules produce identical output regardless of which node is being compiled. Split the compilation: compile non-autogroup:self rules once into globalRulesForNode (cached on the Policy struct), then only compile the autogroup:self ACLs per node. The per-node rules are combined with the cached global rules before merging. Results (950 fake + 50 real clients, 5000 ACL rules): commit-15: 124.2s registration, compileACLWithAutogroupSelf at 39.7% commit-16: 48.8s registration, compileACLWithAutogroupSelf gone from top 2.5x faster registration Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… for autogroup:self MatchesFromFilterRules was converting all ~5000 compiled filter rules into []matcher.Match for every node, even though the global rules produce identical matchers. With 950 nodes this meant building the same 5000 matchers 950 times — parsing IP strings, building IPSets, sorting, normalizing — all redundant. Split getNodeMatchers into two phases: 1. globalMatcherCache: built once from non-autogroup:self rules (5000), shared by all nodes 2. Per-node matchers: built from only the ~2 autogroup:self ACLs Add compileAutogroupSelfRulesForNode to filter.go for the per-node compilation, separate from the global rules path. Results (950 fake + 50 real clients, 5000 ACL rules): commit-16: 48.8s registration, MatchesFromFilterRules at 49% commit-17: 3.4s registration, ComputeNodePeers gone from profile 14x faster than commit-16, 51x faster than commit-12 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
I'd rather avoid having to hold sensitive data on my end, I spent most of my remaining claude tokens in generating a nice way to do performance and correctness comparisons between my patches, this also enabled me to discover some new patches that increase performance. My tests were done on a Ultra 7 165H with 32GB of ram, for reference. You can find the harness in the following repository: https://github.com/miyoyo/headscale-3156-harness Below is an extension of the commit explanations that were in the first message in this conversation. The major difference is that the fake client used now holds a connection open, thus increasing the load, and my previous tests did not consider autogroup:self's performance. I'm thinking it could be useful to try to condense all the commits down into a smaller patch, it's possible some of the steps taken by claude were unnecessary or negative. I'll see what I can remove by tomorrow. I'm also still updating it to verify correctness, looks like the little goblin took some shortcuts in the harness. ---Anything below this point is AI generated--- Highly Concurrent Test: 2000 Nodes × 15000 ACL Rules × Concurrency 2000All 2000 fake clients connect simultaneously.
Configuration
Commits 13–17: autogroup:self at scaleThese commits address a second bottleneck that only surfaces when a policy uses 13. 1 file, +105/−30
Files: 14. 1 file, +50/−5 The resolve cache added in commit 6 was only active during Files: 15. 2 files, +28/−9
Files: 16. 3 files, +126/−3
Files: 17. 2 files, +85/−8
Files: Total line count+1301 / -127 = 1174 net lines of code. |
|
Honestly, this PR is great help for me. I had already hit the max device count (8core 16G) of my server, and I was looking for a way to scale horizontally or find optimizations I could apply to allow more devices to be registered (already on Postgres). Is there any way I can help you? (background: python) |
Glad it's helped! The conclusion that I end up with when trying manually, and any of the many background research agent I have running, is that a lot of the effort comes from the architecture of the rule parsing being Node-based (What can this node reach?) vs Edge-based (Given this rule, which nodes are affected)? I haven't tried that angle just yet (sorry about the timing, life gets in the way), and there seems to be many edge cases that my comparison tests (I run headscale master and do a bunch of things, record all of the server's input, output and state, then compare to my modifications) keep catching, it's a surprisingly complicated problem, but it's solveable, I'm sure of it. As for what you can do for this PR, I'm not quite sure myself, Go is a fairly straightforward language to learn, so if you want to pick one of the issues and have a go [ ;) ] at it, I'm sure the maintainers will be glad, do try to keep it to minimal changes and write extensive tests while doing it. |
|
Just wanted to +1 this, the PR is working fantastic on our 1400 node headscale . |
I am part of a CTF organization team that used Headscale and Tailscale to provide a VPN to thousands of participants.
As it is, the current version of Headscale cannot handle it, primarily due to the O(N^2) reachability computation of each node to each node.
I started by replacing the reachability system by a category system, using permission buckets instead of reachability computation. While this was way faster, it also changed over 2000 lines of code, and ripped out a good chunk of existing code.
It has been tested in production and did not seem to have any issues with reachability in any way.
I grabbed my existing changes, pprof, a fresh copy of the repo, and got Claude to gradually, slowly, improve the performance of Headscale until it would be sufficient for our purposes, without changing too much of the existing code.
What is in this PR is the result of this gradual improvement. It is mostly AI generated, and likely contains things that you would not want copied in. See this PR as more of an idea of what could be changed, then.
Anything below this point was AI generated.
perf: registration and policy evaluation optimizations
Context
This work was done overnight with Claude (Anthropic's AI assistant) running iterative profiling, optimization, and correctness verification cycles against a headscale instance under realistic load.
The goal was to make node registration fast at scale and eliminate idle CPU waste. The benchmark used a brutal ACL policy (500+ tags, 200+ groups, 14K+ filter rules) with thousands of concurrent nodes to surface real bottlenecks.
Results
Correctness was verified at every step:
go test ./hscontrol/...pass at every commit individuallyCommits (incremental, each builds and passes tests independently)
1.
b7ffe48bhscontrol/policy: defer filter compilation in SetNodes4 files, +182/-41
Moves
compileFilterRulesout of the SetNodes hot path. Instead of recompiling the full filter on every node addition, SetNodes marks the filter dirty and compilation happens lazily on the nextFilter()orFilterForNode()call. This eliminates redundant recompilation when nodes are added in rapid succession (e.g., batch registration).Files:
policy.go(filterDirty flag, lazy ensureFilterCompiled),pm.go(interface update),state.go(caller update),policy_test.go(new tests for deferred compilation)2.
fcd7a6d7hscontrol/state,policy: add incremental peer map computation for new node additions6 files, +738/-20
Instead of rebuilding the entire O(N^2) peer map when a new node registers, this adds incremental peer map updates that only compute peers for newly added nodes (O(K*N) where K = new nodes). Includes
RefreshPeersForNodeswhich fixes a correctness bug wherePutNoderan beforeSetNodes, causing stale policy data in peer computation. Also addsHasPolicyChangedetection forSubnetRoutesandIsExitNodechanges.Files:
node_store.go(incrementalSnapshot, refreshNodePeers, RefreshPeersForNodes),policy.go(SetNodes returns newNodeIDs, HasPolicyChange route detection),pm.go,state.go,policy_test.go,node_store_test.go3.
5370bdf9hscontrol/policy: skip O(N) node iteration in Host.Resolve for non-CGNAT prefixes2 files, +77/-12
Host.Resolvewas iterating all N nodes for every host entry to check for CGNAT overlap. Since most ACL hosts entries reference external IPs (not CGNAT100.64.0.0/10or ULAfd7a:115c:a1e0::/48), this adds a fast-pathprefixOverlapsCGNATcheck that skips the node scan entirely for non-overlapping prefixes.Files:
types.go(prefixOverlapsCGNAT, Resolve fast path),policy_test.go(TestHostResolveCGNATSkip)4.
247dbd1bhscontrol/policy: add source matcher index cache for O(relevant) CanAccess checks2 files, +147/-2
CanAccesswas iterating all matchers for every source node. This addsgetSrcMatcherIndiceswhich pre-computes, per source node, which matcher indices have that node's IPs in their source set. SubsequentCanAccesscalls only check relevant matchers instead of all N matchers.Files:
policy.go(srcMatcherCache, getSrcMatcherIndices, canAccessIndexed),policy_test.go(TestSourceMatcherIndexCache)5.
b1dafc92hscontrol/policy: hoist invariant checks out of canAccessIndexed inner loop1 file, +6/-2
Moves
len(m.DstIPs)andm.IPProtochecks out of the per-destination-IP inner loop incanAccessIndexed, since these values don't change per iteration.Files:
policy.go6.
703960aahscontrol/policy: add resolve cache, lightweight filter recompilation, and reachability tests4 files, +2616/-7
Adds a per-compilation-cycle resolve cache for
Group.Resolve,Username.Resolve, andTag.Resolve. The same group/username/tag is referenced hundreds of times across 14K+ ACL rules but resolves identically within one update cycle. The cache achieves ~94% hit rate. Also makesensureFilterCompiledlightweight — it only recompiles filter rules and matchers, skipping redundant tagOwner/autoApprover resolution thatSetNodesalready performed eagerly.Includes comprehensive reachability test suite: equivalence tests, scale tests, dynamic join/leave, tag changes, subnet route overlap, peer symmetry, connection scenarios, and IP-based ACL rule tests (direct IPs, CIDR ranges, hosts entries, mixed identity+IP rules, IPv6).
Files:
policy.go(resolveCache, lightweight ensureFilterCompiled),types.go(cache integration in Resolve methods),reachability_test.go(2200+ lines of reachability tests),reachability_ip_test.go(IP-based ACL tests)7.
b11db02ahscontrol/state: replace 500ms batch timeout with 1ms micro-batch drain1 file, +108/-29
The NodeStore batched operations with a 500ms timeout, causing unnecessary latency for tag propagation and peer map updates. This replaces it with a 1ms micro-batch drain that processes operations as fast as they arrive while still coalescing concurrent writes. Throughput improved 2.5x.
Files:
node_store.go(drainMicrobatch, revised processLoop)8.
d802f381hscontrol/state: store NodeIDs instead of NodeViews in peersByNode to eliminate GC overhead2 files, +215/-183
Changed
peersByNodefrommap[NodeID][]NodeViewtomap[NodeID][]NodeID. At 5000 nodes, the peer graph has ~25M entries.NodeViewcontains a pointer (forces GC to scan every entry), whileNodeIDis auint64(GC skips it entirely).ListPeersmaterializes freshNodeViews fromnodesByIDat read time. GC overhead dropped from 50% to 3%.Files:
node_store.go(Snapshot type, snapshotFromNodes, shallowSnapshot, incrementalSnapshot, refreshNodePeers, ListPeers),node_store_test.go(updated assertions for NodeID storage)9.
a23ce2d1hscontrol/policy: add pre-built node IP indexes for O(1) Username/Tag resolution2 files, +103/-5
Username.ResolveandTag.Resolvescanned all N nodes to build IP sets. This addsbuildNodeIPIndexeswhich pre-buildsnodeIPsByUserandnodeIPsByTagmaps once per filter compilation cycle. Resolve methods do O(1) map lookups instead of O(N) scans.Files:
policy.go(buildNodeIPIndexes, integration into updateLocked/ensureFilterCompiled),types.go(nodeIPsByUser/nodeIPsByTag fields, Resolve fast paths)10.
c04df66cmapper/batcher: wake processing loop on new changes for prompt delivery2 files, +19/-0
Adds a wake channel that signals the batcher's doWork loop to process pending changes immediately instead of waiting for the next tick interval. This ensures node additions and policy changes are delivered to connected clients without the full batch delay, preventing stale peer lists in fast registration scenarios. Also includes
UserProfilesinpolicyChangeResponseso newly visible peers have displayable identity information.Files:
mapper/batcher.go(wake channel, immediate processing signal),mapper/mapper.go(WithUserProfiles in policyChangeResponse)Test coverage
go test ./hscontrol/...pass at every commitCode changes (excluding tests and comments)
+605 lines added, -62 removed = 543 net new lines of production code
This PR was developed with Claude (Anthropic) running overnight, performing iterative
pprofprofiling, optimization, and correctness verification against live headscale instances with thousands of nodes and brutal ACL policies.