Performance optimizations to handle thousands of clients by miyoyo · Pull Request #3156 · juanfont/headscale

miyoyo · 2026-03-24T09:48:08Z

have read the CONTRIBUTING.md file
raised a GitHub issue or discussed it on the projects chat beforehand
added unit tests
added integration tests
updated documentation if needed
updated CHANGELOG.md

I am part of a CTF organization team that used Headscale and Tailscale to provide a VPN to thousands of participants.
As it is, the current version of Headscale cannot handle it, primarily due to the O(N^2) reachability computation of each node to each node.

I started by replacing the reachability system by a category system, using permission buckets instead of reachability computation. While this was way faster, it also changed over 2000 lines of code, and ripped out a good chunk of existing code.
It has been tested in production and did not seem to have any issues with reachability in any way.

I grabbed my existing changes, pprof, a fresh copy of the repo, and got Claude to gradually, slowly, improve the performance of Headscale until it would be sufficient for our purposes, without changing too much of the existing code.

What is in this PR is the result of this gradual improvement. It is mostly AI generated, and likely contains things that you would not want copied in. See this PR as more of an idea of what could be changed, then.

Anything below this point was AI generated.

perf: registration and policy evaluation optimizations

Context

This work was done overnight with Claude (Anthropic's AI assistant) running iterative profiling, optimization, and correctness verification cycles against a headscale instance under realistic load.

The goal was to make node registration fast at scale and eliminate idle CPU waste. The benchmark used a brutal ACL policy (500+ tags, 200+ groups, 14K+ filter rules) with thousands of concurrent nodes to surface real bottlenecks.

Results

Metric	Before	After
Registration throughput (2000 nodes, 14.6K rules)	13/s	206/s (15.8x)
Registration throughput (5000 nodes, 14.6K rules)	untested	77/s avg
Idle CPU (2000 nodes connected)	~70%	0.85%
Memory (5000 nodes)	-	872MB

Correctness was verified at every step:

All go test ./hscontrol/... pass at every commit individually
104/104 reachability tests pass (50-container live ACL test with 10 users, 8 roles, 22 rules)
149/149 mega-test assertions pass

Commits (incremental, each builds and passes tests independently)

1. `b7ffe48b` hscontrol/policy: defer filter compilation in SetNodes

4 files, +182/-41

Moves compileFilterRules out of the SetNodes hot path. Instead of recompiling the full filter on every node addition, SetNodes marks the filter dirty and compilation happens lazily on the next Filter() or FilterForNode() call. This eliminates redundant recompilation when nodes are added in rapid succession (e.g., batch registration).

Files: policy.go (filterDirty flag, lazy ensureFilterCompiled), pm.go (interface update), state.go (caller update), policy_test.go (new tests for deferred compilation)

2. `fcd7a6d7` hscontrol/state,policy: add incremental peer map computation for new node additions

6 files, +738/-20

Instead of rebuilding the entire O(N^2) peer map when a new node registers, this adds incremental peer map updates that only compute peers for newly added nodes (O(K*N) where K = new nodes). Includes RefreshPeersForNodes which fixes a correctness bug where PutNode ran before SetNodes, causing stale policy data in peer computation. Also adds HasPolicyChange detection for SubnetRoutes and IsExitNode changes.

Files: node_store.go (incrementalSnapshot, refreshNodePeers, RefreshPeersForNodes), policy.go (SetNodes returns newNodeIDs, HasPolicyChange route detection), pm.go, state.go, policy_test.go, node_store_test.go

3. `5370bdf9` hscontrol/policy: skip O(N) node iteration in Host.Resolve for non-CGNAT prefixes

2 files, +77/-12

Host.Resolve was iterating all N nodes for every host entry to check for CGNAT overlap. Since most ACL hosts entries reference external IPs (not CGNAT 100.64.0.0/10 or ULA fd7a:115c:a1e0::/48), this adds a fast-path prefixOverlapsCGNAT check that skips the node scan entirely for non-overlapping prefixes.

Files: types.go (prefixOverlapsCGNAT, Resolve fast path), policy_test.go (TestHostResolveCGNATSkip)

4. `247dbd1b` hscontrol/policy: add source matcher index cache for O(relevant) CanAccess checks

2 files, +147/-2

CanAccess was iterating all matchers for every source node. This adds getSrcMatcherIndices which pre-computes, per source node, which matcher indices have that node's IPs in their source set. Subsequent CanAccess calls only check relevant matchers instead of all N matchers.

Files: policy.go (srcMatcherCache, getSrcMatcherIndices, canAccessIndexed), policy_test.go (TestSourceMatcherIndexCache)

5. `b1dafc92` hscontrol/policy: hoist invariant checks out of canAccessIndexed inner loop

1 file, +6/-2

Moves len(m.DstIPs) and m.IPProto checks out of the per-destination-IP inner loop in canAccessIndexed, since these values don't change per iteration.

Files: policy.go

6. `703960aa` hscontrol/policy: add resolve cache, lightweight filter recompilation, and reachability tests

4 files, +2616/-7

Adds a per-compilation-cycle resolve cache for Group.Resolve, Username.Resolve, and Tag.Resolve. The same group/username/tag is referenced hundreds of times across 14K+ ACL rules but resolves identically within one update cycle. The cache achieves ~94% hit rate. Also makes ensureFilterCompiled lightweight — it only recompiles filter rules and matchers, skipping redundant tagOwner/autoApprover resolution that SetNodes already performed eagerly.

Includes comprehensive reachability test suite: equivalence tests, scale tests, dynamic join/leave, tag changes, subnet route overlap, peer symmetry, connection scenarios, and IP-based ACL rule tests (direct IPs, CIDR ranges, hosts entries, mixed identity+IP rules, IPv6).

Files: policy.go (resolveCache, lightweight ensureFilterCompiled), types.go (cache integration in Resolve methods), reachability_test.go (2200+ lines of reachability tests), reachability_ip_test.go (IP-based ACL tests)

7. `b11db02a` hscontrol/state: replace 500ms batch timeout with 1ms micro-batch drain

1 file, +108/-29

The NodeStore batched operations with a 500ms timeout, causing unnecessary latency for tag propagation and peer map updates. This replaces it with a 1ms micro-batch drain that processes operations as fast as they arrive while still coalescing concurrent writes. Throughput improved 2.5x.

Files: node_store.go (drainMicrobatch, revised processLoop)

8. `d802f381` hscontrol/state: store NodeIDs instead of NodeViews in peersByNode to eliminate GC overhead

2 files, +215/-183

Changed peersByNode from map[NodeID][]NodeView to map[NodeID][]NodeID. At 5000 nodes, the peer graph has ~25M entries. NodeView contains a pointer (forces GC to scan every entry), while NodeID is a uint64 (GC skips it entirely). ListPeers materializes fresh NodeViews from nodesByID at read time. GC overhead dropped from 50% to 3%.

Files: node_store.go (Snapshot type, snapshotFromNodes, shallowSnapshot, incrementalSnapshot, refreshNodePeers, ListPeers), node_store_test.go (updated assertions for NodeID storage)

9. `a23ce2d1` hscontrol/policy: add pre-built node IP indexes for O(1) Username/Tag resolution

2 files, +103/-5

Username.Resolve and Tag.Resolve scanned all N nodes to build IP sets. This adds buildNodeIPIndexes which pre-builds nodeIPsByUser and nodeIPsByTag maps once per filter compilation cycle. Resolve methods do O(1) map lookups instead of O(N) scans.

Files: policy.go (buildNodeIPIndexes, integration into updateLocked/ensureFilterCompiled), types.go (nodeIPsByUser/nodeIPsByTag fields, Resolve fast paths)

10. `c04df66c` mapper/batcher: wake processing loop on new changes for prompt delivery

2 files, +19/-0

Adds a wake channel that signals the batcher's doWork loop to process pending changes immediately instead of waiting for the next tick interval. This ensures node additions and policy changes are delivered to connected clients without the full batch delay, preventing stale peer lists in fast registration scenarios. Also includes UserProfiles in policyChangeResponse so newly visible peers have displayable identity information.

Files: mapper/batcher.go (wake channel, immediate processing signal), mapper/mapper.go (WithUserProfiles in policyChangeResponse)

Test coverage

Unit tests: All go test ./hscontrol/... pass at every commit
Reachability tests: 12 test functions covering equivalence, scale, symmetry, dynamic join/leave, tag changes, subnet routes, IP-based ACLs
IP-based ACL tests: 16 scenarios covering direct IPs, CIDR ranges, IPv6, hosts entries, mixed identity+IP rules, subnet route overlap, dynamic node changes
Live container test: 104/104 pass (50 containers, 10 users, 8 roles, 22 ACL rules)
Mega test: 149/149 pass (100 containers, comprehensive ACL scenarios)

Code changes (excluding tests and comments)

+605 lines added, -62 removed = 543 net new lines of production code

This PR was developed with Claude (Anthropic) running overnight, performing iterative pprof profiling, optimization, and correctness verification against live headscale instances with thousands of nodes and brutal ACL policies.

Problem: During bulk node registration (e.g., 500 nodes joining with unique preauth keys), SetUsers is called for every new node with the same user list. Each call triggers updateLocked(), which recompiles all ACL filter rules — an O(rules × nodes) operation. With a 14.6K-rule policy and 500 nodes, this dominated CPU at 32% (61.8s of 192.7s total samples), causing registration throughput to drop from 11/s to 3/s as the node count grew. Fix: Add a deephash-based short-circuit to SetUsers. Before triggering updateLocked(), hash the incoming user list and compare it against the previously stored hash. If the users haven't changed (which is the common case during registration — the user list is stable), return immediately without recompiling filters. Impact: Registration of 500 nodes with unique preauth keys and a 145K-line brutal ACL policy improved from 139s (3.6 nodes/s) to 53s (9.4 nodes/s) — a 2.6x speedup. The 32% CPU from compileFilterRules in the SetUsers path drops to ~0%. How: Added a usersHash field (deephash.Sum) to PolicyManager. In SetUsers, compute deephash.Hash(&users) and compare with the stored hash before proceeding. This is safe because the hash captures the full user list state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Problem: bcrypt.CompareHashAndPassword is called on every preauth key validation during node registration. Each call costs ~75ms of CPU time. When using a single reusable preauth key with many concurrent registrations (e.g., 50 parallel), bcrypt becomes a thundering herd where 50 goroutines all perform the same expensive computation simultaneously. Even with unique keys, bcrypt consumes 42% of total CPU (38.75s of 91.7s samples). Fix: Add a sync.Map-based singleflight cache keyed by "prefix:sha256(bcryptHash)". Each cache entry uses sync.Once so that only the first goroutine performs the actual bcrypt comparison; all concurrent goroutines for the same key block and reuse the result. Impact: For reusable keys (single key shared by many nodes), this reduces bcrypt from O(N) computations to O(1) — a single bcrypt call regardless of how many nodes register concurrently. For unique keys, each key still requires one bcrypt call (cache entries are used once), but the singleflight prevents duplicate work if the same key is validated concurrently by multiple code paths. How: Added bcryptCacheEntry struct with sync.Once + error, stored in a package-level sync.Map. In findAuthKey, LoadOrStore the cache entry and call once.Do with the bcrypt comparison. The cache key includes a SHA-256 of the stored bcrypt hash to ensure correctness if the hash is rotated. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ot-path cost

…node additions

…NAT prefixes

…ccess checks

…r loop

…, and reachability tests

… eliminate GC overhead

… resolution

Add a wake channel that signals the doWork loop to process pending changes immediately instead of waiting for the next tick interval. This ensures node additions and policy changes are delivered to connected clients without the full batch delay, preventing stale peer lists in fast registration scenarios. Also include UserProfiles in policyChangeResponse so that newly visible peers have displayable identity information in the netmap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sbatista-uc · 2026-04-08T14:53:42Z

These performance improvements are quite impressive, can you create a benchmark so myself and others can verify your claims with our own ACLs?

miyoyo · 2026-04-08T14:55:42Z

Of course, do you want me to use my own Acls or do you have your own I can use?

kradalby · 2026-04-08T15:00:25Z

Just want to put in that I am not actively ignoring this, and it probably would be helpful in the future.

We are mostly spending time on the planned work and trying to not get distracted. The other aspect is that we need to have sufficient test coverage to be comfortable changing some of this stuff. A lot of tests are going in recently and we might be getting closer to something like this, but no promises.

miyoyo · 2026-04-08T16:35:23Z

I take no offense in you ignoring or even closing this and copying parts of it piecemeal, this is mostly AI generated with some optimisations I thought up (and a lot of AI tokens), I highly appreciate your work with Headscale and I understand you are busy, thank you for responding!

sbatista-uc · 2026-04-08T19:42:21Z

@miyoyo I'm not comfortable sharing the ACL publicly as I'm using it for work within a highly competitive market. If you're interested in testing our your code and benchmarks against my ACL, I'd be happy to send it to you via a private message. Shoot me an email if you're interested: samuel.batista@usercentrics.com

…group:self When a policy uses autogroup:self (which expands differently per user), ComputeNodePeers and BuildPeerMap previously recompiled matchers from filter rules on every peer-pair check, then iterated ALL matchers to find source matches — O(N² × M) where M is the matcher count. Add three caches that persist across ComputeNodePeers calls within a filter compilation cycle: 1. perNodeMatcherCache: caches []matcher.Match per node ID, avoiding repeated MatchesFromFilterRules allocations and GC pressure 2. perNodeSrcIdxCache: caches source-matcher indices per (srcNode, matcherOwner) pair, reducing CanAccess from O(M) to O(relevant) where relevant << M for large rule sets 3. getNodeMatchers/getPerNodeSrcIndices helper methods that lazily populate both caches Both BuildPeerMap and ComputeNodePeers now use canAccessIndexed with per-node source indices instead of the full CanAccess scan, matching the optimization already used in the non-autogroup:self path. All caches are invalidated on filter recompilation, policy updates, and selective autogroup:self cache invalidation (user-scoped). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ations When using autogroup:self policies, compileFilterRulesForNodeLocked calls compileFilterRulesForNode which resolves every Username, Group, and Tag in the ACL rules. Previously, the resolve cache and node IP indexes were only enabled during ensureFilterCompiled (global filter compilation) and torn down immediately after via defer. Per-node compilations ran without any cache, causing O(users × rules) string matching on every call. With 100 nodes and 5000 rules, this meant ~500K resolveUser calls doing linear scans — 39.5% of total CPU in the profile. Fix: persist the resolve cache and node IP indexes across the filter compilation cycle for autogroup:self policies. Add ensureResolveCacheForCompilation which lazily initializes the cache on first per-node compilation, and clearResolveCache which tears it down on invalidation events (user changes, node identity changes, policy updates). Results (100 nodes, 5000 ACL rules): commit-12: 9.3s registration, resolveUser at 39.5% CPU commit-13: 6.0s registration (cached matchers) commit-14: 0.9s registration, ComputeNodePeers not in top 40 10x improvement over commit-12. At 500 nodes: 38.6s with 500/500 success, CPU profile now dominated by bcrypt (89.9%) not peer computation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…pilation MatchFromStrings calls util.ParseIPSet for every source and destination IP string in every filter rule. With 5000 rules × 950 nodes, the same IP strings (node addresses like "100.64.0.1/32") are parsed millions of times, each building an IPSetBuilder, normalizing ranges, and allocating. Add a package-level ipSetCache that deduplicates ParseIPSet calls within a compilation cycle. The cache is reset via ResetIPSetCache() whenever filters are invalidated (node/user/policy changes). Results (950 fake + 50 real clients, 5000 ACL rules): commit-14: 174.9s registration, MatchFromStrings at 22.1% CPU commit-15: 124.2s registration, MatchFromStrings at 12.8% CPU 29% faster registration, 42% reduction in matcher compilation CPU Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When usesAutogroupSelf is true, compileFilterRulesForNode previously compiled ALL ACL rules for each node — iterating 5000 rules to find the ~2 that reference autogroup:self. The other 4998 rules produce identical output regardless of which node is being compiled. Split the compilation: compile non-autogroup:self rules once into globalRulesForNode (cached on the Policy struct), then only compile the autogroup:self ACLs per node. The per-node rules are combined with the cached global rules before merging. Results (950 fake + 50 real clients, 5000 ACL rules): commit-15: 124.2s registration, compileACLWithAutogroupSelf at 39.7% commit-16: 48.8s registration, compileACLWithAutogroupSelf gone from top 2.5x faster registration Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… for autogroup:self MatchesFromFilterRules was converting all ~5000 compiled filter rules into []matcher.Match for every node, even though the global rules produce identical matchers. With 950 nodes this meant building the same 5000 matchers 950 times — parsing IP strings, building IPSets, sorting, normalizing — all redundant. Split getNodeMatchers into two phases: 1. globalMatcherCache: built once from non-autogroup:self rules (5000), shared by all nodes 2. Per-node matchers: built from only the ~2 autogroup:self ACLs Add compileAutogroupSelfRulesForNode to filter.go for the per-node compilation, separate from the global rules path. Results (950 fake + 50 real clients, 5000 ACL rules): commit-16: 48.8s registration, MatchesFromFilterRules at 49% commit-17: 3.4s registration, ComputeNodePeers gone from profile 14x faster than commit-16, 51x faster than commit-12 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

miyoyo · 2026-04-09T16:55:48Z

@miyoyo I'm not comfortable sharing the ACL publicly as I'm using it for work within a highly competitive market. If you're interested in testing our your code and benchmarks against my ACL, I'd be happy to send it to you via a private message. Shoot me an email if you're interested: samuel.batista@usercentrics.com

I'd rather avoid having to hold sensitive data on my end, I spent most of my remaining claude tokens in generating a nice way to do performance and correctness comparisons between my patches, this also enabled me to discover some new patches that increase performance.

My tests were done on a Ultra 7 165H with 32GB of ram, for reference.

You can find the harness in the following repository: https://github.com/miyoyo/headscale-3156-harness
You can use hsbench genpolicy -policy your_policy.hujson to benchmark both versions.

Below is an extension of the commit explanations that were in the first message in this conversation. The major difference is that the fake client used now holds a connection open, thus increasing the load, and my previous tests did not consider autogroup:self's performance.

I'm thinking it could be useful to try to condense all the commits down into a smaller patch, it's possible some of the steps taken by claude were unnecessary or negative. I'll see what I can remove by tomorrow.

I'm also still updating it to verify correctness, looks like the little goblin took some shortcuts in the harness.

---Anything below this point is AI generated---

Highly Concurrent Test: 2000 Nodes × 15000 ACL Rules × Concurrency 2000

All 2000 fake clients connect simultaneously.

	headscale-base (main)	commit-17 (latest)
Nodes registered	118 / 2000 (5.9%)	2000 / 2000 (100%)
Registration time	300.5s (timed out)	11.3s
Map snapshots captured	0 / 5	5 / 5
Throughput	~0.4 nodes/sec	177 nodes/sec
Real clients connected	not attempted	50 / 50
Lifecycle tests passing	not reached	7 / 7
Peers visible	unknown	2007
Container status	alive but unresponsive	healthy
Relative speed	1×	~443×

Configuration

ACL rules: 15001 (generated, including autogroup:self)
Users: 666
Tags: 10 (exit-node, subnet-router, server, db, api, web, monitoring, ssh-target, vpn-gateway, client)
Groups: 200
Concurrency: 2000 (all nodes register simultaneously)
Real tailscale clients: 50 (commit-17 only, not reached on the base)
Platform: WSL2 AlmaLinux 10, podman, host networking

Commits 13–17: autogroup:self at scale

These commits address a second bottleneck that only surfaces when a policy uses autogroup:self— a rule type that expands differently per user, forcing per-node filter compilation. The April 9 profiling session found that with 950 nodes and 5000 ACL rules, the global optimizations from commits 1–10 had no effect on this path: registration still took 174 seconds.

13. bcef709 hscontrol/policy: cache per-node matchers and source indices for autogroup:self

1 file, +105/−30

ComputeNodePeers and BuildPeerMap called MatchesFromFilterRules on the per-node compiled filter on every peer-pair check — rebuilding the full []matcher.Match slice from scratch for each of the N² pairs. Two caches fix this: perNodeMatcherCache stores the compiled matchers per node ID so they're built once per node per cycle, and perNodeSrcIdxCache indexes source-matcher positions per (srcNode, matcherOwner) pair, extending the canAccessIndexed optimization (commit 4) to the autogroup:self path. Both caches are invalidated on policy updates, user changes, and selective autogroup:self invalidation.

Files: policy.go

14. 3a1c9ef hscontrol/policy: persist resolve cache across per-node filter compilations

1 file, +50/−5

The resolve cache added in commit 6 was only active during ensureFilterCompiled — a defer tore it down immediately after global compilation. Per-node compilations for autogroup:self ran without any cache, hitting O(users × rules) string matching on every compileFilterRulesForNode call. With 100 nodes and 5000 rules this meant ~500K resolveUser calls doing linear scans, accounting for 39.5% of total CPU. ensureResolveCacheForCompilation lazily initialises the cache on the first per-node call and clearResolveCache tears it down on invalidation events, keeping it alive across the full registration cycle. Result: 9.3s → 0.9s for 100 nodes/5000 rules (10x).

Files: policy.go

15. 7df1e8f hscontrol/policy/matcher: cache ParseIPSet results across matcher compilation

2 files, +28/−9

MatchFromStrings calls util.ParseIPSet for every IP string in every filter rule. With 5000 rules × 950 nodes, the same node addresses like 100.64.0.1/32 were parsed millions of times — each call allocating an IPSetBuilder, normalising ranges, and building a new set. A package-level ipSetCache deduplicates ParseIPSet calls within a compilation cycle, reset via ResetIPSetCache() on filter invalidation. With 950 fake + 50 real clients: 174.9s → 124.2s (29% faster), MatchFromStrings CPU 22.1% → 12.8%.

Files: matcher/matcher.go, policy.go

16. 1e92e0a hscontrol/policy: split per-node filter compilation for autogroup:self

3 files, +126/−3

compileFilterRulesForNode iterated all ACL rules for every node to find the ~2 that reference autogroup:self, compiling the other 4998 identically each time. compileNonAutogroupSelfRules now compiles the invariant rules once and caches the result on the Policy struct as globalRulesForNode. Per-node compilation only iterates the autogroup:self ACLs. compileFilterRulesForNode combines the cached global rules with the per-node result before merging. Result: 124.2s → 48.8s (2.5x), compileACLWithAutogroupSelf disappears from the CPU profile.

Files: filter.go (aclUsesAutogroupSelf, compileNonAutogroupSelfRules, split logic), policy.go, types.go (globalRulesForNode field)

17. 8741944 hscontrol/policy: cache global matchers, only build per-node matchers for autogroup:self

2 files, +85/−8

getNodeMatchers was calling MatchesFromFilterRules on the full per-node compiled rules for every node — converting all ~5000 rules into []matcher.Match including the global rules that are identical across all nodes. With 950 nodes this built the same 5000 matchers 950 times. globalMatcherCache holds the []matcher.Match built once from the non-autogroup:self rules; per-node compilation only runs MatchesFromFilterRules on the ~2 autogroup:self rules and appends them to the shared cache. Result: 48.8s → 3.4s (14x), ComputeNodePeers leaves the CPU profile entirely. Across the full autogroup:self sequence: 51x improvement over commit 12's baseline.

Files: filter.go (compileAutogroupSelfRulesForNode), policy.go (globalMatcherCache, revised getNodeMatchers)

Total line count

+1301 / -127 = 1174 net lines of code.

Cediddi · 2026-04-19T14:58:59Z

Honestly, this PR is great help for me. I had already hit the max device count (8core 16G) of my server, and I was looking for a way to scale horizontally or find optimizations I could apply to allow more devices to be registered (already on Postgres).

Is there any way I can help you? (background: python)

miyoyo · 2026-04-20T05:27:42Z

Honestly, this PR is great help for me. I had already hit the max device count (8core 16G) of my server, and I was looking for a way to scale horizontally or find optimizations I could apply to allow more devices to be registered (already on Postgres).

Is there any way I can help you? (background: python)

Glad it's helped!
I've spent some time (unfortunately not enough) trying to minimize the changes I do to the latest master in order to make this PR not as big, but I doubt there is much of an easy way to do it without too much impact.

The conclusion that I end up with when trying manually, and any of the many background research agent I have running, is that a lot of the effort comes from the architecture of the rule parsing being Node-based (What can this node reach?) vs Edge-based (Given this rule, which nodes are affected)?

I haven't tried that angle just yet (sorry about the timing, life gets in the way), and there seems to be many edge cases that my comparison tests (I run headscale master and do a bunch of things, record all of the server's input, output and state, then compare to my modifications) keep catching, it's a surprisingly complicated problem, but it's solveable, I'm sure of it.

As for what you can do for this PR, I'm not quite sure myself, Go is a fairly straightforward language to learn, so if you want to pick one of the issues and have a go [ ;) ] at it, I'm sure the maintainers will be glad, do try to keep it to minimal changes and write extensive tests while doing it.

reflog · 2026-05-01T13:54:49Z

Just wanted to +1 this, the PR is working fantastic on our 1400 node headscale .
We've ported the changes to 0.28 and it resolved the lock contention on map update complely

miyoyo and others added 12 commits March 23, 2026 22:48

hscontrol/policy: defer filter compilation in SetNodes to eliminate h…

0a2d302

…ot-path cost

hscontrol/state,policy: add incremental peer map computation for new …

4d1f5a6

…node additions

hscontrol/policy: skip O(N) node iteration in Host.Resolve for non-CG…

587e343

…NAT prefixes

hscontrol/policy: add source matcher index cache for O(relevant) CanA…

6380ae7

…ccess checks

hscontrol/policy: hoist invariant checks out of canAccessIndexed inne…

3109eef

…r loop

hscontrol/policy: add resolve cache, lightweight filter recompilation…

48b80b4

…, and reachability tests

hscontrol/state: replace 500ms batch timeout with 1ms micro-batch drain

e0424f2

hscontrol/state: store NodeIDs instead of NodeViews in peersByNode to…

b5478c4

… eliminate GC overhead

hscontrol/policy: add pre-built node IP indexes for O(1) Username/Tag…

1c52ca7

… resolution

miyoyo requested review from juanfont and kradalby as code owners March 24, 2026 09:48

hidden and others added 5 commits April 9, 2026 11:12

maxpetrusenkoagent mentioned this pull request Jun 14, 2026

[Bug] Up function takes one second longer than before, after updating headscale from 0.25.1 to 0.28 #3165

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance optimizations to handle thousands of clients#3156

Performance optimizations to handle thousands of clients#3156
miyoyo wants to merge 17 commits into
juanfont:mainfrom
miyoyo:perf/registration-optimizations

miyoyo commented Mar 24, 2026

Uh oh!

sbatista-uc commented Apr 8, 2026

Uh oh!

miyoyo commented Apr 8, 2026

Uh oh!

kradalby commented Apr 8, 2026

Uh oh!

miyoyo commented Apr 8, 2026

Uh oh!

sbatista-uc commented Apr 8, 2026

Uh oh!

miyoyo commented Apr 9, 2026 •

edited

Loading

Uh oh!

Cediddi commented Apr 19, 2026

Uh oh!

miyoyo commented Apr 20, 2026

Uh oh!

reflog commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

miyoyo commented Mar 24, 2026

perf: registration and policy evaluation optimizations

Context

Results

Commits (incremental, each builds and passes tests independently)

1. b7ffe48b hscontrol/policy: defer filter compilation in SetNodes

2. fcd7a6d7 hscontrol/state,policy: add incremental peer map computation for new node additions

3. 5370bdf9 hscontrol/policy: skip O(N) node iteration in Host.Resolve for non-CGNAT prefixes

4. 247dbd1b hscontrol/policy: add source matcher index cache for O(relevant) CanAccess checks

5. b1dafc92 hscontrol/policy: hoist invariant checks out of canAccessIndexed inner loop

6. 703960aa hscontrol/policy: add resolve cache, lightweight filter recompilation, and reachability tests

7. b11db02a hscontrol/state: replace 500ms batch timeout with 1ms micro-batch drain

8. d802f381 hscontrol/state: store NodeIDs instead of NodeViews in peersByNode to eliminate GC overhead

9. a23ce2d1 hscontrol/policy: add pre-built node IP indexes for O(1) Username/Tag resolution

10. c04df66c mapper/batcher: wake processing loop on new changes for prompt delivery

Test coverage

Code changes (excluding tests and comments)

Uh oh!

sbatista-uc commented Apr 8, 2026

Uh oh!

miyoyo commented Apr 8, 2026

Uh oh!

kradalby commented Apr 8, 2026

Uh oh!

miyoyo commented Apr 8, 2026

Uh oh!

sbatista-uc commented Apr 8, 2026

Uh oh!

miyoyo commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Highly Concurrent Test: 2000 Nodes × 15000 ACL Rules × Concurrency 2000

Configuration

Commits 13–17: autogroup:self at scale

Total line count

Uh oh!

Cediddi commented Apr 19, 2026

Uh oh!

miyoyo commented Apr 20, 2026

Uh oh!

reflog commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

1. `b7ffe48b` hscontrol/policy: defer filter compilation in SetNodes

2. `fcd7a6d7` hscontrol/state,policy: add incremental peer map computation for new node additions

3. `5370bdf9` hscontrol/policy: skip O(N) node iteration in Host.Resolve for non-CGNAT prefixes

4. `247dbd1b` hscontrol/policy: add source matcher index cache for O(relevant) CanAccess checks

5. `b1dafc92` hscontrol/policy: hoist invariant checks out of canAccessIndexed inner loop

6. `703960aa` hscontrol/policy: add resolve cache, lightweight filter recompilation, and reachability tests

7. `b11db02a` hscontrol/state: replace 500ms batch timeout with 1ms micro-batch drain

8. `d802f381` hscontrol/state: store NodeIDs instead of NodeViews in peersByNode to eliminate GC overhead

9. `a23ce2d1` hscontrol/policy: add pre-built node IP indexes for O(1) Username/Tag resolution

10. `c04df66c` mapper/batcher: wake processing loop on new changes for prompt delivery

miyoyo commented Apr 9, 2026 •

edited

Loading