Skip to content

Performance optimizations to handle thousands of clients#3156

Open
miyoyo wants to merge 17 commits into
juanfont:mainfrom
miyoyo:perf/registration-optimizations
Open

Performance optimizations to handle thousands of clients#3156
miyoyo wants to merge 17 commits into
juanfont:mainfrom
miyoyo:perf/registration-optimizations

Conversation

@miyoyo

@miyoyo miyoyo commented Mar 24, 2026

Copy link
Copy Markdown
  • have read the CONTRIBUTING.md file
  • raised a GitHub issue or discussed it on the projects chat beforehand
  • added unit tests
  • added integration tests
  • updated documentation if needed
  • updated CHANGELOG.md

I am part of a CTF organization team that used Headscale and Tailscale to provide a VPN to thousands of participants.
As it is, the current version of Headscale cannot handle it, primarily due to the O(N^2) reachability computation of each node to each node.

I started by replacing the reachability system by a category system, using permission buckets instead of reachability computation. While this was way faster, it also changed over 2000 lines of code, and ripped out a good chunk of existing code.
It has been tested in production and did not seem to have any issues with reachability in any way.

I grabbed my existing changes, pprof, a fresh copy of the repo, and got Claude to gradually, slowly, improve the performance of Headscale until it would be sufficient for our purposes, without changing too much of the existing code.

What is in this PR is the result of this gradual improvement. It is mostly AI generated, and likely contains things that you would not want copied in. See this PR as more of an idea of what could be changed, then.

Anything below this point was AI generated.


perf: registration and policy evaluation optimizations

Context

This work was done overnight with Claude (Anthropic's AI assistant) running iterative profiling, optimization, and correctness verification cycles against a headscale instance under realistic load.

The goal was to make node registration fast at scale and eliminate idle CPU waste. The benchmark used a brutal ACL policy (500+ tags, 200+ groups, 14K+ filter rules) with thousands of concurrent nodes to surface real bottlenecks.

Results

Metric Before After
Registration throughput (2000 nodes, 14.6K rules) 13/s 206/s (15.8x)
Registration throughput (5000 nodes, 14.6K rules) untested 77/s avg
Idle CPU (2000 nodes connected) ~70% 0.85%
Memory (5000 nodes) - 872MB

Correctness was verified at every step:

  • All go test ./hscontrol/... pass at every commit individually
  • 104/104 reachability tests pass (50-container live ACL test with 10 users, 8 roles, 22 rules)
  • 149/149 mega-test assertions pass

Commits (incremental, each builds and passes tests independently)

1. b7ffe48b hscontrol/policy: defer filter compilation in SetNodes

4 files, +182/-41

Moves compileFilterRules out of the SetNodes hot path. Instead of recompiling the full filter on every node addition, SetNodes marks the filter dirty and compilation happens lazily on the next Filter() or FilterForNode() call. This eliminates redundant recompilation when nodes are added in rapid succession (e.g., batch registration).

Files: policy.go (filterDirty flag, lazy ensureFilterCompiled), pm.go (interface update), state.go (caller update), policy_test.go (new tests for deferred compilation)

2. fcd7a6d7 hscontrol/state,policy: add incremental peer map computation for new node additions

6 files, +738/-20

Instead of rebuilding the entire O(N^2) peer map when a new node registers, this adds incremental peer map updates that only compute peers for newly added nodes (O(K*N) where K = new nodes). Includes RefreshPeersForNodes which fixes a correctness bug where PutNode ran before SetNodes, causing stale policy data in peer computation. Also adds HasPolicyChange detection for SubnetRoutes and IsExitNode changes.

Files: node_store.go (incrementalSnapshot, refreshNodePeers, RefreshPeersForNodes), policy.go (SetNodes returns newNodeIDs, HasPolicyChange route detection), pm.go, state.go, policy_test.go, node_store_test.go

3. 5370bdf9 hscontrol/policy: skip O(N) node iteration in Host.Resolve for non-CGNAT prefixes

2 files, +77/-12

Host.Resolve was iterating all N nodes for every host entry to check for CGNAT overlap. Since most ACL hosts entries reference external IPs (not CGNAT 100.64.0.0/10 or ULA fd7a:115c:a1e0::/48), this adds a fast-path prefixOverlapsCGNAT check that skips the node scan entirely for non-overlapping prefixes.

Files: types.go (prefixOverlapsCGNAT, Resolve fast path), policy_test.go (TestHostResolveCGNATSkip)

4. 247dbd1b hscontrol/policy: add source matcher index cache for O(relevant) CanAccess checks

2 files, +147/-2

CanAccess was iterating all matchers for every source node. This adds getSrcMatcherIndices which pre-computes, per source node, which matcher indices have that node's IPs in their source set. Subsequent CanAccess calls only check relevant matchers instead of all N matchers.

Files: policy.go (srcMatcherCache, getSrcMatcherIndices, canAccessIndexed), policy_test.go (TestSourceMatcherIndexCache)

5. b1dafc92 hscontrol/policy: hoist invariant checks out of canAccessIndexed inner loop

1 file, +6/-2

Moves len(m.DstIPs) and m.IPProto checks out of the per-destination-IP inner loop in canAccessIndexed, since these values don't change per iteration.

Files: policy.go

6. 703960aa hscontrol/policy: add resolve cache, lightweight filter recompilation, and reachability tests

4 files, +2616/-7

Adds a per-compilation-cycle resolve cache for Group.Resolve, Username.Resolve, and Tag.Resolve. The same group/username/tag is referenced hundreds of times across 14K+ ACL rules but resolves identically within one update cycle. The cache achieves ~94% hit rate. Also makes ensureFilterCompiled lightweight — it only recompiles filter rules and matchers, skipping redundant tagOwner/autoApprover resolution that SetNodes already performed eagerly.

Includes comprehensive reachability test suite: equivalence tests, scale tests, dynamic join/leave, tag changes, subnet route overlap, peer symmetry, connection scenarios, and IP-based ACL rule tests (direct IPs, CIDR ranges, hosts entries, mixed identity+IP rules, IPv6).

Files: policy.go (resolveCache, lightweight ensureFilterCompiled), types.go (cache integration in Resolve methods), reachability_test.go (2200+ lines of reachability tests), reachability_ip_test.go (IP-based ACL tests)

7. b11db02a hscontrol/state: replace 500ms batch timeout with 1ms micro-batch drain

1 file, +108/-29

The NodeStore batched operations with a 500ms timeout, causing unnecessary latency for tag propagation and peer map updates. This replaces it with a 1ms micro-batch drain that processes operations as fast as they arrive while still coalescing concurrent writes. Throughput improved 2.5x.

Files: node_store.go (drainMicrobatch, revised processLoop)

8. d802f381 hscontrol/state: store NodeIDs instead of NodeViews in peersByNode to eliminate GC overhead

2 files, +215/-183

Changed peersByNode from map[NodeID][]NodeView to map[NodeID][]NodeID. At 5000 nodes, the peer graph has ~25M entries. NodeView contains a pointer (forces GC to scan every entry), while NodeID is a uint64 (GC skips it entirely). ListPeers materializes fresh NodeViews from nodesByID at read time. GC overhead dropped from 50% to 3%.

Files: node_store.go (Snapshot type, snapshotFromNodes, shallowSnapshot, incrementalSnapshot, refreshNodePeers, ListPeers), node_store_test.go (updated assertions for NodeID storage)

9. a23ce2d1 hscontrol/policy: add pre-built node IP indexes for O(1) Username/Tag resolution

2 files, +103/-5

Username.Resolve and Tag.Resolve scanned all N nodes to build IP sets. This adds buildNodeIPIndexes which pre-builds nodeIPsByUser and nodeIPsByTag maps once per filter compilation cycle. Resolve methods do O(1) map lookups instead of O(N) scans.

Files: policy.go (buildNodeIPIndexes, integration into updateLocked/ensureFilterCompiled), types.go (nodeIPsByUser/nodeIPsByTag fields, Resolve fast paths)

10. c04df66c mapper/batcher: wake processing loop on new changes for prompt delivery

2 files, +19/-0

Adds a wake channel that signals the batcher's doWork loop to process pending changes immediately instead of waiting for the next tick interval. This ensures node additions and policy changes are delivered to connected clients without the full batch delay, preventing stale peer lists in fast registration scenarios. Also includes UserProfiles in policyChangeResponse so newly visible peers have displayable identity information.

Files: mapper/batcher.go (wake channel, immediate processing signal), mapper/mapper.go (WithUserProfiles in policyChangeResponse)

Test coverage

  • Unit tests: All go test ./hscontrol/... pass at every commit
  • Reachability tests: 12 test functions covering equivalence, scale, symmetry, dynamic join/leave, tag changes, subnet routes, IP-based ACLs
  • IP-based ACL tests: 16 scenarios covering direct IPs, CIDR ranges, IPv6, hosts entries, mixed identity+IP rules, subnet route overlap, dynamic node changes
  • Live container test: 104/104 pass (50 containers, 10 users, 8 roles, 22 ACL rules)
  • Mega test: 149/149 pass (100 containers, comprehensive ACL scenarios)

Code changes (excluding tests and comments)

+605 lines added, -62 removed = 543 net new lines of production code


This PR was developed with Claude (Anthropic) running overnight, performing iterative pprof profiling, optimization, and correctness verification against live headscale instances with thousands of nodes and brutal ACL policies.

miyoyo and others added 12 commits March 23, 2026 22:48
Problem: During bulk node registration (e.g., 500 nodes joining with
unique preauth keys), SetUsers is called for every new node with the
same user list. Each call triggers updateLocked(), which recompiles
all ACL filter rules — an O(rules × nodes) operation. With a 14.6K-rule
policy and 500 nodes, this dominated CPU at 32% (61.8s of 192.7s total
samples), causing registration throughput to drop from 11/s to 3/s as
the node count grew.

Fix: Add a deephash-based short-circuit to SetUsers. Before triggering
updateLocked(), hash the incoming user list and compare it against the
previously stored hash. If the users haven't changed (which is the
common case during registration — the user list is stable), return
immediately without recompiling filters.

Impact: Registration of 500 nodes with unique preauth keys and a
145K-line brutal ACL policy improved from 139s (3.6 nodes/s) to
53s (9.4 nodes/s) — a 2.6x speedup. The 32% CPU from
compileFilterRules in the SetUsers path drops to ~0%.

How: Added a usersHash field (deephash.Sum) to PolicyManager. In
SetUsers, compute deephash.Hash(&users) and compare with the stored
hash before proceeding. This is safe because the hash captures the
full user list state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Problem: bcrypt.CompareHashAndPassword is called on every preauth key
validation during node registration. Each call costs ~75ms of CPU time.
When using a single reusable preauth key with many concurrent
registrations (e.g., 50 parallel), bcrypt becomes a thundering herd
where 50 goroutines all perform the same expensive computation
simultaneously. Even with unique keys, bcrypt consumes 42% of total
CPU (38.75s of 91.7s samples).

Fix: Add a sync.Map-based singleflight cache keyed by
"prefix:sha256(bcryptHash)". Each cache entry uses sync.Once so that
only the first goroutine performs the actual bcrypt comparison; all
concurrent goroutines for the same key block and reuse the result.

Impact: For reusable keys (single key shared by many nodes), this
reduces bcrypt from O(N) computations to O(1) — a single bcrypt call
regardless of how many nodes register concurrently. For unique keys,
each key still requires one bcrypt call (cache entries are used once),
but the singleflight prevents duplicate work if the same key is
validated concurrently by multiple code paths.

How: Added bcryptCacheEntry struct with sync.Once + error, stored in a
package-level sync.Map. In findAuthKey, LoadOrStore the cache entry and
call once.Do with the bcrypt comparison. The cache key includes a SHA-256
of the stored bcrypt hash to ensure correctness if the hash is rotated.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a wake channel that signals the doWork loop to process pending
changes immediately instead of waiting for the next tick interval.
This ensures node additions and policy changes are delivered to
connected clients without the full batch delay, preventing stale
peer lists in fast registration scenarios.

Also include UserProfiles in policyChangeResponse so that newly
visible peers have displayable identity information in the netmap.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sbatista-uc

Copy link
Copy Markdown

These performance improvements are quite impressive, can you create a benchmark so myself and others can verify your claims with our own ACLs?

@miyoyo

miyoyo commented Apr 8, 2026

Copy link
Copy Markdown
Author

Of course, do you want me to use my own Acls or do you have your own I can use?

@kradalby

kradalby commented Apr 8, 2026

Copy link
Copy Markdown
Collaborator

Just want to put in that I am not actively ignoring this, and it probably would be helpful in the future.

We are mostly spending time on the planned work and trying to not get distracted. The other aspect is that we need to have sufficient test coverage to be comfortable changing some of this stuff. A lot of tests are going in recently and we might be getting closer to something like this, but no promises.

@miyoyo

miyoyo commented Apr 8, 2026

Copy link
Copy Markdown
Author

I take no offense in you ignoring or even closing this and copying parts of it piecemeal, this is mostly AI generated with some optimisations I thought up (and a lot of AI tokens), I highly appreciate your work with Headscale and I understand you are busy, thank you for responding!

@sbatista-uc

Copy link
Copy Markdown

@miyoyo I'm not comfortable sharing the ACL publicly as I'm using it for work within a highly competitive market. If you're interested in testing our your code and benchmarks against my ACL, I'd be happy to send it to you via a private message. Shoot me an email if you're interested: samuel.batista@usercentrics.com

hidden and others added 5 commits April 9, 2026 11:12
…group:self

When a policy uses autogroup:self (which expands differently per user),
ComputeNodePeers and BuildPeerMap previously recompiled matchers from
filter rules on every peer-pair check, then iterated ALL matchers to
find source matches — O(N² × M) where M is the matcher count.

Add three caches that persist across ComputeNodePeers calls within a
filter compilation cycle:

1. perNodeMatcherCache: caches []matcher.Match per node ID, avoiding
   repeated MatchesFromFilterRules allocations and GC pressure

2. perNodeSrcIdxCache: caches source-matcher indices per (srcNode,
   matcherOwner) pair, reducing CanAccess from O(M) to O(relevant)
   where relevant << M for large rule sets

3. getNodeMatchers/getPerNodeSrcIndices helper methods that lazily
   populate both caches

Both BuildPeerMap and ComputeNodePeers now use canAccessIndexed with
per-node source indices instead of the full CanAccess scan, matching
the optimization already used in the non-autogroup:self path.

All caches are invalidated on filter recompilation, policy updates,
and selective autogroup:self cache invalidation (user-scoped).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ations

When using autogroup:self policies, compileFilterRulesForNodeLocked calls
compileFilterRulesForNode which resolves every Username, Group, and Tag
in the ACL rules. Previously, the resolve cache and node IP indexes were
only enabled during ensureFilterCompiled (global filter compilation) and
torn down immediately after via defer. Per-node compilations ran without
any cache, causing O(users × rules) string matching on every call.

With 100 nodes and 5000 rules, this meant ~500K resolveUser calls doing
linear scans — 39.5% of total CPU in the profile.

Fix: persist the resolve cache and node IP indexes across the filter
compilation cycle for autogroup:self policies. Add ensureResolveCacheForCompilation
which lazily initializes the cache on first per-node compilation, and
clearResolveCache which tears it down on invalidation events (user changes,
node identity changes, policy updates).

Results (100 nodes, 5000 ACL rules):
  commit-12: 9.3s registration, resolveUser at 39.5% CPU
  commit-13: 6.0s registration (cached matchers)
  commit-14: 0.9s registration, ComputeNodePeers not in top 40

10x improvement over commit-12. At 500 nodes: 38.6s with 500/500 success,
CPU profile now dominated by bcrypt (89.9%) not peer computation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pilation

MatchFromStrings calls util.ParseIPSet for every source and destination
IP string in every filter rule. With 5000 rules × 950 nodes, the same
IP strings (node addresses like "100.64.0.1/32") are parsed millions of
times, each building an IPSetBuilder, normalizing ranges, and allocating.

Add a package-level ipSetCache that deduplicates ParseIPSet calls within
a compilation cycle. The cache is reset via ResetIPSetCache() whenever
filters are invalidated (node/user/policy changes).

Results (950 fake + 50 real clients, 5000 ACL rules):
  commit-14: 174.9s registration, MatchFromStrings at 22.1% CPU
  commit-15: 124.2s registration, MatchFromStrings at 12.8% CPU
  29% faster registration, 42% reduction in matcher compilation CPU

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When usesAutogroupSelf is true, compileFilterRulesForNode previously
compiled ALL ACL rules for each node — iterating 5000 rules to find
the ~2 that reference autogroup:self. The other 4998 rules produce
identical output regardless of which node is being compiled.

Split the compilation: compile non-autogroup:self rules once into
globalRulesForNode (cached on the Policy struct), then only compile
the autogroup:self ACLs per node. The per-node rules are combined
with the cached global rules before merging.

Results (950 fake + 50 real clients, 5000 ACL rules):
  commit-15: 124.2s registration, compileACLWithAutogroupSelf at 39.7%
  commit-16:  48.8s registration, compileACLWithAutogroupSelf gone from top
  2.5x faster registration

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… for autogroup:self

MatchesFromFilterRules was converting all ~5000 compiled filter rules
into []matcher.Match for every node, even though the global rules
produce identical matchers. With 950 nodes this meant building the same
5000 matchers 950 times — parsing IP strings, building IPSets, sorting,
normalizing — all redundant.

Split getNodeMatchers into two phases:
1. globalMatcherCache: built once from non-autogroup:self rules (5000),
   shared by all nodes
2. Per-node matchers: built from only the ~2 autogroup:self ACLs

Add compileAutogroupSelfRulesForNode to filter.go for the per-node
compilation, separate from the global rules path.

Results (950 fake + 50 real clients, 5000 ACL rules):
  commit-16:  48.8s registration, MatchesFromFilterRules at 49%
  commit-17:   3.4s registration, ComputeNodePeers gone from profile
  14x faster than commit-16, 51x faster than commit-12

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@miyoyo

miyoyo commented Apr 9, 2026

Copy link
Copy Markdown
Author

@miyoyo I'm not comfortable sharing the ACL publicly as I'm using it for work within a highly competitive market. If you're interested in testing our your code and benchmarks against my ACL, I'd be happy to send it to you via a private message. Shoot me an email if you're interested: samuel.batista@usercentrics.com

I'd rather avoid having to hold sensitive data on my end, I spent most of my remaining claude tokens in generating a nice way to do performance and correctness comparisons between my patches, this also enabled me to discover some new patches that increase performance.

My tests were done on a Ultra 7 165H with 32GB of ram, for reference.

You can find the harness in the following repository: https://github.com/miyoyo/headscale-3156-harness
You can use hsbench genpolicy -policy your_policy.hujson to benchmark both versions.

Below is an extension of the commit explanations that were in the first message in this conversation. The major difference is that the fake client used now holds a connection open, thus increasing the load, and my previous tests did not consider autogroup:self's performance.

I'm thinking it could be useful to try to condense all the commits down into a smaller patch, it's possible some of the steps taken by claude were unnecessary or negative. I'll see what I can remove by tomorrow.

I'm also still updating it to verify correctness, looks like the little goblin took some shortcuts in the harness.

---Anything below this point is AI generated---

Highly Concurrent Test: 2000 Nodes × 15000 ACL Rules × Concurrency 2000

All 2000 fake clients connect simultaneously.

headscale-base (main) commit-17 (latest)
Nodes registered 118 / 2000 (5.9%) 2000 / 2000 (100%)
Registration time 300.5s (timed out) 11.3s
Map snapshots captured 0 / 5 5 / 5
Throughput ~0.4 nodes/sec 177 nodes/sec
Real clients connected not attempted 50 / 50
Lifecycle tests passing not reached 7 / 7
Peers visible unknown 2007
Container status alive but unresponsive healthy
Relative speed ~443×

Configuration

  • ACL rules: 15001 (generated, including autogroup:self)
  • Users: 666
  • Tags: 10 (exit-node, subnet-router, server, db, api, web, monitoring, ssh-target, vpn-gateway, client)
  • Groups: 200
  • Concurrency: 2000 (all nodes register simultaneously)
  • Real tailscale clients: 50 (commit-17 only, not reached on the base)
  • Platform: WSL2 AlmaLinux 10, podman, host networking

Commits 13–17: autogroup:self at scale

These commits address a second bottleneck that only surfaces when a policy uses autogroup:self— a rule type that expands differently per user, forcing per-node filter compilation. The April 9 profiling session found that with 950 nodes and 5000 ACL rules, the global optimizations from commits 1–10 had no effect on this path: registration still took 174 seconds.


13. bcef709 hscontrol/policy: cache per-node matchers and source indices for autogroup:self

1 file, +105/−30

ComputeNodePeers and BuildPeerMap called MatchesFromFilterRules on the per-node compiled filter on every peer-pair check — rebuilding the full []matcher.Match slice from scratch for each of the N² pairs. Two caches fix this: perNodeMatcherCache stores the compiled matchers per node ID so they're built once per node per cycle, and perNodeSrcIdxCache indexes source-matcher positions per (srcNode, matcherOwner) pair, extending the canAccessIndexed optimization (commit 4) to the autogroup:self path. Both caches are invalidated on policy updates, user changes, and selective autogroup:self invalidation.

Files: policy.go


14. 3a1c9ef hscontrol/policy: persist resolve cache across per-node filter compilations

1 file, +50/−5

The resolve cache added in commit 6 was only active during ensureFilterCompiled — a defer tore it down immediately after global compilation. Per-node compilations for autogroup:self ran without any cache, hitting O(users × rules) string matching on every compileFilterRulesForNode call. With 100 nodes and 5000 rules this meant ~500K resolveUser calls doing linear scans, accounting for 39.5% of total CPU. ensureResolveCacheForCompilation lazily initialises the cache on the first per-node call and clearResolveCache tears it down on invalidation events, keeping it alive across the full registration cycle. Result: 9.3s → 0.9s for 100 nodes/5000 rules (10x).

Files: policy.go


15. 7df1e8f hscontrol/policy/matcher: cache ParseIPSet results across matcher compilation

2 files, +28/−9

MatchFromStrings calls util.ParseIPSet for every IP string in every filter rule. With 5000 rules × 950 nodes, the same node addresses like 100.64.0.1/32 were parsed millions of times — each call allocating an IPSetBuilder, normalising ranges, and building a new set. A package-level ipSetCache deduplicates ParseIPSet calls within a compilation cycle, reset via ResetIPSetCache() on filter invalidation. With 950 fake + 50 real clients: 174.9s → 124.2s (29% faster), MatchFromStrings CPU 22.1% → 12.8%.

Files: matcher/matcher.go, policy.go


16. 1e92e0a hscontrol/policy: split per-node filter compilation for autogroup:self

3 files, +126/−3

compileFilterRulesForNode iterated all ACL rules for every node to find the ~2 that reference autogroup:self, compiling the other 4998 identically each time. compileNonAutogroupSelfRules now compiles the invariant rules once and caches the result on the Policy struct as globalRulesForNode. Per-node compilation only iterates the autogroup:self ACLs. compileFilterRulesForNode combines the cached global rules with the per-node result before merging. Result: 124.2s → 48.8s (2.5x), compileACLWithAutogroupSelf disappears from the CPU profile.

Files: filter.go (aclUsesAutogroupSelf, compileNonAutogroupSelfRules, split logic), policy.go, types.go (globalRulesForNode field)


17. 8741944 hscontrol/policy: cache global matchers, only build per-node matchers for autogroup:self

2 files, +85/−8

getNodeMatchers was calling MatchesFromFilterRules on the full per-node compiled rules for every node — converting all ~5000 rules into []matcher.Match including the global rules that are identical across all nodes. With 950 nodes this built the same 5000 matchers 950 times. globalMatcherCache holds the []matcher.Match built once from the non-autogroup:self rules; per-node compilation only runs MatchesFromFilterRules on the ~2 autogroup:self rules and appends them to the shared cache. Result: 48.8s → 3.4s (14x), ComputeNodePeers leaves the CPU profile entirely. Across the full autogroup:self sequence: 51x improvement over commit 12's baseline.

Files: filter.go (compileAutogroupSelfRulesForNode), policy.go (globalMatcherCache, revised getNodeMatchers)

Total line count

+1301 / -127 = 1174 net lines of code.

@Cediddi

Cediddi commented Apr 19, 2026

Copy link
Copy Markdown

Honestly, this PR is great help for me. I had already hit the max device count (8core 16G) of my server, and I was looking for a way to scale horizontally or find optimizations I could apply to allow more devices to be registered (already on Postgres).

Is there any way I can help you? (background: python)

@miyoyo

miyoyo commented Apr 20, 2026

Copy link
Copy Markdown
Author

Honestly, this PR is great help for me. I had already hit the max device count (8core 16G) of my server, and I was looking for a way to scale horizontally or find optimizations I could apply to allow more devices to be registered (already on Postgres).

Is there any way I can help you? (background: python)

Glad it's helped!
I've spent some time (unfortunately not enough) trying to minimize the changes I do to the latest master in order to make this PR not as big, but I doubt there is much of an easy way to do it without too much impact.

The conclusion that I end up with when trying manually, and any of the many background research agent I have running, is that a lot of the effort comes from the architecture of the rule parsing being Node-based (What can this node reach?) vs Edge-based (Given this rule, which nodes are affected)?

I haven't tried that angle just yet (sorry about the timing, life gets in the way), and there seems to be many edge cases that my comparison tests (I run headscale master and do a bunch of things, record all of the server's input, output and state, then compare to my modifications) keep catching, it's a surprisingly complicated problem, but it's solveable, I'm sure of it.

As for what you can do for this PR, I'm not quite sure myself, Go is a fairly straightforward language to learn, so if you want to pick one of the issues and have a go [ ;) ] at it, I'm sure the maintainers will be glad, do try to keep it to minimal changes and write extensive tests while doing it.

@reflog

reflog commented May 1, 2026

Copy link
Copy Markdown

Just wanted to +1 this, the PR is working fantastic on our 1400 node headscale .
We've ported the changes to 0.28 and it resolved the lock contention on map update complely

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants