rotation of outbound peers by pompon0 · Pull Request #3037 · sei-protocol/sei-chain

pompon0 · 2026-03-06T16:52:06Z

To make the p2p network uniformly random, nodes need to be able to change their peers periodically (otherwise network would be centralized at bootstrap peers). This PR implements an algorithm based on stable hashing, which makes nodes assign random priority to every node ID, and connect to peers with higher priority.

This pr also separates inbound and outbound connection pools to simplify logic. In particular it is possible that there will be 2 concurrent connections between 2 peers (inbound + outbound). Avoiding duplicate connections is best effort - nodes try not to dial peers that they are connected to, but in case 2 peers dial each other at the same time they let those 2 connections be.

Basic requirements:

peermanager maintains a pex table: addresses of peers of our peers. This bounds the network view size (even though it is an unauthenticated network) that our node needs to maintain and provides enough exposure to select new peers. Peers are periodically reporting their connections, therefore peermanager keeps only fresh addresses
on startup a spike of dials is expected - node will try to connect to peers that it was connected to before restart
during stable operation node will dial peers at a low rate like 1/s or 0.1/s, and the node should be selected from a fresh set of addresses - i.e. we cannot snapshot a list of currently available addresses and try to dial them all (it will take hours)
despite low dial rate, node should attempt to round robin over the ever changing peer candidates set - i.e. it should not get stuck dialing the same bad address over and over
in the stable-hash-based approach, each peer ID obtains a priority for dialing - it should be taken into account
implementation should support replacing low priority peers with higher priority peers (to support convergence to a random graph). The churn of the connections should be low though, so that connection efficiency is not affected. My initial guesstimate would be that we should allow replacing a connection every ~1min.
We need to support a pex table with ~100 * 100 = 10k addresses total (100 connections per peer is a safe estimate with the current implementation). Whether we can affort just rank all the addresses on every dial attempt is a borderline IMO.
the addresses inserted to pex table should be made available for dialing ASAP, without any active polling, if possible.

github-actions · 2026-03-06T16:53:03Z

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`✅ passed`	`✅ passed`	`✅ passed`	Mar 11, 2026, 5:50 PM

sei-tendermint/internal/p2p/peermanager_pool.go

sei-tendermint/internal/p2p/peermanager_updates.go

+		for id := range s.last {
+			if _,ok := GetAny(conns,id); !ok {
+				delete(s.last, id)
+				update = PeerUpdate{
+					NodeID: id,
+					Status: PeerStatusDown,
+				}
+				return true
+			}
+		}


codecov · 2026-03-07T11:12:07Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.24%. Comparing base (31c1592) to head (b1c0b8b).
⚠️ Report is 39 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3037      +/-   ##
==========================================
- Coverage   58.30%   58.24%   -0.07%     
==========================================
  Files        2108     2108              
  Lines      173683   173318     -365     
==========================================
- Hits       101268   100949     -319     
+ Misses      63397    63366      -31     
+ Partials     9018     9003      -15

Flag	Coverage Δ
sei-db	`70.41% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
sei-tendermint/config/config.go	`69.88% <ø> (-0.35%)`	⬇️
sei-tendermint/internal/p2p/channel.go	`77.14% <ø> (-0.64%)`	⬇️
sei-tendermint/internal/p2p/peermanager.go	`87.50% <100.00%> (+2.92%)`	⬆️
sei-tendermint/internal/p2p/peermanager_pool.go	`100.00% <100.00%> (ø)`
sei-tendermint/internal/p2p/pex/reactor.go	`90.10% <ø> (+0.97%)`	⬆️
sei-tendermint/internal/p2p/router.go	`85.47% <ø> (+0.67%)`	⬆️
sei-tendermint/internal/p2p/routeroptions.go	`83.78% <ø> (-2.58%)`	⬇️
sei-tendermint/internal/p2p/testonly.go	`80.47% <ø> (+0.56%)`	⬆️
sei-tendermint/internal/p2p/transport.go	`88.46% <ø> (+2.09%)`	⬆️
sei-tendermint/libs/utils/option.go	`93.54% <ø> (+2.63%)`	⬆️
... and 3 more

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sei-tendermint/libs/utils/option.go

 		return o.value
 	}
-	panic("value missing")
+	panic(msg)


sei-tendermint/libs/utils/ordered/map.go

sei-tendermint/internal/p2p/peermanager.go

wen-coding · 2026-03-07T19:37:33Z

sei-tendermint/internal/p2p/peermanager.go

+type connSet[C peerConn] = im.Map[connID, C]
+
+func GetAny[C peerConn](cs connSet[C], id types.NodeID) (C,bool) {
+	if c,ok := cs.Get(connID{id,true}); ok { return c,true }


So we will always prefer outbound connection over inbound connection? Would we expect to have two connections between most stable peers?

Users of GetAny are not distinguishing inbound from outbound, so the fact that we allow both to coexist in some corner cases is an implementation detail.

in the current implementation we do NOT expect having 2 connections between stable peers. We try to avoid it best effort (via dial rule which disallows dialing if there exist an inbound connection), but keep the internal peermanager logic relatively simple, we allow 2 connections coexist in case of a race condition during dialing.

wen-coding · 2026-03-07T20:17:37Z

sei-tendermint/internal/p2p/peermanager.go

+	for inner,ctrl := range m.inner.Lock() {
+		// Start with pool which has NOT dialed previously (for fairness).
+		pools := utils.Slice(inner.persistent,inner.regular)
+		if pools[0]==inner.lastDialPool {


I don't understand the design choice here. I assume persistent peers are configured "friends" so they should always be dialed asap, and since the user configured it, it's unlikely to have many invalid entries anyway. Am I missing anything?

persistent peers are very likely to have bad addresses: people add persistent peers but don't care about keeping the list healthy. Also in case any of the persistent peers goes down, we don't want it to block dialing regular peers.

sei-tendermint/internal/p2p/peermanager.go

wen-coding · 2026-03-09T02:58:41Z

sei-tendermint/internal/p2p/peermanager.go

-	for inner, ctrl := range p.inner.Lock() {
-		inner.DialFailed(addr)
+// DialFailed notifies the peer manager that dialing addresses of id has failed.
+func (m *peerManager[C]) DialFailed(id types.NodeID) error {


Would we mark the node failed on temporary failure as well? Should we have more complicated model to distinguish between temporary failure and long term failure (like NAT addresses)?

we could, what do you propose?

Not in this PR, but in the future we could give a rating to how flaky some address is and pass the information along to peers maybe.

wen-coding · 2026-03-09T02:59:54Z

sei-tendermint/internal/p2p/peermanager.go

-	if len(addrs) == 0 {
-		return nil
-	}
+func (m *peerManager[C]) PushPex(sender types.NodeID, addrs []NodeAddress) error {


How much verification does PushPex give us?

Assuming we don't have many malicious peers who send bad addresses, do we trust incoming address blindly?

PushPex keeps separate capacity for each sender (to avoid unbounded flood of bad addresses). Other than that, we trust them blindly. I was thinking about gossiping signed versioned addresses (so that it is not feasible to gossip fake addresses of honest nodes), but I'm not sure atm how much value it would bring and whether it would be computationally feasible.

sei-tendermint/internal/p2p/router.go

wen-coding · 2026-03-09T03:06:03Z

sei-tendermint/internal/p2p/router.go

+						r.peerManager.DialFailed(id)
+						return err
+					}
+					dialAddrRaw := hConn.conn.RemoteAddr()


Do we save IP:port instead of hostname now? What if the hostname later resolves to a different IP:port?

If it resolves to different IP:port, then the TCP connection will terminate and we will stop advertising the IP:port, since we are no longer connected to the peer. I would like to eventually disallow gossiping dns addresses, in favor of resolving them locally first, but for backward compatibility and to limit the scope of changes, we do the resolution only when dialing the peer (i.e. SelfAddress stays unresolved for now).

wen-coding · 2026-03-09T03:14:19Z

sei-tendermint/internal/p2p/peermanager.go

-	inner utils.Watch[*peerManagerInner[C]]
+
+	inner           utils.Watch[*peerManagerInner[C]]
+	conns           utils.AtomicRecv[connSet[C]]
 }

 func (p *peerManager[C]) LogState() {
 	for inner := range p.inner.Lock() {
 		p.logger.Info("p2p connections",


Do we have metrics showing roughly how many peers are oscillating? Could it be that a few bad peers contributed to the most of the disconnections? Are those permanent bad addresses like NAT addresses or just having network issues?

It is very likely that the preconfigured bad persistent addrs will be responsible for most of the dial failures. We have metrics counting tcp dials/accepts, but not accounting them to specific node ids. Do you have some specific metric in mind that you would like to have here? NodeID/IP have too large cardinality to use as a prometheus metric label, but perhaps we can collect it in some smarter way. We also have the net_info http endpoint that we can expose such stats in a dashboard. Or we can include them in log error messages (depending on verbosity needed).

NAT is not a problem, because we do not support gossiping unidirectional NAT addresses - we only gossip outbound dial addresses (i.e. addresses that we successfully dialed), or preconfigured ExternalAddresses (self-declared static addresses). About private network addresses (including private ips, kubernetes internal dns, etc.) - I don't know the scale of the problem.

sei-tendermint/internal/p2p/peermanager_pool.go

+		for _, e := range t.bySender {
+			if !yield(e) {
+				return
+			}
+		}


sei-tendermint/internal/p2p/peermanager_pool.go

+	for old, priority := range p.out {
+		if id, ok := low.Get(); !ok || priority < id.priority {
+			low = utils.Some(pNodeID{priority, old})
+		}
+	}


wen-coding · 2026-03-11T00:49:18Z

sei-tendermint/internal/p2p/peermanager.go

+	ID       types.NodeID
+	Channels ChannelIDSet
+	DialAddr utils.Option[NodeAddress]
+	SelfAddr utils.Option[NodeAddress]


nit: DialAddr here means outbound address while SelfAddr is self-declared address? (So it's inbound connection?)

The names are a bit confusing.

wen-coding · 2026-03-11T18:22:41Z

sei-tendermint/internal/p2p/peermanager.go

+		regular: newPoolManager(&poolConfig{
+			MaxIn:      options.maxInbound(),
+			MaxOut:     options.maxOutbound(),
+			FixedAddrs: bootstrapAddrs,


Is it possible someone set all bootstrap addresses in persistent so FixedAddrs is now empty?

wen-coding · 2026-03-11T21:07:26Z

sei-tendermint/internal/p2p/peermanager_pool.go

+	// It makes the global topology converge to an uniformly random graph
+	// of a bounded degree.
+	priority := func(id types.NodeID) uint64 {
+		return binary.LittleEndian.Uint64(h.Sum([]byte(id)[:8]))


I think this is wrong. h.Sum(b) will append hash digest to b instead of hashing b, and Uint64() just takes the first 8 bytes, so we are only using the first 8 bytes of id, not the hash.

wen-coding · 2026-03-11T21:14:39Z

sei-tendermint/internal/p2p/router.go

+				}
+				return nil
+			})
+			return nil


Should the return nil be here ? We don't want to exit the loop right?

wen-coding · 2026-03-11T21:15:44Z

sei-tendermint/internal/p2p/peermanager_pool.go

+	// We cannot keep it unbounded to avoid OOM.
+	const maxHistory = 10000
+	if len(p.dialHistory) >= maxHistory {
+		p.dialHistory = map[types.NodeID]struct{}{}


Should we do a total reset here or should we just clear part of the history?

wen-coding · 2026-03-11T21:27:41Z

sei-tendermint/internal/p2p/pex/reactor_test.go

@@ -157,77 +133,6 @@ func TestReactorErrorsOnReceivingTooManyPeers(t *testing.T) {
 	testNet.listenForPeerDown(t, 1, 0)
 }

-func TestReactorSmallPeerStoreInALargeNetwork(t *testing.T) {


Are these tests replaced by new tests somewhere?

wen-coding · 2026-03-11T23:26:31Z

sei-tendermint/internal/p2p/peermanager.go

 }

 // Connected adds conn to the connections pool.
 // Connected peer won't be available for dialing until disconnect (we don't need duplicate connections).
 // May close and drop a duplicate connection already present in the pool.
 // Returns an error if the connection should be rejected.
 func (m *peerManager[C]) Connected(conn C) error {
+	id := conn.Info().connID()
+	if id.NodeID == m.selfID {


when would this ever happen?

wen-coding · 2026-03-11T23:31:46Z

sei-tendermint/internal/p2p/peermanager.go

+		pool := inner.poolByID(id.NodeID)
+		toDisconnect, err := pool.Connect(id)
+		if err != nil {
+			conn.Close()


Hmm, we are closing a connection just established, when would this be necessary?

wen-coding · 2026-03-11T23:46:49Z

sei-tendermint/internal/p2p/peermanager.go

 	}
 	return ids
 }

+// DEPRECATED, currently returns an address iff we are connected to id.


nit: this one is still used somewhere right?

wen-coding · 2026-03-11T23:49:22Z

sei-tendermint/internal/p2p/router.go

-	selfID := privKey.Public().NodeID()
-	peerManager := newPeerManager[*ConnV2](logger, selfID, options)
-	peerDB, err := newPeerDB(db, options.maxPeers())
+	peerDB, err := newPeerDB(db, min(options.maxOutbound(), 100))


nit: why 100?

pompon0 added 27 commits February 25, 2026 19:45

added self-declared addresses to handshake

a7787f6

Merge remote-tracking branch 'origin/main' into gprusak-pex

69ff8e5

self addr

811f86d

pex addrs in handshake

9954688

just 1 addr

5cd87bc

error logging in newPeerManager

d0f363b

caching isPublic

ba4fdba

removed selfID from peerManagerInner

32aee27

asdf

24820ff

tests

4e896ef

flag for enabling pex on handshake

3580e31

compilation

337ecd8

added tests

3b5cb73

removed dead branch handling opaque url

a0d36ab

move PexOnHandshake after authentication

17e9433

poolConn

350afac

Merge branch 'gprusak-less-addrs' into gprusak-stablehash

d853866

thoughts

1a0c532

Merge remote-tracking branch 'origin/main' into gprusak-stablehash

0995acf

WIP

3bfaac6

before combining dialing and out

80b414b

need to figure out how to limit churn on outbound connections

939d986

pool revamp almost ready, rotation supported

1b8921b

before moving pex to peerManager

9ce0937

WIP, peermanager doesn't look very well

ed0d6a2

seems complete

7c620ce

some updates to router

cfdd0f7

github-advanced-security bot found potential problems Mar 6, 2026

View reviewed changes

pex addrs management

8cd71d9

github-advanced-security bot found potential problems Mar 7, 2026

View reviewed changes

sei-tendermint/libs/utils/option.go

return o.value

}

panic("value missing")

panic(msg)

Check warning

Code scanning / CodeQL

Panic in BeginBock or EndBlock consensus methods Warning

Possible panics in BeginBock- or EndBlock-related consensus methods could cause a chain halt

wen-coding reviewed Mar 9, 2026

View reviewed changes

pompon0 added 3 commits March 9, 2026 11:33

new idea

3f2d104

wip

d0242c8

some tweaks

c5a4c31

github-advanced-security bot found potential problems Mar 9, 2026

View reviewed changes

sei-tendermint/internal/p2p/peermanager_pool.go Fixed Show fixed Hide fixed

pompon0 added 2 commits March 9, 2026 16:28

applied comments

1ce1d01

another version

653a1e8

github-advanced-security bot found potential problems Mar 10, 2026

View reviewed changes

pompon0 added 10 commits March 10, 2026 11:57

more documentation

9cf0d63

parallel addr resolution

9ad8ba9

fixed router tests

6fd9896

tests WIP

8d98d13

tests WIP

d17db5f

tests refactored

800ed3c

more tests

0f26cab

last test

80a1c11

pm tests are back

8209c02

pm tests are back

143df90

pompon0 requested a review from wen-coding March 11, 2026 17:47

pompon0 added the non-app-hash-breaking label Mar 11, 2026

fixed defaults

b1c0b8b

wen-coding reviewed Mar 12, 2026

View reviewed changes

Conversation

pompon0 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check warning

codecov bot commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Check warning

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pompon0 Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Check warning

Check warning

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

pompon0 commented Mar 6, 2026 •

edited

Loading

github-actions bot commented Mar 6, 2026 •

edited

Loading

codecov bot commented Mar 7, 2026 •

edited

Loading

pompon0 Mar 9, 2026 •

edited

Loading