Skip to content

fix: add thread-safety lock to shared ClassVar state#71

Closed
dadofsambonzuki wants to merge 1 commit into
lnbits:mainfrom
dadofsambonzuki:fix-thread-safety
Closed

fix: add thread-safety lock to shared ClassVar state#71
dadofsambonzuki wants to merge 1 commit into
lnbits:mainfrom
dadofsambonzuki:fix-thread-safety

Conversation

@dadofsambonzuki

Copy link
Copy Markdown

Problem

NostrRouter uses three ClassVar collections to shuttle events from relays to WebSocket clients:

  • received_subscription_events
  • received_subscription_eosenotices
  • received_subscription_notices

These are written by a daemon thread (the subscribe_events background task via callbacks in tasks.py) and read/deleted by asyncio tasks (the _nostr_to_client loop in router.py). No synchronization protects these shared data structures.

This race condition causes non-deterministic loss of events and EOSE messages. Downstream consumers like the NWC Service Provider wait indefinitely for EOSE that never arrives, causing "all promises rejected" errors in client wallets.

Fix

  • Added threading.Lock as a ClassVar on NostrRouter
  • Wrapped all reads/writes to the three shared collections with with NostrRouter.lock:
  • Only the actual data mutation is locked (not the await send_text calls), minimizing lock contention

The lock is held only for the duration of the list/dict mutation (pop, del, read), not across the await that sends the event to the WebSocket client. This keeps critical sections minimal while ensuring thread safety.

Testing

  • Verified the provider processes requests reliably without EOSE loss
  • No behavioral change when there is no contention (single client)
  • Python syntax verified

@dadofsambonzuki

Copy link
Copy Markdown
Author

Note: We also had to disable uvicorn's default WebSocket ping timeout (ws_ping_interval=None, ws_ping_timeout=None) in lnbits/server.py to prevent the NWC provider ↔ local nostrclient relay WebSocket from being killed every 30–90s. This happens because uvicorn's default 20s ping/pong timeout can fire under event-loop load (e.g. many concurrent payment status checks from old LNDHub records after switching backends).

This change is in core LNbits (lnbits/lnbits/server.py), not in this extension. Should it be proposed as an upstream PR to lnbits/lnbits, or is there a better way to handle it (e.g. making ws_ping_interval configurable via env var)?

@blackcoffeexbt

blackcoffeexbt commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

@motorina0 Are you happy with the architectural changes? This PR comment notes that a change is required in core too - #71 (comment)

I've not tested the changes yet, waiting on your OK/NOK before testing.

@dadofsambonzuki

Copy link
Copy Markdown
Author

Have received no NWC client errors in last 12hrs running this.

@dadofsambonzuki

Copy link
Copy Markdown
Author

Still running clean after 36hrs and heavy zap session last night 😂

@arcbtc

arcbtc commented Jun 29, 2026

Copy link
Copy Markdown
Member

Nice pr, changes are pretty minimal. Could you pr core as well and reference here?
Usually extension PRs that also impact core are discouraged and very hard to get merged, but this core change looks like it might be a good change anyway.
This pr will be merged first #70, so there might be a little conflict, but should be easy enough to fix.
For now tests need fixing

@motorina0 motorina0 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
not tested

@motorina0

Copy link
Copy Markdown
Collaborator

linting failed

NostrRouter uses ClassVars (received_subscription_events,
received_subscription_eosenotices, received_subscription_notices)
that are written by a daemon thread (via subscribe_events callbacks)
and read/deleted by asyncio tasks (via _handle_subscriptions).
Without synchronization this causes lost events, dropped EOSE
messages, and data corruption.

Add a threading.Lock and wrap all shared state access with it.
Only the actual data mutation is locked, not the await send_text
calls.
@dadofsambonzuki

dadofsambonzuki commented Jun 30, 2026

Copy link
Copy Markdown
Author

Linting fixed and WS Ping interval PR opened in Core. #4028

However, PR #70 (use_sdk_more) fundamentally changes the threading model — it removes the daemon thread in tasks.py and runs nostr_client.subscribe() as an async task in the event loop directly. This eliminates the thread-level race condition entirely (callbacks and router tasks run in the same event loop, so cooperative multitasking prevents true concurrency).

PR #71's threading.Lock fix would conflict with PR #70:

If PR #70 is merged, PR #71 should be closed as superseded.

Core PR still legit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants