[scd] Time-bound subscription locks and add lock diagnostics by ashishjsharda · Pull Request #1384 · interuss/dss

ashishjsharda · 2026-03-07T20:28:06Z

Problem

Issue #1311 reports high latency and timeouts for SCD operations under overlapping traffic patterns. Lock acquisition in LockSubscriptionsOnCells becomes expensive when many subscriptions share S2 cells, regardless of temporal relevance.

Solution

Narrow lock scope to subscriptions that overlap both the relevant S2 cells and the operation time window, while preserving correctness by always locking explicitly referenced subscription IDs. Add low-noise diagnostics to improve production observability of costly lock queries.

Changes

1. Time-bounded lock scope

Updated LockSubscriptionsOnCells to accept startTime and endTime, locking rows matching:

cells && $1
COALESCE(starts_at <= endTime, true)
COALESCE(ends_at >= startTime, true)
OR id = ANY(explicitSubscriptionIDs)

2. Call-site updates

Updated OIR mutation paths to pass the relevant time window:

Delete OIR: uses old.StartTime / old.EndTime
Upsert OIR: uses validParams.uExtent.StartTime / validParams.uExtent.EndTime

3. Lock diagnostics (thresholded, low-noise)

Warnings emitted when:

duration >= 500ms, or
matched_rows >= 1000

Fields logged:

global_lock
duration
matched_rows
cell_count
explicit_subscription_id_count

Failure warnings:

SCD global lock query failed
SCD subscription lock query failed

4. Documentation

Extended docs/operations/performances.md with diagnostics behavior, fields, and thresholds.

Validation

go test ./pkg/scd/store/datastore — passed
go test ./pkg/scd/... — passed locally

Risk / Compatibility

LockSubscriptionsOnCells interface signature updated; all in-repo call sites patched.
Cell-based lock scope intentionally narrowed by time overlap; explicit subscription ID locking preserved.
Diagnostics are passive and thresholded, minimizing steady-state log noise.

Follow-up

Validate against existing SCD load-test scenario from High Latency for SCD DSS Requests #1311 (p95/p99 latency, timeout/error rates)
Tune thresholds based on production observations if needed

Closes #1311

linux-foundation-easycla · 2026-03-07T20:28:11Z

❌ - login: @ashishjsharda / name: Ashish Sharda. The commit (672eec8) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please submit a support request ticket.

barroco

Thank you very much @ashishjsharda for this contribution.

Please see comments inline.

barroco · 2026-03-09T20:10:30Z

docs/operations/performances.md

+* Lock query duration is greater than or equal to `500ms`
+* Number of matched/locked rows is greater than or equal to `1000`


How those values were selected ?

Great question. They were chosen as pragmatic initial guardrails to keep logs low-noise while still surfacing clear outliers:

500ms was selected as a “this is unexpectedly slow for lock acquisition” threshold relative to our normal successful lock path.

1000 matched rows was selected as a “potentially broad lock scope” signal (large enough to avoid routine noise, small enough to catch problematic cases).

These are intentionally heuristic and meant for operational triage, not hard SLO limits. I can add a short note in the doc clarifying that these are initial defaults and may be tuned based on production observations.

Ok, thanks. Those values may generate log lines for every requests, looking at the graphs on the original issue.
While there is no requirement related to DSS response time for Operational Intent in SCD in F3548-21, the standard seems to assume that 5 seconds is a reasonable upper bound time for DSS responses in general. Therefore, 4 seconds may be more appropriate based on the experience of the deployed DSS pool referred in the issue.

Regarding the matched rows, it may possibly be removed, see my comments below.

barroco · 2026-03-09T20:12:15Z

docs/operations/performances.md

+* Lock query duration is greater than or equal to `500ms`
+* Number of matched/locked rows is greater than or equal to `1000`
+
+The log message is `SCD lock query diagnostics` and includes:


nit: I would suggest a log message stating more explicitly the reason of the log line such as Expensive SCD lock detected

Good suggestion, agreed. I’ll update the diagnostic log message to make the reason explicit (for example Expensive SCD lock detected) while keeping the same structured fields (duration, matched_rows, cell_count, etc.) so existing analysis remains straightforward.

barroco · 2026-03-09T20:16:50Z

pkg/scd/store/datastore/subscriptions.go

+const (
+	// These thresholds keep lock diagnostics low-noise in production while still surfacing likely bottlenecks.
+	lockQuerySlowThreshold = 500 * time.Millisecond
+	lockQueryLargeRowCount = 1000


nit: It may be useful to have those values configurable so it can be adjusted per use cases. This can be done in a follow up PR.

Agreed. For this PR I kept constants to minimize scope and risk, but making these thresholds configurable is a sensible follow-up. I can open a follow-up issue/PR to expose them via config (with the current values as defaults) so deployments can tune per workload profile.

barroco · 2026-03-10T14:48:28Z

docs/operations/performances.md

+* Lock query duration is greater than or equal to `500ms`
+* Number of matched/locked rows is greater than or equal to `1000`


Ok, thanks. Those values may generate log lines for every requests, looking at the graphs on the original issue.
While there is no requirement related to DSS response time for Operational Intent in SCD in F3548-21, the standard seems to assume that 5 seconds is a reasonable upper bound time for DSS responses in general. Therefore, 4 seconds may be more appropriate based on the experience of the deployed DSS pool referred in the issue.

Regarding the matched rows, it may possibly be removed, see my comments below.

barroco · 2026-03-10T15:37:43Z

pkg/scd/store/datastore/subscriptions.go

 			return stacktrace.Propagate(err, "Error in query: %s", query)
 		}

+		matchedRows := int(result.RowsAffected())


According to the documentation, it seems to always return 0. Do you know if the package pgxpool is implemented differently ? In this case, there is only one row in this table so we can remove that part of the check.

barroco · 2026-03-10T15:44:01Z

pkg/scd/store/datastore/subscriptions.go

 		return stacktrace.Propagate(err, "Error in query: %s", query)
 	}

+	matchedRows := int(result.RowsAffected())


Similarly, this seems to always return 0. Is RowsAffected of pgxpool behaving differently than the sql definition ?
If it is not the case, I suggest to remove the check on row count. Counting with a query would be too expensive and may make the problem of performance worse.

Thanks for flagging this. I double-checked the pgx/pgconn behavior: Exec returns a CommandTag, and for PostgreSQL SELECT tags include a row count (SELECT ), so RowsAffected() is not universally 0 for this query type.

That said, your broader point about usefulness/noise is valid. To keep diagnostics robust and simple, I’m fine removing matched_rows from the warning criteria and relying on duration-based alerting.

barroco · 2026-03-10T15:46:22Z

pkg/scd/store/datastore/subscriptions.go

        WHERE
-            cells && $1
-        OR
+			(


@the-glu can you please evaluate the impact of this change with the load tests ?

Agreed. I haven’t posted comparative load-test numbers yet. I’ll run the SCD loadtest scenarios (overlapping and non-overlapping) on this branch and share before/after results focused on p95/p99 latency and warning volume.

This seems to improve the situation with default tests:

However, this is highly dependent on the flights duration. Loadtest use tests of 10s, switching to 90s show no improvements (but that expected).

I noticed however that lock may be wrong (but that was already the case for cells): notifyVolume can be different from validParams, explaining some transactions restarts.

This may increase those transactions restarts as the lock is now more specific (and may be more 'wrong'), but that only in case of updates.

barroco · 2026-03-10T15:56:47Z

Thanks for the clarification @ashishjsharda, while looking at it again, I am not sure about the usage of RowsAffected. Can you please confirm the behaviour is correct ?

scd: reduce subscription lock scope and add lock diagnostics

672eec8

barroco reviewed Mar 9, 2026

View reviewed changes

barroco reviewed Mar 10, 2026

View reviewed changes

		* Lock query duration is greater than or equal to `500ms`
		* Number of matched/locked rows is greater than or equal to `1000`

Conversation

ashishjsharda commented Mar 7, 2026

Problem

Solution

Changes

1. Time-bounded lock scope

2. Call-site updates

3. Lock diagnostics (thresholded, low-noise)

4. Documentation

Validation

Risk / Compatibility

Follow-up

Uh oh!

linux-foundation-easycla bot commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

barroco left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

barroco commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

linux-foundation-easycla bot commented Mar 7, 2026 •

edited

Loading