Skip to content

[scd] Time-bound subscription locks and add lock diagnostics#1384

Open
ashishjsharda wants to merge 1 commit intointeruss:masterfrom
ashishjsharda:fix/scd-lock-latency-1311
Open

[scd] Time-bound subscription locks and add lock diagnostics#1384
ashishjsharda wants to merge 1 commit intointeruss:masterfrom
ashishjsharda:fix/scd-lock-latency-1311

Conversation

@ashishjsharda
Copy link

Problem

Issue #1311 reports high latency and timeouts for SCD operations under overlapping traffic patterns. Lock acquisition in LockSubscriptionsOnCells becomes expensive when many subscriptions share S2 cells, regardless of temporal relevance.

Solution

Narrow lock scope to subscriptions that overlap both the relevant S2 cells and the operation time window, while preserving correctness by always locking explicitly referenced subscription IDs. Add low-noise diagnostics to improve production observability of costly lock queries.

Changes

1. Time-bounded lock scope

Updated LockSubscriptionsOnCells to accept startTime and endTime, locking rows matching:

  • cells && $1
  • COALESCE(starts_at <= endTime, true)
  • COALESCE(ends_at >= startTime, true)
  • OR id = ANY(explicitSubscriptionIDs)

2. Call-site updates

Updated OIR mutation paths to pass the relevant time window:

  • Delete OIR: uses old.StartTime / old.EndTime
  • Upsert OIR: uses validParams.uExtent.StartTime / validParams.uExtent.EndTime

3. Lock diagnostics (thresholded, low-noise)

Warnings emitted when:

  • duration >= 500ms, or
  • matched_rows >= 1000

Fields logged:

  • global_lock
  • duration
  • matched_rows
  • cell_count
  • explicit_subscription_id_count

Failure warnings:

  • SCD global lock query failed
  • SCD subscription lock query failed

4. Documentation

Extended docs/operations/performances.md with diagnostics behavior, fields, and thresholds.

Validation

  • go test ./pkg/scd/store/datastore — passed
  • go test ./pkg/scd/... — passed locally

Risk / Compatibility

  • LockSubscriptionsOnCells interface signature updated; all in-repo call sites patched.
  • Cell-based lock scope intentionally narrowed by time overlap; explicit subscription ID locking preserved.
  • Diagnostics are passive and thresholded, minimizing steady-state log noise.

Follow-up

Closes #1311

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Mar 7, 2026

CLA Not Signed

Copy link
Contributor

@barroco barroco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @ashishjsharda for this contribution.

Please see comments inline.

Comment on lines +37 to +38
* Lock query duration is greater than or equal to `500ms`
* Number of matched/locked rows is greater than or equal to `1000`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How those values were selected ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. They were chosen as pragmatic initial guardrails to keep logs low-noise while still surfacing clear outliers:

  • 500ms was selected as a “this is unexpectedly slow for lock acquisition” threshold relative to our normal successful lock path.
  • 1000 matched rows was selected as a “potentially broad lock scope” signal (large enough to avoid routine noise, small enough to catch problematic cases).

These are intentionally heuristic and meant for operational triage, not hard SLO limits. I can add a short note in the doc clarifying that these are initial defaults and may be tuned based on production observations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks. Those values may generate log lines for every requests, looking at the graphs on the original issue.
While there is no requirement related to DSS response time for Operational Intent in SCD in F3548-21, the standard seems to assume that 5 seconds is a reasonable upper bound time for DSS responses in general. Therefore, 4 seconds may be more appropriate based on the experience of the deployed DSS pool referred in the issue.

Regarding the matched rows, it may possibly be removed, see my comments below.

* Lock query duration is greater than or equal to `500ms`
* Number of matched/locked rows is greater than or equal to `1000`

The log message is `SCD lock query diagnostics` and includes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would suggest a log message stating more explicitly the reason of the log line such as Expensive SCD lock detected

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, agreed. I’ll update the diagnostic log message to make the reason explicit (for example Expensive SCD lock detected) while keeping the same structured fields (duration, matched_rows, cell_count, etc.) so existing analysis remains straightforward.

const (
// These thresholds keep lock diagnostics low-noise in production while still surfacing likely bottlenecks.
lockQuerySlowThreshold = 500 * time.Millisecond
lockQueryLargeRowCount = 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It may be useful to have those values configurable so it can be adjusted per use cases. This can be done in a follow up PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. For this PR I kept constants to minimize scope and risk, but making these thresholds configurable is a sensible follow-up. I can open a follow-up issue/PR to expose them via config (with the current values as defaults) so deployments can tune per workload profile.

Comment on lines +37 to +38
* Lock query duration is greater than or equal to `500ms`
* Number of matched/locked rows is greater than or equal to `1000`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks. Those values may generate log lines for every requests, looking at the graphs on the original issue.
While there is no requirement related to DSS response time for Operational Intent in SCD in F3548-21, the standard seems to assume that 5 seconds is a reasonable upper bound time for DSS responses in general. Therefore, 4 seconds may be more appropriate based on the experience of the deployed DSS pool referred in the issue.

Regarding the matched rows, it may possibly be removed, see my comments below.

return stacktrace.Propagate(err, "Error in query: %s", query)
}

matchedRows := int(result.RowsAffected())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the documentation, it seems to always return 0. Do you know if the package pgxpool is implemented differently ? In this case, there is only one row in this table so we can remove that part of the check.

return stacktrace.Propagate(err, "Error in query: %s", query)
}

matchedRows := int(result.RowsAffected())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, this seems to always return 0. Is RowsAffected of pgxpool behaving differently than the sql definition ?
If it is not the case, I suggest to remove the check on row count. Counting with a query would be too expensive and may make the problem of performance worse.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for flagging this. I double-checked the pgx/pgconn behavior: Exec returns a CommandTag, and for PostgreSQL SELECT tags include a row count (SELECT ), so RowsAffected() is not universally 0 for this query type.

That said, your broader point about usefulness/noise is valid. To keep diagnostics robust and simple, I’m fine removing matched_rows from the warning criteria and relying on duration-based alerting.

WHERE
cells && $1
OR
(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@the-glu can you please evaluate the impact of this change with the load tests ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I haven’t posted comparative load-test numbers yet. I’ll run the SCD loadtest scenarios (overlapping and non-overlapping) on this branch and share before/after results focused on p95/p99 latency and warning volume.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to improve the situation with default tests:

image

However, this is highly dependent on the flights duration. Loadtest use tests of 10s, switching to 90s show no improvements (but that expected).

I noticed however that lock may be wrong (but that was already the case for cells): notifyVolume can be different from validParams, explaining some transactions restarts.

This may increase those transactions restarts as the lock is now more specific (and may be more 'wrong'), but that only in case of updates.

@barroco
Copy link
Contributor

barroco commented Mar 10, 2026

Thanks for the clarification @ashishjsharda, while looking at it again, I am not sure about the usage of RowsAffected. Can you please confirm the behaviour is correct ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

High Latency for SCD DSS Requests

3 participants