[scd] Time-bound subscription locks and add lock diagnostics#1384
[scd] Time-bound subscription locks and add lock diagnostics#1384ashishjsharda wants to merge 1 commit intointeruss:masterfrom
Conversation
|
barroco
left a comment
There was a problem hiding this comment.
Thank you very much @ashishjsharda for this contribution.
Please see comments inline.
| * Lock query duration is greater than or equal to `500ms` | ||
| * Number of matched/locked rows is greater than or equal to `1000` |
There was a problem hiding this comment.
How those values were selected ?
There was a problem hiding this comment.
Great question. They were chosen as pragmatic initial guardrails to keep logs low-noise while still surfacing clear outliers:
500mswas selected as a “this is unexpectedly slow for lock acquisition” threshold relative to our normal successful lock path.1000matched rows was selected as a “potentially broad lock scope” signal (large enough to avoid routine noise, small enough to catch problematic cases).
These are intentionally heuristic and meant for operational triage, not hard SLO limits. I can add a short note in the doc clarifying that these are initial defaults and may be tuned based on production observations.
There was a problem hiding this comment.
Ok, thanks. Those values may generate log lines for every requests, looking at the graphs on the original issue.
While there is no requirement related to DSS response time for Operational Intent in SCD in F3548-21, the standard seems to assume that 5 seconds is a reasonable upper bound time for DSS responses in general. Therefore, 4 seconds may be more appropriate based on the experience of the deployed DSS pool referred in the issue.
Regarding the matched rows, it may possibly be removed, see my comments below.
| * Lock query duration is greater than or equal to `500ms` | ||
| * Number of matched/locked rows is greater than or equal to `1000` | ||
|
|
||
| The log message is `SCD lock query diagnostics` and includes: |
There was a problem hiding this comment.
nit: I would suggest a log message stating more explicitly the reason of the log line such as Expensive SCD lock detected
There was a problem hiding this comment.
Good suggestion, agreed. I’ll update the diagnostic log message to make the reason explicit (for example Expensive SCD lock detected) while keeping the same structured fields (duration, matched_rows, cell_count, etc.) so existing analysis remains straightforward.
| const ( | ||
| // These thresholds keep lock diagnostics low-noise in production while still surfacing likely bottlenecks. | ||
| lockQuerySlowThreshold = 500 * time.Millisecond | ||
| lockQueryLargeRowCount = 1000 |
There was a problem hiding this comment.
nit: It may be useful to have those values configurable so it can be adjusted per use cases. This can be done in a follow up PR.
There was a problem hiding this comment.
Agreed. For this PR I kept constants to minimize scope and risk, but making these thresholds configurable is a sensible follow-up. I can open a follow-up issue/PR to expose them via config (with the current values as defaults) so deployments can tune per workload profile.
| * Lock query duration is greater than or equal to `500ms` | ||
| * Number of matched/locked rows is greater than or equal to `1000` |
There was a problem hiding this comment.
Ok, thanks. Those values may generate log lines for every requests, looking at the graphs on the original issue.
While there is no requirement related to DSS response time for Operational Intent in SCD in F3548-21, the standard seems to assume that 5 seconds is a reasonable upper bound time for DSS responses in general. Therefore, 4 seconds may be more appropriate based on the experience of the deployed DSS pool referred in the issue.
Regarding the matched rows, it may possibly be removed, see my comments below.
| return stacktrace.Propagate(err, "Error in query: %s", query) | ||
| } | ||
|
|
||
| matchedRows := int(result.RowsAffected()) |
There was a problem hiding this comment.
According to the documentation, it seems to always return 0. Do you know if the package pgxpool is implemented differently ? In this case, there is only one row in this table so we can remove that part of the check.
| return stacktrace.Propagate(err, "Error in query: %s", query) | ||
| } | ||
|
|
||
| matchedRows := int(result.RowsAffected()) |
There was a problem hiding this comment.
Similarly, this seems to always return 0. Is RowsAffected of pgxpool behaving differently than the sql definition ?
If it is not the case, I suggest to remove the check on row count. Counting with a query would be too expensive and may make the problem of performance worse.
There was a problem hiding this comment.
Thanks for flagging this. I double-checked the pgx/pgconn behavior: Exec returns a CommandTag, and for PostgreSQL SELECT tags include a row count (SELECT ), so RowsAffected() is not universally 0 for this query type.
That said, your broader point about usefulness/noise is valid. To keep diagnostics robust and simple, I’m fine removing matched_rows from the warning criteria and relying on duration-based alerting.
| WHERE | ||
| cells && $1 | ||
| OR | ||
| ( |
There was a problem hiding this comment.
@the-glu can you please evaluate the impact of this change with the load tests ?
There was a problem hiding this comment.
Agreed. I haven’t posted comparative load-test numbers yet. I’ll run the SCD loadtest scenarios (overlapping and non-overlapping) on this branch and share before/after results focused on p95/p99 latency and warning volume.
There was a problem hiding this comment.
This seems to improve the situation with default tests:
However, this is highly dependent on the flights duration. Loadtest use tests of 10s, switching to 90s show no improvements (but that expected).
I noticed however that lock may be wrong (but that was already the case for cells): notifyVolume can be different from validParams, explaining some transactions restarts.
This may increase those transactions restarts as the lock is now more specific (and may be more 'wrong'), but that only in case of updates.
|
Thanks for the clarification @ashishjsharda, while looking at it again, I am not sure about the usage of RowsAffected. Can you please confirm the behaviour is correct ? |
Problem
Issue #1311 reports high latency and timeouts for SCD operations under overlapping traffic patterns. Lock acquisition in
LockSubscriptionsOnCellsbecomes expensive when many subscriptions share S2 cells, regardless of temporal relevance.Solution
Narrow lock scope to subscriptions that overlap both the relevant S2 cells and the operation time window, while preserving correctness by always locking explicitly referenced subscription IDs. Add low-noise diagnostics to improve production observability of costly lock queries.
Changes
1. Time-bounded lock scope
Updated
LockSubscriptionsOnCellsto acceptstartTimeandendTime, locking rows matching:cells && $1COALESCE(starts_at <= endTime, true)COALESCE(ends_at >= startTime, true)OR id = ANY(explicitSubscriptionIDs)2. Call-site updates
Updated OIR mutation paths to pass the relevant time window:
old.StartTime/old.EndTimevalidParams.uExtent.StartTime/validParams.uExtent.EndTime3. Lock diagnostics (thresholded, low-noise)
Warnings emitted when:
duration >= 500ms, ormatched_rows >= 1000Fields logged:
global_lockdurationmatched_rowscell_countexplicit_subscription_id_countFailure warnings:
SCD global lock query failedSCD subscription lock query failed4. Documentation
Extended
docs/operations/performances.mdwith diagnostics behavior, fields, and thresholds.Validation
go test ./pkg/scd/store/datastore— passedgo test ./pkg/scd/...— passed locallyRisk / Compatibility
LockSubscriptionsOnCellsinterface signature updated; all in-repo call sites patched.Follow-up
Closes #1311