Skip to content

Conversation

@xzhangxian1008
Copy link
Contributor

@xzhangxian1008 xzhangxian1008 commented Jan 14, 2026

What problem does this PR solve?

Issue Number: close #10636

Problem Summary:

What is changed and how it works?

In `waitForBlockAvailableForTest`, CTEReader will wait the wake of `cv_for_test`. However, we need to check if there are any available blocks before entering the wait or it will wait forever when there is no more block to be pushed.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Fix the issue that TestCTE.Concurrent may hang forever

Summary by CodeRabbit

  • Bug Fixes

    • Improved synchronization across CTE operations to avoid deadlocks and reduce contention, enhancing stability and performance.
    • Revised wait/relock behavior to ensure safe locking order during concurrent access.
  • Tests

    • Simplified test synchronization primitives and made test-access locking more consistent to improve test reliability.

✏️ Tip: You can customize this high-level summary in your review settings.

@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed labels Jan 14, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jan 14, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zanmato1984 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 14, 2026
@coderabbitai
Copy link

coderabbitai bot commented Jan 14, 2026

📝 Walkthrough

Walkthrough

Introduces a shared read–write lock (rw_lock) in CTE, removes per-partition test mutex from CTEPartition, and changes CTEReader to use the CTE rw_lock with an ordered unlock/re-lock sequence around condition-variable waits to avoid deadlocks.

Changes

Cohort / File(s) Summary
CTE core sync
dbms/src/Operators/CTE.h
Added private std::shared_mutex rw_lock (guards: next_cte_reader_id, is_eof, is_cancelled, get_resp, resp, err_msg, sink_exit_num, source_exit_num, registered_sink_num) and public accessor std::shared_mutex & getRWLockForTest().
Partition struct update
dbms/src/Operators/CTEPartition.h
Removed test-only mu_for_test (std::unique_ptr<std::mutex>) from CTEPartition (#ifndef NDEBUG block); mu and cv_for_test remain.
Reader lock flow
dbms/src/Operators/CTEReader.cpp
Replaced per-partition test-mutex usage with shared rw_lock + partition.mu; use checkBlockAvailableNoLock() when appropriate; release rw_lock before waiting on cv_for_test, then re-acquire locks in order (rw_lock then partition mutex) after wake to avoid deadlock.
sequenceDiagram
    participant Reader as CTEReader
    participant CTE as CTE
    participant Part as CTEPartition

    Reader->>CTE: acquire shared rw_lock (shared)
    Reader->>Part: lock partition mutex (mu)
    alt no block available
        Reader->>CTE: unlock shared rw_lock
        Reader->>Part: wait on cv_for_test (waiting releases mu)
        Part-->>Reader: notify cv_for_test
        Reader->>CTE: acquire shared rw_lock (shared)
        Reader->>Part: re-lock partition mutex (mu)
    end
    Reader->>CTE: proceed with block processing
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

lgtm

Poem

🐰 I hopped through locks both big and small,
Swapped tangled mutexes for one shared wall,
I waited, I nudged, then re-locked in line,
Readers now wake in a tidy design,
🥕 Hooray — no more thread-tangled fall!

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix hang for TestCTE.Concurrent test' clearly summarizes the main change - fixing a test hang issue in TestCTE.Concurrent, which aligns with the changeset's objective.
Description check ✅ Passed The PR description includes the issue number (#10636), a problem summary, detailed explanation of the code changes, and a filled checklist with unit test marked as included. The release note is also provided.
Linked Issues check ✅ Passed The PR addresses issue #10636 by fixing the hang in TestCTE.Concurrent through improved synchronization logic in waitForBlockAvailableForTest that checks for available blocks before waiting.
Out of Scope Changes check ✅ Passed All changes are within scope: introducing a shared rw_lock in CTE, removing mu_for_test from CTEPartition, and updating CTEReader synchronization logic to fix the hang - all directly address the hang issue.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@xzhangxian1008
Copy link
Contributor Author

/cc @windtalker @gengliqi

@ti-chi-bot ti-chi-bot bot requested review from gengliqi and windtalker January 15, 2026 01:43
@xzhangxian1008
Copy link
Contributor Author

/run-check-issue-triage-complete


CTEPartition & getPartitionForTest(size_t partition_idx) { return this->partitions[partition_idx]; }

std::shared_mutex * getRWLockForTest() { return &(this->rw_lock); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using a reference instead of a pointer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using a reference instead of a pointer?

done

@xzhangxian1008
Copy link
Contributor Author

/cc @windtalker

@xzhangxian1008
Copy link
Contributor Author

/cc @windtalker @gengliqi

// For example: in `CTE::tryGetBlockAt`, we will lock rw_lock first then lock partition.mu.
// If locking partition.mu first here, `CTE::tryGetBlockAt` may have locked rw_lock. Then
// each of them needs to lock the other lock, but the other lock has been locked now.
rw_lock.lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks very weird to me to manually release and acquire lock after wait, why we need rw_lock here? Is it possible that we only hold rw_lock before while loop?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dbms/src/Operators/CTEReader.cpp (1)

48-62: Potential missed wakeup between unlocking rw_lock and wait()

Between Line 60 and Line 61, is_eof can flip + notify while this thread hasn’t started waiting yet (notifier doesn’t take partition.mu), which can still deadlock the test. Consider guarding the predicate with the same mutex or using a timed wait to avoid indefinite sleep.

🛠️ Minimal mitigation using a bounded wait
-        rw_lock.unlock();
-        partition.cv_for_test->wait(lock);
+        rw_lock.unlock();
+        partition.cv_for_test->wait_for(lock, std::chrono::milliseconds(10));
🧹 Nitpick comments (1)
dbms/src/Operators/CTE.h (1)

188-193: Limit test-only lock accessor to debug builds

getRWLockForTest() is public and available in release builds. Consider guarding it with #ifndef NDEBUG to prevent production misuse.

♻️ Proposed change
-#ifndef NDEBUG
-    CTEPartition & getPartitionForTest(size_t partition_idx) { return this->partitions[partition_idx]; }
-#endif
-
-    std::shared_mutex & getRWLockForTest() { return this->rw_lock; }
+#ifndef NDEBUG
+    CTEPartition & getPartitionForTest(size_t partition_idx) { return this->partitions[partition_idx]; }
+    std::shared_mutex & getRWLockForTest() { return this->rw_lock; }
+#endif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fail TSAN test: TestCTE.Concurrent hang forever

3 participants