Skip to content

gc: priority scheduling with dual watermarks and cross-scan quota#436

Open
xiaoxichen wants to merge 1 commit into
eBay:stable/v4.xfrom
xiaoxichen:gc-sort
Open

gc: priority scheduling with dual watermarks and cross-scan quota#436
xiaoxichen wants to merge 1 commit into
eBay:stable/v4.xfrom
xiaoxichen:gc-sort

Conversation

@xiaoxichen

Copy link
Copy Markdown
Collaborator
  • Sort eligible chunks by garbage ratio (desc) before submission so the most garbage-heavy chunks are always GC'd first
  • Add gc_garbage_rate_threshold_low (default 30%) as a low watermark; chunks between the two watermarks consume at most half the quota
  • Track m_pending_normal_gc_task_count in pdev_gc_actor to reflect tasks queued or running in m_gc_executor across scan cycles; previous code only capped submissions per scan, allowing unbounded queue growth
  • scan_chunks_for_gc now skips a pdev entirely when already at quota, and derives low_tier_cap proportionally from remaining_capacity
  • Add ADR docs/adr/gc-priority-scheduling.md

- Sort eligible chunks by garbage ratio (desc) before submission so the
  most garbage-heavy chunks are always GC'd first
- Add gc_garbage_rate_threshold_low (default 30%) as a low watermark;
  chunks between the two watermarks consume at most half the quota
- Track m_pending_normal_gc_task_count in pdev_gc_actor to reflect tasks
  queued or running in m_gc_executor across scan cycles; previous code
  only capped submissions per scan, allowing unbounded queue growth
- scan_chunks_for_gc now skips a pdev entirely when already at quota,
  and derives low_tier_cap proportionally from remaining_capacity
- Add ADR docs/adr/gc-priority-scheduling.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 71.79487% with 11 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (stable/v4.x@e1c23e1). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/lib/homestore_backend/gc_manager.cpp 70.27% 8 Missing and 3 partials ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@              Coverage Diff               @@
##             stable/v4.x     #436   +/-   ##
==============================================
  Coverage               ?   52.77%           
==============================================
  Files                  ?       36           
  Lines                  ?     5434           
  Branches               ?      683           
==============================================
  Hits                   ?     2868           
  Misses                 ?     2269           
  Partials               ?      297           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

const auto final_state =
priority == static_cast< uint8_t >(task_priority::normal) ? ChunkState::AVAILABLE : ChunkState::INUSE;

if (priority == static_cast< uint8_t >(task_priority::normal)) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not put this here, should put it into the destructor of gc_task_guard

on_gc_task_completed will also be called in handle_recoverd_gc_task , where we should not decrease any counter.

pls refer to where decr_pg_pending_gc_task is called.

};
std::vector< ChunkGCInfo > eligible;
for (const auto& chunk_id : chunks) {
if (is_eligible_for_gc(chunk_id)) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_eligible_for_gc

will this be used again? if not, we can remove it

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, should we move can_chunks_in_pg_be_gc into int is_eligible_for_gc (returns the garbage_ratio_ptc) ?

eligible.push_back({chunk_id, ratio_pct});
}

// Sort eligible chunks by garbage ratio descending so the most garbage-heavy chunks are GC'd first.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a topK problem.
suggest to use a heap with a capacity of remaining_capacity (priority_queue), so that we can hold at most remaining_capacity ChunkGCInfo in memory. and for any new ChunkGCInfo, we only need to compare it with the top of the heap.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO It doesnt worth those lines of code , considering the maximum chunk is 32K.


for (const auto& [pdev_id, chunks] : m_chunk_selector->get_pdev_chunks()) {
auto max_task_num = 2 * (reserved_chunk_num_per_pdev - reserved_chunk_num_per_pdev_for_egc);
const uint32_t max_task_num = 2 * (reserved_chunk_num_per_pdev - reserved_chunk_num_per_pdev_for_egc);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is a fact that :

the higher the garbage ratio of a chunk is , the less blobs will be copied in the gc process of this chunk, and thus will bring less io pressure to hardware.

so , in my mind, assuming we limit the io pressure of gc in a certain period,we can submit more gc tasks if the candidate chunks has higher ratio. I don`t have a clear idea, but just think can max_task_num be optimized to a more flexible value?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The maximum conccurrent is limit by reserved_chunk_num_per_pdev. It is more on how frequent we do the scanning and submits GC task. Utimiately assuming this is a busy loop then we always keep the Q at 2 * reserved_chunks.

However i think the tuning target of GC (in most of the time) is not maximum GC speed, it is enough to catch up with data rotation (or maybe 2x of data rotation rate).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants