Skip to content

SDSTOR-21981: make shard create/seal log-only, avoid header/footer I/O, add prod repro issue#434

Merged
Hooper9973 merged 2 commits into
eBay:dev/refactor_create_seal_shardfrom
Hooper9973:refactor_shard
Jun 18, 2026
Merged

SDSTOR-21981: make shard create/seal log-only, avoid header/footer I/O, add prod repro issue#434
Hooper9973 merged 2 commits into
eBay:dev/refactor_create_seal_shardfrom
Hooper9973:refactor_shard

Conversation

@Hooper9973

Copy link
Copy Markdown
Contributor
  • Switch create_shard and seal_shard to log-only; remove data channel path
  • Stop persisting shard header/footer on disk for create/seal flows
  • Add production GC issue reproductions:
    • issue1: concurrent seal_shard, create_shard, and GC
    • issue2: concurrent put_blob and seal_shard can place blob into sealed shard

Comment thread src/lib/homestore_backend/hs_shard_manager.cpp Outdated
Comment thread src/lib/homestore_backend/hs_shard_manager.cpp Outdated
Comment thread src/lib/homestore_backend/hs_shard_manager.cpp Outdated
Comment thread src/lib/homestore_backend/hs_shard_manager.cpp Outdated
Comment thread src/lib/homestore_backend/hs_shard_manager.cpp
Comment thread src/lib/homestore_backend/hs_shard_manager.cpp Outdated
Comment thread src/lib/homestore_backend/resync_shard_data.fbs Outdated
Comment thread src/lib/homestore_backend/hs_blob_manager.cpp Outdated
Comment thread src/lib/homestore_backend/heap_chunk_selector.cpp Outdated
@codecov-commenter

codecov-commenter commented Jun 16, 2026

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 55.12821% with 35 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (dev/refactor_create_seal_shard@e1c23e1). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/lib/homestore_backend/hs_shard_manager.cpp 48.83% 18 Missing and 4 partials ⚠️
src/lib/homestore_backend/heap_chunk_selector.cpp 45.45% 6 Missing ⚠️
src/lib/homestore_backend/hs_blob_manager.cpp 71.42% 2 Missing and 2 partials ⚠️
...ib/homestore_backend/replication_state_machine.cpp 40.00% 3 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@                        Coverage Diff                        @@
##             dev/refactor_create_seal_shard     #434   +/-   ##
=================================================================
  Coverage                                  ?   53.52%           
=================================================================
  Files                                     ?       36           
  Lines                                     ?     5338           
  Branches                                  ?      671           
=================================================================
  Hits                                      ?     2857           
  Misses                                    ?     2188           
  Partials                                  ?      293           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Hooper9973

Copy link
Copy Markdown
Contributor Author

Run the Reproduced UT on the stablev4.x branch, the result show the UT reproduced the scenario.

$ ./homestore_test_gc -csv error --executor immediate --config_path ./ --chunks_per_pg 1 --override_config hs_backend_config.enable_gc=true --override_config hs_backend_config.gc_garbage_rate_threshold=0 --override_config hs_backend_config.gc_scan_interval_sec=5 --gtest_filter=HomeObjectFixture.StalePChunkRouteAfterGC:HomeObjectFixture.StaleBlobRouteAfterSealAndGC

Port 4000 is not in use.
Port 4001 is not in use.
Port 4002 is not in use.
Note: Google Test filter = HomeObjectFixture.StalePChunkRouteAfterGC:HomeObjectFixture.StaleBlobRouteAfterSealAndGC
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from HomeObjectFixture
[ RUN      ] HomeObjectFixture.StalePChunkRouteAfterGC

Note: Google Test filter = HomeObjectFixture.StalePChunkRouteAfterGC:HomeObjectFixture.StaleBlobRouteAfterSealAndGC
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from HomeObjectFixture
[ RUN      ] HomeObjectFixture.StalePChunkRouteAfterGC

Note: Google Test filter = HomeObjectFixture.StalePChunkRouteAfterGC:HomeObjectFixture.StaleBlobRouteAfterSealAndGC
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from HomeObjectFixture
[ RUN      ] HomeObjectFixture.StalePChunkRouteAfterGC
/home/ubuntu/WorkSpace/Upstream/HomeObject/src/lib/homestore_backend/tests/hs_gc_tests.cpp:1050: Failure
Expected equality of these values:
  p2.value()
    Which is: 87
  live_pchunk
    Which is: 90
shard2 on replica 2 is routed to pchunk 87 but the live vchunk->pchunk mapping is 90 (stale shard/blob route; see PG 39 / PG 3409)


/home/ubuntu/WorkSpace/Upstream/HomeObject/src/lib/homestore_backend/tests/hs_gc_tests.cpp:1080: Failure
Expected equality of these values:
  p2.value()
    Which is: 87
  live_pchunk_after
    Which is: 90
after putting blobs into shard2 on replica 2, shard2's recorded pchunk 87 diverged from the live mapping 90

[  FAILED  ] HomeObjectFixture.StalePChunkRouteAfterGC (16069 ms)
[       OK ] HomeObjectFixture.StalePChunkRouteAfterGC (17070 ms)
[ RUN      ] HomeObjectFixture.StaleBlobRouteAfterSealAndGC
[ RUN      ] HomeObjectFixture.StaleBlobRouteAfterSealAndGC
[       OK ] HomeObjectFixture.StalePChunkRouteAfterGC (15083 ms)
[ RUN      ] HomeObjectFixture.StaleBlobRouteAfterSealAndGC

/home/ubuntu/WorkSpace/Upstream/HomeObject/src/lib/homestore_backend/tests/hs_gc_tests.cpp:1223: Failure
Value of: blob_rejected
  Actual: false
Expected: true
[StaleBlobRouteAfterSealAndGC-fix] late _put_blob should have been rejected by sealed_lsn guard!

[  FAILED  ] HomeObjectFixture.StaleBlobRouteAfterSealAndGC (10255 ms)
[       OK ] HomeObjectFixture.StaleBlobRouteAfterSealAndGC (10255 ms)
[----------] 2 tests from HomeObjectFixture (25338 ms total)

[----------] 2 tests from HomeObjectFixture (27326 ms total)

[----------] Global test environment tear-down
[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (25338 ms total)
[==========] 2 tests from 1 test suite ran. (27326 ms total)
[  PASSED  ] 1 test.
[  PASSED  ] 2 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] HomeObjectFixture.StaleBlobRouteAfterSealAndGC

 1 FAILED TEST
[       OK ] HomeObjectFixture.StaleBlobRouteAfterSealAndGC (10255 ms)
[----------] 2 tests from HomeObjectFixture (26325 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (26325 ms total)
[  PASSED  ] 1 test.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] HomeObjectFixture.StalePChunkRouteAfterGC

@xiaoxichen xiaoxichen left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

The compatibility of shard_info_superblk/shard_info need to be handle carefully, on metablk as well as the raft log entries (new code can receive/read back log entries from old code)

@JacksonYao287 JacksonYao287 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a look tomorrow

Comment thread src/lib/homestore_backend/hs_blob_manager.cpp Outdated
Comment thread src/lib/homestore_backend/heap_chunk_selector.cpp Outdated
@Hooper9973 Hooper9973 force-pushed the refactor_shard branch 2 times, most recently from 722c9f3 to 6d28dd8 Compare June 17, 2026 03:19
Comment thread .github/workflows/conan_build.yml Outdated
Comment thread src/lib/homestore_backend/replication_state_machine.cpp Outdated
Comment thread src/lib/homestore_backend/replication_state_machine.cpp Outdated
Comment thread src/lib/homestore_backend/hs_shard_manager.cpp Outdated

// since we do nothing in pre_commit, so we do nothing in rollback
if (msg_type == ReplicationMessageType::CREATE_SHARD_MSG) {
SLOGD(tid, shard_id, "rollback create shard message, type={}, lsn= {}", msg_header->msg_type, lsn);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JacksonYao287 Please check whether this logic can be deleted (other error like not leader can also release chunk)

 // we have already added release_chunk logic to thenValue of hoemobject#create_shard in originator, so here
            // we just need to release_chunk for non-originater case since it will bring a bug if a chunk is released
            // for two times. for exampele, as a originator:

            // t1 : chunk1 is released in the rollback of create_shard, the chunk state is marked as available
            // t2 : chunk1 is select by a new create shard (shard1), the chunk state is marked as inuse
            // t3 : chunk1 is released in thenValue of create_shard, the chunk state is marked as available
            // t4 : chunk1 is select by a new create shard (shard2), the chunk state is marked as inuse
            // now, shard1 and shard2 hold the same chunk.
            bool res = release_chunk_based_on_create_shard_message(header);
            if (!res) {
                RELEASE_ASSERT(false,
                               "shardID=0x{:x}, pg={}, shard=0x{:x}, failed to release chunk based on create shard msg",
                               msg_header->shard_id, (msg_header->shard_id >> homeobject::shard_width),
                               (msg_header->shard_id & homeobject::shard_mask));
            }

Comment thread src/lib/homestore_backend/hs_shard_manager.cpp Outdated
Comment thread src/lib/homestore_backend/heap_chunk_selector.cpp
Comment thread src/lib/homestore_backend/hs_blob_manager.cpp
…O, add prod repro issue

  - Switch create_shard and seal_shard to log-only; remove data channel path
  - Stop persisting shard header/footer on disk for create/seal flows
  - Add production GC issue reproductions:
    - issue1: concurrent seal_shard, create_shard, and GC
    - issue2: concurrent put_blob and seal_shard can place blob into sealed shard

ASSERT_EQ(hs_pg->pg_sb_->total_occupied_blk_count, total_blob_occupied_blk_count);
ASSERT_EQ(hs_pg->durable_entities().total_occupied_blk_count, total_blob_occupied_blk_count);
// ASSERT_EQ(hs_pg->pg_sb_->total_occupied_blk_count, total_blob_occupied_blk_count);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we have commented out "total_blob_occupied_blk_count += 2; /header and footer/", why comment out these ASSERT_EQ

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since GC still writes the shard header/footer, I’d like to temporarily comment out these blk_count
validations. I will remove the legacy GC shard header/footer code before this dev branch is merged into
stablev4.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after commentting out "total_blob_occupied_blk_count += 2; /header and footer/", I think all the assert should pass. pls enable them in the next pr

Comment on lines -394 to -395
if (ctx) {
ctx->promise_.setValue(folly::makeUnexpected(ShardError(ShardErrorCode::RETRY_REQUEST)));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need also ctx->promise_.setValue to notify the proposer if create_shard is rolled back?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

const auto exVchunk = chunk_selector()->pick_most_available_blk_chunk(new_shard_id, pg_owner);

if (exVchunk == nullptr) {
SLOGW(tid, new_shard_id, "no availble chunk left to create shard for pg={}", pg_owner);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo : availble

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Hooper9973 Hooper9973 merged commit 46efb1e into eBay:dev/refactor_create_seal_shard Jun 18, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants