SDSTOR-21981: make shard create/seal log-only, avoid header/footer I/O, add prod repro issue#434
Conversation
Hooper9973
commented
Jun 11, 2026
- Switch create_shard and seal_shard to log-only; remove data channel path
- Stop persisting shard header/footer on disk for create/seal flows
- Add production GC issue reproductions:
- issue1: concurrent seal_shard, create_shard, and GC
- issue2: concurrent put_blob and seal_shard can place blob into sealed shard
0087f47 to
dbd7adc
Compare
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev/refactor_create_seal_shard #434 +/- ##
=================================================================
Coverage ? 53.52%
=================================================================
Files ? 36
Lines ? 5338
Branches ? 671
=================================================================
Hits ? 2857
Misses ? 2188
Partials ? 293 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
dbd7adc to
efea260
Compare
|
Run the Reproduced UT on the stablev4.x branch, the result show the UT reproduced the scenario. |
efea260 to
78e64d7
Compare
xiaoxichen
left a comment
There was a problem hiding this comment.
LGTM.
The compatibility of shard_info_superblk/shard_info need to be handle carefully, on metablk as well as the raft log entries (new code can receive/read back log entries from old code)
JacksonYao287
left a comment
There was a problem hiding this comment.
I will take a look tomorrow
722c9f3 to
6d28dd8
Compare
|
|
||
| // since we do nothing in pre_commit, so we do nothing in rollback | ||
| if (msg_type == ReplicationMessageType::CREATE_SHARD_MSG) { | ||
| SLOGD(tid, shard_id, "rollback create shard message, type={}, lsn= {}", msg_header->msg_type, lsn); |
There was a problem hiding this comment.
@JacksonYao287 Please check whether this logic can be deleted (other error like not leader can also release chunk)
// we have already added release_chunk logic to thenValue of hoemobject#create_shard in originator, so here
// we just need to release_chunk for non-originater case since it will bring a bug if a chunk is released
// for two times. for exampele, as a originator:
// t1 : chunk1 is released in the rollback of create_shard, the chunk state is marked as available
// t2 : chunk1 is select by a new create shard (shard1), the chunk state is marked as inuse
// t3 : chunk1 is released in thenValue of create_shard, the chunk state is marked as available
// t4 : chunk1 is select by a new create shard (shard2), the chunk state is marked as inuse
// now, shard1 and shard2 hold the same chunk.
bool res = release_chunk_based_on_create_shard_message(header);
if (!res) {
RELEASE_ASSERT(false,
"shardID=0x{:x}, pg={}, shard=0x{:x}, failed to release chunk based on create shard msg",
msg_header->shard_id, (msg_header->shard_id >> homeobject::shard_width),
(msg_header->shard_id & homeobject::shard_mask));
}
6d28dd8 to
128996c
Compare
…O, add prod repro issue
- Switch create_shard and seal_shard to log-only; remove data channel path
- Stop persisting shard header/footer on disk for create/seal flows
- Add production GC issue reproductions:
- issue1: concurrent seal_shard, create_shard, and GC
- issue2: concurrent put_blob and seal_shard can place blob into sealed shard
128996c to
b9a773d
Compare
|
|
||
| ASSERT_EQ(hs_pg->pg_sb_->total_occupied_blk_count, total_blob_occupied_blk_count); | ||
| ASSERT_EQ(hs_pg->durable_entities().total_occupied_blk_count, total_blob_occupied_blk_count); | ||
| // ASSERT_EQ(hs_pg->pg_sb_->total_occupied_blk_count, total_blob_occupied_blk_count); |
There was a problem hiding this comment.
since we have commented out "total_blob_occupied_blk_count += 2; /header and footer/", why comment out these ASSERT_EQ
There was a problem hiding this comment.
Since GC still writes the shard header/footer, I’d like to temporarily comment out these blk_count
validations. I will remove the legacy GC shard header/footer code before this dev branch is merged into
stablev4.
There was a problem hiding this comment.
after commentting out "total_blob_occupied_blk_count += 2; /header and footer/", I think all the assert should pass. pls enable them in the next pr
| if (ctx) { | ||
| ctx->promise_.setValue(folly::makeUnexpected(ShardError(ShardErrorCode::RETRY_REQUEST))); |
There was a problem hiding this comment.
we need also ctx->promise_.setValue to notify the proposer if create_shard is rolled back?
| const auto exVchunk = chunk_selector()->pick_most_available_blk_chunk(new_shard_id, pg_owner); | ||
|
|
||
| if (exVchunk == nullptr) { | ||
| SLOGW(tid, new_shard_id, "no availble chunk left to create shard for pg={}", pg_owner); |
46efb1e
into
eBay:dev/refactor_create_seal_shard