[CORE-13707] redpanda: the great config bootstrap fix#30396
[CORE-13707] redpanda: the great config bootstrap fix#30396WillemKauf wants to merge 19 commits into
redpanda: the great config bootstrap fix#30396Conversation
4a97a52 to
4755fa7
Compare
There was a problem hiding this comment.
Pull request overview
This PR reworks Redpanda’s startup/bootstrap sequencing to prevent early reads of stale cluster configuration during node join/restart. It introduces a “config readiness” gate for config::shard_local_cfg(), splits kvstore recovery from writer startup, and adds an RPC path to fetch a leader-authoritative controller snapshot for restarting nodes.
Changes:
- Add a readiness check to
config::shard_local_cfg()plus an*_unsafe()escape hatch for early bootstrap. - Split
kvstorerecovery into an explicit phase, and restructure application bootstrap to recover/apply snapshots before marking config “ready”. - Add a
fetch_controller_snapshotRPC and controller snapshot plumbing to support leader-authoritative bootstrap for restarting nodes.
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/v/storage/storage_resources.h | Extend ctor signature and add API to update append chunk size during bootstrap. |
| src/v/storage/storage_resources.cc | Plumb append chunk size into storage_resources initialization and add updater method. |
| src/v/storage/kvstore.h | Make recover() a public entry point and add start(recover_t) mode switch. |
| src/v/storage/kvstore.cc | Implement split recovery/start flow; adjust read APIs to require recovery rather than full start. |
| src/v/storage/api.h | Construct kvstore earlier and add recover_kvstore() bootstrap hook before start(). |
| src/v/redpanda/BUILD | Add Bazel deps needed by updated application/bootstrap compilation units. |
| src/v/redpanda/application.h | Reshape bootstrap APIs into async helpers for applying snapshots/config and discovery. |
| src/v/redpanda/application.cc | Move cluster-config hydration earlier and ensure feature-table metrics are set up during metrics init. |
| src/v/redpanda/application_start.cc | Use stored cluster_discovery instance and remove passing discovery by reference into runtime start. |
| src/v/redpanda/application_config.cc | Use shard_local_cfg_unsafe() for bootstrap-time kvstore config extraction. |
| src/v/redpanda/application_bootstrap.cc | Rework bootstrap sequence: recover kvstore, apply snapshots, discovery/join, then mark config ready. |
| src/v/features/feature_table.h | Add setup_metrics() API. |
| src/v/features/feature_table.cc | Defer probe/metrics setup from ctor to explicit setup_metrics(). |
| src/v/config/node_config.cc | Use shard_local_cfg_unsafe() during node-config YAML loading/ignores. |
| src/v/config/configuration.h | Add readiness flag + document shard_local_cfg() vs shard_local_cfg_unsafe(). |
| src/v/config/configuration.cc | Implement shard_local_cfg_unsafe() and guard shard_local_cfg() with readiness vassert. |
| src/v/cluster/types.h | Add serde RPC request/reply types for fetching a controller snapshot. |
| src/v/cluster/service.h | Add controller service RPC endpoint for fetch_controller_snapshot. |
| src/v/cluster/service.cc | Implement RPC handler to delegate snapshot fetch logic to members manager. |
| src/v/cluster/members_manager.h | Declare handler to produce/forward leader-authoritative controller snapshot replies. |
| src/v/cluster/members_manager.cc | Implement snapshot fetch path (forward to leader if needed; leader builds partial join snapshot). |
| src/v/cluster/controller.json | Register new fetch_controller_snapshot RPC in controller protocol schema. |
| src/v/cluster/controller.cc | Adjust controller start wiring after config_manager ctor signature change. |
| src/v/cluster/controller_stm.h | Generalize join snapshot construction via templated backend selection. |
| src/v/cluster/controller_stm.cc | Reorder snapshot application so config manager applies earlier; remove old join snapshot impl. |
| src/v/cluster/config_manager.h | Remove cluster recovery table dependency from config manager constructor/state. |
| src/v/cluster/config_manager.cc | Switch more bootstrap-time reads to shard_local_cfg_unsafe() and adjust replay/bootstrap flow. |
| src/v/cluster/cluster_discovery.h | Update discovery ctor signature and add controller snapshot fetch API (with stale-comment mismatch). |
| src/v/cluster/cluster_discovery.cc | Implement new controller snapshot fetch RPC path and decouple discovery from storage::api. |
Comments suppressed due to low confidence (1)
src/v/cluster/config_manager.cc:193
config_manager::apply_local()still stages updates toneeds_restartproperties viaset_pending_value(...), butconfig_manager::start()no longer promotes pending values after STM replay. This means nodes may never applyneeds_restartconfig updates even after a restart (pending values remain invisible indefinitely). Reintroduce a promotion point after replay (as before), or otherwise ensure pending values are promoted at least once during startup after replay completes.
ss::future<> config_manager::start() {
if (_seen_version == config_version_unset) {
vlog(clusterlog.trace, "Starting config_manager... (initial)");
// Background: wait till we have a leader. If this node is leader
72cd608 to
0140ac7
Compare
This comment was marked as outdated.
This comment was marked as outdated.
3ec0835 to
2058bd9
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
625236c to
d9c2e30
Compare
This comment was marked as outdated.
This comment was marked as outdated.
d9c2e30 to
7ac0443
Compare
redpanda: the great config bootstrap fix
This comment was marked as outdated.
This comment was marked as outdated.
7ac0443 to
07a3683
Compare
redpanda: the great config bootstrap fixredpanda: the great config bootstrap fix
697cfdc to
e4e5cfe
Compare
This singleton was a bad idea. Instead, encapsulate it within
`storage_resources`, which is also accessible everywhere `chunk_cache`
is currently used.
This commit:
* Removes the singleton accessor for `chunk_cache`
* Adds the `_chunk_cache` object storage to `storage_resources`
* Reworks the `start()` logic for `chunk_cache`
* `api::start()` -> `storage_resources::start()` -> `chunk_cache::start()`,
but with an bool argument `prealloc`. Meant for tests, where we previously
wouldn't preallocate the chunk cache upfront when calling `api::start()`.
* Public `chunk_cache` APIs `add()` and `get()` now called `assert_started()`.
* This ensures we allocate chunks in production, while allowing tests to
continue to not preallocate.
* Ensures tests which require the `chunk_cache` properly start/stop their
encapsulated `storage_resources`.
This also accomplishes the goal of deferring construction of `memory_groups()`
before `storage::api::start()` is called during the bootstrapping/start-up
process in `redpanda`.
Otherwise we reach into the `ntp_config` (with 0 overrides) and try to probe the cluster default for `log_cleanup_policy`, which is incorrect in the `kvstore` context.
We are going to be using `rpc_server_bootstrap_memory` as a placeholder value for `cfg.max_service_memory_per_core` during the bootstrap process as a way of avoiding a call to `memory_groups()` early on in the bootstrap process. In order to correctly size the memory allotment for the server post bootstrap, we need a way of adjusting the number of units used for the semaphore `_memory` in `net::server`. Use an `adjustable_semaphore` for this purpose and add a new function `set_memory_capacity()` to adjust the semaphore's capacity. This will be called after the bootstrapping process is complete and `memory_groups` is considered usable/trustworthy.
Will be used in the future as a static member function without a `members_manager` instance.
We are going to want control over what backends we need to include in a `controller_stm` snapshot in a future commit. Move this function to the header and use a variadic template for this purpose.
Pull config_manager::apply_snapshot out of the ss::when_all parallel block and run it serially after feature_backend and before members_manager. Downstream backends (members, topics, plugin, recovery, security, quota, migrations, cluster_link) may consult shard_local_cfg during their own apply_snapshot work; sequencing config_manager first guarantees they see fresh values.
Removes the promote_all_pending() call from config_manager::start(). Pending values that accumulated during the post-bootstrap controller log replay (i.e. cluster_config_delta_cmd records applied above the hydrated snapshot offset) now stay pending instead of being promoted when config_manager::start runs. This is the strict reading of needs_restart=yes semantics: a property change visible to a node only after it restarts. shard_local_cfg() is already hydrated from the controller snapshot before bootstrap proceeds (see application::bootstrap_cluster_config_view), so any delta replayed afterward is a change the cluster made *after* the hydration point and should not retroactively affect the running node's active values for needs_restart properties.
A bit of code extrapolated out. Currently used for gating `join_node_request`s, will be used in a future commit for handling `fetch_controller_snapshot` requests as well.
A new RPC to fetch a controller snapshot, containing the latest in-memory state of the controller leader's `config_manager` and `feature_backend`. This RPC will be called unconditionally during bootstrap of a `redpanda` broker, whether that node is simply restarting or is joining for the first time. The reason for this being that a local node who is down for a long time will have only stale configurations and features in its local cache/snapshots/log, and the configuration and feature managers are two global pieces of state that should be brought to a consistent view early on in a `redpanda` node's lifetime.
e4e5cfe to
aa327c3
Compare
This comment was marked as outdated.
This comment was marked as outdated.
An extra layer of security against using potentially stale configs out of the `shard_local_cfg()`. Defaults to `usable_before_ready::no` to be strict. Properties which are used in the bootstrapping process before a proper view of the cluster configuration can be established and which have no impact on the behavior of the system if read stale can be safely marked as `usable_before_ready::yes`.
Now that `chunk_cache` is no longer a process wide singleton, fixture tests that allocate several `application`s to simulate multi-broker tests allocate `chunk_cache` N times instead of 1, leading to potential memory issues. Add `test_cfg` to `redpanda`, encapsulating the existing `cloud_topics::test_cfg` and adding a new flag `chunk_cache_prealloc`, which is `true` by default for production use.
aa327c3 to
083ed0f
Compare
| .handle_exception_type([](const ss::gate_closed_exception&) { | ||
| lg.trace("Shutdown requested during recovery"); |
There was a problem hiding this comment.
any particular reason for removing this?
|
|
||
| /* | ||
| * Recovery | ||
| * Recovery (recover() itself is a public entry point declared above) |
There was a problem hiding this comment.
nitpick: shouldn't this comment be attached to recover()?
| _kvstore = std::make_unique<kvstore>( | ||
| _kv_conf_cb(), ss::this_shard_id(), _resources, _feature_table); | ||
| return _kvstore->start().then([this] { | ||
| if (!_kvstore) { |
There was a problem hiding this comment.
q: why do we need this? just for start/stopping storage in tests?
| co_await _storage.invoke_on_all( | ||
| [](storage::api& a) { return a.start(); }); |
There was a problem hiding this comment.
nitpick: what's the difference?
EDIT: ah, default param
| * unaffected; callers that signal()/consume() directly through the | ||
| * underlying handle bypass capacity bookkeeping. |
There was a problem hiding this comment.
this feels a little sketchy, but i guess for our use case the whole point is that we only update the capacity at one point post-bootstrap?
| ss::future<> promote_all_pending() { | ||
| co_await ss::smp::invoke_on_all( | ||
| [] { config::shard_local_cfg().promote_pending(); }); | ||
| } |
There was a problem hiding this comment.
iiuc the idea here is that, post the great config bootstrap fix, by the time we apply the local cache it is fully up-to-date with the a valid config snapshot from the controller leader, right?
nitpick: update commit message. only mentions start right now
nitpick: should this commit go on the other side of the actual bootstrap fix? bit easier to see it's correct that way, and then we'll have a valid bisection point just in case.
| // Snapshot application is a wholesale state replacement — promote | ||
| // all pending values immediately so needs_restart properties take | ||
| // effect (e.g. during cluster recovery). | ||
| co_await promote_all_pending(); |
There was a problem hiding this comment.
checking my understanding: we no longer need this because we'll be applying a snapshot earlier in the boostrap process and ultimately forcing pending values to active in apply_local?
oleiman
left a comment
There was a problem hiding this comment.
made it up to config store changes. looking pretty good!
The Great Config Bootstrap Fix
BLUF
Re-arrangement of the bootstrap process in order to allow restarting nodes to see an up to date view of the cluster wide state before initializing systems that depend on it.
BLUF Continued
Two important cluster wide states include the cluster configuration and feature table views, which we now attempt to fetch via a new fetch_controller_snapshot RPC to the controller leader, similar to a node joining the cluster for the first time.
This is mostly why the config cache exists, as well as why we snapshot member tables and feature tables into the kvstore- to allow a view of these properties BEFORE bootstrapping. However, these local snapshots can always be stale, and it is important to bring the node up to date with the controller leader before the system is started.
This prevents bugs like:
These are just two examples of the undefined behaviors that can happen when we bootstrap before establishing a cluster wide view of these configurations and/or features.
We can also no longer promote cluster configs during cluster recovery or during controller log replay after the bootstrapping process - that would be invalid behavior.
The core problem: where does a node's config come from at start-up?
Before — error prone
flowchart TD A[start-up] --> B[load local config cache / bootstrap yaml<br/>load local feature-table snapshot] B --> C{joining or restarting?} C -->|joiner| D[register_with_cluster<br/>→ gets a controller_join_snapshot<br/>config + features are fresh] C -->|restarter| E[use LOCAL cache + snapshot only<br/>⚠️ may be stale / many versions behind] D --> F[wire up + start ALL runtime services<br/>sized from this config] E --> F F --> G[controller->start → raft0 replay] G --> H[the restarter promotes cluster configs read and appears to possess the up-to-date cluster configuration + features, but services already started on stale values] style E fill:#ffd9d9 style H fill:#ffd9d9The hazards:
stale local config and an old feature-table snapshot.
stale reads are silent.
After — establish a consistent view first
flowchart TD A[start-up] --> RO[recover kvstore from local segments and read the node UUID, cluster UUID, feature snapshot] RO --> NR[mark cluster config as NOT ready to catch any potential misuses of stale configs] NR --> HY[preload local config cache / bootstrap yaml] HY --> RPC[start internal RPC bootstrap server] RPC --> EV[establish_cluster_view: call fetch_controller_snapshot from the controller leader and apply fresh config_manager + feature_backend state] EV --> RDY[mark cluster config READY] RDY --> INIT[initialize: build memory_groups,<br/>re-set RPC server memory capacity] INIT --> RUN[wire up + start runtime services<br/>now sized from a consistent, fresh config] RUN --> R0[controller->start → raft0 replay<br/>now just tops up from an already-consistent base] style EV fill:#d9ffd9 style RDY fill:#d9ffd9Key change: both joiners and restarters obtain an authoritative
config_manager+feature_backendview from the controller leader beforethe runtime is built — so everything downstream is sized and gated from fresh
values. The window between "not ready" and "ready" is guarded by
usable_before_ready(see the last commit).Commit walkthrough
Commits 1→3: read-only
kvstoreaccess for bootstrapstorage: exposekvstore::recover()— surfacekvstore::recover()(andan accessor on
storage::api) so the kvstore's persisted state can be readfor read-only purposes early in bootstrap, without a full storage start.
storage: move around constructor logic inapi— make thekvstoreavailable right after
storage::api's constructor, so the read-only accessorabove is usable as early as possible.
cluster: removestorage::apidependency injection incluster_discovery—
cluster_discoveryonly needednode_uuid/cluster_uuid(constant onceset). Pass those two values directly instead of injecting the whole
storage::api, decoupling discovery from the storage lifecycle.Commits 4→5: a chunk cache that doesn't force
memory_groups()earlystorage: abstract outchunk_cache::preallocate()— split the poolpre-allocation out of
start()into a dedicatedpreallocate()step, sostarting the cache and warming its pool become independent.
storage: removechunk_cachesingleton — the process-widethread_localsingleton was a mistake. Encapsulate the cache insidestorage_resources(reachable everywhere the cache was used). Lifecycle is nowapi::start()→storage_resources::start()→chunk_cache::start(prealloc),where
prealloclets tests skip the up-front pool fill;add()/get()assert_started()to catch misuse. Crucially, this defers construction ofmemory_groups()untilstorage::api::start(), instead of at firstsingleton access during early bootstrap.
Commit 6 — well-contained fix
kvstore: setis_compaction_enabled=falseflag — during kvstore segmentrecovery we were reaching into an override-less
ntp_configand probing thecluster default
log_cleanup_policy, which is meaningless in the kvstorecontext. Pass
is_compaction_enabled=falseexplicitly instead.Commits 7→9: size the bootstrap RPC server without
memory_groups()config: addrpc_server_bootstrap_memory— a fixed placeholder formax_service_memory_per_coreused while bootstrapping, so the bootstrap RPCserver doesn't have to consult
memory_groups().utils: addadjustable_semaphore::underlying()— expose the underlyingsemaphore, needed to resize capacity later.
net: useadjustable_semaphoreinserver— backnet::server's_memorylimit with anadjustable_semaphoreand addset_memory_capacity().The server starts sized from
rpc_server_bootstrap_memory; once bootstrap iscomplete and
memory_groups()is trustworthy, its capacity is adjusted to thereal allotment.
Commits 10→15 — preparation for the snapshot fetch
cluster: makemembers_manager::read_members_from_kvstore()static —so it can be called without a
members_managerinstance during bootstrap.cluster: template-izemaybe_make_join_snapshot— move it to theheader as a variadic template, so a caller can choose exactly which
controller-STM backends to include in a snapshot (needed for the fetch RPC).
cluster: applyconfig_managerbefore parallel snapshot batch — pullconfig_manager::apply_snapshotout of thess::when_allparallel block andrun it first, serially. Downstream backends (members, topics, security,
quota, cluster_link, …) may read
shard_local_cfg()during their ownapply_snapshot, so config must be fresh before they run.cluster: stop promoting pending values at start — droppromote_all_pending()fromconfig_manager::start(). Sinceshard_local_cfg()is now hydrated from the controller snapshot beforebootstrap proceeds, any
cluster_config_delta_cmdreplayed afterward is achange the cluster made after hydration; for
needs_restartpropertiesit must stay pending (the strict reading of needs-restart semantics) rather
than retroactively mutating the running node.
cluster: factor outis_request_logical_version_compatible— smallextraction currently gating
join_node_request; reused next for the fetchRPC.
cluster: addfetch_controller_snapshotRPC — the heart of the newapproach. A new RPC that returns the controller leader's latest in-memory
config_manager+feature_backendstate. It is called unconditionallyduring bootstrap — whether the node is restarting or joining — because a node
that was down a long time has only stale local config/features, and these two
pieces of global state must be brought to a consistent view early in the
node's lifetime.
Commit 16 —
redpanda: the great config bootstrap fixThis is the payload that rewires
application::run()to use everything above.The new start-up order is:
What changed conceptually (before → after):
fetch_controller_snapshotmemory_groups()(from stale config)rpc_server_bootstrap_memoryplaceholder, adjusted post-bootstrapmemory_groups()initialize()after the view is consistentshard_local_cfg()readsusable_before_ready(next commit)sequenceDiagram participant N as Booting node participant K as local kvstore participant L as controller leader N->>K: recover() (read-only): node/cluster UUID, feature snapshot N->>N: mark_config_ready(false) N->>N: hydrate local config cache N->>N: start bootstrap RPC server (rpc_server_bootstrap_memory) N->>L: fetch_controller_snapshot (config_manager + feature_backend) L-->>N: authoritative snapshot N->>N: apply snapshot → mark_config_ready(true) N->>N: initialize(): build memory_groups, resize RPC memory N->>N: start runtime services on a consistent config N->>N: controller->start → raft0 replay (tops up from consistent base)Commit 17 —
config: addusable_before_readyflagshard_local_cfg()is markednot-ready until
establish_cluster_view()completes; reading any property'sactive value while not-ready trips an assertion unless the property opted
in via
usable_before_ready::yes. Default is::no(strict). Properties thatare genuinely needed during bootstrap and whose stale value is harmless are
flagged after audit. See the appendix for the full list and rationale.
Commit 18 —
redpanda: addtest_cfgandchunk_cache_preallocchunk_cacheis no longer a process wide singleton, fixturetests that allocate several
applications to simulate multi-brokertests allocate
chunk_cacheN times per shard instead of 1, leading to potentialmemory issues. Add
test_cfgtoredpanda, encapsulating the existingcloud_topics::test_cfgand adding a new flagchunk_cache_prealloc,which is
trueby default for production use.Commit 19 —
rptest: addMultiNodeBootstrapTestbootstrap path (restart and join), asserting that a node converges to the
cluster's configuration/feature view at start-up rather than lagging until
raft0 replay.
Appendix: properties flagged
usable_before_readyThese are read during bootstrap, before
shard_local_cfg()is ready. Each isaudited as safe to read with a potentially-stale local value (the consumer is
re-synced/reconstructed after the view is established, or the value has no
lasting effect).
start disabled, memory capacity re-applied afterward):
rpc_server_bootstrap_memory,rpc_server_listen_backlog,rpc_server_tcp_recv_buf,rpc_server_tcp_send_buf,rpc_server_compress_replies.disable_metrics,disable_public_metrics,aggregate_metrics.tls_min_version,tls_enable_renegotiation,tls_v1_2_cipher_suites,tls_v1_3_cipher_suites.wire_up_bootstrap_services):raft_smp_max_non_local_requests,pp_sr_smp_max_non_local_requests,topic_partitions_per_shard.kvstore_max_segment_size,kvstore_flush_interval,append_chunk_size,storage_read_buffer_size,storage_read_readahead_count,storage_target_replay_bytes,storage_max_concurrent_replay.join_retry_timeout_ms.What this PR doesn't fix:
The potential race/log invariant violation caused by reading a config cache that is more up to date than the controller snapshot (e.g. we have read-ahead in the log's history and then replay other entries out of order). It may be the case that we want to eliminate the config cache read entirely in favour of the controller snapshot.
flowchart LR subgraph correct["✅ Correct order (log is source of truth)"] direction LR c5["offset 5<br/>snapshot: X=old"] --> c7["offset 7<br/>reads X ⇒ old"] --> c10["offset 10<br/>writes X=new"] end subgraph broken["❌ Replay reading kvstore first"] direction LR b0["kvstore preloaded<br/>X=new (from @10)"] --> b5["offset 5<br/>snapshot"] --> b7["offset 7<br/>reads X ⇒ new"] --> b10["offset 10<br/>writes X=new"] b7 -. "future state leaks<br/>into the past" .-> b0 endBackports Required
Release Notes
Bug Fixes
redpandanodes during the bootstrapping process.