-
Notifications
You must be signed in to change notification settings - Fork 440
RATIS-2388 (Further) Enhancing content for concept in ratis-docs #1338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
szetszwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jcshepherd , thanks a lot for contributing this! I have reviewed down to the Snapshot section. Please see the comments so far. Will continue reviewing this.
|
|
||
| Raft's safety guarantees depend on majority agreement within each group. The leader replicates | ||
| each operation to the followers in its group, and operations are committed when at least | ||
| (N/2 + 1) peers in that group acknowledge them. This means a group of 3 peers can tolerate 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use $\lfloor N/2 + 1 \rfloor$.
| A single cluster can host multiple independent Raft groups, each with its own leader election, | ||
| consistency and state replication. Groups typically consist of an odd number of peers (3, 5, or | ||
| 7 are common) to ensure clear majority decisions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's combine the first sentence to the paragraph and move the second sentence to the next section, which talks about majority.
diff --git a/ratis-docs/src/site/markdown/concept/index-v2.md b/ratis-docs/src/site/markdown/concept/index-v2.md
index d51146863..bf1463c43 100644
--- a/ratis-docs/src/site/markdown/concept/index-v2.md
+++ b/ratis-docs/src/site/markdown/concept/index-v2.md
@@ -71,18 +71,20 @@ group is a logical consensus domain that runs across a specific subset of peers
-At any given time, one peer in a group acts as the "leader" while the others are "followers" or
+One of the peers in a group acts as the "leader" while the others are "followers" or
"listeners". The leader handles all write requests and replicates operations to other peers in
the group. Both leaders and followers can service read requests, with different consistency
-guarantees.
+guarantees. A single cluster can host multiple independent Raft groups,
+each with its own leader election, consistency and state replication.
-A single cluster can host multiple independent Raft groups, each with its own leader election,
-consistency and state replication. Groups typically consist of an odd number of peers (3, 5, or
-7 are common) to ensure clear majority decisions.
### Majority-Based Decision-Making
Raft's safety guarantees depend on majority agreement within each group. The leader replicates
each operation to the followers in its group, and operations are committed when at least
-(N/2 + 1) peers in that group acknowledge them. This means a group of 3 peers can tolerate 1
+$\lfloor N/2 + 1 \rfloor$ peers in that group acknowledge them.
+This means a group of 3 peers can tolerate 1
failure, a group of five peers can tolerate 2 failures, and so on.
+Since a group of $N$ peers for an even $N$ can tolerate the same number of failures as
+a group of $(N-1)$ peers, groups typically consist of an odd number of peers (3, 5, or
+7 are common) to ensure clear majority decisions.
This majority requirement affects both availability and performance. A group remains available as
long as a majority of its peers are reachable and functioning. However, every transaction must|
|
||
| ### Servers, Clusters, and Groups | ||
|
|
||
| A Raft server (also known as a "peer") is a single running instance of your application with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add also "member", i.e.
... (also known as a "peer" or a "member")
|
|
||
| A Raft cluster is a physical collection of servers that can participate in consensus. A Raft | ||
| group is a logical consensus domain that runs across a specific subset of peers in the cluster. | ||
| At any given time, one peer in a group acts as the "leader" while the others are "followers" or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since a group could temporarily have no leader or more than one leaders (old leader not yet timed out), let remove "At any given time, ", i.e.
One of the peers in a group acts as the "leader" ...
| The state machine is not a finite state machine with states and transitions. Instead, it's a | ||
| deterministic computation engine that processes a sequence of operations and maintains some |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the first sentence since an application can implement its Raft state machine as a finite state machine.
A state machine is a deterministic computation engine ...
| after the snapshot, bringing the peer up to the current state of the group. | ||
|
|
||
| During normal operation, the state machine continuously processes transactions as they're | ||
| committed by the Raft group, responds to leadership changes, and handles read-only queries. For |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could remove "responds to leadership changes" since the state machine does not need to do anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the state machine implements LeaderEventApi (or FollowerEventApi) then it is able to respond to leadership changes. I will de-emphasize it though. If you feel it's not a complication that is worth introducing in this doc, then I'll remove it: let me know either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... the state machine implements LeaderEventApi (or FollowerEventApi) ...
Yes, you are right that a state machine can implement those APIs. I mean that supporting those APIs is optional.
| ### Designing Your State Machine | ||
|
|
||
| When designing your state machine, ensure your operations are deterministic and can be | ||
| efficiently serialized for replication. Operations must be idempotent, as Raft may occasionally |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idempotent is not a requirement of state machine -- e.g. x = x+1 is a valid non-idempotent operation. Each transaction is applied exactly one time.
| ### Read Consistency Options | ||
|
|
||
| **Linearizable reads** provide the strongest consistency by going through the Raft protocol to | ||
| ensure you're reading the most up-to-date committed data. Use the client's `sendReadOnly` method, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sendReadOnly may use Linearizable read or Leader read, depending on the conf raft.server.read.option.
| data if the leader has been partitioned from the majority. | ||
|
|
||
| **Follower reads** provide eventual consistency by serving reads directly from followers using | ||
| their local state machine. Call `sendReadOnly(message, serverId)` with a specific follower's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When linearizable read is enabled, follower read using sendReadOnly(message, serverId) is also linearizable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this is more nuanced than I thought. I've attempted to rewrite the section on "Read Consistency Options". Let me know what you think.
| the snapshot data is loaded replacing any existing state, and the state machine resumes normal | ||
| operation by replaying any log entries that occurred after the snapshot. | ||
|
|
||
| Your state machine's `initialize` method is responsible for loading snapshots during startup by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method is reinitialize.
|
@szetszwo - Thanks very much for taking the time to review: I know it's a longish document. I've addressed your feedback. I struggled a bit with reworking the section on read consistency. If you have any further thoughts on how to make that section clearer or improve its flow, please let me know. |
|
@jcshepherd , thanks for the update!
How about breaking it to several md files? We can merge the files separately. It would also be easier to read.
Sure, will continue reviewing it tomorrow. |
|
Good idea about breaking it apart: thanks. I've broken it into five sections/files with navigation between and within sections. You've reviewed the content in sections 1, 2 and 4 so far. The discussion on read consistency is now in section 2. Sections 3 and 5 probably haven't been looked at. |
I have the following suggestion for the read section: #### Read Consistency Options
Ratis provides several read patterns with different consistency and performance characteristics.
Read requests query the state machine of a server directly without going through the Raft consensus protocol.
The `sendReadOnly()` API sends the request to the leader.
(When a non-leader server receives such request, it throws a `NotLeaderException`
and then the client will retry other servers.)
In contrast, the `sendReadOnly(message, serverId)` API sends the request to a particular server,
which may be a leader or a follower.
The server's `raft.server.read.option` configuration affects read behavior:
* **DEFAULT (default setting)**: `sendReadOnly()` performs leader reads for efficiency.
It provides strong consistency under normal conditions.
* Split-brain Problem: In case that an old leader has been partitioned from the majority
and a new leader has been elected, reading from the old leader can return stale data
since the old leader does not have the new transactions committed by the new leader.
* **LINEARIZABLE**: both `sendReadOnly()` and `sendReadOnly(message, serverId)`
use the ReadIndex protocol to provide linearizable consistency, ensuring you always read the most
up-to-date committed data and won't read stale data as described in the "Split-brain Problem" above.
* Non-linearizable API: Clients may use `sendReadOnlyNonLinearizable()` to read from leader's state machine
directly without a linearizable guarantee.
Other than the `sendReadOnly(..)` methods mentioned above,
we have the following read APIs:
**Stale reads with minimum index** let you specify a minimum log index that the peer must have applied
before serving the read. Call `sendStaleRead()`: if the peer hasn't caught up to your minimum index,
it will throw a `StaleReadException`.
**Asynchronous reads**:
Ratis supports both blocking reads and asynchronous reads.
All the blocking read methods are supported by asynchronous reads
-- the blocking reads return a reply directly while asynchronous reads return a future of the reply.
Asynchronous reads additionally support the following APIs
* **Read-after-write consistency** ensures reads reflect the latest successful write by the same client.
Since write requests go through the Raft consensus protocol but read requests do not,
a read request $R$ may be completed before a write request $W$
even if $R$ is sent (asynchronously) after $W$.
Therefore, the reply returned by $R$ may not include the result updated by $W$.
Use `sendReadAfterWrite()` when you need to read your own writes immediately.
* **Unorder reads** let the asynchronous read requests complete in any order.
The fastest reply is completed first.
Use `sendReadOnlyUnordered()` when you don't care about the ordering. |
|
Thank you again. :-) I restructured the read consistency section along the lines you suggested and also added a mention and forward reference to the discussion about blocking and async APIs. Thanks! |
szetszwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jcshepherd , Thanks for the update! Please see the comments inlined.
I think we have gone through all the docs. We probably could merge it after the comments are addressed.
BTW, have you used any AI generative tool to create the doc?
| Leadership in Ratis is both simpler and more complex than it might initially appear. Ratis | ||
| handles all the mechanics of leader election and failover automatically, but your application | ||
| needs to handle leadership changes robustly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "your application needs to handle leadership changes robustly" ? Only the applications want to know who is the leader need to handle leadership changes. If an application does not care about who is the leader, then it does not need to do anything.
The first sentence is a kind of a personal comment. Let's remove it.
I suggest to rewrite this paragraph as below:
Ratis handles all the mechanics of leader election and failover automatically.
If your application does not care about who is the leader, then it does not need to do anything.
Otherwise, your application can optionally observe leadership change and react accordingly;
see [State Machine Leadership Events](#State-Machine-Leadership-Events).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks: I think that's a better phrasing. Before I read your full comment just now I was going to raise the scenario of client that strongly prefers follower reads needed some kind of application logic to decide what to do if a follower is elected leader, but agree the phrasing as written over-states the need.
| #### Handling Network Partitions | ||
|
|
||
| When a network partition occurs, the Raft group may split into multiple subgroups that cannot | ||
| communicate with each other. Raft's majority-based approach ensures that only one subgroup (the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only one -> at most one
| decisions. However, it also means that your application may become unavailable for writes if a | ||
| majority of servers are unreachable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"if a majority of servers are unreachable" -> "if no subgroups have a majority of servers."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are both true, correct? E.g., if the client is in DC A, DC A is network partitioned from DC B and DC B contains the majority, then the client in DC A will be unable to write until the partition heals. How about "However, it also means that your application may become unable to write if either the majority subgroup is unreachable or no subgroup has the majority of servers." ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a client can connect to the leader (not necessarily a majority of server), it can write. Also, the clients in DC A cannot write does not mean "the application may become unavailable" since the other clients in DC B can write.
For simplicity, let's just talk about servers in this section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, true. Went with your suggestion.
| #### What is a Raft Group in Ratis? | ||
|
|
||
| In Ratis terminology, a "Raft Group" is a collection of servers that participate in a single | ||
| Raft cluster. Each group has a unique RaftGroupId (typically a UUID) that distinguishes it from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RaftGroupId must be a UUID. So, let's remove "typically".
Yes, definitely: I've been using Amazon Q CLI to analyze the Ratis and Ozone codebases, as well as public documentation, and generate the initial document drafts, as well as helping with organization. I've personally reviewed every line of documentation and made corrections/adjustments where I felt they were needed, so I own any errors. Reviewing the rest of your feedback now: thanks. |
|
Thanks. With the exception of my question above (#1338 (comment)), I've incorporated your feedback. I also moved my working copy of index.md to index.md and squashed the commits thus far. |
…anding content, breaking document into 5 sections/files for readability and future expansion, with navigation between and within sections.
|
"Split-brain" behavior revised as suggested, and squashed all commits to this point. |
What changes were proposed in this pull request?
This is a rewrite of concepts/index.md in ratis-docs, which goes into substantially more detail about Ratis/Raft architecture and how applications integrate w/Ratis.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/RATIS-2388
How was this patch tested?
Documentation only.