Skip to content

Conversation

@jcshepherd
Copy link

@jcshepherd jcshepherd commented Jan 23, 2026

What changes were proposed in this pull request?

This is a rewrite of concepts/index.md in ratis-docs, which goes into substantially more detail about Ratis/Raft architecture and how applications integrate w/Ratis.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/RATIS-2388

How was this patch tested?

Documentation only.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcshepherd , thanks a lot for contributing this! I have reviewed down to the Snapshot section. Please see the comments so far. Will continue reviewing this.


Raft's safety guarantees depend on majority agreement within each group. The leader replicates
each operation to the followers in its group, and operations are committed when at least
(N/2 + 1) peers in that group acknowledge them. This means a group of 3 peers can tolerate 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use $\lfloor N/2 + 1 \rfloor$.

Comment on lines 76 to 78
A single cluster can host multiple independent Raft groups, each with its own leader election,
consistency and state replication. Groups typically consist of an odd number of peers (3, 5, or
7 are common) to ensure clear majority decisions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's combine the first sentence to the paragraph and move the second sentence to the next section, which talks about majority.

diff --git a/ratis-docs/src/site/markdown/concept/index-v2.md b/ratis-docs/src/site/markdown/concept/index-v2.md
index d51146863..bf1463c43 100644
--- a/ratis-docs/src/site/markdown/concept/index-v2.md
+++ b/ratis-docs/src/site/markdown/concept/index-v2.md
@@ -71,18 +71,20 @@ group is a logical consensus domain that runs across a specific subset of peers
-At any given time, one peer in a group acts as the "leader" while the others are "followers" or
+One of the peers in a group acts as the "leader" while the others are "followers" or
 "listeners". The leader handles all write requests and replicates operations to other peers in
 the group. Both leaders and followers can service read requests, with different consistency
-guarantees.
+guarantees. A single cluster can host multiple independent Raft groups, 
+each with its own leader election, consistency and state replication.
 
-A single cluster can host multiple independent Raft groups, each with its own leader election,
-consistency and state replication. Groups typically consist of an odd number of peers (3, 5, or
-7 are common) to ensure clear majority decisions.
 
 ### Majority-Based Decision-Making
 
 Raft's safety guarantees depend on majority agreement within each group. The leader replicates
 each operation to the followers in its group, and operations are committed when at least
-(N/2 + 1) peers in that group acknowledge them. This means a group of 3 peers can tolerate 1
+$\lfloor N/2 + 1 \rfloor$ peers in that group acknowledge them. 
+This means a group of 3 peers can tolerate 1
 failure, a group of five peers can tolerate 2 failures, and so on.
+Since a group of $N$ peers for an even $N$ can tolerate the same number of failures as
+a group of $(N-1)$ peers, groups typically consist of an odd number of peers (3, 5, or
+7 are common) to ensure clear majority decisions.
 
 This majority requirement affects both availability and performance. A group remains available as
 long as a majority of its peers are reachable and functioning. However, every transaction must


### Servers, Clusters, and Groups

A Raft server (also known as a "peer") is a single running instance of your application with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add also "member", i.e.

... (also known as a "peer" or a "member")


A Raft cluster is a physical collection of servers that can participate in consensus. A Raft
group is a logical consensus domain that runs across a specific subset of peers in the cluster.
At any given time, one peer in a group acts as the "leader" while the others are "followers" or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since a group could temporarily have no leader or more than one leaders (old leader not yet timed out), let remove "At any given time, ", i.e.

One of the peers in a group acts as the "leader" ...

Comment on lines 127 to 128
The state machine is not a finite state machine with states and transitions. Instead, it's a
deterministic computation engine that processes a sequence of operations and maintains some
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the first sentence since an application can implement its Raft state machine as a finite state machine.

A state machine is a deterministic computation engine ...

after the snapshot, bringing the peer up to the current state of the group.

During normal operation, the state machine continuously processes transactions as they're
committed by the Raft group, responds to leadership changes, and handles read-only queries. For
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could remove "responds to leadership changes" since the state machine does not need to do anything.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the state machine implements LeaderEventApi (or FollowerEventApi) then it is able to respond to leadership changes. I will de-emphasize it though. If you feel it's not a complication that is worth introducing in this doc, then I'll remove it: let me know either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... the state machine implements LeaderEventApi (or FollowerEventApi) ...

Yes, you are right that a state machine can implement those APIs. I mean that supporting those APIs is optional.

### Designing Your State Machine

When designing your state machine, ensure your operations are deterministic and can be
efficiently serialized for replication. Operations must be idempotent, as Raft may occasionally
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idempotent is not a requirement of state machine -- e.g. x = x+1 is a valid non-idempotent operation. Each transaction is applied exactly one time.

### Read Consistency Options

**Linearizable reads** provide the strongest consistency by going through the Raft protocol to
ensure you're reading the most up-to-date committed data. Use the client's `sendReadOnly` method,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sendReadOnly may use Linearizable read or Leader read, depending on the conf raft.server.read.option.

data if the leader has been partitioned from the majority.

**Follower reads** provide eventual consistency by serving reads directly from followers using
their local state machine. Call `sendReadOnly(message, serverId)` with a specific follower's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When linearizable read is enabled, follower read using sendReadOnly(message, serverId) is also linearizable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is more nuanced than I thought. I've attempted to rewrite the section on "Read Consistency Options". Let me know what you think.

the snapshot data is loaded replacing any existing state, and the state machine resumes normal
operation by replaying any log entries that occurred after the snapshot.

Your state machine's `initialize` method is responsible for loading snapshots during startup by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method is reinitialize.

@jcshepherd
Copy link
Author

@szetszwo - Thanks very much for taking the time to review: I know it's a longish document. I've addressed your feedback. I struggled a bit with reworking the section on read consistency. If you have any further thoughts on how to make that section clearer or improve its flow, please let me know.

@szetszwo
Copy link
Contributor

@jcshepherd , thanks for the update!

... I know it's a longish document. ...

How about breaking it to several md files? We can merge the files separately. It would also be easier to read.

... I struggled a bit with reworking the section on read consistency. If you have any further thoughts on how to make that section clearer or improve its flow, please let me know.

Sure, will continue reviewing it tomorrow.

@jcshepherd
Copy link
Author

Good idea about breaking it apart: thanks. I've broken it into five sections/files with navigation between and within sections. You've reviewed the content in sections 1, 2 and 4 so far. The discussion on read consistency is now in section 2. Sections 3 and 5 probably haven't been looked at.
It feels like a largish number of sections, but it seems to flow pretty well and offers lots of room for future expansion. Let me know what you think.

@szetszwo
Copy link
Contributor

... I struggled a bit with reworking the section on read consistency. If you have any further thoughts on how to make that section clearer or improve its flow, please let me know.

I have the following suggestion for the read section:

#### Read Consistency Options

Ratis provides several read patterns with different consistency and performance characteristics.
Read requests query the state machine of a server directly without going through the Raft consensus protocol. 
The `sendReadOnly()` API sends the request to the leader.
(When a non-leader server receives such request, it throws a `NotLeaderException`
and then the client will retry other servers.)
In contrast, the `sendReadOnly(message, serverId)` API sends the request to a particular server,
which may be a leader or a follower.

The server's `raft.server.read.option` configuration affects read behavior:

* **DEFAULT (default setting)**: `sendReadOnly()` performs leader reads for efficiency.
  It provides strong consistency under normal conditions.
  *  Split-brain Problem: In case that an old leader has been partitioned from the majority
  and a new leader has been elected, reading from the old leader can return stale data
  since the old leader does not have the new transactions committed by the new leader.

* **LINEARIZABLE**: both `sendReadOnly()` and `sendReadOnly(message, serverId)`
  use the ReadIndex protocol to provide linearizable consistency, ensuring you always read the most
  up-to-date committed data and won't read stale data as described in the "Split-brain Problem" above.  
  * Non-linearizable API: Clients may use `sendReadOnlyNonLinearizable()` to read from leader's state machine
    directly without a linearizable guarantee.

Other than the `sendReadOnly(..)` methods mentioned above,
we have the following read APIs:

**Stale reads with minimum index** let you specify a minimum log index that the peer must have applied
before serving the read.  Call `sendStaleRead()`: if the peer hasn't caught up to your minimum index,
it will throw a `StaleReadException`.

**Asynchronous reads**:
Ratis supports both blocking reads and asynchronous reads.
All the blocking read methods are supported by asynchronous reads
-- the blocking reads return a reply directly while asynchronous reads return a future of the reply.
Asynchronous reads additionally support the following APIs

* **Read-after-write consistency** ensures reads reflect the latest successful write by the same client.
Since write requests go through the Raft consensus protocol but read requests do not,
a read request $R$ may be completed before a write request $W$
even if $R$ is sent (asynchronously) after $W$.
Therefore, the reply returned by $R$ may not include the result updated by $W$.
Use `sendReadAfterWrite()` when you need to read your own writes immediately.

* **Unorder reads** let the asynchronous read requests complete in any order.
The fastest reply is completed first.
Use `sendReadOnlyUnordered()` when you don't care about the ordering.

@jcshepherd
Copy link
Author

Thank you again. :-) I restructured the read consistency section along the lines you suggested and also added a mention and forward reference to the discussion about blocking and async APIs. Thanks!

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcshepherd , Thanks for the update! Please see the comments inlined.

I think we have gone through all the docs. We probably could merge it after the comments are addressed.

BTW, have you used any AI generative tool to create the doc?

Comment on lines 82 to 84
Leadership in Ratis is both simpler and more complex than it might initially appear. Ratis
handles all the mechanics of leader election and failover automatically, but your application
needs to handle leadership changes robustly.
Copy link
Contributor

@szetszwo szetszwo Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "your application needs to handle leadership changes robustly" ? Only the applications want to know who is the leader need to handle leadership changes. If an application does not care about who is the leader, then it does not need to do anything.

The first sentence is a kind of a personal comment. Let's remove it.

I suggest to rewrite this paragraph as below:

Ratis handles all the mechanics of leader election and failover automatically.
If your application does not care about who is the leader, then it does not need to do anything.
Otherwise, your application can optionally observe leadership change and react accordingly;
see [State Machine Leadership Events](#State-Machine-Leadership-Events).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks: I think that's a better phrasing. Before I read your full comment just now I was going to raise the scenario of client that strongly prefers follower reads needed some kind of application logic to decide what to do if a follower is elected leader, but agree the phrasing as written over-states the need.

#### Handling Network Partitions

When a network partition occurs, the Raft group may split into multiple subgroups that cannot
communicate with each other. Raft's majority-based approach ensures that only one subgroup (the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only one -> at most one

Comment on lines 126 to 129
decisions. However, it also means that your application may become unavailable for writes if a
majority of servers are unreachable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"if a majority of servers are unreachable" -> "if no subgroups have a majority of servers."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are both true, correct? E.g., if the client is in DC A, DC A is network partitioned from DC B and DC B contains the majority, then the client in DC A will be unable to write until the partition heals. How about "However, it also means that your application may become unable to write if either the majority subgroup is unreachable or no subgroup has the majority of servers." ?

Copy link
Contributor

@szetszwo szetszwo Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a client can connect to the leader (not necessarily a majority of server), it can write. Also, the clients in DC A cannot write does not mean "the application may become unavailable" since the other clients in DC B can write.

For simplicity, let's just talk about servers in this section.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, true. Went with your suggestion.

#### What is a Raft Group in Ratis?

In Ratis terminology, a "Raft Group" is a collection of servers that participate in a single
Raft cluster. Each group has a unique RaftGroupId (typically a UUID) that distinguishes it from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RaftGroupId must be a UUID. So, let's remove "typically".

@jcshepherd
Copy link
Author

BTW, have you used any AI generative tool to create the doc?

Yes, definitely: I've been using Amazon Q CLI to analyze the Ratis and Ozone codebases, as well as public documentation, and generate the initial document drafts, as well as helping with organization. I've personally reviewed every line of documentation and made corrections/adjustments where I felt they were needed, so I own any errors.

Reviewing the rest of your feedback now: thanks.

@jcshepherd
Copy link
Author

Thanks. With the exception of my question above (#1338 (comment)), I've incorporated your feedback. I also moved my working copy of index.md to index.md and squashed the commits thus far.

…anding content, breaking document into 5 sections/files for readability and future expansion, with navigation between and within sections.
@jcshepherd
Copy link
Author

"Split-brain" behavior revised as suggested, and squashed all commits to this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants