feat(rl): add CISPO advantage estimator (MiniMax-M1) by EazyReal · Pull Request #1331 · radixark/miles

EazyReal · 2026-06-12T08:02:32Z

Problem

PPO/GSPO-style clipped surrogates zero the gradient of any token whose importance ratio is clipped. The MiniMax-M1 report (arXiv:2506.13585) shows these clipped tokens are often exactly the low-probability, high-leverage reasoning fork tokens ("However", "Recheck", ...), and proposes CISPO: keep every token in the gradient, clip only the importance-sampling weight, and stop-gradient it.

Behavior

Before: --advantage-estimator cispo is rejected by argparse; no estimator preserves gradients for ratio-clipped tokens.

After: --advantage-estimator cispo trains with the CISPO surrogate, per token:

L = -sg(clip(r, 1-eps_clip, 1+eps_clip_high)) * A * log pi_theta

where r = pi_theta/pi_old and A is the GRPO-style group-normalized advantage (unclipped). Every token always contributes gradient; the IS weight is clamped then detached. The paper's "no lower clipping" configuration is reachable with --eps-clip 1.0.

Why this shape

Estimator-level integration, no parallel path. CISPO's advantage is GRPO's, so cispo joins the existing grpo gates for group reward normalization and get_grpo_returns rather than duplicating that plumbing. The new math is a single function, compute_cispo_loss, returning the same (per_token_loss, per_token_clipfrac) contract as compute_policy_loss, so all reduction/metric/TIS/OPSM handling in policy_loss_function is reused unchanged.
NaN guard on log_probs. policy_loss_function keeps the invariant that every per-token tensor entering the masked reducers is sanitized (ppo_kl, advantages, kl, ... all get torch.where(active_tokens, nan_to_num(x), 0)). CISPO is the first consumer of raw current-policy log_probs in the loss (PPO/GSPO only consume the already-sanitized ppo_kl), so the cispo branch sanitizes them the same way — otherwise an inf at a loss-masked token becomes 0 * inf = NaN in the masked reduction and poisons the loss. A regression test injects a -inf target logit at a masked token through the full policy_loss_function path and asserts the loss and grads stay finite.
Metrics note: cispo's pg_clipfrac counts tokens whose raw IS weight left the clip band, while PPO's counts tokens where clipping changed the loss — the two are not numerically comparable across estimators (commented at the definition).

Tests

Unit tests pin the closed-form loss value, the clamp bounds/clipfrac, the defining invariant that gradient flows only through log_probs (IS weight detached, ppo_kl receives no grad), and the masked-token NaN guard above. No test_loss_snapshot config was added for cispo because the snapshot artifacts live in the external artifacts repo and an external PR cannot regenerate them — happy to add the config alongside a regenerated snapshot if you'd like.

gemini-code-assist

Code Review

This pull request introduces the CISPO (Clipped Importance Sampling Policy Optimization) advantage estimator from MiniMax-M1. It adds the cispo option to the CLI arguments, implements the compute_cispo_loss surrogate function, integrates it into the policy loss pipeline, and includes comprehensive unit tests. The reviewer's feedback highlights that the documentation in cli-reference.md still lists on_policy_distillation as a valid option for --advantage-estimator, which is outdated and should be removed to match the current CLI arguments.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-12T08:09:21Z

 | Flag | Default | What |
 |---|---|---|
-| `--advantage-estimator` | `grpo` | `grpo`, `gspo`, `ppo`, `reinforce_plus_plus`, `reinforce_plus_plus_baseline`, `on_policy_distillation` |
+| `--advantage-estimator` | `grpo` | `grpo`, `gspo`, `cispo`, `ppo`, `reinforce_plus_plus`, `reinforce_plus_plus_baseline`, `on_policy_distillation` |


The choice on_policy_distillation is listed here as a valid option for --advantage-estimator. However, according to miles/utils/arguments.py, on-policy distillation (OPD) has been refactored to be orthogonal to the advantage estimator (enabled via --opd-kl-coef > 0) and is no longer a valid choice in the argparse choices list. We should remove it from the documentation to avoid user confusion and argparse errors.\n\nSuggested change:\nmarkdown\n| `--advantage-estimator` | `grpo` | `grpo`, `gspo`, `cispo`, `ppo`, `reinforce_plus_plus`, `reinforce_plus_plus_baseline` |\n

Good catch — verified against miles/utils/arguments.py: on_policy_distillation is not in the argparse choices (OPD is enabled via --use-opd and applied orthogonally on top of any estimator, gated at loss.py's if args.use_opd:), so this doc entry was stale and would produce an argparse error. Since this PR already edits these exact rows, removed it in passing in ca14f03 (both occurrences, lines 71 and 230).

gemini-code-assist · 2026-06-12T08:09:21Z

 | Flag | Type | Default | Notes |
 |---|---|---|---|
-| `--advantage-estimator` | enum | `grpo` | `grpo`, `gspo`, `ppo`, `reinforce_plus_plus`, `reinforce_plus_plus_baseline`, `on_policy_distillation` |
+| `--advantage-estimator` | enum | `grpo` | `grpo`, `gspo`, `cispo`, `ppo`, `reinforce_plus_plus`, `reinforce_plus_plus_baseline`, `on_policy_distillation` |


The choice on_policy_distillation is listed here as a valid option for --advantage-estimator. However, according to miles/utils/arguments.py, on-policy distillation (OPD) has been refactored to be orthogonal to the advantage estimator (enabled via --opd-kl-coef > 0) and is no longer a valid choice in the argparse choices list. We should remove it from the documentation to avoid user confusion and argparse errors.\n\nSuggested change:\nmarkdown\n| `--advantage-estimator` | enum | `grpo` | `grpo`, `gspo`, `cispo`, `ppo`, `reinforce_plus_plus`, `reinforce_plus_plus_baseline` |\n

Same as the line-71 occurrence — verified stale (on_policy_distillation is not an argparse choice; OPD is enabled via --use-opd orthogonally to the estimator) and removed in ca14f03.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

lawrence-harmonic · 2026-06-15T15:20:46Z

CISPO is REINFORCE + TIS. Therefore it can be more general to implement REINFORCE, then the user can use the existing custom TIS system.

EazyReal requested review from Shi-Dong, Zhichenzzz, fzyzcjy, guapisolo, jybsuper, maocheng23 and yueming-yuan as code owners June 12, 2026 08:02

gemini-code-assist Bot reviewed Jun 12, 2026

View reviewed changes

feat(rl): add CISPO advantage estimator (MiniMax-M1)

ca14f03

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

EazyReal force-pushed the upstream-pr/cispo-advantage-estimator branch from 8923e38 to ca14f03 Compare June 12, 2026 08:46

yueming-yuan mentioned this pull request Jun 12, 2026

[sglang-miles] weight checker: ULP-based quant error tolerance via allow_quant_error sgl-project/sglang#28082

Open

This was referenced Jun 15, 2026

feat(rl): add REINFORCE advantage estimator #1343

Open

feat(rl): composable current-policy importance-sampling correction (TIS hook) #1344

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rl): add CISPO advantage estimator (MiniMax-M1)#1331

feat(rl): add CISPO advantage estimator (MiniMax-M1)#1331
EazyReal wants to merge 1 commit into
radixark:mainfrom
EazyReal:upstream-pr/cispo-advantage-estimator

EazyReal commented Jun 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

EazyReal Jun 12, 2026

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

EazyReal Jun 12, 2026

Uh oh!

lawrence-harmonic commented Jun 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EazyReal commented Jun 12, 2026

Problem

Behavior

Why this shape

Tests

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

EazyReal Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

EazyReal Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

lawrence-harmonic commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lawrence-harmonic commented Jun 15, 2026 •

edited

Loading