Skip to content

feat(rl): add CISPO advantage estimator (MiniMax-M1)#1331

Open
EazyReal wants to merge 1 commit into
radixark:mainfrom
EazyReal:upstream-pr/cispo-advantage-estimator
Open

feat(rl): add CISPO advantage estimator (MiniMax-M1)#1331
EazyReal wants to merge 1 commit into
radixark:mainfrom
EazyReal:upstream-pr/cispo-advantage-estimator

Conversation

@EazyReal

Copy link
Copy Markdown

Problem

PPO/GSPO-style clipped surrogates zero the gradient of any token whose importance ratio is clipped. The MiniMax-M1 report (arXiv:2506.13585) shows these clipped tokens are often exactly the low-probability, high-leverage reasoning fork tokens ("However", "Recheck", ...), and proposes CISPO: keep every token in the gradient, clip only the importance-sampling weight, and stop-gradient it.

Behavior

Before: --advantage-estimator cispo is rejected by argparse; no estimator preserves gradients for ratio-clipped tokens.

After: --advantage-estimator cispo trains with the CISPO surrogate, per token:

L = -sg(clip(r, 1-eps_clip, 1+eps_clip_high)) * A * log pi_theta

where r = pi_theta/pi_old and A is the GRPO-style group-normalized advantage (unclipped). Every token always contributes gradient; the IS weight is clamped then detached. The paper's "no lower clipping" configuration is reachable with --eps-clip 1.0.

Why this shape

  • Estimator-level integration, no parallel path. CISPO's advantage is GRPO's, so cispo joins the existing grpo gates for group reward normalization and get_grpo_returns rather than duplicating that plumbing. The new math is a single function, compute_cispo_loss, returning the same (per_token_loss, per_token_clipfrac) contract as compute_policy_loss, so all reduction/metric/TIS/OPSM handling in policy_loss_function is reused unchanged.
  • NaN guard on log_probs. policy_loss_function keeps the invariant that every per-token tensor entering the masked reducers is sanitized (ppo_kl, advantages, kl, ... all get torch.where(active_tokens, nan_to_num(x), 0)). CISPO is the first consumer of raw current-policy log_probs in the loss (PPO/GSPO only consume the already-sanitized ppo_kl), so the cispo branch sanitizes them the same way — otherwise an inf at a loss-masked token becomes 0 * inf = NaN in the masked reduction and poisons the loss. A regression test injects a -inf target logit at a masked token through the full policy_loss_function path and asserts the loss and grads stay finite.
  • Metrics note: cispo's pg_clipfrac counts tokens whose raw IS weight left the clip band, while PPO's counts tokens where clipping changed the loss — the two are not numerically comparable across estimators (commented at the definition).

Tests

Unit tests pin the closed-form loss value, the clamp bounds/clipfrac, the defining invariant that gradient flows only through log_probs (IS weight detached, ppo_kl receives no grad), and the masked-token NaN guard above. No test_loss_snapshot config was added for cispo because the snapshot artifacts live in the external artifacts repo and an external PR cannot regenerate them — happy to add the config alongside a regenerated snapshot if you'd like.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the CISPO (Clipped Importance Sampling Policy Optimization) advantage estimator from MiniMax-M1. It adds the cispo option to the CLI arguments, implements the compute_cispo_loss surrogate function, integrates it into the policy loss pipeline, and includes comprehensive unit tests. The reviewer's feedback highlights that the documentation in cli-reference.md still lists on_policy_distillation as a valid option for --advantage-estimator, which is outdated and should be removed to match the current CLI arguments.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread docs/user-guide/cli-reference.md Outdated
| Flag | Default | What |
|---|---|---|
| `--advantage-estimator` | `grpo` | `grpo`, `gspo`, `ppo`, `reinforce_plus_plus`, `reinforce_plus_plus_baseline`, `on_policy_distillation` |
| `--advantage-estimator` | `grpo` | `grpo`, `gspo`, `cispo`, `ppo`, `reinforce_plus_plus`, `reinforce_plus_plus_baseline`, `on_policy_distillation` |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The choice on_policy_distillation is listed here as a valid option for --advantage-estimator. However, according to miles/utils/arguments.py, on-policy distillation (OPD) has been refactored to be orthogonal to the advantage estimator (enabled via --opd-kl-coef > 0) and is no longer a valid choice in the argparse choices list. We should remove it from the documentation to avoid user confusion and argparse errors.\n\nSuggested change:\nmarkdown\n| `--advantage-estimator` | `grpo` | `grpo`, `gspo`, `cispo`, `ppo`, `reinforce_plus_plus`, `reinforce_plus_plus_baseline` |\n

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — verified against miles/utils/arguments.py: on_policy_distillation is not in the argparse choices (OPD is enabled via --use-opd and applied orthogonally on top of any estimator, gated at loss.py's if args.use_opd:), so this doc entry was stale and would produce an argparse error. Since this PR already edits these exact rows, removed it in passing in ca14f03 (both occurrences, lines 71 and 230).

Comment thread docs/user-guide/cli-reference.md Outdated
| Flag | Type | Default | Notes |
|---|---|---|---|
| `--advantage-estimator` | enum | `grpo` | `grpo`, `gspo`, `ppo`, `reinforce_plus_plus`, `reinforce_plus_plus_baseline`, `on_policy_distillation` |
| `--advantage-estimator` | enum | `grpo` | `grpo`, `gspo`, `cispo`, `ppo`, `reinforce_plus_plus`, `reinforce_plus_plus_baseline`, `on_policy_distillation` |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The choice on_policy_distillation is listed here as a valid option for --advantage-estimator. However, according to miles/utils/arguments.py, on-policy distillation (OPD) has been refactored to be orthogonal to the advantage estimator (enabled via --opd-kl-coef > 0) and is no longer a valid choice in the argparse choices list. We should remove it from the documentation to avoid user confusion and argparse errors.\n\nSuggested change:\nmarkdown\n| `--advantage-estimator` | enum | `grpo` | `grpo`, `gspo`, `cispo`, `ppo`, `reinforce_plus_plus`, `reinforce_plus_plus_baseline` |\n

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the line-71 occurrence — verified stale (on_policy_distillation is not an argparse choice; OPD is enabled via --use-opd orthogonally to the estimator) and removed in ca14f03.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@lawrence-harmonic

lawrence-harmonic commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

CISPO is REINFORCE + TIS. Therefore it can be more general to implement REINFORCE, then the user can use the existing custom TIS system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants