feat(rl): add CISPO advantage estimator (MiniMax-M1)#1331
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the CISPO (Clipped Importance Sampling Policy Optimization) advantage estimator from MiniMax-M1. It adds the cispo option to the CLI arguments, implements the compute_cispo_loss surrogate function, integrates it into the policy loss pipeline, and includes comprehensive unit tests. The reviewer's feedback highlights that the documentation in cli-reference.md still lists on_policy_distillation as a valid option for --advantage-estimator, which is outdated and should be removed to match the current CLI arguments.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| | Flag | Default | What | | ||
| |---|---|---| | ||
| | `--advantage-estimator` | `grpo` | `grpo`, `gspo`, `ppo`, `reinforce_plus_plus`, `reinforce_plus_plus_baseline`, `on_policy_distillation` | | ||
| | `--advantage-estimator` | `grpo` | `grpo`, `gspo`, `cispo`, `ppo`, `reinforce_plus_plus`, `reinforce_plus_plus_baseline`, `on_policy_distillation` | |
There was a problem hiding this comment.
The choice on_policy_distillation is listed here as a valid option for --advantage-estimator. However, according to miles/utils/arguments.py, on-policy distillation (OPD) has been refactored to be orthogonal to the advantage estimator (enabled via --opd-kl-coef > 0) and is no longer a valid choice in the argparse choices list. We should remove it from the documentation to avoid user confusion and argparse errors.\n\nSuggested change:\nmarkdown\n| `--advantage-estimator` | `grpo` | `grpo`, `gspo`, `cispo`, `ppo`, `reinforce_plus_plus`, `reinforce_plus_plus_baseline` |\n
There was a problem hiding this comment.
Good catch — verified against miles/utils/arguments.py: on_policy_distillation is not in the argparse choices (OPD is enabled via --use-opd and applied orthogonally on top of any estimator, gated at loss.py's if args.use_opd:), so this doc entry was stale and would produce an argparse error. Since this PR already edits these exact rows, removed it in passing in ca14f03 (both occurrences, lines 71 and 230).
| | Flag | Type | Default | Notes | | ||
| |---|---|---|---| | ||
| | `--advantage-estimator` | enum | `grpo` | `grpo`, `gspo`, `ppo`, `reinforce_plus_plus`, `reinforce_plus_plus_baseline`, `on_policy_distillation` | | ||
| | `--advantage-estimator` | enum | `grpo` | `grpo`, `gspo`, `cispo`, `ppo`, `reinforce_plus_plus`, `reinforce_plus_plus_baseline`, `on_policy_distillation` | |
There was a problem hiding this comment.
The choice on_policy_distillation is listed here as a valid option for --advantage-estimator. However, according to miles/utils/arguments.py, on-policy distillation (OPD) has been refactored to be orthogonal to the advantage estimator (enabled via --opd-kl-coef > 0) and is no longer a valid choice in the argparse choices list. We should remove it from the documentation to avoid user confusion and argparse errors.\n\nSuggested change:\nmarkdown\n| `--advantage-estimator` | enum | `grpo` | `grpo`, `gspo`, `cispo`, `ppo`, `reinforce_plus_plus`, `reinforce_plus_plus_baseline` |\n
There was a problem hiding this comment.
Same as the line-71 occurrence — verified stale (on_policy_distillation is not an argparse choice; OPD is enabled via --use-opd orthogonally to the estimator) and removed in ca14f03.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
8923e38 to
ca14f03
Compare
|
CISPO is REINFORCE + TIS. Therefore it can be more general to implement REINFORCE, then the user can use the existing custom TIS system. |
Problem
PPO/GSPO-style clipped surrogates zero the gradient of any token whose importance ratio is clipped. The MiniMax-M1 report (arXiv:2506.13585) shows these clipped tokens are often exactly the low-probability, high-leverage reasoning fork tokens ("However", "Recheck", ...), and proposes CISPO: keep every token in the gradient, clip only the importance-sampling weight, and stop-gradient it.
Behavior
Before:
--advantage-estimator cispois rejected by argparse; no estimator preserves gradients for ratio-clipped tokens.After:
--advantage-estimator cispotrains with the CISPO surrogate, per token:where
r = pi_theta/pi_oldandAis the GRPO-style group-normalized advantage (unclipped). Every token always contributes gradient; the IS weight is clamped then detached. The paper's "no lower clipping" configuration is reachable with--eps-clip 1.0.Why this shape
cispojoins the existinggrpogates for group reward normalization andget_grpo_returnsrather than duplicating that plumbing. The new math is a single function,compute_cispo_loss, returning the same(per_token_loss, per_token_clipfrac)contract ascompute_policy_loss, so all reduction/metric/TIS/OPSM handling inpolicy_loss_functionis reused unchanged.log_probs.policy_loss_functionkeeps the invariant that every per-token tensor entering the masked reducers is sanitized (ppo_kl,advantages,kl, ... all gettorch.where(active_tokens, nan_to_num(x), 0)). CISPO is the first consumer of raw current-policylog_probsin the loss (PPO/GSPO only consume the already-sanitizedppo_kl), so the cispo branch sanitizes them the same way — otherwise an inf at a loss-masked token becomes0 * inf = NaNin the masked reduction and poisons the loss. A regression test injects a-inftarget logit at a masked token through the fullpolicy_loss_functionpath and asserts the loss and grads stay finite.pg_clipfraccounts tokens whose raw IS weight left the clip band, while PPO's counts tokens where clipping changed the loss — the two are not numerically comparable across estimators (commented at the definition).Tests
Unit tests pin the closed-form loss value, the clamp bounds/clipfrac, the defining invariant that gradient flows only through
log_probs(IS weight detached,ppo_klreceives no grad), and the masked-token NaN guard above. Notest_loss_snapshotconfig was added for cispo because the snapshot artifacts live in the external artifacts repo and an external PR cannot regenerate them — happy to add the config alongside a regenerated snapshot if you'd like.