fix: force intervention frames to Advantage: positive (pi*0.6 spec)#14
Open
jiabinq wants to merge 1 commit intoOpenDriveLab:mainfrom
Open
fix: force intervention frames to Advantage: positive (pi*0.6 spec)#14jiabinq wants to merge 1 commit intoOpenDriveLab:mainfrom
jiabinq wants to merge 1 commit intoOpenDriveLab:mainfrom
Conversation
Human expert intervention frames in DAgger episodes were being labeled purely by advantage percentile, causing ~70% of expert corrections to be associated with "Advantage: negative". This contradicts the pi*0.6 specification which requires intervention frames to always be forced to positive. The fix adds intervention forcing to both assign_task_index (non-staged) and assign_task_index_staged (staged) code paths. When an "intervention" column is present and a frame has intervention=1, it is forced to the highest advantage bin regardless of the estimator's output. Also adds --task-text CLI arg for configurable task descriptions in tasks.jsonl (previously hardcoded to "fold the cloth"). Includes 5 tests covering binary/n_slices modes for both staged and non-staged paths, plus backward compatibility when no intervention column exists.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
intervention=1) to the highest advantage bin in both staged and non-staged labeling paths, matching the pi*0.6 specification--task-textCLI arg to replace hardcoded "fold the cloth" in tasks.jsonlProblem
discretize_advantage.pylabels frames purely by advantage percentile. Human expert corrections in DAgger episodes get no special treatment — ~70% end up labeled "Advantage: negative" because the advantage estimator assigns low values at intervention moments (the robot was failing right before the human took over). At inference with "Advantage: positive", the model avoids reproducing those corrective actions, resulting in weak recovery behavior.Pi*0.6 specifies that intervention frames must always be forced to positive. Evo-RL implements this correctly (
force_intervention_positive=True).Test plan
test_discretize_advantage.py)interventioncolumn is absent