Add CLI-Gym Harbor taskset#1665
Conversation
ApprovabilityVerdict: Needs human review This PR introduces a substantial new feature (712 lines) adding CLI-Gym taskset functionality with dataset materialization, Docker handling, and scoring logic. New capabilities of this scope require human review, and there is also an unresolved review comment about potential incorrect scoring of successful test runs. You can customize Macroscope's approvability policy. Learn more. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 3 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit f6a160d. Configure here.

Summary
CLIGymTaskSetandmake_cli_gym_tasksetfor materializingPrimeIntellect/CLI-Gymrows into Harbor task directoriessolution/solve.shTaskSet.validate()paths instead of falling through to generic Harbor missing-solution errorsmax_test_timeout_secdataset_pathloading for prebuilt Harbor-format task dirs while normalizing Prime HF image refs to pullable GAR refsverifiers.experimental.composable.tasksetsValidation
uv run ruff check verifiers/envs/experimental/composable/tasksets/harbor/cli_gym.pyuv run python -m py_compile verifiers/envs/experimental/composable/tasksets/harbor/cli_gym.py_apply_gold_patch()andvalidate_instance()raise the CLI-Gym-specific unsupported-gold-solution messageSandboxSpec.timeout_minutes, and missing timeouts preserve default behaviorcorrupt-python-exceptiongroup-source-by-semantic-mutationresearch-environmentsharbor-debug: reward0.0, saved under/home/ubuntu/git/cligym-validation/20260613T004749Z/harbor-debug-noop-900/evals/harbor-debug--openai--gpt-4.1-mini/f935e518rlm_harnesscomposition: reward1.0onpython-environment-library-shadowing-with-partial-module-removal, saved under/home/ubuntu/git/cligym-validation/20260613T004749Z/rlm-gpt55-shadowing-extract180/evals/rlm-cligym--openai--gpt-5.5/a8753335Note
Medium Risk
New sandbox materialization and scoring paths affect training/eval pipelines; misuse as an RL reward env is possible because tests run in-task, though docs and explicit runtime errors mitigate gold-validation misuse.
Overview
Adds
CLIGymTaskSetandmake_cli_gym_taskset, wiring CLI-Gym (default Hugging Face repoPrimeIntellect/CLI-Gym) into the composable Harbor taskset stack by subclassingHarborDatasetTaskSet.When no local
dataset_pathis given, rows are materialized into Harbor-style task directories (instructions, Docker assets,task.toml, test scripts) with fingerprinted caching via.materialized.json, stale-dir pruning, and optional Prime GAR image ref normalization. Runtime behavior sets sandbox TTL from agent + test timeouts (plus buffer), scopestest_timeoutfrom metadata, and uses a customtests/test.shwrapper that writesreward.txtwithout failing the verifier step.Gold/oracle paths are explicitly blocked:
_apply_gold_patchandvalidate_instanceraise CLI-Gym-specific **RuntimeError**s, reflecting missingsolution/solve.shand visible in-sandbox tests (documented as SFT/trajectory use, not hardened RL). Exports are added on the harbor and tasksets package__init__modules.Reviewed by Cursor Bugbot for commit 6304cf3. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Add CLI-Gym Harbor taskset backed by local or materialized HF datasets
CLIGymTaskSetin cli_gym.py, a newHarborDatasetTaskSetsubclass that loads tasks from a local directory tree or materializes them from a Hugging Face dataset on first use.instruction.md,task.yaml, a Dockerfile,docker-compose.yaml,task.toml, and atests/test.shwrapper; subsequent runs skip unchanged tasks via SHA-256 fingerprints and a manifest file.tests/test.shwrapper parses pytest output to write areward.txtsignal (1.0 on pass, 0.0 on failure) and always exits 0 so the shell step does not fail./testbed.RuntimeError, as CLI-Gym tasks have no oracle solution path.CLIGymTaskSetandmake_cli_gym_tasksetfrom theharborandtasksetspackage namespaces.Macroscope summarized 6304cf3.