Skip to content

Add CLI-Gym Harbor taskset#1665

Merged
rasdani merged 5 commits into
mainfrom
cli-gym-harbor-taskset
Jun 15, 2026
Merged

Add CLI-Gym Harbor taskset#1665
rasdani merged 5 commits into
mainfrom
cli-gym-harbor-taskset

Conversation

@rasdani

@rasdani rasdani commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Summary

  • add CLIGymTaskSet and make_cli_gym_taskset for materializing PrimeIntellect/CLI-Gym rows into Harbor task directories
  • document that official CLI-Gym does not provide gold/oracle solutions and that materialized tasks do not include solution/solve.sh
  • document that CLI-Gym scoring tests are visible inside the single task sandbox, so this taskset should be treated as a carefully filtered SFT/trajectory-generation source rather than a hardened RL reward environment
  • raise CLI-Gym-specific errors for gold-patch and TaskSet.validate() paths instead of falling through to generic Harbor missing-solution errors
  • avoid reusing empty or partial HF materialization caches by requiring a completed full-materialization marker
  • skip malformed HF rows during bulk materialization while preserving strict errors for explicitly requested tasks
  • set CLI-Gym sandbox TTL from agent timeout plus test timeout plus a small buffer, while keeping the test command timeout scoped to max_test_timeout_sec
  • make the derived Harbor cache manifest-backed with per-task row fingerprints and prune stale full-cache task dirs when the HF split changes
  • preserve generic dataset_path loading for prebuilt Harbor-format task dirs while normalizing Prime HF image refs to pullable GAR refs
  • export the taskset from verifiers.experimental.composable.tasksets

Validation

  • uv run ruff check verifiers/envs/experimental/composable/tasksets/harbor/cli_gym.py
  • uv run python -m py_compile verifiers/envs/experimental/composable/tasksets/harbor/cli_gym.py
  • runtime check that _apply_gold_patch() and validate_instance() raise the CLI-Gym-specific unsupported-gold-solution message
  • materialization regression check covering empty-cache rematerialization, completed-cache reuse, bulk malformed-row skip, and requested malformed-row error
  • sandbox lifetime regression check: 3000s agent + 3000s test produces a 101-minute SandboxSpec.timeout_minutes, and missing timeouts preserve default behavior
  • manifest cache regression check covering stale-dir pruning, unchanged-cache reuse, changed-row rematerialization, bulk malformed-row skip, requested partial materialization, and requested malformed-row error
  • commit hooks: ruff check, ruff format, Semgrep v1 policy, generated AGENTS/CLAUDE check
  • push hooks: ruff check, ruff format, Semgrep v1 policy, generated AGENTS/CLAUDE check, ty
  • earlier import/materialization check for two CLI-Gym rows, including a malformed YAML row: corrupt-python-exceptiongroup-source-by-semantic-mutation
  • earlier Harbor debug no-op smoke via research-environments harbor-debug: reward 0.0, saved under /home/ubuntu/git/cligym-validation/20260613T004749Z/harbor-debug-noop-900/evals/harbor-debug--openai--gpt-4.1-mini/f935e518
  • earlier GPT-5.5 reward hunt using existing rlm_harness composition: reward 1.0 on python-environment-library-shadowing-with-partial-module-removal, saved under /home/ubuntu/git/cligym-validation/20260613T004749Z/rlm-gpt55-shadowing-extract180/evals/rlm-cligym--openai--gpt-5.5/a8753335

Note

Medium Risk
New sandbox materialization and scoring paths affect training/eval pipelines; misuse as an RL reward env is possible because tests run in-task, though docs and explicit runtime errors mitigate gold-validation misuse.

Overview
Adds CLIGymTaskSet and make_cli_gym_taskset, wiring CLI-Gym (default Hugging Face repo PrimeIntellect/CLI-Gym) into the composable Harbor taskset stack by subclassing HarborDatasetTaskSet.

When no local dataset_path is given, rows are materialized into Harbor-style task directories (instructions, Docker assets, task.toml, test scripts) with fingerprinted caching via .materialized.json, stale-dir pruning, and optional Prime GAR image ref normalization. Runtime behavior sets sandbox TTL from agent + test timeouts (plus buffer), scopes test_timeout from metadata, and uses a custom tests/test.sh wrapper that writes reward.txt without failing the verifier step.

Gold/oracle paths are explicitly blocked: _apply_gold_patch and validate_instance raise CLI-Gym-specific **RuntimeError**s, reflecting missing solution/solve.sh and visible in-sandbox tests (documented as SFT/trajectory use, not hardened RL). Exports are added on the harbor and tasksets package __init__ modules.

Reviewed by Cursor Bugbot for commit 6304cf3. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Add CLI-Gym Harbor taskset backed by local or materialized HF datasets

  • Introduces CLIGymTaskSet in cli_gym.py, a new HarborDatasetTaskSet subclass that loads tasks from a local directory tree or materializes them from a Hugging Face dataset on first use.
  • Materialization converts HF dataset rows into per-task directories containing instruction.md, task.yaml, a Dockerfile, docker-compose.yaml, task.toml, and a tests/test.sh wrapper; subsequent runs skip unchanged tasks via SHA-256 fingerprints and a manifest file.
  • The tests/test.sh wrapper parses pytest output to write a reward.txt signal (1.0 on pass, 0.0 on failure) and always exits 0 so the shell step does not fail.
  • Sandbox containers are provisioned with the task's Docker image and a combined agent+test lifetime timeout; workdir is fixed to /testbed.
  • Gold patch application and instance validation both raise RuntimeError, as CLI-Gym tasks have no oracle solution path.
  • Exports CLIGymTaskSet and make_cli_gym_taskset from the harbor and tasksets package namespaces.

Macroscope summarized 6304cf3.

Comment thread verifiers/envs/experimental/composable/tasksets/harbor/cli_gym.py Outdated
Comment thread verifiers/envs/experimental/composable/tasksets/harbor/cli_gym.py
Comment thread verifiers/envs/experimental/composable/tasksets/harbor/cli_gym.py Outdated
@macroscopeapp

macroscopeapp Bot commented Jun 13, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

This PR introduces a substantial new feature (712 lines) adding CLI-Gym taskset functionality with dataset materialization, Docker handling, and scoring logic. New capabilities of this scope require human review, and there is also an unresolved review comment about potential incorrect scoring of successful test runs.

You can customize Macroscope's approvability policy. Learn more.

Comment thread verifiers/envs/experimental/composable/tasksets/harbor/cli_gym.py

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f6a160d. Configure here.

Comment thread verifiers/envs/experimental/composable/tasksets/harbor/cli_gym.py
Comment thread verifiers/envs/experimental/composable/tasksets/harbor/cli_gym.py Outdated
@rasdani rasdani requested a review from samsja June 15, 2026 23:53
@rasdani rasdani merged commit 5ed2809 into main Jun 15, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants