Add CLI-Gym Harbor taskset by rasdani · Pull Request #1665 · PrimeIntellect-ai/verifiers

rasdani · 2026-06-13T01:07:48Z

Summary

add CLIGymTaskSet and make_cli_gym_taskset for materializing PrimeIntellect/CLI-Gym rows into Harbor task directories
document that official CLI-Gym does not provide gold/oracle solutions and that materialized tasks do not include solution/solve.sh
document that CLI-Gym scoring tests are visible inside the single task sandbox, so this taskset should be treated as a carefully filtered SFT/trajectory-generation source rather than a hardened RL reward environment
raise CLI-Gym-specific errors for gold-patch and TaskSet.validate() paths instead of falling through to generic Harbor missing-solution errors
avoid reusing empty or partial HF materialization caches by requiring a completed full-materialization marker
skip malformed HF rows during bulk materialization while preserving strict errors for explicitly requested tasks
set CLI-Gym sandbox TTL from agent timeout plus test timeout plus a small buffer, while keeping the test command timeout scoped to max_test_timeout_sec
make the derived Harbor cache manifest-backed with per-task row fingerprints and prune stale full-cache task dirs when the HF split changes
preserve generic dataset_path loading for prebuilt Harbor-format task dirs while normalizing Prime HF image refs to pullable GAR refs
export the taskset from verifiers.experimental.composable.tasksets

Validation

uv run ruff check verifiers/envs/experimental/composable/tasksets/harbor/cli_gym.py
uv run python -m py_compile verifiers/envs/experimental/composable/tasksets/harbor/cli_gym.py
runtime check that _apply_gold_patch() and validate_instance() raise the CLI-Gym-specific unsupported-gold-solution message
materialization regression check covering empty-cache rematerialization, completed-cache reuse, bulk malformed-row skip, and requested malformed-row error
sandbox lifetime regression check: 3000s agent + 3000s test produces a 101-minute SandboxSpec.timeout_minutes, and missing timeouts preserve default behavior
manifest cache regression check covering stale-dir pruning, unchanged-cache reuse, changed-row rematerialization, bulk malformed-row skip, requested partial materialization, and requested malformed-row error
commit hooks: ruff check, ruff format, Semgrep v1 policy, generated AGENTS/CLAUDE check
push hooks: ruff check, ruff format, Semgrep v1 policy, generated AGENTS/CLAUDE check, ty
earlier import/materialization check for two CLI-Gym rows, including a malformed YAML row: corrupt-python-exceptiongroup-source-by-semantic-mutation
earlier Harbor debug no-op smoke via research-environments harbor-debug: reward 0.0, saved under /home/ubuntu/git/cligym-validation/20260613T004749Z/harbor-debug-noop-900/evals/harbor-debug--openai--gpt-4.1-mini/f935e518
earlier GPT-5.5 reward hunt using existing rlm_harness composition: reward 1.0 on python-environment-library-shadowing-with-partial-module-removal, saved under /home/ubuntu/git/cligym-validation/20260613T004749Z/rlm-gpt55-shadowing-extract180/evals/rlm-cligym--openai--gpt-5.5/a8753335

Note

Medium Risk
New sandbox materialization and scoring paths affect training/eval pipelines; misuse as an RL reward env is possible because tests run in-task, though docs and explicit runtime errors mitigate gold-validation misuse.

Overview
Adds CLIGymTaskSet and make_cli_gym_taskset, wiring CLI-Gym (default Hugging Face repo PrimeIntellect/CLI-Gym) into the composable Harbor taskset stack by subclassing HarborDatasetTaskSet.

When no local dataset_path is given, rows are materialized into Harbor-style task directories (instructions, Docker assets, task.toml, test scripts) with fingerprinted caching via .materialized.json, stale-dir pruning, and optional Prime GAR image ref normalization. Runtime behavior sets sandbox TTL from agent + test timeouts (plus buffer), scopes test_timeout from metadata, and uses a custom tests/test.sh wrapper that writes reward.txt without failing the verifier step.

Gold/oracle paths are explicitly blocked: _apply_gold_patch and validate_instance raise CLI-Gym-specific **RuntimeError**s, reflecting missing solution/solve.sh and visible in-sandbox tests (documented as SFT/trajectory use, not hardened RL). Exports are added on the harbor and tasksets package __init__ modules.

^{Reviewed by Cursor Bugbot for commit 6304cf3. Bugbot is set up for automated code reviews on this repo. Configure here.}

Note

Add CLI-Gym Harbor taskset backed by local or materialized HF datasets

Introduces CLIGymTaskSet in cli_gym.py, a new HarborDatasetTaskSet subclass that loads tasks from a local directory tree or materializes them from a Hugging Face dataset on first use.
Materialization converts HF dataset rows into per-task directories containing instruction.md, task.yaml, a Dockerfile, docker-compose.yaml, task.toml, and a tests/test.sh wrapper; subsequent runs skip unchanged tasks via SHA-256 fingerprints and a manifest file.
The tests/test.sh wrapper parses pytest output to write a reward.txt signal (1.0 on pass, 0.0 on failure) and always exits 0 so the shell step does not fail.
Sandbox containers are provisioned with the task's Docker image and a combined agent+test lifetime timeout; workdir is fixed to /testbed.
Gold patch application and instance validation both raise RuntimeError, as CLI-Gym tasks have no oracle solution path.
Exports CLIGymTaskSet and make_cli_gym_taskset from the harbor and tasksets package namespaces.

^{Macroscope summarized 6304cf3.}

macroscopeapp · 2026-06-13T01:10:23Z

Approvability

Verdict: Needs human review

This PR introduces a substantial new feature (712 lines) adding CLI-Gym taskset functionality with dataset materialization, Docker handling, and scoring logic. New capabilities of this scope require human review, and there is also an unresolved review comment about potential incorrect scoring of successful test runs.

^{You can customize Macroscope's approvability policy. Learn more.}

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit f6a160d. Configure here.}

Add CLI-Gym Harbor taskset

aff35cd

cursor Bot reviewed Jun 13, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/composable/tasksets/harbor/cli_gym.py Outdated

Comment thread verifiers/envs/experimental/composable/tasksets/harbor/cli_gym.py

macroscopeapp Bot reviewed Jun 13, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/composable/tasksets/harbor/cli_gym.py Outdated

Document CLI-Gym taskset limitations

1bebc00

cursor Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/composable/tasksets/harbor/cli_gym.py

Harden CLI-Gym materialization

f6a160d

cursor Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/composable/tasksets/harbor/cli_gym.py

Comment thread verifiers/envs/experimental/composable/tasksets/harbor/cli_gym.py Outdated

rasdani added 2 commits June 15, 2026 23:35

Extend CLI-Gym sandbox lifetime

af2e461

Prune stale CLI-Gym cache entries

6304cf3

rasdani requested a review from samsja June 15, 2026 23:53

hallerite approved these changes Jun 15, 2026

View reviewed changes

rasdani merged commit 5ed2809 into main Jun 15, 2026
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CLI-Gym Harbor taskset#1665

Add CLI-Gym Harbor taskset#1665
rasdani merged 5 commits into
mainfrom
cli-gym-harbor-taskset

rasdani commented Jun 13, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

macroscopeapp Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rasdani commented Jun 13, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Add CLI-Gym Harbor taskset backed by local or materialized HF datasets

Uh oh!

Uh oh!

Uh oh!

Uh oh!

macroscopeapp Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approvability

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rasdani commented Jun 13, 2026 •

edited by macroscopeapp Bot

Loading

macroscopeapp Bot commented Jun 13, 2026 •

edited

Loading