Add mini-swe-agent harness via ACP shim by bingran-you · Pull Request #576 · benchflow-ai/benchflow

bingran-you · 2026-05-28T20:27:24Z

Summary

Integrates mini-swe-agent as a first-class benchflow agent (mini-swe, aliases mini / minisweagent / mini-swe-agent), following the same integration contract as the other supported agents.

mini-swe is a deliberately minimal, single-bash-tool harness for apples-to-apples model comparison. A new in-process ACP shim runs its DefaultAgent loop and loads its bundled mini.yaml verbatim (minus the interactive mode key), so the upstream guardrails are reproduced faithfully: single bash tool, shared system/instance templates, >10k output truncation, malformed-tool-call retry.

Files

src/benchflow/agents/mini_swe_acp_shim.py (new) — import-safe ACP shim (no top-level side effects; stdout isolation + minisweagent import happen in main()).
src/benchflow/agents/registry.py — one additive AgentConfig + aliases; installs into an isolated /opt/benchflow/mini-swe-venv.
tests/integration/configs/mini-swe.yaml + tests/integration/run.sh — agent integration matrix.
tests/conformance/run_conformance.py — mini-swe smoke model + env keys.
tests/test_mini_swe_routing.py, tests/test_mini_swe_submit.py — routing + submit-lifecycle tests.
src/benchflow/sandbox/docker.py — drive-by fix of a pre-existing ruff UP041 error (added in openhands install + docker concurrency: 4 fixes to make --concurrency 60 viable #575) that failed ruff check src tests for every PR off main. Behavior-preserving.

Provider wiring

The shim reads BENCHFLOW_PROVIDER_* directly (like openclaw/pi/opencode/harvey-lab) — no env.py changes; the usage proxy is honored via the injected litellm api_base, so token usage is captured the same way as other agents.

_litellm_prefix reconstructs the litellm provider prefix from BENCHFLOW_PROVIDER_PROTOCOL. mini-swe drives litellm.completion (chat-completions / anthropic-messages, not the OpenAI Responses API); openai-responses only comes from aws-bedrock, whose proxy also exposes an anthropic-messages surface, so Anthropic models route there. This makes Azure (openai-completions) and Bedrock (Claude via the proxy's /v1/messages) both work.

Review follow-ups (thermo-nuclear, two rounds)

Per-action ACP lifecycle: execute_actions drives the env loop itself — each action emits start→result around its own env.execute; the submit action is closed with the submission; actions that never run (anything after submit in a multi-tool-call turn) emit nothing rather than being falsely marked completed. Fixes both the original dangling-submit bug and the multi-action pollution case.
Infra-error classification: unexpected exceptions in session/prompt return a JSON-RPC error (not a successful end_turn), so auth/provider/protocol/runtime failures are classified as agent/infra errors instead of masquerading as task failures (matches openclaw). The agent's own task failures still return normally with an exit_status.
Import-safe shim: stdout redirect + minisweagent import live in main()/a factory, so the routing policy is importable/unit-testable without the sandbox runtime and importing the module never clobbers stdout.
Parity coverage: mini-swe added to the integration matrix (configs/ + run.sh) and the conformance smoke map.

Test plan

ruff check src tests, ruff format --check src tests, ty check — clean
Full unit suite: 2504 passed, 3 skipped, 0 failed
test_mini_swe_routing.py (Azure/Bedrock/anthropic/empty routing); test_mini_swe_submit.py (multi-action submit lifecycle — executed action → real output completed, submit → submission completed, post-submit action → not emitted)
Real e2e on SkillsBench weighted-gdp-calc, Azure azure-foundry-openai/gpt-5.5: healthy trajectory, 11 tool calls all completed (0 dangling), agent iterates on outputs, usage extracted (provider_response)
Real e2e on SkillsBench weighted-gdp-calc, Bedrock aws-bedrock/us.anthropic.claude-opus-4-7: healthy trajectory, 9 tool calls all completed (0 dangling), usage extracted
Real e2e on SkillsBench pdf-excel-diff: reward 1.0 on both Azure gpt-5.5 and Bedrock opus-4.7
Regression — existing claude-agent-acp through the same pipeline unaffected

Note: reward 0.0 on weighted-gdp-calc reflects model/task difficulty, not the integration — trajectory, ACP lifecycle, and token-usage extraction are all healthy.

Integrates SWE-agent's mini-swe-agent as a benchflow agent. A new in-process ACP shim runs mini-swe's DefaultAgent loop and loads its bundled mini.yaml verbatim (minus the interactive `mode` key), so the upstream guardrails are reproduced faithfully: single bash tool, shared system instructions, >10k output truncation, and malformed-tool-call retry. The shim reads BENCHFLOW_PROVIDER_* directly (like openclaw/pi/opencode), so no env.py wiring is needed; the usage proxy is honored automatically via the injected litellm api_base. Registered as `mini-swe` with aliases mini / minisweagent / mini-swe-agent.

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

litellm.completion (what mini-swe drives) speaks chat-completions and anthropic-messages but NOT the OpenAI Responses API. Replace the flat protocol->prefix dict with a policy helper: anthropic-messages -> anthropic, openai-completions -> openai, and openai-responses (only ever aws-bedrock, whose proxy also exposes /v1/messages) -> anthropic for Claude models. Verified end-to-end: Azure gpt-5.5 and Bedrock us.anthropic.claude-opus-4-7 both solve hello-world and the skillsbench pdf-excel-diff task (reward 1.0, token usage extracted via the usage proxy).

`except (TimeoutError, asyncio.TimeoutError)` in docker teardown (added in benchflow-ai#575) is redundant — asyncio.TimeoutError is an alias of builtin TimeoutError on Python 3.11+. ruff UP041 flags it, failing `ruff check src tests` for every PR off main. Collapse to `except TimeoutError`. Behavior-preserving.

Review follow-ups for the mini-swe ACP shim: - Fix dangling submit tool_call: the `echo COMPLETE_TASK...` command makes env.execute raise Submitted before the parent emits observations, leaving its ACP tool_call stuck in_progress. _ACPAgent.execute_actions now catches Submitted, emits a completed tool_call_update, then re-raises. Verified on Azure gpt-5.5 and Bedrock opus-4.7: every tool call ends `completed`. - Make the shim import-safe: stdout redirection and the (banner-printing) minisweagent import move into main()/a factory, so the pure routing policy is importable and unit-testable without the sandbox runtime. - Integration parity: add tests/integration/configs/mini-swe.yaml and register mini-swe in run.sh ALL_AGENTS, matching the other 8 supported agents. - Add tests/test_mini_swe_routing.py covering Azure/Bedrock/anthropic/empty protocol routing in _litellm_prefix.

…rmance map Second-round review follow-ups: - Model the ACP tool-call lifecycle per action. execute_actions now drives the env loop itself (mirroring DefaultAgent) instead of delegating then patching up after Submitted. Each action emits start→result around its own env.execute; the submit action is closed with the submission; actions that never run (e.g. anything after submit in a multi-tool-call turn) emit nothing instead of being falsely marked completed. tool_call start moves out of query() into the execution loop. - Return a JSON-RPC error (not a successful end_turn) for unexpected exceptions in session/prompt, so BenchFlow classifies auth/provider/protocol/ runtime failures as agent/infra errors rather than masking them as task failures (matches the openclaw shim; the agent's own task failures still return normally with an exit_status). - Add mini-swe to tests/conformance/run_conformance.py AGENT_MODELS + ENV_KEYS (gemini smoke model + keys) so the conformance run uses the right model and credential check instead of the unknown-agent fallback. - Add tests/test_mini_swe_submit.py (gated on minisweagent, like the docker smoke test) proving the multi-action submit lifecycle. Verified: full unit suite green; Azure gpt-5.5 and Bedrock opus-4.7 e2e on SkillsBench weighted-gdp-calc — every tool call completed, 0 dangling, agent iterates on outputs, usage extracted via provider_response.

bingran-you · 2026-06-02T23:09:34Z

Thermo-nuclear code quality review

What this PR is trying to solve

PR #576 adds mini-swe-agent as a first-class BenchFlow ACP agent. It registers a new mini-swe agent, installs mini-swe into an isolated venv, launches a Python ACP shim, routes BenchFlow provider env into LiteLLM, emits ACP tool-call lifecycle events, and adds routing/submit tests plus integration/conformance config.

The shape is reasonable, but I would not merge yet. There are experiment-health and maintainability issues that can make trajectories incomplete or the harness non-reproducible.

Blocking findings

ACP tool trajectories do not record the observation the agent actually saw.

In src/benchflow/agents/mini_swe_acp_shim.py:217-220, the shim emits ACP tool results using only output.get("output", ""). That drops returncode, exception_info, and mini-swe's own observation rendering. Then src/benchflow/agents/mini_swe_acp_shim.py:233-237 separately feeds the model the formatted mini-swe observation.

That means the BenchFlow ACP/tool trajectory is not the same evidence the agent used. A failed command is still marked completed, but the ACP trajectory may not show its return code. Long outputs are truncated to 2k chars here, while mini-swe's config exposes up to 10k head/tail to the model. For SkillsBench/experiment review, this is a trajectory-health blocker.

Handoff: emit the same rendered observation mini-swe gives the model, or emit structured JSON text containing at least returncode, output/output_head/output_tail, elided_chars, and exception_info. Add non-skipped regressions for a nonzero command and a >10k output command, asserting the ACP tool_call_update includes return code and truncation metadata.
The harness depends on an unpinned upstream package, while the critical contract test is skipped in normal CI.

src/benchflow/agents/registry.py:609-610 installs mini-swe-agent with no version or commit pin. Meanwhile tests/test_mini_swe_submit.py:16 skips the submit/lifecycle regression unless minisweagent happens to be installed locally.

This is too fragile for a benchmark harness. The shim hand-copies part of DefaultAgent.execute_actions; if upstream changes action shape, exception class, or observation contract, BenchFlow can silently start producing unhealthy trajectories. Standard CI would not catch it because the boundary test is skipped.

Handoff: pin mini-swe-agent to an exact version or source commit, and make the submit lifecycle test run in CI either by adding the package to dev dependencies or by testing the shim boundary with small fake DefaultAgent/Submitted modules. Also update regression test docstrings to name PR Add mini-swe-agent harness via ACP shim #576 or the guarding commit per AGENTS.md.
The openai-responses non-Claude route is encoded as “best-effort” but appears unsupported.

src/benchflow/agents/mini_swe_acp_shim.py:75-76 maps openai-responses non-Anthropic models to LiteLLM's openai provider, and tests/test_mini_swe_routing.py:27-28 locks this in for aws-bedrock/openai.gpt-oss-20b-1:0.

But mini-swe uses litellm.completion, and the PR description says it cannot speak OpenAI Responses. BenchFlow's Bedrock proxy exposes /v1/messages and /v1/responses, not a chat-completions surface (src/benchflow/providers/bedrock_proxy.py:257-261). This path is likely to become an infra error, not a healthy model failure.

Handoff: for openai-responses + non-Anthropic models, fail fast with an explicit unsupported-provider error, or add/select an actual chat-completions endpoint before claiming support. The routing test should assert the chosen supported behavior, not “best effort”.

Residual risks

The Bedrock E2E claim should state the sandbox backend. On Daytona/modal, host-side Bedrock proxy routing is unreachable for agents without direct Bedrock support; currently direct support is only OpenHands. Add a Docker-vs-Daytona canary or document mini-swe Bedrock as Docker-only until direct routing exists.
src/benchflow/agents/registry.py is now close to the 1k-line review threshold. The next agent/harness change should extract agent-specific setup out of the central registry.

bingran-you added 7 commits May 24, 2026 13:11

Merge branch 'benchflow-ai:main' into main

c4dffff

Merge branch 'benchflow-ai:main' into main

6621b9a

Merge branch 'benchflow-ai:main' into main

5a20884

Merge branch 'benchflow-ai:main' into main

13a5746

Merge branch 'benchflow-ai:main' into main

be68a0b

Merge branch 'benchflow-ai:main' into main

961085e

devin-ai-integration Bot reviewed May 28, 2026

View reviewed changes

bingran-you added 5 commits May 28, 2026 13:44

Apply ruff format to mini-swe shim

53a64d8

bingran-you closed this Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mini-swe-agent harness via ACP shim#576

Add mini-swe-agent harness via ACP shim#576
bingran-you wants to merge 12 commits into
benchflow-ai:mainfrom
bingran-you:bry/blissful-allen-516e63

bingran-you commented May 28, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

bingran-you commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bingran-you commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

Provider wiring

Review follow-ups (thermo-nuclear, two rounds)

Test plan

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

bingran-you commented Jun 2, 2026

Thermo-nuclear code quality review

What this PR is trying to solve

Blocking findings

Residual risks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bingran-you commented May 28, 2026 •

edited

Loading