Add mini-swe-agent harness via ACP shim#576
Conversation
Integrates SWE-agent's mini-swe-agent as a benchflow agent. A new in-process ACP shim runs mini-swe's DefaultAgent loop and loads its bundled mini.yaml verbatim (minus the interactive `mode` key), so the upstream guardrails are reproduced faithfully: single bash tool, shared system instructions, >10k output truncation, and malformed-tool-call retry. The shim reads BENCHFLOW_PROVIDER_* directly (like openclaw/pi/opencode), so no env.py wiring is needed; the usage proxy is honored automatically via the injected litellm api_base. Registered as `mini-swe` with aliases mini / minisweagent / mini-swe-agent.
litellm.completion (what mini-swe drives) speaks chat-completions and anthropic-messages but NOT the OpenAI Responses API. Replace the flat protocol->prefix dict with a policy helper: anthropic-messages -> anthropic, openai-completions -> openai, and openai-responses (only ever aws-bedrock, whose proxy also exposes /v1/messages) -> anthropic for Claude models. Verified end-to-end: Azure gpt-5.5 and Bedrock us.anthropic.claude-opus-4-7 both solve hello-world and the skillsbench pdf-excel-diff task (reward 1.0, token usage extracted via the usage proxy).
`except (TimeoutError, asyncio.TimeoutError)` in docker teardown (added in benchflow-ai#575) is redundant — asyncio.TimeoutError is an alias of builtin TimeoutError on Python 3.11+. ruff UP041 flags it, failing `ruff check src tests` for every PR off main. Collapse to `except TimeoutError`. Behavior-preserving.
Review follow-ups for the mini-swe ACP shim: - Fix dangling submit tool_call: the `echo COMPLETE_TASK...` command makes env.execute raise Submitted before the parent emits observations, leaving its ACP tool_call stuck in_progress. _ACPAgent.execute_actions now catches Submitted, emits a completed tool_call_update, then re-raises. Verified on Azure gpt-5.5 and Bedrock opus-4.7: every tool call ends `completed`. - Make the shim import-safe: stdout redirection and the (banner-printing) minisweagent import move into main()/a factory, so the pure routing policy is importable and unit-testable without the sandbox runtime. - Integration parity: add tests/integration/configs/mini-swe.yaml and register mini-swe in run.sh ALL_AGENTS, matching the other 8 supported agents. - Add tests/test_mini_swe_routing.py covering Azure/Bedrock/anthropic/empty protocol routing in _litellm_prefix.
…rmance map Second-round review follow-ups: - Model the ACP tool-call lifecycle per action. execute_actions now drives the env loop itself (mirroring DefaultAgent) instead of delegating then patching up after Submitted. Each action emits start→result around its own env.execute; the submit action is closed with the submission; actions that never run (e.g. anything after submit in a multi-tool-call turn) emit nothing instead of being falsely marked completed. tool_call start moves out of query() into the execution loop. - Return a JSON-RPC error (not a successful end_turn) for unexpected exceptions in session/prompt, so BenchFlow classifies auth/provider/protocol/ runtime failures as agent/infra errors rather than masking them as task failures (matches the openclaw shim; the agent's own task failures still return normally with an exit_status). - Add mini-swe to tests/conformance/run_conformance.py AGENT_MODELS + ENV_KEYS (gemini smoke model + keys) so the conformance run uses the right model and credential check instead of the unknown-agent fallback. - Add tests/test_mini_swe_submit.py (gated on minisweagent, like the docker smoke test) proving the multi-action submit lifecycle. Verified: full unit suite green; Azure gpt-5.5 and Bedrock opus-4.7 e2e on SkillsBench weighted-gdp-calc — every tool call completed, 0 dangling, agent iterates on outputs, usage extracted via provider_response.
Thermo-nuclear code quality reviewWhat this PR is trying to solvePR #576 adds The shape is reasonable, but I would not merge yet. There are experiment-health and maintainability issues that can make trajectories incomplete or the harness non-reproducible. Blocking findings
Residual risks
|
Summary
Integrates mini-swe-agent as a first-class benchflow agent (
mini-swe, aliasesmini/minisweagent/mini-swe-agent), following the same integration contract as the other supported agents.mini-swe is a deliberately minimal, single-bash-tool harness for apples-to-apples model comparison. A new in-process ACP shim runs its
DefaultAgentloop and loads its bundledmini.yamlverbatim (minus the interactivemodekey), so the upstream guardrails are reproduced faithfully: singlebashtool, shared system/instance templates, >10k output truncation, malformed-tool-call retry.Files
src/benchflow/agents/mini_swe_acp_shim.py(new) — import-safe ACP shim (no top-level side effects; stdout isolation + minisweagent import happen inmain()).src/benchflow/agents/registry.py— one additiveAgentConfig+ aliases; installs into an isolated/opt/benchflow/mini-swe-venv.tests/integration/configs/mini-swe.yaml+tests/integration/run.sh— agent integration matrix.tests/conformance/run_conformance.py— mini-swe smoke model + env keys.tests/test_mini_swe_routing.py,tests/test_mini_swe_submit.py— routing + submit-lifecycle tests.src/benchflow/sandbox/docker.py— drive-by fix of a pre-existingruff UP041error (added in openhands install + docker concurrency: 4 fixes to make --concurrency 60 viable #575) that failedruff check src testsfor every PR off main. Behavior-preserving.Provider wiring
The shim reads
BENCHFLOW_PROVIDER_*directly (likeopenclaw/pi/opencode/harvey-lab) — noenv.pychanges; the usage proxy is honored via the injected litellmapi_base, so token usage is captured the same way as other agents._litellm_prefixreconstructs the litellm provider prefix fromBENCHFLOW_PROVIDER_PROTOCOL. mini-swe driveslitellm.completion(chat-completions / anthropic-messages, not the OpenAI Responses API);openai-responsesonly comes from aws-bedrock, whose proxy also exposes an anthropic-messages surface, so Anthropic models route there. This makes Azure (openai-completions) and Bedrock (Claude via the proxy's/v1/messages) both work.Review follow-ups (thermo-nuclear, two rounds)
execute_actionsdrives the env loop itself — each action emits start→result around its ownenv.execute; the submit action is closed with the submission; actions that never run (anything after submit in a multi-tool-call turn) emit nothing rather than being falsely marked completed. Fixes both the original dangling-submit bug and the multi-action pollution case.session/promptreturn a JSON-RPC error (not a successfulend_turn), so auth/provider/protocol/runtime failures are classified as agent/infra errors instead of masquerading as task failures (matches openclaw). The agent's own task failures still return normally with anexit_status.main()/a factory, so the routing policy is importable/unit-testable without the sandbox runtime and importing the module never clobbers stdout.configs/+run.sh) and the conformance smoke map.Test plan
ruff check src tests,ruff format --check src tests,ty check— cleantest_mini_swe_routing.py(Azure/Bedrock/anthropic/empty routing);test_mini_swe_submit.py(multi-action submit lifecycle — executed action → real output completed, submit → submission completed, post-submit action → not emitted)weighted-gdp-calc, Azureazure-foundry-openai/gpt-5.5: healthy trajectory, 11 tool calls allcompleted(0 dangling), agent iterates on outputs, usage extracted (provider_response)weighted-gdp-calc, Bedrockaws-bedrock/us.anthropic.claude-opus-4-7: healthy trajectory, 9 tool calls allcompleted(0 dangling), usage extractedpdf-excel-diff: reward 1.0 on both Azure gpt-5.5 and Bedrock opus-4.7claude-agent-acpthrough the same pipeline unaffectedNote:
reward 0.0onweighted-gdp-calcreflects model/task difficulty, not the integration — trajectory, ACP lifecycle, and token-usage extraction are all healthy.