Skip to content
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

### Added

- **Daytona usage telemetry by default** — Daytona runs now start a sandbox-local provider usage proxy so token/cost telemetry works without an external tunnel; use `--usage-tracking off` to bypass proxying when needed.
- **Azure AI Foundry providers** — new `azure-foundry-openai/` and `azure-foundry-anthropic/` prefixes routing through Foundry's unified resource. Export `AZURE_API_KEY` plus `AZURE_API_ENDPOINT` (e.g. `https://<resource>.openai.azure.com/`); benchflow derives the resource name from the endpoint host, builds the per-surface base URL, and maps the key onto the agent-native auth env automatically. Missing/unrecognized endpoints and unsupported agent/provider protocol pairings fail fast with clear errors instead of falling through to the wrong endpoint.
- **Azure Foundry auth guidance** — agent discovery output and docs now call out that provider-prefixed models can use provider-specific credentials instead of the agent's native/default API key.

Expand Down
22 changes: 6 additions & 16 deletions docs/reference/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,17 +45,15 @@ bench eval create \
--concurrency 64 \
--sandbox-setup-timeout 300

# From remote repo with required token usage telemetry through an external tunnel
# From remote repo with required token usage telemetry
bench eval create \
--source-repo benchflow-ai/skillsbench \
--source-path tasks \
--agent gemini \
--model gemini-3.1-flash-lite-preview \
--sandbox daytona \
--usage-tracking required \
--usage-proxy-url https://your-tunnel.example.com \
--usage-proxy-port 18081 \
--concurrency 1 \
--concurrency 16 \
--sandbox-setup-timeout 300

# From local directory
Expand Down Expand Up @@ -98,9 +96,6 @@ bench eval create \
| `--model` | Agent default | Model ID |
| `--sandbox` | `docker` | Sandbox: docker, daytona, or modal |
| `--usage-tracking` | `auto` | Token usage telemetry policy: `auto`, `required`, or `off` |
| `--usage-proxy-url` | — | Externally reachable usage-proxy base URL for remote sandboxes such as Daytona |
| `--usage-proxy-bind-host` | auto | Local interface for the usage proxy; external proxy mode defaults to `127.0.0.1` |
| `--usage-proxy-port` | random | Fixed local port for externally tunneled usage tracking |
| `--environment-manifest` | — | Path to an Environment-plane manifest (`environment.toml`); applied to every rollout in the batch |
| `--concurrency` | `4` | Max concurrent tasks (batch mode only) |
| `--agent-idle-timeout` | (built-in default) | Abort ACP prompts after this many idle seconds; `0` disables idle detection |
Expand All @@ -120,15 +115,10 @@ When mounting skills, the recommended docs default is
[Architecture: skill loading](../architecture.md#skill-loading) for how
`--skills-dir` is registered with each agent and how the nudge modes differ.

For official Daytona batch runs that must report provider token/cost telemetry,
use `--usage-tracking required` with a tunnel or ingress URL pointing at the
fixed `--usage-proxy-port`. The fixed-port tunnel mode supports one rollout per
BenchFlow process; use `--concurrency 1`, or run multiple jobs with separate
ports/tunnels. This limit applies only to metered external-tunnel mode; Daytona
batch runs that do not require usage telemetry can still use higher concurrency.
Without an external URL, Daytona runs continue in `auto` mode and record
`usage_source=unavailable` because the remote sandbox cannot reach a host-bound
proxy.
Daytona batch runs collect provider token/cost telemetry by default with a
sandbox-local proxy. Use `--usage-tracking required` when missing telemetry
should fail the rollout, or `--usage-tracking off` for recovery runs that should
leave provider traffic untouched.

`--source-env` is for external hosted environment hubs. The first supported
runner is PrimeIntellect / Verifiers: BenchFlow preserves the hosted identity
Expand Down
18 changes: 6 additions & 12 deletions docs/v05-e2e-testing-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,18 +25,12 @@ All commands below assume you are in the repo root.
> The examples below use `weighted-gdp-calc` (fast, ~5 tool calls) as the
> default lightweight task. Swap in any task name from `$TASKS/`.

> **Usage telemetry caveat (Daytona / Modal):** Remote sandboxes run the agent
> on a host that cannot reach BenchFlow's host-bound usage proxy. Default
> `--usage-tracking auto` therefore records `agent_result.usage_source ==
> "unavailable"` unless you configure an external tunnel/ingress with
> `--usage-proxy-url` and `--usage-proxy-port`. Official batch runs that need
> token/cost telemetry should use `--usage-tracking required` so the run fails
> before the agent starts if the external endpoint is missing or unhealthy. The
> fixed-port tunnel mode supports one rollout per BenchFlow process; use
> `--concurrency 1`, or run multiple jobs with separate ports/tunnels. This
> constraint is specific to metered external-tunnel mode; Daytona batches that do
> not require usage telemetry can still run with higher concurrency. Local
> sandboxes (e.g. `--sandbox docker`) populate usage telemetry without a tunnel.
> **Usage telemetry:** Docker uses a host-side provider proxy; Daytona uses a
> sandbox-local provider proxy because the agent runs on a remote host. Default
> `--usage-tracking auto` records provider token/cost telemetry when the proxy can
> be started. Use `--usage-tracking required` when missing telemetry should fail
> the rollout, or `--usage-tracking off` for recovery runs that should leave
> provider traffic untouched.

---

Expand Down
68 changes: 68 additions & 0 deletions src/benchflow/agents/codex_config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
"""Helpers for writing Codex ACP provider configuration."""

from __future__ import annotations

import json
from typing import Any

CODEX_CONFIG_ENV = "CODEX_CONFIG"
CODEX_MODEL_PROVIDER_ENV = "MODEL_PROVIDER"

_CODEX_PROVIDER_ID_PREFIX = "benchflow-"


def codex_provider_id(provider_name: str | None) -> str:
safe_name = "".join(
char if char.isalnum() or char in {"-", "_"} else "-"
for char in (provider_name or "provider").lower()
).strip("-")
return f"{_CODEX_PROVIDER_ID_PREFIX}{safe_name or 'provider'}"


def apply_codex_provider_config(
agent_env: dict[str, str],
*,
base_url: str,
model: str | None,
provider_name: str,
strict: bool = False,
) -> None:
"""Create or update Codex's model provider entry in ``agent_env``."""
raw_config = agent_env.get(CODEX_CONFIG_ENV)
if not raw_config:
config: dict[str, Any] = {}
else:
try:
config = json.loads(raw_config)
except json.JSONDecodeError as exc:
if strict:
raise ValueError(f"{CODEX_CONFIG_ENV} must be valid JSON") from exc
return
if not isinstance(config, dict):
if strict:
raise ValueError(f"{CODEX_CONFIG_ENV} must decode to a JSON object")
return

provider_id = (
agent_env.get(CODEX_MODEL_PROVIDER_ENV)
or config.get("model_provider")
or codex_provider_id(provider_name)
)
providers = config.get("model_providers")
providers = {} if not isinstance(providers, dict) else dict(providers)
provider = providers.get(provider_id)
provider = dict(provider) if isinstance(provider, dict) else {}
provider.setdefault("name", provider_name)
provider["base_url"] = base_url
provider.setdefault("env_key", "OPENAI_API_KEY")
provider.setdefault("wire_api", "responses")
provider.setdefault("supports_websockets", False)

providers[provider_id] = provider
config["model_providers"] = providers
config["model_provider"] = provider_id
if model:
config["model"] = model

agent_env[CODEX_MODEL_PROVIDER_ENV] = str(provider_id)
agent_env[CODEX_CONFIG_ENV] = json.dumps(config, separators=(",", ":"))
10 changes: 10 additions & 0 deletions src/benchflow/agents/credentials.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,16 @@ async def write_credential_files(
await write_gemini_vertex_settings(env, agent, model, cred_home)

# Agent credential files (e.g. codex auth.json)
if (
agent == "codex-acp"
and "OPENAI_API_KEY" not in agent_env
and agent_env.get("CODEX_AUTH_JSON")
):
path = f"{cred_home}/.codex/auth.json"
await upload_credential(env, path, agent_env["CODEX_AUTH_JSON"], owner=owner)
logger.info("Agent credential file written: %s", path)
return

if agent_cfg and agent_cfg.credential_files:
for cf in agent_cfg.credential_files:
value = agent_env.get(cf.env_source)
Expand Down
Loading
Loading