benchflow-ai · bingran-you · May 31, 2026 · May 30, 2026 · May 30, 2026 · May 30, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,7 @@
 
 ### Added
 
+- **Daytona usage telemetry by default** — Daytona runs now start a sandbox-local provider usage proxy so token/cost telemetry works without an external tunnel; use `--usage-tracking off` to bypass proxying when needed.
 - **Azure AI Foundry providers** — new `azure-foundry-openai/` and `azure-foundry-anthropic/` prefixes routing through Foundry's unified resource. Export `AZURE_API_KEY` plus `AZURE_API_ENDPOINT` (e.g. `https://<resource>.openai.azure.com/`); benchflow derives the resource name from the endpoint host, builds the per-surface base URL, and maps the key onto the agent-native auth env automatically. Missing/unrecognized endpoints and unsupported agent/provider protocol pairings fail fast with clear errors instead of falling through to the wrong endpoint.
 - **Azure Foundry auth guidance** — agent discovery output and docs now call out that provider-prefixed models can use provider-specific credentials instead of the agent's native/default API key.
 

diff --git a/docs/reference/cli.md b/docs/reference/cli.md
@@ -45,17 +45,15 @@ bench eval create \
   --concurrency 64 \
   --sandbox-setup-timeout 300
 
-# From remote repo with required token usage telemetry through an external tunnel
+# From remote repo with required token usage telemetry
 bench eval create \
   --source-repo benchflow-ai/skillsbench \
   --source-path tasks \
   --agent gemini \
   --model gemini-3.1-flash-lite-preview \
   --sandbox daytona \
   --usage-tracking required \
-  --usage-proxy-url https://your-tunnel.example.com \
-  --usage-proxy-port 18081 \
-  --concurrency 1 \
+  --concurrency 16 \
   --sandbox-setup-timeout 300
 
 # From local directory
@@ -98,9 +96,6 @@ bench eval create \
 | `--model` | Agent default | Model ID |
 | `--sandbox` | `docker` | Sandbox: docker, daytona, or modal |
 | `--usage-tracking` | `auto` | Token usage telemetry policy: `auto`, `required`, or `off` |
-| `--usage-proxy-url` | — | Externally reachable usage-proxy base URL for remote sandboxes such as Daytona |
-| `--usage-proxy-bind-host` | auto | Local interface for the usage proxy; external proxy mode defaults to `127.0.0.1` |
-| `--usage-proxy-port` | random | Fixed local port for externally tunneled usage tracking |
 | `--environment-manifest` | — | Path to an Environment-plane manifest (`environment.toml`); applied to every rollout in the batch |
 | `--concurrency` | `4` | Max concurrent tasks (batch mode only) |
 | `--agent-idle-timeout` | (built-in default) | Abort ACP prompts after this many idle seconds; `0` disables idle detection |
@@ -120,15 +115,10 @@ When mounting skills, the recommended docs default is
 [Architecture: skill loading](../architecture.md#skill-loading) for how
 `--skills-dir` is registered with each agent and how the nudge modes differ.
 
-For official Daytona batch runs that must report provider token/cost telemetry,
-use `--usage-tracking required` with a tunnel or ingress URL pointing at the
-fixed `--usage-proxy-port`. The fixed-port tunnel mode supports one rollout per
-BenchFlow process; use `--concurrency 1`, or run multiple jobs with separate
-ports/tunnels. This limit applies only to metered external-tunnel mode; Daytona
-batch runs that do not require usage telemetry can still use higher concurrency.
-Without an external URL, Daytona runs continue in `auto` mode and record
-`usage_source=unavailable` because the remote sandbox cannot reach a host-bound
-proxy.
+Daytona batch runs collect provider token/cost telemetry by default with a
+sandbox-local proxy. Use `--usage-tracking required` when missing telemetry
+should fail the rollout, or `--usage-tracking off` for recovery runs that should
+leave provider traffic untouched.
 
 `--source-env` is for external hosted environment hubs. The first supported
 runner is PrimeIntellect / Verifiers: BenchFlow preserves the hosted identity

diff --git a/docs/v05-e2e-testing-guide.md b/docs/v05-e2e-testing-guide.md
@@ -25,18 +25,12 @@ All commands below assume you are in the repo root.
 > The examples below use `weighted-gdp-calc` (fast, ~5 tool calls) as the
 > default lightweight task. Swap in any task name from `$TASKS/`.
 
-> **Usage telemetry caveat (Daytona / Modal):** Remote sandboxes run the agent
-> on a host that cannot reach BenchFlow's host-bound usage proxy. Default
-> `--usage-tracking auto` therefore records `agent_result.usage_source ==
-> "unavailable"` unless you configure an external tunnel/ingress with
-> `--usage-proxy-url` and `--usage-proxy-port`. Official batch runs that need
-> token/cost telemetry should use `--usage-tracking required` so the run fails
-> before the agent starts if the external endpoint is missing or unhealthy. The
-> fixed-port tunnel mode supports one rollout per BenchFlow process; use
-> `--concurrency 1`, or run multiple jobs with separate ports/tunnels. This
-> constraint is specific to metered external-tunnel mode; Daytona batches that do
-> not require usage telemetry can still run with higher concurrency. Local
-> sandboxes (e.g. `--sandbox docker`) populate usage telemetry without a tunnel.
+> **Usage telemetry:** Docker uses a host-side provider proxy; Daytona uses a
+> sandbox-local provider proxy because the agent runs on a remote host. Default
+> `--usage-tracking auto` records provider token/cost telemetry when the proxy can
+> be started. Use `--usage-tracking required` when missing telemetry should fail
+> the rollout, or `--usage-tracking off` for recovery runs that should leave
+> provider traffic untouched.
 
 ---
 

diff --git a/src/benchflow/agents/codex_config.py b/src/benchflow/agents/codex_config.py
@@ -0,0 +1,68 @@
+"""Helpers for writing Codex ACP provider configuration."""
+
+from __future__ import annotations
+
+import json
+from typing import Any
+
+CODEX_CONFIG_ENV = "CODEX_CONFIG"
+CODEX_MODEL_PROVIDER_ENV = "MODEL_PROVIDER"
+
+_CODEX_PROVIDER_ID_PREFIX = "benchflow-"
+
+
+def codex_provider_id(provider_name: str | None) -> str:
+    safe_name = "".join(
+        char if char.isalnum() or char in {"-", "_"} else "-"
+        for char in (provider_name or "provider").lower()
+    ).strip("-")
+    return f"{_CODEX_PROVIDER_ID_PREFIX}{safe_name or 'provider'}"
+
+
+def apply_codex_provider_config(
+    agent_env: dict[str, str],
+    *,
+    base_url: str,
+    model: str | None,
+    provider_name: str,
+    strict: bool = False,
+) -> None:
+    """Create or update Codex's model provider entry in ``agent_env``."""
+    raw_config = agent_env.get(CODEX_CONFIG_ENV)
+    if not raw_config:
+        config: dict[str, Any] = {}
+    else:
+        try:
+            config = json.loads(raw_config)
+        except json.JSONDecodeError as exc:
+            if strict:
+                raise ValueError(f"{CODEX_CONFIG_ENV} must be valid JSON") from exc
+            return
+    if not isinstance(config, dict):
+        if strict:
+            raise ValueError(f"{CODEX_CONFIG_ENV} must decode to a JSON object")
+        return
+
+    provider_id = (
+        agent_env.get(CODEX_MODEL_PROVIDER_ENV)
+        or config.get("model_provider")
+        or codex_provider_id(provider_name)
+    )
+    providers = config.get("model_providers")
+    providers = {} if not isinstance(providers, dict) else dict(providers)
+    provider = providers.get(provider_id)
+    provider = dict(provider) if isinstance(provider, dict) else {}
+    provider.setdefault("name", provider_name)
+    provider["base_url"] = base_url
+    provider.setdefault("env_key", "OPENAI_API_KEY")
+    provider.setdefault("wire_api", "responses")
+    provider.setdefault("supports_websockets", False)
+
+    providers[provider_id] = provider
+    config["model_providers"] = providers
+    config["model_provider"] = provider_id
+    if model:
+        config["model"] = model
+
+    agent_env[CODEX_MODEL_PROVIDER_ENV] = str(provider_id)
+    agent_env[CODEX_CONFIG_ENV] = json.dumps(config, separators=(",", ":"))
diff --git a/src/benchflow/agents/credentials.py b/src/benchflow/agents/credentials.py
@@ -102,6 +102,16 @@ async def write_credential_files(
     await write_gemini_vertex_settings(env, agent, model, cred_home)
 
     # Agent credential files (e.g. codex auth.json)
+    if (
+        agent == "codex-acp"
+        and "OPENAI_API_KEY" not in agent_env
+        and agent_env.get("CODEX_AUTH_JSON")
+    ):
+        path = f"{cred_home}/.codex/auth.json"
+        await upload_credential(env, path, agent_env["CODEX_AUTH_JSON"], owner=owner)
+        logger.info("Agent credential file written: %s", path)
+        return
+
     if agent_cfg and agent_cfg.credential_files:
         for cf in agent_cfg.credential_files:
             value = agent_env.get(cf.env_source)