Skip to content

harness: interactive proxy robustness + analytics fixes#26

Open
Th0rgal wants to merge 62 commits intomainfrom
harness/interactive-robustness-fixes
Open

harness: interactive proxy robustness + analytics fixes#26
Th0rgal wants to merge 62 commits intomainfrom
harness/interactive-robustness-fixes

Conversation

@Th0rgal
Copy link
Copy Markdown
Member

@Th0rgal Th0rgal commented Apr 22, 2026

Summary

Robustness and analytics fixes for the interactive benchmark agent runtime. Addresses spurious failures caused by Lake build races, transient 429/5xx responses, truncated responses, and missing or misleading tool feedback. Also corrects stats that were always reporting 0 distinct candidates.

Changes

Pre-build + self-heal (biggest impact)

  • prebuild_task_modules() runs lake build once per task up front for all Benchmark.* modules needed by the case, so inspect_lean_goals / try_tactic_at_hole / run_lean_check never race an incomplete build.
  • New environment_error failure class: detect "object file ...olean does not exist" / "failed to load environment", auto-invoke lake build for the missing module, and retry the check once before surfacing to the model. Stagnation tracking and temperature schedules ignore this class.

Tool feedback

  • write_editable_proof now returns immediate warnings for: placeholder detected, hidden Proofs import, unfilled ?_ hole, theorem-statement mismatch, empty content. The model sees these before spending build budget.
  • HOLE_PATTERN tightened to (?<!\w)\?_(?!\w) so identifiers like ?_foo don't falsely trigger.
  • HIDDEN_PROOF_IMPORT_PATTERN now covers import/open/export of Benchmark.Cases.*.Proofs paths.
  • Interactive-mode TOOLS.md is synthesized from the real tool specs (6 tools) rather than relying on the static shell-oriented prompt.

HTTP resilience

  • send_chat_completion rewritten with retry/backoff + jitter (up to 6 attempts), honours Retry-After, and falls back through extra_body.fallback_models on persistent failure.
  • Silent length retry: on finish_reason: length, bump max_completion_tokens (up to 12000, 3 silent bumps) before counting the attempt as a failure.

Defaults

  • agents/interactive.json: max_completion_tokens 2000 → 4096, max_tool_calls 24 → 40.

Analytics

  • build_run_analysis now has a fallback path for interactive attempts (which don't populate trace) that derives stable_digest from candidate_file_contents, so distinct_candidate_count / candidate_change_count are correct.
  • Light temperature schedule on repeated non-environment failure classes.

Schema

  • schemas/agent-run.schema.json: add prebuild_reports property.

Verification

Ran the 3 quick tasks on both gpt and builtin/smart: 6/6 pass, analytics report distinct=1 change_count=1 (previously 0).

task gpt tool_calls smart tool_calls
ethereum/deposit_contract_minimal/deposit_count 2 4
damn_vulnerable_defi/side_entrance/deposit_sets_sender_credit 2 8
damn_vulnerable_defi/side_entrance/deposit_sets_pool_balance 2 2

Test plan

  • Syntax check (python -m py_compile) on all edited modules
  • 3 quick tasks × 2 models all pass with correct analytics
  • Full benchmark sweep with gpt (in flight)
  • Address any bugbot / codex findings on this PR

🤖 Generated with Claude Code


Note

Medium Risk
Medium risk because it refactors interactive execution flow (tool semantics, retry/backoff, temperature/token budgeting) and Lean check behavior, which can affect benchmark determinism and provider interactions.

Overview
Makes the interactive agent loop more robust and informative: write_editable_proof now runs Lean automatically and returns structured warnings/hints; tool prompts are synthesized in interactive mode to match the actual callable function tools and discourage nonexistent shell commands.

Hardens Lean verification by prebuilding task modules upfront (prebuild_task_modules + prebuild_reports in results), classifying/auto-healing environment_error cases (missing .olean), filtering noisy Lean warnings, capping details size, and improving hole/tactic handling (?_ matching, term-position tactic wrapping, cached run_lean_check, cached search_public_defs, richer try_tactic_at_hole failures).

Improves provider resilience and run analytics: chat completions gain retry/backoff with Retry-After, optional fallback model chain via extra_body.fallback_models, smarter length-stop token bumping (length_retry_token_cap), a temperature escalation schedule on repeated failure classes, and corrected candidate-change/distinct-candidate counting for interactive runs (plus schema/default updates).

Reviewed by Cursor Bugbot for commit 0ff58e6. Bugbot is set up for automated code reviews on this repo. Configure here.

Major improvements to the interactive agent runtime to reduce spurious
failures and improve tool feedback quality:

- Pre-build Benchmark case modules once per task so the interactive tools
  don't race the Lake build (eliminates olean-not-found flake).
- Add environment_error failure class with one-shot self-heal: when Lean
  can't find a dependency olean, auto-run lake build and retry before
  reporting to the model. Stagnation tracking skips these classes.
- write_editable_proof now returns immediate warnings (placeholder, hole,
  theorem_statement_mismatch, hidden_proof_import, etc.) so the model
  sees actionable feedback before spending a build budget.
- Synthesize TOOLS.md for interactive mode from the real tool specs
  instead of relying on static prompts that reference shell commands.
- HTTP layer: retry/backoff with jitter, honour Retry-After, fall back
  through extra_body.fallback_models, and silently grow
  max_completion_tokens (up to 12000) on length finish_reason.
- Raise max_tool_calls (24 -> 40) and max_completion_tokens (2000 ->
  4096) for the default interactive profile.
- Tighten HOLE_PATTERN (avoid matching `?_foo`) and broaden hidden
  import regex to cover `open`/`export` of proof modules.
- Fix candidate_change_count / distinct_candidate_count for interactive
  attempts by falling back to stable_digest of candidate_file_contents
  when trace is missing.
- Add light temperature schedule on repeated failure_class history.

Schema: extend agent-run.schema.json with prebuild_reports.

Verified: 3 quick tasks pass cleanly on both gpt and builtin/smart with
correct analytics (distinct=1, change_count=1).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 503a0654fd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/interactive_runtime.py Outdated
Comment on lines +645 to +646
_LAKE_BUILD_CACHE.add(module_name)
code, output = lean_run_command(["lake", "build", module_name], cwd=ROOT)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Cache only successful lake builds

prebuild_task_modules adds each module to _LAKE_BUILD_CACHE before running lake build, so a failed build is still marked as cached. In agent_runner.run_many (case/suite runs in one process), that causes later tasks sharing the same module to skip rebuilds and keep reporting cached, which prevents recovery from transient build failures and undermines the new self-healing behavior.

Useful? React with 👍 / 👎.

Comment thread harness/default_agent.py Outdated
Comment on lines +1016 to +1020
if candidate_hash not in distinct_candidate_hashes:
distinct_candidate_hashes.add(candidate_hash)
if not isinstance(trace, dict) or not trace.get("candidate_sha256"):
if candidate_text != previous_candidate:
candidate_change_count += 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Count candidate changes on every transition

In the interactive fallback path (attempts without trace), candidate_change_count is incremented only when the hash has not been seen before. That undercounts real transitions when the model revisits an earlier candidate (e.g., A -> B -> A should count two changes after the first attempt, but the second A transition is dropped), so analytics remain inaccurate for non-monotonic candidate sequences.

Useful? React with 👍 / 👎.

- P1: only cache *successful* lake builds so failed builds can be
  retried on subsequent calls. Previously `_LAKE_BUILD_CACHE` marked
  modules as attempted before invoking lake, which prevented the
  self-heal path from ever recovering from a transient build failure in
  batch/suite runs that share a process.

- P2: count every candidate transition in the interactive analytics
  fallback, including reverts (A -> B -> A now counts 2 changes, not 1).
  Previously the increment was gated by "hash not yet seen", which
  undercounted non-monotonic candidate sequences.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread harness/interactive_runtime.py Outdated
Bugbot (Medium): `_attempt_lake_build` returned False when the module
was already cached as built, so after `prebuild_task_modules` populated
the cache, every subsequent environment_error auto-heal attempt
short-circuited without invoking lake and without retrying the check —
while the model was told "The harness is attempting to rebuild it".

The auto-heal path only fires when the runtime has observed a missing
.olean at check time, which means the cached 'success' entry is stale.
Always invoke lake build in this path and refresh the cache with the
latest result.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread harness/default_agent.py
Comment thread harness/interactive_runtime.py
- Low: merge the two back-to-back `if config.mode == "interactive":`
  blocks in `execute_agent_task` into a single branch.
- Low: `try_tactic_at_hole` now uses the same `HOLE_PATTERN` regex as
  `inspect_lean_goals`, so the two tools agree on what counts as a
  hole (previously the tightened lookbehind in HOLE_PATTERN was not
  mirrored in `try_tactic_at_hole`, which could let it replace a match
  that `inspect_lean_goals` had reported as absent).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 950d7e3145

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/default_agent.py
Comment on lines +1066 to +1069
try:
return max(0.0, float(value))
except ValueError:
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Support date-form Retry-After values

_parse_retry_after only accepts numeric delta-seconds, but Retry-After is also commonly sent as an HTTP-date on 429/503 responses. In that case this function returns None, so _backoff_delay uses a short exponential retry instead of the server-requested wait, which can repeatedly hit rate limits and prolong failures for overloaded providers. Please parse HTTP-date values as well before computing retry delays.

Useful? React with 👍 / 👎.

Codex (P2): `_parse_retry_after` only handled delta-seconds, so when a
provider sent an HTTP-date on 429/503 (which RFC 7231 permits), the
function returned None and `_backoff_delay` fell back to a short
exponential retry instead of honouring the server-requested wait. That
can repeatedly hit rate limits and prolong failures.

Now parse both forms; a date in the past clamps to 0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3ddda642d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/default_agent.py Outdated
Comment on lines +1095 to +1096
if retry_after is not None:
return min(retry_after, 60.0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Respect full Retry-After delay from provider

_backoff_delay clamps Retry-After to 60 seconds, so 429/503 responses that legitimately request longer waits (for example several minutes) will be retried too early and can burn through all retries while the provider is still enforcing the limit. This directly undermines the new “honour Retry-After” behavior and can turn recoverable throttling into hard failures.

Useful? React with 👍 / 👎.

Comment thread harness/interactive_runtime.py Outdated
Comment on lines +707 to +708
re.compile(r"object file ['\"][^'\"]+\.olean['\"]? does not exist"),
re.compile(r"failed to load environment"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Narrow environment-error matching to real infra failures

The new ENVIRONMENT_ERROR_PATTERNS treats any "object file ... .olean does not exist" / "failed to load environment" message as infrastructure, but those strings also occur when the candidate introduces a bad import/module name. Because _annotate_check_result short-circuits on environment_error, these model-caused mistakes are labeled “not your fault” and excluded from stagnation/temperature logic, which can prevent corrective behavior in interactive runs.

Useful? React with 👍 / 👎.

Comment thread harness/default_agent.py
Codex P2 #1: narrow `environment_error` classification to real infra
failures. Previously any missing .olean matched, but that string is
also produced when the model imports a non-existent module. In the
latter case we want the normal stagnation/temperature logic to kick in
so the model can correct itself. Now only classify as environment_error
when the missing module is under `Benchmark.*` (our dependency tree
which should have been pre-built) or when lean itself is missing. The
generic "failed to load environment" substring no longer
short-circuits.

Codex P2 #2: raise `_backoff_delay` Retry-After ceiling from 60s to
600s. Providers routinely request several-minute waits on 429/503; the
old 60s clamp caused retries to fire while the rate limit was still in
force, undermining the "honour Retry-After" behaviour.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: df1004c256

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/interactive_runtime.py Outdated
# pre-built) or the model's fault (imported a module that doesn't exist). We
# only classify the former as environment_error so stagnation/temperature
# logic still applies to model-caused import mistakes.
_MISSING_OLEAN_RE = re.compile(r"object file ['\"]([^'\"]+\.olean)['\"]? does not exist")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle of module in missing-olean regex

Lean commonly reports missing artifacts as object file '...olean' of module <Name> does not exist, but _MISSING_OLEAN_RE only matches the shorter '...olean' does not exist form. Fresh evidence in this revision is that the new regex at this line requires does not exist immediately after the path, so _missing_olean_module returns None for the standard message shape; that prevents classify_failure from returning environment_error and skips the new auto-heal lake build retry path in run_lean_check.

Useful? React with 👍 / 👎.

Bugbot (Medium): the temperature schedule condition was checked at the
top of every loop iteration, so once two consecutive same-class
failures triggered it, temperature was bumped on every subsequent turn
-- including pure search/write turns -- until the 0.7 cap was reached.
Track the history length we have already acted on and only bump once
per new failure entry.

Codex (P1): Lean reports missing artifacts in two forms depending on
context:

  object file '...olean' does not exist
  object file '...olean' of module <Name> does not exist

The `_MISSING_OLEAN_RE` regex only matched the shorter form, so on the
more common "of module" diagnostic `_missing_olean_module` returned
None, `classify_failure` did not return `environment_error`, and the
auto-heal retry path was skipped. Accept arbitrary text between the
path and the "does not exist" tail.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 894a285f92

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/interactive_runtime.py Outdated
Comment on lines +753 to +755
missing_module = _missing_olean_module(details)
if missing_module and missing_module.startswith("Benchmark."):
return "environment_error"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Verify missing Benchmark module exists before env classification

Fresh evidence in this revision is that classify_failure now labels any missing .olean as environment_error whenever the parsed module name starts with Benchmark.. A model-caused bad import like import Benchmark.Foo.Bar (which is not a real module) will therefore be treated as infrastructure failure, causing _annotate_check_result to mark it as environment_error and skip stagnation/temperature correction, so interactive runs can keep retrying instead of fixing the import.

Useful? React with 👍 / 👎.

Codex (P2): a model-authored bad import like `import Benchmark.Foo.Bar`
where no such module exists would still produce an "object file ...olean
does not exist" diagnostic, and under the previous check that matched
the `Benchmark.*` prefix it would be misclassified as infrastructure
failure -- skipping stagnation/temperature correction instead of letting
the model discover and fix the bogus import.

Require the corresponding `.lean` source file to actually exist in the
tree before treating a missing .olean as environment_error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 42efe75153

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/default_agent.py Outdated
Comment on lines +1195 to +1197
# Fall back only on rate-limit / service-unavailable style errors.
if exc.status not in (429, 500, 502, 503, 504) and exc.status != 0:
break
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Retry alternate models on all transient HTTP statuses

send_chat_completion only falls through to fallback_models for 429/5xx (or status 0), but _post_chat_completion explicitly treats 408, 409, and 425 as transient retryable failures too. If the primary model keeps returning one of those statuses, the code exits after exhausting retries on that model instead of trying configured fallback models, so the new fallback chain is skipped in a recoverable outage scenario.

Useful? React with 👍 / 👎.

Codex (P2): `send_chat_completion`'s fallback-model gate used a
hard-coded subset {429, 500, 502, 503, 504}, but `_post_chat_completion`
retries the broader `RETRY_STATUS_CODES` set (408, 409, 425, 429, 500,
502, 503, 504). A primary model that kept returning 408/409/425 would
exhaust retries and then skip the configured fallback chain entirely.
Route on the same set used by the underlying retry loop.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread harness/interactive_runtime.py
Comment thread harness/default_agent.py Outdated
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f8d43bbfae

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/default_agent.py Outdated
token_budget = min(int(token_budget * 1.5), 4500)
# Up to 3 silent budget bumps before nudging the model to simplify.
if consecutive_length_stops <= 3:
token_budget = min(int(token_budget * 1.5), 12000)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Cap length-retry token bumps at model-safe limit

The new length-retry path raises token_budget to min(int(token_budget * 1.5), 12000), which can exceed the model/provider output-token cap after a finish_reason == "length" response. With the updated interactive default already at 4096, the first bump becomes 6144; providers that enforce a 4096 max will return HTTP 400, and send_chat_completion treats 400 as non-transient and exits instead of recovering. This turns a recoverable truncation into a hard task failure for capped models.

Useful? React with 👍 / 👎.

Comment thread harness/default_agent.py
Comment thread harness/default_agent.py Outdated
Three related fixes from bugbot/codex on f8d43bb:

1. (codex P1) Cap length-retry token bump at config.max_completion_tokens.
   The previous hard-coded cap of 12000 could exceed the provider's per-
   response output-token limit (e.g. models with a 4096 hard cap), turning
   a recoverable truncation into a non-transient HTTP 400 hard failure.

2. (bugbot Low) Also reset token_budget when recovering from a length
   streak, not only when consecutive_length_stops >= 5. Previously a
   1-4 stop streak followed by recovery left the elevated budget in
   place for the rest of the run.

3. (bugbot Low) Skip environment_error entries when appending to
   failure_class_history. They are infra noise that would break the
   sliding window same-class check (e.g. [type_error, env_error,
   type_error]) even though the filter at the trigger site rejects
   them -- consistent with the PR's stated intent that env_error be
   invisible to stagnation tracking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 07c3880b8d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/default_agent.py Outdated
# provider-enforced per-response limit (some models hard-cap at the
# configured value and would return HTTP 400 on anything larger).
if consecutive_length_stops <= 3:
token_budget = min(int(token_budget * 1.5), config.max_completion_tokens)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make length-stop retries raise max_tokens

execute_interactive_agent_task initializes token_budget to config.max_completion_tokens, but the length-retry branch sets it to min(int(token_budget * 1.5), config.max_completion_tokens), which can never increase the value. As a result, repeated finish_reason == "length" responses are retried with the same token limit instead of a larger one, so the new silent recovery path is effectively disabled and long responses can still exhaust attempts unnecessarily.

Useful? React with 👍 / 👎.

Comment thread harness/default_agent.py Outdated
claude added 2 commits April 22, 2026 12:18
Python 3.12 surfaces socket read timeouts during SSL as bare
TimeoutError, which slipped past the urllib.error.URLError handler
and killed the task with a traceback. Catch it alongside URLError and
apply the same backoff retry policy.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previous commit capped the silent length-retry bump at
config.max_completion_tokens, which -- as bugbot and codex both
pointed out -- turned the feature into a no-op because token_budget
starts at exactly that value.

Add a config.extra_body.length_retry_token_cap knob (stripped from
the provider payload in send_chat_completion). Default stays at
max_completion_tokens so models with a hard per-response cap don't
get surprised by HTTP 400; interactive.json opts into a 12000
ceiling, which is the original intended behavior for gpt-class
models that accept larger single-response budgets.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c5f118c37b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/default_agent.py
Comment on lines +1193 to +1196
fallback_models = [
str(item)
for item in (config.extra_body.get("fallback_models") or [])
if isinstance(item, str) and item.strip()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate fallback_models type before iterating

fallback_models is read from untyped extra_body and iterated directly, so a common single-string config (for example "fallback_models": "gpt-4o-mini") is treated as characters and produces model ids like "g", "p", "t" during failover. In transient-outage scenarios this makes the fallback chain fail even though a valid fallback was configured. Please require a list/tuple of strings (or normalize a single string to a one-element list) before building models_to_try.

Useful? React with 👍 / 👎.

Comment thread harness/default_agent.py Outdated
Comment on lines +1794 to +1796
length_retry_token_cap = int(
config.extra_body.get("length_retry_token_cap", config.max_completion_tokens)
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle invalid length_retry_token_cap without crashing

Interactive runs cast extra_body.length_retry_token_cap with int(...) unconditionally, but extra_body is schema-free and often operator-edited; non-integer values (e.g., null, "12k", or nested objects) will raise before the first model turn and abort the task. This knob should be parsed defensively (or validated up front) and fall back to max_completion_tokens instead of terminating execution.

Useful? React with 👍 / 👎.

Both knobs are read from schema-free extra_body:

- fallback_models="gpt-4o-mini" (string shorthand) previously iterated
  character-by-character, producing fake model ids like "g", "p", "t"
  during failover. Normalize a bare string to a one-element list first.

- length_retry_token_cap=null / "12k" / nested object would crash the
  int() cast before the first model turn, aborting the task. Fall back
  to max_completion_tokens silently.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread harness/default_agent.py Outdated
Comment thread harness/default_agent.py Outdated
claude added 4 commits April 22, 2026 20:56
`_LEAN_BLOCK_HEADER_RE` and `_FP_LINE_COL_RE` hardcoded the source
prefix to `CandidateCheck.lean:`, which only fires for the stub file
`evaluate_candidate` writes when `check_goals=False` (i.e. the
`run_lean_check` / `write_editable_proof` code paths). The
`inspect_lean_goals` path runs Lean against the actual editable file
— paths like `Benchmark/Generated/Foo/Bar.lean:` — and so fell
through both filters untouched.

Corpus analysis of 83 interactive runs found:
  - 32/88 (36%) of inspect_lean_goals outputs still carried full
    `linter.unusedSimpArgs` warning blocks (verified on the largest
    leaker: 9.8 KB of noise inside a 24.5 KB output — a 40% reduction
    with the fix applied).
  - Stagnation detection's error-text fingerprint left the file path
    in place for inspect_lean_goals outputs, so the same underlying
    error surfaced via two different Lean invocations produced two
    different fingerprints and escaped the no-progress loop detector.

Broadening both regexes to `\S+\.lean:LINE:COL:` lets a single
strip + fingerprint pipeline cover every tool that reports Lean
diagnostics, regardless of which source file Lean named.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The `placeholder_detected` write-warning told the agent to replace
sorry/admit/axiom "before run_lean_check". That instruction is stale
on two counts:

  * commit b2bd74f (2026-04-22 18:29) folded run_lean_check into
    write_editable_proof, so by the time this warning is returned the
    Lean check has already run. There is no separate "before" moment.
  * commit f5ceebe (2026-04-22 20:01) removed every prompt-side hint
    telling the agent to call run_lean_check after write_editable_proof,
    aligning the tool surface with the folded architecture.

The current wording therefore gestures at a workflow that no longer
exists, undercutting the "run_lean_check is not needed after
write_editable_proof" description already in the tool surface.

Corpus analysis (83 interactive runs, 372 write_editable_proof calls)
found this warning fired 13 times, always alongside a Lean verdict
for the same proof — so the agent was seeing a "do X before Y" hint
at the exact moment it was also reading Y's output, a self-contradiction
that adds no new signal. Rewording to "Lean rejects these — replace
with a real tactic or a `?_` hole." keeps the actionable advice while
dropping the stale sequencing language and pointing the agent at the
legitimate `?_` probe workflow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
`_extract_theorem_signature` used the non-greedy pattern `.*?(?::=)`
to find the proof marker, but that stops at the first `:=` token,
which in 74 of 88 (84%) benchmark task files lives inside a `let
x := ...` binding within the statement. The result is a truncated
signature that makes the theorem_statement_mismatch check a
false-negative for any edit past that let-binding.

All 88 task files use `:= by` as the actual proof start, so
anchoring on `:=\s*by\b` cleanly captures the full signature
without regressing the 14 let-free cases.
…ublic_defs

Corpus analysis of 83 interactive runs found 20/29 failed tasks (69%)
hit "unknown identifier 'simp'" / '<tactic>' errors at least once. The
corpus-wide count of unknown-identifier incidents is dominated by
tactic names — 17 `simp`, 10 `simpa`, 6 `omega`, 4 `exact`,
4 `native_decide`, 3 `intro`, 3 `simp_all`, 2 `by_cases` — i.e.
~60% of cases where Lean emits `unknown identifier` are tactic-in-term-
position mistakes (`exact simp [...]`, `:= by_cases h`, etc.), not
missing definitions.

The existing `unknown_identifier` hint (both the per-failure hint
and the stagnation-escalation hint) sent the model to
`search_public_defs` to look for the name, which cannot help:
these are language keywords, not definitions. The model would burn
tool calls searching and loop.

Detect the tactic-name case and emit a position-specific hint instead
("wrap with `by`, or drop the `exact`/`refine` prefix"), and
skip the misleading "Use search_public_defs" hint in that branch.
Non-tactic identifiers (missing lemmas, typos like `Nat.div_mul_le`)
still get the original advice.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 19bb7323c0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +811 to 812
rf"theorem\s+{re.escape(short_name)}\b(?P<signature>.*?):=\s*by\b",
re.DOTALL,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Accept theorem bodies not written with := by

_extract_theorem_signature now only matches declarations where the theorem body starts with := by, so a candidate that keeps the exact same theorem statement but uses a valid term-style body (for example := <term>) produces None and is treated as a theorem_statement_mismatch in preflight_candidate. That turns semantically valid proofs into hard preflight failures and blocks the interactive loop from evaluating them with Lean.

Useful? React with 👍 / 👎.

claude added 5 commits April 22, 2026 21:35
…ic_defs

Corpus analysis of 83 runs under results/agent_runs/custom/interactive-proxy/
classified unknown-identifier incidents across failed tasks:
  51 tactic-in-term-position (addressed by 19bb732)
  34 "other" — dominated by camelCase locals: prevOwner (17), owner (6), ...
  12 snake_case lemma guesses
  12 qualified-name misses
   4 var_like already matching a narrower pattern

38 of these (the 34 "other" + 4 var_like) are local-variable-shaped names
(camelCase, no dot, no underscore, leading lowercase) affecting 6/29
failed tasks. The previous hint sent the agent to search_public_defs,
which cannot resolve local binders and produced budget-wasting loops.

New branch detects var-like names and instead points at inspect_lean_goals
plus re-reading the theorem signature for the real parameter names,
explicitly warning that search_public_defs is wrong for this shape.

Tactic-in-term-position path is preserved (checked first), and the
fallback "Check imports" hint still fires for snake/qualified misses
that match neither category.
…haped

Corpus analysis of the 29 failed runs under
results/agent_runs/custom/interactive-proxy/ shows 5 tasks (17%)
stagnating on guesses that only exist in Mathlib:

  ethereum.../full_deposit_preserves_partial_gap: sub_eq_sub_right,
                                                  add_sub_add_right_eq_sub
  openzeppelin.../preview_deposit_rounds_down:    Nat.cast_mk, Nat.div_def,
                                                  Nat.div_mul_le, Nat.le_div_mul
  paladin.../weth_claimed_plus_allocated:         add_assoc, add_left_comm,
                                                  sub_eq_add_neg, add_comm
  safe.../in_list_reachable:                      not_eq
  zama.../transfer_conservation:                  lt_of_add_lt_add_right,
                                                  Nat.not_ge.mp

This workspace has no Mathlib dependency — only core Lean 4, Batteries,
and the task's own Benchmark.* modules. The prior fallback hint
("Check imports. Standard names: Nat.lt_of_not_ge, Nat.not_le_of_lt")
was misleading because the agent kept searching for imports that cannot
be added. Worse, it cited a `Nat.*` name as a "standard" example while
the agent was already being burned by the Nat.* namespace.

Detect Mathlib-shape names (arithmetic-prefix `add_*`/`sub_*`/`le_*`/...,
common exact names like `add_assoc`/`add_comm`, and any `Nat.*`-qualified
guess) and emit a hint that: (a) names the absent dependency explicitly,
(b) redirects to `omega`/`ring`/`simp arith` tactics, (c) reminds the
agent that search_public_defs takes a KEYWORD, not a guessed lemma name.

Precedence order preserved: tactic-in-term (19bb732) ranks highest,
then var_like (b24ef3f), then decide_True (legacy), then Mathlib,
then generic search_public_defs. The misleading "Check imports..."
trailer is suppressed when the Mathlib branch fires.
Corpus analysis of 83 runs under results/agent_runs/custom/interactive-proxy/:
6 of 16 `type_mismatch` incidents (38% of this class), spanning 5 of 29
failed tasks (17%), hit the same recurring shape:

  has type        ¬x = 0 : Prop
  but is expected ¬x.val = 0 : Prop

The `.val` projection on Uint256 / Address / Nat is missing on one side.
Affected failed tasks:
  openzeppelin.../preview_deposit_rounds_down
  openzeppelin.../positive_deposit_mints_positive_shares_under_rate_bound
  lido.../ceildiv_sandwich
  safe.../remove_owner_in_list_reachable
  safe.../add_owner_is_owner_correctness

The agent repeatedly tried `exact h` against the asymmetric goal instead
of letting simp bridge the coercion. The prior hint
("Unfold definitions to align types") did not name the structural issue,
so the agent cycled on `exact`-family tactics until the tool budget ran
out.

New hint fires only when the "has type … but is expected to have type …"
substrings differ by `.val` on exactly one side, and points the agent at:
  - `by simpa using h` / `by simp_all` (simp bridges the projection)
  - `by omega` once `.val` is exposed on both sides for arithmetic goals
  - the underlying injectivity lemma via search_public_defs for negated-
    equality mismatches

Precedence is preserved: the existing `decide` branch runs first (when
both match), and the generic "Unfold definitions" trailer is still
appended as a fallback.
Commits b24ef3f and 31ff5e8 taught `_build_check_hints` to distinguish
`unknown identifier '<var>'` (local binder) and `unknown identifier
'<mathlib-name>'` (unavailable dependency) from generic definition
misses. However `_build_escalation_hint` — which fires once the agent
has repeated the same failure_class multiple times — still emitted only
"Stop guessing identifier names. Use the search_public_defs tool" for
every non-tactic case.

That meant an agent stagnating on `prevOwner` / `add_comm` / `Nat.div_mul_le`
kept being told to search_public_defs even though:
  * local binders will never be found by search_public_defs, and
  * Mathlib lemmas do not exist in this workspace at all.

Corpus evidence (same 83-run dataset):
  6/29 failed tasks surface a var-like unknown identifier at escalation time
  5/29 failed tasks surface a Mathlib-shaped unknown identifier at escalation

Route the escalation hint the same way as the check hint:
  tactic   -> wrap in `by` (existing)
  var_like -> inspect_lean_goals to see real binders
  mathlib  -> omega / ring / simp arith; search_public_defs takes a keyword
  else     -> original generic "stop guessing" fallback

No behavior change when the details string is empty (falls through to
the generic fallback as before).
Corpus analysis of 83 runs under results/agent_runs/custom/interactive-proxy/:
21 parse-error events across 14 of 29 failed tasks (48%) surface one of
  - "unexpected token '…'"         (11 incidents; `using`/`at`/`:`/`;`/`|`/…)
  - "unexpected identifier"         (5 incidents)
  - "expected '{' or indented tactic sequence"  (5 incidents)

Most of these incidents (19/21) coexist with a classifiable semantic error
and so get routed to `unknown_identifier`, `unsolved_goals`, etc. But 2
failed-task incidents collapse to `failure_class=other` because the parse
error is the ONLY signal the checker emits:
  lido.../locked_funds_solvency    unexpected token ';'
  zama.../burn_decreases_supply    unexpected token '|' (first failure
                                    turn — lost the whole budget starting
                                    from a malformed signature)

Extend `classify_failure` with a `parse_error` bucket and a dedicated hint
that: (a) names the likely causes (tactic-in-term, missing `by`, stray
`;`/`|`/`using`, bullet-indentation mismatch), (b) tells the agent to
re-read the editable file via read_public_file to see character positions,
(c) advises a clean `:= by <tactics>` rewrite over token-by-token patching.

Placement: `parse_error` comes LAST in classify_failure before "other", so
it never steals classification from more specific errors. The existing
cross-cutting "Syntax error. Ensure the theorem body uses `:= by`…"
catch-all still fires as a secondary hint.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0c6c7adf96

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/default_agent.py
Comment on lines +2011 to 2012
if tool_name in ("run_lean_check", "write_editable_proof") and result.get("failure_mode") == "lean_check_failed":
saw_lean_failure = True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Track preflight tool failures in temperature history

The temperature schedule only records tool failures when failure_mode == "lean_check_failed", but write_editable_proof now commonly returns preflight failures (for example placeholder_detected or hidden_case_import_detected) with different failure modes. Those repeated failures never get added to failure_class_history, so the repeated-class bump at the top of execute_interactive_agent_task never activates and interactive runs can stay stuck at deterministic temperature until attempt exhaustion.

Useful? React with 👍 / 👎.

claude added 8 commits April 22, 2026 21:52
…dentifier

The interactive-mode user prompt previously instructed:

  "For unknown_identifier errors: use search_public_defs to find correct
   names."

Corpus analysis of 83 runs under results/agent_runs/custom/interactive-proxy/
shows that this blanket advice misleads the agent on 24 of 29 failed tasks
(83%). Those tasks hit at least one `unknown identifier '…'` where
search_public_defs cannot possibly help because the missing name is one of:

  - a Lean TACTIC written in term position (simp, simpa, omega,
    native_decide, by_cases, …) — search_public_defs searches
    *definitions*, not syntactic keywords. Fix is `by <tactic>`.
  - a LOCAL BINDER (prevOwner, owner, hKey, …) that is simply
    out-of-scope at the use site — binders are not public defs.
  - a MATHLIB LEMMA (add_comm, add_sub_*, Nat.div_mul_*, …). Mathlib
    is not a dependency in this workspace.

The downstream repair_hints (commits 19bb732, b24ef3f, 31ff5e8) already
route each shape correctly, but the initial instruction primes the agent
to call search_public_defs first and overwrites any subsequent hint.

Rewrite the bullet to:
  - tell the agent to read repair_hints before searching,
  - enumerate the three "wrong-shape" cases and their remedies, and
  - reserve search_public_defs for genuine project-defined names.

Passed runs hit this scenario only 6/54 times (11%), so the clearer
instruction should not distract correct agents while sharply reducing
the misled population in failed runs.
Lean phrases `cases h` on a non-inductive major premise (implication
`A → B`, function type, unreduced equality, Bool-valued `==`) as
`tactic 'cases' failed, major premise type is not an inductive type`
— distinct from the `constructor failed, not an inductive datatype`
shape already handled by `constructor_failed`.

Corpus analysis of 83 runs at results/agent_runs/custom/interactive-proxy/:
the pattern occurs 22 times in one failed task
(safe__owner_manager_reach__setup_owners_acyclicity.json), where the
agent repeatedly retried `cases h` on an `hkey = SENTINEL → False`
implication because the generic \"other\" bucket gave no actionable
hint — a 22-cycle stagnation loop with no escape.

Split into its own `cases_failed` class covering both `cases` and
`induction` variants, with a hint that names the four actual remedies:
apply the implication with an argument, `by_cases` for decidable Props,
`absurd`/`.elim` for contradiction, or `Bool.ne_iff`/`beq_iff_eq` to
rewrite Bool equalities before case-splitting.
When Lean reports unsolved goals with a `case <label>` marker, the
agent has ALREADY case-split successfully and only one branch is open.
The previous hint set unconditionally appended "Try restructuring:
`by_cases h : condition …`", which told the agent to undo its own
working split and start over.

Corpus analysis of 83 runs at results/agent_runs/custom/interactive-proxy/:
59 of 127 unsolved_goals incidents (46%) across 22 tasks carry a
`case <label>` marker — `case pos`, `case neg`, `case left`,
`case right`, etc. 6 of the 8 unsolved_goals final failures had one.

Now: when a case label is present, (a) name the open branch(es) back
to the agent so it stops re-inspecting, (b) point at the
branch-specific hypotheses that simp_all / omega / absurd can close
the branch with, and (c) SUPPRESS the "restructure with by_cases"
hint so the agent doesn't undo its progress. The fresh-split hint
still fires when no case label is present.
The escalation hint for unsolved_goals / simp_no_progress / rfl_failed /
unfold_failed was unconditionally advising the stuck agent to "Start
with `unfold <spec_name>` … then `by_cases` on each conditional
branch". When the agent is stuck with a `case <label>` already in the
goal output, that template tells it to TEAR DOWN a working case-split
and start over — pure regression.

Mirror the case-label detection from _build_check_hints (commit 77cb481)
into _build_escalation_hint. When `case <label>` markers are present,
escalate with branch-closing advice instead of a fresh-split template:
(1) inspect_lean_goals at a `?_` in the branch to read `h✝`, (2)
`simp_all` with the contract-simp-set to discharge contradictory
hypotheses, (3) `exact absurd` when hypotheses literally disagree,
(4) `omega` after `simp only` exposes `.val` forms.

Corpus evidence from results/agent_runs/custom/interactive-proxy/: 59
of 127 unsolved_goals incidents (46%) across 22 tasks carry a case
label; 6 of 8 unsolved_goals final-failures did. Those are exactly the
runs most likely to hit the stagnation threshold and trigger this
escalation.
The public impl/spec file set is static for a session, so the same
(query, limit) pair always returns the same matches. Cache the first
response and short-circuit repeat calls with a `cached: true` marker
plus a note telling the agent to try a different query or a different
tool.

Corpus evidence from results/agent_runs/custom/interactive-proxy/:
- Failed runs averaged 41.9 search_public_defs calls vs 1.5 on passing
  runs (28x more).
- Of extracted queries across failed runs, 94% were byte-identical
  re-queries: e.g. `"removeOwner_ownerListInvariant"` submitted 26
  times in one run, `"isChain"` 25 times, `"ceilDiv"` 24 times,
  `"acyclic"` 25 times. The agent was stuck in a tight re-query loop
  with no signal that the library wouldn't suddenly start matching.

Mirrors the existing run_lean_check cache pattern (see _last_eval_cache
in __init__) — same mechanism: deepcopy the first result, stamp it
with `cached: true` + a note on the next identical call, and stay
behaviourally-equivalent on distinct queries.
Corpus analysis of 29 failed interactive runs found 38 `omega_failed`
incidents carrying 96 opaque-coercion occurrences in their
counterexample `where:` sections (mul: 45, add: 34, sub: 17). When the
goal mentions `↑(mul a b).val`, `↑(add a b).val`, or `↑(sub a b).val`,
omega treats the whole thing as an opaque Nat and cannot see the
underlying linear arithmetic — even though the operation reduces to
`a.val * b.val` / `a.val + b.val` / `a.val - b.val` whenever the
spec's no-overflow hypothesis holds.

The canonical conversion lemmas already exist in Verity:

  Uint256.mul_eq_of_lt (h : a.val * b.val < modulus) :
    (a * b).val = a.val * b.val
  Uint256.add_eq_of_lt (h : a.val + b.val < modulus) :
    (a + b).val = a.val + b.val
  Uint256.sub_eq_of_le (h : b.val ≤ a.val) :
    (a - b).val = a.val - b.val

…but ZERO proofs in the 83-run corpus — failed OR passed — used them.
Agents searched for related terms ("val_mul", "Uint256 mul add sub ge
theorem lemma val", "div_mul_le") and never landed on the right names,
so the existing generic "introduce helper lemmas that bound the
product" nonlinear hint did not close the loop.

Detect `↑(mul|add|sub …)` in the omega_failed details and emit a
targeted hint that names the specific lemma(s) applicable to the ops
present, spells out the hypothesis shape (`a.val + b.val < modulus`,
typically the spec's `hNoOverflow`), and shows the `rw` call that
turns the goal into a plain Nat arithmetic fact omega can close.

Coverage: 11 / 17 current-classifier omega_failed incidents across 5
distinct failed tasks (lido vaulthub_locked ×3, openzeppelin
erc4626_virtual_offset_deposit ×2) now receive the targeted hint
instead of the generic "out of omega's reach" advice.

Strictly additive: the pre-existing division/modulus and variable-
product nonlinear hints still fire when their own patterns match.
…gue 'reduce to equalities'

Corpus analysis of 29 failed interactive runs found 3 distinct tasks
(damn_vulnerable_defi side_entrance exploit_trace_drains_pool, kleros
sortition_trees root_minus_left_equals_right_subtree, safe
owner_manager_reach add_owner_owner_list_invariant) surfacing
`expected type must not contain free variables` with 19 total
occurrences across attempts. Lean's own error text tells the user
"Use the '+revert' option to automatically cleanup and revert free
variables" — yet the prior hint ("Reduce to concrete equalities
before decide/native_decide") never mentioned `revert` at all and
pointed agents away from the exact remedy Lean recommends.

The trigger across all three tasks is the same shape: `decide` /
`native_decide` / `cases <x>` / `induction <x>` run on a goal that
still mentions local hypotheses (`hLow`, `hHigh`, `nodeIndex`,
`hBorrow`) or pattern-bound names (`val✝`, `isLt✝`) that the tactic
can't close over. Replace the weak hint with a targeted two-option
remedy: (a) `revert` every free name in the goal (plus the Lean 4
`decide +revert` / `native_decide +revert` shortcut Lean itself
suggests) and (b) swap `decide`/`native_decide` for `omega` /
`simp_all` / `exact`, which consult the local context directly.
Corpus analysis of 83 agent runs at results/agent_runs/custom/
interactive-proxy (29 failed, 54 passed) found 11 failed runs (38% of
failures) ending with an unsolved_goals error whose goal still carried
the unfolded monadic trace — `ContractResult.success`/`.revert`,
`Contract.run`, or a wrapper like `Core.Address.ofNat ((match ...))`
around a nested match over `getMappingAddr`/storage reads. Cross-family:
safe/owner_manager_reach (6), zama/erc7984 (2), paladin_votes (1),
kleros/sortition_trees (1). Zero of 54 passed runs trip the same marker,
so the signal is a clean failure predictor.

In every case the agent kept piling more helpers (`ContractResult.
success`, `.revert`, `.snd`, `Contract.run`, …) onto its `simp` list
without closing the goal, because the leftover `if <cond>` arms in the
trace test PROPOSITIONAL equality while the spec's hypotheses are in
BEq form (`(x != zeroAddress) = true`). The pre-existing generic
if/match hint never mentioned bridging BEq→Prop nor `split_ifs` on the
unreduced arms, so agents escalated with "add one more definition to
simp" forever.

Add a trace-specific hint that (a) tells the agent to stop adding to
the simp list, (b) proposes `split_ifs`/`split` to force case analysis
on each remaining if-branch, and (c) shows the one-line
`have hNZ : x ≠ zeroAddress := by simpa using hBEq` preconversion so
that a subsequent `simp_all` can close the whole trace in one step.
The hint fires alongside the existing case-label hint when both are
present (the case-label direction remains "don't re-split"; this hint
adds "here is how to close the remaining trace inside that branch").
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 6 total unresolved issues (including 5 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 1a5b969. Configure here.

short_name = self.paths.theorem_name.rsplit(".", 1)[-1]
pattern = re.compile(
rf"theorem\s+{re.escape(short_name)}\b(?P<signature>.*?)(?::=)",
rf"theorem\s+{re.escape(short_name)}\b(?P<signature>.*?):=\s*by\b",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signature check falsely triggers on term-mode proofs

Medium Severity

_extract_theorem_signature now requires := by instead of just :=. If the original editable file uses := by (common), expected_theorem_signature is set. But if the model submits a valid term-mode proof (:= fun ... => or := Eq.mpr ...), the extraction returns None, and None != expected_signature fires a false theorem_statement_mismatch warning — misleading the model into thinking it altered the theorem statement when only the proof style changed.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1a5b969. Configure here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a5b9691e4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/default_agent.py
Comment on lines +1218 to +1221
fallback_models = [
str(item)
for item in raw_fallback
if isinstance(item, str) and item.strip()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Trim fallback model ids before sending requests

The fallback-model normalization keeps the original string even when it only passed the item.strip() guard, so leading/trailing whitespace is preserved in outbound model IDs. A config such as "fallback_models": [" gpt-4o-mini"] will be treated as a distinct (invalid) model name and fail failover even though the intended model is available. Store item.strip() in fallback_models to avoid avoidable fallback failures.

Useful? React with 👍 / 👎.

Comment thread harness/default_agent.py
Comment on lines +1238 to +1239
if exc.status not in RETRY_STATUS_CODES and exc.status != 0:
break
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Continue fallback chain after non-transient model errors

The fallback loop stops on the first non-transient status, which means one bad fallback entry (for example, a 404 model-not-found on the first fallback) prevents trying any later configured models. This makes the fallback chain brittle in real runs where operators often keep multiple backups. Instead of breaking immediately on every non-transient error, continue to subsequent fallback models for model-specific failures.

Useful? React with 👍 / 👎.

claude added 9 commits April 22, 2026 23:10
Corpus analysis of 29 failed interactive runs:

- 3 final-eval failures ended at `synthesis_failed` with a raw
  `(X).run s).snd` monadic trace still in the goal at the hole
  (`safe/setupOwners_ownerListInvariant`,
   `safe/removeOwner_isOwnerCorrectness`,
   `zama/transfer_sufficient`). The agents had written `exact ?_`
  without ever unfolding `Contract.run`, so the existing generic
  synthesis hint — which says "use `?_` with inspect_lean_goals" —
  just loops them back to the same un-reducible shape.

- 1 additional final-eval failure
  (`safe/swap_owner_ownerListInvariant`) classifies as
  `unsolved_goals` and carries the same `.run X).snd` shape, but
  the existing unsolved_goals monadic-trace detection only matches
  the literal `Contract.run` / `ContractResult.*` strings, so it
  missed this case.

Both branches now detect the raw-goal pattern via a regex over
goal lines. False-positive rate is clean: 0 of 54 passed runs
match on final details, 0 of 28 passed-run intermediate check
outputs match the new `synthesis_failed + goal-trace` signal
(vs 15 hits across failed runs). The new hint tells the agent
the hole is unreachable until the contract function is unfolded
and gives the concrete `simp [X, Contract.run, Verity.bind, ...]`
incantation, rather than recommending another round of
`inspect_lean_goals` which would reveal the same un-reduced goal.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Corpus analysis of 83 interactive-proxy runs:

- try_tactic_at_hole succeeded 0 of 76 times across the entire corpus
  (passed runs: 0/6 successes on 4 runs that used it; failed runs:
  0/70 successes across 25 runs).
- Passed runs never exceeded a 2-streak of failures — they called the
  tool 1-2 times, got nothing, and moved on.
- Failed runs hit a ≥3-streak in 14 of 29 tasks (48%), with 4 runs
  stacking 5-7 consecutive speculative tactics on the same hole.

The tool keeps re-running Lean each call and returns the usual
class-based repair hints, but there is no signal telling the agent
that the tool itself is the wrong instrument for the remaining
work. Adding a per-session counter and injecting a pivot hint at
streak ≥ 3 costs one extra `int` of state and zero extra Lean
invocations. The hint explicitly names the shapes of goal that
single-tactic closure cannot discharge (BEq↔Prop bridging, residual
if/match case analysis, monadic-trace unfolding, multi-step
arithmetic) and steers the model to write_editable_proof +
inspect_lean_goals instead.

Threshold of 3 is chosen to have zero false positives on passed
runs (max observed streak = 2).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The agent occasionally writes `(private) lemma foo ... := by ...` for
helper lemmas, which is valid Mathlib syntax but not Lean 4 core. Without
Mathlib, Lean reports `unexpected identifier; expected 'abbrev', ..., or
'theorem'` at the `lemma` token — a one-token rename to `theorem` is
all that's needed. Corpus analysis of 83 interactive runs: 3 of 29 failed
tasks (10%) wrote `lemma` helpers at some point and never recovered,
because the generic parse-error hint talks about tactic-in-term-position,
missing `by`, stray tokens, and branch indentation — none of which point
at keyword choice. Detect the exact "expected 'abbrev' ... 'theorem'"
keyword-list fingerprint (unique to the top-level-command parse shape)
and surface a targeted rename hint before the generic one.
…lasses

Corpus analysis of 83 interactive-proxy run logs shows the Uint256/Address
.val coercion asymmetry pattern appears in 14 of 29 failed tasks (48%), but
the existing hint only fired when failure_class == "type_mismatch". In 12
of those 14 cases, Lean reports a cascading unsolved_goals or
unknown_identifier from the same simp call before the mismatch is surfaced,
so the hint was suppressed.

Lift the regex-based asymmetry detector out of the per-class branch and
into the cross-class pattern section so it fires whenever the
"has type ... but is expected to have type ..." pair disagrees on a .val
projection, regardless of which error Lean listed first. Dedup against
existing hints via substring check.
…sions

Corpus analysis of 83 interactive-proxy runs: 12 of 29 failed tasks (41%)
end with `exact ?_` still in the submitted proof. Every one of those runs
re-submitted a `?_`-containing proof 2–9 times after the first rejection.

The "UNFILLED HOLE IN SUBMITTED PROOF" warning already existed, but was
inserted into `hints` BEFORE `_filter_seen_hints`, so dedup suppressed it
on the 2nd–Nth resubmission. The agent then got no feedback tying its
specific, detectable mistake (still-unfilled hole in the exact proof text
just submitted) to the failure class, and kept re-submitting `?_`.

Move the hole-warning insertion AFTER the dedup filter so it fires on
every submission that still contains `?_`. The warning is keyed to the
concrete proof-text state, not to the abstract hint corpus, so it is
correctly exempt from dedup — it reports something specifically about the
submission the agent just sent, not generic class advice.
Corpus of 29 failed interactive runs: 7 terminate on
`don't know how to synthesize placeholder`. The existing conditional
hint inside `synthesis_failed` only fires when the goal still contains
a `(X).run _).snd` monadic trace (~3 of 7). The remaining ~4 — goals
with `s.storage` arithmetic, conditional branches, or list-predicate
witnesses (ethereum/full_deposit_preserves_partial_gap,
lido/shares_conversion_monotone, safe/setup_owners_acyclicity,
safe/setup_owners_owner_list_invariant) — only received the generic
"try show <goal type>" hint, which the agent had already tried. The
agent then loops on inspect_lean_goals until the tool budget is gone.

Add an `else` branch that emits shape-aware guidance: by-omega for
arithmetic goals, split_ifs for conditionals, explicit list literals
for list invariants, and explicit witnesses for And/Exists. Both
branches remain mutually exclusive so passing-run cases are unaffected.

Verified via direct calls: non-monadic synthesize-placeholder text
now produces the new hint; monadic-trace text still produces the
original run-snd hint only (not both).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mismatch

Corpus of 29 failed interactive runs: 8 tasks (28%) hit a type_mismatch
where the "is expected to have type" region contains raw
`ContractResult.revert` / `ContractResult.success` / nested
`match match if ...` machinery — i.e. the goal still carries the full
Contract.run trace while the hypothesis has already been simplified
to the concrete storage-map shape. All 8 are `safe/owner_manager_reach`
tasks (add_owner, remove_owner, swap_owner, setup_owners × correctness,
invariant, reachability).

The existing `type_mismatch` hint was only
  "Unfold definitions to align types. Check spec matches impl."
which is not actionable for this shape — the agent doesn't know which
definitions to feed simp, so it loops on `exact h` / `rw` without
reducing the goal. The cross-class `.val` asymmetry detector (shipped
in f5cd8a9) also doesn't cover this shape because the two sides are
not a pure `.val` projection diff; one side has a whole monadic trace.

Add a targeted detector inside `type_mismatch` that matches
"is expected to have type … ContractResult.revert/success/match match"
within 800 chars and emits a directive hint naming the exact reducers
(`Contract.run`, `ContractResult.snd/revert/success`, `Verity.bind`,
`Bind.bind`, `Verity.pure`, `Pure.pure`) plus `split_ifs` on sentinel
guards. Verified negative: a `.val`-only asymmetry still fires the
original `.val` cross-class hint and NOT this new one.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Corpus of 29 failed interactive runs: 19 tasks (66%) emit at least one
`unknown identifier '<tactic>'` diagnostic. Token-count breakdown:
  173 × 'simp'     100 × 'simpa'      52 × 'omega'
   43 × 'native_decide'    24 × 'simp_all'    10 × 'exact'
One task alone (safe/swap_owner_is_owner_correctness) has 52 repeat
occurrences of `unknown identifier 'simp'` in a single run — meaning
the agent kept submitting proofs with `exact simp [...]` 52 times.

The existing tactic-in-term-position hint inside `_build_check_hints`
(~line 1466) is wired correctly — it classifies the failure and emits
an actionable "wrap in `by`" hint — but `_filter_seen_hints` suppresses
it after the first emission. On the 2nd–Nth repeat, the agent gets no
feedback tying its specific, detectable mistake to each rejection.

This is the same failure mode as the UNFILLED HOLE warning (fixed in
4e7f31c): a state-conditional critical warning must repeat as long as
the state persists. Mirror that fix — re-detect the tactic-in-term
case against the current `details` and `hints.insert(0, ...)` post
dedup, so the warning names the specific tactic (`simp`, `omega`, …)
on every rejection. The hint is keyed to the concrete error-text state,
not the generic hint corpus, so it is not a "noise" dedup candidate.

Verified: given synthetic details with three `unknown identifier` lines
(simp, simp, omega), `_UNKNOWN_IDENT_RE` + `_LEAN_TACTIC_NAMES` filter
yields ['simp','simp','omega'] and the hint names the first ('simp').

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Corpus of 29 failed interactive runs: 6 tasks (21%) emit
`unknown identifier '<camelCase>'` for names that are clearly
local-binder-shaped (no dots, lowercase first char, no underscores).
Per-task repetition is extreme:
  safe/swap_owner_is_owner_correctness  110× (91 prevOwner, 19 oldOwner)
  safe/remove_owner_owner_list_invariant  56× owner
  safe/swap_owner_in_list_reachable       41× (hKey, hKeyNew, hKeyPrev, hKeyOld, newOwner)
  safe/add_owner_owner_list_invariant     34× owner
  safe/in_list_reachable                  21× s
  safe/setup_owners_owner_list_invariant  15× (owner1, owner2, owner3)

The existing local-variable hint in `_build_check_hints` (~line 1475)
is actionable — tells the agent to call inspect_lean_goals and re-check
the signature — but `_filter_seen_hints` suppresses it after first
emission. Same failure mode as UNFILLED HOLE (4e7f31c) and tactic-in-
term (30923f9): state-conditional critical warning that must repeat
as long as the state persists.

Mirror those fixes — re-detect the local-variable case against the
current `details` and `hints.insert(0, ...)` post-dedup, so the
warning names the specific out-of-scope binder on every rejection.

Suppress when a tactic-hit is also present: Lean reports both the
same way and the tactic mistake is almost always the upstream cause
of the binder being unreachable — firing both would just fill the
agent's context with duplicate advice for the same line.

Verified: pure tactic case → tactic hint only; pure var case → var
hint; mixed case → tactic only (var suppressed); Mathlib name (has
dot) → neither fires (handled by the existing mathlib branch).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants