harness: interactive proxy robustness + analytics fixes#26
harness: interactive proxy robustness + analytics fixes#26
Conversation
Major improvements to the interactive agent runtime to reduce spurious failures and improve tool feedback quality: - Pre-build Benchmark case modules once per task so the interactive tools don't race the Lake build (eliminates olean-not-found flake). - Add environment_error failure class with one-shot self-heal: when Lean can't find a dependency olean, auto-run lake build and retry before reporting to the model. Stagnation tracking skips these classes. - write_editable_proof now returns immediate warnings (placeholder, hole, theorem_statement_mismatch, hidden_proof_import, etc.) so the model sees actionable feedback before spending a build budget. - Synthesize TOOLS.md for interactive mode from the real tool specs instead of relying on static prompts that reference shell commands. - HTTP layer: retry/backoff with jitter, honour Retry-After, fall back through extra_body.fallback_models, and silently grow max_completion_tokens (up to 12000) on length finish_reason. - Raise max_tool_calls (24 -> 40) and max_completion_tokens (2000 -> 4096) for the default interactive profile. - Tighten HOLE_PATTERN (avoid matching `?_foo`) and broaden hidden import regex to cover `open`/`export` of proof modules. - Fix candidate_change_count / distinct_candidate_count for interactive attempts by falling back to stable_digest of candidate_file_contents when trace is missing. - Add light temperature schedule on repeated failure_class history. Schema: extend agent-run.schema.json with prebuild_reports. Verified: 3 quick tasks pass cleanly on both gpt and builtin/smart with correct analytics (distinct=1, change_count=1). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 503a0654fd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| _LAKE_BUILD_CACHE.add(module_name) | ||
| code, output = lean_run_command(["lake", "build", module_name], cwd=ROOT) |
There was a problem hiding this comment.
Cache only successful lake builds
prebuild_task_modules adds each module to _LAKE_BUILD_CACHE before running lake build, so a failed build is still marked as cached. In agent_runner.run_many (case/suite runs in one process), that causes later tasks sharing the same module to skip rebuilds and keep reporting cached, which prevents recovery from transient build failures and undermines the new self-healing behavior.
Useful? React with 👍 / 👎.
| if candidate_hash not in distinct_candidate_hashes: | ||
| distinct_candidate_hashes.add(candidate_hash) | ||
| if not isinstance(trace, dict) or not trace.get("candidate_sha256"): | ||
| if candidate_text != previous_candidate: | ||
| candidate_change_count += 1 |
There was a problem hiding this comment.
Count candidate changes on every transition
In the interactive fallback path (attempts without trace), candidate_change_count is incremented only when the hash has not been seen before. That undercounts real transitions when the model revisits an earlier candidate (e.g., A -> B -> A should count two changes after the first attempt, but the second A transition is dropped), so analytics remain inaccurate for non-monotonic candidate sequences.
Useful? React with 👍 / 👎.
- P1: only cache *successful* lake builds so failed builds can be retried on subsequent calls. Previously `_LAKE_BUILD_CACHE` marked modules as attempted before invoking lake, which prevented the self-heal path from ever recovering from a transient build failure in batch/suite runs that share a process. - P2: count every candidate transition in the interactive analytics fallback, including reverts (A -> B -> A now counts 2 changes, not 1). Previously the increment was gated by "hash not yet seen", which undercounted non-monotonic candidate sequences. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bugbot (Medium): `_attempt_lake_build` returned False when the module was already cached as built, so after `prebuild_task_modules` populated the cache, every subsequent environment_error auto-heal attempt short-circuited without invoking lake and without retrying the check — while the model was told "The harness is attempting to rebuild it". The auto-heal path only fires when the runtime has observed a missing .olean at check time, which means the cached 'success' entry is stale. Always invoke lake build in this path and refresh the cache with the latest result. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Low: merge the two back-to-back `if config.mode == "interactive":` blocks in `execute_agent_task` into a single branch. - Low: `try_tactic_at_hole` now uses the same `HOLE_PATTERN` regex as `inspect_lean_goals`, so the two tools agree on what counts as a hole (previously the tightened lookbehind in HOLE_PATTERN was not mirrored in `try_tactic_at_hole`, which could let it replace a match that `inspect_lean_goals` had reported as absent). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 950d7e3145
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| try: | ||
| return max(0.0, float(value)) | ||
| except ValueError: | ||
| return None |
There was a problem hiding this comment.
Support date-form Retry-After values
_parse_retry_after only accepts numeric delta-seconds, but Retry-After is also commonly sent as an HTTP-date on 429/503 responses. In that case this function returns None, so _backoff_delay uses a short exponential retry instead of the server-requested wait, which can repeatedly hit rate limits and prolong failures for overloaded providers. Please parse HTTP-date values as well before computing retry delays.
Useful? React with 👍 / 👎.
Codex (P2): `_parse_retry_after` only handled delta-seconds, so when a provider sent an HTTP-date on 429/503 (which RFC 7231 permits), the function returned None and `_backoff_delay` fell back to a short exponential retry instead of honouring the server-requested wait. That can repeatedly hit rate limits and prolong failures. Now parse both forms; a date in the past clamps to 0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3ddda642d4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if retry_after is not None: | ||
| return min(retry_after, 60.0) |
There was a problem hiding this comment.
Respect full Retry-After delay from provider
_backoff_delay clamps Retry-After to 60 seconds, so 429/503 responses that legitimately request longer waits (for example several minutes) will be retried too early and can burn through all retries while the provider is still enforcing the limit. This directly undermines the new “honour Retry-After” behavior and can turn recoverable throttling into hard failures.
Useful? React with 👍 / 👎.
| re.compile(r"object file ['\"][^'\"]+\.olean['\"]? does not exist"), | ||
| re.compile(r"failed to load environment"), |
There was a problem hiding this comment.
Narrow environment-error matching to real infra failures
The new ENVIRONMENT_ERROR_PATTERNS treats any "object file ... .olean does not exist" / "failed to load environment" message as infrastructure, but those strings also occur when the candidate introduces a bad import/module name. Because _annotate_check_result short-circuits on environment_error, these model-caused mistakes are labeled “not your fault” and excluded from stagnation/temperature logic, which can prevent corrective behavior in interactive runs.
Useful? React with 👍 / 👎.
Codex P2 #1: narrow `environment_error` classification to real infra failures. Previously any missing .olean matched, but that string is also produced when the model imports a non-existent module. In the latter case we want the normal stagnation/temperature logic to kick in so the model can correct itself. Now only classify as environment_error when the missing module is under `Benchmark.*` (our dependency tree which should have been pre-built) or when lean itself is missing. The generic "failed to load environment" substring no longer short-circuits. Codex P2 #2: raise `_backoff_delay` Retry-After ceiling from 60s to 600s. Providers routinely request several-minute waits on 429/503; the old 60s clamp caused retries to fire while the rate limit was still in force, undermining the "honour Retry-After" behaviour. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: df1004c256
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # pre-built) or the model's fault (imported a module that doesn't exist). We | ||
| # only classify the former as environment_error so stagnation/temperature | ||
| # logic still applies to model-caused import mistakes. | ||
| _MISSING_OLEAN_RE = re.compile(r"object file ['\"]([^'\"]+\.olean)['\"]? does not exist") |
There was a problem hiding this comment.
Handle
of module in missing-olean regex
Lean commonly reports missing artifacts as object file '...olean' of module <Name> does not exist, but _MISSING_OLEAN_RE only matches the shorter '...olean' does not exist form. Fresh evidence in this revision is that the new regex at this line requires does not exist immediately after the path, so _missing_olean_module returns None for the standard message shape; that prevents classify_failure from returning environment_error and skips the new auto-heal lake build retry path in run_lean_check.
Useful? React with 👍 / 👎.
Bugbot (Medium): the temperature schedule condition was checked at the top of every loop iteration, so once two consecutive same-class failures triggered it, temperature was bumped on every subsequent turn -- including pure search/write turns -- until the 0.7 cap was reached. Track the history length we have already acted on and only bump once per new failure entry. Codex (P1): Lean reports missing artifacts in two forms depending on context: object file '...olean' does not exist object file '...olean' of module <Name> does not exist The `_MISSING_OLEAN_RE` regex only matched the shorter form, so on the more common "of module" diagnostic `_missing_olean_module` returned None, `classify_failure` did not return `environment_error`, and the auto-heal retry path was skipped. Accept arbitrary text between the path and the "does not exist" tail. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 894a285f92
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| missing_module = _missing_olean_module(details) | ||
| if missing_module and missing_module.startswith("Benchmark."): | ||
| return "environment_error" |
There was a problem hiding this comment.
Verify missing Benchmark module exists before env classification
Fresh evidence in this revision is that classify_failure now labels any missing .olean as environment_error whenever the parsed module name starts with Benchmark.. A model-caused bad import like import Benchmark.Foo.Bar (which is not a real module) will therefore be treated as infrastructure failure, causing _annotate_check_result to mark it as environment_error and skip stagnation/temperature correction, so interactive runs can keep retrying instead of fixing the import.
Useful? React with 👍 / 👎.
Codex (P2): a model-authored bad import like `import Benchmark.Foo.Bar` where no such module exists would still produce an "object file ...olean does not exist" diagnostic, and under the previous check that matched the `Benchmark.*` prefix it would be misclassified as infrastructure failure -- skipping stagnation/temperature correction instead of letting the model discover and fix the bogus import. Require the corresponding `.lean` source file to actually exist in the tree before treating a missing .olean as environment_error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 42efe75153
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Fall back only on rate-limit / service-unavailable style errors. | ||
| if exc.status not in (429, 500, 502, 503, 504) and exc.status != 0: | ||
| break |
There was a problem hiding this comment.
Retry alternate models on all transient HTTP statuses
send_chat_completion only falls through to fallback_models for 429/5xx (or status 0), but _post_chat_completion explicitly treats 408, 409, and 425 as transient retryable failures too. If the primary model keeps returning one of those statuses, the code exits after exhausting retries on that model instead of trying configured fallback models, so the new fallback chain is skipped in a recoverable outage scenario.
Useful? React with 👍 / 👎.
Codex (P2): `send_chat_completion`'s fallback-model gate used a
hard-coded subset {429, 500, 502, 503, 504}, but `_post_chat_completion`
retries the broader `RETRY_STATUS_CODES` set (408, 409, 425, 429, 500,
502, 503, 504). A primary model that kept returning 408/409/425 would
exhaust retries and then skip the configured fallback chain entirely.
Route on the same set used by the underlying retry loop.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f8d43bbfae
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| token_budget = min(int(token_budget * 1.5), 4500) | ||
| # Up to 3 silent budget bumps before nudging the model to simplify. | ||
| if consecutive_length_stops <= 3: | ||
| token_budget = min(int(token_budget * 1.5), 12000) |
There was a problem hiding this comment.
Cap length-retry token bumps at model-safe limit
The new length-retry path raises token_budget to min(int(token_budget * 1.5), 12000), which can exceed the model/provider output-token cap after a finish_reason == "length" response. With the updated interactive default already at 4096, the first bump becomes 6144; providers that enforce a 4096 max will return HTTP 400, and send_chat_completion treats 400 as non-transient and exits instead of recovering. This turns a recoverable truncation into a hard task failure for capped models.
Useful? React with 👍 / 👎.
Three related fixes from bugbot/codex on f8d43bb: 1. (codex P1) Cap length-retry token bump at config.max_completion_tokens. The previous hard-coded cap of 12000 could exceed the provider's per- response output-token limit (e.g. models with a 4096 hard cap), turning a recoverable truncation into a non-transient HTTP 400 hard failure. 2. (bugbot Low) Also reset token_budget when recovering from a length streak, not only when consecutive_length_stops >= 5. Previously a 1-4 stop streak followed by recovery left the elevated budget in place for the rest of the run. 3. (bugbot Low) Skip environment_error entries when appending to failure_class_history. They are infra noise that would break the sliding window same-class check (e.g. [type_error, env_error, type_error]) even though the filter at the trigger site rejects them -- consistent with the PR's stated intent that env_error be invisible to stagnation tracking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 07c3880b8d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # provider-enforced per-response limit (some models hard-cap at the | ||
| # configured value and would return HTTP 400 on anything larger). | ||
| if consecutive_length_stops <= 3: | ||
| token_budget = min(int(token_budget * 1.5), config.max_completion_tokens) |
There was a problem hiding this comment.
Make length-stop retries raise max_tokens
execute_interactive_agent_task initializes token_budget to config.max_completion_tokens, but the length-retry branch sets it to min(int(token_budget * 1.5), config.max_completion_tokens), which can never increase the value. As a result, repeated finish_reason == "length" responses are retried with the same token limit instead of a larger one, so the new silent recovery path is effectively disabled and long responses can still exhaust attempts unnecessarily.
Useful? React with 👍 / 👎.
Python 3.12 surfaces socket read timeouts during SSL as bare TimeoutError, which slipped past the urllib.error.URLError handler and killed the task with a traceback. Catch it alongside URLError and apply the same backoff retry policy. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previous commit capped the silent length-retry bump at config.max_completion_tokens, which -- as bugbot and codex both pointed out -- turned the feature into a no-op because token_budget starts at exactly that value. Add a config.extra_body.length_retry_token_cap knob (stripped from the provider payload in send_chat_completion). Default stays at max_completion_tokens so models with a hard per-response cap don't get surprised by HTTP 400; interactive.json opts into a 12000 ceiling, which is the original intended behavior for gpt-class models that accept larger single-response budgets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c5f118c37b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| fallback_models = [ | ||
| str(item) | ||
| for item in (config.extra_body.get("fallback_models") or []) | ||
| if isinstance(item, str) and item.strip() |
There was a problem hiding this comment.
Validate fallback_models type before iterating
fallback_models is read from untyped extra_body and iterated directly, so a common single-string config (for example "fallback_models": "gpt-4o-mini") is treated as characters and produces model ids like "g", "p", "t" during failover. In transient-outage scenarios this makes the fallback chain fail even though a valid fallback was configured. Please require a list/tuple of strings (or normalize a single string to a one-element list) before building models_to_try.
Useful? React with 👍 / 👎.
| length_retry_token_cap = int( | ||
| config.extra_body.get("length_retry_token_cap", config.max_completion_tokens) | ||
| ) |
There was a problem hiding this comment.
Handle invalid length_retry_token_cap without crashing
Interactive runs cast extra_body.length_retry_token_cap with int(...) unconditionally, but extra_body is schema-free and often operator-edited; non-integer values (e.g., null, "12k", or nested objects) will raise before the first model turn and abort the task. This knob should be parsed defensively (or validated up front) and fall back to max_completion_tokens instead of terminating execution.
Useful? React with 👍 / 👎.
Both knobs are read from schema-free extra_body: - fallback_models="gpt-4o-mini" (string shorthand) previously iterated character-by-character, producing fake model ids like "g", "p", "t" during failover. Normalize a bare string to a one-element list first. - length_retry_token_cap=null / "12k" / nested object would crash the int() cast before the first model turn, aborting the task. Fall back to max_completion_tokens silently. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
`_LEAN_BLOCK_HEADER_RE` and `_FP_LINE_COL_RE` hardcoded the source
prefix to `CandidateCheck.lean:`, which only fires for the stub file
`evaluate_candidate` writes when `check_goals=False` (i.e. the
`run_lean_check` / `write_editable_proof` code paths). The
`inspect_lean_goals` path runs Lean against the actual editable file
— paths like `Benchmark/Generated/Foo/Bar.lean:` — and so fell
through both filters untouched.
Corpus analysis of 83 interactive runs found:
- 32/88 (36%) of inspect_lean_goals outputs still carried full
`linter.unusedSimpArgs` warning blocks (verified on the largest
leaker: 9.8 KB of noise inside a 24.5 KB output — a 40% reduction
with the fix applied).
- Stagnation detection's error-text fingerprint left the file path
in place for inspect_lean_goals outputs, so the same underlying
error surfaced via two different Lean invocations produced two
different fingerprints and escaped the no-progress loop detector.
Broadening both regexes to `\S+\.lean:LINE:COL:` lets a single
strip + fingerprint pipeline cover every tool that reports Lean
diagnostics, regardless of which source file Lean named.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The `placeholder_detected` write-warning told the agent to replace sorry/admit/axiom "before run_lean_check". That instruction is stale on two counts: * commit b2bd74f (2026-04-22 18:29) folded run_lean_check into write_editable_proof, so by the time this warning is returned the Lean check has already run. There is no separate "before" moment. * commit f5ceebe (2026-04-22 20:01) removed every prompt-side hint telling the agent to call run_lean_check after write_editable_proof, aligning the tool surface with the folded architecture. The current wording therefore gestures at a workflow that no longer exists, undercutting the "run_lean_check is not needed after write_editable_proof" description already in the tool surface. Corpus analysis (83 interactive runs, 372 write_editable_proof calls) found this warning fired 13 times, always alongside a Lean verdict for the same proof — so the agent was seeing a "do X before Y" hint at the exact moment it was also reading Y's output, a self-contradiction that adds no new signal. Rewording to "Lean rejects these — replace with a real tactic or a `?_` hole." keeps the actionable advice while dropping the stale sequencing language and pointing the agent at the legitimate `?_` probe workflow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
`_extract_theorem_signature` used the non-greedy pattern `.*?(?::=)` to find the proof marker, but that stops at the first `:=` token, which in 74 of 88 (84%) benchmark task files lives inside a `let x := ...` binding within the statement. The result is a truncated signature that makes the theorem_statement_mismatch check a false-negative for any edit past that let-binding. All 88 task files use `:= by` as the actual proof start, so anchoring on `:=\s*by\b` cleanly captures the full signature without regressing the 14 let-free cases.
…ublic_defs
Corpus analysis of 83 interactive runs found 20/29 failed tasks (69%)
hit "unknown identifier 'simp'" / '<tactic>' errors at least once. The
corpus-wide count of unknown-identifier incidents is dominated by
tactic names — 17 `simp`, 10 `simpa`, 6 `omega`, 4 `exact`,
4 `native_decide`, 3 `intro`, 3 `simp_all`, 2 `by_cases` — i.e.
~60% of cases where Lean emits `unknown identifier` are tactic-in-term-
position mistakes (`exact simp [...]`, `:= by_cases h`, etc.), not
missing definitions.
The existing `unknown_identifier` hint (both the per-failure hint
and the stagnation-escalation hint) sent the model to
`search_public_defs` to look for the name, which cannot help:
these are language keywords, not definitions. The model would burn
tool calls searching and loop.
Detect the tactic-name case and emit a position-specific hint instead
("wrap with `by`, or drop the `exact`/`refine` prefix"), and
skip the misleading "Use search_public_defs" hint in that branch.
Non-tactic identifiers (missing lemmas, typos like `Nat.div_mul_le`)
still get the original advice.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 19bb7323c0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| rf"theorem\s+{re.escape(short_name)}\b(?P<signature>.*?):=\s*by\b", | ||
| re.DOTALL, |
There was a problem hiding this comment.
Accept theorem bodies not written with
:= by
_extract_theorem_signature now only matches declarations where the theorem body starts with := by, so a candidate that keeps the exact same theorem statement but uses a valid term-style body (for example := <term>) produces None and is treated as a theorem_statement_mismatch in preflight_candidate. That turns semantically valid proofs into hard preflight failures and blocks the interactive loop from evaluating them with Lean.
Useful? React with 👍 / 👎.
…ic_defs Corpus analysis of 83 runs under results/agent_runs/custom/interactive-proxy/ classified unknown-identifier incidents across failed tasks: 51 tactic-in-term-position (addressed by 19bb732) 34 "other" — dominated by camelCase locals: prevOwner (17), owner (6), ... 12 snake_case lemma guesses 12 qualified-name misses 4 var_like already matching a narrower pattern 38 of these (the 34 "other" + 4 var_like) are local-variable-shaped names (camelCase, no dot, no underscore, leading lowercase) affecting 6/29 failed tasks. The previous hint sent the agent to search_public_defs, which cannot resolve local binders and produced budget-wasting loops. New branch detects var-like names and instead points at inspect_lean_goals plus re-reading the theorem signature for the real parameter names, explicitly warning that search_public_defs is wrong for this shape. Tactic-in-term-position path is preserved (checked first), and the fallback "Check imports" hint still fires for snake/qualified misses that match neither category.
…haped
Corpus analysis of the 29 failed runs under
results/agent_runs/custom/interactive-proxy/ shows 5 tasks (17%)
stagnating on guesses that only exist in Mathlib:
ethereum.../full_deposit_preserves_partial_gap: sub_eq_sub_right,
add_sub_add_right_eq_sub
openzeppelin.../preview_deposit_rounds_down: Nat.cast_mk, Nat.div_def,
Nat.div_mul_le, Nat.le_div_mul
paladin.../weth_claimed_plus_allocated: add_assoc, add_left_comm,
sub_eq_add_neg, add_comm
safe.../in_list_reachable: not_eq
zama.../transfer_conservation: lt_of_add_lt_add_right,
Nat.not_ge.mp
This workspace has no Mathlib dependency — only core Lean 4, Batteries,
and the task's own Benchmark.* modules. The prior fallback hint
("Check imports. Standard names: Nat.lt_of_not_ge, Nat.not_le_of_lt")
was misleading because the agent kept searching for imports that cannot
be added. Worse, it cited a `Nat.*` name as a "standard" example while
the agent was already being burned by the Nat.* namespace.
Detect Mathlib-shape names (arithmetic-prefix `add_*`/`sub_*`/`le_*`/...,
common exact names like `add_assoc`/`add_comm`, and any `Nat.*`-qualified
guess) and emit a hint that: (a) names the absent dependency explicitly,
(b) redirects to `omega`/`ring`/`simp arith` tactics, (c) reminds the
agent that search_public_defs takes a KEYWORD, not a guessed lemma name.
Precedence order preserved: tactic-in-term (19bb732) ranks highest,
then var_like (b24ef3f), then decide_True (legacy), then Mathlib,
then generic search_public_defs. The misleading "Check imports..."
trailer is suppressed when the Mathlib branch fires.
Corpus analysis of 83 runs under results/agent_runs/custom/interactive-proxy/:
6 of 16 `type_mismatch` incidents (38% of this class), spanning 5 of 29
failed tasks (17%), hit the same recurring shape:
has type ¬x = 0 : Prop
but is expected ¬x.val = 0 : Prop
The `.val` projection on Uint256 / Address / Nat is missing on one side.
Affected failed tasks:
openzeppelin.../preview_deposit_rounds_down
openzeppelin.../positive_deposit_mints_positive_shares_under_rate_bound
lido.../ceildiv_sandwich
safe.../remove_owner_in_list_reachable
safe.../add_owner_is_owner_correctness
The agent repeatedly tried `exact h` against the asymmetric goal instead
of letting simp bridge the coercion. The prior hint
("Unfold definitions to align types") did not name the structural issue,
so the agent cycled on `exact`-family tactics until the tool budget ran
out.
New hint fires only when the "has type … but is expected to have type …"
substrings differ by `.val` on exactly one side, and points the agent at:
- `by simpa using h` / `by simp_all` (simp bridges the projection)
- `by omega` once `.val` is exposed on both sides for arithmetic goals
- the underlying injectivity lemma via search_public_defs for negated-
equality mismatches
Precedence is preserved: the existing `decide` branch runs first (when
both match), and the generic "Unfold definitions" trailer is still
appended as a fallback.
Commits b24ef3f and 31ff5e8 taught `_build_check_hints` to distinguish `unknown identifier '<var>'` (local binder) and `unknown identifier '<mathlib-name>'` (unavailable dependency) from generic definition misses. However `_build_escalation_hint` — which fires once the agent has repeated the same failure_class multiple times — still emitted only "Stop guessing identifier names. Use the search_public_defs tool" for every non-tactic case. That meant an agent stagnating on `prevOwner` / `add_comm` / `Nat.div_mul_le` kept being told to search_public_defs even though: * local binders will never be found by search_public_defs, and * Mathlib lemmas do not exist in this workspace at all. Corpus evidence (same 83-run dataset): 6/29 failed tasks surface a var-like unknown identifier at escalation time 5/29 failed tasks surface a Mathlib-shaped unknown identifier at escalation Route the escalation hint the same way as the check hint: tactic -> wrap in `by` (existing) var_like -> inspect_lean_goals to see real binders mathlib -> omega / ring / simp arith; search_public_defs takes a keyword else -> original generic "stop guessing" fallback No behavior change when the details string is empty (falls through to the generic fallback as before).
Corpus analysis of 83 runs under results/agent_runs/custom/interactive-proxy/:
21 parse-error events across 14 of 29 failed tasks (48%) surface one of
- "unexpected token '…'" (11 incidents; `using`/`at`/`:`/`;`/`|`/…)
- "unexpected identifier" (5 incidents)
- "expected '{' or indented tactic sequence" (5 incidents)
Most of these incidents (19/21) coexist with a classifiable semantic error
and so get routed to `unknown_identifier`, `unsolved_goals`, etc. But 2
failed-task incidents collapse to `failure_class=other` because the parse
error is the ONLY signal the checker emits:
lido.../locked_funds_solvency unexpected token ';'
zama.../burn_decreases_supply unexpected token '|' (first failure
turn — lost the whole budget starting
from a malformed signature)
Extend `classify_failure` with a `parse_error` bucket and a dedicated hint
that: (a) names the likely causes (tactic-in-term, missing `by`, stray
`;`/`|`/`using`, bullet-indentation mismatch), (b) tells the agent to
re-read the editable file via read_public_file to see character positions,
(c) advises a clean `:= by <tactics>` rewrite over token-by-token patching.
Placement: `parse_error` comes LAST in classify_failure before "other", so
it never steals classification from more specific errors. The existing
cross-cutting "Syntax error. Ensure the theorem body uses `:= by`…"
catch-all still fires as a secondary hint.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0c6c7adf96
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if tool_name in ("run_lean_check", "write_editable_proof") and result.get("failure_mode") == "lean_check_failed": | ||
| saw_lean_failure = True |
There was a problem hiding this comment.
Track preflight tool failures in temperature history
The temperature schedule only records tool failures when failure_mode == "lean_check_failed", but write_editable_proof now commonly returns preflight failures (for example placeholder_detected or hidden_case_import_detected) with different failure modes. Those repeated failures never get added to failure_class_history, so the repeated-class bump at the top of execute_interactive_agent_task never activates and interactive runs can stay stuck at deterministic temperature until attempt exhaustion.
Useful? React with 👍 / 👎.
…dentifier
The interactive-mode user prompt previously instructed:
"For unknown_identifier errors: use search_public_defs to find correct
names."
Corpus analysis of 83 runs under results/agent_runs/custom/interactive-proxy/
shows that this blanket advice misleads the agent on 24 of 29 failed tasks
(83%). Those tasks hit at least one `unknown identifier '…'` where
search_public_defs cannot possibly help because the missing name is one of:
- a Lean TACTIC written in term position (simp, simpa, omega,
native_decide, by_cases, …) — search_public_defs searches
*definitions*, not syntactic keywords. Fix is `by <tactic>`.
- a LOCAL BINDER (prevOwner, owner, hKey, …) that is simply
out-of-scope at the use site — binders are not public defs.
- a MATHLIB LEMMA (add_comm, add_sub_*, Nat.div_mul_*, …). Mathlib
is not a dependency in this workspace.
The downstream repair_hints (commits 19bb732, b24ef3f, 31ff5e8) already
route each shape correctly, but the initial instruction primes the agent
to call search_public_defs first and overwrites any subsequent hint.
Rewrite the bullet to:
- tell the agent to read repair_hints before searching,
- enumerate the three "wrong-shape" cases and their remedies, and
- reserve search_public_defs for genuine project-defined names.
Passed runs hit this scenario only 6/54 times (11%), so the clearer
instruction should not distract correct agents while sharply reducing
the misled population in failed runs.
Lean phrases `cases h` on a non-inductive major premise (implication `A → B`, function type, unreduced equality, Bool-valued `==`) as `tactic 'cases' failed, major premise type is not an inductive type` — distinct from the `constructor failed, not an inductive datatype` shape already handled by `constructor_failed`. Corpus analysis of 83 runs at results/agent_runs/custom/interactive-proxy/: the pattern occurs 22 times in one failed task (safe__owner_manager_reach__setup_owners_acyclicity.json), where the agent repeatedly retried `cases h` on an `hkey = SENTINEL → False` implication because the generic \"other\" bucket gave no actionable hint — a 22-cycle stagnation loop with no escape. Split into its own `cases_failed` class covering both `cases` and `induction` variants, with a hint that names the four actual remedies: apply the implication with an argument, `by_cases` for decidable Props, `absurd`/`.elim` for contradiction, or `Bool.ne_iff`/`beq_iff_eq` to rewrite Bool equalities before case-splitting.
When Lean reports unsolved goals with a `case <label>` marker, the agent has ALREADY case-split successfully and only one branch is open. The previous hint set unconditionally appended "Try restructuring: `by_cases h : condition …`", which told the agent to undo its own working split and start over. Corpus analysis of 83 runs at results/agent_runs/custom/interactive-proxy/: 59 of 127 unsolved_goals incidents (46%) across 22 tasks carry a `case <label>` marker — `case pos`, `case neg`, `case left`, `case right`, etc. 6 of the 8 unsolved_goals final failures had one. Now: when a case label is present, (a) name the open branch(es) back to the agent so it stops re-inspecting, (b) point at the branch-specific hypotheses that simp_all / omega / absurd can close the branch with, and (c) SUPPRESS the "restructure with by_cases" hint so the agent doesn't undo its progress. The fresh-split hint still fires when no case label is present.
The escalation hint for unsolved_goals / simp_no_progress / rfl_failed / unfold_failed was unconditionally advising the stuck agent to "Start with `unfold <spec_name>` … then `by_cases` on each conditional branch". When the agent is stuck with a `case <label>` already in the goal output, that template tells it to TEAR DOWN a working case-split and start over — pure regression. Mirror the case-label detection from _build_check_hints (commit 77cb481) into _build_escalation_hint. When `case <label>` markers are present, escalate with branch-closing advice instead of a fresh-split template: (1) inspect_lean_goals at a `?_` in the branch to read `h✝`, (2) `simp_all` with the contract-simp-set to discharge contradictory hypotheses, (3) `exact absurd` when hypotheses literally disagree, (4) `omega` after `simp only` exposes `.val` forms. Corpus evidence from results/agent_runs/custom/interactive-proxy/: 59 of 127 unsolved_goals incidents (46%) across 22 tasks carry a case label; 6 of 8 unsolved_goals final-failures did. Those are exactly the runs most likely to hit the stagnation threshold and trigger this escalation.
The public impl/spec file set is static for a session, so the same (query, limit) pair always returns the same matches. Cache the first response and short-circuit repeat calls with a `cached: true` marker plus a note telling the agent to try a different query or a different tool. Corpus evidence from results/agent_runs/custom/interactive-proxy/: - Failed runs averaged 41.9 search_public_defs calls vs 1.5 on passing runs (28x more). - Of extracted queries across failed runs, 94% were byte-identical re-queries: e.g. `"removeOwner_ownerListInvariant"` submitted 26 times in one run, `"isChain"` 25 times, `"ceilDiv"` 24 times, `"acyclic"` 25 times. The agent was stuck in a tight re-query loop with no signal that the library wouldn't suddenly start matching. Mirrors the existing run_lean_check cache pattern (see _last_eval_cache in __init__) — same mechanism: deepcopy the first result, stamp it with `cached: true` + a note on the next identical call, and stay behaviourally-equivalent on distinct queries.
Corpus analysis of 29 failed interactive runs found 38 `omega_failed`
incidents carrying 96 opaque-coercion occurrences in their
counterexample `where:` sections (mul: 45, add: 34, sub: 17). When the
goal mentions `↑(mul a b).val`, `↑(add a b).val`, or `↑(sub a b).val`,
omega treats the whole thing as an opaque Nat and cannot see the
underlying linear arithmetic — even though the operation reduces to
`a.val * b.val` / `a.val + b.val` / `a.val - b.val` whenever the
spec's no-overflow hypothesis holds.
The canonical conversion lemmas already exist in Verity:
Uint256.mul_eq_of_lt (h : a.val * b.val < modulus) :
(a * b).val = a.val * b.val
Uint256.add_eq_of_lt (h : a.val + b.val < modulus) :
(a + b).val = a.val + b.val
Uint256.sub_eq_of_le (h : b.val ≤ a.val) :
(a - b).val = a.val - b.val
…but ZERO proofs in the 83-run corpus — failed OR passed — used them.
Agents searched for related terms ("val_mul", "Uint256 mul add sub ge
theorem lemma val", "div_mul_le") and never landed on the right names,
so the existing generic "introduce helper lemmas that bound the
product" nonlinear hint did not close the loop.
Detect `↑(mul|add|sub …)` in the omega_failed details and emit a
targeted hint that names the specific lemma(s) applicable to the ops
present, spells out the hypothesis shape (`a.val + b.val < modulus`,
typically the spec's `hNoOverflow`), and shows the `rw` call that
turns the goal into a plain Nat arithmetic fact omega can close.
Coverage: 11 / 17 current-classifier omega_failed incidents across 5
distinct failed tasks (lido vaulthub_locked ×3, openzeppelin
erc4626_virtual_offset_deposit ×2) now receive the targeted hint
instead of the generic "out of omega's reach" advice.
Strictly additive: the pre-existing division/modulus and variable-
product nonlinear hints still fire when their own patterns match.
…gue 'reduce to equalities'
Corpus analysis of 29 failed interactive runs found 3 distinct tasks
(damn_vulnerable_defi side_entrance exploit_trace_drains_pool, kleros
sortition_trees root_minus_left_equals_right_subtree, safe
owner_manager_reach add_owner_owner_list_invariant) surfacing
`expected type must not contain free variables` with 19 total
occurrences across attempts. Lean's own error text tells the user
"Use the '+revert' option to automatically cleanup and revert free
variables" — yet the prior hint ("Reduce to concrete equalities
before decide/native_decide") never mentioned `revert` at all and
pointed agents away from the exact remedy Lean recommends.
The trigger across all three tasks is the same shape: `decide` /
`native_decide` / `cases <x>` / `induction <x>` run on a goal that
still mentions local hypotheses (`hLow`, `hHigh`, `nodeIndex`,
`hBorrow`) or pattern-bound names (`val✝`, `isLt✝`) that the tactic
can't close over. Replace the weak hint with a targeted two-option
remedy: (a) `revert` every free name in the goal (plus the Lean 4
`decide +revert` / `native_decide +revert` shortcut Lean itself
suggests) and (b) swap `decide`/`native_decide` for `omega` /
`simp_all` / `exact`, which consult the local context directly.
Corpus analysis of 83 agent runs at results/agent_runs/custom/ interactive-proxy (29 failed, 54 passed) found 11 failed runs (38% of failures) ending with an unsolved_goals error whose goal still carried the unfolded monadic trace — `ContractResult.success`/`.revert`, `Contract.run`, or a wrapper like `Core.Address.ofNat ((match ...))` around a nested match over `getMappingAddr`/storage reads. Cross-family: safe/owner_manager_reach (6), zama/erc7984 (2), paladin_votes (1), kleros/sortition_trees (1). Zero of 54 passed runs trip the same marker, so the signal is a clean failure predictor. In every case the agent kept piling more helpers (`ContractResult. success`, `.revert`, `.snd`, `Contract.run`, …) onto its `simp` list without closing the goal, because the leftover `if <cond>` arms in the trace test PROPOSITIONAL equality while the spec's hypotheses are in BEq form (`(x != zeroAddress) = true`). The pre-existing generic if/match hint never mentioned bridging BEq→Prop nor `split_ifs` on the unreduced arms, so agents escalated with "add one more definition to simp" forever. Add a trace-specific hint that (a) tells the agent to stop adding to the simp list, (b) proposes `split_ifs`/`split` to force case analysis on each remaining if-branch, and (c) shows the one-line `have hNZ : x ≠ zeroAddress := by simpa using hBEq` preconversion so that a subsequent `simp_all` can close the whole trace in one step. The hint fires alongside the existing case-label hint when both are present (the case-label direction remains "don't re-split"; this hint adds "here is how to close the remaining trace inside that branch").
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 6 total unresolved issues (including 5 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 1a5b969. Configure here.
| short_name = self.paths.theorem_name.rsplit(".", 1)[-1] | ||
| pattern = re.compile( | ||
| rf"theorem\s+{re.escape(short_name)}\b(?P<signature>.*?)(?::=)", | ||
| rf"theorem\s+{re.escape(short_name)}\b(?P<signature>.*?):=\s*by\b", |
There was a problem hiding this comment.
Signature check falsely triggers on term-mode proofs
Medium Severity
_extract_theorem_signature now requires := by instead of just :=. If the original editable file uses := by (common), expected_theorem_signature is set. But if the model submits a valid term-mode proof (:= fun ... => or := Eq.mpr ...), the extraction returns None, and None != expected_signature fires a false theorem_statement_mismatch warning — misleading the model into thinking it altered the theorem statement when only the proof style changed.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 1a5b969. Configure here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1a5b9691e4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| fallback_models = [ | ||
| str(item) | ||
| for item in raw_fallback | ||
| if isinstance(item, str) and item.strip() |
There was a problem hiding this comment.
Trim fallback model ids before sending requests
The fallback-model normalization keeps the original string even when it only passed the item.strip() guard, so leading/trailing whitespace is preserved in outbound model IDs. A config such as "fallback_models": [" gpt-4o-mini"] will be treated as a distinct (invalid) model name and fail failover even though the intended model is available. Store item.strip() in fallback_models to avoid avoidable fallback failures.
Useful? React with 👍 / 👎.
| if exc.status not in RETRY_STATUS_CODES and exc.status != 0: | ||
| break |
There was a problem hiding this comment.
Continue fallback chain after non-transient model errors
The fallback loop stops on the first non-transient status, which means one bad fallback entry (for example, a 404 model-not-found on the first fallback) prevents trying any later configured models. This makes the fallback chain brittle in real runs where operators often keep multiple backups. Instead of breaking immediately on every non-transient error, continue to subsequent fallback models for model-specific failures.
Useful? React with 👍 / 👎.
Corpus analysis of 29 failed interactive runs: - 3 final-eval failures ended at `synthesis_failed` with a raw `(X).run s).snd` monadic trace still in the goal at the hole (`safe/setupOwners_ownerListInvariant`, `safe/removeOwner_isOwnerCorrectness`, `zama/transfer_sufficient`). The agents had written `exact ?_` without ever unfolding `Contract.run`, so the existing generic synthesis hint — which says "use `?_` with inspect_lean_goals" — just loops them back to the same un-reducible shape. - 1 additional final-eval failure (`safe/swap_owner_ownerListInvariant`) classifies as `unsolved_goals` and carries the same `.run X).snd` shape, but the existing unsolved_goals monadic-trace detection only matches the literal `Contract.run` / `ContractResult.*` strings, so it missed this case. Both branches now detect the raw-goal pattern via a regex over goal lines. False-positive rate is clean: 0 of 54 passed runs match on final details, 0 of 28 passed-run intermediate check outputs match the new `synthesis_failed + goal-trace` signal (vs 15 hits across failed runs). The new hint tells the agent the hole is unreachable until the contract function is unfolded and gives the concrete `simp [X, Contract.run, Verity.bind, ...]` incantation, rather than recommending another round of `inspect_lean_goals` which would reveal the same un-reduced goal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Corpus analysis of 83 interactive-proxy runs: - try_tactic_at_hole succeeded 0 of 76 times across the entire corpus (passed runs: 0/6 successes on 4 runs that used it; failed runs: 0/70 successes across 25 runs). - Passed runs never exceeded a 2-streak of failures — they called the tool 1-2 times, got nothing, and moved on. - Failed runs hit a ≥3-streak in 14 of 29 tasks (48%), with 4 runs stacking 5-7 consecutive speculative tactics on the same hole. The tool keeps re-running Lean each call and returns the usual class-based repair hints, but there is no signal telling the agent that the tool itself is the wrong instrument for the remaining work. Adding a per-session counter and injecting a pivot hint at streak ≥ 3 costs one extra `int` of state and zero extra Lean invocations. The hint explicitly names the shapes of goal that single-tactic closure cannot discharge (BEq↔Prop bridging, residual if/match case analysis, monadic-trace unfolding, multi-step arithmetic) and steers the model to write_editable_proof + inspect_lean_goals instead. Threshold of 3 is chosen to have zero false positives on passed runs (max observed streak = 2). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The agent occasionally writes `(private) lemma foo ... := by ...` for helper lemmas, which is valid Mathlib syntax but not Lean 4 core. Without Mathlib, Lean reports `unexpected identifier; expected 'abbrev', ..., or 'theorem'` at the `lemma` token — a one-token rename to `theorem` is all that's needed. Corpus analysis of 83 interactive runs: 3 of 29 failed tasks (10%) wrote `lemma` helpers at some point and never recovered, because the generic parse-error hint talks about tactic-in-term-position, missing `by`, stray tokens, and branch indentation — none of which point at keyword choice. Detect the exact "expected 'abbrev' ... 'theorem'" keyword-list fingerprint (unique to the top-level-command parse shape) and surface a targeted rename hint before the generic one.
…lasses Corpus analysis of 83 interactive-proxy run logs shows the Uint256/Address .val coercion asymmetry pattern appears in 14 of 29 failed tasks (48%), but the existing hint only fired when failure_class == "type_mismatch". In 12 of those 14 cases, Lean reports a cascading unsolved_goals or unknown_identifier from the same simp call before the mismatch is surfaced, so the hint was suppressed. Lift the regex-based asymmetry detector out of the per-class branch and into the cross-class pattern section so it fires whenever the "has type ... but is expected to have type ..." pair disagrees on a .val projection, regardless of which error Lean listed first. Dedup against existing hints via substring check.
…sions Corpus analysis of 83 interactive-proxy runs: 12 of 29 failed tasks (41%) end with `exact ?_` still in the submitted proof. Every one of those runs re-submitted a `?_`-containing proof 2–9 times after the first rejection. The "UNFILLED HOLE IN SUBMITTED PROOF" warning already existed, but was inserted into `hints` BEFORE `_filter_seen_hints`, so dedup suppressed it on the 2nd–Nth resubmission. The agent then got no feedback tying its specific, detectable mistake (still-unfilled hole in the exact proof text just submitted) to the failure class, and kept re-submitting `?_`. Move the hole-warning insertion AFTER the dedup filter so it fires on every submission that still contains `?_`. The warning is keyed to the concrete proof-text state, not to the abstract hint corpus, so it is correctly exempt from dedup — it reports something specifically about the submission the agent just sent, not generic class advice.
Corpus of 29 failed interactive runs: 7 terminate on `don't know how to synthesize placeholder`. The existing conditional hint inside `synthesis_failed` only fires when the goal still contains a `(X).run _).snd` monadic trace (~3 of 7). The remaining ~4 — goals with `s.storage` arithmetic, conditional branches, or list-predicate witnesses (ethereum/full_deposit_preserves_partial_gap, lido/shares_conversion_monotone, safe/setup_owners_acyclicity, safe/setup_owners_owner_list_invariant) — only received the generic "try show <goal type>" hint, which the agent had already tried. The agent then loops on inspect_lean_goals until the tool budget is gone. Add an `else` branch that emits shape-aware guidance: by-omega for arithmetic goals, split_ifs for conditionals, explicit list literals for list invariants, and explicit witnesses for And/Exists. Both branches remain mutually exclusive so passing-run cases are unaffected. Verified via direct calls: non-monadic synthesize-placeholder text now produces the new hint; monadic-trace text still produces the original run-snd hint only (not both). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mismatch Corpus of 29 failed interactive runs: 8 tasks (28%) hit a type_mismatch where the "is expected to have type" region contains raw `ContractResult.revert` / `ContractResult.success` / nested `match match if ...` machinery — i.e. the goal still carries the full Contract.run trace while the hypothesis has already been simplified to the concrete storage-map shape. All 8 are `safe/owner_manager_reach` tasks (add_owner, remove_owner, swap_owner, setup_owners × correctness, invariant, reachability). The existing `type_mismatch` hint was only "Unfold definitions to align types. Check spec matches impl." which is not actionable for this shape — the agent doesn't know which definitions to feed simp, so it loops on `exact h` / `rw` without reducing the goal. The cross-class `.val` asymmetry detector (shipped in f5cd8a9) also doesn't cover this shape because the two sides are not a pure `.val` projection diff; one side has a whole monadic trace. Add a targeted detector inside `type_mismatch` that matches "is expected to have type … ContractResult.revert/success/match match" within 800 chars and emits a directive hint naming the exact reducers (`Contract.run`, `ContractResult.snd/revert/success`, `Verity.bind`, `Bind.bind`, `Verity.pure`, `Pure.pure`) plus `split_ifs` on sentinel guards. Verified negative: a `.val`-only asymmetry still fires the original `.val` cross-class hint and NOT this new one. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Corpus of 29 failed interactive runs: 19 tasks (66%) emit at least one `unknown identifier '<tactic>'` diagnostic. Token-count breakdown: 173 × 'simp' 100 × 'simpa' 52 × 'omega' 43 × 'native_decide' 24 × 'simp_all' 10 × 'exact' One task alone (safe/swap_owner_is_owner_correctness) has 52 repeat occurrences of `unknown identifier 'simp'` in a single run — meaning the agent kept submitting proofs with `exact simp [...]` 52 times. The existing tactic-in-term-position hint inside `_build_check_hints` (~line 1466) is wired correctly — it classifies the failure and emits an actionable "wrap in `by`" hint — but `_filter_seen_hints` suppresses it after the first emission. On the 2nd–Nth repeat, the agent gets no feedback tying its specific, detectable mistake to each rejection. This is the same failure mode as the UNFILLED HOLE warning (fixed in 4e7f31c): a state-conditional critical warning must repeat as long as the state persists. Mirror that fix — re-detect the tactic-in-term case against the current `details` and `hints.insert(0, ...)` post dedup, so the warning names the specific tactic (`simp`, `omega`, …) on every rejection. The hint is keyed to the concrete error-text state, not the generic hint corpus, so it is not a "noise" dedup candidate. Verified: given synthetic details with three `unknown identifier` lines (simp, simp, omega), `_UNKNOWN_IDENT_RE` + `_LEAN_TACTIC_NAMES` filter yields ['simp','simp','omega'] and the hint names the first ('simp'). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Corpus of 29 failed interactive runs: 6 tasks (21%) emit `unknown identifier '<camelCase>'` for names that are clearly local-binder-shaped (no dots, lowercase first char, no underscores). Per-task repetition is extreme: safe/swap_owner_is_owner_correctness 110× (91 prevOwner, 19 oldOwner) safe/remove_owner_owner_list_invariant 56× owner safe/swap_owner_in_list_reachable 41× (hKey, hKeyNew, hKeyPrev, hKeyOld, newOwner) safe/add_owner_owner_list_invariant 34× owner safe/in_list_reachable 21× s safe/setup_owners_owner_list_invariant 15× (owner1, owner2, owner3) The existing local-variable hint in `_build_check_hints` (~line 1475) is actionable — tells the agent to call inspect_lean_goals and re-check the signature — but `_filter_seen_hints` suppresses it after first emission. Same failure mode as UNFILLED HOLE (4e7f31c) and tactic-in- term (30923f9): state-conditional critical warning that must repeat as long as the state persists. Mirror those fixes — re-detect the local-variable case against the current `details` and `hints.insert(0, ...)` post-dedup, so the warning names the specific out-of-scope binder on every rejection. Suppress when a tactic-hit is also present: Lean reports both the same way and the tactic mistake is almost always the upstream cause of the binder being unreachable — firing both would just fill the agent's context with duplicate advice for the same line. Verified: pure tactic case → tactic hint only; pure var case → var hint; mixed case → tactic only (var suppressed); Mathlib name (has dot) → neither fires (handled by the existing mathlib branch). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


Summary
Robustness and analytics fixes for the interactive benchmark agent runtime. Addresses spurious failures caused by Lake build races, transient 429/5xx responses, truncated responses, and missing or misleading tool feedback. Also corrects stats that were always reporting 0 distinct candidates.
Changes
Pre-build + self-heal (biggest impact)
prebuild_task_modules()runslake buildonce per task up front for allBenchmark.*modules needed by the case, soinspect_lean_goals/try_tactic_at_hole/run_lean_checknever race an incomplete build.environment_errorfailure class: detect "object file ...olean does not exist" / "failed to load environment", auto-invokelake buildfor the missing module, and retry the check once before surfacing to the model. Stagnation tracking and temperature schedules ignore this class.Tool feedback
write_editable_proofnow returns immediate warnings for: placeholder detected, hiddenProofsimport, unfilled?_hole, theorem-statement mismatch, empty content. The model sees these before spending build budget.HOLE_PATTERNtightened to(?<!\w)\?_(?!\w)so identifiers like?_foodon't falsely trigger.HIDDEN_PROOF_IMPORT_PATTERNnow coversimport/open/exportofBenchmark.Cases.*.Proofspaths.TOOLS.mdis synthesized from the real tool specs (6 tools) rather than relying on the static shell-oriented prompt.HTTP resilience
send_chat_completionrewritten with retry/backoff + jitter (up to 6 attempts), honoursRetry-After, and falls back throughextra_body.fallback_modelson persistent failure.finish_reason: length, bumpmax_completion_tokens(up to 12000, 3 silent bumps) before counting the attempt as a failure.Defaults
agents/interactive.json:max_completion_tokens2000 → 4096,max_tool_calls24 → 40.Analytics
build_run_analysisnow has a fallback path for interactive attempts (which don't populatetrace) that derivesstable_digestfromcandidate_file_contents, sodistinct_candidate_count/candidate_change_countare correct.Schema
schemas/agent-run.schema.json: addprebuild_reportsproperty.Verification
Ran the 3 quick tasks on both
gptandbuiltin/smart: 6/6 pass, analytics reportdistinct=1 change_count=1(previously 0).Test plan
python -m py_compile) on all edited modulesgpt(in flight)🤖 Generated with Claude Code
Note
Medium Risk
Medium risk because it refactors interactive execution flow (tool semantics, retry/backoff, temperature/token budgeting) and Lean check behavior, which can affect benchmark determinism and provider interactions.
Overview
Makes the interactive agent loop more robust and informative:
write_editable_proofnow runs Lean automatically and returns structured warnings/hints; tool prompts are synthesized in interactive mode to match the actual callable function tools and discourage nonexistent shell commands.Hardens Lean verification by prebuilding task modules upfront (
prebuild_task_modules+prebuild_reportsin results), classifying/auto-healingenvironment_errorcases (missing.olean), filtering noisy Lean warnings, capping details size, and improving hole/tactic handling (?_matching, term-position tactic wrapping, cachedrun_lean_check, cachedsearch_public_defs, richertry_tactic_at_holefailures).Improves provider resilience and run analytics: chat completions gain retry/backoff with
Retry-After, optional fallback model chain viaextra_body.fallback_models, smarter length-stop token bumping (length_retry_token_cap), a temperature escalation schedule on repeated failure classes, and corrected candidate-change/distinct-candidate counting for interactive runs (plus schema/default updates).Reviewed by Cursor Bugbot for commit 0ff58e6. Bugbot is set up for automated code reviews on this repo. Configure here.