harness: interactive proxy robustness + analytics fixes by Th0rgal · Pull Request #26 · lfglabs-dev/verity-benchmark

Th0rgal · 2026-04-22T07:10:06Z

Summary

Robustness and analytics fixes for the interactive benchmark agent runtime. Addresses spurious failures caused by Lake build races, transient 429/5xx responses, truncated responses, and missing or misleading tool feedback. Also corrects stats that were always reporting 0 distinct candidates.

Changes

Pre-build + self-heal (biggest impact)

prebuild_task_modules() runs lake build once per task up front for all Benchmark.* modules needed by the case, so inspect_lean_goals / try_tactic_at_hole / run_lean_check never race an incomplete build.
New environment_error failure class: detect "object file ...olean does not exist" / "failed to load environment", auto-invoke lake build for the missing module, and retry the check once before surfacing to the model. Stagnation tracking and temperature schedules ignore this class.

Tool feedback

write_editable_proof now returns immediate warnings for: placeholder detected, hidden Proofs import, unfilled ?_ hole, theorem-statement mismatch, empty content. The model sees these before spending build budget.
HOLE_PATTERN tightened to (?<!\w)\?_(?!\w) so identifiers like ?_foo don't falsely trigger.
HIDDEN_PROOF_IMPORT_PATTERN now covers import/open/export of Benchmark.Cases.*.Proofs paths.
Interactive-mode TOOLS.md is synthesized from the real tool specs (6 tools) rather than relying on the static shell-oriented prompt.

HTTP resilience

send_chat_completion rewritten with retry/backoff + jitter (up to 6 attempts), honours Retry-After, and falls back through extra_body.fallback_models on persistent failure.
Silent length retry: on finish_reason: length, bump max_completion_tokens (up to 12000, 3 silent bumps) before counting the attempt as a failure.

Defaults

agents/interactive.json: max_completion_tokens 2000 → 4096, max_tool_calls 24 → 40.

Analytics

build_run_analysis now has a fallback path for interactive attempts (which don't populate trace) that derives stable_digest from candidate_file_contents, so distinct_candidate_count / candidate_change_count are correct.
Light temperature schedule on repeated non-environment failure classes.

Schema

schemas/agent-run.schema.json: add prebuild_reports property.

Verification

Ran the 3 quick tasks on both gpt and builtin/smart: 6/6 pass, analytics report distinct=1 change_count=1 (previously 0).

task	gpt tool_calls	smart tool_calls
ethereum/deposit_contract_minimal/deposit_count	2	4
damn_vulnerable_defi/side_entrance/deposit_sets_sender_credit	2	8
damn_vulnerable_defi/side_entrance/deposit_sets_pool_balance	2	2

Test plan

Syntax check (python -m py_compile) on all edited modules
3 quick tasks × 2 models all pass with correct analytics
Full benchmark sweep with gpt (in flight)
Address any bugbot / codex findings on this PR

🤖 Generated with Claude Code

Note

Medium Risk
Medium risk because it refactors interactive execution flow (tool semantics, retry/backoff, temperature/token budgeting) and Lean check behavior, which can affect benchmark determinism and provider interactions.

Overview
Makes the interactive agent loop more robust and informative: write_editable_proof now runs Lean automatically and returns structured warnings/hints; tool prompts are synthesized in interactive mode to match the actual callable function tools and discourage nonexistent shell commands.

Hardens Lean verification by prebuilding task modules upfront (prebuild_task_modules + prebuild_reports in results), classifying/auto-healing environment_error cases (missing .olean), filtering noisy Lean warnings, capping details size, and improving hole/tactic handling (?_ matching, term-position tactic wrapping, cached run_lean_check, cached search_public_defs, richer try_tactic_at_hole failures).

Improves provider resilience and run analytics: chat completions gain retry/backoff with Retry-After, optional fallback model chain via extra_body.fallback_models, smarter length-stop token bumping (length_retry_token_cap), a temperature escalation schedule on repeated failure classes, and corrected candidate-change/distinct-candidate counting for interactive runs (plus schema/default updates).

^{Reviewed by Cursor Bugbot for commit 0ff58e6. Bugbot is set up for automated code reviews on this repo. Configure here.}

Major improvements to the interactive agent runtime to reduce spurious failures and improve tool feedback quality: - Pre-build Benchmark case modules once per task so the interactive tools don't race the Lake build (eliminates olean-not-found flake). - Add environment_error failure class with one-shot self-heal: when Lean can't find a dependency olean, auto-run lake build and retry before reporting to the model. Stagnation tracking skips these classes. - write_editable_proof now returns immediate warnings (placeholder, hole, theorem_statement_mismatch, hidden_proof_import, etc.) so the model sees actionable feedback before spending a build budget. - Synthesize TOOLS.md for interactive mode from the real tool specs instead of relying on static prompts that reference shell commands. - HTTP layer: retry/backoff with jitter, honour Retry-After, fall back through extra_body.fallback_models, and silently grow max_completion_tokens (up to 12000) on length finish_reason. - Raise max_tool_calls (24 -> 40) and max_completion_tokens (2000 -> 4096) for the default interactive profile. - Tighten HOLE_PATTERN (avoid matching `?_foo`) and broaden hidden import regex to cover `open`/`export` of proof modules. - Fix candidate_change_count / distinct_candidate_count for interactive attempts by falling back to stable_digest of candidate_file_contents when trace is missing. - Add light temperature schedule on repeated failure_class history. Schema: extend agent-run.schema.json with prebuild_reports. Verified: 3 quick tasks pass cleanly on both gpt and builtin/smart with correct analytics (distinct=1, change_count=1). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 503a0654fd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T07:15:07Z

+        _LAKE_BUILD_CACHE.add(module_name)
+        code, output = lean_run_command(["lake", "build", module_name], cwd=ROOT)


Cache only successful lake builds

prebuild_task_modules adds each module to _LAKE_BUILD_CACHE before running lake build, so a failed build is still marked as cached. In agent_runner.run_many (case/suite runs in one process), that causes later tasks sharing the same module to skip rebuilds and keep reporting cached, which prevents recovery from transient build failures and undermines the new self-healing behavior.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-22T07:15:07Z

+            if candidate_hash not in distinct_candidate_hashes:
+                distinct_candidate_hashes.add(candidate_hash)
+                if not isinstance(trace, dict) or not trace.get("candidate_sha256"):
+                    if candidate_text != previous_candidate:
+                        candidate_change_count += 1


Count candidate changes on every transition

In the interactive fallback path (attempts without trace), candidate_change_count is incremented only when the hash has not been seen before. That undercounts real transitions when the model revisits an earlier candidate (e.g., A -> B -> A should count two changes after the first attempt, but the second A transition is dropped), so analytics remain inaccurate for non-monotonic candidate sequences.

Useful? React with 👍 / 👎.

- P1: only cache *successful* lake builds so failed builds can be retried on subsequent calls. Previously `_LAKE_BUILD_CACHE` marked modules as attempted before invoking lake, which prevented the self-heal path from ever recovering from a transient build failure in batch/suite runs that share a process. - P2: count every candidate transition in the interactive analytics fallback, including reverts (A -> B -> A now counts 2 changes, not 1). Previously the increment was gated by "hash not yet seen", which undercounted non-monotonic candidate sequences. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Bugbot (Medium): `_attempt_lake_build` returned False when the module was already cached as built, so after `prebuild_task_modules` populated the cache, every subsequent environment_error auto-heal attempt short-circuited without invoking lake and without retrying the check — while the model was told "The harness is attempting to rebuild it". The auto-heal path only fires when the runtime has observed a missing .olean at check time, which means the cached 'success' entry is stale. Always invoke lake build in this path and refresh the cache with the latest result. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Low: merge the two back-to-back `if config.mode == "interactive":` blocks in `execute_agent_task` into a single branch. - Low: `try_tactic_at_hole` now uses the same `HOLE_PATTERN` regex as `inspect_lean_goals`, so the two tools agree on what counts as a hole (previously the tightened lookbehind in HOLE_PATTERN was not mirrored in `try_tactic_at_hole`, which could let it replace a match that `inspect_lean_goals` had reported as absent). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 950d7e3145

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T08:26:01Z

+    try:
+        return max(0.0, float(value))
+    except ValueError:
+        return None


Support date-form Retry-After values

_parse_retry_after only accepts numeric delta-seconds, but Retry-After is also commonly sent as an HTTP-date on 429/503 responses. In that case this function returns None, so _backoff_delay uses a short exponential retry instead of the server-requested wait, which can repeatedly hit rate limits and prolong failures for overloaded providers. Please parse HTTP-date values as well before computing retry delays.

Useful? React with 👍 / 👎.

Codex (P2): `_parse_retry_after` only handled delta-seconds, so when a provider sent an HTTP-date on 429/503 (which RFC 7231 permits), the function returned None and `_backoff_delay` fell back to a short exponential retry instead of honouring the server-requested wait. That can repeatedly hit rate limits and prolong failures. Now parse both forms; a date in the past clamps to 0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3ddda642d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T08:48:15Z

+    if retry_after is not None:
+        return min(retry_after, 60.0)


Respect full Retry-After delay from provider

_backoff_delay clamps Retry-After to 60 seconds, so 429/503 responses that legitimately request longer waits (for example several minutes) will be retried too early and can burn through all retries while the provider is still enforcing the limit. This directly undermines the new “honour Retry-After” behavior and can turn recoverable throttling into hard failures.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-22T08:48:15Z

+    re.compile(r"object file ['\"][^'\"]+\.olean['\"]? does not exist"),
+    re.compile(r"failed to load environment"),


Narrow environment-error matching to real infra failures

The new ENVIRONMENT_ERROR_PATTERNS treats any "object file ... .olean does not exist" / "failed to load environment" message as infrastructure, but those strings also occur when the candidate introduces a bad import/module name. Because _annotate_check_result short-circuits on environment_error, these model-caused mistakes are labeled “not your fault” and excluded from stagnation/temperature logic, which can prevent corrective behavior in interactive runs.

Useful? React with 👍 / 👎.

Codex P2 #1: narrow `environment_error` classification to real infra failures. Previously any missing .olean matched, but that string is also produced when the model imports a non-existent module. In the latter case we want the normal stagnation/temperature logic to kick in so the model can correct itself. Now only classify as environment_error when the missing module is under `Benchmark.*` (our dependency tree which should have been pre-built) or when lean itself is missing. The generic "failed to load environment" substring no longer short-circuits. Codex P2 #2: raise `_backoff_delay` Retry-After ceiling from 60s to 600s. Providers routinely request several-minute waits on 429/503; the old 60s clamp caused retries to fire while the rate limit was still in force, undermining the "honour Retry-After" behaviour. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: df1004c256

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T08:59:40Z

+# pre-built) or the model's fault (imported a module that doesn't exist). We
+# only classify the former as environment_error so stagnation/temperature
+# logic still applies to model-caused import mistakes.
+_MISSING_OLEAN_RE = re.compile(r"object file ['\"]([^'\"]+\.olean)['\"]? does not exist")


Handle of module in missing-olean regex

Lean commonly reports missing artifacts as object file '...olean' of module <Name> does not exist, but _MISSING_OLEAN_RE only matches the shorter '...olean' does not exist form. Fresh evidence in this revision is that the new regex at this line requires does not exist immediately after the path, so _missing_olean_module returns None for the standard message shape; that prevents classify_failure from returning environment_error and skips the new auto-heal lake build retry path in run_lean_check.

Useful? React with 👍 / 👎.

Bugbot (Medium): the temperature schedule condition was checked at the top of every loop iteration, so once two consecutive same-class failures triggered it, temperature was bumped on every subsequent turn -- including pure search/write turns -- until the 0.7 cap was reached. Track the history length we have already acted on and only bump once per new failure entry. Codex (P1): Lean reports missing artifacts in two forms depending on context: object file '...olean' does not exist object file '...olean' of module <Name> does not exist The `_MISSING_OLEAN_RE` regex only matched the shorter form, so on the more common "of module" diagnostic `_missing_olean_module` returned None, `classify_failure` did not return `environment_error`, and the auto-heal retry path was skipped. Accept arbitrary text between the path and the "does not exist" tail. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 894a285f92

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T09:07:53Z

+    missing_module = _missing_olean_module(details)
+    if missing_module and missing_module.startswith("Benchmark."):
+        return "environment_error"


Verify missing Benchmark module exists before env classification

Fresh evidence in this revision is that classify_failure now labels any missing .olean as environment_error whenever the parsed module name starts with Benchmark.. A model-caused bad import like import Benchmark.Foo.Bar (which is not a real module) will therefore be treated as infrastructure failure, causing _annotate_check_result to mark it as environment_error and skip stagnation/temperature correction, so interactive runs can keep retrying instead of fixing the import.

Useful? React with 👍 / 👎.

Codex (P2): a model-authored bad import like `import Benchmark.Foo.Bar` where no such module exists would still produce an "object file ...olean does not exist" diagnostic, and under the previous check that matched the `Benchmark.*` prefix it would be misclassified as infrastructure failure -- skipping stagnation/temperature correction instead of letting the model discover and fix the bogus import. Require the corresponding `.lean` source file to actually exist in the tree before treating a missing .olean as environment_error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 42efe75153

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T09:17:51Z

+            # Fall back only on rate-limit / service-unavailable style errors.
+            if exc.status not in (429, 500, 502, 503, 504) and exc.status != 0:
+                break


Retry alternate models on all transient HTTP statuses

send_chat_completion only falls through to fallback_models for 429/5xx (or status 0), but _post_chat_completion explicitly treats 408, 409, and 425 as transient retryable failures too. If the primary model keeps returning one of those statuses, the code exits after exhausting retries on that model instead of trying configured fallback models, so the new fallback chain is skipped in a recoverable outage scenario.

Useful? React with 👍 / 👎.

Codex (P2): `send_chat_completion`'s fallback-model gate used a hard-coded subset {429, 500, 502, 503, 504}, but `_post_chat_completion` retries the broader `RETRY_STATUS_CODES` set (408, 409, 425, 429, 500, 502, 503, 504). A primary model that kept returning 408/409/425 would exhaust retries and then skip the configured fallback chain entirely. Route on the same set used by the underlying retry loop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f8d43bbfae

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T09:53:29Z

-                token_budget = min(int(token_budget * 1.5), 4500)
+            # Up to 3 silent budget bumps before nudging the model to simplify.
+            if consecutive_length_stops <= 3:
+                token_budget = min(int(token_budget * 1.5), 12000)


Cap length-retry token bumps at model-safe limit

The new length-retry path raises token_budget to min(int(token_budget * 1.5), 12000), which can exceed the model/provider output-token cap after a finish_reason == "length" response. With the updated interactive default already at 4096, the first bump becomes 6144; providers that enforce a 4096 max will return HTTP 400, and send_chat_completion treats 400 as non-transient and exits instead of recovering. This turns a recoverable truncation into a hard task failure for capped models.

Useful? React with 👍 / 👎.

Three related fixes from bugbot/codex on f8d43bb: 1. (codex P1) Cap length-retry token bump at config.max_completion_tokens. The previous hard-coded cap of 12000 could exceed the provider's per- response output-token limit (e.g. models with a 4096 hard cap), turning a recoverable truncation into a non-transient HTTP 400 hard failure. 2. (bugbot Low) Also reset token_budget when recovering from a length streak, not only when consecutive_length_stops >= 5. Previously a 1-4 stop streak followed by recovery left the elevated budget in place for the rest of the run. 3. (bugbot Low) Skip environment_error entries when appending to failure_class_history. They are infra noise that would break the sliding window same-class check (e.g. [type_error, env_error, type_error]) even though the filter at the trigger site rejects them -- consistent with the PR's stated intent that env_error be invisible to stagnation tracking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 07c3880b8d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T10:08:31Z

+            # provider-enforced per-response limit (some models hard-cap at the
+            # configured value and would return HTTP 400 on anything larger).
+            if consecutive_length_stops <= 3:
+                token_budget = min(int(token_budget * 1.5), config.max_completion_tokens)


Make length-stop retries raise max_tokens

execute_interactive_agent_task initializes token_budget to config.max_completion_tokens, but the length-retry branch sets it to min(int(token_budget * 1.5), config.max_completion_tokens), which can never increase the value. As a result, repeated finish_reason == "length" responses are retried with the same token limit instead of a larger one, so the new silent recovery path is effectively disabled and long responses can still exhaust attempts unnecessarily.

Useful? React with 👍 / 👎.

Python 3.12 surfaces socket read timeouts during SSL as bare TimeoutError, which slipped past the urllib.error.URLError handler and killed the task with a traceback. Catch it alongside URLError and apply the same backoff retry policy. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Previous commit capped the silent length-retry bump at config.max_completion_tokens, which -- as bugbot and codex both pointed out -- turned the feature into a no-op because token_budget starts at exactly that value. Add a config.extra_body.length_retry_token_cap knob (stripped from the provider payload in send_chat_completion). Default stays at max_completion_tokens so models with a hard per-response cap don't get surprised by HTTP 400; interactive.json opts into a 12000 ceiling, which is the original intended behavior for gpt-class models that accept larger single-response budgets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c5f118c37b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T11:04:21Z

+    fallback_models = [
+        str(item)
+        for item in (config.extra_body.get("fallback_models") or [])
+        if isinstance(item, str) and item.strip()


Validate fallback_models type before iterating

fallback_models is read from untyped extra_body and iterated directly, so a common single-string config (for example "fallback_models": "gpt-4o-mini") is treated as characters and produces model ids like "g", "p", "t" during failover. In transient-outage scenarios this makes the fallback chain fail even though a valid fallback was configured. Please require a list/tuple of strings (or normalize a single string to a one-element list) before building models_to_try.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-22T11:04:21Z

+    length_retry_token_cap = int(
+        config.extra_body.get("length_retry_token_cap", config.max_completion_tokens)
+    )


Handle invalid length_retry_token_cap without crashing

Interactive runs cast extra_body.length_retry_token_cap with int(...) unconditionally, but extra_body is schema-free and often operator-edited; non-integer values (e.g., null, "12k", or nested objects) will raise before the first model turn and abort the task. This knob should be parsed defensively (or validated up front) and fall back to max_completion_tokens instead of terminating execution.

Useful? React with 👍 / 👎.

Both knobs are read from schema-free extra_body: - fallback_models="gpt-4o-mini" (string shorthand) previously iterated character-by-character, producing fake model ids like "g", "p", "t" during failover. Normalize a bare string to a one-element list first. - length_retry_token_cap=null / "12k" / nested object would crash the int() cast before the first model turn, aborting the task. Fall back to max_completion_tokens silently. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

`_LEAN_BLOCK_HEADER_RE` and `_FP_LINE_COL_RE` hardcoded the source prefix to `CandidateCheck.lean:`, which only fires for the stub file `evaluate_candidate` writes when `check_goals=False` (i.e. the `run_lean_check` / `write_editable_proof` code paths). The `inspect_lean_goals` path runs Lean against the actual editable file — paths like `Benchmark/Generated/Foo/Bar.lean:` — and so fell through both filters untouched. Corpus analysis of 83 interactive runs found: - 32/88 (36%) of inspect_lean_goals outputs still carried full `linter.unusedSimpArgs` warning blocks (verified on the largest leaker: 9.8 KB of noise inside a 24.5 KB output — a 40% reduction with the fix applied). - Stagnation detection's error-text fingerprint left the file path in place for inspect_lean_goals outputs, so the same underlying error surfaced via two different Lean invocations produced two different fingerprints and escaped the no-progress loop detector. Broadening both regexes to `\S+\.lean:LINE:COL:` lets a single strip + fingerprint pipeline cover every tool that reports Lean diagnostics, regardless of which source file Lean named. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The `placeholder_detected` write-warning told the agent to replace sorry/admit/axiom "before run_lean_check". That instruction is stale on two counts: * commit b2bd74f (2026-04-22 18:29) folded run_lean_check into write_editable_proof, so by the time this warning is returned the Lean check has already run. There is no separate "before" moment. * commit f5ceebe (2026-04-22 20:01) removed every prompt-side hint telling the agent to call run_lean_check after write_editable_proof, aligning the tool surface with the folded architecture. The current wording therefore gestures at a workflow that no longer exists, undercutting the "run_lean_check is not needed after write_editable_proof" description already in the tool surface. Corpus analysis (83 interactive runs, 372 write_editable_proof calls) found this warning fired 13 times, always alongside a Lean verdict for the same proof — so the agent was seeing a "do X before Y" hint at the exact moment it was also reading Y's output, a self-contradiction that adds no new signal. Rewording to "Lean rejects these — replace with a real tactic or a `?_` hole." keeps the actionable advice while dropping the stale sequencing language and pointing the agent at the legitimate `?_` probe workflow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

`_extract_theorem_signature` used the non-greedy pattern `.*?(?::=)` to find the proof marker, but that stops at the first `:=` token, which in 74 of 88 (84%) benchmark task files lives inside a `let x := ...` binding within the statement. The result is a truncated signature that makes the theorem_statement_mismatch check a false-negative for any edit past that let-binding. All 88 task files use `:= by` as the actual proof start, so anchoring on `:=\s*by\b` cleanly captures the full signature without regressing the 14 let-free cases.

…ublic_defs Corpus analysis of 83 interactive runs found 20/29 failed tasks (69%) hit "unknown identifier 'simp'" / '<tactic>' errors at least once. The corpus-wide count of unknown-identifier incidents is dominated by tactic names — 17 `simp`, 10 `simpa`, 6 `omega`, 4 `exact`, 4 `native_decide`, 3 `intro`, 3 `simp_all`, 2 `by_cases` — i.e. ~60% of cases where Lean emits `unknown identifier` are tactic-in-term- position mistakes (`exact simp [...]`, `:= by_cases h`, etc.), not missing definitions. The existing `unknown_identifier` hint (both the per-failure hint and the stagnation-escalation hint) sent the model to `search_public_defs` to look for the name, which cannot help: these are language keywords, not definitions. The model would burn tool calls searching and loop. Detect the tactic-name case and emit a position-specific hint instead ("wrap with `by`, or drop the `exact`/`refine` prefix"), and skip the misleading "Use search_public_defs" hint in that branch. Non-tactic identifiers (missing lemmas, typos like `Nat.div_mul_le`) still get the original advice.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 19bb7323c0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T19:31:07Z

+            rf"theorem\s+{re.escape(short_name)}\b(?P<signature>.*?):=\s*by\b",
            re.DOTALL,


Accept theorem bodies not written with := by

_extract_theorem_signature now only matches declarations where the theorem body starts with := by, so a candidate that keeps the exact same theorem statement but uses a valid term-style body (for example := <term>) produces None and is treated as a theorem_statement_mismatch in preflight_candidate. That turns semantically valid proofs into hard preflight failures and blocks the interactive loop from evaluating them with Lean.

Useful? React with 👍 / 👎.

…ic_defs Corpus analysis of 83 runs under results/agent_runs/custom/interactive-proxy/ classified unknown-identifier incidents across failed tasks: 51 tactic-in-term-position (addressed by 19bb732) 34 "other" — dominated by camelCase locals: prevOwner (17), owner (6), ... 12 snake_case lemma guesses 12 qualified-name misses 4 var_like already matching a narrower pattern 38 of these (the 34 "other" + 4 var_like) are local-variable-shaped names (camelCase, no dot, no underscore, leading lowercase) affecting 6/29 failed tasks. The previous hint sent the agent to search_public_defs, which cannot resolve local binders and produced budget-wasting loops. New branch detects var-like names and instead points at inspect_lean_goals plus re-reading the theorem signature for the real parameter names, explicitly warning that search_public_defs is wrong for this shape. Tactic-in-term-position path is preserved (checked first), and the fallback "Check imports" hint still fires for snake/qualified misses that match neither category.

…haped Corpus analysis of the 29 failed runs under results/agent_runs/custom/interactive-proxy/ shows 5 tasks (17%) stagnating on guesses that only exist in Mathlib: ethereum.../full_deposit_preserves_partial_gap: sub_eq_sub_right, add_sub_add_right_eq_sub openzeppelin.../preview_deposit_rounds_down: Nat.cast_mk, Nat.div_def, Nat.div_mul_le, Nat.le_div_mul paladin.../weth_claimed_plus_allocated: add_assoc, add_left_comm, sub_eq_add_neg, add_comm safe.../in_list_reachable: not_eq zama.../transfer_conservation: lt_of_add_lt_add_right, Nat.not_ge.mp This workspace has no Mathlib dependency — only core Lean 4, Batteries, and the task's own Benchmark.* modules. The prior fallback hint ("Check imports. Standard names: Nat.lt_of_not_ge, Nat.not_le_of_lt") was misleading because the agent kept searching for imports that cannot be added. Worse, it cited a `Nat.*` name as a "standard" example while the agent was already being burned by the Nat.* namespace. Detect Mathlib-shape names (arithmetic-prefix `add_*`/`sub_*`/`le_*`/..., common exact names like `add_assoc`/`add_comm`, and any `Nat.*`-qualified guess) and emit a hint that: (a) names the absent dependency explicitly, (b) redirects to `omega`/`ring`/`simp arith` tactics, (c) reminds the agent that search_public_defs takes a KEYWORD, not a guessed lemma name. Precedence order preserved: tactic-in-term (19bb732) ranks highest, then var_like (b24ef3f), then decide_True (legacy), then Mathlib, then generic search_public_defs. The misleading "Check imports..." trailer is suppressed when the Mathlib branch fires.

Corpus analysis of 83 runs under results/agent_runs/custom/interactive-proxy/: 6 of 16 `type_mismatch` incidents (38% of this class), spanning 5 of 29 failed tasks (17%), hit the same recurring shape: has type ¬x = 0 : Prop but is expected ¬x.val = 0 : Prop The `.val` projection on Uint256 / Address / Nat is missing on one side. Affected failed tasks: openzeppelin.../preview_deposit_rounds_down openzeppelin.../positive_deposit_mints_positive_shares_under_rate_bound lido.../ceildiv_sandwich safe.../remove_owner_in_list_reachable safe.../add_owner_is_owner_correctness The agent repeatedly tried `exact h` against the asymmetric goal instead of letting simp bridge the coercion. The prior hint ("Unfold definitions to align types") did not name the structural issue, so the agent cycled on `exact`-family tactics until the tool budget ran out. New hint fires only when the "has type … but is expected to have type …" substrings differ by `.val` on exactly one side, and points the agent at: - `by simpa using h` / `by simp_all` (simp bridges the projection) - `by omega` once `.val` is exposed on both sides for arithmetic goals - the underlying injectivity lemma via search_public_defs for negated- equality mismatches Precedence is preserved: the existing `decide` branch runs first (when both match), and the generic "Unfold definitions" trailer is still appended as a fallback.

Commits b24ef3f and 31ff5e8 taught `_build_check_hints` to distinguish `unknown identifier '<var>'` (local binder) and `unknown identifier '<mathlib-name>'` (unavailable dependency) from generic definition misses. However `_build_escalation_hint` — which fires once the agent has repeated the same failure_class multiple times — still emitted only "Stop guessing identifier names. Use the search_public_defs tool" for every non-tactic case. That meant an agent stagnating on `prevOwner` / `add_comm` / `Nat.div_mul_le` kept being told to search_public_defs even though: * local binders will never be found by search_public_defs, and * Mathlib lemmas do not exist in this workspace at all. Corpus evidence (same 83-run dataset): 6/29 failed tasks surface a var-like unknown identifier at escalation time 5/29 failed tasks surface a Mathlib-shaped unknown identifier at escalation Route the escalation hint the same way as the check hint: tactic -> wrap in `by` (existing) var_like -> inspect_lean_goals to see real binders mathlib -> omega / ring / simp arith; search_public_defs takes a keyword else -> original generic "stop guessing" fallback No behavior change when the details string is empty (falls through to the generic fallback as before).

Corpus analysis of 83 runs under results/agent_runs/custom/interactive-proxy/: 21 parse-error events across 14 of 29 failed tasks (48%) surface one of - "unexpected token '…'" (11 incidents; `using`/`at`/`:`/`;`/`|`/…) - "unexpected identifier" (5 incidents) - "expected '{' or indented tactic sequence" (5 incidents) Most of these incidents (19/21) coexist with a classifiable semantic error and so get routed to `unknown_identifier`, `unsolved_goals`, etc. But 2 failed-task incidents collapse to `failure_class=other` because the parse error is the ONLY signal the checker emits: lido.../locked_funds_solvency unexpected token ';' zama.../burn_decreases_supply unexpected token '|' (first failure turn — lost the whole budget starting from a malformed signature) Extend `classify_failure` with a `parse_error` bucket and a dedicated hint that: (a) names the likely causes (tactic-in-term, missing `by`, stray `;`/`|`/`using`, bullet-indentation mismatch), (b) tells the agent to re-read the editable file via read_public_file to see character positions, (c) advises a clean `:= by <tactics>` rewrite over token-by-token patching. Placement: `parse_error` comes LAST in classify_failure before "other", so it never steals classification from more specific errors. The existing cross-cutting "Syntax error. Ensure the theorem body uses `:= by`…" catch-all still fires as a secondary hint.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0c6c7adf96

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T19:50:31Z

+            if tool_name in ("run_lean_check", "write_editable_proof") and result.get("failure_mode") == "lean_check_failed":
                saw_lean_failure = True


Track preflight tool failures in temperature history

The temperature schedule only records tool failures when failure_mode == "lean_check_failed", but write_editable_proof now commonly returns preflight failures (for example placeholder_detected or hidden_case_import_detected) with different failure modes. Those repeated failures never get added to failure_class_history, so the repeated-class bump at the top of execute_interactive_agent_task never activates and interactive runs can stay stuck at deterministic temperature until attempt exhaustion.

Useful? React with 👍 / 👎.

…dentifier The interactive-mode user prompt previously instructed: "For unknown_identifier errors: use search_public_defs to find correct names." Corpus analysis of 83 runs under results/agent_runs/custom/interactive-proxy/ shows that this blanket advice misleads the agent on 24 of 29 failed tasks (83%). Those tasks hit at least one `unknown identifier '…'` where search_public_defs cannot possibly help because the missing name is one of: - a Lean TACTIC written in term position (simp, simpa, omega, native_decide, by_cases, …) — search_public_defs searches *definitions*, not syntactic keywords. Fix is `by <tactic>`. - a LOCAL BINDER (prevOwner, owner, hKey, …) that is simply out-of-scope at the use site — binders are not public defs. - a MATHLIB LEMMA (add_comm, add_sub_*, Nat.div_mul_*, …). Mathlib is not a dependency in this workspace. The downstream repair_hints (commits 19bb732, b24ef3f, 31ff5e8) already route each shape correctly, but the initial instruction primes the agent to call search_public_defs first and overwrites any subsequent hint. Rewrite the bullet to: - tell the agent to read repair_hints before searching, - enumerate the three "wrong-shape" cases and their remedies, and - reserve search_public_defs for genuine project-defined names. Passed runs hit this scenario only 6/54 times (11%), so the clearer instruction should not distract correct agents while sharply reducing the misled population in failed runs.

Lean phrases `cases h` on a non-inductive major premise (implication `A → B`, function type, unreduced equality, Bool-valued `==`) as `tactic 'cases' failed, major premise type is not an inductive type` — distinct from the `constructor failed, not an inductive datatype` shape already handled by `constructor_failed`. Corpus analysis of 83 runs at results/agent_runs/custom/interactive-proxy/: the pattern occurs 22 times in one failed task (safe__owner_manager_reach__setup_owners_acyclicity.json), where the agent repeatedly retried `cases h` on an `hkey = SENTINEL → False` implication because the generic \"other\" bucket gave no actionable hint — a 22-cycle stagnation loop with no escape. Split into its own `cases_failed` class covering both `cases` and `induction` variants, with a hint that names the four actual remedies: apply the implication with an argument, `by_cases` for decidable Props, `absurd`/`.elim` for contradiction, or `Bool.ne_iff`/`beq_iff_eq` to rewrite Bool equalities before case-splitting.

When Lean reports unsolved goals with a `case <label>` marker, the agent has ALREADY case-split successfully and only one branch is open. The previous hint set unconditionally appended "Try restructuring: `by_cases h : condition …`", which told the agent to undo its own working split and start over. Corpus analysis of 83 runs at results/agent_runs/custom/interactive-proxy/: 59 of 127 unsolved_goals incidents (46%) across 22 tasks carry a `case <label>` marker — `case pos`, `case neg`, `case left`, `case right`, etc. 6 of the 8 unsolved_goals final failures had one. Now: when a case label is present, (a) name the open branch(es) back to the agent so it stops re-inspecting, (b) point at the branch-specific hypotheses that simp_all / omega / absurd can close the branch with, and (c) SUPPRESS the "restructure with by_cases" hint so the agent doesn't undo its progress. The fresh-split hint still fires when no case label is present.

The escalation hint for unsolved_goals / simp_no_progress / rfl_failed / unfold_failed was unconditionally advising the stuck agent to "Start with `unfold <spec_name>` … then `by_cases` on each conditional branch". When the agent is stuck with a `case <label>` already in the goal output, that template tells it to TEAR DOWN a working case-split and start over — pure regression. Mirror the case-label detection from _build_check_hints (commit 77cb481) into _build_escalation_hint. When `case <label>` markers are present, escalate with branch-closing advice instead of a fresh-split template: (1) inspect_lean_goals at a `?_` in the branch to read `h✝`, (2) `simp_all` with the contract-simp-set to discharge contradictory hypotheses, (3) `exact absurd` when hypotheses literally disagree, (4) `omega` after `simp only` exposes `.val` forms. Corpus evidence from results/agent_runs/custom/interactive-proxy/: 59 of 127 unsolved_goals incidents (46%) across 22 tasks carry a case label; 6 of 8 unsolved_goals final-failures did. Those are exactly the runs most likely to hit the stagnation threshold and trigger this escalation.

The public impl/spec file set is static for a session, so the same (query, limit) pair always returns the same matches. Cache the first response and short-circuit repeat calls with a `cached: true` marker plus a note telling the agent to try a different query or a different tool. Corpus evidence from results/agent_runs/custom/interactive-proxy/: - Failed runs averaged 41.9 search_public_defs calls vs 1.5 on passing runs (28x more). - Of extracted queries across failed runs, 94% were byte-identical re-queries: e.g. `"removeOwner_ownerListInvariant"` submitted 26 times in one run, `"isChain"` 25 times, `"ceilDiv"` 24 times, `"acyclic"` 25 times. The agent was stuck in a tight re-query loop with no signal that the library wouldn't suddenly start matching. Mirrors the existing run_lean_check cache pattern (see _last_eval_cache in __init__) — same mechanism: deepcopy the first result, stamp it with `cached: true` + a note on the next identical call, and stay behaviourally-equivalent on distinct queries.

Corpus analysis of 29 failed interactive runs found 38 `omega_failed` incidents carrying 96 opaque-coercion occurrences in their counterexample `where:` sections (mul: 45, add: 34, sub: 17). When the goal mentions `↑(mul a b).val`, `↑(add a b).val`, or `↑(sub a b).val`, omega treats the whole thing as an opaque Nat and cannot see the underlying linear arithmetic — even though the operation reduces to `a.val * b.val` / `a.val + b.val` / `a.val - b.val` whenever the spec's no-overflow hypothesis holds. The canonical conversion lemmas already exist in Verity: Uint256.mul_eq_of_lt (h : a.val * b.val < modulus) : (a * b).val = a.val * b.val Uint256.add_eq_of_lt (h : a.val + b.val < modulus) : (a + b).val = a.val + b.val Uint256.sub_eq_of_le (h : b.val ≤ a.val) : (a - b).val = a.val - b.val …but ZERO proofs in the 83-run corpus — failed OR passed — used them. Agents searched for related terms ("val_mul", "Uint256 mul add sub ge theorem lemma val", "div_mul_le") and never landed on the right names, so the existing generic "introduce helper lemmas that bound the product" nonlinear hint did not close the loop. Detect `↑(mul|add|sub …)` in the omega_failed details and emit a targeted hint that names the specific lemma(s) applicable to the ops present, spells out the hypothesis shape (`a.val + b.val < modulus`, typically the spec's `hNoOverflow`), and shows the `rw` call that turns the goal into a plain Nat arithmetic fact omega can close. Coverage: 11 / 17 current-classifier omega_failed incidents across 5 distinct failed tasks (lido vaulthub_locked ×3, openzeppelin erc4626_virtual_offset_deposit ×2) now receive the targeted hint instead of the generic "out of omega's reach" advice. Strictly additive: the pre-existing division/modulus and variable- product nonlinear hints still fire when their own patterns match.

…gue 'reduce to equalities' Corpus analysis of 29 failed interactive runs found 3 distinct tasks (damn_vulnerable_defi side_entrance exploit_trace_drains_pool, kleros sortition_trees root_minus_left_equals_right_subtree, safe owner_manager_reach add_owner_owner_list_invariant) surfacing `expected type must not contain free variables` with 19 total occurrences across attempts. Lean's own error text tells the user "Use the '+revert' option to automatically cleanup and revert free variables" — yet the prior hint ("Reduce to concrete equalities before decide/native_decide") never mentioned `revert` at all and pointed agents away from the exact remedy Lean recommends. The trigger across all three tasks is the same shape: `decide` / `native_decide` / `cases <x>` / `induction <x>` run on a goal that still mentions local hypotheses (`hLow`, `hHigh`, `nodeIndex`, `hBorrow`) or pattern-bound names (`val✝`, `isLt✝`) that the tactic can't close over. Replace the weak hint with a targeted two-option remedy: (a) `revert` every free name in the goal (plus the Lean 4 `decide +revert` / `native_decide +revert` shortcut Lean itself suggests) and (b) swap `decide`/`native_decide` for `omega` / `simp_all` / `exact`, which consult the local context directly.

Corpus analysis of 83 agent runs at results/agent_runs/custom/ interactive-proxy (29 failed, 54 passed) found 11 failed runs (38% of failures) ending with an unsolved_goals error whose goal still carried the unfolded monadic trace — `ContractResult.success`/`.revert`, `Contract.run`, or a wrapper like `Core.Address.ofNat ((match ...))` around a nested match over `getMappingAddr`/storage reads. Cross-family: safe/owner_manager_reach (6), zama/erc7984 (2), paladin_votes (1), kleros/sortition_trees (1). Zero of 54 passed runs trip the same marker, so the signal is a clean failure predictor. In every case the agent kept piling more helpers (`ContractResult. success`, `.revert`, `.snd`, `Contract.run`, …) onto its `simp` list without closing the goal, because the leftover `if <cond>` arms in the trace test PROPOSITIONAL equality while the spec's hypotheses are in BEq form (`(x != zeroAddress) = true`). The pre-existing generic if/match hint never mentioned bridging BEq→Prop nor `split_ifs` on the unreduced arms, so agents escalated with "add one more definition to simp" forever. Add a trace-specific hint that (a) tells the agent to stop adding to the simp list, (b) proposes `split_ifs`/`split` to force case analysis on each remaining if-branch, and (c) shows the one-line `have hNZ : x ≠ zeroAddress := by simpa using hBEq` preconversion so that a subsequent `simp_all` can close the whole trace in one step. The hint fires alongside the existing case-label hint when both are present (the case-label direction remains "don't re-split"; this hint adds "here is how to close the remaining trace inside that branch").

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 6 total unresolved issues (including 5 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 1a5b969. Configure here.}

cursor · 2026-04-22T20:56:12Z

        short_name = self.paths.theorem_name.rsplit(".", 1)[-1]
        pattern = re.compile(
-            rf"theorem\s+{re.escape(short_name)}\b(?P<signature>.*?)(?::=)",
+            rf"theorem\s+{re.escape(short_name)}\b(?P<signature>.*?):=\s*by\b",


Signature check falsely triggers on term-mode proofs

Medium Severity

_extract_theorem_signature now requires := by instead of just :=. If the original editable file uses := by (common), expected_theorem_signature is set. But if the model submits a valid term-mode proof (:= fun ... => or := Eq.mpr ...), the extraction returns None, and None != expected_signature fires a false theorem_statement_mismatch warning — misleading the model into thinking it altered the theorem statement when only the proof style changed.

Additional Locations (1)

harness/interactive_runtime.py#L184-L190

^{Reviewed by Cursor Bugbot for commit 1a5b969. Configure here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a5b9691e4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T21:02:43Z

+    fallback_models = [
+        str(item)
+        for item in raw_fallback
+        if isinstance(item, str) and item.strip()


Trim fallback model ids before sending requests

The fallback-model normalization keeps the original string even when it only passed the item.strip() guard, so leading/trailing whitespace is preserved in outbound model IDs. A config such as "fallback_models": [" gpt-4o-mini"] will be treated as a distinct (invalid) model name and fail failover even though the intended model is available. Store item.strip() in fallback_models to avoid avoidable fallback failures.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-22T21:02:43Z

+            if exc.status not in RETRY_STATUS_CODES and exc.status != 0:
+                break


Continue fallback chain after non-transient model errors

The fallback loop stops on the first non-transient status, which means one bad fallback entry (for example, a 404 model-not-found on the first fallback) prevents trying any later configured models. This makes the fallback chain brittle in real runs where operators often keep multiple backups. Instead of breaking immediately on every non-transient error, continue to subsequent fallback models for model-specific failures.

Useful? React with 👍 / 👎.

Corpus analysis of 29 failed interactive runs: - 3 final-eval failures ended at `synthesis_failed` with a raw `(X).run s).snd` monadic trace still in the goal at the hole (`safe/setupOwners_ownerListInvariant`, `safe/removeOwner_isOwnerCorrectness`, `zama/transfer_sufficient`). The agents had written `exact ?_` without ever unfolding `Contract.run`, so the existing generic synthesis hint — which says "use `?_` with inspect_lean_goals" — just loops them back to the same un-reducible shape. - 1 additional final-eval failure (`safe/swap_owner_ownerListInvariant`) classifies as `unsolved_goals` and carries the same `.run X).snd` shape, but the existing unsolved_goals monadic-trace detection only matches the literal `Contract.run` / `ContractResult.*` strings, so it missed this case. Both branches now detect the raw-goal pattern via a regex over goal lines. False-positive rate is clean: 0 of 54 passed runs match on final details, 0 of 28 passed-run intermediate check outputs match the new `synthesis_failed + goal-trace` signal (vs 15 hits across failed runs). The new hint tells the agent the hole is unreachable until the contract function is unfolded and gives the concrete `simp [X, Contract.run, Verity.bind, ...]` incantation, rather than recommending another round of `inspect_lean_goals` which would reveal the same un-reduced goal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Corpus analysis of 83 interactive-proxy runs: - try_tactic_at_hole succeeded 0 of 76 times across the entire corpus (passed runs: 0/6 successes on 4 runs that used it; failed runs: 0/70 successes across 25 runs). - Passed runs never exceeded a 2-streak of failures — they called the tool 1-2 times, got nothing, and moved on. - Failed runs hit a ≥3-streak in 14 of 29 tasks (48%), with 4 runs stacking 5-7 consecutive speculative tactics on the same hole. The tool keeps re-running Lean each call and returns the usual class-based repair hints, but there is no signal telling the agent that the tool itself is the wrong instrument for the remaining work. Adding a per-session counter and injecting a pivot hint at streak ≥ 3 costs one extra `int` of state and zero extra Lean invocations. The hint explicitly names the shapes of goal that single-tactic closure cannot discharge (BEq↔Prop bridging, residual if/match case analysis, monadic-trace unfolding, multi-step arithmetic) and steers the model to write_editable_proof + inspect_lean_goals instead. Threshold of 3 is chosen to have zero false positives on passed runs (max observed streak = 2). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The agent occasionally writes `(private) lemma foo ... := by ...` for helper lemmas, which is valid Mathlib syntax but not Lean 4 core. Without Mathlib, Lean reports `unexpected identifier; expected 'abbrev', ..., or 'theorem'` at the `lemma` token — a one-token rename to `theorem` is all that's needed. Corpus analysis of 83 interactive runs: 3 of 29 failed tasks (10%) wrote `lemma` helpers at some point and never recovered, because the generic parse-error hint talks about tactic-in-term-position, missing `by`, stray tokens, and branch indentation — none of which point at keyword choice. Detect the exact "expected 'abbrev' ... 'theorem'" keyword-list fingerprint (unique to the top-level-command parse shape) and surface a targeted rename hint before the generic one.

…lasses Corpus analysis of 83 interactive-proxy run logs shows the Uint256/Address .val coercion asymmetry pattern appears in 14 of 29 failed tasks (48%), but the existing hint only fired when failure_class == "type_mismatch". In 12 of those 14 cases, Lean reports a cascading unsolved_goals or unknown_identifier from the same simp call before the mismatch is surfaced, so the hint was suppressed. Lift the regex-based asymmetry detector out of the per-class branch and into the cross-class pattern section so it fires whenever the "has type ... but is expected to have type ..." pair disagrees on a .val projection, regardless of which error Lean listed first. Dedup against existing hints via substring check.

…sions Corpus analysis of 83 interactive-proxy runs: 12 of 29 failed tasks (41%) end with `exact ?_` still in the submitted proof. Every one of those runs re-submitted a `?_`-containing proof 2–9 times after the first rejection. The "UNFILLED HOLE IN SUBMITTED PROOF" warning already existed, but was inserted into `hints` BEFORE `_filter_seen_hints`, so dedup suppressed it on the 2nd–Nth resubmission. The agent then got no feedback tying its specific, detectable mistake (still-unfilled hole in the exact proof text just submitted) to the failure class, and kept re-submitting `?_`. Move the hole-warning insertion AFTER the dedup filter so it fires on every submission that still contains `?_`. The warning is keyed to the concrete proof-text state, not to the abstract hint corpus, so it is correctly exempt from dedup — it reports something specifically about the submission the agent just sent, not generic class advice.

Corpus of 29 failed interactive runs: 7 terminate on `don't know how to synthesize placeholder`. The existing conditional hint inside `synthesis_failed` only fires when the goal still contains a `(X).run _).snd` monadic trace (~3 of 7). The remaining ~4 — goals with `s.storage` arithmetic, conditional branches, or list-predicate witnesses (ethereum/full_deposit_preserves_partial_gap, lido/shares_conversion_monotone, safe/setup_owners_acyclicity, safe/setup_owners_owner_list_invariant) — only received the generic "try show <goal type>" hint, which the agent had already tried. The agent then loops on inspect_lean_goals until the tool budget is gone. Add an `else` branch that emits shape-aware guidance: by-omega for arithmetic goals, split_ifs for conditionals, explicit list literals for list invariants, and explicit witnesses for And/Exists. Both branches remain mutually exclusive so passing-run cases are unaffected. Verified via direct calls: non-monadic synthesize-placeholder text now produces the new hint; monadic-trace text still produces the original run-snd hint only (not both). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…mismatch Corpus of 29 failed interactive runs: 8 tasks (28%) hit a type_mismatch where the "is expected to have type" region contains raw `ContractResult.revert` / `ContractResult.success` / nested `match match if ...` machinery — i.e. the goal still carries the full Contract.run trace while the hypothesis has already been simplified to the concrete storage-map shape. All 8 are `safe/owner_manager_reach` tasks (add_owner, remove_owner, swap_owner, setup_owners × correctness, invariant, reachability). The existing `type_mismatch` hint was only "Unfold definitions to align types. Check spec matches impl." which is not actionable for this shape — the agent doesn't know which definitions to feed simp, so it loops on `exact h` / `rw` without reducing the goal. The cross-class `.val` asymmetry detector (shipped in f5cd8a9) also doesn't cover this shape because the two sides are not a pure `.val` projection diff; one side has a whole monadic trace. Add a targeted detector inside `type_mismatch` that matches "is expected to have type … ContractResult.revert/success/match match" within 800 chars and emits a directive hint naming the exact reducers (`Contract.run`, `ContractResult.snd/revert/success`, `Verity.bind`, `Bind.bind`, `Verity.pure`, `Pure.pure`) plus `split_ifs` on sentinel guards. Verified negative: a `.val`-only asymmetry still fires the original `.val` cross-class hint and NOT this new one. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Corpus of 29 failed interactive runs: 19 tasks (66%) emit at least one `unknown identifier '<tactic>'` diagnostic. Token-count breakdown: 173 × 'simp' 100 × 'simpa' 52 × 'omega' 43 × 'native_decide' 24 × 'simp_all' 10 × 'exact' One task alone (safe/swap_owner_is_owner_correctness) has 52 repeat occurrences of `unknown identifier 'simp'` in a single run — meaning the agent kept submitting proofs with `exact simp [...]` 52 times. The existing tactic-in-term-position hint inside `_build_check_hints` (~line 1466) is wired correctly — it classifies the failure and emits an actionable "wrap in `by`" hint — but `_filter_seen_hints` suppresses it after the first emission. On the 2nd–Nth repeat, the agent gets no feedback tying its specific, detectable mistake to each rejection. This is the same failure mode as the UNFILLED HOLE warning (fixed in 4e7f31c): a state-conditional critical warning must repeat as long as the state persists. Mirror that fix — re-detect the tactic-in-term case against the current `details` and `hints.insert(0, ...)` post dedup, so the warning names the specific tactic (`simp`, `omega`, …) on every rejection. The hint is keyed to the concrete error-text state, not the generic hint corpus, so it is not a "noise" dedup candidate. Verified: given synthetic details with three `unknown identifier` lines (simp, simp, omega), `_UNKNOWN_IDENT_RE` + `_LEAN_TACTIC_NAMES` filter yields ['simp','simp','omega'] and the hint names the first ('simp'). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Corpus of 29 failed interactive runs: 6 tasks (21%) emit `unknown identifier '<camelCase>'` for names that are clearly local-binder-shaped (no dots, lowercase first char, no underscores). Per-task repetition is extreme: safe/swap_owner_is_owner_correctness 110× (91 prevOwner, 19 oldOwner) safe/remove_owner_owner_list_invariant 56× owner safe/swap_owner_in_list_reachable 41× (hKey, hKeyNew, hKeyPrev, hKeyOld, newOwner) safe/add_owner_owner_list_invariant 34× owner safe/in_list_reachable 21× s safe/setup_owners_owner_list_invariant 15× (owner1, owner2, owner3) The existing local-variable hint in `_build_check_hints` (~line 1475) is actionable — tells the agent to call inspect_lean_goals and re-check the signature — but `_filter_seen_hints` suppresses it after first emission. Same failure mode as UNFILLED HOLE (4e7f31c) and tactic-in- term (30923f9): state-conditional critical warning that must repeat as long as the state persists. Mirror those fixes — re-detect the local-variable case against the current `details` and `hints.insert(0, ...)` post-dedup, so the warning names the specific out-of-scope binder on every rejection. Suppress when a tactic-hit is also present: Lean reports both the same way and the tactic mistake is almost always the upstream cause of the binder being unreachable — firing both would just fill the agent's context with duplicate advice for the same line. Verified: pure tactic case → tactic hint only; pure var case → var hint; mixed case → tactic only (var suppressed); Mathlib name (has dot) → neither fires (handled by the existing mathlib branch). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

cursor Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread harness/interactive_runtime.py Outdated

cursor Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread harness/default_agent.py

Comment thread harness/interactive_runtime.py

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

cursor Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread harness/default_agent.py

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

cursor Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread harness/interactive_runtime.py

Comment thread harness/default_agent.py Outdated

fix: rfind Benchmark marker + extra_body cannot clobber overrides

f8d43bb

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

cursor Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread harness/default_agent.py

Comment thread harness/default_agent.py Outdated

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

cursor Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread harness/default_agent.py Outdated

claude added 2 commits April 22, 2026 12:18

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

cursor Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread harness/default_agent.py Outdated

Comment thread harness/default_agent.py Outdated

claude added 4 commits April 22, 2026 20:56

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

claude added 5 commits April 22, 2026 21:35

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

claude added 8 commits April 22, 2026 21:52

cursor Bot reviewed Apr 22, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

claude added 9 commits April 22, 2026 23:10

		_LAKE_BUILD_CACHE.add(module_name)
		code, output = lean_run_command(["lake", "build", module_name], cwd=ROOT)

		re.compile(r"object file ['\"][^'\"]+\.olean['\"]? does not exist"),
		re.compile(r"failed to load environment"),

		rf"theorem\s+{re.escape(short_name)}\b(?P<signature>.?):=\sby\b",
		re.DOTALL,

		if tool_name in ("run_lean_check", "write_editable_proof") and result.get("failure_mode") == "lean_check_failed":
		saw_lean_failure = True

		if exc.status not in RETRY_STATUS_CODES and exc.status != 0:
		break

Conversation

Th0rgal commented Apr 22, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Verification

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Th0rgal commented Apr 22, 2026 •

edited by cursor Bot

Loading