Skip to content

feat(agent)!: hook system v2 — composable middleware#2012

Merged
gold-silver-copper merged 11 commits into
mainfrom
feat/hook-system-v2
Jul 5, 2026
Merged

feat(agent)!: hook system v2 — composable middleware#2012
gold-silver-copper merged 11 commits into
mainfrom
feat/hook-system-v2

Conversation

@gold-silver-copper

@gold-silver-copper gold-silver-copper commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Hook system v2 — composable middleware

Evolves Rig's agent hook system into a composable middleware layer that supports serious production use cases — RAG/context injection, guardrails, request shaping, telemetry, tool policy, multi-turn orchestration — before the vector-store/RAG rework, and without touching vector stores. Major breaking change; no deprecated aliases (renamed once, correctly).

Update — round 3 (tool-execution semantics)

A third round redesigned tool-execution streaming for the cleanest, most correct semantics (breaking, per the mandate to prefer a small breaking change over surprising behavior):

  • Split model-emitted tool call from execution start. MultiTurnStreamItem::StreamAssistantItem(StreamedAssistantContent::ToolCall) now reports the tool call the model emitted — surfaced when the model turn is committed, whether or not Rig executes it. A new MultiTurnStreamItem::ToolExecutionStart { tool_call, internal_call_id } marks that Rig has started executing a tool, emitted only after the tool passed its ToolCall hook checks and its body actually runs (never for a dropped, hook-skipped, or invalid-recovery call). run_single_tool now returns ToolCallOutcome { content, executed } so the driver can tell a real run from a Flow::Skip. All three events (model call → execution start → result) correlate via internal_call_id. This is LangGraph's distinction between model-emitted tool calls and tool-execution lifecycle events.
  • Atomic per-batch commit/surface. drive_tool_calls collects tool outcomes instead of streaming them one-by-one. Successful ToolExecutionStart + ToolResult items are surfaced (in call order) and committed to history only after the whole batch settles successfully. On the first hook termination / fail-closed error the batch fails fast — stop new, drop not-yet-started concurrent siblings, drain in-flight, surface the deterministic lowest call-index error, surface no successful items, commit nothing (no orphan execution-start events, no partial history). Sequential and concurrent share the atomic path; run() and stream() return the same terminal reason. This matches OpenAI Agents' bounded concurrent execution that commits/surfaces only after the batch settles. (Previously the concurrent path streamed each result as its tool completed, in completion order.)
  • Local tool-choice validation. After per-turn patches and active_tools filtering, allowed_tool_names_for_choice validates the effective request before the provider call: ToolChoice::Required with no advertised tool (no executable tool and no synthetic output tool) and ToolChoice::Specific naming a tool not in the effective advertised set are local request errors with no provider round-trip. When a per-turn active_tools allow-list caused the incompatibility, the error says so and suggests setting a compatible tool_choice in the same RequestPatch. Structured-output Tool mode with no real tools still works when the synthetic output tool satisfies the choice. This is Pydantic AI's explicit local validation for impossible tool_choice/tool-set combinations.

Round-3 tests: split taxonomy + atomic ordering on both surfaces (stream_emits_model_tool_calls_then_atomic_execution_items, ..._results_in_call_order_after_batch_settles_...); concurrent_termination_surfaces_no_execution_items (no orphan start, no successful result, side effect ran but suppressed); Fix-3 unit tests + no-provider-call integration tests (required_with_empty_active_tools_..., specific_naming_filtered_out_tool_...); the anthropic streaming-tools cassette updated to assert call-order surfacing.

A 4-dimension adversarial review of round-3 confirmed 7 findings (no logic bugs in the atomic batch, split, or validation — the machinery was sound): ToolExecutionStart now carries the effective (hook-rewritten) tool call rather than the model's original — matching its doc and preventing a RewriteArgs redaction from leaking the original args (run_single_tool returns ToolExecution::Executed(effective_call) | Skipped); a Flow::Skip result is now surfaced as a ToolResult (no ToolExecutionStart) rather than committed-but-not-streamed; the active_tools error hint is shown only when the filter actually dropped a satisfying tool (not for a plain typo); and several doc-scoping/wording fixes (atomicity is not scoped to concurrency > 1; no completion-order claims). Regression tests added for each. Full rig-core lib 1138 pass; anthropic/openai/gemini cassette suites (322) pass.

Update — round 2 (correctness hardening)

A second adversarial review round, informed by how Pydantic AI / OpenAI Agents / LangGraph handle these cases, tightened four behaviors:

  • Flow::Terminate from a tool hook is now turn-wide fail-fast (was run-all-then-decide). Sequential execution (tool_concurrency == 1, the default) surfaces the terminate immediately and never starts the remaining sibling tools. Concurrent execution (> 1) drops not-yet-started siblings — a shared terminating flag makes them skip — while already-in-flight siblings are drained (so no detached task is left and the lowest call-index terminate reason still wins deterministically; a dropped sibling always has a higher index than the terminator that dropped it). No post-termination successful result is surfaced or committed. This avoids the Semantic-Kernel fail-open where every dispatched tool runs to completion before termination is honored, matching Pydantic AI / OpenAI Agents' cancel-or-drain-on-failure.
  • Tracing/redaction ordering guarantee. gen_ai.tool.call.result is recorded on the span only after the ToolResult hook runs — the redacted replacement on RewriteResult, the raw output on Keep, and nothing on Terminate — so a redaction guardrail never leaks the raw output to the trace or logs. The first ToolResult hook still observes the tool's actual output. (OpenAI Agents applies tool-output guardrails before tracing / tool-end / model-visible output for the same reason.)
  • Canonical ModelTurnFinished.content guarantee. On the streaming surface, ModelTurnFinished.content now carries the canonical committed content from StreamedTurn::finish (reasoning → text → tool-call ordering), matching what is recorded into run history — not the raw stream.choice aggregate. The raw choice is retained for the raw/final stream item. This mirrors Vercel AI SDK / LangGraph separating raw stream events from normalized final state; the blocking surface already used the committed resp.choice.
  • Scratchpad concurrency semantics documented. Removed the "never contends" claim. Scratchpad/HookContext now document that at tool_concurrency > 1 the ToolCall/ToolResult hooks for different tools share one HookContext and can run concurrently; Scratchpad::update is race-free per operation but imposes no deterministic ordering across concurrent tool hooks — store commutative/idempotent state or key per-tool state by the call id / internal call id (as LangGraph/OpenAI Agents/Pydantic AI treat run context as shared runtime state, not an ordered log under parallel tool execution).

Round-2 tests: sequential fail-fast (tool B side effect must not run) and concurrent drop-not-yet-started + drain-in-flight (both surfaces, Notify-synchronized for determinism, 5/5 non-flaky); redaction non-leak via a span-field capturing subscriber; canonical ModelTurnFinished ordering. The redaction and canonical tests were verified to fail without their fix. Also: git diff --check clean, () observes nothing, CHANGELOG signature includes ctx, streaming transcript-untouched assertion added.


Shortcomings addressed → what fixed them

Shortcoming (before) Fix
Hooks short-circuit on the first non-Continue, so a RAG hook's request patch skips a later tool-policy/telemetry hook Accumulating dispatch: on CompletionCall, every hook runs and their patches merge
RequestOverride is narrow, replacement-only, no per-hook merge semantics RequestPatch with documented per-field merge rules
Hooks can't inject context/RAG documents RequestPatch::extra_context: Vec<Document>
No run-scoped identity/state; multi-turn orchestration needs ad-hoc interior mutability HookContext (run id, turn, streaming flag, agent name, shared Scratchpad)
Rewrites single-shot (only the first RewriteArgs/RewriteResult wins) Chained rewrites across the stack
Streaming tool-only / reasoning-only turns fire no lifecycle event StepEvent::ModelTurnFinished (once per accepted turn, both surfaces)

New dispatch contract

A HookStack combines hooks in registration order, and how their Flow results combine depends on the event:

  • CompletionCall — accumulate & merge. Every hook runs; each Flow::PatchRequest is merged in registration order into one effective patch. A mergeable patch does not short-circuit later hooks. Flow::Terminate stops the stack; any unsupported flow fails closed (accumulated patch discarded).
  • ToolCall / ToolResult — chain. Every hook runs; a RewriteArgs/RewriteResult is threaded into the next hook's event so rewrites compose (redact → truncate). Skip/Terminate are terminal mid-chain. The first result hook still observes the tool's actual output.
  • Every other event — first non-Continue wins (observe-only / recovery: CompletionResponse, ModelTurnFinished, InvalidToolCall, streamed deltas).

Nesting composes: a HookStack pushed as a hook returns its own net flow (a merged patch / threaded rewrite / terminal action), which the outer stack folds in again. Register observe-only hooks before steering hooks (a Terminate short-circuits the stack).

RequestPatch merge semantics (per field)

Field hook ⊕ hook (registration order) patch → baseline
extra_context append (earlier hooks' docs first) append after static + dynamic context
additional_params shallow-merge top-level keys, later hook wins shallow-merge onto baseline params
preamble last writer wins (warns on conflict) replaces
temperature, max_tokens, tool_choice last writer wins (warns on conflict) replaces
active_tools set intersection (warns when empty) narrows the advertised set
history last writer wins (warns on conflict) replaces the messages sent this turn

Patches are per-turn and non-stickyCompletionCall re-fires each turn and re-resolves from the agent baseline.

extra_context — document ordering

Document order in the completion request is static → dynamic (vector-store) → hook extras, hook extras in registration order. The RAG query text is unchanged. Per-turn and non-sticky; works identically on run() and stream().

history view

RequestPatch::history replaces the prior messages sent to the provider for the turn (context-window compaction / summarization). The persisted transcript and run state are untouched, and RAG's query text still derives from the original history — only what is sent changes.

HookContext

Passed by & to every on_event: run_id(), turn() (advanced per turn by the driver), is_streaming(), agent_name(), and a shared Scratchpad (interior-mutable type-map: insert/get/update/remove/contains) so cooperating hooks share run-scoped state without their own Arc<Mutex<…>>. It is a driver construct — nothing from it reaches the sans-IO AgentRun.

Streaming/non-streaming

run() and stream() share one drive loop (drive_agent), so every new semantic (patch accumulation, extra_context, history, chained rewrites) lands once and behaves identically — verified on both surfaces in tests. ModelTurnFinished fires exactly once per accepted turn on both surfaces, including a streamed tool-only turn (which fires no StreamResponseFinish); it is suppressed for turns recovered via invalid-tool-call retry/repair/skip, and fires after the medium-specific raw event when one fires. CompletionResponse (blocking) and StreamResponseFinish (streaming, text turns only) are retained for the raw provider payload.

Decision-point resolutions

  • Blind merge on CompletionCall (hooks see the agent baseline, not earlier hooks' patches): simpler, keeps StepEvent: Copy, sufficient because patches are declarative with documented conflict rules.
  • active_tools intersects rather than last-writer-wins: it is an allow-list guardrail, so two narrowing hooks must compose as narrowing; widening is a config error.
  • Scalars / preamble / history: last-writer-wins with a tracing::warn! — silent conflict is the Semantic Kernel wart; a warning keeps composition debuggable. Additive guidance belongs in extra_context, not preamble concatenation.
  • ModelTurnFinished fires after the raw event: consistent "raw before normalized" ordering across surfaces.
  • HookContext.turn is an AtomicUsize (built once per run, advanced per turn): gives a coherent turn to non-CompletionCall events; Sync for &HookContext across awaits.

Inspiration references

Studied under /Users/kisaczka/Desktop/code/many_rigs/inspirations/:

  • Semantic Kernel (docs/decisions/0043-filters-exception-handling.md, KernelFunctionFromPrompt.InvokeStreamingCoreAsync): its ADR admits forgetting await next() silently disables downstream filters, and it shipped a real streaming/non-streaming short-circuit asymmetry. → kept Rig's typed Flow + fail-closed model; enforced both-surface parity with tests.
  • LangGraph (chat_agent_executor.py llm_input_messages vs messages; pregel/main.py invoke = stream): the ephemeral per-call view vs committed edit → RequestPatch::history; one drive loop → parity guarantee.
  • Vercel AI SDK (generate-text.ts prepareStep, util/notify.ts): per-field replace-vs-merge documented explicitly; observers return void and are error-isolated; naming churn (onFinishonEnd) → renamed once with no aliases.
  • OpenAI Agents (tool_guardrails.py): rich enum outcomes over booleans; run-scoped RunContextWrapperHookContext.
  • LangChain (middleware/types.py, factory.py): first-registered-is-outermost onion, immutable request + override; classic callbacks' untyped-kwargs pain → typed StepEvent.
  • pydantic-ai (capabilities/abstract.py): a run-scoped RunContext on every hook is table stakes; but its ~30-method mega-trait is a tax Rig avoids by keeping the single on_event + match-arm model.

Tests

  • cargo test -p rig-core --lib agent226 passed (220 existing + 6 new hook-v2 integration tests), 0 failed.
  • New hook.rs unit tests: 16 (accumulation, terminate short-circuit, nested-stack composition, per-field merge rules incl. empty intersection, scratchpad).
  • New runner.rs integration tests (both surfaces): extra_context after static context, append order, per-turn non-sticky, history override (transcript untouched), ModelTurnFinished once per accepted turn incl. tool-only, chained RewriteArgs+RewriteResult.
  • Provider cassette hook tests (replay): anthropic request_override/tool_call_rewrite_args/tool_result_rewrite (6), openai request_hook/permission_control (3), gemini agent_run_streamed (6) — all pass on both surfaces.
  • cargo fmt --check, cargo clippy -p rig-core --lib --tests --all-features (0 warnings), cargo doc -p rig-core --no-deps --all-features (0 warnings). All 5 hook examples + agent_with_durable_approval cargo check clean.

A 5-dimension adversarial review (API/design, merge-semantics edge cases, streaming parity, regressions, deep dispatch correctness) surfaced no correctness/parity/regression issues; the only confirmed findings were doc-only — the four public add_hook rustdocs and the foundational CHANGELOG bullet still carried the pre-v2 "first non-Continue short-circuits the rest" blanket claim (false for the new accumulate/chain cases) — now corrected to state the event-dependent composition and point to the module docs.

Known limitations & follow-ups (not in this PR)

  • RunStarted/RunFinished { outcome } observe-only lifecycle events (so telemetry sees hook-initiated terminate) — deferred.
  • An observer registration mode (error-isolated, never-skipped) if ordering guidance proves insufficient — deferred; the accumulation change already prevents ordinary patches from skipping later hooks.
  • Transcript-visible tool-call repair vs execution-only arg rewrite — documented need.
  • Tool-definition rewriting / per-turn tool injection via the patch — RequestPatch is #[non_exhaustive], so a future tools field is non-breaking.
  • The vector-store/RAG migration itself: dynamic_context reimplemented as a bundled hook using extra_context.

Non-goals (unchanged)

No vector-store crate/VectorStoreIndex removal; dynamic_context untouched; no RAG example migration; no generic Retriever trait; no second observer trait or declarative ordering engine.

Evolve the agent hook system into a composable middleware layer ahead of
the vector-store/RAG rework, without touching vector stores.

- HookContext: run-scoped context (run_id, turn, is_streaming, agent_name,
  shared Scratchpad) passed to every on_event; breaks the trait signature.
- Mergeable request patches: RequestOverride -> RequestPatch,
  Flow::OverrideRequest -> Flow::PatchRequest. CompletionCall patches from
  every hook accumulate and merge in registration order (no more first-patch
  short-circuit); documented per-field merge rules.
- RequestPatch::extra_context: per-turn passive-RAG document injection,
  appended after static + dynamic context.
- RequestPatch::history: per-turn replacement of the messages sent to the
  provider; transcript untouched, RAG keys off the original history.
- Chained tool rewrites: RewriteArgs/RewriteResult compose across a HookStack.
- StepEvent::ModelTurnFinished: normalized once-per-turn event on both
  surfaces (covers streamed tool-only turns).

All merge/patch/context/history/event semantics land once in the shared
drive loop, so run() and stream() behave identically. Fail-closed handling,
invalid-tool-call recovery, and run-all-then-decide tool execution unchanged.
…ook v2

Review found the pre-v2 'first non-Continue short-circuits the rest' blanket
claim still on all four public add_hook rustdocs and the foundational CHANGELOG
bullet — false for CompletionCall (accumulate) and ToolCall/ToolResult (chain).
State the event-dependent composition and point to the hook module docs.
… ModelTurnFinished

Address the second adversarial review round:

- Flow::Terminate from a tool hook is now turn-wide fail-fast. Sequential
  execution stops before starting remaining siblings; concurrent execution
  drops not-yet-started siblings (a shared flag makes them skip) while draining
  already-in-flight ones, so no new side effect runs after a termination and the
  lowest call-index terminate reason still wins deterministically. Avoids the
  Semantic-Kernel run-all-then-decide fail-open.
- Tool-result redaction: gen_ai.tool.call.result is recorded only AFTER the
  ToolResult hook runs (replacement on RewriteResult, raw on Keep, nothing on
  Terminate), so a redaction hook never leaks the raw output to the trace/logs.
- Streaming ModelTurnFinished.content now carries the canonical committed
  StreamedTurn::choice (reasoning->text->tool ordering), not the raw
  stream.choice aggregate; raw is kept for final stream behavior.
- Scratchpad/HookContext docs: drop "never contends"; document that at
  tool_concurrency > 1 tool hooks for different tools share the context and run
  concurrently, update is race-free per op but imposes no ordering; recommend
  commutative/idempotent state or keying by call id.
- Docs/test hygiene: CHANGELOG old signature includes ctx; anthropic
  request_override wording; () observes nothing; streaming transcript-untouched
  assertion.

Tests: sequential + concurrent fail-fast (both surfaces), redaction
non-leak (span capture), canonical ModelTurnFinished ordering — all verified to
fail without their fix. Full agent suite 229 passed.
…; fix () observes doc

Round-2 review follow-ups (both low-severity accuracy fixes):
- The concurrent drop test's narrative was wrong: a synchronous terminator meant
  no sibling ran at all, so it never exercised the in-flight-drain vs
  beyond-window-drop distinction it described. Rewrite with a Notify so tc1 is
  genuinely in flight (drains, called contains 1) while tc2 beyond the window is
  dropped (called never contains 2) and tc0's body never runs. Renamed
  accordingly; 8/8 non-flaky.
- Correct the () hook observes() doc: observes gates only the two streaming delta
  events, not 'every event' (non-delta events dispatch unconditionally).
…al tool-choice validation

Round-3 redesign for correctness and cleaner semantics (breaking):

1. Split model-emitted tool call from execution start. StreamAssistantItem
   (StreamedAssistantContent::ToolCall) now reports the tool call the MODEL
   emitted (at turn commit, whether or not it runs). A new
   MultiTurnStreamItem::ToolExecutionStart marks that Rig actually started
   executing a tool (after hook checks) — never for a dropped/hook-skipped/
   invalid-recovery call. run_single_tool now reports `executed` to distinguish
   a real run from a Flow::Skip.

2. Atomic per-batch commit/surface. drive_tool_calls collects tool outcomes
   instead of streaming them one-by-one; execution-start + result items are
   surfaced (in call order) and committed only after the whole batch settles OK.
   On the first termination/fail-closed error the batch fails fast (stop new,
   drop not-yet-started concurrent siblings, drain in-flight, lowest call-index
   error) with no successful items surfaced and no history commit — no orphan
   execution-start events. Sequential and concurrent share the atomic path;
   results now surface in call order (was completion order for concurrent).

3. Local tool-choice validation. allowed_tool_names_for_choice now validates
   the effective request before the provider call: Required with no advertised
   tool (executable or output tool) and Specific naming a non-advertised tool
   are local errors, with an active_tools-aware hint suggesting a compatible
   tool_choice in the same RequestPatch. Structured-output Tool mode with no
   real tools still works via the synthetic output tool.

Tests: split taxonomy + atomic ordering (both surfaces), no-orphan-execution-start
on concurrent termination, no-successful-result/no-commit on termination, Fix-3
unit + no-provider-call integration tests. Full lib 1135 pass; anthropic/openai/
gemini cassette suites (322) pass.
…esults, precise active_tools hint

Address the round-3 adversarial review (7 confirmed findings):

- ToolExecutionStart now carries the EFFECTIVE (hook-rewritten) tool call, not
  the model's original — matching its doc and preventing a RewriteArgs redaction
  from leaking the original args. run_single_tool returns
  ToolExecution::Executed(effective_call) | Skipped.
- A Flow::Skip tool result is now surfaced as a ToolResult (no ToolExecutionStart)
  instead of being committed-but-not-streamed — the stream matches history again.
- The active_tools error hint is shown only when the filter actually dropped a
  tool that would have satisfied the choice, not for a plain typo (threads the
  pre-filter tool set into allowed_tool_names_for_choice).
- Docs: ToolExecutionStart/StreamUserItem and the tool_concurrency method docs no
  longer scope atomicity to concurrency>1 or claim completion-order surfacing;
  ToolExecutionStart timing reworded to the atomic-settle model.

Tests: ToolExecutionStart-carries-rewritten-args, hook-skip-surfaces-result-without-
execution-start, specific-typo-not-blamed-on-active_tools. Full lib 1138 pass;
anthropic/openai/gemini cassette suites (322) pass.
…::large_enum_variant)

The Executed(ToolCall) variants dwarfed the empty Skipped/Preresolved variants;
box the ToolCall so the enums stay small. Surfaced under the rig package's
feature set (cargo clippy -p rig --all-features).
- Correct AgentHook::on_event default doc: the default observes() returns
  true (not "observes nothing"); () skips only the high-frequency delta
  dispatch and still receives every other event, returning Continue.
- Add unit_hook_observes_no_event_kind guarding the () observes override.
- Rename first_terminate_short_circuits_on_observe_only_events (it dispatched
  a chained ToolCall) to ..._on_chained_tool_call; add a real observe-only
  (_ => arm) terminate test via TextDelta.
- Assert ModelTurnFinished is suppressed on recovered turns on both the
  blocking and streaming surfaces (its own guard, distinct from the
  response-finish guards), with cross-driver parity.
- Add runner_add_hook_appends_to_agent_default_hooks proving runner/prompt
  add_hook appends to (not replaces) the agent's default hooks.
- Fix broken intra-doc link ToolCallOutcome::executed -> ::execution.
…ut-tool warning

Deep-review follow-ups on hook system v2:

- Fix stale AgentRunner::tool_concurrency streaming doc. It still described the
  pre-atomic-batch behavior ("emits each ToolResult as its tool finishes ...
  completion order"); the driver now surfaces ToolExecutionStart + ToolResult
  stream items in call order only after the whole batch settles. This matches the
  implementation and the sibling StreamingPromptRequest::tool_concurrency doc that
  links to it.
- Fix Scratchpad doc that referenced "clones of a run's HookContext" — HookContext
  is not Clone (holds an AtomicUsize; it is shared by &-reference). Attribute the
  sharing to the shared &HookContext / Scratchpad clones instead.
- completion.rs: the committed Tool-mode stall warning fired falsely when
  tool_choice = Specific names the synthetic output tool. allowed_tool_names_for_choice
  accepts and advertises such a choice (the output tool is callable), but
  tool_choice_permits_output_tool treats every Specific as forbidding. Add a
  name-aware output_tool_callable helper used only at the warning site (resolve_output_mode
  keeps the coarser check, since the output-tool name is not known there yet) and a
  unit test.
…ool-mode finalization, fix stale changelog

Deep-review follow-ups on structured-output Tool-mode interactions:

- ModelTurnFinished.content: doc said "as recorded into the run", but on a
  Tool-mode output-tool finalization turn the run persists the turn as assistant
  text (structured output) with the tool call dropped, while both drivers fire
  the event with the model-emitted content (including the output-tool call).
  The content is the model's committed turn output (consistent across surfaces,
  fired at turn commit before finalization). Correct the field doc to state this
  with an explicit Tool-mode caveat, and add a both-surfaces regression test.
- StreamAssistantItem: the contract promised a complete ToolCall item for every
  model tool call whether or not Rig executes it, but an output-tool call
  finalizes the run directly (bypassing drive_tool_calls), so its complete item
  is never emitted (only deltas + FinalResponse); invalid-recovery calls are the
  same shape. Narrow the doc to document both carve-outs, and add a streaming
  regression test.
- CHANGELOG: two Unreleased lines still described streaming ToolResult items in
  completion order, contradicting the atomic call-order-after-settle behavior.
  Update them to match.
A comparative study of LangChain, LangGraph, OpenAI Agents, pydantic-ai,
Semantic Kernel, and the Vercel AI SDK surfaced three scoped, pure-doc
improvements to the hook system (no API or behavior change):

- Flow::Skip / Flow::skip: document that `reason` is delivered to the model
  verbatim as the tool result, so it doubles as a prompt — tell the model the
  tool did not run and not to retry, or it may re-emit the identical call
  (mirrors LangChain's HITL reject/respond feedback).
- Flow::retry: add a worked example building corrective feedback from
  InvalidToolCallContext::available_tools (closes an asymmetric doc gap — the
  rewrite_args/rewrite_result constructors already had examples; mirrors
  LangGraph's INVALID_TOOL_NAME_ERROR_TEMPLATE).
- hook module docs: add a "Why a returned Flow, not a next()-style middleware"
  rationale, citing the silent-skip footgun that a next()/middleware model
  carries (Semantic Kernel ADR 0043) and which Rig's fail-closed returned-Flow
  design prevents.
@gold-silver-copper gold-silver-copper added this pull request to the merge queue Jul 5, 2026
Merged via the queue into main with commit e43089a Jul 5, 2026
6 checks passed
pull Bot pushed a commit to sternelee/rig that referenced this pull request Jul 5, 2026
)

* test(gemini): live cassette hook-system stress suite

Adds a Gemini cassette-backed stress suite
(tests/providers/gemini/cassette/hook_stress.rs) that exercises the merged hook
system (v2, 0xPlaygrounds#2012) across long, realistic multi-turn workflows recorded against
real Gemini and replayed deterministically:

- HookContext identity (stable run_id, advancing turn, is_streaming, agent_name)
  and a shared Scratchpad threaded across two cooperating hooks and many turns.
- RequestPatch: extra_context injection + active_tools narrowing + temperature,
  proven by downstream effects (the injected fact reaches the answer; the
  filtered-out tool never executes).
- Chained tool lifecycle: RewriteArgs -> observe -> RewriteResult redaction, with
  paired positive/negative assertions (the marker reaches the model; the raw
  result does not; the transcript keeps the model's original args).
- Streaming lifecycle ordering (tool call -> execution start -> tool result ->
  final response) and is_streaming parity vs the blocking surface.
- Per-turn atomic call/result pairing and the Skip zero-execution invariant.

Assertions follow tools_support's loose-assertion convention (exact equality only
for rig-synthesized values), so the cassettes survive re-recording; hooks are
deterministic so replay requests stay byte-identical. Cassettes are auto-scrubbed
and safety-checked by the recorder (key -> [REDACTED]; ids/signatures
placeholdered) and were reviewed manually.

No hook-system bugs surfaced: every scenario passed on the first live recording,
confirming the documented behavior. Inspired by how LangChain, LangGraph, OpenAI
Agents, pydantic-ai, Semantic Kernel, and the Vercel AI SDK test their
hook/middleware/guardrail systems (ordered breadcrumb logs; proving mutations via
downstream effects; paired positive+negative redaction; zero-downstream skip
invariants).

* test(gemini): expand hook-system stress suite (+24 live cassette tests)

Broadens the Gemini hook-system stress suite from 6 to 30 recorded workflows
across four themed cassette files, driven by a shared deterministic-fixtures
module (hook_stress_support.rs):

- hook_stress_context (8): HookContext identity (stable run_id, advancing turn,
  is_streaming, agent_name incl. the unset case); a shared Scratchpad written by
  one hook and read by a second, growing across turns; internal_call_id
  correlation of ToolCall/ToolResult; two observe-only hooks both firing;
  add_hook appending across builder + request; CompletionCall patch accumulation;
  and active_tools set-intersection across two narrowing hooks.
- hook_stress_patch (4): preamble override; tool_choice=Required (first turn
  only); per-turn history replacement injecting a prior fact; multi-field patch
  (preamble + extra_context).
- hook_stress_tools (6): single-key and chained RewriteArgs; chained
  RewriteResult (redact -> wrap) and truncation; Terminate from a ToolResult
  (post-execution); and model-driven recovery from a tool error.
- hook_stress_streaming (6): TextDelta / StreamResponseFinish / ModelTurnFinished
  on streaming; ModelTurnFinished on tool turns; RewriteResult redaction reaching
  the FinalResponse; active_tools narrowing and Skip on the streaming driver; and
  blocking-vs-streaming answer parity (two cassettes).

All recorded live against Gemini and replaying deterministically; every patch /
rewrite / skip effect is proven by a downstream-observable change, and
model-shaped values use loose assertions (exact only for rig-synthesized ones).
Cassettes are auto-scrubbed + safety-checked by the recorder and were reviewed.

A footgun surfaced along the way: forcing tool_choice=Required on *every* turn
loops until max_turns (each turn re-forces a tool call). Captured with a
first-turn-only patch fixture (FirstTurnPatch) so the intended pattern is shown.
No hook-system bugs found; every scenario confirms the documented behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant