Conversation
Two production failure modes observed 2026-05-03 leave a Lark user visibly stuck even though the cluster is healthy. Both healed locally, no NyxID changes: (1) /llm /route /model says "binding 已失效, 请发送 /init" while /init says "已绑定, 请先 /unbind". After PR #558 re-DCR'd the cluster's OAuth client to add the `proxy` scope, the user's existing NyxID binding (issued for the previous client_id) is rejected at broker token-exchange. ModelChannelSlashCommandHandler caught the rejection and pointed the user at /init — which then refused because the local readmodel still holds the (now-dead) binding_id, looping the user forever. Self-heal: when the broker throws BindingRevokedException / BindingNotFoundException / BindingScopeMismatchException, dispatch RevokeBindingCommand to the local ExternalIdentityBindingGAgent (reason="auto_self_heal_*") so the readmodel flips to revoked, and tell the user the binding was cleared and /init will work now. Mirrors the dispatch shape UnbindChannelSlashCommandHandler uses for explicit /unbind, but skips the NyxID-side revoke (NyxID is the one that just told us the binding is gone). (2) Bot replies as "..." forever when the LLM call fails. The streaming sink fires the first chunk via channel-relay/reply, consuming the single-use reply token and creating a placeholder message. If the LLM then fails (e.g. upstream 429), pre-fix the runtime fell through to RunLlmReplyAsync which issued a fresh /reply against the dead token and got `401 Reply token already used` — leaving the user staring at "..." with no error explanation. Self-heal: ConversationGAgent.TryCompleteStreamedReplyAsync now takes the Failed branch when streaming has already committed the placeholder. Edits the existing message in place via RunStreamChunkAsync (channel-relay/reply/update — no reply token needed) with the classified failure text, then persists ConversationTurnCompletedEvent so the runtime envelope retry loop does not refire and consume the dead token again. If the edit also fails (rare: Lark may refuse stale-message edits), persist the last flushed partial as terminal — same defence-in-depth pattern the existing PR #374 fix uses for the Completed path. Tests: - 4 new ModelSlashCommandHandlerTests pinning the binding self-heal for each rejection shape + degraded-path when IActorRuntime is missing. - 2 new ConversationGAgentDedupTests pinning the streaming-Failed branch edits the placeholder + falls through to "persist last flushed partial" when the edit also fails. Verification: dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests --no-build (851/851) dotnet test test/Aevatar.GAgents.Channel.Protocol.Tests --no-build (36/36 in dedup suite) dotnet test test/Aevatar.Foundation.Core.Tests (230/230) bash tools/ci/test_stability_guards.sh (passed) bash tools/ci/query_projection_priming_guard.sh (passed) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… path Both reviewers caught the same bug: pre-fix the self-heal returned the "本地已自动清理" message even when IActorRuntime was unregistered or when CreateAsync / HandleEventAsync threw — the local readmodel was NOT actually cleaned in those paths, so the user follows the message to /init, /init still sees the stale binding and refuses, recreating the exact loop this PR exists to break. Split the self-heal API into cleanedMessage + degradedMessage. The cleaned message is returned ONLY when the local revoke envelope was actually dispatched to the binding actor; otherwise the degraded message points the user at /unbind explicitly. Also add a single retry on the dispatch path (mirror UnbindChannelSlashCommandHandler's PR #521 v4-pro review fix) so a one-off Orleans hiccup doesn't downgrade an otherwise self-healable binding to manual /unbind guidance. Tests: - List_DegradesToUnbindGuidance_WhenSelfHealActorRuntimeMissing now asserts the reply contains "/unbind" and DOES NOT contain "已自动清理" — pinning the no-runtime degraded path. - List_DegradesToUnbindGuidance_WhenSelfHealDispatchKeepsThrowing is a new test exercising ThrowingActorRuntime (every CreateAsync throws) and asserts AttemptCount == 2 (retry-once contract) plus the degraded reply shape. Verification: dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests --no-build (852/852) bash tools/ci/test_stability_guards.sh (passed) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ly-self-heal Self-heal binding rejections + streaming-failed reply-token loop
DeepSeek v4-pro with thinking mode rejects requests when reasoning_content from prior assistant turns is not echoed back. This change: - Adds ReasoningContent to ChatMessage, LLMResponse, and LLMStreamChunk - Propagates reasoning content through ChatRuntime streaming rounds - Appends reasoning_content to conversation history for multi-turn - Implements ExtractReasoningContent in MEAILLMProvider - Wires reasoning content into non-streaming ConvertResponse Closes #563
Channel conversation LLM runs inside Orleans actors with no HTTP auth context. The scope resolver returns null, causing ActorBackedUserMemoryStore to throw InvalidOperationException on every turn and log a warning. - Add TryResolveScopeId/TryResolveWriteActorId that return null instead of throwing - ReadProjectedStateAsync returns null when no scope is available - GetAsync returns UserMemoryDocument.Empty, BuildPromptSectionAsync returns empty string — no warning logged - Write operations (Save/Add/Remove) still throw since they only run from Studio API with auth context Closes #564
…ths, history persistence
…, remove dead try-catch
Gracefully handle missing scope in UserMemoryStore (fixes #564)
…ing-content Propagate reasoning_content through LLM pipeline (fixes #563)
| "/model encountered NyxID-side binding rejection ({Reason}) but IProjectionReadinessPort is not registered; cannot verify local readmodel cleanup. actor={ActorId}, subject={Platform}:{Tenant}:{User}", | ||
| reason, | ||
| actorId, | ||
| context.Subject.Platform, context.Subject.Tenant, context.Subject.ExternalUserId); |
There was a problem hiding this comment.
[Consensus: 3 models] severity=blocker, category=arch
The new TryDispatchLocalBindingRevokeAsync runs EnsureProjectionForActorAsync + WaitForBindingStateAsync inside the slash-command request path. Three reviewers flag this as a CLAUDE.md violation: query-time projection priming is forbidden — the application/query layer must not activate a projection scope and synchronously wait for read-model materialization on the request hot path. IProjectionReadinessPort is a write-side completion port, not a turn/query dependency. Refactor self-heal so the revocation is fire-and-forget (publish a continuation event, return early to the user with a 'binding cleanup in progress' message), or perform the projection wait off the request path. CLAUDE.md ref: 「禁止 query-time replay/priming」 / 「QueryPort/QueryService/ApplicationService 不得在请求路径...同步补投影」.
Per-model verbatim
-
codex: This makes the
/modelread/list path activate a projection scope and then synchronously wait for read-model cleanup. That violatesCLAUDE.md: 禁止 query-time replay/primingand the ADR note thatIProjectionReadinessPortis a write-side completion port, not a turn/query dependency. Move this repair behind an explicit write-side command/port that owns projection activation, and keep/modelhonest if the cleanup cannot already be observed. -
v4-pro:
TryDispatchLocalBindingRevokeAsynccallsEnsureProjectionForActorAsync+WaitForBindingStateAsyncsynchronously in the slash-command handler path. This is projection priming in the request path: activating a projection and waiting for the readmodel to refresh within the request execution. Violates CLAUDE.md:
禁止 query-time priming:查询前若需要先'确保投影存在/刷新 read model',该动作必须在显式 activation、lease、binder 或后台物化流程中完成;禁止在 query 方法内同步补投影。
Move projection activation/binder setup before the slash-command path (e.g. upstream in the conversation pipeline when the inbound message arrives), or make the self-heal fire-and-forget (dispatch revoke + don't synchronously wait for readmodel). The degraded-message fallback already handles cases where the readmodel isn't cleaned within the request.
- kimi:
TryDispatchLocalBindingRevokeAsynccalls_bindingProjectionPort.EnsureProjectionForActorAsync(actorId, ct)synchronously inside the slash command handler request path. This is projection priming in the application layer. CLAUDE.md forbids query-time priming: "查询前若需要先'确保投影存在/刷新 read model',该动作必须在显式 activation、lease、binder 或后台物化流程中完成;禁止在 query 方法内同步补投影。" While the OAuth endpoint does similar priming, endpoints are Host layer; a per-user-message slash handler is not. Consider activating binding projection scopes eagerly at startup (e.g. a hosted service that primes all known binding actors) instead of per-request.
| var actor = await _actorRuntime | ||
| .CreateAsync<ExternalIdentityBindingGAgent>(actorId, ct) | ||
| .ConfigureAwait(false); | ||
| var envelope = new EventEnvelope |
There was a problem hiding this comment.
[Consensus: 4 models] severity=blocker, category=arch
actor.HandleEventAsync(envelope, ct) dispatches RevokeBindingCommand directly through IActorRuntime, bypassing IActorDispatchPort and the actor's inbox / serialization guarantees. Four reviewers flag this as a CLAUDE.md violation of Runtime 与 Dispatch 分责: IActorRuntime may resolve lifecycle/topology, but command delivery must go through the runtime-neutral publisher (publish/send). Inline HandleEventAsync reduces to direct method invocation on the target actor — exactly the pattern CLAUDE.md forbids: 「禁止在基类、业务层或中间适配层绕过标准 publisher 直操底层传输对象」. Replace with the runtime-neutral dispatch port (IActorDispatchPort.PublishAsync(...) or the equivalent) and let the actor's inbox process the revoke command.
Per-model verbatim
-
codex: Calling
actor.HandleEventAsyncdirectly bypassesIActorDispatchPortand manually inlines delivery into the target actor. This violatesCLAUDE.md: Runtime 与 Dispatch 分责/投递语义必须 runtime-neutral;IActorRuntimeshould only resolve lifecycle/topology, while command delivery should go throughIActorDispatchPort.DispatchAsyncwithEnvelopeRouteSemantics.CreateDirect(...)or the existing command-dispatch helper. -
v4-pro:
actor.HandleEventAsync(envelope, ct)dispatches theRevokeBindingCommanddirectly to the actor via inline method call, bypassing the actor's inbox queue and serialization guarantees. Violates CLAUDE.md:
投递语义必须 runtime-neutral:publish/send 统一表示'进入目标 actor inbox 等待处理',不得因目标是 self 或底层 runtime 差异而退化为 inline dispatch。
The comment at L126 says this mirrors UnbindChannelSlashCommandHandler — but that handler also uses the same pattern and should be fixed too. Use IActorDispatchPort (or equivalent standard dispatch mechanism) instead of CreateAsync + HandleEventAsync.
-
kimi: The handler directly calls
actor.HandleEventAsync(envelope, ct)viaIActorRuntime, bypassing the standardIActorDispatchPort. CLAUDE.md requires runtime-neutral dispatch: "禁止在基类、业务层或中间适配层绕过标准 publisher 直操底层传输对象。" WhileUnbindChannelSlashCommandHandleruses the same pattern, the new self-heal path amplifies the violation by coupling it with projection priming and retry logic. Refactor to useIActorDispatchPortor a dedicated binding command port so the slash handler does not depend on runtime internals. -
mimo-v2.5-pro: The slash-command handler now directly orchestrates actor dispatch via
IActorRuntime,EventEnvelope, andRevokeBindingCommand. Per CLAUDE.md, slash handlers are channel-routed application services — they should not construct and dispatch actor envelopes. ExtractSelfHealRevokedBindingAsync/TryDispatchLocalBindingRevokeAsyncinto a dedicatedIBindingSelfHealService(application layer) and inject that instead. This also simplifies testing: the current test file has to mockIActorRuntime,ExternalIdentityBindingProjectionPort, ANDIProjectionReadinessPortjust to cover one catch branch.
| private readonly ILogger<ModelChannelSlashCommandHandler> _logger; | ||
|
|
||
| public ModelChannelSlashCommandHandler( | ||
| ILogger<ModelChannelSlashCommandHandler> logger, |
There was a problem hiding this comment.
[Consensus: 2 models] severity=major, category=di
Constructor injects the concrete ExternalIdentityBindingProjectionPort (and accepts nullable IActorRuntime + IProjectionReadinessPort). Two issues: (a) DI: CLAUDE.md 依赖反转 requires depending on abstractions — define IExternalIdentityBindingProjectionPort and inject the interface (current registration in IdentityServiceCollectionExtensions.cs is services.TryAddSingleton<ExternalIdentityBindingProjectionPort>() against the concrete type). (b) silent degradation: nullable self-heal collaborators mean if any registration is missing in production DI the self-heal — the user-facing fix this PR delivers — silently disappears with no error. Either make these required (non-nullable), or add a startup validation that logs a warning when self-heal will be inactive.
Per-model verbatim
-
glm-5.1:
IActorRuntime,ExternalIdentityBindingProjectionPort, andIProjectionReadinessPortare all nullable and degrade gracefully, but the self-heal feature is the primary user-facing fix in this PR. If any of these aren't registered in the production DI container, the handler silently falls back to the degraded message and the binding loop is unbroken. Verify registrations exist in the host composition root — if they're not yet wired, this should be a blocker. -
kimi: Constructor injects the concrete
ExternalIdentityBindingProjectionPortinstead of an abstraction. CLAUDE.md: "依赖反转:上层依赖抽象,禁止跨层反向依赖和对具体实现的直接耦合。" The port is registered as a singleton concrete class (IdentityServiceCollectionExtensions.cs:86,services.TryAddSingleton<ExternalIdentityBindingProjectionPort>()). DefineIExternalIdentityBindingProjectionPort, register against it, and inject the interface.
|
|
||
| public async Task<bool> RemoveEntryAsync(string id, CancellationToken ct = default) | ||
| { | ||
| var actor = await EnsureWriteActorAsync(ct); |
There was a problem hiding this comment.
[Consensus: 3 models] severity=major, category=concurrency
RemoveEntryAsync was reordered so EnsureWriteActorAsync runs before ReadProjectedStateAsync. Three reviewers flag this side-effect change: previously, calling RemoveEntryAsync("missing") was a no-op that returned false without materializing any actor; now it always activates the user-memory actor first (state hydration, reminder registration, storage I/O), even when the entry does not exist (returns false at the existence check). This violates CLAUDE.md 默认路径须定义资源语义 — a no-op delete now implicitly creates a resource, which has cost (cluster activation, cache pressure) and changes resource lifecycle. Move actor resolution after the existence check; only resolve the scope/actor when there is real work to do.
Per-model verbatim
-
codex:
RemoveEntryAsync("missing")now creates or materializes the user-memory actor before checking whether the entry exists. That turns a no-op delete into implicit resource creation, violatingCLAUDE.md: 默认路径须定义资源语义; keep the projected read first, and if the no-scope case must throw, resolve the scope ID explicitly without callingEnsureWriteActorAsyncuntil a matching entry is found. -
kimi:
RemoveEntryAsyncwas reordered to callEnsureWriteActorAsyncbeforeReadProjectedStateAsync. This causes unnecessary actor activation when the entry does not exist. Actor creation has side effects (state hydration, reminder registration, storage reads). Move actor resolution back after the existence check, or document why eager activation is required. -
mimo-v2.5-pro: The reorder of
EnsureWriteActorAsyncbeforeReadProjectedStateAsyncinRemoveEntryAsyncchanges the side-effect ordering. Previously the actor was only created when the entry existed. Now everyRemoveEntryAsynccall creates the actor even when the entry is missing (returnsfalseat line 133). The actor creation is idempotent, but it's an unnecessary round-trip in the common "entry not found" case. Consider movingEnsureWriteActorAsyncback after the existence check, or documenting that the actor must exist before the read (if there's a race-condition fix reason for the reorder).
| } | ||
|
|
||
| public async Task<string> BuildPromptSectionAsync(int maxChars = 2000, CancellationToken ct = default) | ||
| { |
There was a problem hiding this comment.
[codex] severity=major, category=bug
Removing the previous try/catch means a transient projection/read-model failure in BuildPromptSectionAsync will now fail the whole chat/prompt construction path instead of degrading by omitting memory. Restore the best-effort behavior here: catch non-cancellation exceptions from GetAsync, log the warning, and return string.Empty.
| runtime.Actors.Should().NotContainKey("user-memory-user-1", | ||
| "no actor should be created when entry is missing"); | ||
| var actor = runtime.Actors["user-memory-user-1"]; | ||
| actor.ReceivedEnvelopes.Should().BeEmpty("no remove command should be dispatched when entry is missing"); |
There was a problem hiding this comment.
[v4-pro] severity=major, category=test
RemoveEntryAsync_MissingEntry_ReturnsFalse changed from asserting runtime.Actors.Should().NotContainKey(...) (actor NOT created) to asserting runtime.Actors["user-memory-user-1"] EXISTS with empty envelopes. But the new RemoveEntryAsync reads state first and returns false before calling EnsureWriteActorAsync when the entry is missing — the actor should not be created at all. Either the test is wrong or the test infrastructure (EmptyReader, FakeActorRuntime) triggers an unexpected actor creation. Verify the test passes against the actual code change.
| AttachOpenAIRawRepresentationForReasoning(meaiMsg, msg); | ||
| result.Add(meaiMsg); | ||
| } | ||
|
|
There was a problem hiding this comment.
[Consensus: 3 models] severity=major, category=arch
AttachOpenAIRawRepresentationForReasoning injects reasoning_content into the OpenAI SDK's serialized JSON via rawMessage.Patch.Set("$.reasoning_content"u8, ...) under #pragma warning disable SCME0001. Three reviewers flag the same fragility: Patch is an SDK-internal experimental surface (SCME0001), reasoning_content is not a documented field on AssistantChatMessage, and the JSONPath relies on the SDK's current serialization shape. Any SDK update that changes the patch model, the serializer, or the message structure will silently break or drop reasoning content propagation — there is no compile-time guarantee. Mitigations: (a) pin the OpenAI SDK version, (b) add an integration test that asserts the serialized JSON contains reasoning_content, and (c) file a feature request upstream for first-class reasoning content support so the workaround can be retired.
Per-model verbatim
-
glm-5.1:
rawMessage.Patch.Set("$.reasoning_content"u8, ...)relies on the OpenAI SDK's internalPatch/SCME0001surface to injectreasoning_contentinto the serialized JSON. This is fragile — any SDK update that changes the patch model or serialization pipeline will silently drop reasoning content or throw at runtime. Consider isolating this behind a version-checked helper (or a feature flag) so a future SDK bump doesn't cause a silent regression. -
v4-pro:
AttachOpenAIRawRepresentationForReasoninginjectsreasoning_contentinto the OpenAI SDK'sAssistantChatMessageviarawMessage.Patch.Set("$.reasoning_content"u8, ...). This hardcoded JSON-path depends on the SDK's internal serialization structure (AssistantChatMessage.ModelReaderWriter.Writeoutput) which could change between versions and silently break reasoning propagation. Consider a customAssistantChatMessagesubclass or an explicitBinaryContentwrapper if the SDK doesn't natively support reasoning_content. -
mimo-v2.5-pro:
AttachOpenAIRawRepresentationForReasoningusesrawMessage.Patch.Set("$.reasoning_content"u8, ...)with#pragma warning disable SCME0001. This is a fragile internal API — the OpenAI SDK doesn't documentreasoning_contentas a supported field onAssistantChatMessage, and thePatchproperty is implementation-detail. If the SDK changes wire format or removes the Patch escape hatch, this will silently break reasoning round-tripping. Add a fallback: if the patch throws (or ifRawRepresentationdoesn't survive a serialize-then-deserialize cycle), log a warning and degrade gracefully rather than crashing the entire chat call.
| CancellationToken ct) | ||
| { | ||
| try | ||
| { |
There was a problem hiding this comment.
[v4-pro] severity=minor, category=perf
MergeProxyRouteCandidatesAsync calls _nyxClient.DiscoverProxyServicesAsync synchronously within GetServicesAsync. If the /api/v1/proxy/services endpoint is slow or unreachable (and the catch swallows the exception with a warning), the catalog load time doubles on every /model or /route invocation. Consider making the proxy-service discovery happen on a background refresh, or caching with a TTL, rather than per-request.
| NyxIdLlmServicesResult result, | ||
| string bearerToken, | ||
| CancellationToken ct) | ||
| { |
There was a problem hiding this comment.
[mimo-v2.5-pro] severity=minor, category=design
The proxy endpoint path /api/v1/proxy/services?per_page=100 is hardcoded here AND in NyxIdLlmServiceCatalogClient.cs:63. Two independent HTTP clients call the same endpoint. Extract the path into a shared constant (or better, into NyxIdLlmProviderSource / a config key) so the two call-sites stay in sync if the path changes.
| /// 注册 Ornn 技能工具系统。ornn_search_skills 工具始终注册到 LLM;远程技能按需获取通过 | ||
| /// IRemoteSkillFetcher 集成到统一的 use_skill 工具。所有 Ornn API 调用通过 NyxID 的 | ||
| /// proxy 路由,因此调用方必须先注册 NyxIdApiClient(一般通过 AddNyxIdTools)。 | ||
| /// </summary> |
There was a problem hiding this comment.
[kimi] severity=minor, category=di
AddOrnnSkills registers OrnnSkillClient which now requires NyxIdApiClient, but the extension does not register or assert that dependency. Hosts that call AddOrnnSkills without also calling AddNyxIdTools will hit a runtime DI resolution failure. Add an explicit services.TryAddSingleton<NyxIdApiClient>() guard here, or document that callers must register NyxID tools first and add a runtime guard that throws a clear InvalidOperationException if the client is missing.
| configure?.Invoke(options); | ||
| services.TryAddSingleton(options); | ||
| services.TryAddSingleton<OrnnSkillClient>(); | ||
| services.TryAddSingleton<IRemoteSkillFetcher, OrnnRemoteSkillFetcher>(); |
There was a problem hiding this comment.
[v4-pro] severity=minor, category=di
DI contract change: OrnnSkillClient now REQUIRES NyxIdApiClient (was optional HttpClient). services.TryAddSingleton<OrnnSkillClient>() at line 23 will fail at resolution time if NyxIdApiClient is not already registered. The comment at line 12-13 documents the requirement but there's no runtime guard. If EnableOrnnSkills=true but AddNyxIdTools() was not called (unlikely in production, possible in test harnesses), the DI container throws InvalidOperationException. Consider adding a services.AddOrnnSkills guard that checks whether NyxIdApiClient is already registered.
| @@ -401,6 +414,18 @@ private async Task DispatchOutputChunksAsync( | |||
| /// </summary> | |||
There was a problem hiding this comment.
[mimo] severity=nit, category=other
The comment references "PR #569 review, codex P1 on SkillRunnerGAgent.cs:351" — the line number will drift as the file evolves. Prefer referencing the commit hash or a stable anchor rather than a line number.
| <ImplicitUsings>enable</ImplicitUsings> | ||
| <Nullable>enable</Nullable> | ||
| <IsPackable>false</IsPackable> | ||
| <IsTestProject>true</IsTestProject> |
There was a problem hiding this comment.
[v4-pro] severity=nit, category=test
New test project lacks Aevatar.AI.ToolProviders.NyxId project reference. OrnnSkillClientTests.cs instantiates NyxIdApiClient directly (line: new Aevatar.AI.ToolProviders.NyxId.NyxIdApiClient(...)) via the using statement in OrnnSkillClientTests.cs:4, but the .csproj doesn't have a <ProjectReference> to Aevatar.AI.ToolProviders.NyxId. If the project builds, it's because Aevatar.AI.ToolProviders.Ornn transitively references it (line 11: <ProjectReference Include="..\Aevatar.AI.ToolProviders.NyxId\..." />). This works but is fragile — make the reference explicit.
Codecov flagged 0% coverage on LarkCardKitClient (81 lines) and
ILarkCardKitClient (17 lines). Mirror the LarkNyxClient
HTTP-recording pattern from LarkCoverageTests so the four CardKit
methods exercise the real transport: a fake HttpMessageHandler
captures the request, the test asserts URL + method + body shape.
Coverage:
- CreateCardAsync: POST /cards, body embeds DataJson as a nested
JSON object (load-bearing — Lark rejects double-encoded payloads).
- StreamElementContentAsync: PUT to .../elements/{element_id}/
content, sequence + uuid wired through.
- StreamElementContentAsync omits `uuid` when the IdempotencyKey is
blank (Lark rejects empty uuids on some endpoints).
- StreamElementContentAsync runs ids through Uri.EscapeDataString
(verified via AbsoluteUri so percent-encoded chars survive — the
default Uri.ToString() decodes %20 back to a literal space).
- SetCardSettingsAsync: PATCH on settings, settings JSON inline.
- UpdateCardAsync: PUT card-as-object, sequence carried, no uuid.
- ParseJsonObject rejects blank/malformed JSON at the boundary
rather than letting it surface as a 400 from Lark.
- AddLarkTools registers ILarkCardKitClient as a singleton.
54/54 Aevatar.AI.ToolProviders.Lark.Tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address review batch on PR #562 (10 inline comments). All in files I have recent ownership of and require no architectural shifts: - #16 (blocker, security): ssh_exec is now opt-in via NyxIdToolOptions. EnableSshExecTool. Hosts that haven't wired the approval middleware no longer see the tool by default. Mainnet host opts in (Lark bot needs it). - #21 (major, bug): code_execute keeps the modern /execute + {language, script} contract, but on a NyxID-proxy upstream 404 it retries the legacy /run + {language, code} contract so deployments still pinned to old chrono-sandbox-service builds keep working. - #22 (major, bug): SkillRegistry.IsFresh now exempts SkillSource != Remote from TTL — local skills are baked in at registration and don't need expiring; prior behavior dropped them from use_skill after the first 5min. - #18 (major, bug): TurnRunner.TryResolveSenderBindingAsync narrows the catch to transient infra errors (Http/Timeout/IO/JSON) and surfaces non-transient (logic, NRE, serialization) at Error level so ops can distinguish "sender unbound" from "binding store broken". - #19 (major, bug): ConversationReplyGenerator narrows the sender-route-fallback catch to transient errors via IsRetryableSenderRouteFailure. Programmer errors no longer cost an LLM round on retry. - #29 + #30 (minor): inbox runtime gives metadata enrichment its own 15s budget separate from the LLM run, surfacing errorCode=llm_reply_metadata_timeout when scope/UserConfig lookup is slow. ResolveFallbackTimeout treats ResponseTimeoutSeconds<=0 as "no timeout" rather than silently snapping back to 120s. - #12 (minor): ConversationGAgent's stream-chunk and final-stream-chunk edits run under a 10s CTS now; the failure path already uses one. A hung relay can no longer pin the actor turn forever. - #27 (minor, security): ConstantTimeEquals docstring tightened — removed the "for future callers" line and added a SCOPE comment that this helper is rebuild-admin-only and shouldn't be promoted to internal/public without replacing its length-leak with a length-padding scheme. - #23 (major, bug): CLI ornn skills slug default → ornn-api (matches the registered slug; bare "ornn" is the SPA frontend that returns HTML). Build clean (NyxId / Skills / NyxidChat / Mainnet hosts), 30 AI tests + 15 inbox runtime tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- #15 (major, di): AddOrnnSkills now self-registers NyxIdApiClient and NyxIdToolOptions via TryAdd safety nets so workflow hosts that enable Ornn without AddNyxIdTools get a clean 404 path on missing BaseUrl rather than a confusing DI resolution failure. - #3 (major, di): Introduce IExternalIdentityBindingProjectionPort and inject the interface in IdentityOAuthEndpoints. The concrete class remains for runtime composition; the registration also publishes the interface as a forwarder so existing tests that hold the concrete type keep working. - #14 + #17 (major, concurrency): TurnStreamingReplySink TOCTOU concern was based on a misread — drainSignal is captured inside the same lock acquisition as _dispatchInProgress=false in the cap branch, and the throttle gate's nextIsFinal=false invariant makes _drainTcs guaranteed null on that path. Document the invariants so future readers don't re-flag this. - #20 (major, arch): Replace the singleton Dictionary cache in NyxIdLlmServiceCatalogClient with IMemoryCache. Per CLAUDE.md "中间层状态约束", per-caller state lives in a host-owned cache, not a service field. AbsoluteExpiration policy preserved (30s). - #9 (minor): /api/v1/proxy/services?per_page=100 was already extracted into NyxIdLlmCatalogRoutes.ProxyServicesPath — both call sites already use the constant. - #10 (minor): LooksLikeLlmRouteCandidate already has negative-signal filtering plus a two-distinct-weak-signals threshold. No change. - #11 (minor): ExternalIdentityBindingProjector already logs a Warning with DocumentId/EventId/Version on the empty-binding delete path. Build clean (Ornn / Identity / NyxidChat / Channel.Runtime), updated catalog client test passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- #13 (major, arch): /api/oauth/aevatar-client/rebuild now dispatches ProvisionAevatarOAuthClientCommand via IActorDispatchPort.DispatchAsync, matching /unbind. The inline actor.HandleEventAsync was a known CLAUDE.md "投递语义必须 runtime-neutral" violation; aligning the two endpoints removes the inconsistency that any future inbox middleware would silently bypass on rebuild. - #24 (minor, design): callback endpoint accepts ?format=json on the URL to opt back into the {status:"bound", already_bound, display_name} envelope that programmatic CLI/SDK consumers used pre-HTML-render. Default stays HTML for browser callbacks. - #26 (minor, arch): /rebuild now sits behind a RebuildAuthEndpointFilter that enforces the admin-token check before model binding and per-request DI activation kick in. The filter + the inline check in the handler are redundant by design (defense in depth) — the filter rejects unauth posts before deserialization runs, and the handler still validates so hand-rolled tests/integration scenarios cannot bypass. - #28 (minor, design): document the readmodel-deletion contract in the ExternalIdentityBindingProjector header — empty BindingId deletes the document instead of upserting an inactive record; downstream audit consumers must read the committed-event log directly. - #1 + #2 (blocker, arch): no change needed. Earlier commits in this PR already moved /model self-heal to IActorDispatchPort.DispatchAsync and removed the EnsureProjectionForActorAsync call from the slash-command request path. Verified by reading the current handler. - #25 (minor, test): documented in the rebuild handler comment — concurrent /rebuild calls would race on the same actor, but this is operator-grade break-glass and de-duping concurrent rebuilds is out of scope. Build clean (Identity), 34 OAuth-path tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- #7 (major, arch): wrap MEAILLMProvider's Patch.Set("$.reasoning_content") in try/catch so a future OpenAI SDK contract break (Patch surface change, serializer rewrite, field rename) degrades to "no reasoning replay" instead of crashing the chat call. The OpenAI package is already pinned to 2.9.1 in Directory.Packages.props, and the existing AIComponentCoverageTests already pin the serialized JSON shape so any drift fails the build the moment Patch stops landing in the payload. - Test fix for #22 (skill registry TTL): the old TryGet_BeyondTtl_ReturnsFalseSoCallerCanRefetch test relied on TTL expiring SkillSource.Local entries — that's the bug #22 flagged. Updated the stale-entry tests to use SkillSource.Remote (via remoteId) which is the realistic stale scenario. Added a new TryGet_LocalSkillBeyondTtl_StillFresh test pinning the new behaviour. 538 AI tests + 897 ChannelRuntime tests + 16 Ornn tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov flagged 1 missing + 2 partial branches after the previous coverage commit (4c537ad). Add three targeted tests: * CreateCardAsync_RejectsLiteralNullJson: ParseJsonObject's `?? throw` branch only fires when JsonNode.Parse returns null without throwing — and the only input that does that is the literal JSON `"null"`. Without the throw, the request would serialize as `"data": null` and Lark would reject it as a missing field; the boundary check turns the silent null into a clear ArgumentException at the caller's call site. * SetCardSettingsAsync_OmitsUuid_WhenIdempotencyKeyIsBlank: covers the false branch of `if (!string.IsNullOrWhiteSpace(...))` for the settings endpoint. Mirrors the existing StreamElementContent variant. * UpdateCardAsync_PassesIdempotencyKey_WhenProvided: covers the true branch of the same conditional for UpdateCardAsync, plus asserts the idempotency key is .Trim()'d before emission so accidental whitespace does not defeat Lark's dedup. 13/13 LarkCardKitClient tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Multi-model fix-check verdict (3/5 partial, 2/5 unresolved on
discussion_r3201406794) flagged the residual structural risk: the
runtime-only `card_mode` boolean still lived on
LlmReplyStreamChunkEvent — a domain-event-shaped envelope — so a
misbehaving persistence layer could replay it and silently re-route
to the card sink. Comment-only mitigation was acceptable to three
reviewers but not to glm-5.1 / codex.
Eliminate the risk by construction:
* New proto LlmReplyCardStreamChunkEvent in conversation_events.proto
carries the same fields as LlmReplyStreamChunkEvent, dispatched by
the card sink path. The proto type itself signals routing — there
is no boolean to flip.
* LlmReplyStreamChunkEvent.card_mode is removed and the field number
is `reserved` along with its name so any stale serializer or
accidental reuse fails loudly instead of silently re-enabling
card mode.
* TurnStreamingReplySink.DispatchOneAsync now constructs the correct
proto type based on the existing `cardMode` constructor flag.
* ConversationGAgent grows a second [EventHandler]
HandleLlmReplyCardStreamChunkAsync. The legacy
HandleLlmReplyStreamChunkAsync delegates to a private
HandleNyxRelayStreamingChunkCoreAsync helper; the card handler runs
the card-streaming machine and, on CreationFailed, synthesizes an
edit-message chunk and falls through to the same helper so the
user still gets a reply when card creation fails.
* IConversationCardTurnRunner / ChannelCardConversationTurnRunner /
HandleLarkCardStreamingChunkCoreAsync take
LlmReplyCardStreamChunkEvent. Tests update RecordingCardTurnRunner
factories and CreateCardStreamChunk helper accordingly; tests now
exercise HandleLlmReplyCardStreamChunkAsync instead of relying on
a CardMode flag on the legacy event.
Spot-checks for the two manual-eyeball items in the verdict:
* `git grep ConfigureAwait agents/Aevatar.GAgents.Channel.Runtime/
Conversation/ConversationGAgent*.cs` finds only a comment mention
("no ConfigureAwait(false)") — no actual usage. mimo's dissent on
this point was a misread.
* The three classifier helpers
(ClassifyCreateFailure / ClassifyPostSendFailure /
ClassifyStreamFailure) consistently map 11310 → IsTableLimitExceeded
and 230099 / 230100 → IsCardUnavailable. No 230099 lands in any
IsTableLimitExceeded path. codex's "neither classifier sets it"
read is the accurate one.
Tests: 133 protocol + 36 ChannelRuntime streaming + 57 Lark tools,
all green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
🔁
|
Resolve the items the /opencode-pr-fix-check verdict flagged after the first review pass: - TurnStreamingReplySink #14 (cap branch): clear `_pendingText` when the cap is hit so the reviewer-asked invariant "pending text is never left behind when we release `_dispatchInProgress`" actually holds. The earlier "preserve _pendingText for FinalizeAsync" comment was wrong: FinalizeAsync uses its own `text` parameter, not the stash, so clearing is harmless. - TurnStreamingReplySink #17 (throttle gate): arm the deferred-flush timer inside the same `lock` acquisition that releases `_dispatchInProgress = false`. A concurrent OnDeltaAsync can no longer observe the (no-timer + not-dispatching) gap. The trailing outside-lock `armTimerDelay` block is gone. - ConversationReplyGenerator: log a Warning at construction when SkillRegistry is wired but IRemoteSkillFetcher is missing — that config silently advertises use_skill yet can't pull remote skills. Single grepable line for ops. - SkillRunnerGAgent: replace the drifting `SkillRunnerGAgent.cs:351` reference in the streaming-sink comment with the stable anchor `EnsureToolStatusAllowsCompletion`. - Ornn.Tests.csproj: explicit ProjectReference to Aevatar.AI.ToolProviders.NyxId — tests instantiate NyxIdApiClient directly so depending on transitive flow-through from Ornn was fragile. - MEAILLMProvider Patch.Set: documentation upgrade — call out the three layered mitigations (SDK pin in Directory.Packages.props, AIComponentCoverageTests JSON-shape integration assertion, try/catch graceful degradation) and the upstream tracking path (file an openai-dotnet issue + retire the SCME0001 hack when a typed reasoning_content lands). Build clean (Channel.Runtime, NyxidChat, Scheduled, MEAI, Ornn.Tests), 30 streaming-sink tests + 16 Ornn tests + 18 MEAI/coverage tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "safety net" I added in PR #562 review #15 (commit 673809a) was itself the cause of a production outage on 2026-05-08: System.InvalidOperationException: NyxID base URL is not configured. at NyxIdRelayAuthValidator.ResolveDiscoveryUrl() at NyxIdChatEndpoints.HandleRelayWebhookAsync ... `AddAevatarAIFeatures` runs `RegisterOrnnSkills` BEFORE the mainnet host calls `AddNyxIdTools`, so the safety-net `services.TryAddSingleton< NyxIdToolOptions>()` registered an empty default first; the host's real `AddNyxIdTools(o => { o.BaseUrl = ... })` then no-op'd because `TryAddSingleton(options)` saw the empty instance already registered. Every NyxID call (relay JWT validation, channel-relay/reply, proxy) exploded on null BaseUrl, and Lark stopped replying entirely. The right behaviour: hosts that enable Ornn skills MUST call `AddNyxIdTools` themselves. If they don't, DI resolution fails fast at startup with a clear "no service for NyxIdApiClient" — which is exactly the signal that needs to surface, not be papered over with empty placeholders. Updated the docstring with a remarks block so the next person tempted to add this safety net sees the production-incident breadcrumb first. The reviewer concern that triggered #15 (workflow host enables Ornn without NyxId) is real but theoretical for this repo today; the cost of the wrong fix was concrete and severe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ault PR #590 shipped CardKit streaming behind a flag (default false) and the production cluster's ConfigMap doesn't carry an Aevatar:NyxId:Relay override (verified earlier — the deployed appsettings.Distributed.json has no Relay node), so flipping the flag in repo configuration was a no-op. Move the default to true in the C# field initializer so the aevatar Lark bot uses the card path out of the box. Risk: low. PR #590's failure-routing already handles missing CardKit scopes — card.create returns the scope-error / rate-limit envelope, ConversationGAgent.HandleLarkCardStreamingChunkCoreAsync transitions to CreationFailed, and the turn falls through to the legacy edit-message sink for the rest of the chunks. Worst case for a deployment without cardkit:card:read + cardkit:card:write granted is one extra POST per turn before the fallback kicks in. Deployments that explicitly want the legacy path (no Lark bot, or want to skip the card.create round-trip) can opt out via Aevatar:NyxId:Relay:StreamingCardKitEnabled=false. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lt test Last commit flipped StreamingCardKitEnabled default to true, which made ProcessAsync_StreamingEnabled_DispatchesChunkEventAndReadyEvent fail because the inbox now stamps card-mode chunks with the structurally distinct LlmReplyCardStreamChunkEvent proto type (PR #590 routing scheme), and the test asserts the legacy LlmReplyStreamChunkEvent descriptor. Two changes: - Existing test sets StreamingCardKitEnabled=false explicitly; comment notes that it deliberately exercises the legacy text-edit chunk shape. - New test ProcessAsync_StreamingEnabledWithDefaultCardMode_Dispatches- CardChunkEvent pins the new default behaviour: omit the flag, observe LlmReplyCardStreamChunkEvent on the wire. 898 ChannelRuntime tests + 94 NyxId AI tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lark CardKit 2.0's open-apis/cardkit/v1/cards endpoints want type-tagged `data` / `settings` / `card.data` payloads as JSON-encoded STRINGS, not inline objects. The merged PR #590 implementation always embedded those fields as inline JsonNode trees, which Lark rejects with code 9499 "Invalid parameter type in json: Data | Settings | Card". Production rollout (commit 36b8b5c flipped StreamingCardKitEnabled default to true) made every Lark turn run card.create, hit the 400, log "Card create failed; falling back to text-edit", and silently spend the rest of the turn on the legacy edit-message path — so users still saw text-edit replies and concluded card mode was off (it was on, but the client was sending the wrong shape). Verified each shape against the live endpoint via `nyxid proxy request`: POST /cards data: STRING when type=card_json|card_id data: INLINE OBJECT when type=template PATCH /cards/{id}/settings settings: STRING PUT /cards/{id} card: INLINE { type, data } where data follows the same string rule Tests: - Updated existing pinning tests that asserted the wrong shape (the inline-object assertion was the contract bug). - Added explicit cases for card_id (string) and template (inline object). - Round-trip card_json string back through JsonDocument.Parse so the test still pins the inner schema/streaming_mode fields. - UpdateCardAsync now validates CardJson parses (defensive client-side) and wraps it in `card.{type:"card_json", data:<string>}` automatically. - All four CardKit methods reject blank / null DataJson upfront. 59 Lark CardKit tests + 37 ChannelRuntime card tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Production incident 2026-05-08: a Lark turn issued an `ssh_exec` tool
call to sg-office-network. NyxID accepted the POST to
`/api/v1/ssh/{catalog_id}/exec` at 07:17:47 UTC and never returned a
response. Direct curl reproduces the call in ~2s (NyxID side
duration_ms:1666), so the hang is something between aevatar's HttpClient
and NyxID's SSH gateway / NodeAgent — most likely the per-user single-
session SSH lock with a stale prior session. With no client-side cap,
the call sat past the inbox runtime's 120s timeout, the user got the
generic `llm_reply_timeout` fallback ("Sorry, this took too long..."),
and the actual SSH attempt was never observable to the model.
Two layered fixes:
1. NyxIdSshExecTool now wraps the SshExecAsync call in a linked CTS that
cancels at `timeout_secs + 15s` (the +15s gives NyxID's own server-
side timeout a margin to respond with `timed_out: true` before we
give up). On timeout the tool returns
{"error":"ssh_timeout","detail":"NyxID did not return ... within Ns"}
so the LLM sees a real tool error and can either retry against a
different host or summarize "couldn't reach the SSH gateway" instead
of dragging the run.
2. NyxIdApiClient.SendAsync now lets `OperationCanceledException`
propagate. Previously the catch-all wrapped it as
{"error":true,"message":"A task was canceled."}
which silently shadowed any per-call CTS the caller installed on top
of the LLM run's CT (the test for this fix caught it: tool's catch
never fired). Cancellation is a control-flow signal, not an HTTP
error envelope. Other callers of NyxIdApiClient already work with
either path because the LLM run's CT cancellation also unwinds the
inbox runtime via OperationCanceledException.
Tests:
- New ExecuteAsync_HardTimesOut_WhenNyxIdHangsOnSshPost exercises the
full fail mode via a `MapHanging` route on the test PathHandler that
awaits Timeout.Infinite until the cancellation token fires.
- Existing 44 SSH + ApiClient tests still pass; 539-test AI suite green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Production observation 2026-05-08: a "拉一下 sg office 设备 IP" turn
generated a perfectly-shaped streaming-card reply but ran out of LLM
turn budget mid-flush — the user saw the generic "took too long"
fallback instead of the real answer. The bot did the work; we just
cut it off too early.
Multi-step Lark turns realistically need:
- ornn skill-search / fetch SKILL.md ~10s
- ssh_exec to a remote host ~30-60s
(45s is the new client-side
hard cap on a stuck NyxID
SSH gateway)
- LLM thinking + final streaming dispatch ~30-90s
- per-tool jitter + connection setup ~10-20s
──────
~ 80-180s typical
120s left no margin once anything but the happy path happened. 300s
buys headroom without giving up the watchdog: a true infinite hang
still resolves to the "took too long" fallback so the user is not
stuck on a "..." reaction forever.
Bumped both knobs to keep them in sync:
- NyxIdRelayOptions.ResponseTimeoutSeconds default 120 → 300
- ChannelLlmReplyInboxRuntime.FallbackTimeoutSecondsDefault 120 → 300
Deployments that want a tighter / looser cap still set
Aevatar:NyxId:Relay:ResponseTimeoutSeconds in config; 0 / negative
still means "no cap" (host has its own watchdog) per the existing
ResolveFallbackTimeout contract.
Existing tests that pin specific timeout values
(e.g. ResponseTimeoutSeconds=1 in the timeout-fallback test) are
unaffected; tests that didn't set the option get 300s by default
which is even further from firing for fast tests. 16 inbox runtime +
24 SSH tool tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /daily Lark slash command and the agent template behind it were named daily_report internally — six different casings (daily_report, DailyReport, dailyReport, daily-report, plus compound identifiers like daily_report_form, BuildDailyReportForm, CreateDailyReportAgentAsync). The user-facing command has always been /daily; the internal _report suffix added no signal and made it harder to talk about the skill in docs and Ornn. Strategy: - Sed-driven global rename across .cs / .json / .md / .proto / .yaml / csproj. 281 occurrences across 26 files. Order: kebab → camel → Pascal → snake (no shared substrings, so order is defensive). - Renamed compound identifiers (daily_report_form → daily_form, BuildDailyReportForm → BuildDailyForm, etc.). All callers updated by the same pass. - Three test inputs that intentionally used /daily_report as a fake unknown-command literal (testing "this is not a real command, return unknown usage") now collide with the real /daily command after the rename. Dropped those three [InlineData] entries — the same code path is still covered by the /foobar and / cases that remain. Persistence note: any scheduled agent records persisted with template="daily_report" before this commit will not match the new template="daily" lookup. The current production /daily flow is broken anyway (always returns eanzhao's data, hangs past the 120s budget per this morning's incident review), so a clean cutover is acceptable — re-run /daily and the rebuilt agent will land under the new name. This commit is naming-only. The next commit (separate) replaces the template body to consume the Ornn-hosted `daily` skill via use_skill, the way sg-office-network is reached via ssh_exec. Build clean across affected projects (Authoring.Lark, Scheduled, NyxidChat). 895 ChannelRuntime + 17 Lark Platform tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`/daily` and `/social-media` are no longer in-tree commands. Recipes for new persistent agents now live as Ornn skills — the LLM searches via `ornn_search_skills` (or matches a slash command name to a skill query) and follows the SKILL.md verbatim. `agent_builder` keeps only the lifecycle surface (`list_agents`, `agent_status`, `run_agent`, `disable_agent`, `enable_agent`, `delete_agent`). The `social_media` template, the `WorkflowAgentGAgent` actor, the `twitter_publish` workflow module, the `IScopeWorkflowCommandPort` LLM plumbing in `agent_builder`, and all related card-flow surfaces are deleted with no replacement; the missing scope-scoped workflow tools needed to bring social-media back as an Ornn skill are tracked in issue #598. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…continuation-phase-a [codex] Refactor LLM reply continuation into agent runs
No description provided.