Skip to content

feat: full-duplex voice, faster MiniMax model, README rewrite#24

Open
tangxiya-star wants to merge 1 commit into
mainfrom
feat/fullduplex-voice-readme
Open

feat: full-duplex voice, faster MiniMax model, README rewrite#24
tangxiya-star wants to merge 1 commit into
mainfrom
feat/fullduplex-voice-readme

Conversation

@tangxiya-star

Copy link
Copy Markdown
Collaborator

What

Three independent improvements, all cleanly rebased on top of latest main.

Agent voice loop (agent/src/agent.py)

  • Default model → MiniMax-M2.1-highspeed at reasoning_effort: "low" — measured snappiest combo for voice TTFA (~200 think tokens / ~6s per verdict vs ~280 / ~8s on M2.7-highspeed).
  • reasoning_split: true so the model's <think> chain-of-thought goes to a separate field and can never leak into the spoken reply (we'd seen fragments like "The user…" escape the SDK's tag stripping and get spoken aloud).
  • Interruptible announcements (allow_interruptions=True) — blocking interruptions left the contractor unable to get a word in while announcements stacked up.

iOS full-duplex mic (app-ios/.../VoiceAgentSession.swift)

  • Keep the mic open for the whole session instead of closing it whenever the agent speaks. The old half-duplex rule made the user inaudible for most of the conversation. Echo is handled by the .voiceChat AVAudioSession AEC plus the agent's interruption-word gating. Pairs with the interruptible announcements above.

Repo

  • README rewritten into the full project overview (demo loop, architecture diagram, stack, what's working).
  • Track agent/livekit.toml (LiveKit Cloud agent config).
  • Remove the empty vision/ placeholder.

Notes

  • Rebased onto origin/main; a stale local events.py/test_events.py raw-number experiment was dropped in favor of main's newer word-spelling reading (which already lists every span and matches the agent prompt).
  • Personal Xcode code-signing changes (DEVELOPMENT_TEAM, .xiya bundle id) were intentionally excluded.

Test

  • pytest: 58 pass. One live-API behavioral test (test_calls_building_code_tool_for_code_question) is flaky — it passed 3/3 on re-run; the model occasionally asks "what city?" instead of calling the tool.

🤖 Generated with Claude Code

Agent voice loop:
- Default to MiniMax-M2.1-highspeed at reasoning_effort "low" (snappiest
  measured combo: ~200 think tokens / ~6s vs ~280 / ~8s on M2.7).
- Add extra_body reasoning_split so the model's <think> chain-of-thought
  can never leak into the spoken reply.
- Make proactive announcements interruptible (allow_interruptions=True) so
  the contractor can talk over a stacked-up announcement.

iOS:
- Keep the mic open full-duplex for the whole session instead of closing it
  while the agent speaks (half-duplex left the user inaudible). Echo is
  handled by .voiceChat AEC + the agent's interruption-word gating.

Repo:
- Rewrite README into the full project overview (loop, architecture, stack).
- Track agent/livekit.toml (LiveKit Cloud agent config).
- Remove empty vision/ placeholder.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant