Skip to content

Conversation

@Sahilgul
Copy link

@Sahilgul Sahilgul commented Jan 17, 2026

Realtime STT Token Usage Tracking for OpenAI and Duration-Based Providers

This PR implements a previously missing feature: comprehensive STT token usage tracking across LiveKit agents for both token-based and duration-based providers.

Our team was working on tracking STT usage for cost and metrics analysis, but we discovered that GPT-4o Transcribe did not populate token counts in STTMetrics, even though the API does provide them. After investigating, I implemented this missing feature.

Reference: OpenAI GPT-4o Transcribe documentation

Problem Statement

The LiveKit Agents framework previously did not track STT token usage consistently:

  • Token Fields: Providers like GPT-4o Transcribe return input/output token counts, but these were not captured.
  • Duration-Only Providers: Whisper and Azure STT only provide audio duration, but token fields defaulted to None or were missing.
  • Metrics Consistency: Unified STTMetrics structure was needed to support both token-based and duration-based providers.
  • Usage Analysis: Applications could not accurately monitor STT costs or token consumption.

Solution

This PR introduces comprehensive STT metrics tracking across the agents framework:

  • STTMetrics: Added input_tokens, output_tokens, total_tokens, audio_tokens, and text_tokens fields with default 0 values.
  • UsageCollector: Extended to accumulate STT token metrics alongside LLM and TTS metrics.
  • Base STT class: Updated recognize() method to extract and emit token usage from SpeechEvent.
  • OpenAI Plugin: Parses token counts from transcription API responses.
  • Duration-Only Providers (Whisper, Azure): Token fields remain 0, but audio_duration is captured.

Core Architecture Changes

  • agents/metrics/base.py: New token fields in STTMetrics.
  • agents/metrics/usage_collector.py: Collects STT token metrics per request.
  • agents/stt/stt.py: Extracts token usage if available; preserves backward compatibility.
  • plugins/openai/stt.py: Parses input/output token usage for supported models (GPT-4o Transcribe, whisper-1).
  • plugins/azure/stt.py (optional): Duration tracking remains, token fields default to 0.

Example OpenAI API Response (generic text)

Transcription(
    text='This is an example transcription for testing purposes.',
    logprobs=None,
    usage=UsageTokens(
        input_tokens=393,
        output_tokens=183,
        total_tokens=576,
        type='tokens',
        input_token_details=UsageTokensInputTokenDetails(
            audio_tokens=386,
            text_tokens=7
        )
    )
)

Key Changes

  • STTMetrics now supports both token-based and duration-only STT providers.
  • UsageCollector aggregates STT metrics for billing and analysis.
  • Token counts are emitted per request with request_id linking metrics to the transcription.
  • Backward compatible: fields default to 0, no breaking changes to existing code.

Benefits

  • Consistent Metrics: Token and duration metrics unified across providers.
  • Cost Monitoring: Token-based usage available for billing/analytics (OpenAI).
  • Backward Compatibility: Duration-only providers like Whisper continue to work without errors.
  • Observability: All STT requests now produce metrics that can be logged or traced.

Notes

  • Whisper returns only duration (UsageDuration) → token counts are 0.
  • GPT-4o Transcribe returns token counts → fields populated in STTMetrics.
  • Azure and other duration-based providers: audio duration tracked, tokens default to 0.
  • No breaking changes; default values ensure older workflows continue to work.

Summary by CodeRabbit

  • New Features

    • Expanded STT reporting: speech events now include detailed token-usage (input, output, total, audio, text) and these counts are aggregated into overall usage summaries.
    • New usage emission for final transcripts includes reported audio duration.
  • Tests

    • Adjusted fake audio playback timing to better bound and report playback position for playback-finished callbacks.

✏️ Tip: You can customize this high-level summary in your review settings.

@CLAassistant
Copy link

CLAassistant commented Jan 17, 2026

CLA assistant check
All committers have signed the CLA.

@coderabbitai
Copy link

coderabbitai bot commented Jan 17, 2026

📝 Walkthrough

Walkthrough

Adds STT token-usage fields and threads token usage from STT plugins (OpenAI) through SpeechEvent -> STTMetrics -> UsageCollector -> UsageSummary; also schedules Azure recognition-usage emission and tweaks FakeAudioOutput playback-duration computation.

Changes

Cohort / File(s) Summary
Metrics dataclasses
livekit-agents/livekit/agents/metrics/base.py, livekit-agents/livekit/agents/metrics/usage_collector.py
Added STT token fields (input_tokens, output_tokens, total_tokens, audio_tokens, text_tokens) to STTMetrics and corresponding accumulator fields (stt_*_tokens) to UsageSummary; UsageCollector.collect() aggregates these counters.
STT core
livekit-agents/livekit/agents/stt/stt.py
Added STTTokenUsage TypedDict and SpeechEvent.token_usage; recognize() extracts token usage (defaulting to 0) and populates STTMetrics with token fields.
OpenAI STT plugin
livekit-plugins/livekit-plugins-openai/.../openai/stt.py
_recognize_impl() reads token usage from OpenAI response (input/output/total and input token details like audio/text) and attaches token_usage to emitted SpeechEvent.
Azure STT plugin
livekit-plugins/livekit-plugins-azure/.../azure/stt.py
After final transcript, computes audio_duration and schedules _emit_recognition_usage(request_id, audio_duration) to publish a RECOGNITION_USAGE event (audio duration payload).
Tests / Utilities
tests/fake_io.py
FakeAudioOutput.clear_buffer now computes a single clamped played_duration (bounded by pushed duration) and uses it for playback_position when invoking on_playback_finished; aligns callback timing with elapsed time.

Sequence Diagram(s)

sequenceDiagram
    participant OpenAI as OpenAI API
    participant OpenAIPlugin as OpenAI STT Plugin
    participant STT as STT Engine
    participant Collector as UsageCollector
    participant Summary as UsageSummary

    OpenAI->>OpenAIPlugin: response (transcript + usage)
    OpenAIPlugin->>OpenAIPlugin: extract token_usage (input/output/total/audio/text)
    OpenAIPlugin->>STT: emit SpeechEvent (with token_usage)
    STT->>STT: construct STTMetrics (include token fields)
    STT->>Collector: emit STTMetrics
    Collector->>Collector: aggregate stt_*_tokens
    Collector->>Summary: update UsageSummary token fields
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐇 I count small tokens in a hop and a run,

Input, output, total—each one is fun.
Audio and text in little neat stacks,
I stash them in metrics and never look back,
A carrot for data, my tally is done 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding comprehensive STT token usage tracking fields and infrastructure to the metrics system across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 917cfd1 and b6fc2bc.

📒 Files selected for processing (2)
  • livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/stt.py
  • tests/fake_io.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/fake_io.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

  • livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/stt.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: unit-tests
  • GitHub Check: type-check (3.13)
  • GitHub Check: type-check (3.9)
🔇 Additional comments (2)
livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/stt.py (2)

285-310: Good addition of duration-based usage emission after final transcript.

This aligns with the unified STT metrics goal and keeps duration-only providers reporting usage.


312-321: Helper method for recognition usage emission is clear and scoped.

Encapsulating the usage event emission here keeps the callback flow tidy.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@livekit-agents/livekit/agents/metrics/base.py`:
- Around line 41-53: Remove the trailing whitespace on the empty line preceding
the token fields to satisfy the linter, and delete the temporary review comment
"# NEW: Token usage fields" (or replace it with a concise docstring/header)
since each field already has a docstring; update the block with the attributes
input_tokens, output_tokens, total_tokens, audio_tokens, and text_tokens in
livekit.agents.metrics.base (the variables named input_tokens, output_tokens,
total_tokens, audio_tokens, text_tokens) so only the documented fields remain
and no trailing spaces exist.

In `@livekit-agents/livekit/agents/metrics/usage_collector.py`:
- Around line 25-32: Remove the trailing whitespace on the blank line following
the stt_text_tokens field in the UsageCollector dataclass (the lines defining
stt_input_tokens, stt_output_tokens, stt_total_tokens, stt_audio_tokens,
stt_text_tokens); edit the file to delete the trailing space characters at the
end of that line (or remove the empty line entirely) and re-run the linter to
ensure W293 is resolved.

In `@livekit-agents/livekit/agents/stt/stt.py`:
- Around line 166-182: The blank lines surrounding the token-extraction block
contain trailing whitespace; remove trailing spaces on the empty lines around
the code that handles event._token_usage (the block that sets input_tokens,
output_tokens, total_tokens, audio_tokens, text_tokens) so there are truly blank
lines without trailing whitespace and ruff W293 is resolved.

In `@livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py`:
- Around line 439-471: Remove the trailing whitespace on the blank line (fix the
ruff lint) and stop assigning a dynamic attribute `_token_usage` to SpeechEvent;
instead add an optional, typed field to the SpeechEvent dataclass (e.g.,
token_usage: Optional[RecognitionUsage] or a new dataclass with
input_tokens/output_tokens/total_tokens/audio_tokens/text_tokens) or extend the
existing RecognitionUsage type to include audio_tokens/text_tokens, then set
that typed field when constructing stt.SpeechEvent (the constructed symbol is
stt.SpeechEvent with type stt.SpeechEventType.FINAL_TRANSCRIPT and alternatives
[sd]) so mypy strict mode no longer reports attr-defined errors.
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3d97b05 and 309d764.

📒 Files selected for processing (4)
  • livekit-agents/livekit/agents/metrics/base.py
  • livekit-agents/livekit/agents/metrics/usage_collector.py
  • livekit-agents/livekit/agents/stt/stt.py
  • livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

  • livekit-agents/livekit/agents/stt/stt.py
  • livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py
  • livekit-agents/livekit/agents/metrics/usage_collector.py
  • livekit-agents/livekit/agents/metrics/base.py
🧬 Code graph analysis (3)
livekit-agents/livekit/agents/stt/stt.py (1)
livekit-agents/livekit/agents/metrics/base.py (1)
  • STTMetrics (30-54)
livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py (2)
livekit-agents/livekit/agents/voice/agent_activity.py (1)
  • stt (2773-2774)
livekit-agents/livekit/agents/voice/agent.py (1)
  • stt (508-518)
livekit-agents/livekit/agents/metrics/usage_collector.py (1)
livekit-agents/livekit/agents/telemetry/http_server.py (1)
  • metrics (18-35)
🪛 GitHub Check: ruff
livekit-agents/livekit/agents/stt/stt.py

[failure] 182-182: Ruff (W293)
livekit-agents/livekit/agents/stt/stt.py:182:1: W293 Blank line contains whitespace


[failure] 173-173: Ruff (W293)
livekit-agents/livekit/agents/stt/stt.py:173:1: W293 Blank line contains whitespace

livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py

[failure] 450-450: Ruff (W293)
livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py:450:1: W293 Blank line contains whitespace

livekit-agents/livekit/agents/metrics/usage_collector.py

[failure] 32-32: Ruff (W293)
livekit-agents/livekit/agents/metrics/usage_collector.py:32:1: W293 Blank line contains whitespace

livekit-agents/livekit/agents/metrics/base.py

[failure] 41-41: Ruff (W293)
livekit-agents/livekit/agents/metrics/base.py:41:1: W293 Blank line contains whitespace

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: type-check (3.13)
  • GitHub Check: unit-tests
  • GitHub Check: type-check (3.9)
🔇 Additional comments (5)
livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py (1)

458-461: LGTM - SpeechEvent creation logic is correct.

The event creation preserves the existing behavior for FINAL_TRANSCRIPT events while adding token usage metadata.

livekit-agents/livekit/agents/stt/stt.py (2)

183-199: LGTM - STTMetrics construction correctly includes token fields.

The token fields are properly extracted and passed to the STTMetrics constructor, maintaining consistency with the field definitions in base.py.


390-402: Streaming metrics don't include token usage.

The _metrics_monitor_task creates STTMetrics for streaming recognition without extracting token usage. While the fields default to 0, this creates an inconsistency where batch recognize() reports tokens but streaming doesn't.

If the realtime API doesn't provide token data, this is expected behavior. Otherwise, consider extracting token usage from RECOGNITION_USAGE events similar to the batch path.

livekit-agents/livekit/agents/metrics/usage_collector.py (1)

96-102: LGTM - STT token aggregation logic is correct.

The collection pattern correctly mirrors the existing LLMMetrics accumulation, properly aggregating all five token fields from STTMetrics.

livekit-agents/livekit/agents/metrics/base.py (1)

43-52: LGTM - Token fields are well-defined with appropriate defaults.

The token usage fields are correctly typed with sensible defaults (0) ensuring backward compatibility. The docstrings clearly explain the purpose of each field, distinguishing between total tokens, audio tokens, and text tokens.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines 41 to 53

# NEW: Token usage fields
input_tokens: int = 0
"""Total input tokens used (audio + text tokens)."""
output_tokens: int = 0
"""Total output tokens generated."""
total_tokens: int = 0
"""Total tokens used (input + output)."""
audio_tokens: int = 0
"""Number of audio tokens in input."""
text_tokens: int = 0
"""Number of text tokens in input (e.g., from prompt)."""

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix trailing whitespace and consider removing temporary comment.

  1. Linting error (Line 41): Blank line contains trailing whitespace - flagged by ruff.

  2. Code hygiene: The # NEW: Token usage fields comment is useful during review but could be removed before merge since the docstrings adequately document each field's purpose.

Proposed fix
     """Whether the STT is streaming (e.g using websocket)."""
-    
-    # NEW: Token usage fields
+
     input_tokens: int = 0
🧰 Tools
🪛 GitHub Check: ruff

[failure] 41-41: Ruff (W293)
livekit-agents/livekit/agents/metrics/base.py:41:1: W293 Blank line contains whitespace

🤖 Prompt for AI Agents
In `@livekit-agents/livekit/agents/metrics/base.py` around lines 41 - 53, Remove
the trailing whitespace on the empty line preceding the token fields to satisfy
the linter, and delete the temporary review comment "# NEW: Token usage fields"
(or replace it with a concise docstring/header) since each field already has a
docstring; update the block with the attributes input_tokens, output_tokens,
total_tokens, audio_tokens, and text_tokens in livekit.agents.metrics.base (the
variables named input_tokens, output_tokens, total_tokens, audio_tokens,
text_tokens) so only the documented fields remain and no trailing spaces exist.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py`:
- Around line 439-471: The current construction of stt.SpeechEvent sets
token_usage to None when input/output/total tokens are zero, which drops
audio_tokens/text_tokens if only detailed counts exist; update the logic in the
block that builds token_usage (around resp/usage handling and the
stt.SpeechEvent creation) so you always populate the token_usage dict with
input_tokens, output_tokens, total_tokens, audio_tokens, and text_tokens and
then set token_usage to that dict if any of those five values is non-zero (e.g.,
use a any(...) check on the dict values) instead of checking only
input/output/total; reference the resp/usage extraction and the
stt.SpeechEvent(...) call to locate where to change the condition.
♻️ Duplicate comments (2)
livekit-agents/livekit/agents/metrics/base.py (1)

41-52: Remove the temporary comment and trailing whitespace.

The inline note is no longer needed, and the blank line appears to include whitespace (ruff W293).

🧹 Suggested cleanup
-    
-    # NEW: Token usage fields
+
     input_tokens: int = 0
livekit-agents/livekit/agents/metrics/usage_collector.py (1)

25-31: Remove trailing whitespace after the STT token fields.

The blank line after stt_text_tokens appears to contain whitespace (ruff W293).

🧹 Suggested cleanup
     stt_audio_tokens: int = 0
     stt_text_tokens: int = 0
-
+
     # properties for naming consistency: prompt = input, completion = output
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 309d764 and 059af5f.

📒 Files selected for processing (4)
  • livekit-agents/livekit/agents/metrics/base.py
  • livekit-agents/livekit/agents/metrics/usage_collector.py
  • livekit-agents/livekit/agents/stt/stt.py
  • livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • livekit-agents/livekit/agents/stt/stt.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

  • livekit-agents/livekit/agents/metrics/base.py
  • livekit-agents/livekit/agents/metrics/usage_collector.py
  • livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py
🧬 Code graph analysis (1)
livekit-agents/livekit/agents/metrics/usage_collector.py (1)
livekit-agents/livekit/agents/telemetry/http_server.py (1)
  • metrics (18-35)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: type-check (3.13)
  • GitHub Check: type-check (3.9)
  • GitHub Check: unit-tests
🔇 Additional comments (1)
livekit-agents/livekit/agents/metrics/usage_collector.py (1)

95-101: Aggregation looks correct.

STT token fields are accumulated consistently with the new metrics.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +439 to +471
# Extract token usage if available
input_tokens = 0
output_tokens = 0
total_tokens = 0
audio_tokens = 0
text_tokens = 0
if hasattr(resp, "usage") and resp.usage:
usage = resp.usage
input_tokens = getattr(usage, "input_tokens", 0)
output_tokens = getattr(usage, "output_tokens", 0)
total_tokens = getattr(usage, "total_tokens", 0)

# Extract detailed token breakdown
if hasattr(usage, "input_token_details") and usage.input_token_details:
details = usage.input_token_details
audio_tokens = getattr(details, "audio_tokens", 0)
text_tokens = getattr(details, "text_tokens", 0)

# Create the speech event with token usage
speech_event = stt.SpeechEvent(
type=stt.SpeechEventType.FINAL_TRANSCRIPT,
alternatives=[sd],
token_usage={
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"total_tokens": total_tokens,
"audio_tokens": audio_tokens,
"text_tokens": text_tokens,
}
if (input_tokens > 0 or output_tokens > 0 or total_tokens > 0)
else None,
)
return speech_event
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Don’t drop audio/text usage when totals are missing.

If only detailed tokens are present, token_usage becomes None and metrics lose audio/text counts.

✅ Suggested fix
-            speech_event = stt.SpeechEvent(
+            has_usage = any(
+                token > 0
+                for token in (input_tokens, output_tokens, total_tokens, audio_tokens, text_tokens)
+            )
+            speech_event = stt.SpeechEvent(
                 type=stt.SpeechEventType.FINAL_TRANSCRIPT,
                 alternatives=[sd],
                 token_usage={
                     "input_tokens": input_tokens,
                     "output_tokens": output_tokens,
                     "total_tokens": total_tokens,
                     "audio_tokens": audio_tokens,
                     "text_tokens": text_tokens,
                 }
-                if (input_tokens > 0 or output_tokens > 0 or total_tokens > 0)
-                else None,
+                if has_usage
+                else None,
             )
🤖 Prompt for AI Agents
In `@livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py` around
lines 439 - 471, The current construction of stt.SpeechEvent sets token_usage to
None when input/output/total tokens are zero, which drops
audio_tokens/text_tokens if only detailed counts exist; update the logic in the
block that builds token_usage (around resp/usage handling and the
stt.SpeechEvent creation) so you always populate the token_usage dict with
input_tokens, output_tokens, total_tokens, audio_tokens, and text_tokens and
then set token_usage to that dict if any of those five values is non-zero (e.g.,
use a any(...) check on the dict values) instead of checking only
input/output/total; reference the resp/usage extraction and the
stt.SpeechEvent(...) call to locate where to change the condition.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/fake_io.py (1)

88-99: Duplicate on_playback_finished call and _pushed_duration reset.

The method calls on_playback_finished twice with identical parameters, and resets _pushed_duration = 0.0 twice. Per the base class implementation in io.py, calling on_playback_finished more than expected triggers a warning: "playback_finished called more times than playback segments were captured". The second call (lines 94-98) and second reset (line 99) appear to be accidental duplication.

🐛 Proposed fix to remove duplicate code
         self.on_playback_finished(
             playback_position=played_duration,
             interrupted=True,
             synchronized_transcript=None,
         )
         self._pushed_duration = 0.0
-        self.on_playback_finished(
-            playback_position=played_duration,
-            interrupted=True,
-            synchronized_transcript=None,
-        )
-        self._pushed_duration = 0.0
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6f5e06a and ed74c13.

📒 Files selected for processing (1)
  • tests/fake_io.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

  • tests/fake_io.py
🧬 Code graph analysis (1)
tests/fake_io.py (4)
livekit-agents/livekit/agents/voice/io.py (1)
  • on_playback_finished (191-218)
livekit-agents/livekit/agents/voice/recorder_io/recorder_io.py (1)
  • on_playback_finished (379-486)
livekit-agents/livekit/agents/voice/transcription/synchronizer.py (2)
  • on_playback_finished (554-579)
  • synchronized_transcript (281-285)
livekit-agents/livekit/agents/voice/speech_handle.py (1)
  • interrupted (83-84)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: unit-tests
  • GitHub Check: type-check (3.9)
  • GitHub Check: type-check (3.13)
🔇 Additional comments (1)
tests/fake_io.py (1)

83-87: LGTM!

The explicit calculation with clamping between [0, _pushed_duration] ensures valid playback position bounds and the comments clearly explain the rationale.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

@bml1g12
Copy link
Contributor

bml1g12 commented Jan 17, 2026

Excited to see this PR, as I noticed this week my STT was missing the data required to track costs - after this PR - any idea if Elevenlabs and Azure STT will have token counting? Otherwise I guess I need to make a PR to implement these as we are experimenting with those

@Sahilgul
Copy link
Author

For Azure STT (Azure Speech Studio), billing is based on audio duration, not on tokens. I just finished a fix for Azure STT tracing, it was capturing the duration but not emitting it back to the metrics. For now I’m not sure about ElevenLabs, I haven’t checked that.

For Azure OpenAI the existing PR will work, since this also provides inference for GPT-4o Transcribe or Whisper like models. GPT-4o Transcribe billing is based on three things: input audio tokens, input text tokens, and output tokens (text by default). Whisper billing is based on duration only.

@Sahilgul Sahilgul force-pushed the feature/stt-tracing branch from ed74c13 to 917cfd1 Compare January 18, 2026 17:01
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/fake_io.py (1)

88-99: Critical: Duplicate on_playback_finished call will emit the event twice.

Lines 88-92 and 94-98 both call on_playback_finished with identical parameters, and _pushed_duration is reset twice (lines 93 and 99). This appears to be a merge/copy-paste error that will cause duplicate playback_finished events to be emitted.

Looking at the on_playback_finished implementation in io.py, it tracks segment counts and will log a warning for the extra call: "playback_finished called more times than playback segments were captured".

🐛 Proposed fix: Remove the duplicate call
         self._flush_handle = None
         # Calculate played duration based on real elapsed time, capped at pushed duration
         # This matches the behavior of ConsoleAudioOutput and accounts for speed_factor
         # in tests (check_timestamp multiplies by speed_factor to convert to test time)
         played_duration = time.time() - self._start_time
         played_duration = min(max(0, played_duration), self._pushed_duration)
         self.on_playback_finished(
             playback_position=played_duration,
             interrupted=True,
             synchronized_transcript=None,
         )
         self._pushed_duration = 0.0
-        self.on_playback_finished(
-            playback_position=played_duration,
-            interrupted=True,
-            synchronized_transcript=None,
-        )
-        self._pushed_duration = 0.0
🤖 Fix all issues with AI agents
In `@livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/stt.py`:
- Around line 303-323: The nested call_soon_threadsafe is redundant:
_emit_recognition_usage is already scheduled via self._loop.call_soon_threadsafe
(call site that passes evt.result.result_id, audio_duration), so remove the
inner self._loop.call_soon_threadsafe inside _emit_recognition_usage and
directly call self._event_ch.send_nowait(...) (wrapped in the existing
contextlib.suppress), keeping the SpeechEvent construction and
stt.RecognitionUsage unchanged; this simplifies _emit_recognition_usage and
avoids double-scheduling.
♻️ Duplicate comments (2)
livekit-agents/livekit/agents/metrics/base.py (1)

41-53: Remove the temporary review comment before merging.

The # NEW: Token usage fields comment on line 42 is a development marker that should be removed before merge. The docstrings already document each field's purpose.

♻️ Proposed fix
     """Whether the STT is streaming (e.g using websocket)."""

-    # NEW: Token usage fields
     input_tokens: int = 0
livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py (1)

457-470: Token usage condition doesn't include audio_tokens or text_tokens.

The condition on line 468 only checks input_tokens, output_tokens, and total_tokens. If only audio_tokens or text_tokens are non-zero (with the others at 0), token_usage will incorrectly be set to None, losing the detailed token breakdown.

🐛 Proposed fix
             speech_event = stt.SpeechEvent(
                 type=stt.SpeechEventType.FINAL_TRANSCRIPT,
                 alternatives=[sd],
                 token_usage={
                     "input_tokens": input_tokens,
                     "output_tokens": output_tokens,
                     "total_tokens": total_tokens,
                     "audio_tokens": audio_tokens,
                     "text_tokens": text_tokens,
                 }
-                if (input_tokens > 0 or output_tokens > 0 or total_tokens > 0)
+                if any((input_tokens, output_tokens, total_tokens, audio_tokens, text_tokens))
                 else None,
             )
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed74c13 and 917cfd1.

📒 Files selected for processing (6)
  • livekit-agents/livekit/agents/metrics/base.py
  • livekit-agents/livekit/agents/metrics/usage_collector.py
  • livekit-agents/livekit/agents/stt/stt.py
  • livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/stt.py
  • livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py
  • tests/fake_io.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

  • livekit-agents/livekit/agents/stt/stt.py
  • livekit-agents/livekit/agents/metrics/base.py
  • livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py
  • livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/stt.py
  • livekit-agents/livekit/agents/metrics/usage_collector.py
  • tests/fake_io.py
🧬 Code graph analysis (4)
livekit-agents/livekit/agents/stt/stt.py (2)
livekit-agents/livekit/agents/metrics/base.py (1)
  • STTMetrics (30-54)
livekit-agents/livekit/agents/utils/audio.py (1)
  • calculate_audio_duration (22-38)
livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py (1)
livekit-agents/livekit/agents/stt/stt.py (2)
  • SpeechEvent (89-95)
  • SpeechEventType (33-50)
livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/stt.py (1)
livekit-agents/livekit/agents/stt/stt.py (3)
  • SpeechEvent (89-95)
  • SpeechEventType (33-50)
  • RecognitionUsage (66-67)
tests/fake_io.py (3)
livekit-agents/livekit/agents/voice/recorder_io/recorder_io.py (1)
  • on_playback_finished (379-486)
livekit-agents/livekit/agents/voice/io.py (1)
  • on_playback_finished (191-218)
livekit-agents/livekit/agents/voice/transcription/synchronizer.py (2)
  • on_playback_finished (554-579)
  • synchronized_transcript (281-285)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: type-check (3.9)
  • GitHub Check: type-check (3.13)
  • GitHub Check: unit-tests
🔇 Additional comments (4)
livekit-agents/livekit/agents/stt/stt.py (2)

70-86: Well-structured TypedDict for optional token usage.

Using TypedDict with total=False correctly models the optional nature of token fields across different STT providers. The docstrings clearly document each field's purpose.


187-214: Token extraction and metrics emission implemented correctly.

The token extraction safely handles None with .get() and defaults to 0, maintaining backward compatibility. The token fields are properly propagated to STTMetrics.

livekit-agents/livekit/agents/metrics/usage_collector.py (2)

25-31: STT token tracking fields follow existing conventions.

The new fields are consistently named with the stt_ prefix and default to 0 for backward compatibility.


95-101: STT token accumulation correctly integrated.

The accumulation logic properly extends the existing STTMetrics handling to include the new token fields, following the same pattern used for LLM and TTS metrics.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants