feat(metrics): add comprehensive STT token usage tracking #4542

Sahilgul · 2026-01-17T01:58:41Z

Realtime STT Token Usage Tracking for OpenAI and Duration-Based Providers

This PR implements a previously missing feature: comprehensive STT token usage tracking across LiveKit agents for both token-based and duration-based providers.

Our team was working on tracking STT usage for cost and metrics analysis, but we discovered that GPT-4o Transcribe did not populate token counts in STTMetrics, even though the API does provide them. After investigating, I implemented this missing feature.

Reference: OpenAI GPT-4o Transcribe documentation

Problem Statement

The LiveKit Agents framework previously did not track STT token usage consistently:

Token Fields: Providers like GPT-4o Transcribe return input/output token counts, but these were not captured.
Duration-Only Providers: Whisper and Azure STT only provide audio duration, but token fields defaulted to None or were missing.
Metrics Consistency: Unified STTMetrics structure was needed to support both token-based and duration-based providers.
Usage Analysis: Applications could not accurately monitor STT costs or token consumption.

Solution

This PR introduces comprehensive STT metrics tracking across the agents framework:

STTMetrics: Added input_tokens, output_tokens, total_tokens, audio_tokens, and text_tokens fields with default 0 values.
UsageCollector: Extended to accumulate STT token metrics alongside LLM and TTS metrics.
Base STT class: Updated recognize() method to extract and emit token usage from SpeechEvent.
OpenAI Plugin: Parses token counts from transcription API responses.
Duration-Only Providers (Whisper, Azure): Token fields remain 0, but audio_duration is captured.

Core Architecture Changes

agents/metrics/base.py: New token fields in STTMetrics.
agents/metrics/usage_collector.py: Collects STT token metrics per request.
agents/stt/stt.py: Extracts token usage if available; preserves backward compatibility.
plugins/openai/stt.py: Parses input/output token usage for supported models (GPT-4o Transcribe, whisper-1).
plugins/azure/stt.py (optional): Duration tracking remains, token fields default to 0.

Example OpenAI API Response (generic text)

Transcription(
    text='This is an example transcription for testing purposes.',
    logprobs=None,
    usage=UsageTokens(
        input_tokens=393,
        output_tokens=183,
        total_tokens=576,
        type='tokens',
        input_token_details=UsageTokensInputTokenDetails(
            audio_tokens=386,
            text_tokens=7
        )
    )
)

Key Changes

STTMetrics now supports both token-based and duration-only STT providers.
UsageCollector aggregates STT metrics for billing and analysis.
Token counts are emitted per request with request_id linking metrics to the transcription.
Backward compatible: fields default to 0, no breaking changes to existing code.

Benefits

Consistent Metrics: Token and duration metrics unified across providers.
Cost Monitoring: Token-based usage available for billing/analytics (OpenAI).
Backward Compatibility: Duration-only providers like Whisper continue to work without errors.
Observability: All STT requests now produce metrics that can be logged or traced.

Notes

Whisper returns only duration (UsageDuration) → token counts are 0.
GPT-4o Transcribe returns token counts → fields populated in STTMetrics.
Azure and other duration-based providers: audio duration tracked, tokens default to 0.
No breaking changes; default values ensure older workflows continue to work.

Summary by CodeRabbit

New Features
- Expanded STT reporting: speech events now include detailed token-usage (input, output, total, audio, text) and these counts are aggregated into overall usage summaries.
- New usage emission for final transcripts includes reported audio duration.
Tests
- Adjusted fake audio playback timing to better bound and report playback position for playback-finished callbacks.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

CLAassistant · 2026-01-17T01:58:48Z

All committers have signed the CLA.

coderabbitai · 2026-01-17T01:58:52Z

📝 Walkthrough

Walkthrough

Adds STT token-usage fields and threads token usage from STT plugins (OpenAI) through SpeechEvent -> STTMetrics -> UsageCollector -> UsageSummary; also schedules Azure recognition-usage emission and tweaks FakeAudioOutput playback-duration computation.

Changes

Cohort / File(s)	Summary
Metrics dataclasses `livekit-agents/livekit/agents/metrics/base.py`, `livekit-agents/livekit/agents/metrics/usage_collector.py`	Added STT token fields (`input_tokens`, `output_tokens`, `total_tokens`, `audio_tokens`, `text_tokens`) to `STTMetrics` and corresponding accumulator fields (`stt_*_tokens`) to `UsageSummary`; `UsageCollector.collect()` aggregates these counters.
STT core `livekit-agents/livekit/agents/stt/stt.py`	Added `STTTokenUsage` TypedDict and `SpeechEvent.token_usage`; `recognize()` extracts token usage (defaulting to 0) and populates `STTMetrics` with token fields.
OpenAI STT plugin `livekit-plugins/livekit-plugins-openai/.../openai/stt.py`	`_recognize_impl()` reads token usage from OpenAI response (input/output/total and input token details like audio/text) and attaches `token_usage` to emitted `SpeechEvent`.
Azure STT plugin `livekit-plugins/livekit-plugins-azure/.../azure/stt.py`	After final transcript, computes audio_duration and schedules `_emit_recognition_usage(request_id, audio_duration)` to publish a RECOGNITION_USAGE event (audio duration payload).
Tests / Utilities `tests/fake_io.py`	`FakeAudioOutput.clear_buffer` now computes a single clamped `played_duration` (bounded by pushed duration) and uses it for playback_position when invoking `on_playback_finished`; aligns callback timing with elapsed time.

Sequence Diagram(s)

sequenceDiagram
    participant OpenAI as OpenAI API
    participant OpenAIPlugin as OpenAI STT Plugin
    participant STT as STT Engine
    participant Collector as UsageCollector
    participant Summary as UsageSummary

    OpenAI->>OpenAIPlugin: response (transcript + usage)
    OpenAIPlugin->>OpenAIPlugin: extract token_usage (input/output/total/audio/text)
    OpenAIPlugin->>STT: emit SpeechEvent (with token_usage)
    STT->>STT: construct STTMetrics (include token fields)
    STT->>Collector: emit STTMetrics
    Collector->>Collector: aggregate stt_*_tokens
    Collector->>Summary: update UsageSummary token fields

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐇 I count small tokens in a hop and a run,

Input, output, total—each one is fun.
Audio and text in little neat stacks,
I stash them in metrics and never look back,
A carrot for data, my tally is done 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding comprehensive STT token usage tracking fields and infrastructure to the metrics system across multiple files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 917cfd1 and b6fc2bc.

📒 Files selected for processing (2)

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/stt.py
tests/fake_io.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/fake_io.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/stt.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: unit-tests
GitHub Check: type-check (3.13)
GitHub Check: type-check (3.9)

🔇 Additional comments (2)

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/stt.py (2)

285-310: Good addition of duration-based usage emission after final transcript.

This aligns with the unified STT metrics goal and keeps duration-only providers reporting usage.

312-321: Helper method for recognition usage emission is clear and scoped.

Encapsulating the usage event emission here keeps the callback flow tidy.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@livekit-agents/livekit/agents/metrics/base.py`:
- Around line 41-53: Remove the trailing whitespace on the empty line preceding
the token fields to satisfy the linter, and delete the temporary review comment
"# NEW: Token usage fields" (or replace it with a concise docstring/header)
since each field already has a docstring; update the block with the attributes
input_tokens, output_tokens, total_tokens, audio_tokens, and text_tokens in
livekit.agents.metrics.base (the variables named input_tokens, output_tokens,
total_tokens, audio_tokens, text_tokens) so only the documented fields remain
and no trailing spaces exist.

In `@livekit-agents/livekit/agents/metrics/usage_collector.py`:
- Around line 25-32: Remove the trailing whitespace on the blank line following
the stt_text_tokens field in the UsageCollector dataclass (the lines defining
stt_input_tokens, stt_output_tokens, stt_total_tokens, stt_audio_tokens,
stt_text_tokens); edit the file to delete the trailing space characters at the
end of that line (or remove the empty line entirely) and re-run the linter to
ensure W293 is resolved.

In `@livekit-agents/livekit/agents/stt/stt.py`:
- Around line 166-182: The blank lines surrounding the token-extraction block
contain trailing whitespace; remove trailing spaces on the empty lines around
the code that handles event._token_usage (the block that sets input_tokens,
output_tokens, total_tokens, audio_tokens, text_tokens) so there are truly blank
lines without trailing whitespace and ruff W293 is resolved.

In `@livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py`:
- Around line 439-471: Remove the trailing whitespace on the blank line (fix the
ruff lint) and stop assigning a dynamic attribute `_token_usage` to SpeechEvent;
instead add an optional, typed field to the SpeechEvent dataclass (e.g.,
token_usage: Optional[RecognitionUsage] or a new dataclass with
input_tokens/output_tokens/total_tokens/audio_tokens/text_tokens) or extend the
existing RecognitionUsage type to include audio_tokens/text_tokens, then set
that typed field when constructing stt.SpeechEvent (the constructed symbol is
stt.SpeechEvent with type stt.SpeechEventType.FINAL_TRANSCRIPT and alternatives
[sd]) so mypy strict mode no longer reports attr-defined errors.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3d97b05 and 309d764.

📒 Files selected for processing (4)

livekit-agents/livekit/agents/metrics/base.py
livekit-agents/livekit/agents/metrics/usage_collector.py
livekit-agents/livekit/agents/stt/stt.py
livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

livekit-agents/livekit/agents/stt/stt.py
livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py
livekit-agents/livekit/agents/metrics/usage_collector.py
livekit-agents/livekit/agents/metrics/base.py

🧬 Code graph analysis (3)

livekit-agents/livekit/agents/stt/stt.py (1)

livekit-agents/livekit/agents/metrics/base.py (1)

STTMetrics (30-54)

livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py (2)

livekit-agents/livekit/agents/voice/agent_activity.py (1)

stt (2773-2774)

livekit-agents/livekit/agents/voice/agent.py (1)

stt (508-518)

livekit-agents/livekit/agents/metrics/usage_collector.py (1)

livekit-agents/livekit/agents/telemetry/http_server.py (1)

metrics (18-35)

🪛 GitHub Check: ruff

livekit-agents/livekit/agents/stt/stt.py

[failure] 182-182: Ruff (W293)
livekit-agents/livekit/agents/stt/stt.py:182:1: W293 Blank line contains whitespace

[failure] 173-173: Ruff (W293)
livekit-agents/livekit/agents/stt/stt.py:173:1: W293 Blank line contains whitespace

livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py

[failure] 450-450: Ruff (W293)
livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py:450:1: W293 Blank line contains whitespace

livekit-agents/livekit/agents/metrics/usage_collector.py

[failure] 32-32: Ruff (W293)
livekit-agents/livekit/agents/metrics/usage_collector.py:32:1: W293 Blank line contains whitespace

livekit-agents/livekit/agents/metrics/base.py

[failure] 41-41: Ruff (W293)
livekit-agents/livekit/agents/metrics/base.py:41:1: W293 Blank line contains whitespace

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: type-check (3.13)
GitHub Check: unit-tests
GitHub Check: type-check (3.9)

🔇 Additional comments (5)

livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py (1)

458-461: LGTM - SpeechEvent creation logic is correct.

The event creation preserves the existing behavior for FINAL_TRANSCRIPT events while adding token usage metadata.

livekit-agents/livekit/agents/stt/stt.py (2)

183-199: LGTM - STTMetrics construction correctly includes token fields.

The token fields are properly extracted and passed to the STTMetrics constructor, maintaining consistency with the field definitions in base.py.

390-402: Streaming metrics don't include token usage.

The _metrics_monitor_task creates STTMetrics for streaming recognition without extracting token usage. While the fields default to 0, this creates an inconsistency where batch recognize() reports tokens but streaming doesn't.

If the realtime API doesn't provide token data, this is expected behavior. Otherwise, consider extracting token usage from RECOGNITION_USAGE events similar to the batch path.

livekit-agents/livekit/agents/metrics/usage_collector.py (1)

96-102: LGTM - STT token aggregation logic is correct.

The collection pattern correctly mirrors the existing LLMMetrics accumulation, properly aggregating all five token fields from STTMetrics.

livekit-agents/livekit/agents/metrics/base.py (1)

43-52: LGTM - Token fields are well-defined with appropriate defaults.

The token usage fields are correctly typed with sensible defaults (0) ensuring backward compatibility. The docstrings clearly explain the purpose of each field, distinguishing between total tokens, audio tokens, and text tokens.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-17T02:06:07Z

livekit-agents/livekit/agents/metrics/base.py

+
+    # NEW: Token usage fields
+    input_tokens: int = 0
+    """Total input tokens used (audio + text tokens)."""
+    output_tokens: int = 0
+    """Total output tokens generated."""
+    total_tokens: int = 0
+    """Total tokens used (input + output)."""
+    audio_tokens: int = 0
+    """Number of audio tokens in input."""
+    text_tokens: int = 0
+    """Number of text tokens in input (e.g., from prompt)."""
+


⚠️ Potential issue | 🟡 Minor

Fix trailing whitespace and consider removing temporary comment.

Linting error (Line 41): Blank line contains trailing whitespace - flagged by ruff.

Code hygiene: The # NEW: Token usage fields comment is useful during review but could be removed before merge since the docstrings adequately document each field's purpose.

Proposed fix

"""Whether the STT is streaming (e.g using websocket).""" - - # NEW: Token usage fields + input_tokens: int = 0

🧰 Tools

🪛 GitHub Check: ruff

[failure] 41-41: Ruff (W293)
livekit-agents/livekit/agents/metrics/base.py:41:1: W293 Blank line contains whitespace

🤖 Prompt for AI Agents

In `@livekit-agents/livekit/agents/metrics/base.py` around lines 41 - 53, Remove the trailing whitespace on the empty line preceding the token fields to satisfy the linter, and delete the temporary review comment "# NEW: Token usage fields" (or replace it with a concise docstring/header) since each field already has a docstring; update the block with the attributes input_tokens, output_tokens, total_tokens, audio_tokens, and text_tokens in livekit.agents.metrics.base (the variables named input_tokens, output_tokens, total_tokens, audio_tokens, text_tokens) so only the documented fields remain and no trailing spaces exist.

livekit-agents/livekit/agents/metrics/usage_collector.py

livekit-agents/livekit/agents/stt/stt.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py`:
- Around line 439-471: The current construction of stt.SpeechEvent sets
token_usage to None when input/output/total tokens are zero, which drops
audio_tokens/text_tokens if only detailed counts exist; update the logic in the
block that builds token_usage (around resp/usage handling and the
stt.SpeechEvent creation) so you always populate the token_usage dict with
input_tokens, output_tokens, total_tokens, audio_tokens, and text_tokens and
then set token_usage to that dict if any of those five values is non-zero (e.g.,
use a any(...) check on the dict values) instead of checking only
input/output/total; reference the resp/usage extraction and the
stt.SpeechEvent(...) call to locate where to change the condition.

♻️ Duplicate comments (2)

livekit-agents/livekit/agents/metrics/base.py (1)
41-52: Remove the temporary comment and trailing whitespace.

The inline note is no longer needed, and the blank line appears to include whitespace (ruff W293).
🧹 Suggested cleanup
-    
-    # NEW: Token usage fields
+
     input_tokens: int = 0
livekit-agents/livekit/agents/metrics/usage_collector.py (1)
25-31: Remove trailing whitespace after the STT token fields.

The blank line after stt_text_tokens appears to contain whitespace (ruff W293).
🧹 Suggested cleanup
     stt_audio_tokens: int = 0
     stt_text_tokens: int = 0
-
+
     # properties for naming consistency: prompt = input, completion = output

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 309d764 and 059af5f.

📒 Files selected for processing (4)

livekit-agents/livekit/agents/metrics/base.py
livekit-agents/livekit/agents/metrics/usage_collector.py
livekit-agents/livekit/agents/stt/stt.py
livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py

🚧 Files skipped from review as they are similar to previous changes (1)

livekit-agents/livekit/agents/stt/stt.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

livekit-agents/livekit/agents/metrics/base.py
livekit-agents/livekit/agents/metrics/usage_collector.py
livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py

🧬 Code graph analysis (1)

livekit-agents/livekit/agents/metrics/usage_collector.py (1)

livekit-agents/livekit/agents/telemetry/http_server.py (1)

metrics (18-35)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: type-check (3.13)
GitHub Check: type-check (3.9)
GitHub Check: unit-tests

🔇 Additional comments (1)

livekit-agents/livekit/agents/metrics/usage_collector.py (1)

95-101: Aggregation looks correct.

STT token fields are accumulated consistently with the new metrics.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-17T03:28:14Z

livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py

+            # Extract token usage if available
+            input_tokens = 0
+            output_tokens = 0
+            total_tokens = 0
+            audio_tokens = 0
+            text_tokens = 0
+            if hasattr(resp, "usage") and resp.usage:
+                usage = resp.usage
+                input_tokens = getattr(usage, "input_tokens", 0)
+                output_tokens = getattr(usage, "output_tokens", 0)
+                total_tokens = getattr(usage, "total_tokens", 0)
+
+                # Extract detailed token breakdown
+                if hasattr(usage, "input_token_details") and usage.input_token_details:
+                    details = usage.input_token_details
+                    audio_tokens = getattr(details, "audio_tokens", 0)
+                    text_tokens = getattr(details, "text_tokens", 0)
+
+            # Create the speech event with token usage
+            speech_event = stt.SpeechEvent(
                type=stt.SpeechEventType.FINAL_TRANSCRIPT,
                alternatives=[sd],
+                token_usage={
+                    "input_tokens": input_tokens,
+                    "output_tokens": output_tokens,
+                    "total_tokens": total_tokens,
+                    "audio_tokens": audio_tokens,
+                    "text_tokens": text_tokens,
+                }
+                if (input_tokens > 0 or output_tokens > 0 or total_tokens > 0)
+                else None,
            )
+            return speech_event


⚠️ Potential issue | 🟡 Minor

Don’t drop audio/text usage when totals are missing.

If only detailed tokens are present, token_usage becomes None and metrics lose audio/text counts.

✅ Suggested fix

- speech_event = stt.SpeechEvent( + has_usage = any( + token > 0 + for token in (input_tokens, output_tokens, total_tokens, audio_tokens, text_tokens) + ) + speech_event = stt.SpeechEvent( type=stt.SpeechEventType.FINAL_TRANSCRIPT, alternatives=[sd], token_usage={ "input_tokens": input_tokens, "output_tokens": output_tokens, "total_tokens": total_tokens, "audio_tokens": audio_tokens, "text_tokens": text_tokens, } - if (input_tokens > 0 or output_tokens > 0 or total_tokens > 0) - else None, + if has_usage + else None, )

🤖 Prompt for AI Agents

In `@livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py` around lines 439 - 471, The current construction of stt.SpeechEvent sets token_usage to None when input/output/total tokens are zero, which drops audio_tokens/text_tokens if only detailed counts exist; update the logic in the block that builds token_usage (around resp/usage handling and the stt.SpeechEvent creation) so you always populate the token_usage dict with input_tokens, output_tokens, total_tokens, audio_tokens, and text_tokens and then set token_usage to that dict if any of those five values is non-zero (e.g., use a any(...) check on the dict values) instead of checking only input/output/total; reference the resp/usage extraction and the stt.SpeechEvent(...) call to locate where to change the condition.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/fake_io.py (1)
88-99: Duplicate on_playback_finished call and _pushed_duration reset.

The method calls on_playback_finished twice with identical parameters, and resets _pushed_duration = 0.0 twice. Per the base class implementation in io.py, calling on_playback_finished more than expected triggers a warning: "playback_finished called more times than playback segments were captured". The second call (lines 94-98) and second reset (line 99) appear to be accidental duplication.
🐛 Proposed fix to remove duplicate code
         self.on_playback_finished(
             playback_position=played_duration,
             interrupted=True,
             synchronized_transcript=None,
         )
         self._pushed_duration = 0.0
-        self.on_playback_finished(
-            playback_position=played_duration,
-            interrupted=True,
-            synchronized_transcript=None,
-        )
-        self._pushed_duration = 0.0

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6f5e06a and ed74c13.

📒 Files selected for processing (1)

tests/fake_io.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

tests/fake_io.py

🧬 Code graph analysis (1)

tests/fake_io.py (4)

livekit-agents/livekit/agents/voice/io.py (1)

on_playback_finished (191-218)

livekit-agents/livekit/agents/voice/recorder_io/recorder_io.py (1)

on_playback_finished (379-486)

livekit-agents/livekit/agents/voice/transcription/synchronizer.py (2)

on_playback_finished (554-579)

synchronized_transcript (281-285)

livekit-agents/livekit/agents/voice/speech_handle.py (1)

interrupted (83-84)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: unit-tests
GitHub Check: type-check (3.9)
GitHub Check: type-check (3.13)

🔇 Additional comments (1)

tests/fake_io.py (1)

83-87: LGTM!

The explicit calculation with clamping between [0, _pushed_duration] ensures valid playback position bounds and the comments clearly explain the rationale.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

bml1g12 · 2026-01-17T21:49:27Z

Excited to see this PR, as I noticed this week my STT was missing the data required to track costs - after this PR - any idea if Elevenlabs and Azure STT will have token counting? Otherwise I guess I need to make a PR to implement these as we are experimenting with those

Sahilgul · 2026-01-18T11:36:50Z

For Azure STT (Azure Speech Studio), billing is based on audio duration, not on tokens. I just finished a fix for Azure STT tracing, it was capturing the duration but not emitting it back to the metrics. For now I’m not sure about ElevenLabs, I haven’t checked that.

For Azure OpenAI the existing PR will work, since this also provides inference for GPT-4o Transcribe or Whisper like models. GPT-4o Transcribe billing is based on three things: input audio tokens, input text tokens, and output tokens (text by default). Whisper billing is based on duration only.

…f wall-clock time on interruption

…uptions

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/fake_io.py (1)
88-99: Critical: Duplicate on_playback_finished call will emit the event twice.

Lines 88-92 and 94-98 both call on_playback_finished with identical parameters, and _pushed_duration is reset twice (lines 93 and 99). This appears to be a merge/copy-paste error that will cause duplicate playback_finished events to be emitted.

Looking at the on_playback_finished implementation in io.py, it tracks segment counts and will log a warning for the extra call: "playback_finished called more times than playback segments were captured".
🐛 Proposed fix: Remove the duplicate call
         self._flush_handle = None
         # Calculate played duration based on real elapsed time, capped at pushed duration
         # This matches the behavior of ConsoleAudioOutput and accounts for speed_factor
         # in tests (check_timestamp multiplies by speed_factor to convert to test time)
         played_duration = time.time() - self._start_time
         played_duration = min(max(0, played_duration), self._pushed_duration)
         self.on_playback_finished(
             playback_position=played_duration,
             interrupted=True,
             synchronized_transcript=None,
         )
         self._pushed_duration = 0.0
-        self.on_playback_finished(
-            playback_position=played_duration,
-            interrupted=True,
-            synchronized_transcript=None,
-        )
-        self._pushed_duration = 0.0

🤖 Fix all issues with AI agents

In `@livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/stt.py`:
- Around line 303-323: The nested call_soon_threadsafe is redundant:
_emit_recognition_usage is already scheduled via self._loop.call_soon_threadsafe
(call site that passes evt.result.result_id, audio_duration), so remove the
inner self._loop.call_soon_threadsafe inside _emit_recognition_usage and
directly call self._event_ch.send_nowait(...) (wrapped in the existing
contextlib.suppress), keeping the SpeechEvent construction and
stt.RecognitionUsage unchanged; this simplifies _emit_recognition_usage and
avoids double-scheduling.

♻️ Duplicate comments (2)

livekit-agents/livekit/agents/metrics/base.py (1)
41-53: Remove the temporary review comment before merging.

The # NEW: Token usage fields comment on line 42 is a development marker that should be removed before merge. The docstrings already document each field's purpose.
♻️ Proposed fix
     """Whether the STT is streaming (e.g using websocket)."""

-    # NEW: Token usage fields
     input_tokens: int = 0
livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py (1)
457-470: Token usage condition doesn't include audio_tokens or text_tokens.

The condition on line 468 only checks input_tokens, output_tokens, and total_tokens. If only audio_tokens or text_tokens are non-zero (with the others at 0), token_usage will incorrectly be set to None, losing the detailed token breakdown.
🐛 Proposed fix
             speech_event = stt.SpeechEvent(
                 type=stt.SpeechEventType.FINAL_TRANSCRIPT,
                 alternatives=[sd],
                 token_usage={
                     "input_tokens": input_tokens,
                     "output_tokens": output_tokens,
                     "total_tokens": total_tokens,
                     "audio_tokens": audio_tokens,
                     "text_tokens": text_tokens,
                 }
-                if (input_tokens > 0 or output_tokens > 0 or total_tokens > 0)
+                if any((input_tokens, output_tokens, total_tokens, audio_tokens, text_tokens))
                 else None,
             )

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed74c13 and 917cfd1.

📒 Files selected for processing (6)

livekit-agents/livekit/agents/metrics/base.py
livekit-agents/livekit/agents/metrics/usage_collector.py
livekit-agents/livekit/agents/stt/stt.py
livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/stt.py
livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py
tests/fake_io.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

livekit-agents/livekit/agents/stt/stt.py
livekit-agents/livekit/agents/metrics/base.py
livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py
livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/stt.py
livekit-agents/livekit/agents/metrics/usage_collector.py
tests/fake_io.py

🧬 Code graph analysis (4)

livekit-agents/livekit/agents/stt/stt.py (2)

livekit-agents/livekit/agents/metrics/base.py (1)

STTMetrics (30-54)

livekit-agents/livekit/agents/utils/audio.py (1)

calculate_audio_duration (22-38)

livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py (1)

livekit-agents/livekit/agents/stt/stt.py (2)

SpeechEvent (89-95)

SpeechEventType (33-50)

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/stt.py (1)

livekit-agents/livekit/agents/stt/stt.py (3)

SpeechEvent (89-95)

SpeechEventType (33-50)

RecognitionUsage (66-67)

tests/fake_io.py (3)

livekit-agents/livekit/agents/voice/recorder_io/recorder_io.py (1)

on_playback_finished (379-486)

livekit-agents/livekit/agents/voice/io.py (1)

on_playback_finished (191-218)

livekit-agents/livekit/agents/voice/transcription/synchronizer.py (2)

on_playback_finished (554-579)

synchronized_transcript (281-285)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: type-check (3.9)
GitHub Check: type-check (3.13)
GitHub Check: unit-tests

🔇 Additional comments (4)

livekit-agents/livekit/agents/stt/stt.py (2)

70-86: Well-structured TypedDict for optional token usage.

Using TypedDict with total=False correctly models the optional nature of token fields across different STT providers. The docstrings clearly document each field's purpose.

187-214: Token extraction and metrics emission implemented correctly.

The token extraction safely handles None with .get() and defaults to 0, maintaining backward compatibility. The token fields are properly propagated to STTMetrics.

livekit-agents/livekit/agents/metrics/usage_collector.py (2)

25-31: STT token tracking fields follow existing conventions.

The new fields are consistently named with the stt_ prefix and default to 0 for backward compatibility.

95-101: STT token accumulation correctly integrated.

The accumulation logic properly extends the existing STTMetrics handling to include the new token fields, following the same pattern used for LLM and TTS metrics.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/stt.py

…n_usage

coderabbitai bot reviewed Jan 17, 2026

View reviewed changes

Sahilgul added 11 commits January 18, 2026 21:36

feat(metrics): add token usage fields to STTMetrics

f010ddd

feat(metrics): add STT token usage tracking to UsageCollector

46fd15d

feat(stt): extract and emit token usage in recognize() method

df974d8

feat(openai): add token usage extraction for STT transcriptions

098866d

Add token usage fields to STTMetrics

a566f45

Add STT token usage tracking to UsageSummary

f781877

Add STTTokenUsage TypedDict and token_usage field to SpeechEvent

49db6c2

Extract and pass OpenAI transcription token usage to SpeechEvent

f0fa1bc

Fix FakeAudioOutput playback position to use audio duration instead o…

6204ab5

…f wall-clock time on interruption

Fix FakeAudioOutput to correctly track played audio and handle interr…

58506fd

…uptions

Add recognition usage metrics to Azure STT plugin

917cfd1

Sahilgul force-pushed the feature/stt-tracing branch from ed74c13 to 917cfd1 Compare January 18, 2026 17:01

coderabbitai bot reviewed Jan 18, 2026

View reviewed changes

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/stt.py Outdated Show resolved Hide resolved

Sahilgul added 2 commits January 18, 2026 23:03

fix(azure): remove redundant call_soon_threadsafe in _emit_recognitio…

200c58c

…n_usage

fix(tests): remove duplicate on_playback_finished call in clear_buffer

b6fc2bc

feat(metrics): add comprehensive STT token usage tracking #4542

Are you sure you want to change the base?

feat(metrics): add comprehensive STT token usage tracking #4542

Conversation

Sahilgul commented Jan 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Realtime STT Token Usage Tracking for OpenAI and Duration-Based Providers

Problem Statement

Solution

Core Architecture Changes

Example OpenAI API Response (generic text)

Key Changes

Benefits

Notes

Summary by CodeRabbit

Uh oh!

CLAassistant commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

bml1g12 commented Jan 17, 2026

Uh oh!

Sahilgul commented Jan 18, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sahilgul commented Jan 17, 2026 •

edited by coderabbitai bot

Loading

CLAassistant commented Jan 17, 2026 •

edited

Loading

coderabbitai bot commented Jan 17, 2026 •

edited

Loading