[WIP] OpenTelemetry metrics by IlonaShishov · Pull Request #273 · RHEcosystemAppEng/google-lightspeed-agent

IlonaShishov · 2026-05-31T07:52:05Z

add OpenTelemetry metrics for subscriptions, DCR clients, usage, and tool calls

Introduce a DB-polling MetricsCollector that caches entitlement, DCR client, and usage data, exposed as OTel observable gauges via Prometheus and OTLP exporters. Add a synchronous tool_calls_by_name counter wired into the UsageTrackingPlugin. Metrics are independently toggleable from tracing via OTEL_METRICS_ENABLED.

luis5tb

PR Review: OpenTelemetry Metrics

+980/-119 across 12 files — reviewed the full diff.

Critical (1)

High-cardinality client_id attribute on gauges and counters risks metric explosion [metrics.py:378-380, 430-435]

Every unique (order_id, client_id) combination creates a separate time series in Prometheus/Cloud Monitoring. The tool_calls_by_name counter compounds this with a third dimension (tool_name). With many clients, this can overwhelm the monitoring backend.

→ Consider whether client_id is truly needed on gauge observations. For aggregated billing views, order_id alone may suffice. At minimum, the dcr_clients_active gauge (which emits value=1 per client with account_id + order_id + client_id attributes) should be changed to a grouped count by account_id.

Important (5)

Gauge names use _total suffix, violating OTel naming conventions [metrics.py:451-505]

OTel and Prometheus conventions reserve _total for counters. The Prometheus exporter auto-appends _total to counter names. Using _total on gauges (subscriptions_total, tokens_input_total, tokens_output_total, requests_total) will confuse operators and may clash with Prometheus conventions.

→ Rename: subscriptions_total → subscriptions_count, tokens_input_total → input_tokens, tokens_output_total → output_tokens, requests_total → request_count.
_query_dcr_clients fetches ALL rows without aggregation [metrics.py:260-275]

Unlike subscriptions (grouped+counted) and usage (grouped+summed), DCR clients are fetched row-by-row. Each becomes a gauge observation with value=1. This is unbounded — if the table grows to thousands of rows, the collector loads them all into memory every poll cycle, and each becomes a separate time series.

→ Group by account_id (or account_id + order_id) and COUNT(*) to match the aggregation pattern used by the other queries.
No validation on otel_metrics_collection_interval [config/settings.py:471-474]

A user could set OTEL_METRICS_COLLECTION_INTERVAL=0 (or negative), causing asyncio.sleep(0) in a tight loop that hammers the database.

→ Add ge=1 to the Field validator: otel_metrics_collection_interval: int = Field(default=60, ge=1, ...)
Three sequential DB queries in _collect_once could run concurrently [metrics.py:222-227]

_query_subscriptions, _query_dcr_clients, and _query_usage are independent async calls run sequentially. On a remote PostgreSQL instance with network latency, this triples the collection time unnecessarily.

→ Use asyncio.gather:
```
subscriptions, dcr_clients, usage = await asyncio.gather(
    self._query_subscriptions(),
    self._query_dcr_clients(),
    self._query_usage(),
)
```
No test that increment_tool_call actually increments when metrics are enabled [tests/test_metrics.py:1002-1010]

TestToolCallCounter.test_increment_tool_call only tests the no-op path (when _tool_call_counter is None). There's no test verifying the counter adds to the OTel counter when metrics are enabled. The E2E test creates a counter but doesn't verify the increment goes through increment_tool_call().

→ Add a test that sets up an InMemoryMetricReader, registers instruments, calls increment_tool_call, and asserts the counter data point appears in the reader.

Suggestions (5)

Intermediate Observation dataclass adds unnecessary indirection [metrics.py:335-339]

The Observation dataclass wraps values before converting to otel_metrics.Observation in the lambda callbacks. The lambdas could construct otel_metrics.Observation directly from cache snapshots, eliminating the intermediate type.
Consider namespace-prefixing metric names [metrics.py:451-505]

Names like subscriptions_count are generic. Prefixing with lightspeed. or lightspeed_agent. (e.g., lightspeed.subscriptions_count) avoids collisions if other services share the same Cloud Monitoring project.
@pytest.mark.asyncio markers are redundant [tests/test_metrics.py]

The project uses asyncio_mode = "auto" in pyproject.toml, making explicit @pytest.mark.asyncio decorators unnecessary. Not harmful, but inconsistent with the project's existing pattern.
Usage plugin test refactoring (monkeypatch) is a welcome cleanup [tests/test_usage_plugin.py]

The migration from manual save/restore to monkeypatch.setattr is correct and improves test isolation. Nice improvement.
Consider adding _observe_tokens_output and _observe_requests individual unit tests [tests/test_metrics.py]

Only _observe_tokens_input has an individual test. The other two gauge callbacks are only tested via the E2E test. Adding focused unit tests would make failures easier to diagnose.

Strengths

Clean separation of concerns: The cache/collector/callback architecture is well-layered. Polling the DB on a timer and serving stale data to gauge callbacks is the right pattern for OTel observable gauges.
Atomic cache replacement: self._cache = MetricsCache(...) is a single reference assignment, making the read/write pattern safe without locks in CPython's GIL.
Graceful degradation: DB errors log a warning and serve stale cache. Metrics disabled = no-op path. Missing Prometheus exporter = try/except with warning. Robust.
Independent metrics toggle: OTEL_METRICS_ENABLED independent of OTEL_ENABLED (tracing) is a good design choice for production rollout.
Test coverage is solid: 369 lines of tests covering cache defaults, collector lifecycle, gauge callbacks, DB error handling, settings, and an E2E pipeline.
Usage plugin test cleanup: The monkeypatch refactoring is a genuine improvement in test quality.

Verdict: Comment (leaning toward Request Changes)

The architecture is sound and the implementation is clean. The WIP label is appropriate — the items above (especially metric naming conventions, cardinality risks, and collection interval validation) should be addressed before merging. None are architecturally difficult to fix.

luis5tb · 2026-06-01T10:36:59Z

Besides the AI generated review, I have:

I think this would need some extra information at the cloudrun README to describe how to set it up
What happen if the agent gets scaled to 0? Perhaps this requires a note about this requiring the scale to be at least 1 (if that is the case)

IlonaShishov · 2026-06-01T13:58:44Z

Besides the AI generated review, I have:

I think this would need some extra information at the cloudrun README to describe how to set it up

What happen if the agent gets scaled to 0? Perhaps this requires a note about this requiring the scale to be at least 1 (if that is the case)

Hi, thanks for the review!
Im working on the changes,
just one note, looking at service.yaml: run.googleapis.com/minScale: "1" - the agent is already configured with minimum 1 instance. It doesn't scale to zero. So it's not a concern with the current config. But I will document it in case someone changes minScale to 0, the metrics collector would stop polling and no metrics would flow until an instance spins back up :)

IlonaShishov · 2026-06-07T12:26:49Z

CI (lint, test, build and vulnerability) issues will be resolved by PRs #285 and #286

…, and tool calls Introduce a DB-polling MetricsCollector that caches entitlement, DCR client, and usage data, exposed as OTel observable gauges via Prometheus and OTLP exporters. Add a synchronous tool_calls_by_name counter wired into the UsageTrackingPlugin. Metrics are independently toggleable from tracing via OTEL_METRICS_ENABLED. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The collector runs entirely in a single asyncio event loop, so the threading.Lock was unnecessary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add logging for metric reader creation and a debug message when an exporter does not support metrics (jaeger/zipkin). Collapse single-item gRPC import to one line. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PrometheusMetricReader uses the standard OTEL_EXPORTER_PROMETHEUS_PORT env var natively, so the custom setting was redundant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…gin tests Replace try/finally with monkeypatch.setattr for cleaner test teardown. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add managed OTel collector sidecar annotation and OTEL_METRICS_ENABLED env var. App defaults (otlp exporter on localhost:4317) already match the collector sidecar, so no other env vars are needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… level All metrics now use account_id as the sole grouping dimension: - DCR clients: GROUP BY account_id with COUNT(*) instead of per-row - Usage gauges: JOIN usage_records to marketplace_entitlements to resolve account_id, replacing order_id/client_id labels - Tool call counter: reduced to tool_name only Reduces time-series growth from O(orders × clients) to O(accounts). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Drop _total suffix from gauge names (reserved for counters): - subscriptions_total → subscriptions_count - tokens_input_total → input_tokens - tokens_output_total → output_tokens - requests_total → request_count Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…lector The three independent queries (subscriptions, DCR clients, usage) now run concurrently instead of sequentially, reducing collection time to the slowest query rather than the sum of all three. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Enforce ge=10 seconds to prevent tight polling loops when OTEL_METRICS_COLLECTION_INTERVAL is set too low. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Document metric instruments, Cloud Run setup, and minimum instance requirements for metrics collection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ant asyncio markers Add test verifying the tool call counter actually increments with correct attributes when metrics are enabled. Rename existing test to clarify it covers the disabled/no-op path. Remove redundant @pytest.mark.asyncio decorators — project uses asyncio_mode = "auto". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…m chart - Start prometheus_client HTTP server on port 9464 for metrics scraping - Add otel_metrics_prometheus_port setting back to config - Add ServiceMonitor template for Prometheus Operator auto-discovery - Add GrafanaDashboard template with panels for all agent metrics - Expose metrics port on agent Deployment and Service (conditional) - Allow Prometheus scraping in NetworkPolicy - Add monitoring section to values.yaml with prerequisites documentation - Document full Grafana Operator setup (instance, datasource, dashboard) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Make GrafanaDashboard UID and title dynamic (based on Helm fullname) so multiple agents can coexist in Grafana without collisions - Add PromQL fallback to produce numeric 0 for accounts without DCR clients - Add comment on serviceName uniqueness requirement in values.yaml - Document umbrella chart pattern for multi-agent deployments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace the non-existent `run.googleapis.com/otel-collector` annotation with a Google-Built OpenTelemetry Collector sidecar (otelcol-google:0.151.0) that receives OTLP metrics on localhost:4317 and exports them to Cloud Monitoring via the googlemanagedprometheus exporter. The collector config is stored in Secret Manager (otel-collector-config) and created automatically by setup.sh. All features used are GA-level (multi-container sidecar, secret volumes). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three dashboards (Marketplace, Usage & Billing, Agent Operations) deployed via an idempotent Cloud Build step that auto-discovers JSON files in deploy/cloudrun/dashboards/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

IlonaShishov requested review from dmartinol, luis5tb and yuvalk as code owners May 31, 2026 07:52

luis5tb reviewed Jun 1, 2026

View reviewed changes

IlonaShishov force-pushed the metrics branch 2 times, most recently from ee09acc to 63d025b Compare June 3, 2026 13:40

IlonaShishov force-pushed the metrics branch 2 times, most recently from b1a9f7f to 8c17182 Compare June 11, 2026 14:00

IlonaShishov and others added 15 commits June 14, 2026 12:08

refactor: remove unnecessary threading lock from MetricsCollector

c127f61

The collector runs entirely in a single asyncio event loop, so the threading.Lock was unnecessary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: remove unused otel_metrics_prometheus_port setting

6d56aef

PrometheusMetricReader uses the standard OTEL_EXPORTER_PROMETHEUS_PORT env var natively, so the custom setting was redundant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: use monkeypatch instead of manual save/restore in usage plu…

bf20fa9

…gin tests Replace try/finally with monkeypatch.setattr for cleaner test teardown. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add minimum validation on metrics collection interval

cb2a0b8

Enforce ge=10 seconds to prevent tight polling loops when OTEL_METRICS_COLLECTION_INTERVAL is set too low. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: add metrics section to telemetry documentation

70daab3

Document metric instruments, Cloud Run setup, and minimum instance requirements for metrics collection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: document multi-agent metrics distinction via OTEL_SERVICE_NAME

4c8a551

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

IlonaShishov force-pushed the metrics branch from 8c17182 to ff84ca3 Compare June 14, 2026 09:08

IlonaShishov force-pushed the metrics branch 2 times, most recently from ee7576b to bb686a7 Compare June 15, 2026 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] OpenTelemetry metrics#273

[WIP] OpenTelemetry metrics#273
IlonaShishov wants to merge 17 commits into
RHEcosystemAppEng:mainfrom
IlonaShishov:metrics

IlonaShishov commented May 31, 2026

Uh oh!

luis5tb left a comment

Uh oh!

luis5tb commented Jun 1, 2026

Uh oh!

IlonaShishov commented Jun 1, 2026

Uh oh!

IlonaShishov commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

IlonaShishov commented May 31, 2026

Uh oh!

luis5tb left a comment

Choose a reason for hiding this comment

PR Review: OpenTelemetry Metrics

Critical (1)

Important (5)

Suggestions (5)

Strengths

Verdict: Comment (leaning toward Request Changes)

Uh oh!

luis5tb commented Jun 1, 2026

Uh oh!

IlonaShishov commented Jun 1, 2026

Uh oh!

IlonaShishov commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants