[WIP] OpenTelemetry metrics#273
Conversation
luis5tb
left a comment
There was a problem hiding this comment.
PR Review: OpenTelemetry Metrics
+980/-119 across 12 files — reviewed the full diff.
Critical (1)
-
High-cardinality
client_idattribute on gauges and counters risks metric explosion[metrics.py:378-380, 430-435]Every unique
(order_id, client_id)combination creates a separate time series in Prometheus/Cloud Monitoring. Thetool_calls_by_namecounter compounds this with a third dimension (tool_name). With many clients, this can overwhelm the monitoring backend.→ Consider whether
client_idis truly needed on gauge observations. For aggregated billing views,order_idalone may suffice. At minimum, thedcr_clients_activegauge (which emitsvalue=1per client withaccount_id + order_id + client_idattributes) should be changed to a grouped count byaccount_id.
Important (5)
-
Gauge names use
_totalsuffix, violating OTel naming conventions[metrics.py:451-505]OTel and Prometheus conventions reserve
_totalfor counters. The Prometheus exporter auto-appends_totalto counter names. Using_totalon gauges (subscriptions_total,tokens_input_total,tokens_output_total,requests_total) will confuse operators and may clash with Prometheus conventions.→ Rename:
subscriptions_total→subscriptions_count,tokens_input_total→input_tokens,tokens_output_total→output_tokens,requests_total→request_count. -
_query_dcr_clientsfetches ALL rows without aggregation[metrics.py:260-275]Unlike subscriptions (grouped+counted) and usage (grouped+summed), DCR clients are fetched row-by-row. Each becomes a gauge observation with
value=1. This is unbounded — if the table grows to thousands of rows, the collector loads them all into memory every poll cycle, and each becomes a separate time series.→ Group by
account_id(oraccount_id + order_id) andCOUNT(*)to match the aggregation pattern used by the other queries. -
No validation on
otel_metrics_collection_interval[config/settings.py:471-474]A user could set
OTEL_METRICS_COLLECTION_INTERVAL=0(or negative), causingasyncio.sleep(0)in a tight loop that hammers the database.→ Add
ge=1to the Field validator:otel_metrics_collection_interval: int = Field(default=60, ge=1, ...) -
Three sequential DB queries in
_collect_oncecould run concurrently[metrics.py:222-227]_query_subscriptions,_query_dcr_clients, and_query_usageare independent async calls run sequentially. On a remote PostgreSQL instance with network latency, this triples the collection time unnecessarily.→ Use
asyncio.gather:subscriptions, dcr_clients, usage = await asyncio.gather( self._query_subscriptions(), self._query_dcr_clients(), self._query_usage(), )
-
No test that
increment_tool_callactually increments when metrics are enabled[tests/test_metrics.py:1002-1010]TestToolCallCounter.test_increment_tool_callonly tests the no-op path (when_tool_call_counter is None). There's no test verifying the counter adds to the OTel counter when metrics are enabled. The E2E test creates a counter but doesn't verify the increment goes throughincrement_tool_call().→ Add a test that sets up an
InMemoryMetricReader, registers instruments, callsincrement_tool_call, and asserts the counter data point appears in the reader.
Suggestions (5)
-
Intermediate
Observationdataclass adds unnecessary indirection[metrics.py:335-339]The
Observationdataclass wraps values before converting tootel_metrics.Observationin the lambda callbacks. The lambdas could constructotel_metrics.Observationdirectly from cache snapshots, eliminating the intermediate type. -
Consider namespace-prefixing metric names
[metrics.py:451-505]Names like
subscriptions_countare generic. Prefixing withlightspeed.orlightspeed_agent.(e.g.,lightspeed.subscriptions_count) avoids collisions if other services share the same Cloud Monitoring project. -
@pytest.mark.asynciomarkers are redundant[tests/test_metrics.py]The project uses
asyncio_mode = "auto"in pyproject.toml, making explicit@pytest.mark.asynciodecorators unnecessary. Not harmful, but inconsistent with the project's existing pattern. -
Usage plugin test refactoring (monkeypatch) is a welcome cleanup
[tests/test_usage_plugin.py]The migration from manual save/restore to
monkeypatch.setattris correct and improves test isolation. Nice improvement. -
Consider adding
_observe_tokens_outputand_observe_requestsindividual unit tests[tests/test_metrics.py]Only
_observe_tokens_inputhas an individual test. The other two gauge callbacks are only tested via the E2E test. Adding focused unit tests would make failures easier to diagnose.
Strengths
- Clean separation of concerns: The cache/collector/callback architecture is well-layered. Polling the DB on a timer and serving stale data to gauge callbacks is the right pattern for OTel observable gauges.
- Atomic cache replacement:
self._cache = MetricsCache(...)is a single reference assignment, making the read/write pattern safe without locks in CPython's GIL. - Graceful degradation: DB errors log a warning and serve stale cache. Metrics disabled = no-op path. Missing Prometheus exporter = try/except with warning. Robust.
- Independent metrics toggle:
OTEL_METRICS_ENABLEDindependent ofOTEL_ENABLED(tracing) is a good design choice for production rollout. - Test coverage is solid: 369 lines of tests covering cache defaults, collector lifecycle, gauge callbacks, DB error handling, settings, and an E2E pipeline.
- Usage plugin test cleanup: The monkeypatch refactoring is a genuine improvement in test quality.
Verdict: Comment (leaning toward Request Changes)
The architecture is sound and the implementation is clean. The WIP label is appropriate — the items above (especially metric naming conventions, cardinality risks, and collection interval validation) should be addressed before merging. None are architecturally difficult to fix.
|
Besides the AI generated review, I have:
|
Hi, thanks for the review! |
ee09acc to
63d025b
Compare
b1a9f7f to
8c17182
Compare
…, and tool calls Introduce a DB-polling MetricsCollector that caches entitlement, DCR client, and usage data, exposed as OTel observable gauges via Prometheus and OTLP exporters. Add a synchronous tool_calls_by_name counter wired into the UsageTrackingPlugin. Metrics are independently toggleable from tracing via OTEL_METRICS_ENABLED. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The collector runs entirely in a single asyncio event loop, so the threading.Lock was unnecessary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add logging for metric reader creation and a debug message when an exporter does not support metrics (jaeger/zipkin). Collapse single-item gRPC import to one line. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PrometheusMetricReader uses the standard OTEL_EXPORTER_PROMETHEUS_PORT env var natively, so the custom setting was redundant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…gin tests Replace try/finally with monkeypatch.setattr for cleaner test teardown. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add managed OTel collector sidecar annotation and OTEL_METRICS_ENABLED env var. App defaults (otlp exporter on localhost:4317) already match the collector sidecar, so no other env vars are needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… level All metrics now use account_id as the sole grouping dimension: - DCR clients: GROUP BY account_id with COUNT(*) instead of per-row - Usage gauges: JOIN usage_records to marketplace_entitlements to resolve account_id, replacing order_id/client_id labels - Tool call counter: reduced to tool_name only Reduces time-series growth from O(orders × clients) to O(accounts). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Drop _total suffix from gauge names (reserved for counters): - subscriptions_total → subscriptions_count - tokens_input_total → input_tokens - tokens_output_total → output_tokens - requests_total → request_count Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lector The three independent queries (subscriptions, DCR clients, usage) now run concurrently instead of sequentially, reducing collection time to the slowest query rather than the sum of all three. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Enforce ge=10 seconds to prevent tight polling loops when OTEL_METRICS_COLLECTION_INTERVAL is set too low. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document metric instruments, Cloud Run setup, and minimum instance requirements for metrics collection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ant asyncio markers Add test verifying the tool call counter actually increments with correct attributes when metrics are enabled. Rename existing test to clarify it covers the disabled/no-op path. Remove redundant @pytest.mark.asyncio decorators — project uses asyncio_mode = "auto". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…m chart - Start prometheus_client HTTP server on port 9464 for metrics scraping - Add otel_metrics_prometheus_port setting back to config - Add ServiceMonitor template for Prometheus Operator auto-discovery - Add GrafanaDashboard template with panels for all agent metrics - Expose metrics port on agent Deployment and Service (conditional) - Allow Prometheus scraping in NetworkPolicy - Add monitoring section to values.yaml with prerequisites documentation - Document full Grafana Operator setup (instance, datasource, dashboard) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Make GrafanaDashboard UID and title dynamic (based on Helm fullname) so multiple agents can coexist in Grafana without collisions - Add PromQL fallback to produce numeric 0 for accounts without DCR clients - Add comment on serviceName uniqueness requirement in values.yaml - Document umbrella chart pattern for multi-agent deployments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the non-existent `run.googleapis.com/otel-collector` annotation with a Google-Built OpenTelemetry Collector sidecar (otelcol-google:0.151.0) that receives OTLP metrics on localhost:4317 and exports them to Cloud Monitoring via the googlemanagedprometheus exporter. The collector config is stored in Secret Manager (otel-collector-config) and created automatically by setup.sh. All features used are GA-level (multi-container sidecar, secret volumes). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ee7576b to
bb686a7
Compare
Three dashboards (Marketplace, Usage & Billing, Agent Operations) deployed via an idempotent Cloud Build step that auto-discovers JSON files in deploy/cloudrun/dashboards/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
add OpenTelemetry metrics for subscriptions, DCR clients, usage, and tool calls
Introduce a DB-polling MetricsCollector that caches entitlement, DCR client, and usage data, exposed as OTel observable gauges via Prometheus and OTLP exporters. Add a synchronous tool_calls_by_name counter wired into the UsageTrackingPlugin. Metrics are independently toggleable from tracing via OTEL_METRICS_ENABLED.