Skip to content

[WIP] OpenTelemetry metrics#273

Open
IlonaShishov wants to merge 17 commits into
RHEcosystemAppEng:mainfrom
IlonaShishov:metrics
Open

[WIP] OpenTelemetry metrics#273
IlonaShishov wants to merge 17 commits into
RHEcosystemAppEng:mainfrom
IlonaShishov:metrics

Conversation

@IlonaShishov

Copy link
Copy Markdown
Collaborator

add OpenTelemetry metrics for subscriptions, DCR clients, usage, and tool calls

Introduce a DB-polling MetricsCollector that caches entitlement, DCR client, and usage data, exposed as OTel observable gauges via Prometheus and OTLP exporters. Add a synchronous tool_calls_by_name counter wired into the UsageTrackingPlugin. Metrics are independently toggleable from tracing via OTEL_METRICS_ENABLED.

@luis5tb luis5tb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: OpenTelemetry Metrics

+980/-119 across 12 files — reviewed the full diff.


Critical (1)

  1. High-cardinality client_id attribute on gauges and counters risks metric explosion [metrics.py:378-380, 430-435]

    Every unique (order_id, client_id) combination creates a separate time series in Prometheus/Cloud Monitoring. The tool_calls_by_name counter compounds this with a third dimension (tool_name). With many clients, this can overwhelm the monitoring backend.

    → Consider whether client_id is truly needed on gauge observations. For aggregated billing views, order_id alone may suffice. At minimum, the dcr_clients_active gauge (which emits value=1 per client with account_id + order_id + client_id attributes) should be changed to a grouped count by account_id.


Important (5)

  1. Gauge names use _total suffix, violating OTel naming conventions [metrics.py:451-505]

    OTel and Prometheus conventions reserve _total for counters. The Prometheus exporter auto-appends _total to counter names. Using _total on gauges (subscriptions_total, tokens_input_total, tokens_output_total, requests_total) will confuse operators and may clash with Prometheus conventions.

    → Rename: subscriptions_totalsubscriptions_count, tokens_input_totalinput_tokens, tokens_output_totaloutput_tokens, requests_totalrequest_count.

  2. _query_dcr_clients fetches ALL rows without aggregation [metrics.py:260-275]

    Unlike subscriptions (grouped+counted) and usage (grouped+summed), DCR clients are fetched row-by-row. Each becomes a gauge observation with value=1. This is unbounded — if the table grows to thousands of rows, the collector loads them all into memory every poll cycle, and each becomes a separate time series.

    → Group by account_id (or account_id + order_id) and COUNT(*) to match the aggregation pattern used by the other queries.

  3. No validation on otel_metrics_collection_interval [config/settings.py:471-474]

    A user could set OTEL_METRICS_COLLECTION_INTERVAL=0 (or negative), causing asyncio.sleep(0) in a tight loop that hammers the database.

    → Add ge=1 to the Field validator: otel_metrics_collection_interval: int = Field(default=60, ge=1, ...)

  4. Three sequential DB queries in _collect_once could run concurrently [metrics.py:222-227]

    _query_subscriptions, _query_dcr_clients, and _query_usage are independent async calls run sequentially. On a remote PostgreSQL instance with network latency, this triples the collection time unnecessarily.

    → Use asyncio.gather:

    subscriptions, dcr_clients, usage = await asyncio.gather(
        self._query_subscriptions(),
        self._query_dcr_clients(),
        self._query_usage(),
    )
  5. No test that increment_tool_call actually increments when metrics are enabled [tests/test_metrics.py:1002-1010]

    TestToolCallCounter.test_increment_tool_call only tests the no-op path (when _tool_call_counter is None). There's no test verifying the counter adds to the OTel counter when metrics are enabled. The E2E test creates a counter but doesn't verify the increment goes through increment_tool_call().

    → Add a test that sets up an InMemoryMetricReader, registers instruments, calls increment_tool_call, and asserts the counter data point appears in the reader.


Suggestions (5)

  1. Intermediate Observation dataclass adds unnecessary indirection [metrics.py:335-339]

    The Observation dataclass wraps values before converting to otel_metrics.Observation in the lambda callbacks. The lambdas could construct otel_metrics.Observation directly from cache snapshots, eliminating the intermediate type.

  2. Consider namespace-prefixing metric names [metrics.py:451-505]

    Names like subscriptions_count are generic. Prefixing with lightspeed. or lightspeed_agent. (e.g., lightspeed.subscriptions_count) avoids collisions if other services share the same Cloud Monitoring project.

  3. @pytest.mark.asyncio markers are redundant [tests/test_metrics.py]

    The project uses asyncio_mode = "auto" in pyproject.toml, making explicit @pytest.mark.asyncio decorators unnecessary. Not harmful, but inconsistent with the project's existing pattern.

  4. Usage plugin test refactoring (monkeypatch) is a welcome cleanup [tests/test_usage_plugin.py]

    The migration from manual save/restore to monkeypatch.setattr is correct and improves test isolation. Nice improvement.

  5. Consider adding _observe_tokens_output and _observe_requests individual unit tests [tests/test_metrics.py]

    Only _observe_tokens_input has an individual test. The other two gauge callbacks are only tested via the E2E test. Adding focused unit tests would make failures easier to diagnose.


Strengths

  • Clean separation of concerns: The cache/collector/callback architecture is well-layered. Polling the DB on a timer and serving stale data to gauge callbacks is the right pattern for OTel observable gauges.
  • Atomic cache replacement: self._cache = MetricsCache(...) is a single reference assignment, making the read/write pattern safe without locks in CPython's GIL.
  • Graceful degradation: DB errors log a warning and serve stale cache. Metrics disabled = no-op path. Missing Prometheus exporter = try/except with warning. Robust.
  • Independent metrics toggle: OTEL_METRICS_ENABLED independent of OTEL_ENABLED (tracing) is a good design choice for production rollout.
  • Test coverage is solid: 369 lines of tests covering cache defaults, collector lifecycle, gauge callbacks, DB error handling, settings, and an E2E pipeline.
  • Usage plugin test cleanup: The monkeypatch refactoring is a genuine improvement in test quality.

Verdict: Comment (leaning toward Request Changes)

The architecture is sound and the implementation is clean. The WIP label is appropriate — the items above (especially metric naming conventions, cardinality risks, and collection interval validation) should be addressed before merging. None are architecturally difficult to fix.

@luis5tb

luis5tb commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Besides the AI generated review, I have:

  • I think this would need some extra information at the cloudrun README to describe how to set it up
  • What happen if the agent gets scaled to 0? Perhaps this requires a note about this requiring the scale to be at least 1 (if that is the case)

@IlonaShishov

Copy link
Copy Markdown
Collaborator Author

Besides the AI generated review, I have:

  • I think this would need some extra information at the cloudrun README to describe how to set it up
  • What happen if the agent gets scaled to 0? Perhaps this requires a note about this requiring the scale to be at least 1 (if that is the case)

Hi, thanks for the review!
Im working on the changes,
just one note, looking at service.yaml: run.googleapis.com/minScale: "1" - the agent is already configured with minimum 1 instance. It doesn't scale to zero. So it's not a concern with the current config. But I will document it in case someone changes minScale to 0, the metrics collector would stop polling and no metrics would flow until an instance spins back up :)

@IlonaShishov IlonaShishov force-pushed the metrics branch 2 times, most recently from ee09acc to 63d025b Compare June 3, 2026 13:40
@IlonaShishov

Copy link
Copy Markdown
Collaborator Author

CI (lint, test, build and vulnerability) issues will be resolved by PRs #285 and #286

@IlonaShishov IlonaShishov force-pushed the metrics branch 2 times, most recently from b1a9f7f to 8c17182 Compare June 11, 2026 14:00
IlonaShishov and others added 15 commits June 14, 2026 12:08
…, and tool calls

Introduce a DB-polling MetricsCollector that caches entitlement, DCR client,
and usage data, exposed as OTel observable gauges via Prometheus and OTLP
exporters. Add a synchronous tool_calls_by_name counter wired into the
UsageTrackingPlugin. Metrics are independently toggleable from tracing via
OTEL_METRICS_ENABLED.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The collector runs entirely in a single asyncio event loop, so the
threading.Lock was unnecessary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add logging for metric reader creation and a debug message when an
exporter does not support metrics (jaeger/zipkin). Collapse single-item
gRPC import to one line.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PrometheusMetricReader uses the standard OTEL_EXPORTER_PROMETHEUS_PORT
env var natively, so the custom setting was redundant.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…gin tests

Replace try/finally with monkeypatch.setattr for cleaner test teardown.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add managed OTel collector sidecar annotation and OTEL_METRICS_ENABLED
env var. App defaults (otlp exporter on localhost:4317) already match
the collector sidecar, so no other env vars are needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… level

All metrics now use account_id as the sole grouping dimension:
- DCR clients: GROUP BY account_id with COUNT(*) instead of per-row
- Usage gauges: JOIN usage_records to marketplace_entitlements to
  resolve account_id, replacing order_id/client_id labels
- Tool call counter: reduced to tool_name only

Reduces time-series growth from O(orders × clients) to O(accounts).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Drop _total suffix from gauge names (reserved for counters):
- subscriptions_total → subscriptions_count
- tokens_input_total → input_tokens
- tokens_output_total → output_tokens
- requests_total → request_count

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lector

The three independent queries (subscriptions, DCR clients, usage) now
run concurrently instead of sequentially, reducing collection time to
the slowest query rather than the sum of all three.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Enforce ge=10 seconds to prevent tight polling loops when
OTEL_METRICS_COLLECTION_INTERVAL is set too low.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document metric instruments, Cloud Run setup, and minimum instance
requirements for metrics collection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ant asyncio markers

Add test verifying the tool call counter actually increments with
correct attributes when metrics are enabled. Rename existing test
to clarify it covers the disabled/no-op path.

Remove redundant @pytest.mark.asyncio decorators — project uses
asyncio_mode = "auto".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…m chart

- Start prometheus_client HTTP server on port 9464 for metrics scraping
- Add otel_metrics_prometheus_port setting back to config
- Add ServiceMonitor template for Prometheus Operator auto-discovery
- Add GrafanaDashboard template with panels for all agent metrics
- Expose metrics port on agent Deployment and Service (conditional)
- Allow Prometheus scraping in NetworkPolicy
- Add monitoring section to values.yaml with prerequisites documentation
- Document full Grafana Operator setup (instance, datasource, dashboard)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Make GrafanaDashboard UID and title dynamic (based on Helm fullname)
  so multiple agents can coexist in Grafana without collisions
- Add PromQL fallback to produce numeric 0 for accounts without DCR clients
- Add comment on serviceName uniqueness requirement in values.yaml
- Document umbrella chart pattern for multi-agent deployments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the non-existent `run.googleapis.com/otel-collector` annotation
with a Google-Built OpenTelemetry Collector sidecar (otelcol-google:0.151.0)
that receives OTLP metrics on localhost:4317 and exports them to Cloud
Monitoring via the googlemanagedprometheus exporter.

The collector config is stored in Secret Manager (otel-collector-config)
and created automatically by setup.sh. All features used are GA-level
(multi-container sidecar, secret volumes).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@IlonaShishov IlonaShishov force-pushed the metrics branch 2 times, most recently from ee7576b to bb686a7 Compare June 15, 2026 13:15
Three dashboards (Marketplace, Usage & Billing, Agent Operations)
deployed via an idempotent Cloud Build step that auto-discovers
JSON files in deploy/cloudrun/dashboards/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants