Skip to content

fix(eval overview): hide non-output metrics for evaluator steps#3897

Open
mmabrouk wants to merge 3 commits intomainfrom
fix/evaluator-overview-output-metrics
Open

fix(eval overview): hide non-output metrics for evaluator steps#3897
mmabrouk wants to merge 3 commits intomainfrom
fix/evaluator-overview-output-metrics

Conversation

@mmabrouk
Copy link
Member

@mmabrouk mmabrouk commented Mar 3, 2026

Summary

  • Restrict evaluator metrics in Overview to evaluator output namespaces (attributes.ag.data.outputs.* and normalized equivalents)
  • Filter both live run metrics and fallback evaluator metric definitions using the same namespace check
  • Prevent annotation infra metrics (duration, cost, tokens, errors) from showing as evaluator metrics in the Overview section

Testing

  • Not run (frontend-only filtering change)

Open with Devin

Filter evaluator overview metrics by output namespaces so annotation infra metrics (duration, cost, tokens, errors) are not displayed as evaluator metrics.
@vercel
Copy link

vercel bot commented Mar 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agenta-documentation Ready Ready Preview, Comment Mar 5, 2026 0:09am

Request Review

@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Mar 3, 2026
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 4 additional findings in Devin Review.

Open in Devin Review

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 normalizeMetricPath produces paths for outputs. prefix that isEvaluatorOutputMetric will always reject

When normalizeMetricPath receives a path starting with outputs. (e.g., "outputs.score"), it produces "attributes.ag.outputs.score". The new isEvaluatorOutputMetric filter then checks this against EVALUATOR_OUTPUT_PATH_PREFIXES, but none of them match "attributes.ag.outputs." — note the missing data. segment.

Root Cause

At evaluatorMetrics.ts:161, normalizeMetricPath maps outputs.Xattributes.ag.outputs.X:

if (trimmed.startsWith("outputs.")) return `attributes.ag.${trimmed}`

But the EVALUATOR_OUTPUT_PATH_PREFIXES at evaluatorMetrics.ts:30-35 does not include "attributes.ag.outputs." — it only includes "attributes.ag.data.outputs." (with the data. segment). So isEvaluatorOutputMetric("attributes.ag.outputs.score") returns false, and the metric is silently dropped at line 179.

Compare with the data. prefix handling at line 160: normalizeMetricPath("data.outputs.score")"attributes.ag.data.outputs.score" which correctly passes the filter.

This inconsistency means any evaluator definition whose metric path starts with outputs. (e.g., "outputs.score") will have that metric silently excluded from fallback metrics.

Impact: In practice, the standard extractMetrics flow (evaluators.ts:87-100) produces bare key names like "score" which hit the default branch of normalizeMetricPath and correctly get prefixed with attributes.ag.data.outputs.. So this bug would only manifest if an evaluator definition provides a metric path explicitly prefixed with outputs., which is a supported but apparently uncommon code path in normalizeMetricPath.

(Refers to line 161)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

Railway Preview Environment

Preview URL https://gateway-production-9925.up.railway.app/w
Project agenta-oss-pr-3897
Image tag pr-3897-35873c3
Status Deployed
Railway logs Open logs
Workflow logs View workflow run
Updated at 2026-03-05T12:16:18.448Z

@mmabrouk mmabrouk requested review from ardaerzin March 3, 2026 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Frontend size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant