test: rewrite eval runtime span tests to exercise real code by smflorentino · Pull Request #1566 · UiPath/uipath-python

smflorentino · 2026-04-14T13:31:40Z

Summary

All ~60 tests in test_eval_runtime_spans.py were tautological — they constructed dicts inline and asserted against those same dicts without calling any production code
Replaced with 24 tests that all run the full evaluation pipeline via execute() and assert on the resulting span tree
Only execute_runtime (the actual agent invocation) is mocked — initiate_evaluation, _execute_eval, run_evaluator, compute_evaluator_scores, and all span configuration functions run for real
Each test class asserts on a different level of the span hierarchy:
- TestEvalSetRunSpan — batch-level span with aggregate scores, metadata values, schema content, status, and conditional eval_set_run_id
- TestEvaluationSpan — per-item spans with item attributes, scores, input data, agentId value, and error/success status
- TestEvaluatorSpan — per-evaluator spans with evaluator attributes
- TestEvaluationOutputSpan — score output spans with justification extraction (pydantic, string, none) verified in both direct attributes and serialized output JSON
- TestSpanHierarchy — correct span ordering and counts
No production code changes

Test plan

pytest tests/cli/eval/test_eval_runtime_spans.py -v — all 24 tests pass
Full test suite (1728 tests) passes with no regressions

🤖 Generated with Claude Code

All ~60 tests in test_eval_runtime_spans.py were tautological — they constructed dicts inline and asserted against those same dicts without ever calling production code (e.g. `assert {"span_type": "eval_set_run"}["span_type"] == "eval_set_run"`). Replace with 24 tests that inject a SpanCapturingTracer into the runtime, call the real methods (execute, _execute_eval, run_evaluator), and assert on the captured spans. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

All ~60 tests in test_eval_runtime_spans.py were tautological — they constructed dicts inline and asserted against those same dicts without ever calling production code. Replace with 25 tests that inject a SpanCapturingTracer into the runtime, call the real methods (execute, _execute_eval, run_evaluator), and assert on the captured spans. TestEvalSetRunSpan tests run the full pipeline: execute() flows real eval items through _execute_eval() and run_evaluator(), then verifies the parent span has aggregate scores and metadata. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Every test now runs the full pipeline: execute() -> initiate_evaluation() -> _execute_eval() per item -> run_evaluator() per evaluator. Only execute_runtime (the agent invocation) is mocked. Each test class asserts on a different level of the span tree that the evaluation produces, rather than testing isolated methods disconnected from the pipeline they belong to. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace terse abbreviations (Acc, rel, i1, E1) with meaningful names that convey what is being evaluated (ExactMatchEvaluator, calculator-addition, sentiment-positive-review, etc.). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

These are mocks, not real evaluator implementations — name them accordingly. Tests that need one evaluator use the default MockEvaluator; tests that need multiple use MockEvaluatorA/B/C. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Test that the evaluation span gets StatusCode.ERROR when the agent returns an error, and StatusCode.OK on success - Test that the evaluation span carries the eval item's inputs as serialized JSON in the input attribute Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Assert agentId value (not just existence) on both eval set run and evaluation spans - Assert inputSchema/outputSchema content matches the runtime schema - Assert StatusCode.OK on the eval set run span - Assert evaluatorId and justification in the evaluation output span's output JSON (not just as direct span attributes) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

smflorentino and others added 7 commits April 14, 2026 06:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: rewrite eval runtime span tests to exercise real code#1566

test: rewrite eval runtime span tests to exercise real code#1566
smflorentino wants to merge 7 commits intomainfrom
fix/rewrite-eval-runtime-spans-tests

smflorentino commented Apr 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

smflorentino commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

smflorentino commented Apr 14, 2026 •

edited

Loading