test: rewrite eval runtime span tests to exercise real code#1566
Open
smflorentino wants to merge 7 commits intomainfrom
Open
test: rewrite eval runtime span tests to exercise real code#1566smflorentino wants to merge 7 commits intomainfrom
smflorentino wants to merge 7 commits intomainfrom
Conversation
All ~60 tests in test_eval_runtime_spans.py were tautological — they
constructed dicts inline and asserted against those same dicts without
ever calling production code (e.g. `assert {"span_type": "eval_set_run"}["span_type"] == "eval_set_run"`).
Replace with 24 tests that inject a SpanCapturingTracer into the
runtime, call the real methods (execute, _execute_eval, run_evaluator),
and assert on the captured spans.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All ~60 tests in test_eval_runtime_spans.py were tautological — they constructed dicts inline and asserted against those same dicts without ever calling production code. Replace with 25 tests that inject a SpanCapturingTracer into the runtime, call the real methods (execute, _execute_eval, run_evaluator), and assert on the captured spans. TestEvalSetRunSpan tests run the full pipeline: execute() flows real eval items through _execute_eval() and run_evaluator(), then verifies the parent span has aggregate scores and metadata. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Every test now runs the full pipeline: execute() -> initiate_evaluation() -> _execute_eval() per item -> run_evaluator() per evaluator. Only execute_runtime (the agent invocation) is mocked. Each test class asserts on a different level of the span tree that the evaluation produces, rather than testing isolated methods disconnected from the pipeline they belong to. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace terse abbreviations (Acc, rel, i1, E1) with meaningful names that convey what is being evaluated (ExactMatchEvaluator, calculator-addition, sentiment-positive-review, etc.). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These are mocks, not real evaluator implementations — name them accordingly. Tests that need one evaluator use the default MockEvaluator; tests that need multiple use MockEvaluatorA/B/C. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Test that the evaluation span gets StatusCode.ERROR when the agent returns an error, and StatusCode.OK on success - Test that the evaluation span carries the eval item's inputs as serialized JSON in the input attribute Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Assert agentId value (not just existence) on both eval set run and evaluation spans - Assert inputSchema/outputSchema content matches the runtime schema - Assert StatusCode.OK on the eval set run span - Assert evaluatorId and justification in the evaluation output span's output JSON (not just as direct span attributes) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
test_eval_runtime_spans.pywere tautological — they constructed dicts inline and asserted against those same dicts without calling any production codeexecute()and assert on the resulting span treeexecute_runtime(the actual agent invocation) is mocked —initiate_evaluation,_execute_eval,run_evaluator,compute_evaluator_scores, and all span configuration functions run for realTestEvalSetRunSpan— batch-level span with aggregate scores, metadata values, schema content, status, and conditionaleval_set_run_idTestEvaluationSpan— per-item spans with item attributes, scores, input data,agentIdvalue, and error/success statusTestEvaluatorSpan— per-evaluator spans with evaluator attributesTestEvaluationOutputSpan— score output spans with justification extraction (pydantic, string, none) verified in both direct attributes and serialized output JSONTestSpanHierarchy— correct span ordering and countsTest plan
pytest tests/cli/eval/test_eval_runtime_spans.py -v— all 24 tests pass🤖 Generated with Claude Code