feat: Add retrieval metrics, hierarchical assertion scoring, and data-linked questions#52
Open
feat: Add retrieval metrics, hierarchical assertion scoring, and data-linked questions#52
Conversation
- Add TextUnitFieldsConfig for flexible column name mapping (id, text, embedding, short_id) - Support multiple file formats: parquet, csv, json, jsonl - Add assessor_type config to choose between 'rationale' and 'bing' raters - Add significance testing utilities (stats.py) - Update generate-retrieval-reference CLI with embedding generation support
- Add cluster_match_by config option for retrieval metrics (text/id/short_id) - Fix cluster loading to support both text_units and text_unit_ids formats - Fix cluster.id attribute access in fidelity.py - Add entity question generation with bridge/comparison/intersection types - Add batch validation with entity relevance check - Use MMR sampling for better deduplication
- Add match_by param to calculate_single_query_fidelity and calculate_fidelity - Add match_by param to calculate_single_query_recall and calculate_recall - Add cluster_match_by param to calculate_retrieval_metrics and extract_per_query_metrics - Pass cluster_match_by through run_retrieval_evaluation to all metric calculations
…icance tests - Save map step assertions for global questions (map_assertions.json) - Track supporting_assertions (child assertions) for each global assertion - Save assertion sources to separate files (assertion_sources.json, map_assertion_sources.json) - Add paired=True parameter for repeated measures statistical design - Update significance tests to use Friedman/Wilcoxon for repeated measures - Use paired tests in assertion and retrieval scoring comparisons
Friedman test requires at least 3 conditions/groups. For exactly 2 groups, we now skip the omnibus test and directly perform a paired comparison: - Normal data: Paired t-test - Non-normal data: Wilcoxon signed-rank test Added _compare_two_groups_paired() helper function that: - Checks normality of the paired differences - Runs appropriate paired test - Returns a GroupComparisonResult with consistent structure
Link Questions (renamed from entity_questions): - Rename entity_questions to link_questions throughout codebase - Add temporal question type with dedicated system prompt - Reorganize batch_validation_prompt.txt with clearer structure - Add question validators for link and global questions AutoE Module Refactoring: - Split assertion_scores into assertion/ submodule (hierarchical, standard, aggregation) - Move pairwise_scores, reference_scores, retrieval_scores to submodules - Add hierarchical assertion scoring prompts Other Changes: - Add global questions batch validation prompt - Add assertion generation stats tracking - Update CLI and config for new structure
- Add run_hierarchical_assertion_evaluation pipeline function for multiple RAGs - Add MultiRAGHierarchicalAssertionConfig for CLI multi-RAG mode - Fix metric consistency: use per-question averaging throughout - Fix supporting_pass_rate to not deduplicate (same assertion can have different results under different globals) - Update CLI to auto-detect single vs multi-RAG config format - Update autoe.ipynb with hierarchical assertion examples - Update docs/cli/autoe.md with multi-RAG hierarchical config examples - Rename retrieval_scores to retrieval_metrics in imports
…or multi-RAG hierarchical config
…or multi-RAG hierarchical config
- Add clustered permutation test as optional secondary analysis for assertion-level significance testing (accounts for within-question correlation by permuting labels at the question/cluster level) - Add summarize_significance_results() to produce a combined summary table across all metrics and test types - Remove *_passed conditional metrics from significance tests (keep as descriptive stats only) to avoid selection bias from conditioning on post-treatment variable - Add per-question avg global pass rate print in Step 1 for consistency - Update docs/cli/autoe.md with clustered permutation config and output - Update autoe.ipynb notebook with clustered permutation examples
- Fix KeyError: use 'support_level' instead of 'support_coverage' in hierarchical mode (column name mismatch) - Save summary CSVs (by_question, by_assertion) in single-RAG standard mode - Save eval_summary.json in both standard and hierarchical CLI modes - Add avg_question_pass_rate metric (per-question pass rate averaged) - Extract summarize_standard_scores() and compute_hierarchical_eval_summary() into aggregation.py as shared functions - Refactor CLI and multi-RAG pipeline to use shared functions, eliminating code duplication and ensuring consistent metric computation - Update module exports in assertion/__init__.py and autoe/__init__.py
Rename all identifiers, file names, directory names, config keys, enum values, and documentation references from 'data_link'/'link' to 'data_linked'/'linked' for grammatical consistency. Changes include: - Python source: classes, functions, variables, constants, imports (e.g., DataLinkQuestionGen -> DataLinkedQuestionGen, DataLinkConfig -> DataLinkedConfig, QuestionType.DATA_LINK -> DATA_LINKED) - File/directory renames: link_question_gen.py -> linked_question_gen.py, link_validator.py -> linked_validator.py, link_questions/ -> linked_questions/, data_link_* example answer files -> data_linked_* - Config: YAML keys, Pydantic field names, prompt paths - Notebooks: autoq.ipynb, autoe.ipynb, retrieval_metrics.ipynb - Typos config: add UMBRELA and ba to exception list - Ruff fixes in changed files: ASYNC230, PTH123, W293, Q000, W291, SLOT000, NPY002, C420, C901 (extracted helper methods)
Run ruff format (--preview) and ruff check --fix (--preview) to resolve 338 pre-existing lint and formatting issues (456 -> 118 remaining). Remaining 118 errors are non-auto-fixable pre-existing issues (T201 print statements, PTH123 open() calls, G004 logging f-strings, etc.).
Resolve all 52 remaining ruff check --preview errors: - G004: Convert logging f-strings to %s formatting (10 fixes) - PTH123/ASYNC230: Replace open() with Path.open()/read_text/write_text (12 fixes) - ERA001: Remove commented-out code in autoe.ipynb (8 fixes) - FBT003: Use keyword arguments for boolean Fields (3 fixes) - BLE001: Replace blind except with specific exceptions (2 fixes) - FURB113: Replace consecutive append() with extend() (2 fixes) - RUF027: Add noqa for intentional template strings (2 fixes) - TRY003/EM102: Extract exception message to variable (1 fix) - RUF022: Sort __all__ alphabetically (1 fix) - FURB148: Remove unnecessary enumerate (1 fix) - FURB189: Add noqa for intentional str subclass (1 fix) - RET504: Remove unnecessary assignment before return (1 fix) - F841: Remove unused variable (1 fix) - NPY002: Add noqa for legacy np.random usage (1 fix) - ARG001: Add noqa for interface parameter (1 fix) - F821: Add noqa for cross-cell notebook reference (1 fix) - CPY001: Add copyright header (1 fix) - T201: Suppress for stats/reporting modules in ruff.toml (66 fixes) All ruff checks now pass (0 errors).
…large datasets - Replace per-element scipy cosine calls with vectorized numpy matrix ops in get_semantic_neighbors and compute_similarity_to_references - Use batch matrix multiply for all rep-to-corpus distances in KmeansTextSampler instead of per-rep Python loops - Switch to MiniBatchKMeans for datasets >20K samples (10K batch size) - Use np.argpartition (O(N)) instead of sorted() (O(N log N)) - Add 25 tests verifying vectorized results match original scipy output
- Add temporal_question_system_prompt field to DataLinkedPromptConfig - Pass prompt_config to DataLinkedQuestionGen instead of ignoring it - Use config-driven prompt templates in linked_question_gen.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds several major features to the benchmark-qed evaluation framework: a retrieval metrics evaluation pipeline, hierarchical assertion scoring with multi-RAG comparison support, data-linked cross-document question generation, and improvements to the existing question generation pipeline.
Major Features
Retrieval Metrics Evaluation Pipeline
Retrieval metrics measure how well a RAG system's retrieval step finds the right source documents. This feature adds an end-to-end pipeline to evaluate retrieval quality by comparing retrieved text units against curated reference sets, computing standard IR metrics (precision, recall, fidelity).
retrieval-scoresCLI command with configurable settingsTextUnitFieldsConfigHierarchical Assertion Scoring with Multi-RAG Comparison
Assertion-based evaluation scores RAG answers against factual claims (assertions) derived from source documents. This feature extends the existing assertion scoring to support hierarchical assertions — global claims that are backed by supporting sub-claims — and enables side-by-side comparison of multiple RAG methods in a single evaluation run with statistical significance testing.
aggregation.pymoduleData-Linked Question Generation
Data-linked questions test a RAG system's ability to synthesize information across multiple source documents, rather than answering from a single document. This feature adds generation of cross-document questions with four relationship types.
Question Generation Improvements
min_questions_in_contextconfig option to filter low-question categoriesOther Changes
Code Quality
Configuration & CLI
cluster_match_byconfig option for retrieval metrics (text/id/short_id)Datasets
generated_questionstogenerated_questions_v1generated_questions_v2for both AP_News and podcast with assertion files for data_global, data_local, and data_linked question typesDocumentation