Skip to content

feat: Add retrieval metrics, hierarchical assertion scoring, and data-linked questions#52

Open
ha2trinh wants to merge 35 commits intomainfrom
feat/retrieval-metrics
Open

feat: Add retrieval metrics, hierarchical assertion scoring, and data-linked questions#52
ha2trinh wants to merge 35 commits intomainfrom
feat/retrieval-metrics

Conversation

@ha2trinh
Copy link
Copy Markdown
Contributor

Summary

This PR adds several major features to the benchmark-qed evaluation framework: a retrieval metrics evaluation pipeline, hierarchical assertion scoring with multi-RAG comparison support, data-linked cross-document question generation, and improvements to the existing question generation pipeline.

Major Features

Retrieval Metrics Evaluation Pipeline

Retrieval metrics measure how well a RAG system's retrieval step finds the right source documents. This feature adds an end-to-end pipeline to evaluate retrieval quality by comparing retrieved text units against curated reference sets, computing standard IR metrics (precision, recall, fidelity).

  • New retrieval-scores CLI command with configurable settings
  • Support for multiple input formats: parquet, csv, json, jsonl
  • Flexible column name mapping via TextUnitFieldsConfig
  • Relevance assessor types: rationale-based and bing-style raters
  • Built-in caching to avoid redundant LLM assessments
  • Significance testing utilities for comparing retrieval methods

Hierarchical Assertion Scoring with Multi-RAG Comparison

Assertion-based evaluation scores RAG answers against factual claims (assertions) derived from source documents. This feature extends the existing assertion scoring to support hierarchical assertions — global claims that are backed by supporting sub-claims — and enables side-by-side comparison of multiple RAG methods in a single evaluation run with statistical significance testing.

  • Hierarchical evaluation pipeline: global assertions with supporting sub-assertions, evaluated in staged mode (global first, then supporting for passed globals) or joint mode (single LLM call)
  • Multi-RAG comparison: evaluate multiple RAG methods against the same assertions with automatic statistical analysis
  • Significance testing: repeated measures tests (Friedman/Wilcoxon), clustered permutation tests (accounting for within-question correlation), and combined p-value summaries
  • Shared metric aggregation functions extracted into a reusable aggregation.py module

Data-Linked Question Generation

Data-linked questions test a RAG system's ability to synthesize information across multiple source documents, rather than answering from a single document. This feature adds generation of cross-document questions with four relationship types.

  • Bridge questions: connecting facts across two documents
  • Comparison questions: comparing entities/events across documents
  • Intersection questions: finding common themes across documents
  • Temporal questions: reasoning about time-ordered events across documents
  • Batch validation with entity relevance checking
  • Dedicated validators for linked and global question types

Question Generation Improvements

  • MMR (Maximal Marginal Relevance) sampler for better question deduplication
  • Improved entity extraction prompts for higher quality context
  • Weighted sampling for data_global categories (log-weighted by question count)
  • New min_questions_in_context config option to filter low-question categories

Other Changes

Code Quality

  • Fix all pre-existing ruff lint and formatting errors across the codebase (0 remaining)
  • Add type annotations for full pyright typecheck and verifytypes compliance (100% type completeness score)

Configuration & CLI

  • Add cluster_match_by config option for retrieval metrics (text/id/short_id)
  • Add configurable reference filename and flexible reference format parsing

Datasets

  • Rename AP_News generated_questions to generated_questions_v1
  • Add generated_questions_v2 for both AP_News and podcast with assertion files for data_global, data_local, and data_linked question types

Documentation

  • Update example notebooks: autoe.ipynb, autoq.ipynb, retrieval_metrics.ipynb
  • Update CLI docs for autoe and autoq

- Add TextUnitFieldsConfig for flexible column name mapping (id, text, embedding, short_id)
- Support multiple file formats: parquet, csv, json, jsonl
- Add assessor_type config to choose between 'rationale' and 'bing' raters
- Add significance testing utilities (stats.py)
- Update generate-retrieval-reference CLI with embedding generation support
- Add cluster_match_by config option for retrieval metrics (text/id/short_id)
- Fix cluster loading to support both text_units and text_unit_ids formats
- Fix cluster.id attribute access in fidelity.py
- Add entity question generation with bridge/comparison/intersection types
- Add batch validation with entity relevance check
- Use MMR sampling for better deduplication
- Add match_by param to calculate_single_query_fidelity and calculate_fidelity
- Add match_by param to calculate_single_query_recall and calculate_recall
- Add cluster_match_by param to calculate_retrieval_metrics and extract_per_query_metrics
- Pass cluster_match_by through run_retrieval_evaluation to all metric calculations
…icance tests

- Save map step assertions for global questions (map_assertions.json)
- Track supporting_assertions (child assertions) for each global assertion
- Save assertion sources to separate files (assertion_sources.json, map_assertion_sources.json)
- Add paired=True parameter for repeated measures statistical design
- Update significance tests to use Friedman/Wilcoxon for repeated measures
- Use paired tests in assertion and retrieval scoring comparisons
Friedman test requires at least 3 conditions/groups. For exactly 2 groups,
we now skip the omnibus test and directly perform a paired comparison:
- Normal data: Paired t-test
- Non-normal data: Wilcoxon signed-rank test

Added _compare_two_groups_paired() helper function that:
- Checks normality of the paired differences
- Runs appropriate paired test
- Returns a GroupComparisonResult with consistent structure
Link Questions (renamed from entity_questions):

- Rename entity_questions to link_questions throughout codebase

- Add temporal question type with dedicated system prompt

- Reorganize batch_validation_prompt.txt with clearer structure

- Add question validators for link and global questions

AutoE Module Refactoring:

- Split assertion_scores into assertion/ submodule (hierarchical, standard, aggregation)

- Move pairwise_scores, reference_scores, retrieval_scores to submodules

- Add hierarchical assertion scoring prompts

Other Changes:

- Add global questions batch validation prompt

- Add assertion generation stats tracking

- Update CLI and config for new structure
- Add run_hierarchical_assertion_evaluation pipeline function for multiple RAGs

- Add MultiRAGHierarchicalAssertionConfig for CLI multi-RAG mode

- Fix metric consistency: use per-question averaging throughout

- Fix supporting_pass_rate to not deduplicate (same assertion can have different results under different globals)

- Update CLI to auto-detect single vs multi-RAG config format

- Update autoe.ipynb with hierarchical assertion examples

- Update docs/cli/autoe.md with multi-RAG hierarchical config examples

- Rename retrieval_scores to retrieval_metrics in imports
- Add clustered permutation test as optional secondary analysis for
  assertion-level significance testing (accounts for within-question
  correlation by permuting labels at the question/cluster level)
- Add summarize_significance_results() to produce a combined summary
  table across all metrics and test types
- Remove *_passed conditional metrics from significance tests (keep
  as descriptive stats only) to avoid selection bias from conditioning
  on post-treatment variable
- Add per-question avg global pass rate print in Step 1 for consistency
- Update docs/cli/autoe.md with clustered permutation config and output
- Update autoe.ipynb notebook with clustered permutation examples
- Fix KeyError: use 'support_level' instead of 'support_coverage' in
  hierarchical mode (column name mismatch)
- Save summary CSVs (by_question, by_assertion) in single-RAG standard mode
- Save eval_summary.json in both standard and hierarchical CLI modes
- Add avg_question_pass_rate metric (per-question pass rate averaged)
- Extract summarize_standard_scores() and compute_hierarchical_eval_summary()
  into aggregation.py as shared functions
- Refactor CLI and multi-RAG pipeline to use shared functions, eliminating
  code duplication and ensuring consistent metric computation
- Update module exports in assertion/__init__.py and autoe/__init__.py
Rename all identifiers, file names, directory names, config keys,
enum values, and documentation references from 'data_link'/'link' to
'data_linked'/'linked' for grammatical consistency.

Changes include:
- Python source: classes, functions, variables, constants, imports
  (e.g., DataLinkQuestionGen -> DataLinkedQuestionGen,
  DataLinkConfig -> DataLinkedConfig, QuestionType.DATA_LINK -> DATA_LINKED)
- File/directory renames: link_question_gen.py -> linked_question_gen.py,
  link_validator.py -> linked_validator.py,
  link_questions/ -> linked_questions/,
  data_link_* example answer files -> data_linked_*
- Config: YAML keys, Pydantic field names, prompt paths
- Notebooks: autoq.ipynb, autoe.ipynb, retrieval_metrics.ipynb
- Typos config: add UMBRELA and ba to exception list
- Ruff fixes in changed files: ASYNC230, PTH123, W293, Q000, W291,
  SLOT000, NPY002, C420, C901 (extracted helper methods)
Run ruff format (--preview) and ruff check --fix (--preview) to resolve
338 pre-existing lint and formatting issues (456 -> 118 remaining).

Remaining 118 errors are non-auto-fixable pre-existing issues (T201 print
statements, PTH123 open() calls, G004 logging f-strings, etc.).
Resolve all 52 remaining ruff check --preview errors:
- G004: Convert logging f-strings to %s formatting (10 fixes)
- PTH123/ASYNC230: Replace open() with Path.open()/read_text/write_text (12 fixes)
- ERA001: Remove commented-out code in autoe.ipynb (8 fixes)
- FBT003: Use keyword arguments for boolean Fields (3 fixes)
- BLE001: Replace blind except with specific exceptions (2 fixes)
- FURB113: Replace consecutive append() with extend() (2 fixes)
- RUF027: Add noqa for intentional template strings (2 fixes)
- TRY003/EM102: Extract exception message to variable (1 fix)
- RUF022: Sort __all__ alphabetically (1 fix)
- FURB148: Remove unnecessary enumerate (1 fix)
- FURB189: Add noqa for intentional str subclass (1 fix)
- RET504: Remove unnecessary assignment before return (1 fix)
- F841: Remove unused variable (1 fix)
- NPY002: Add noqa for legacy np.random usage (1 fix)
- ARG001: Add noqa for interface parameter (1 fix)
- F821: Add noqa for cross-cell notebook reference (1 fix)
- CPY001: Add copyright header (1 fix)
- T201: Suppress for stats/reporting modules in ruff.toml (66 fixes)

All ruff checks now pass (0 errors).
…large datasets

- Replace per-element scipy cosine calls with vectorized numpy matrix ops
  in get_semantic_neighbors and compute_similarity_to_references
- Use batch matrix multiply for all rep-to-corpus distances in
  KmeansTextSampler instead of per-rep Python loops
- Switch to MiniBatchKMeans for datasets >20K samples (10K batch size)
- Use np.argpartition (O(N)) instead of sorted() (O(N log N))
- Add 25 tests verifying vectorized results match original scipy output
- Add temporal_question_system_prompt field to DataLinkedPromptConfig
- Pass prompt_config to DataLinkedQuestionGen instead of ignoring it
- Use config-driven prompt templates in linked_question_gen.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant