This project uses local Ollama with Qwen2.5-3B-Instruct to:
- generate general hallucination prompts
- generate medical hallucination prompts
- merge them into a balanced dataset
- run a multi-agent OVON hallucination pipeline
- analyze the resulting scores and plots
original.py is a script adaptation of Gosmar and Dahl's OVON-style hallucination mitigation work:
@misc{gosmar,
title={Hallucination Mitigation using Agentic AI Natural Language-Based Frameworks},
author={Diego Gosmar and Deborah A. Dahl},
year={2025},
eprint={2501.13946},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.13946},
}The medical-retrieval augmentation is provided by a local MedRAG pipeline built on the MedRAG corpus introduced by Xiong et al.:
@article{xiong2024benchmarking,
title={Benchmarking Retrieval-Augmented Generation for Medicine},
author={Xiong, Guangzhi and Jin, Qiao and Lu, Zhiyong and Zhang, Aidong},
year={2024},
journal={Findings of the Association for Computational Linguistics: ACL 2024}
}Relative to the original OVON / RAG papers, this repository makes several implementation-oriented changes:
- all OpenAI GPT model calls were rewritten to use a shared local base model, currently
qwen2.5:3b-instruct - parallelization was added to relevant scripts
- the common model/backend settings were centralized into shared config and helper modules
- medical retrieval is performed fully locally via LlamaIndex + FAISS over the MedRAG corpus, replacing any web-search or OpenAI Assistants/Vector-Stores dependency (the pipeline can run end-to-end against local Ollama with no external API keys)
- the
SecondLevelReviewerflow was extended so MedRAG-routed failures use a dedicated MedRAG-aware reviewer prompt and pass retrieval-failure evidence into the fallback review step, as some generated hallucinatory prompts do not have factual key terms for retrieval. Still, MedRAG not being able to find any relevant passages should still be treated as important evidence of hallucination. - KPI scoring now uses a simpler 1-10 rubric for
factualityandhelpfulnessfor each agent, withTHSdefined as the average of those two scores rather than Gosmar and Dahl's formula.
Install Ollama on macOS:
brew install ollamaStart the local Ollama server:
ollama serveIn a second terminal, pull the model:
ollama pull qwen2.5:3b-instructYou can verify the model is available with:
ollama listRun all other scripts in a separate terminal.
medrag_original.py is an adaptation of original.py that adds a MedicalClassifier before the second stage and routes medical prompts through a local MedRAG retriever implemented in medrag.py.
From the project root:
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install --upgrade pip
pip install -r requirements.txtThe local RAG pipeline retrieves passages using nomic-embed-text served by Ollama:
ollama pull nomic-embed-textDownload the MedRAG textbooks subset, embed it with Ollama, and persist a FAISS index to data/medrag_index/:
python3 build_medrag_index.py --subsets textbooksNotes:
- The builder streams the dataset and is resumable: if you Ctrl-C mid-build, re-run the same command to continue from the last checkpoint.
textbooksalone is ~126k passages and usually completes in about 20–40 minutes on an M-series Mac with Ollama. Vectors weigh in around 400 MB.- Add
statpearlsfor broader recall (--subsets textbooks statpearls, ~1 hr total).wikipediaandpubmedare orders of magnitude larger and not recommended for a laptop. - Pass
--persist-dir /some/other/pathto put the index elsewhere;medrag_original.pyaccepts--medrag-index-dirto match.
Important implementation choices in the local MedRAG pipeline:
- Nomic prefixes + L2 normalization.
medrag_embed.NomicOllamaEmbeddingsubclasses LlamaIndex'sOllamaEmbeddingto transparently prependsearch_document:to corpus chunks andsearch_query:to user queries, as required bynomic-embed-textfor best retrieval quality. Outputs are L2-normalized so FAISSIndexFlatIPyields exact cosine similarity in[-1, 1], rather than usingIndexFlatL2with distance-to-similarity conversion. - Cosine similarity threshold.
MedRAG.min_scoredefaults to0.15. Retrieved passages below this threshold are discarded, which surfaces genuinely off-topic medical prompts (a hallucination signal) asmedrag_status="no_chunks". Tune this per your evaluation set. - Thread-safe index sharing.
medrag_original.pyconstructs theMedRAGclient once inmain()and passes it into theThreadPoolExecutorworkers.IndexFlatIPreads, the embedder, and theOpenAIclient are all thread-safe, so the FAISS index is loaded exactly once per process. The lazy-load path uses a lock to guard the first-access race.
In another terminal, run the general prompt generator from the project root:
python3 data/generate_general.py --workers 6Default outputs:
data/general.jsonldata/general_summary.json
Run the medical prompt generator:
python3 data/generate_medical.py --workers 6Default outputs:
data/medical.jsonldata/medical_summary.json
Note:
- Records with malformed JSON responses are still saved with
raw_responseanderror_comment
After both datasets are generated, run:
python3 preprocess_data.py 500This means:
- total output size is
500 250prompts come fromgeneral.jsonl250prompts come frommedical.jsonl- within each domain, each of the 5 techniques contributes the same number of prompts
- any row with
error_commentor a missingpromptis skipped
Default outputs:
data/data.jsonldata/data_summary.json
Important constraint:
xmust be divisible by10, because the pipeline requiresx/2prompts per domain and equal counts across 5 techniques
After creating data/data.jsonl, run the multi-agent pipeline:
python3 original.py --workers 6To run the MedRAG-enabled variant instead (requires the FAISS index from the MedRAG Setup section):
python3 medrag_original.py --workers 6Default outputs:
data/original_results.csvdata/medrag_original_results.csvdata/error.txt- logs can be found in the
datafolder
Notes:
- Prompt processing is parallelized across prompts, while the three agent stages remain sequential within each prompt
SecondLevelReviewerandKPI_EvaluatorJSON failures do not stop the run; they are logged todata/error.txtasoriginal,<prompt_id>ormedrag_original,<prompt_id>- Both pipeline scripts support
--checkpoint-every N(default10) to atomically rewrite a partial results CSV everyNcompleted prompts - The MedRAG FAISS index is loaded once at process start and shared across worker threads (read-only), so per-prompt retrieval is a sub-millisecond FAISS lookup plus the usual Ollama embed/chat calls
- Medical prompts that return no MedRAG passages above the similarity threshold are treated as a hallucination signal and routed through the
SecondLevelReviewer_MedRAGfallback
To rerun errored prompts listed in data/error.txt and patch the existing CSV rows in place:
python3 rerun_json_errors.pyThe repair script reruns only the failing prompt_id values, rewrites only the matching rows in data/original_results.csv or data/medrag_original_results.csv, and then rewrites data/error.txt with only the failures that remain.
After the pipeline finishes, generate the plots and stats:
python3 analyze_hallucinations.py --results-csv data/original_results.csv
python3 analyze_hallucinations.py --results-csv data/medrag_original_results.csvDefault outputs:
plots/original_results/all/...plots/original_results/medical/...plots/original_results/general/...
Each subset folder contains:
- line and bar plots
- improvement plots
stats.txt