AutoResearch: An Execution-Grounded Multi-Agent Framework for AI Paper and Code Generation
AutoResearch is a unified multi-agent framework that takes code, data, ideas, and papers as input, runs an execution-grounded research loop, and produces improved code, validated experiment results, and a structured research paper.
In short: Input β Think + Experiment + Learn β New Code + Better Results + Research Paper
AutoResearch is designed around a simple principle: generated research artifacts should be constrained by execution, verification, and iteration, not only by text generation.
The framework combines:
- planning and orchestration for multi-stage task control,
- code generation and self-healing for iterative experiment repair,
- retrieval and verification for grounded citations and supporting evidence,
- paper writing and review for structured manuscript generation,
- memory and meta-learning for cross-run improvement.
π₯ Demo Video: Watch on YouTube
AutoResearch is built to work with mixed research materials, including:
- GitHub repositories and existing codebases
- datasets for training and evaluation
- instructions / ideas such as βimprove accuracyβ or βtry a new architectureβ
- research papers for related work, baselines, and methods
- documents such as PDFs, DOCX files, notes, and logs
The architecture operates through a coordinated multi-agent loop:
- Orchestrator controls the execution pipeline and stage transitions
- Planner / Query Engine parses tasks and decomposes objectives
- CodeAgent / Coding Agent generates and repairs code
- Experiment / Benchmark Agent runs experiments and compares against baselines
- FigureAgent generates charts and visual summaries
- Review Agents / Critic Agent critique outputs and refine results
- Research Agent retrieves literature and structured evidence
- Verification Engine checks references, evidence alignment, and citation validity
- Memory System / Meta-Learning stores reusable lessons and failure-derived skills
- Paper Writer Agent generates the final structured manuscript
The framework produces:
- improved code and repository updates
- experiment results, metrics, plots, and benchmark comparisons
- structured research paper drafts
- publication assets such as figures, tables, and BibTeX references
- reusable learned knowledge stored for future runs
At the center of AutoResearch is a self-healing loop:
Generate β Run β Error Detect β Fix β Run
Instead of treating code generation as a one-shot activity, the framework uses execution feedback as a first-class signal for correction and refinement.
AutoResearch distributes responsibilities across specialized agents rather than forcing one model to do everything. This improves modularity, transparency, and controllability.
The framework combines structured retrieval with citation verification to reduce unsupported references and weak literature grounding.
The paper-writing pipeline is grounded in executed experiments, benchmark outputs, and verified references, allowing the system to produce a structured manuscript rather than an ungrounded draft.
A typical run looks like this:
Input: code + data + idea + papers
β
Orchestrator / Query Engine parses the task
β
Planner creates an execution strategy
β
CodeAgent generates or edits code
β
Execution sandbox runs the code
β
If failure: detect error β repair β re-run
β
Benchmark / Experiment agents evaluate results
β
Retrieval + verification ground claims and citations
β
Paper Writer assembles the research artifact
β
Output: improved code + better results + research paper
AutoResearch/
βββ main.py β app + routes + entry point
βββ orchestrator.py β pipeline engine and stage control
βββ llm.py β model provider abstraction
β
βββ agents/
β βββ engineering.py β planner, coder, tester, debugger, critic
β βββ research.py β researcher, experiment, paper writer
β βββ conception.py β ideation and concept shaping agents
β βββ paper.py β outline, section, citation, figure, reviewer
β βββ experiment.py β planning, codegen, runner, tracker, evaluator
β βββ decision.py β proceed / refine / pivot control
β βββ memory.py β cross-run knowledge and state reuse
β βββ gan.py β adversarial generate-evaluate loop
β βββ hooks.py β session lifecycle events
β βββ context_modes.py β dev / research / review switching
β βββ registry.py β unified agent registry
β
βββ tools/ β sandbox, file reader, executor, output manager
βββ skills/ β knowledge files and workflow skills
βββ research/ β retrieval, pipeline, templates, HITL, assessor
βββ static/ β web GUI
βββ tests/
βββ eval/
bash deploy.sh
source .venv/bin/activate
python main.pyThen open:
http://localhost:8000
Input:
GitHub repo + dataset + idea + related papers
System:
Think + experiment + learn
Output:
New code + better results + research paper
| Endpoint | What it does |
|---|---|
POST /api/agent/run |
Run a task synchronously |
POST /api/agent/stream |
Run with SSE streaming |
POST /api/upload |
Upload PDF / DOCX / CSV / JSON assets |
POST /api/conception/ideate |
Run the ideation pipeline |
POST /api/experiment/run |
Run the experiment pipeline |
POST /api/paper/write |
Run the paper-generation pipeline |
POST /api/gan/run |
Run the adversarial generate-evaluate loop |
GET /api/skills/agents |
List available skill agents |
GET /api/skills/rules/{lang} |
Return language-specific rules |
GET /api/outputs |
Browse saved outputs |
AutoResearch is intended for workflows where a user wants more than code generation alone. It is built for end-to-end research automation scenarios such as:
- improving an existing repository,
- testing new ideas against baselines,
- generating figures and experiment summaries,
- grounding claims in retrieved literature,
- drafting a research manuscript from executed results.
If you use this project, please cite:
@misc{kumar2026autoresearch,
title={AutoResearch: An Execution-Grounded Multi-Agent Framework for AI Paper and Code Generation},
author={Rajesh Kumar and Waqar Ali and Junaid Ahmed and Abdullah Aman Khan and Shaoning Zeng and Yong Tang},
year={2026},
note={arXiv preprint, update identifier when public}
}- The diagrams in
assets/images/summarize the architecture, execution loop, and output flow. - The framework is designed around execution-grounded validation, not text-only generation.
- Retrieval, verification, and paper generation are treated as first-class components of the system.




