This repository contains a set of tools for reinforcement learning with LLMs in verifiable environments.
WARNING: This repository in its current state should be viewed as in-progress research code, and is not guaranteed to yield stable or optimal training results. Best results will likely be found on reasonable timescales when using 7B+ models, and at least 8 GPUs.
Note: If you don't need multi-turn tool calling or agentic interactions, you should probably just use TRL (or Unsloth/Axolotl) for GRPO. This is mostly a multi-turn LLM RL repo with some other bells and whistles.
PyPI coming soon, for now just do:
git clone https://github.com/willccbb/verifiers.git
cd verifiers
uv sync
uv pip install flash-attn --no-build-isolation
source .venv/bin/activateEnsure your wandb and huggingface-cli logins are set up (or set report_to=None in training_args).
If you encounter version issues, please confirm that you are able to run basic TRL training in your environment before opening an issue (see verifiers/examples/trl_grpo.py as a reference).
See verifiers/examples/math_train.py for an example with the ToolEnv environment + a Python tool.
To run on a 8-GPU node with 4 inference GPUs and 4 training GPUs:
# Launch vLLM inference server from verifiers/, with .venv active
CUDA_VISIBLE_DEVICES=0,1,2,3 python verifiers/inference/vllm_serve.py --model "Qwen/Qwen2.5-7B-Instruct" --tensor_parallel_size 4 --max_model_len 8192 --gpu_memory_utilization 0.9 --enable_prefix_caching True# Run training script from verifiers/, with .venv active
CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch --num-processes 4 --config-file configs/zero3.yaml verifiers/examples/math_train.pyMulti-node training setups are supported as well; you can specify the host IP + port of your inference as an argument in the GRPOConfig in your training script. See the TRL docs for info on multi-node training via SLURM.
You can also use environment classes to evaluate models with multi-turn tool use offline, i.e. without RL training. See verifiers/examples/math_eval.py for an example.
To create your own multi-turn environment, inherit from MultiTurnEnv and implement:
def is_completed(self, messages: List[Dict[str, str]], **kwargs: Any) -> bool:
pass
def env_response(self, messages: List[Dict[str, str]], **kwargs: Any) -> Dict[str, str]:
pass- Environments (
MultiTurnEnv):DoubleCheckEnv,CodeEnv,ToolEnv,SmolaToolEnv - Multi-turn tool use in
ToolEnv,SmolaToolEnv,CodeEnv - Dataset formatting + XML parsers
- Basic rubrics for math/code correctness + formatting
- Defaults for GRPO, model, tokenizer, etc.
If you use this code in your research, please cite:
@article{brown2025verifiers,
title={Verifiers: Reinforcement Learning with LLMs in Verifiable Environments},
author={Brown, William},
year={2025}
}