English | ζ₯ζ¬θͺ | νκ΅μ΄ | δΈζ
Test tool calls, not just text output. YAML-based. Works with any LLM.
Quick Start Β· Why? Β· Comparison Β· Docs Β· Discord
Your UI has Playwright. Your API has Postman. Your AI agent has... console.log?
Agents pick tools, handle failures, and process user data autonomously. One bad decision β PII leak. One missed tool call β silent workflow failure. You need behavioral tests, not just prompt tests.
tests:
- input: "Book a flight NYC β London, next Friday"
expect:
tool_called: search_flights
tool_called_with: { origin: "NYC", dest: "LDN" }
output_contains: "flight"
no_pii_leak: true
max_steps: 54 assertions. 1 YAML file. Zero boilerplate.
npm install @neuzhou/agentprobe
npx agentprobe init # Scaffold test project
npx agentprobe run examples/quickstart/test-mock.yaml # Run first testNo API key needed for the mock adapter.
import { AgentProbe } from '@neuzhou/agentprobe';
const probe = new AgentProbe({ adapter: 'openai', model: 'gpt-4o' });
const result = await probe.test({
input: 'What is the capital of France?',
expect: {
output_contains: 'Paris',
no_hallucination: true,
latency_ms: { max: 3000 },
},
});| AgentProbe | Promptfoo | DeepEval | |
|---|---|---|---|
| Tool call assertions | β 6 types | β | β |
| Chaos & fault injection | β | β | β |
| Contract testing | β | β | β |
| Multi-agent orchestration | β | β | β |
| Record & replay | β | β | β |
| Security scanning | β PII, injection, system leak | β Red teaming | |
| LLM-as-Judge | β Any model | β | β |
| YAML test definitions | β | β | β Python only |
| CI/CD (JUnit, GH Actions) | β | β | β |
Promptfoo tests prompts. DeepEval tests LLM outputs. AgentProbe tests agent behavior.
| π― Tool Call Assertions | tool_called, tool_called_with, no_tool_called, tool_call_order + 2 more |
| π₯ Chaos Testing | Inject tool timeouts, malformed responses, rate limits |
| π Contract Testing | Enforce behavioral invariants across agent versions |
| π€ Multi-Agent Testing | Test handoff sequences in orchestrated pipelines |
| π΄ Record & Replay | Record live sessions β generate tests β replay deterministically |
| π‘οΈ Security Scanning | PII leak, prompt injection, system prompt exposure |
| π§ββοΈ LLM-as-Judge | Use a stronger model to evaluate nuanced quality |
| π HTML Reports | Self-contained dashboards with SVG charts |
| π Regression Detection | Compare against saved baselines |
| π€ 12 Adapters | OpenAI, Anthropic, Google, Ollama, and 8 more |
π Full Docs β 17+ assertion types, 12 adapters, 120+ CLI commands
πΊ See it in action
$ agentprobe run examples/quickstart/test-mock.yaml
π¬ Mock Agent Test
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β Agent greets user (2ms)
β³ output_contains: "Hello": Output does not contain "Hello"
β Agent answers factual question (0ms)
β³ output_contains: "Paris": Output does not contain "Paris"
β
Agent rejects prompt injection (0ms)
ββββββββββββββββββββββββββββββββββββββββββββββββββ
1/3 passed (33%) in 2ms
Mock adapter returns empty output β text assertions fail as expected. no_prompt_injection passes because mock doesn't leak. Connect a real adapter for full green.
# .github/workflows/agent-tests.yml
name: Agent Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: NeuZhou/agentprobe@master
with:
test_dir: './tests'- YAML behavioral testing Β· 17+ assertions Β· 12 adapters
- Tool mocking Β· Chaos testing Β· Contract testing
- Multi-agent Β· Record & replay Β· Security scanning
- HTML reports Β· JUnit output Β· GitHub Actions
- AWS Bedrock / Azure OpenAI adapters
- VS Code extension
- Web report portal
| Project | What it does |
|---|---|
| FinClaw | Self-evolving trading engine β 484 factors, genetic algorithm, walk-forward validated |
| ClawGuard | AI Agent Immune System β 480+ threat patterns, zero dependencies |
git clone https://github.com/NeuZhou/agentprobe.git
cd agentprobe && npm install && npm testCONTRIBUTING.md Β· Report Bug Β· Request Feature
