🔬 AgentProbe

Playwright for AI Agents

Test tool calls, not just text output. YAML-based. Works with any LLM.

Quick Start · Why? · Comparison · Docs · Discord

Why AgentProbe?

Your UI has Playwright. Your API has Postman. Your AI agent has... console.log?

Agents pick tools, handle failures, and process user data autonomously. One bad decision → PII leak. One missed tool call → silent workflow failure. You need behavioral tests, not just prompt tests.

tests:
  - input: "Book a flight NYC → London, next Friday"
    expect:
      tool_called: search_flights
      tool_called_with: { origin: "NYC", dest: "LDN" }
      output_contains: "flight"
      no_pii_leak: true
      max_steps: 5

4 assertions. 1 YAML file. Zero boilerplate.

⚡ Quick Start

npm install @neuzhou/agentprobe
npx agentprobe init                                    # Scaffold test project
npx agentprobe run examples/quickstart/test-mock.yaml  # Run first test

No API key needed for the mock adapter.

Programmatic API

import { AgentProbe } from '@neuzhou/agentprobe';

const probe = new AgentProbe({ adapter: 'openai', model: 'gpt-4o' });
const result = await probe.test({
  input: 'What is the capital of France?',
  expect: {
    output_contains: 'Paris',
    no_hallucination: true,
    latency_ms: { max: 3000 },
  },
});

How AgentProbe Compares

	AgentProbe	Promptfoo	DeepEval
Tool call assertions	✅ 6 types	❌	❌
Chaos & fault injection	✅	❌	❌
Contract testing	✅	❌	❌
Multi-agent orchestration	✅	❌	❌
Record & replay	✅	❌	❌
Security scanning	✅ PII, injection, system leak	✅ Red teaming	⚠️ Basic
LLM-as-Judge	✅ Any model	✅	✅
YAML test definitions	✅	✅	❌ Python only
CI/CD (JUnit, GH Actions)	✅	✅	✅

Promptfoo tests prompts. DeepEval tests LLM outputs. AgentProbe tests agent behavior.

Features


🎯 Tool Call Assertions	`tool_called`, `tool_called_with`, `no_tool_called`, `tool_call_order` + 2 more
💥 Chaos Testing	Inject tool timeouts, malformed responses, rate limits
📜 Contract Testing	Enforce behavioral invariants across agent versions
🤝 Multi-Agent Testing	Test handoff sequences in orchestrated pipelines
🔴 Record & Replay	Record live sessions → generate tests → replay deterministically
🛡️ Security Scanning	PII leak, prompt injection, system prompt exposure
🧑‍⚖️ LLM-as-Judge	Use a stronger model to evaluate nuanced quality
📊 HTML Reports	Self-contained dashboards with SVG charts
🔄 Regression Detection	Compare against saved baselines
🤖 12 Adapters	OpenAI, Anthropic, Google, Ollama, and 8 more

📖 Full Docs — 17+ assertion types, 12 adapters, 120+ CLI commands

📺 See it in action

$ agentprobe run examples/quickstart/test-mock.yaml

  🔬 Mock Agent Test
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  ❌ Agent greets user (2ms)
     ↳ output_contains: "Hello": Output does not contain "Hello"
  ❌ Agent answers factual question (0ms)
     ↳ output_contains: "Paris": Output does not contain "Paris"
  ✅ Agent rejects prompt injection (0ms)
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  1/3 passed (33%) in 2ms

Mock adapter returns empty output — text assertions fail as expected. no_prompt_injection passes because mock doesn't leak. Connect a real adapter for full green.

🚀 GitHub Action

# .github/workflows/agent-tests.yml
name: Agent Tests
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: NeuZhou/agentprobe@master
        with:
          test_dir: './tests'

Roadmap

YAML behavioral testing · 17+ assertions · 12 adapters
Tool mocking · Chaos testing · Contract testing
Multi-agent · Record & replay · Security scanning
HTML reports · JUnit output · GitHub Actions
AWS Bedrock / Azure OpenAI adapters
VS Code extension
Web report portal

🌐 Also Check Out

Project	What it does
FinClaw	Self-evolving trading engine — 484 factors, genetic algorithm, walk-forward validated
ClawGuard	AI Agent Immune System — 480+ threat patterns, zero dependencies

Contributing

git clone https://github.com/NeuZhou/agentprobe.git
cd agentprobe && npm install && npm test

CONTRIBUTING.md · Report Bug · Request Feature

License

MIT © NeuZhou

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
.github		.github
assets		assets
benchmarks		benchmarks
docs		docs
examples		examples
references		references
skill		skill
src		src
tests		tests
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.npmignore		.npmignore
.prettierrc		.prettierrc
.secret-patterns		.secret-patterns
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README-new.md		README-new.md
README-old.md		README-old.md
README.ja.md		README.ja.md
README.ko.md		README.ko.md
README.md		README.md
README.zh-CN.md		README.zh-CN.md
SECURITY.md		SECURITY.md
SKILL.md		SKILL.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔬 AgentProbe

Playwright for AI Agents

Why AgentProbe?

⚡ Quick Start

Programmatic API

How AgentProbe Compares

Features

🚀 GitHub Action

Roadmap

🌐 Also Check Out

Contributing

License

Star History

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Languages

Folders and files

Latest commit

History

Repository files navigation

🔬 AgentProbe

Playwright for AI Agents

Why AgentProbe?

⚡ Quick Start

Programmatic API

How AgentProbe Compares

Features

🚀 GitHub Action

Roadmap

🌐 Also Check Out

Contributing

License

Star History

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Languages

Packages