Skip to content

NeuZhou/agentprobe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

154 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

English | ζ—₯本θͺž | ν•œκ΅­μ–΄ | δΈ­ζ–‡

πŸ”¬ AgentProbe

Playwright for AI Agents

AgentProbe β€” Test Every Decision Your Agent Makes

Test tool calls, not just text output. YAML-based. Works with any LLM.

npm CI codecov TypeScript MIT License Stars

Quick Start Β· Why? Β· Comparison Β· Docs Β· Discord


Why AgentProbe?

Your UI has Playwright. Your API has Postman. Your AI agent has... console.log?

Agents pick tools, handle failures, and process user data autonomously. One bad decision β†’ PII leak. One missed tool call β†’ silent workflow failure. You need behavioral tests, not just prompt tests.

tests:
  - input: "Book a flight NYC β†’ London, next Friday"
    expect:
      tool_called: search_flights
      tool_called_with: { origin: "NYC", dest: "LDN" }
      output_contains: "flight"
      no_pii_leak: true
      max_steps: 5

4 assertions. 1 YAML file. Zero boilerplate.


⚑ Quick Start

npm install @neuzhou/agentprobe
npx agentprobe init                                    # Scaffold test project
npx agentprobe run examples/quickstart/test-mock.yaml  # Run first test

No API key needed for the mock adapter.

Programmatic API

import { AgentProbe } from '@neuzhou/agentprobe';

const probe = new AgentProbe({ adapter: 'openai', model: 'gpt-4o' });
const result = await probe.test({
  input: 'What is the capital of France?',
  expect: {
    output_contains: 'Paris',
    no_hallucination: true,
    latency_ms: { max: 3000 },
  },
});

How AgentProbe Compares

AgentProbe Promptfoo DeepEval
Tool call assertions βœ… 6 types ❌ ❌
Chaos & fault injection βœ… ❌ ❌
Contract testing βœ… ❌ ❌
Multi-agent orchestration βœ… ❌ ❌
Record & replay βœ… ❌ ❌
Security scanning βœ… PII, injection, system leak βœ… Red teaming ⚠️ Basic
LLM-as-Judge βœ… Any model βœ… βœ…
YAML test definitions βœ… βœ… ❌ Python only
CI/CD (JUnit, GH Actions) βœ… βœ… βœ…

Promptfoo tests prompts. DeepEval tests LLM outputs. AgentProbe tests agent behavior.


Features

🎯 Tool Call Assertions tool_called, tool_called_with, no_tool_called, tool_call_order + 2 more
πŸ’₯ Chaos Testing Inject tool timeouts, malformed responses, rate limits
πŸ“œ Contract Testing Enforce behavioral invariants across agent versions
🀝 Multi-Agent Testing Test handoff sequences in orchestrated pipelines
πŸ”΄ Record & Replay Record live sessions β†’ generate tests β†’ replay deterministically
πŸ›‘οΈ Security Scanning PII leak, prompt injection, system prompt exposure
πŸ§‘β€βš–οΈ LLM-as-Judge Use a stronger model to evaluate nuanced quality
πŸ“Š HTML Reports Self-contained dashboards with SVG charts
πŸ”„ Regression Detection Compare against saved baselines
πŸ€– 12 Adapters OpenAI, Anthropic, Google, Ollama, and 8 more

πŸ“– Full Docs β€” 17+ assertion types, 12 adapters, 120+ CLI commands


πŸ“Ί See it in action
$ agentprobe run examples/quickstart/test-mock.yaml

  πŸ”¬ Mock Agent Test
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  ❌ Agent greets user (2ms)
     ↳ output_contains: "Hello": Output does not contain "Hello"
  ❌ Agent answers factual question (0ms)
     ↳ output_contains: "Paris": Output does not contain "Paris"
  βœ… Agent rejects prompt injection (0ms)
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  1/3 passed (33%) in 2ms

Mock adapter returns empty output β€” text assertions fail as expected. no_prompt_injection passes because mock doesn't leak. Connect a real adapter for full green.


πŸš€ GitHub Action

# .github/workflows/agent-tests.yml
name: Agent Tests
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: NeuZhou/agentprobe@master
        with:
          test_dir: './tests'

Roadmap

  • YAML behavioral testing Β· 17+ assertions Β· 12 adapters
  • Tool mocking Β· Chaos testing Β· Contract testing
  • Multi-agent Β· Record & replay Β· Security scanning
  • HTML reports Β· JUnit output Β· GitHub Actions
  • AWS Bedrock / Azure OpenAI adapters
  • VS Code extension
  • Web report portal

🌐 Also Check Out

Project What it does
FinClaw Self-evolving trading engine β€” 484 factors, genetic algorithm, walk-forward validated
ClawGuard AI Agent Immune System β€” 480+ threat patterns, zero dependencies

Contributing

git clone https://github.com/NeuZhou/agentprobe.git
cd agentprobe && npm install && npm test

CONTRIBUTING.md Β· Report Bug Β· Request Feature


License

MIT Β© NeuZhou


Star History

Star History