Context
Princeton paper "Towards a Science of AI Agent Reliability" (arxiv 2602.16666) proposes 12 metrics across 4 dimensions (consistency, robustness, predictability, safety) for evaluating AI agent reliability. Our judge skill evaluates content quality (semantic, pragmatic, syntactic) but does NOT measure behavioral reliability. A perfect quality score on one run is meaningless if the next run produces contradictory output.
Plan
See plans/agent-reliability-judge-extension.md in cto-executive-system.
Phase 1: Judge Consistency Check (~1.5h)
- Add optional
reliability object to verdict-schema.json (runs, outcome_consistency, score_variance)
- Add
--reliability flag to run-judge.sh for multi-pass evaluation (3 runs, compute consistency)
- Flag verdicts with consistency < 0.67 as UNRELIABLE
Phase 2: Quality-Gate Robustness Check (~1.5h)
- Add prompt robustness check for agent-generated outputs (re-run with paraphrase, compare)
- Add robustness section to Quality Gate Report template
Phase 3: Safety Compliance Extension (~1h)
- Merge SLB risk-tier model with Princeton S_comp metric (CRITICAL/DANGEROUS/CAUTION/SAFE tiers for file paths)
- Add
content_hash (SHA-256) to verdict-schema.json binding verdicts to exact content
Related
Context
Princeton paper "Towards a Science of AI Agent Reliability" (arxiv 2602.16666) proposes 12 metrics across 4 dimensions (consistency, robustness, predictability, safety) for evaluating AI agent reliability. Our judge skill evaluates content quality (semantic, pragmatic, syntactic) but does NOT measure behavioral reliability. A perfect quality score on one run is meaningless if the next run produces contradictory output.
Plan
See
plans/agent-reliability-judge-extension.mdin cto-executive-system.Phase 1: Judge Consistency Check (~1.5h)
reliabilityobject to verdict-schema.json (runs, outcome_consistency, score_variance)--reliabilityflag to run-judge.sh for multi-pass evaluation (3 runs, compute consistency)Phase 2: Quality-Gate Robustness Check (~1.5h)
Phase 3: Safety Compliance Extension (~1h)
content_hash(SHA-256) to verdict-schema.json binding verdicts to exact contentRelated
knowledge/agent-reliability-metrics-princeton.mdknowledge/slb-two-person-rule-agents.md