Skip to content

feat(examples/appworld): add AppWorld RLM benchmark#32

Merged
glesperance merged 29 commits into
mainfrom
appworld-rlm
Jun 4, 2026
Merged

feat(examples/appworld): add AppWorld RLM benchmark#32
glesperance merged 29 commits into
mainfrom
appworld-rlm

Conversation

@glesperance

Copy link
Copy Markdown
Contributor

Rationale

This adds the AppWorld benchmark path for predict-rlm so we can evaluate RLM agents on realistic app/API tasks, run grouped GEPA optimization, and publish sanitized benchmark results without committing raw traces or private run artifacts.

Summary

  • Add an examples/appworld package with benchmark loading, AppWorld worker tools, PredictRLM service wiring, GEPA project config, smoke tests, and setup helpers.
  • Add grouped AppWorld optimization support in rlm-gepa, async evaluation adapter support, and related reporting/statistics utilities.
  • Harden sandbox/interpreter behavior around async stderr EOF hangs, defaulted single-output SUBMIT, and tool kwarg handling.
  • Rename repo agent guidance from CLAUDE.md to AGENTS.md and add long-running local run guidance.
  • Refresh docs, generated skill guidance, example outputs, and AppWorld blog/results documentation.
  • Keep raw AppWorld run artifacts ignored; document the externally hosted protected runs.bundle instead.

Test Plan

  • AppWorld smoke coverage added under examples/appworld/tests/test_appworld_smoke.py.
  • Unit coverage added/updated for interpreter behavior, PredictRLM submit handling, LM config, RLM skill docs, and GEPA runtime/reporting.
  • AppWorld benchmark results recorded in results.txt with zero errors/timeouts for the top GPT-5.5 run: 154/168, TGC 0.917, SGC 0.839.
  • Raw run artifacts are excluded from git; only examples/appworld/runs/README.md is tracked.

@glesperance glesperance merged commit c7067bb into main Jun 4, 2026
5 checks passed
@glesperance glesperance deleted the appworld-rlm branch June 4, 2026 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant