Skip to content

Add /run-evals comment command#1948

Open
adibarra wants to merge 2 commits into
mainfrom
feat/run-evals-command
Open

Add /run-evals comment command#1948
adibarra wants to merge 2 commits into
mainfrom
feat/run-evals-command

Conversation

@adibarra

Copy link
Copy Markdown
Collaborator

Adds the workflow-side plumbing so a PR comment can launch an eval-only run of a single recipe (no perf sweep):
/run-evals [conc] [master-config]

  • run-evals.yml: new issue_comment command (mirrors pr-comment-sweep.yml auth/ SHA-pin/reply); maps -> framework+task, infers nvidia/amd master config from the config-key HW token, builds 'test-config ... --evals-only', calls e2e-tests.yml with eval-framework/eval-task.
  • e2e-tests.yml: new eval-framework/eval-task inputs, threaded into the eval jobs.
  • benchmark-tmpl.yml / benchmark-multinode-tmpl.yml: matching inputs -> env (EVAL_FRAMEWORK / EVAL_TASKS_DIR).

Inert by default (eval-framework defaults to lm-eval). The framework-dispatch code (run_eval EVAL_FRAMEWORK override + run_swebench_eval + swebench task/scorer) lives with the swebench PR and is checked out from the commented PR's head at runtime, so this can merge to main independently.

…orkflow plumbing)

Adds the workflow-side plumbing so a PR comment can launch an eval-only run of a
single recipe (no perf sweep):
  /run-evals <eval> <config-key> [conc] [master-config]

- run-evals.yml: new issue_comment command (mirrors pr-comment-sweep.yml auth/
  SHA-pin/reply); maps <eval> -> framework+task, infers nvidia/amd master config
  from the config-key HW token, builds 'test-config ... --evals-only', calls
  e2e-tests.yml with eval-framework/eval-task.
- e2e-tests.yml: new eval-framework/eval-task inputs, threaded into the eval jobs.
- benchmark-tmpl.yml / benchmark-multinode-tmpl.yml: matching inputs -> env
  (EVAL_FRAMEWORK / EVAL_TASKS_DIR).

Inert by default (eval-framework defaults to lm-eval). The framework-dispatch
code (run_eval EVAL_FRAMEWORK override + run_swebench_eval + swebench task/scorer)
lives with the swebench PR and is checked out from the commented PR's head at
runtime, so this can merge to main independently.
@adibarra adibarra requested a review from a team June 27, 2026 02:51
@functionstackx

Copy link
Copy Markdown
Collaborator

@adibarra isnt there eval-only tag that does this?

Comment thread .github/workflows/e2e-tests.yml
Comment thread .github/workflows/run-evals.yml
Comment thread .github/workflows/run-evals.yml Outdated
@adibarra

Copy link
Copy Markdown
Collaborator Author

Going for a slightly different functionality here compared to the existing one

- run-evals.yml: 'Reply with run link' now uses always() + branches on the parse
  step outcome, so a bad eval name / missing config-key / unrecognized HW token
  gets a helpful PR reply instead of silently doing nothing.
- run-evals.yml: document the gpqa_diamond/swebench aliases in the header + both
  error messages (were only in the case statement).
- e2e-tests/benchmark-tmpl/benchmark-multinode-tmpl: align the eval-framework
  input description across all three (drop the misleading 'recipe default' — the
  consumer falls back to a hardcoded lm-eval).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants