Name	Name	Last commit message	Last commit date
parent directory ..
scripts	scripts
README.md	README.md
game.py	game.py
run_infer.py	run_infer.py

Name

Last commit message

Last commit date

EDA Evaluation

This folder contains evaluation harness for evaluating agents on the Entity-deduction-Arena Benchmark, from the paper Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games, presented in ACL 2024 main conference.

Configure OpenDevin and your LLM

Create a config.toml file if it does not exist at the root of the workspace. Please check README.md for how to set this up.

Start the evaluation

export OPENAI_API_KEY="sk-XXX"; # This is required for evaluation (to simulate another party of conversation)
./evaluation/EDA/scripts/run_infer.sh [model_config] [git-version] [agent] [dataset] [eval_limit]

where model_config is mandatory, while git-version, agent, dataset and eval_limit are optional.

model_config, e.g. eval_gpt4_1106_preview, is the config group name for your LLM settings, as defined in your config.toml.
git-version, e.g. HEAD, is the git commit hash of the OpenDevin version you would like to evaluate. It could also be a release tag like 0.6.2.
agent, e.g. CodeActAgent, is the name of the agent for benchmarks, defaulting to CodeActAgent.
dataset: There are two tasks in this evaluation. Specify dataset to test on either things or celebs task.
eval_limit, e.g. 10, limits the evaluation to the first eval_limit instances. By default it infers all instances.

For example,

./evaluation/EDA/scripts/run_infer.sh eval_gpt4o_2024_05_13 0.6.2 CodeActAgent things

Reference

@inproceedings{zhang2023entity,
  title={Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games},
  author={Zhang, Yizhe and Lu, Jiarui and Jaitly, Navdeep},
  journal={ACL},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

EDA Evaluation

Configure OpenDevin and your LLM

Start the evaluation

Reference

FilesExpand file tree

EDA

Directory actions

More options

Directory actions

More options

Latest commit

History

EDA

Folders and files

parent directory

README.md

EDA Evaluation

Configure OpenDevin and your LLM

Start the evaluation

Reference