[Eval] DiscoveryBench OpenHands Integration #4627

Ethan0456 · 2024-10-30T10:57:48Z

End-user friendly description of the problem this fixes or functionality that this introduces

This PR integrates the DiscoveryBench Benchmark into OpenHands, enabling it to evaluate the agent's capability for multi-step, data-driven discovery tasks across domains such as sociology, biology, and engineering.

Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below:

With this integration, users can benchmark the performance of OpenHands agents on real-world and synthetic discovery tasks, measuring their success in generating hypotheses, analyzing data, and reasoning through complex workflows. Here are the results for the DiscoveryBench test split with gpt-4o and CoderActAgent:

Metric	Value
Average Recall Context	0.267
Average Mean Accuracy Score	0.112
Average Final Score	0.103

Give a summary of what the PR does, explaining any non-trivial design decisions

This PR integrates DiscoveryBench into OpenHands by incorporating a structured flow that allows the OpenHands' agent to interact with DiscoveryBench tasks.
Non-trivial design decisions:
- Cloning the DiscoveryBench repository: Instead of using huggingface, we clone the repo to ensure that we always have the latest version and updates from the upstream repository.
- process_instance function: This function encapsulates the logic to execute each instance, parse the agent's hypothesis, and evaluate it against the gold hypothesis.

How we structured everything in run_infer.py

run_infer.py is the entry point for running the evaluation. Here's how the process is structured:
- DiscoveryBench setup: First, the script clones the DiscoveryBench repository and loads its dataset into a pandas DataFrame for easy processing of the instances.
- Agent environment: For each task, a Docker container is spun up with all the necessary libraries, ensuring that each task runs in a clean environment.
- Agent configuration: Disabled function calling while enabling Jupyter and browsing delegate configurations in CoderActAgent.
- Agent inference: The OpenHands agent is invoked to process the task within this environment, producing a hypothesis.
- Result parsing: After receiving the agent’s hypothesis, we parse it and compare it against the “gold” hypothesis provided by DiscoveryBench.
- Logging and output: The result for each task is logged into the test_result dictionary, which is ultimately written to an output.jsonl file for analysis and review.

Link of any specific issues this addresses

This PR addresses issue [Evaluation] Add DiscoveryBench Benchmark #4465

Link to the Older PR

This PR is a newer version of [Evaluation] DiscoveryBench OpenHands Integration

Signed-off-by: Abhijeetsingh Meena <abhijeet040403@gmail.com>

…taset Signed-off-by: Abhijeetsingh Meena <abhijeet040403@gmail.com>

Signed-off-by: Abhijeetsingh Meena <abhijeet040403@gmail.com>

…s for linting compliance Signed-off-by: Abhijeetsingh Meena <abhijeet040403@gmail.com>

Signed-off-by: Abhijeetsingh Meena <abhijeet040403@gmail.com>

…pyter and browsing delegate config Signed-off-by: Abhijeetsingh Meena <abhijeet040403@gmail.com>

Signed-off-by: Abhijeetsingh Meena <abhijeet040403@gmail.com>

Ethan0456 and others added 14 commits October 30, 2024 13:58

init(eval): add baseline DiscoveryBench infer script

7c9ccaa

Signed-off-by: Abhijeetsingh Meena <abhijeet040403@gmail.com>

feat(eval): implement create_dataset function to clone and prepare da…

da61d20

…taset Signed-off-by: Abhijeetsingh Meena <abhijeet040403@gmail.com>

feat(eval): implement process_instance function

cfe39b6

Signed-off-by: Abhijeetsingh Meena <abhijeet040403@gmail.com>

feat(eval): initialize docker runtime with necessary python libraries

33071aa

Signed-off-by: Abhijeetsingh Meena <abhijeet040403@gmail.com>

feat(eval): implement complete_runtime function

a585cee

Signed-off-by: Abhijeetsingh Meena <abhijeet040403@gmail.com>

feat(eval): add response parser for DiscoveryBench evaluation

98efba4

Signed-off-by: Abhijeetsingh Meena <abhijeet040403@gmail.com>

docs(eval): Add README for discoverybench

8140aae

docs(eval): Add README for DiscoveryBench eval utils

477eb84

refactor(eval): integrate DiscoveryBench evaluation and update script…

f1bf06c

…s for linting compliance Signed-off-by: Abhijeetsingh Meena <abhijeet040403@gmail.com>

feat(eval): add run_infer.sh to execute inference

23a8027

Signed-off-by: Abhijeetsingh Meena <abhijeet040403@gmail.com>

feat(eval): add AgentConfig to disable function calling and enable ju…

0374351

…pyter and browsing delegate config Signed-off-by: Abhijeetsingh Meena <abhijeet040403@gmail.com>

chore(eval): set execute permission for run_infer.sh

7d35c51

Signed-off-by: Abhijeetsingh Meena <abhijeet040403@gmail.com>

docs(eval): update README to comply with linting rules

7f3ab98

Merge branch 'main' into test/discoverybench-openhands-integration

adf0b87

suranah mentioned this pull request Oct 30, 2024

[Evaluation] DiscoveryBench OpenHands Integration #4562

Open

1 task

neubig self-requested a review October 30, 2024 15:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Eval] DiscoveryBench OpenHands Integration #4627

[Eval] DiscoveryBench OpenHands Integration #4627

Ethan0456 commented Oct 30, 2024

[Eval] DiscoveryBench OpenHands Integration #4627

Are you sure you want to change the base?

[Eval] DiscoveryBench OpenHands Integration #4627

Conversation

Ethan0456 commented Oct 30, 2024

End-user friendly description of the problem this fixes or functionality that this introduces

Give a summary of what the PR does, explaining any non-trivial design decisions

How we structured everything in run_infer.py

Link of any specific issues this addresses

Link to the Older PR