Skip to content

Commit

Permalink
eval: update aiderbench readme (All-Hands-AI#4209)
Browse files Browse the repository at this point in the history
  • Loading branch information
xingyaoww authored Oct 4, 2024
1 parent 9cc9b19 commit 80a6313
Show file tree
Hide file tree
Showing 3 changed files with 2 additions and 22 deletions.
24 changes: 2 additions & 22 deletions evaluation/aider_bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,15 +59,13 @@ You can update the arguments in the script
## Summarize Results

```bash
poetry run python ./evaluation/aider_bench/scripts/summarize_results.py [path_to_output_jsonl_file] [model_name]
# with optional SKIP_NUM
poetry run python SKIP_NUM=12 ./evaluation/aider_bench/scripts/summarize_results.py [path_to_output_jsonl_file] [model_name]
poetry run python ./evaluation/aider_bench/scripts/summarize_results.py [path_to_output_jsonl_file]
```

Full example:

```bash
poetry run python ./evaluation/aider_bench/scripts/summarize_results.py evaluation/evaluation_outputs/outputs/AiderBench/CodeActAgent/claude-3-5-sonnet@20240620_maxiter_30_N_v1.9/output.jsonl claude-3-5-sonnet@20240620
poetry run python ./evaluation/aider_bench/scripts/summarize_results.py evaluation/evaluation_outputs/outputs/AiderBench/CodeActAgent/claude-3-5-sonnet@20240620_maxiter_30_N_v1.9/output.jsonl
```

This will list the instances that passed and the instances that failed. For each
Expand All @@ -81,21 +79,3 @@ outcome of the tests. If there are no syntax or indentation errors, you can
expect to see something like "`..F...EF..`", where "`.`" means the test case
passed, "`E`" means there was an error while executing the test case and "`F`"
means some assertion failed and some returned output was not as expected.

## Visualization

If the required Python libraries are installed (`matplotlib.pyplot` and `seaborn`),
the `summarize_results.py` script will also generate two histograms to
the output folder.

### Cost Histogram

The cost histogram shows the number of successful and failed instances per cost point.

![Cost Histogram](./examples/cost_histogram.png)

### Actions Histogram

The actions histogram shows per number of actions the number of successful and failed instances.

![Actions Histogram](./examples/actions_histogram.png)
Binary file not shown.
Binary file removed evaluation/aider_bench/examples/cost_histogram.png
Binary file not shown.

0 comments on commit 80a6313

Please sign in to comment.