diff --git a/evaluation/aider_bench/README.md b/evaluation/aider_bench/README.md index b3d80ddf6af..a45a1b13967 100644 --- a/evaluation/aider_bench/README.md +++ b/evaluation/aider_bench/README.md @@ -59,15 +59,13 @@ You can update the arguments in the script ## Summarize Results ```bash -poetry run python ./evaluation/aider_bench/scripts/summarize_results.py [path_to_output_jsonl_file] [model_name] -# with optional SKIP_NUM -poetry run python SKIP_NUM=12 ./evaluation/aider_bench/scripts/summarize_results.py [path_to_output_jsonl_file] [model_name] +poetry run python ./evaluation/aider_bench/scripts/summarize_results.py [path_to_output_jsonl_file] ``` Full example: ```bash -poetry run python ./evaluation/aider_bench/scripts/summarize_results.py evaluation/evaluation_outputs/outputs/AiderBench/CodeActAgent/claude-3-5-sonnet@20240620_maxiter_30_N_v1.9/output.jsonl claude-3-5-sonnet@20240620 +poetry run python ./evaluation/aider_bench/scripts/summarize_results.py evaluation/evaluation_outputs/outputs/AiderBench/CodeActAgent/claude-3-5-sonnet@20240620_maxiter_30_N_v1.9/output.jsonl ``` This will list the instances that passed and the instances that failed. For each @@ -81,21 +79,3 @@ outcome of the tests. If there are no syntax or indentation errors, you can expect to see something like "`..F...EF..`", where "`.`" means the test case passed, "`E`" means there was an error while executing the test case and "`F`" means some assertion failed and some returned output was not as expected. - -## Visualization - -If the required Python libraries are installed (`matplotlib.pyplot` and `seaborn`), -the `summarize_results.py` script will also generate two histograms to -the output folder. - -### Cost Histogram - -The cost histogram shows the number of successful and failed instances per cost point. - -![Cost Histogram](./examples/cost_histogram.png) - -### Actions Histogram - -The actions histogram shows per number of actions the number of successful and failed instances. - -![Actions Histogram](./examples/actions_histogram.png) diff --git a/evaluation/aider_bench/examples/actions_histogram.png b/evaluation/aider_bench/examples/actions_histogram.png deleted file mode 100644 index 894c60b8338..00000000000 Binary files a/evaluation/aider_bench/examples/actions_histogram.png and /dev/null differ diff --git a/evaluation/aider_bench/examples/cost_histogram.png b/evaluation/aider_bench/examples/cost_histogram.png deleted file mode 100644 index da6251da349..00000000000 Binary files a/evaluation/aider_bench/examples/cost_histogram.png and /dev/null differ