Document running the entire benchmarking suite #657

hacobe · 2023-01-07T02:30:25Z

This adds some instructions on running the benchmarking suite. It is still missing the baseline benchmark values and instructions to update them, which I'll include in a later PR. I'm also not totally clear on what util.py is for so I didn't document it. Let me know if there's any part of the workflow that's still missing

Test plan

Ran unit tests

Ran scripts and saw that they worked

AdamGleave

Thanks for making a start on this @hacobe.

@taufeeque9 we should add documentation for how to generate the table of results in the paper, you might find this draft useful for this purpose, though feel free to modify as it needed.

benchmarking/README.md

timbauman · 2023-05-08T21:42:09Z

benchmarking/README.md

+    run_name="<name>"
+```
+
+To compute a p-value to test whether the differences from the paper are statistically significant:


FWIW, in my test run they were statistically significant so something may have changed since the paper

Not too surprising, can you share the results and if they moved in a positive or negative direction since the paper? ;)

I'll include this once I include the canonical results CSV

Actually, I think we should just do a bulk update of all results - I might need help with getting access to more compute for this though

Created issue for this #710

Okay for Ant it looks like the original results were mean 1953 std dev 99 and the new results (for me) are mean 1794 std dev 244. The p-value is 0.20 so it's not a statistically significant difference though. This is different from my run yesterday though. This gives more reasons to rerun everything in bulk IMO

src/imitation/scripts/compare_to_baseline.py

timbauman · 2023-05-08T21:45:11Z

src/imitation/scripts/compare_to_baseline.py

+    baseline = pd.DataFrame.from_records(
+        [
+            {
+                "algo": "??exp_command=bc",


For whatever reason this is how the analyze script stores the algo field

Hmm, we should probably try to clean that up! @taufeeque9 any idea why it's doing this?

That is the default behavior (else statement below) when the Sacred command is not recognized. The commands to train bc and dagger was changed earlier without updating the _get_algo_name function in analyze.py. A check for preference_comparisons is also missing from the function.

Function reproduced from analyze.py:

def _get_algo_name(sd: sacred_util.SacredDicts) -> str: exp_command = _get_exp_command(sd) if exp_command == "gail": return "GAIL" elif exp_command == "airl": return "AIRL" elif exp_command == "train_bc": return "BC" elif exp_command == "train_dagger": return "DAgger" else: return f"??exp_command={exp_command}"

benchmarking/README.md

AdamGleave · 2023-05-09T02:33:29Z

src/imitation/scripts/compare_to_baseline.py

@@ -0,0 +1,100 @@
+"""Compare experiment results to baseline results.
+
+This script compares experiment results to the results reported in the


I wonder if it's better to compare two CSVs, and we just ship a "reference" CSV with the GitHub? That way we can allow things to deviate from the paper over time, which seems desirable (as we add new algorithms, environments). We could still include the paper results as a CSV to allow people to check for regressions over longer periods.

I was thinking I'd just commit the latest results we have, whether that's from the paper or otherwise. We could include the paper results somewhere too although that feels less essential.

It's likely that many of the behaviors have changed since the paper so we may need to rerun to get up-to-date results (probably outside of this PR)

AdamGleave · 2023-05-09T02:36:29Z

src/imitation/scripts/compare_to_baseline.py

+    summary = summary.reset_index()
+
+    # Table 2 (https://arxiv.org/pdf/2211.11972.pdf)
+    # todo: store results in this repo outside this file


AdamGleave · 2023-05-09T02:37:10Z

src/imitation/scripts/compare_to_baseline.py

+    baseline = pd.DataFrame.from_records(
+        [
+            {
+                "algo": "??exp_command=bc",


Hmm, we should probably try to clean that up! @taufeeque9 any idea why it's doing this?

AdamGleave · 2023-05-09T02:38:07Z

src/imitation/scripts/compare_to_baseline.py

+        ],
+    )
+    baseline["count"] = 5
+    baseline["confidence_level"] = 0.95


Do we want confidence level stored in the DF? Is there a reason it should vary between algo?

src/imitation/scripts/compare_to_baseline.py

AdamGleave · 2023-05-09T02:41:05Z

tests/scripts/test_scripts.py

@@ -1075,3 +1076,42 @@ def test_convert_trajs_from_current_format_is_idempotent(
    assert (
        filecmp.dircmp(converted_path, original_path).diff_files == []
    ), "convert_trajs not idempotent"
+
+
+@pytest.mark.parametrize(


Thanks for adding a test!

codecov · 2023-05-10T01:50:00Z

Codecov Report

Merging #657 (8ccbdb0) into master (20366b0) will increase coverage by 0.61%.
The diff coverage is 93.54%.

❗ Current head 8ccbdb0 differs from pull request most recent head ef3ac84. Consider uploading reports for the commit ef3ac84 to get more accurate results

@@            Coverage Diff             @@
##           master     #657      +/-   ##
==========================================
+ Coverage   95.67%   96.29%   +0.61%     
==========================================
  Files         100       92       -8     
  Lines        9581     8750     -831     
==========================================
- Hits         9167     8426     -741     
+ Misses        414      324      -90

Files	Coverage Δ
tests/scripts/test_scripts.py	`100.00% <100.00%> (ø)`
src/imitation/scripts/analyze.py	`91.86% <75.00%> (+0.39%)`	⬆️
src/imitation/scripts/compare_to_baseline.py	`95.23% <95.23%> (ø)`

... and 57 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

…ults differ from the paper.

Co-authored-by: Adam Gleave <adam@gleave.me>

ernestum · 2023-08-07T15:42:11Z

benchmarking/README.md

+
+```bash
+python experiments/commands.py \
+  --name <name> \


Should this be --name <name> or --name=<name>?

hacobe requested a review from AdamGleave January 7, 2023 02:30

hacobe marked this pull request as draft January 7, 2023 02:30

AdamGleave reviewed Jan 8, 2023

View reviewed changes

benchmarking/README.md Outdated Show resolved Hide resolved

benchmarking/README.md Outdated Show resolved Hide resolved

benchmarking/README.md Outdated Show resolved Hide resolved

ernestum mentioned this pull request Jan 17, 2023

Benchmark and replicate algorithm performance #388

Open

timbauman force-pushed the benchmarking-docs branch from de2a05f to 9ed0c47 Compare May 8, 2023 21:12

timbauman requested a review from levmckinney May 8, 2023 21:41

timbauman reviewed May 8, 2023

View reviewed changes

src/imitation/scripts/compare_to_baseline.py Outdated Show resolved Hide resolved

timbauman reviewed May 8, 2023

View reviewed changes

timbauman marked this pull request as ready for review May 8, 2023 22:28

AdamGleave reviewed May 9, 2023

View reviewed changes

timbauman mentioned this pull request May 9, 2023

Use CSV of latest results for benchmarking #710

Open

hacobe and others added 11 commits May 11, 2023 11:28

Document running the entire benchmarking suite and testing if the res…

a06574e

…ults differ from the paper.

Fix seeds flag.

d8a82dc

Fix margin of error formula

d5587d5

format

d9f63a8

minor edits

54094eb

adding tests

25cc78f

Add test

6a378d8

Update benchmarking/README.md

b890b4e

Co-authored-by: Adam Gleave <adam@gleave.me>

Update benchmarking/README.md

a237405

Co-authored-by: Adam Gleave <adam@gleave.me>

expand expression

27b9d82

handle command names

761a7e4

timbauman force-pushed the benchmarking-docs branch from b1a5d76 to 761a7e4 Compare May 11, 2023 18:29

timbauman added 2 commits May 12, 2023 17:49

add baseline

b2e2014

fix tests

8ccbdb0

ernestum added this to the Release v1.0 milestone May 25, 2023

ernestum assigned taufeeque9 Jul 28, 2023

taufeeque9 mentioned this pull request Aug 4, 2023

Add scripts and configs for hyperparameter tuning #675

Merged

ernestum reviewed Aug 7, 2023

View reviewed changes

Merge branch 'master' into benchmarking-docs

ef3ac84

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document running the entire benchmarking suite #657

Document running the entire benchmarking suite #657

hacobe commented Jan 7, 2023 •

edited by timbauman

Loading

AdamGleave left a comment

timbauman May 8, 2023

AdamGleave May 9, 2023

timbauman May 9, 2023

timbauman May 9, 2023

timbauman May 9, 2023

timbauman May 9, 2023 •

edited

Loading

timbauman May 8, 2023

AdamGleave May 9, 2023

taufeeque9 May 10, 2023

AdamGleave May 9, 2023

timbauman May 9, 2023

AdamGleave May 9, 2023

AdamGleave May 9, 2023

AdamGleave May 9, 2023

AdamGleave May 9, 2023

codecov bot commented May 10, 2023 •

edited

Loading

ernestum Aug 7, 2023 •

edited

Loading

		@@ -0,0 +1,100 @@
		"""Compare experiment results to baseline results.

		This script compares experiment results to the results reported in the

Document running the entire benchmarking suite #657

Are you sure you want to change the base?

Document running the entire benchmarking suite #657

Conversation

hacobe commented Jan 7, 2023 • edited by timbauman Loading

Test plan

AdamGleave left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timbauman May 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented May 10, 2023 • edited Loading

Codecov Report

ernestum Aug 7, 2023 • edited Loading

Choose a reason for hiding this comment

hacobe commented Jan 7, 2023 •

edited by timbauman

Loading

timbauman May 9, 2023 •

edited

Loading

codecov bot commented May 10, 2023 •

edited

Loading

ernestum Aug 7, 2023 •

edited

Loading