-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document running the entire benchmarking suite #657
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making a start on this @hacobe.
@taufeeque9 we should add documentation for how to generate the table of results in the paper, you might find this draft useful for this purpose, though feel free to modify as it needed.
de2a05f
to
9ed0c47
Compare
run_name="<name>" | ||
``` | ||
|
||
To compute a p-value to test whether the differences from the paper are statistically significant: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, in my test run they were statistically significant so something may have changed since the paper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too surprising, can you share the results and if they moved in a positive or negative direction since the paper? ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll include this once I include the canonical results CSV
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think we should just do a bulk update of all results - I might need help with getting access to more compute for this though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created issue for this #710
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay for Ant it looks like the original results were mean 1953 std dev 99 and the new results (for me) are mean 1794 std dev 244. The p-value is 0.20 so it's not a statistically significant difference though. This is different from my run yesterday though. This gives more reasons to rerun everything in bulk IMO
baseline = pd.DataFrame.from_records( | ||
[ | ||
{ | ||
"algo": "??exp_command=bc", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For whatever reason this is how the analyze script stores the algo field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we should probably try to clean that up! @taufeeque9 any idea why it's doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the default behavior (else statement below) when the Sacred command is not recognized. The commands to train bc
and dagger
was changed earlier without updating the _get_algo_name
function in analyze.py
. A check for preference_comparisons
is also missing from the function.
Function reproduced from analyze.py
:
def _get_algo_name(sd: sacred_util.SacredDicts) -> str:
exp_command = _get_exp_command(sd)
if exp_command == "gail":
return "GAIL"
elif exp_command == "airl":
return "AIRL"
elif exp_command == "train_bc":
return "BC"
elif exp_command == "train_dagger":
return "DAgger"
else:
return f"??exp_command={exp_command}"
@@ -0,0 +1,100 @@ | |||
"""Compare experiment results to baseline results. | |||
|
|||
This script compares experiment results to the results reported in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's better to compare two CSVs, and we just ship a "reference" CSV with the GitHub? That way we can allow things to deviate from the paper over time, which seems desirable (as we add new algorithms, environments). We could still include the paper results as a CSV to allow people to check for regressions over longer periods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking I'd just commit the latest results we have, whether that's from the paper or otherwise. We could include the paper results somewhere too although that feels less essential.
It's likely that many of the behaviors have changed since the paper so we may need to rerun to get up-to-date results (probably outside of this PR)
summary = summary.reset_index() | ||
|
||
# Table 2 (https://arxiv.org/pdf/2211.11972.pdf) | ||
# todo: store results in this repo outside this file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
baseline = pd.DataFrame.from_records( | ||
[ | ||
{ | ||
"algo": "??exp_command=bc", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we should probably try to clean that up! @taufeeque9 any idea why it's doing this?
], | ||
) | ||
baseline["count"] = 5 | ||
baseline["confidence_level"] = 0.95 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want confidence level stored in the DF? Is there a reason it should vary between algo?
@@ -1075,3 +1076,42 @@ def test_convert_trajs_from_current_format_is_idempotent( | |||
assert ( | |||
filecmp.dircmp(converted_path, original_path).diff_files == [] | |||
), "convert_trajs not idempotent" | |||
|
|||
|
|||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding a test!
Codecov Report
@@ Coverage Diff @@
## master #657 +/- ##
==========================================
+ Coverage 95.67% 96.29% +0.61%
==========================================
Files 100 92 -8
Lines 9581 8750 -831
==========================================
- Hits 9167 8426 -741
+ Misses 414 324 -90
... and 57 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
…ults differ from the paper.
Co-authored-by: Adam Gleave <adam@gleave.me>
Co-authored-by: Adam Gleave <adam@gleave.me>
b1a5d76
to
761a7e4
Compare
|
||
```bash | ||
python experiments/commands.py \ | ||
--name <name> \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be --name <name>
or --name=<name>
?
This adds some instructions on running the benchmarking suite. It is still missing the baseline benchmark values and instructions to update them, which I'll include in a later PR. I'm also not totally clear on what util.py is for so I didn't document it. Let me know if there's any part of the workflow that's still missing
Test plan
Ran unit tests
Ran scripts and saw that they worked