Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document running the entire benchmarking suite #657

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

hacobe
Copy link
Contributor

@hacobe hacobe commented Jan 7, 2023

This adds some instructions on running the benchmarking suite. It is still missing the baseline benchmark values and instructions to update them, which I'll include in a later PR. I'm also not totally clear on what util.py is for so I didn't document it. Let me know if there's any part of the workflow that's still missing

Test plan

Ran unit tests

Ran scripts and saw that they worked

@hacobe hacobe requested a review from AdamGleave January 7, 2023 02:30
@hacobe hacobe marked this pull request as draft January 7, 2023 02:30
Copy link
Member

@AdamGleave AdamGleave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making a start on this @hacobe.

@taufeeque9 we should add documentation for how to generate the table of results in the paper, you might find this draft useful for this purpose, though feel free to modify as it needed.

benchmarking/README.md Outdated Show resolved Hide resolved
benchmarking/README.md Outdated Show resolved Hide resolved
benchmarking/README.md Outdated Show resolved Hide resolved
run_name="<name>"
```

To compute a p-value to test whether the differences from the paper are statistically significant:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, in my test run they were statistically significant so something may have changed since the paper

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too surprising, can you share the results and if they moved in a positive or negative direction since the paper? ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll include this once I include the canonical results CSV

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think we should just do a bulk update of all results - I might need help with getting access to more compute for this though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created issue for this #710

Copy link
Contributor

@timbauman timbauman May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay for Ant it looks like the original results were mean 1953 std dev 99 and the new results (for me) are mean 1794 std dev 244. The p-value is 0.20 so it's not a statistically significant difference though. This is different from my run yesterday though. This gives more reasons to rerun everything in bulk IMO

baseline = pd.DataFrame.from_records(
[
{
"algo": "??exp_command=bc",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For whatever reason this is how the analyze script stores the algo field

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we should probably try to clean that up! @taufeeque9 any idea why it's doing this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the default behavior (else statement below) when the Sacred command is not recognized. The commands to train bc and dagger was changed earlier without updating the _get_algo_name function in analyze.py. A check for preference_comparisons is also missing from the function.

Function reproduced from analyze.py:

def _get_algo_name(sd: sacred_util.SacredDicts) -> str:
    exp_command = _get_exp_command(sd)

    if exp_command == "gail":
        return "GAIL"
    elif exp_command == "airl":
        return "AIRL"
    elif exp_command == "train_bc":
        return "BC"
    elif exp_command == "train_dagger":
        return "DAgger"
    else:
        return f"??exp_command={exp_command}"

@timbauman timbauman marked this pull request as ready for review May 8, 2023 22:28
benchmarking/README.md Outdated Show resolved Hide resolved
benchmarking/README.md Outdated Show resolved Hide resolved
@@ -0,0 +1,100 @@
"""Compare experiment results to baseline results.

This script compares experiment results to the results reported in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's better to compare two CSVs, and we just ship a "reference" CSV with the GitHub? That way we can allow things to deviate from the paper over time, which seems desirable (as we add new algorithms, environments). We could still include the paper results as a CSV to allow people to check for regressions over longer periods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking I'd just commit the latest results we have, whether that's from the paper or otherwise. We could include the paper results somewhere too although that feels less essential.

It's likely that many of the behaviors have changed since the paper so we may need to rerun to get up-to-date results (probably outside of this PR)

summary = summary.reset_index()

# Table 2 (https://arxiv.org/pdf/2211.11972.pdf)
# todo: store results in this repo outside this file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

baseline = pd.DataFrame.from_records(
[
{
"algo": "??exp_command=bc",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we should probably try to clean that up! @taufeeque9 any idea why it's doing this?

],
)
baseline["count"] = 5
baseline["confidence_level"] = 0.95
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want confidence level stored in the DF? Is there a reason it should vary between algo?

src/imitation/scripts/compare_to_baseline.py Outdated Show resolved Hide resolved
@@ -1075,3 +1076,42 @@ def test_convert_trajs_from_current_format_is_idempotent(
assert (
filecmp.dircmp(converted_path, original_path).diff_files == []
), "convert_trajs not idempotent"


@pytest.mark.parametrize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding a test!

@codecov
Copy link

codecov bot commented May 10, 2023

Codecov Report

Merging #657 (8ccbdb0) into master (20366b0) will increase coverage by 0.61%.
The diff coverage is 93.54%.

❗ Current head 8ccbdb0 differs from pull request most recent head ef3ac84. Consider uploading reports for the commit ef3ac84 to get more accurate results

@@            Coverage Diff             @@
##           master     #657      +/-   ##
==========================================
+ Coverage   95.67%   96.29%   +0.61%     
==========================================
  Files         100       92       -8     
  Lines        9581     8750     -831     
==========================================
- Hits         9167     8426     -741     
+ Misses        414      324      -90     
Files Coverage Δ
tests/scripts/test_scripts.py 100.00% <100.00%> (ø)
src/imitation/scripts/analyze.py 91.86% <75.00%> (+0.39%) ⬆️
src/imitation/scripts/compare_to_baseline.py 95.23% <95.23%> (ø)

... and 57 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more


```bash
python experiments/commands.py \
--name <name> \
Copy link
Collaborator

@ernestum ernestum Aug 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be --name <name> or --name=<name>?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants