Benchmark and replicate algorithm performance #388

AdamGleave · 2022-01-19T05:06:16Z

Tune hyperparameters / match implementation details / fix bugs until we replicate the performance of reference implementations of algorithms. I'm not concerned about an exact match -- if we do about as well on average but better and worse depending on environments this seems OK.

Concretely, should test BC, AIRL, GAIL, DRLHP, DAgger on at least the seals versions of CartPole, MountainCar, HalfCheetah, Hopper.

Baselines: paper results as first port of call. But some paper results are confounded by different environment version, especially fixed vs variable horizon. SB2 GAIL is a good sanity check. If need be reference implementations of most other algorithms exist, but can be hard to run.

AdamGleave · 2022-01-31T18:34:12Z

One useful tool might be airspeed velocity (asv) to keep track of metrics over time.

Rocamonde · 2022-08-15T17:20:07Z

Has there been any progress on this?
Is this still a blocker for v1?

AdamGleave · 2022-08-16T08:30:20Z

@taufeeque9 has been working on this, but it's a big enough task it might make sense to split it up (e.g. you each take some subset of algorithms and work together on building out a useful pipeline).

I think this is still a blocker for v1. At the least, we don't want to actively start promoting the code until we're confident the algorithms are all performing adequately.

ernestum · 2023-01-17T12:17:43Z

I am trying to figure out the current state of the benchmarks.

What I could figure out on my own:

It seems like @Rocamonde, @taufeeque9 and @hacobe mostly worked on this with the most important PRs being Add benchmarking results #627, Fix benchmarking configs + Add a minimal test suite for those configs. #653, Add a program to generate commands to run training scripts #654 and Document running the entire benchmarking suite #657.
the benchmarks/ folder contains some config files, a README.md with instructions on how to run the benchmarks with the config files and an utils.py file that somehow post-processes the output of the benchmarks.
Is this correct? Am I missing something?

What other steps are required until we can close this issue?
Which of those steps could I help with?

Rocamonde · 2023-01-17T12:25:01Z

Things that could be done that I think would be useful (but not necessarily what you should do)

In terms of benchmarking:

Not sure if the benchmarking runs we've had so far are satisfactory enough for @AdamGleave to want to go ahead with them for JMLR submission. If so, perhaps we could add them to wandb and put a link on the repo so they're publicly available?

In terms of the codebase:

Trying to run the benchmark configs and make sure they run well; fix any issues if appropriate (we ran most but not all)
Allow users to run benchmarking scripts without having to clone imitation from github, but simply by passing the name of the benchmark (see the disclaimer in the README.md)
Less important: clean up a little the utils.py file so it's more readable / useful to future users

AdamGleave · 2023-01-19T03:50:35Z

@ernestum benchmarking/util.py cleans up the configs generated by automatic hyperparameter tuning; the output of this is the JSON config files in benchmarking/.

My understanding of the outstanding issues are:

We still don't have tuned configs that work well for preference comparisons. @taufeeque9 is working on this now. This is a must before JMLR I think.
We don't have any summary table of results that's publicly visible (e.g. in README.md) and which we can automatically update. This functionality is almost there though: experiments/commands.py can be used to run the benchmark and we can extract a CSV of results from it using imitation.scripts.analyze.

I don't think being able to run benchmarks without cloning the repo is that important, this is primarily something developers would want to run.

ernestum · 2023-01-19T10:21:59Z

Ok to me it looks like the primary thing I could contribute here is making the execution of benchmarks a bit more future-proof:

When I first read utils.py it seemed really obscure without the tacit knowledge about how the benchmarks work. As @Rocamonde already mentioned it I could clean up and improve documentation here. My outsiders perspective might help to ensure that everything is documented implicitly.
That the combination of experiments/commands.py and imitation.scripts.analyze is required to generate the summary table is also easily forgotten I think. Either formalizing this in code or adding it do the documentation (close to the final table) could help with this.

Is there interest in this or is that a lower priority thing?

AdamGleave · 2023-01-31T03:48:44Z

Is there interest in this or is that a lower priority thing?

Cleaning up and documenting utils.py seems worthwhile.

Documenting how to generate the summary table also worthwhile. Although I'm not that happy with the current workflow, imitation.scripts.analyze seems overly-complex for what it does, so you might want to be on the lookout for ways to improve the table generation.

AdamGleave assigned ernestum and taufeeque9 Aug 22, 2022

ernestum added this to the Release v1 milestone May 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark and replicate algorithm performance #388

Benchmark and replicate algorithm performance #388

AdamGleave commented Jan 19, 2022

AdamGleave commented Jan 31, 2022

Rocamonde commented Aug 15, 2022

AdamGleave commented Aug 16, 2022

ernestum commented Jan 17, 2023 •

edited

Loading

Rocamonde commented Jan 17, 2023 •

edited

Loading

AdamGleave commented Jan 19, 2023

ernestum commented Jan 19, 2023

AdamGleave commented Jan 31, 2023 •

edited

Loading

Benchmark and replicate algorithm performance #388

Benchmark and replicate algorithm performance #388

Comments

AdamGleave commented Jan 19, 2022

AdamGleave commented Jan 31, 2022

Rocamonde commented Aug 15, 2022

AdamGleave commented Aug 16, 2022

ernestum commented Jan 17, 2023 • edited Loading

Rocamonde commented Jan 17, 2023 • edited Loading

AdamGleave commented Jan 19, 2023

ernestum commented Jan 19, 2023

AdamGleave commented Jan 31, 2023 • edited Loading

ernestum commented Jan 17, 2023 •

edited

Loading

Rocamonde commented Jan 17, 2023 •

edited

Loading

AdamGleave commented Jan 31, 2023 •

edited

Loading