Explain how to run the entire benchmarking suite and how to compare a…

… new algorithm to the benchmark runs.
HumanCompatibleAI · Oct 20, 2023 · 15d4026 · 15d4026
1 parent 1058f9a
commit 15d4026
Showing 1 changed file with 172 additions and 7 deletions.
diff --git a/benchmarking/README.md b/benchmarking/README.md
@@ -1,24 +1,189 @@
 # Benchmarking imitation
 
-The `src/imitation/scripts/config/tuned_hps` directory provides the tuned hyperparameter configs for benchmarking imitation. For v0.4.0, these correspond to the hyperparameters used in the paper [imitation: Clean Imitation Learning Implementations](https://arxiv.org/abs/2211.11972).
+The imitation library is benchmarked by running the algorithms BC, DAgger, AIRL and GAIL
+on five different environments from the 
+[seals environment suite](https://github.com/HumanCompatibleAI/seals)
+each with 10 different random seeds.
 
-Configuration files can be loaded either from the CLI or from the Python API.
 
-## CLI
+## Running a Single Benchmark
+
+To run a single benchmark from the commandline, you may use:
 
 ```bash
 python -m imitation.scripts.<train_script> <algo> with <algo>_<env>
 ```
-`train_script` can be either 1) `train_imitation` with `algo` as `bc` or `dagger` or 2) `train_adversarial`  with `algo` as `gail` or `airl`. The `env` can be either of `seals_ant`, `seals_half_cheetah`, `seals_hopper`, `seals_swimmer`, or `seals_walker`. The hyperparameters for other environments are not tuned yet. You may be able to get reasonable performance by using hyperparameters tuned for a similar environment; alternatively, you can tune the hyperparameters using the `tuning` script.
 
-## Python
+There are two different `train_scripts`: `train_imitation` and `train_adversarial` each running different algorithms:
+
+| train_script      | algo       |
+|-------------------|------------|
+| train_imitation   | bc, dagger |
+| train_adversarial | gail, airl |
+
+There are five environment configurations for which we have tuned hyperparameters:
+
+| enviroment         |
+|--------------------|
+| seals_ant          |
+| seals_half_cheetah |
+| seals_hopper       |
+| seals_swimmer      |
+| seals_walker       |
+
+
+If you want to run the same benchmark from a python script, you can use the following code:
 
 ```python
 ...
-from imitation.scripts.<train_script> import <train_ex>
-<train_ex>.run(command_name="<algo>", named_configs=["<algo>_<env>"])
+from imitation.scripts.<train_script> import <train_script>_ex
+<train_script>_ex.run(command_name="<algo>", named_configs=["<algo>_<env>"])
+```
+
+### Inputs
+
+The tuned hyperparameters can be found in `src/imitation/scripts/config/tuned_hps`.
+For v0.4.0, they correspond to the hyperparameters used in the paper
+[imitation: Clean Imitation Learning Implementations](https://arxiv.org/abs/2211.11972).
+You may be able to get reasonable performance by using hyperparameters tuned for a similar environment.
+
+The experts and expert demonstrations are loaded from the HuggingFace model hub and 
+are grouped under the [HumanCompatibleAI Organization](https://huggingface.co/HumanCompatibleAI).
+
+### Outputs
+
+The training scripts are [sacred experiments](https://sacred.readthedocs.io) which place
+their output in an output folder structured like this:
+
+```
+output
+├── airl
+│ └── seals-Swimmer-v1
+│     └── 20231012_121226_c5c0e4
+│         └── sacred -> ../../../sacred/train_adversarial/2
+├── dagger
+│ └── seals-CartPole-v0
+│     └── 20230927_095917_c29dc2
+│         └──  sacred -> ../../../sacred/train_imitation/1
+└── sacred
+    ├── train_adversarial
+    │ ├── 1
+    │ ├── 2
+    │ ├── 3
+    │ ├── 4
+    │ ├── ...
+    │ └── _sources
+    └── train_imitation
+        ├── 1
+        └── _sources
 ```
 
+In the `sacred` folder all runs are grouped by the training script, and each gets a 
+folder with their run id.
+That run folder contains 
+- a `config.json` file with the hyperparameters used for that run
+- a `run.json` file with run information with the final score and expert score
+- a `cout.txt` file with the stdout of the run
+
+Additionally, there are run folders grouped by algorithm and environment.
+They contain further log files and model checkpoints as well as a symlink to the
+corresponding sacred run folder.
+
+Important entries in the json files are:
+- `run.json`
+  - `command`: The name of the algorithm 
+  - `result.imit_stats.monitor_return_mean`: the score of a run
+  - `result.expert_stats.monitor_return_mean`: the score of the expert policy that was used for a run
+- `config.json`
+  - `environment.gym_id` The environment name of the run
+
+## Running the Complete Benchmark Suite
+
+To execute the entire benchmarking suite with 10 seeds for each configuration, 
+you can utilize the `run_all_benchmarks.sh` script. 
+This script will consecutively run all configurations. 
+To optimize the process, consider parallelization options. 
+You can either send all commands to GNU Parallel, 
+use SLURM by invoking `run_all_benchmarks_on_slurm.sh` or
+split up the lines in multiple scripts to run on multiple machines manually.
+
+### Generating Benchmark Summaries
+
+There are scripts to summarize all runs in a folder in a CSV file or in a markdown file.
+For the CSV, run:
+
+```shell
+python sacred_output_to_csv.py output/sacred > summary.csv
+```
+
+This generates a csv file like this:
+
+```csv
+algo, env, score, expert_score
+gail, seals/Walker2d-v1, 2298.883520464286, 2502.8930135576925
+gail, seals/Swimmer-v1, 287.33667667857145, 295.40472964423077
+airl, seals/Walker2d-v1, 310.4065185178571, 2502.8930135576925
+...
+```
+
+For a more comprehensive summary that includes aggregate statistics such as 
+mean, standard deviation, IQM (Inter Quartile Mean) with confidence intervals, 
+as recommended by the 'rliable' library, use the following command:
+
+```shell
+python sacred_output_to_markdown_summary output/sacred --output summary.md
+```
+
+This will produce a markdown summary file named `summary.md`.
+
+
+
+**Hint:**
+If you have multiple output folders, because you ran different parts of the 
+benchmark on different machines, you can copy the output folders into a common root 
+folder. 
+The above scripts will search all nested directories for folders with 
+a `run.json` and a `config.json` file.
+For example, calling `python sacred_output_to_csv.py benchmark_runs/ > summary.csv`
+on an output folder structured like this:
+```
+benchmark_runs
+├── first_batch
+│   ├── 1
+│   ├── 2
+│   ├── 3
+│   ├── ...
+└── second_batch
+    ├── 1
+    ├── 2
+    ├── 3
+    ├── ...
+```
+will aggregate all runs from both `first_batch` and `second_batch` into a single
+csv file.
+
+## Comparing an Algorithm against the Benchmark Runs
+
+If you modified one of the existing algorithms or implemented a new one, you might want
+to compare it to the benchmark runs to see if there is a significant improvement or not.
+
+If your algorithm has the same file output format as described above, you can use the
+`compute_probability_of_improvement.py` script to do the comparison.
+It uses the "Probability of Improvement" metric as recommended by the 
+[rliable library](https://github.com/google-research/rliable).
+
+```shell
+python compute_probability_of_improvement.py <your_runs_dir> <baseline_runs_dir> --baseline-algo <algo>
+```
+
+where:
+- `your_runs_dir` is the directory containing the runs for your algorithm
+- `baseline_runs_dir` is the directory containing runs for a known algorithm. Hint: you do not need to re-run our benchmarks. We provide our run folders as release artifacts.
+- `algo` is the algorithm you want to compare against
+
+If `your_runs_dir` contains runs for more than one algorithm, you will have to 
+disambiguate using the `--algo` option.
+
 # Tuning Hyperparameters
 
 The hyperparameters of any algorithm in imitation can be tuned using `src/imitation/scripts/tuning.py`.