A collection of "easy wins" to make machine learning in research reproducible.
This book aims to provide easy ways to increase the quality of scientific contributions that use machine learning methods. The reproducible aspect will make it easy for fellow researchers to use and iterate on a publication, increasing citations of published work. The use of appropriate validation techniques and increase in code quality accelerates the review process during publication and avoids possible rejection due to deficiencies in the methodology. Making models, code and possibly data available increases the visibility of work and enables easier collaboration on future work.
This book focuses on basics that work. Getting you 90% of the way to top-tier reproducibility.
Every scientific conference has seen a massive uptick in applications that use some type of machine learning. Whether it’s a linear regression using scikit-learn, a transformer from Hugging Face, or a custom convolutional neural network in Jax, the breadth of applications is as vast as the quality of contributions.
This work to make machine learning applications reproducible has an outsized impact compared to the limited additional work that is required using existing Python libraries.
This tutorial uses the Palmer Penguins dataset.
Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
Artwork by @allison_horst
| ▲ Top |
If you'd like to develop and/or build the Increase citations, ease review & collaboration book, you should:
- Clone this repository
- Run
pip install -r requirements.txt
(it is recommended you do this within a virtual environment) - (Optional) Edit the books source files located in the
book/
directory - (Optional) Jupytext syncs the content between
python_scripts
andbook/notebooks
to enable diffs. - Run
jupyter-book clean book/
to remove any existing builds - Run
jupyter-book build book/
A fully-rendered HTML version of the book will be built in book/_build/html/
.
This repo uses: Jupytext doc
To synchronize the notebooks and the Python scripts (based on filestamps, only input cells content is modified in the notebooks):
The idea and implementation for jupytext were copied from the Euroscipy 2019 scikit-learn tutorial. Thanks for the great work!
$ jupytext --sync notebooks/*.ipynb
or simply use:
$ make sync
If you create a new notebook, you need to set-up the text files it is going to be paired with:
$ jupytext --set-formats notebooks//ipynb,python_scripts//auto:percent notebooks/*.ipynb
or simply use:
$ make format
To render all the notebooks (from time to time, slow to run):
$ make render
| ▲ Top |
Please see the Jupyter Book documentation to discover options for deploying a book online using services such as GitHub, GitLab, or Netlify.
For GitHub and GitLab deployment specifically, the cookiecutter-jupyter-book includes templates for, and information about, optional continuous integration (CI) workflow files to help easily and automatically deploy books online with GitHub or GitLab. For example, if you chose github
for the include_ci
cookiecutter option, your book template was created with a GitHub actions workflow file that, once pushed to GitHub, automatically renders and pushes your book to the gh-pages
branch of your repo and hosts it on GitHub Pages when a push or pull request is made to the main branch.
We welcome and recognize all contributions. You can see a list of current contributors in the contributors tab.
This project is created using the excellent open source Jupyter Book project and the executablebooks/cookiecutter-jupyter-book template. Notebooks are synced with scripts using jupytext for version control.
| ▲ Top |