ML Reproducibility Tools and Best Practices
Reproducibility, obtaining similar results as presented in a paper using the same code and data, is necessary to verify the reliability of research findings. Reproducibility is also an important step to promote open and accessible research, thereby allowing the scientific community to quickly integrate new findings and convert ideas to practice. Reproducibility also promotes the use of robust experimental workflows, which potentially reduce unintentional errors.
In this blog post, we will share commonly used tools and explain 12 basic practices that you can use in your research to ensure reproducible science.
Updated : 21st December, 2020
|Hydra, OmegaConf, Pytorch Lightning
|Pytorch Lightning, TestTube
|Tensorboard, Comet.ML, Weights & Biases, MLFlow, Visdom, Neptune
|Check best practices below
|Pytorch Lightning, MLFlow, Determined.AI
|Github, Gitlab, Replicate.AI
|DVC, CML, Replicate.AI
|Jupyter Notebook, papermill, JupyterLab, Google Colab
|Matplotlib, Seaborn , Pandas, Overleaf
|pip, conda, Poetry, Docker, Singularity, repo2docker
|Open Source Release
|Squash Commits, Binder
|ML Code Completeness Checklist, ML Reproducibility Checklist
|Test and Validate
|AWS, GCP, CodeOcean
1. Config Management
When you begin implementing your research code, the first line of work is to define an argument parser to define the set of parameters your code expects. These set of hyperparameters can typically look like this:
python train.py --hidden_dim 100 --batch_size 32 --num_tasks 10 --dropout 0.2 --with_mask --log_interval 100 --learning_rate 0.001 --optimizer sgd --scheduler plateau --scheduler_gamma 0.9 --weight_decay 0.9
These sets of arguments typically grow over time in your research
project, making maintenance and reproducibility a pain. Typically in
your code, you should be careful to log all hyperparameters for all
experiments, so that you can replicate an old version of your code.
Lightning provides a great way to log all hyperparameters in
files in the experiment output folder, allowing for better
An alternative to using a long list of argparse elements is to use config files. Config files can be either in JSON or YAML format (I prefer YAML due to the ability to add comments), where you can set your hyperparams in a logically nested way. The above set of hyperparams could be organized as:
general: # for generic args
optim: # for optimizer args
2. Checkpoint Management
Managing your model checkpoints is very important in terms of reproducibility, as it allows you to release trained models for the community to easily verify your work, as well as build upon it. Ideally, you should save your checkpoints as frequently as possible. Given the system resource restrictions, it is usually not feasible. Thus, it is ideal to save the last checkpoint along with the checkpoint of the last best model (according to your evaluation metrics). Pytorch Lightning provides an in-built solution to do this efficiently.
When training your model, you realize that for several parameters it is not giving you the ideal performance. Ideally, you want to check several things. Is the training loss of the model saturating? Is it still going down? How is the validation performance over training look like? You need to log all the metrics efficiently, and later plot those metrics in nice shiny plots for analysis and inspection.
Logging is also important for reproducibility, so researchers can verify the training progression of their replications in great detail.
In the bare-bones setup, you could just log all metrics in the filesystem and then plot by loading them in a python script using matplotlib. To make this process easy and also to provide live, interactive plots, several services are available now which you can leverage in your work. Tensorboard, for example, is popular in the ML community primarily for its early adoption and ability to deploy locally. Newer entrants, like Comet.ML, WandB or MLFlow, provide exciting features ranging from sharable online logging interfaces, with fine-grained ability to monitor experiments and hyperparams. In a future blog post, we will discuss on the pros and cons of these systems.
4. Setting the seed
Probably the most important aspect of the exact reproducibility of your research is the seed of the experiment. Although exact reproducibility is not guaranteed, especially in GPU execution environments [2, 8], it’s still beneficial to report the seed due to its impact on your results.
When you begin your experiments, it suggested to first set the seed using scripts like these (assuming if you use PyTorch):
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
os.environ["PYTHONHASHSEED"] = str(seed)
Do not optimize the seed like a hyperparameter. If your algorithm only works on a range of seeds, it’s not a robust contribution.
Reporting the performance of your model on multiple seeds captures the variance of the proposed model. Before beginning your experiments, randomly draw \(n\) seeds and set them aside in your config file, and report all experimental results aggregated over those \(n\) seeds. \(n=5\) is a good starting point, but you an always increase this number.
5. Version Control
To track your research effectively, we highly recommended practice
setting up version control using
git in your repository from the
get-go. You can use a service like Github or
Gitlab as your hosting provider.
git commit=s to explain to your future self (and your collaborators) what change you made to your experiment at a given time. Ideally, you should /always commit before you run an experiment/, so that you can =tag the results with specific commits. Be as detailed
with your commit messages as you can - your future self will thank you!
6. Data Management
Managing your data is extremely important for reproducibility, especially when you propose a new dataset or a new dataset split. In your many rounds of experiments, you would probably work with different splits of the data, hence tracking all those changes should have similar priority as tracking your code.
The easiest way to track your data is to add it to the git version system or use cloud storage solutions such as Google Drive, AWS S3 to store your datasets.
For large datasets, you can also use
git-lfs, or maintain a md5 hash of
the dataset in your config file, like this:
def md5_update_from_dir(directory: Union[str, Path], hash: Hash) -> Hash:
for path in sorted(Path(directory).iterdir(), key=lambda p: str(p).lower()):
hash = md5_update_from_file(path, hash)
hash = md5_update_from_dir(path, hash)
def md5_dir(directory: Union[str, Path]) -> str:
return str(md5_update_from_dir(directory, hashlib.md5()).hexdigest())
Having such a hash will allow you to track which dataset or data split you were working on at a certain commit.
7. Data Analysis
Keeping track of the analysis you perform on the data/results is also very important in terms of the reproducibility of your contribution. Jupyter Notebooks are the standard in maintaining all your analysis and plotting functions in one place. Ideally, you should separate notebooks for data analysis, result analysis, plot generation, and table generation, and add them in your version control. Pandas' to_latex allows you to directly write your results as a latex table, removing error-prone copying of results into LaTeX.
When you need to update the results in your paper, you can just access the corresponding file and re-run the cells. You can also parameterize and run notebooks with the papermill API so that your notebooks are cleanly executed your desired analysis parameters.
8. Reporting Results
When reporting your results, it is ideal to run your experiments in different seeds and/or different datasets. Thus, your results should contain plots with error bars and tables with standard deviations. You should also describe how the descriptive statistics were calculated, e.g. mean reward over multiple seeds. Statistical testing and highlighting statistically significant values is also encouraged . This information provides a more realistic assessment of the performance of a model and avoids the sharing of overly optimistic results [4,5,6,7].
A higher bar of reproducibility is to report the results on multiple datasets to highlight the robustness of your model. Even if the model has larger variance over different datasets, its still encouraged to report them all - to avoid the discovery of these issues later on.
While reporting your results, consult the ML Reproducibility Checklist which has detailed guidelines on the best practices for reporting figures and tables.
9. Managing Dependencies
Irreproducibility often stems from software deprecation. To replicate a
published work, the first thing to do is to match the same development
environment, containing the same libraries that the program expects.
Thus, it is crucial to document the libraries and their versions that
you use in your experiments. After your experiments are stable, you can
conda to collect all libraries that have been used:
$ pip freeze > requirements.txt
$ conda env export > environment.yml
You can also leverage headless virtual machines such as Docker or Singularity to provide the exact reproducible dev environment used for the experiments. Singularity, in particular, is supported in many HPC systems (such as Compute Canada), which can be used to train and then subsequently release your experiments to the public. You can also convert your existing repository into a Docker environment using repo2docker.
10. Open Source Release
After you have published your paper, consider open sourcing your experiments. This not only encourages reproducible research but also adds more visibility to your paper. Once you release your code, consider adding it to Papers With Code for added visibility. You can also release a demo on Binder or Colab to encourage people to use your model.
For good examples of model demos check out .
Before releasing your code, check the following:
- Squash the commits in the public branch (master) into a single commit
- Helps remove your private experiment commit messages (and the awkward comments!)
- Make sure your code does not contain any API keys (for loggers such as WandB or Comet.ML)
- Keep an eye out for hardcoded file paths
- Improve readability of your code using formatters such as Black. Obscure, poorly written codebases, even when they run, are oftentimes impossible to reuse or build on top of
- Document your functions and classes appropriately. In ML, it’s beneficial to the reader if you annotate your code with input and output tensor dimensions.
11. Effective Communication
When releasing your code, try to add as much information about the code in the README file. Papers With Code released ML Code Completeness checklist, which suggests adding the following in your README:
- Dependency information
- Training scripts
- Evaluation scripts
- Pre-trained models
Papers With Code evaluated repositories released after NeurIPS 2019 and found repositories that do not address any of the above only got a median of 1.5 Github stars, whereas repositories which have all five of the above criteria got 196.5 median stars! Only 9% of the repositories fulfilled the 5 points, so definitely we can do better about communicating our research. The better the communication, the better it is in terms of reproducibility.
You should always mention clearly the source of the dataset used in the work. If you are releasing a new dataset or pretrained model for the community, consider adding proper documentation for easy access, such as a datasheet or model card. These are READMEs for the dataset or model which contains:
- Collection Process
- Use cases
12. Test and Validate
Finally, it’s important from the reproducibility perspective to test your implementation in a different environment than the training setup. This testing doesn’t necessarily mean you have to re-train the entire pipeline. Specifically, you should make sure that the training and evaluation scripts are running in the test environment.
To get an isolated test environment, you can use AWS or GCP cloud instances. You can also checkout CodeOcean which provides isolated AWS instances tied to Jupyter Notebooks for easy evaluation.
Reproducibility is hard. Maintaining a reproducible research codebase is harder when the incentive is to publish your ideas quicker than your competitor. Nevertheless, we agree with what Joelle Pineau said in NeurIPS 2018 : “Science is not a competitive sport”. We need to invest more time and care in our research, and we need to ensure as computer scientists our work is reproducible so that it adds value to the reader and practitioners who would build upon our work.
We hope this post will be useful in your research. Feel free to comment if you have any particular point/libraries that we missed, we would be happy to add them.
Many thanks to Joelle Pineau for encouraging writing this draft, and helping formulating the best practices. Thanks to Shagun Sodhani, Matthew Muckley and Michela Paganini for providing feedback on the draft. Thanks to Deep Learning for Science School for inviting Koustuv to speak about reproducibility on August 2020, for which this blog post is a point of reference.
- Rule A, Birmingham A, Zuniga C, Altintas I, Huang SC, Knight R, Moshiri N, Nguyen MH, Rosenthal SB, Pérez F, Rose PW. Ten simple rules for reproducible research in Jupyter notebooks. arXiv preprint arXiv:1810.08055. 2018 Oct 13.
- Nvidia CUDNN Developer Guides
- Gebru T, Morgenstern J, Vecchione B, Vaughan JW, Wallach H, Daumé III H, Crawford K. Datasheets for datasets. arXiv preprint arXiv:1803.09010. 2018 Mar 23.
- Bouthillier X, Laurent C, Vincent P. Unreproducible research is reproducible. In International Conference on Machine Learning 2019 May 24 (pp. 725-734).
- Lucic M, Kurach K, Michalski M, Gelly S, Bousquet O. Are GANs created equal? a large-scale study. In Advances in Neural Information Processing Systems 2018 (pp. 700-709).
- Henderson P, Islam R, Bachman P, Pineau J, Precup D, Meger D. Deep Reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence 2018 Apr 29.
- Raff E. A Step Toward Quantifying Independently Reproducible Machine Learning Research. In Advances in Neural Information Processing Systems 2019 (pp. 5485-5495).
- Pytorch note on reproducibility
- Forde JZ, Paganini M. The Scientific Method in the Science of Machine Learning. In ICLR Debugging Machine Learning Models Workshop 2019.
- Mordvintsev A, Pezzotti N, Schubert L, Olah C. Differentiable Image Parameterizations. Distill 2018.
- Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B, Spitzer E, Raji ID, and Gebru T. Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT* ‘19). Association for Computing Machinery, New York, NY, USA, 220–229.