Chapter 4 Reproducible

We want our code to be reproducible so that:

  • it can be used by others (both for collaboration and to allow effective review and accountability);
  • it keeps working over time (protected from external changes);
  • it can be easily reused by others in their own projects.

There are a number of steps that we can take to ensure that our code is as reproducible as possible.

4.1 Manage project dependencies

Your project will depend on an number of external factors, such as software or packages. These dependencies may mean that your project won’t work on others’ machines or may not work on your machine at a later date (e.g. as external packages are updated over time). To ensure that this doesn’t become an issue for your project, you should use some kind of dependency management tool.

Dependency management tools

Language Tools
R We recommend using Conda. Other alternatives are Packrat and Renv.
Python We recommend using Conda
Javascript Include third party library dependencies in the project as .js files

Include a git hash

If practical, the output of your code should include the git hash of the code that produced it. By doing so, the analysis should be more reproducible, there is no ambiguity about the specific code that was used to generate it.

R

You can access the git hash using either of the following code: snippets.

library(git2r)
repo <- repository(".")
print(repository_head(repo))

or

print(system("git rev-parse --short HEAD", intern = TRUE))

Python

You can access the git hash using the following code:

import subprocess
def get_git_revision_hash():
    return subprocess.check_output(['git', 'rev-parse', 'HEAD'])
def get_git_revision_short_hash():
    return subprocess.check_output(['git', 'rev-parse', '--short', 'HEAD'])

4.2 Format

If the output is a report, the write up should be fully reproducible, or as close as possible.

  • Avoid workflows that require manually copying and pasting results between documents.
  • For Python, consider using Jupyter notebooks. For R, use rmarkdown.

4.3 Optimize for change

  • Don’t try to solve every conceivable problem up-front, instead focus on making your code easy to change when needed.
  • Don’t prematurely optimize - choose clarity over performance, unless there is a serious performance issue that needs to be addressed.
  • Change can come in several forms, including hardware - your code will eventually be run on a colleague’s machine or a server somewhere. Without over-complicating things, write your code with this in mind. For example, use relative paths (e.g. ./file_in_the_project_directory.R rather than /Users/my_username/development/my_project/file_in_the_project_directory.R)