2 Key RAP components in coding projects
The default position is that all MoJ coding projects should be done in a reproducible manner.
To be considered fully reproducible, an MoJ RAP project should include the components listed below. These are arranged in three levels to inform prioritisation. All MoJ coding projects, whether considered RAP or not, should at least include the level 1 components to be considered reproducible.
All MoJ coding projects should regardless of the level and in appropriate measure:
- Adhere to the MoJ Analytical IT Tools Strategy to ensure suitable IT tools are used.
- Follow the MoJ Analytical IT Tools Strategy: Recommended Ways of Working.
- Apply quality assurance which (as described by the Aqua book) is proportional to the complexity and risk of the analysis – see the relevant Duck Book section.
2.1 Level 1:
- Include clear code along with suitably embedded documentation. At minimum, code should set out clearly (e.g. in R you can use the styler package to format code according to the tidyverse style guide) and be well commented (comments are for explaining why something is needed, not how it works). A README (often the first item a visitor will see when visiting a GitHub repository) should set out how a user can reproduce the process. It should include how the project is structured, any associated manual steps, the datasets and code used including any functions and dependencies, and any access requirements and definition of specialist terms. For more information about understandable code structure and style see the documentation see the Data and Analysis coding standards chapter: Understandable.
- Make code recyclable. Manual steps should generally be minimised with those that remain being fully documented. For a future iteration with new data, would the code take less than 30 minutes of editing before being able to be run again? Consider whether any:
- Steps need re-editing or could become irrelevant?
- Variables that require manual input (e.g. table names, years) can be assigned at the top of the code, so it’s easy to edit in one place with each iteration?
- Fixed variables are prone to changing such as geographic boundaries, that you could start preparing for changes now by making it easy to adapt in future?
- Use an appropriate project structure. Typically it is appropriate to have one repository for each endpoint in your process and within a repository a folder structure similar to that for a package – see the Project Structure section.
- Use version control. Utilise Git/GitHub repositories to store and share code and the Git Flow workflow. For more information see the Introduction to Git/GitHub course and the Bite-sized session: Learning to love GitHub with Git Flow.
- The code be as widely available as possible via the GitHub repository/repositories. If the repository/repositories cannot be shared publicly, at least not without major revisions to the code, then can this be done within Data and Analysis or within a more localised team? For more information about the steps you need to take to make your repository public please see the Analytical Platform User Guidance Acceptable use policy section on GitHub.
- Take an additional step to prevent any sensitive data being pushed to Github.com i.e. through use of a gitignore file – for more information, see the Developing R packages & RAP ways of working course section on excluding sensitive data. Projects that include Jupyter notebooks should use nbstripout to ensure data is not visible, see the Analytical Platform user guidance on Github security and the nbstripout documentation.
- Use package dependency management (to prevent code not running when packages are updated) – for R projects see the Analytical Platform user guidance section on renv and the Coffee and Coding introduction to renv. For Python projects see the Analytical Platform user guidance on venv and pip.
- Ensure data are in a tidy data format at each stage – see the Data Structures section.
- The code be peer reviewed which should cover reproducibility (according to the RAP level) and quality assurance requirements. For more information about reviewing see the Data and Analysis coding standards section on project review. You could also consider obtaining a coding and/or RAP mentor to help with this - for more information see the Analytical Platform and related tools training section on mentoring. If needed you could advertise for a peer reviewer via Slack - see the Analytical Platform and related tools training section on Slack.
2.2 Level 2:
- Consist of coding that is modular and generalisable, using functions (whether made by others or developed as part of the project). Functions are a way to bundle up bits of code to make them easy to reuse. For more information see the generalisable code section and the Writing functions in R course. Where writing or maintaining functions you should:
- Ensure there is appropriate non-sensitive data available in tidy data format for use in development and testing.
- Utilise the condition system to flag up if something unusual is happening when a function is run. For more information, see the Developing R packages & RAP ways of working section on using the condition system.
- Include automated quality assurance checks on input data sets. For more information, see the Developing R packages & RAP ways of working section on automating quality assurance checks on input data sets.
2.3 Level 3:
- Include all code within a package(s) apart from a single function call within a separate repository that would be run to produce the output1; see the Developing R packages course for how to develop a package. The packages should include:
- Documentation that is integrated2 and useful. In R use the package Roxygen2; see the Developing R packages section on documenting functions and the Developing R packages and RAP ways of working section on documenting package data.
- Automated tests so when any changes are pushed to Github.com, tests are run to ascertain whether there are any problems. The testing should consist of both:
- Unit tests (generally there should be at least one for each function), at least to cover what really needs to be tested (what is of high risk?). For more information including using the testthat R package, see the Developing R packages section on testing your code. In Python, use the pytest module to write unit tests and integration tests – learn more via this weblink.
- Integration tests (testing everything in the whole pipeline). For more information on these and automating software workflows, see the Developing R packages section on continuous integration.
- A NEWS file, a change-log which should be kept updated. For more information, see the Developing R packages section on adding a NEWS file.
- The use of GitHub releases to generate easily accessible snapshots at relevant intervals. For more information, see the Developing R packages section on managing package releases.
- Ensure ongoing maintenance of the coding project, keeping the code updated as its’ dependencies (e.g. packages used) change.
- The package(s) should be publicly available where the code can be amended to enable this to happen. For more information about the steps you need to take to make your repository public please see the Analytical Platform User Guidance Acceptable use policy section on GitHub.
So for a statistical publication the publication date could be the only argument in the function call that resides in a separate repository i.e. this not containing the package code. The publication date would be recorded together with associated dependencies (including their hash) and be version controlled via a GitHub release.↩︎
Documentation that is integrated with the package functions in a standard format and is accessible in a standard way i.e. via ? and ??.↩︎