3 Project Structure
In order to develop a pipeline like the one in section 1, certain ways of structuring your project can help.
3.1 Use one repository (repo) for each endpoint in your process
As in the diagram above, define a set of endpoints or outputs that your process will produce. For a statistical publication this could include:
- Cleaned datasets for internal use
- Publishable datasets (these may be the same as those above)
- Publication outputs (Charts, Tables, Publication text)
- Separate repos for other outputs (eg. MI packs, briefing packs, data visualisation tools)
- A separate repo for your functions (a function being a set of commands that are bundled together so that they can all be repeated with a single line of code)
This makes it easier to use outputs for multiple purposes, rather than having to extract them from the middle of a larger process. It also allows other users who want to adapt your code or outputs to find the section of code that they need, without having to understand the full range of code.
In the diagram above, the aim of the first repo should be to render the data into a format that can be used to produce the broadest range of other outputs, which are then each created within their own repos. You may also need to consider the point at which you want to include disclosure control in the process. For example, the first repo might create a full, unredacted dataset for internal use. You may then want to include a further stage which aggregates data to prevent small numbers being generated.
Lastly, it is recommended that any functions you need to write for the project should be within a separate repo - see Generalisable Code section.
3.2 Common structure to a Project repository (repo)
A project repo should have a similar structure as for a package, which bundles together code, data, documentation, and tests. The directory structure of an R package is typically as follows:
- R code is in ‘R’.
- Documentation is in ‘man’.
- Data (while generally in MoJ this should be stored in S3) is in ‘data’.
- Tests are in ‘tests’.
- Templates on how to use the package are in ‘vignettes’.
- Markdown templates and other files are in ‘inst’.
Any functions you write for the project (functions being an MoJ level 2 RAP component, see Generalisable Code section for more details) should be within a separate repo.
For further information about package structure see the R Packages book section on package structure.
Best practice is for repos, at least those which contain functions, to be created as packages (a MoJ level 3 RAP component). Functions can either be added to a suitable existing (multi-use across projects) package, or a project-specific package if necessary.