Chapter 1 Understandable

It is crucial that we write code in a way that can be understood easily by others (and our future selves) and passed on to other staff with minimal additional instruction.

1.1 Code structure

There should be clear logical separation of parameters, assumptions, data and code, with appropriate documentation.

Data

  • It should be easy for others to change parameters and data without needing to understand the full codebase.

  • Where possible, it’s best practice to use lookup/reference files as much as possible, rather than hard-coding variables. An alternative approach is to store all parameters in one script, making it easy for the variables to be located and updated.

  • Small data files that are not MoJ data can be committed to the Github repo (e.g. parameters, csvs containing assumptions, mock data for unit tests, lookup tables) - these should be stored in a ‘data’ folder within the project folder.

  • Prefer text format (e.g. csv) to Excel.

  • Please note that unpublished MoJ data should never be committed to your Github repository - even a ‘private’ repo isn’t secure. Instead, data should be stored in an S3 bucket or in Athena.

Documentation

  • Your project should include a README - by writing a README for your project you’re helping others (and your future self) to understand and run your project. Find our template README here.

  • You should add a description to your Github repository and tag it with appropriate topics. This will allow your project to be discoverable and reusable by others. Github will also automatically tag it according to the language used.

  • Any functions you create should have appropriate documentation.

  • Try to make your Git commit messages as informative as possible.

  • Your code should be appropriately commented. Comments are for explaining why something is needed, not how it works. Note that there are no hard and fast rules on how to do this - this blog gives an overview on some of the differing opinions on this.

Good code is its own best documentation. As you’re about to add a comment, ask yourself, “How can I improve the code so that this comment isn’t needed?” Improve the code and then document it to make it even clearer.” - Steve McConnell

Abstractions

Abstractions (e.g. functions, packages, modules etc.) should be used where possible. This makes code easier to understand, maintain and extend.

It will often be the case that will be a pre-existing package or module that contains the functions you need for your project. The best way to find these is usually through a quick google search and, if you’re struggling to find what you’re looking for, it’s worth asking on the relevant slack channel (e.g. #r and #python in the ASD workspace).

If you end up using a piece of code three times, it’s probably worth turning it into a function and separating it out into a separate script. For more information on how to write functions, see here. As a general rule, functions should be less than about 50 lines long.

All non trivial functions should be documented using the programming language’s accepted standard:
* For Python follow PEP8 and particularly PEP257.
* For R, use roxygen2.

If your functions are short and well documented, there is often little need for additional code comments. See the Tidyverse Style Guide for some basic guidance on how to make your functions understandable.

1.2 Code style

There are often many ways of tackling a given problem. As a team, it makes sense to standardise our approach, not because one approach is necessarily better than all others, but because collaboration is easier if there is more commonality in our approaches.

Defaults

This section sets out sensible defaults which you are expected to follow. They are not strict rules, but you will be expected to explain the benefits of alternative approaches if you want to do something different.

Language Defaults
R
  • Follow the Tidyverse Style Guide
  • Default to packages from the Tidyverse, because they have been carefully designed to work together effectively as part of a modern data analysis workflow. More info can be found here: R for Data Science by Hadley Wickham. For example:
    • Prefer tibbles to data.frames
    • Use ggplot2 rather than base graphics
    • Use the pipe %>% appropriately, but not always e.g. see here.
    • Prefer purrr to the apply family of functions. See here
  • Use the package name when calling a function. For example, using dplyr::mutate() rather than just mutate()
  • R Packages are the fundamental unit of reproducible R code.
Python
  • Follow PEP8
  • Use Python 3
  • Use pandas for data analysis
  • Use loc and iloc to write to data frames
  • Use Altair for basic data visualisation
  • Use Scikit Learn for machine learning
  • Use SQLAlchemy and pandas for database interactions, rather than writing your own SQL
Encoding and CSVs Use unicode. This means you should convert inputs that include non-ASCII characters to unicode as early as possible in your data processing workflow. If you are outputting to text files, these should be encoded in utf-8.
SQL Use Postgres or SQLite where possible, rather than other SQL database backends.
GIS
  • Use PostGIS as your GIS backend
  • Prefer conducting your GIS analysis in code, e.g. using SQL, rather than point and click in a GUI
  • Use QGIS if you need a GUI.

In addition to the above coding defaults, the following actions will help you to write understandable code.

Naming conventions

  • Use meaningful names - well-named functions and variables can remove the need for a comment and make life a little easier for other readers, including your future self. Avoid meaningless names like ‘obj’ / ‘result’ / ‘foo’.
  • Don’t be cute or jokey when naming things.
  • Use single-letter variables only where the letter represents a well-known mathematical property (e.g. e = mc^2), or where their meaning is otherwise clear.

Clear and concise code

  • Choose clarity over cleverness - use advanced language tricks with care.
  • Avoid unnecessary repetition in you code. For example, if you end up using the same piece of code 3 times it’s probably worth turning it into a function.
  • Less code is usually better - but not at the expense of clarity.
  • Use code comments well (see above).

Linters

A linter is a tool that analyses code to check for programmatic and stylistic errors. You should apply a linter to review your code formatting, which will mean our coding style will be consistent across projects and make it easier for others to understand.

In RStudio, the keyboard shortcut ‘ctrl+shift+A’ will reformat your code and automate some of the process of passing the linter. If you apply the linter as you work, rather than at the end, you will find it much easier to write code that passes the linter first time.

If possible, set up your linters and tests to run automatically on all pull requests, using Github Actions.

Error messages

Errors will occur, so write your error messages in a way that provide useful information to end users and people working on the code.