Statistical Methods Guidance

1.1 Introduction

This document signposts useful statistical methods guidance on key topics for MoJ analysts. The contents have been selected on the basis of providing one or more of the following:

An accessible introduction to the topic
‘Under the bonnet’ theory in accessible as well as technical language
Key steps to take (including common issues and key assumptions)
A practical demonstration

Please bear in mind:

It’s intended to be a ‘live’ document.
Readers are welcome to make suggestions about content including the addition of other resources and topics. Please do this either by creating a GitHub issue or by completing this MS form.
It’s deliberately kept brief with further statistical methods resources being signposted in the Online analytical training Trello board.
Guidance about Analytical Platform and related tools (Git/GitHub, Python, R, SQL etc.) training resources are in this separate document.

1.2 Overview sources

Helpful overview sources include:

Tables providing general guidelines for choosing a statistical test and data analysis examples along with links to R code
The Data Science Textbook which provides brief overviews of many techniques
Much liked textbooks covering a wide range of techniques:
- Discovering statistics using R
- An Introduction to Statistical Learning with R or Python code

1.3 Exploratory Data Analysis

“Getting familiar with the data”

Exploratory versus explanatory analysis, a reminder about why data need to be visualised and recommended chart formats depending on the type of data
Helpful textbooks:
- R for Data Science (EDA section)
- Discovering statistics using R; see Chapter 4 on exploring data with graphs, Chapter 5 on exploring assumptions, and Chapter 6 on correlation

1.4 Outliers, missing values and data imputation

“Dealing with extreme and missing values”

Outlier detection and treatment
Missing data imputation - a chapter from Data Analysis Using Regression and Multilevel/Hierarchical Models (2006) by Andrew Gelman and Jennifer Hill
Datacamp courses:

1.5 Statistical inference

“Making inferences about a population based on certain sample characteristics”

Introductions to statistical inference:
- Emergency Medicine Journal article
- Wikipedia entry on statistical inference
- Discovering statistics using R; see Chapter 1
Datacamp courses:
- Foundations of inference in R
- Foundations of inference in Python

1.6 Hypothesis testing

“Do the sample data sufficiently support a particular population hypothesis?”

Internal resources:
- “Old” GSS introduction to hypothesis testing; see Pages 26-30
- MoJ hypothesis testing workbook with Excel based examples and data
- MoJ junior statistician group presentations on parametric and non-parametric hypothesis testing
External resources:
- Table providing general guidelines for choosing a statistical test
- Discovering statistics using R; see Chapter 2 on introduction to testing, Chapter 5 on exploring assumptions, Chapter 6 on correlation, Chapter 9 on comparing two means, and Chapter 15 on non-parametric tests
Analytical Function courses:
- Statistics in R and Hypothesis testing in R
- Hypothesis testing in Python
Datacamp courses:
- Hypothesis testing in R
- Hypothesis testing in Python

1.7 Linear regression

“Modelling the relationship between a continuous dependent variable and explanatory variables by fitting a linear equation to observed data”

Internal resources:
External resources:
- Discovering statistics using R; see Chapter 7
- An Introduction to Statistical Learning; see Chapter 3 with R or Python code
- Centre for Multilevel Modelling Online Course Module 3 - Multiple regression; with R code
Analytical Function courses including linear regression:
- Statistics in R and Machine learning in R
- Introduction to machine learning in Python
Datacamp courses:
- Introduction to Regression in R, Intermediate Regression in R and Inference for Linear Regression in R
- Introduction to Regression with statsmodels in Python and Intermediate Regression with statsmodels in Python

1.8 Risk, Odds and Generalised Linear Models

“Risk, odds and the extension of linear modeling ideas to a wider class of response types, such as count data or binary responses”

Internal resources:
External resources:
- Discovering statistics using R; see Chapter 8
- An Introduction to Statistical Learning; see Chapter 4 with R or Python code
- Centre for Multilevel Modelling Online Course Module 6 - Regression models for binary responses; with R code
Analytical Function courses including generalised linear modelling:
- Statistics in R and Machine learning in R
- Introduction to machine learning in Python
Datacamp courses:
- Introduction to Regression in R, Intermediate Regression in R and Supervised Learning in R: Regression
- Introduction to Regression with statsmodels in Python, Intermediate Regression with statsmodels in Python, Machine Learning with PySpark, and Linear Classifiers in Python
For examples of logistic regression modelling in crime/offending contexts along with performance versus machine learning type approaches see this Online analytical training Trello card.

1.9 Survival analysis

“Analysing the expected time to an event of interest”

Survival Analysis in R
Introduction to Regression Methods for Public Health Using R; see Chapter 7
An Introduction to Statistical Learning; see Chapter 11 with R or Python code
Datacamp courses:
- Survival Analysis in R
- Survival Analysis in Python

1.10 Multilevel and cluster robust models

“Models when building in data hierarchies”

Brief introductions:
Centre for multilevel modelling course
Discovering statistics using R; see Chapter 19
Clustered standard errors with R
Nonlinear multilevel models: Generalised Additive Mixed Models (GAMMs) using the R package mgcv, these being more interpretable than nonlinear models like Random Forest.
Datacamp course Hierarchical and Mixed Effects Models in R

1.11 Time series analysis & forecasting

“Analysing a sequence of data points collected over an interval of time”

Forecasting: Principles and Practice

1.12 Bayesian regression

“The Bayesian approach to linear regression”

Bayesian Regression Using NumPyro, a practical guide to using Python to infer the distributions of regression coefficients.
For R programmers see the brms package which provides getting started links.

1.13 Sample size determination

“Choosing an appropriate sample size”

Bitesize session on sample size calculations in R; drawing on resources on ⁠this R sample size calculations Trello card
Guide to clustered sample sizes using R at the Evaluation & Prototyping One-Stop Shop
Sampling: Design and Analysis, a textbook by ⁠Sharon Lohr.
Further resources signposted on ⁠this sample calculations trello card include:
- how to reduce the minimum sample size needed in trials
- sample sizes for difference in difference and regression discontinuity designs

1.14 Survey analysis

“Analysing the results of a survey”

Analyzing Survey Data in R; essentially someone’s freely accessible notes from the Datacamp course
Datacamp course: Analyzing Survey Data in R

1.15 Inter-rater reliability analysis

“Measuring the agreement between subjective ratings”

1.16 Evaluation and Prototyping analysis

“In particular, to understand the impact of an intervention”

Evaluation & Prototyping One-Stop Shop

1.17 Further sources

There are a number of Centers of Expertise within Data and Analysis, which offer advice and support.
Information on specific methods used in the process to produce statistics across the Government Statistical Service (GSS)
In addition to the internal Statistical Methodology Team, there is also the mostly free GSS Methodology Advice Service
Lastly, don’t forget that you may be able to get useful help from ChatGPT on methods. When prompting ChatGPT, be clear and specific, concise, use correct grammar and spelling, and provide an example if necessary.