Statistical Methods Guidance
2024-11-20
1 Topics
1.1 Introduction
This document signposts useful statistical methods guidance on key topics for MoJ analysts. The contents have been selected on the basis of providing one or more of the following:
- An accessible introduction to the topic
- ‘Under the bonnet’ theory in accessible as well as technical language
- Key steps to take (including common issues and key assumptions)
- A practical demonstration
Please bear in mind:
- It’s intended to be a ‘live’ document.
- Readers are welcome to make suggestions about content including the addition of other resources and topics. Please do this either by creating a GitHub issue or by completing this MS form.
- It’s deliberately kept brief with further statistical methods resources being signposted in the Online analytical training Trello board.
- Guidance about Analytical Platform and related tools (Git/GitHub, Python, R, SQL etc.) training resources are in this separate document.
1.2 Overview sources
Helpful overview sources include:
- Tables providing general guidelines for choosing a statistical test and data analysis examples along with links to R code
- The Data Science Textbook which provides brief overviews of many techniques
- Much liked textbooks covering a wide range of techniques:
- Discovering statistics using R
- An Introduction to Statistical Learning with R or Python code
1.3 Exploratory Data Analysis
“Getting familiar with the data”
- Exploratory versus explanatory analysis, a reminder about why data need to be visualised and recommended chart formats depending on the type of data
- Helpful textbooks:
- R for Data Science (EDA section)
- Discovering statistics using R; see Chapter 4 on exploring data with graphs, Chapter 5 on exploring assumptions, and Chapter 6 on correlation
1.4 Outliers, missing values and data imputation
“Dealing with extreme and missing values”
- Outlier detection and treatment
- Missing data imputation - a chapter from Data Analysis Using Regression and Multilevel/Hierarchical Models (2006) by Andrew Gelman and Jennifer Hill
- Datacamp courses:
1.5 Statistical inference
“Making inferences about a population based on certain sample characteristics”
- Introductions to statistical inference:
- Emergency Medicine Journal article
- Wikipedia entry on statistical inference
- Discovering statistics using R; see Chapter 1
- Datacamp courses:
1.6 Hypothesis testing
“Do the sample data sufficiently support a particular population hypothesis?”
- Internal resources:
- “Old” GSS introduction to hypothesis testing; see Pages 26-30
- MoJ hypothesis testing workbook with Excel based examples and data
- MoJ junior statistician group presentations on parametric and non-parametric hypothesis testing
- External resources:
- Table providing general guidelines for choosing a statistical test
- Discovering statistics using R; see Chapter 2 on introduction to testing, Chapter 5 on exploring assumptions, Chapter 6 on correlation, Chapter 9 on comparing two means, and Chapter 15 on non-parametric tests
- Analytical Function courses:
- Datacamp courses:
1.7 Linear regression
“Modelling the relationship between a continuous dependent variable and explanatory variables by fitting a linear equation to observed data”
- Internal resources:
- External resources:
- Discovering statistics using R; see Chapter 7
- An Introduction to Statistical Learning; see Chapter 3 with R or Python code
- Centre for Multilevel Modelling Online Course Module 3 - Multiple regression; with R code
- Analytical Function courses including linear regression:
- Datacamp courses:
1.8 Risk, Odds and Generalised Linear Models
“Risk, odds and the extension of linear modeling ideas to a wider class of response types, such as count data or binary responses”
- Internal resources:
- External resources:
- Discovering statistics using R; see Chapter 8
- An Introduction to Statistical Learning; see Chapter 4 with R or Python code
- Centre for Multilevel Modelling Online Course Module 6 - Regression models for binary responses; with R code
- Analytical Function courses including generalised linear modelling:
- Datacamp courses:
- For examples of logistic regression modelling in crime/offending contexts along with performance versus machine learning type approaches see this Online analytical training Trello card.
1.9 Survival analysis
“Analysing the expected time to an event of interest”
- Survival Analysis in R
- Introduction to Regression Methods for Public Health Using R; see Chapter 7
- An Introduction to Statistical Learning; see Chapter 11 with R or Python code
- Datacamp courses:
1.10 Multilevel and cluster robust models
“Models when building in data hierarchies”
- Brief introductions:
- Centre for multilevel modelling course
- Discovering statistics using R; see Chapter 19
- Clustered standard errors with R
- Nonlinear multilevel models: Generalised Additive Mixed Models (GAMMs) using the R package mgcv, these being more interpretable than nonlinear models like Random Forest.
- Datacamp course Hierarchical and Mixed Effects Models in R
1.11 Time series analysis & forecasting
“Analysing a sequence of data points collected over an interval of time”
1.12 Bayesian regression
“The Bayesian approach to linear regression”
- Bayesian Regression Using NumPyro, a practical guide to using Python to infer the distributions of regression coefficients.
- For R programmers see the brms package which provides getting started links.
1.13 Sample size determination
“Choosing an appropriate sample size”
- Bitesize session on sample size calculations in R; drawing on resources on this R sample size calculations Trello card
- Guide to clustered sample sizes using R at the Evaluation & Prototyping One-Stop Shop
- Sampling: Design and Analysis, a textbook by Sharon Lohr.
- Further resources signposted on this sample calculations trello card include:
- how to reduce the minimum sample size needed in trials
- sample sizes for difference in difference and regression discontinuity designs
1.14 Survey analysis
“Analysing the results of a survey”
- Analyzing Survey Data in R; essentially someone’s freely accessible notes from the Datacamp course
- Datacamp course: Analyzing Survey Data in R
1.15 Inter-rater reliability analysis
“Measuring the agreement between subjective ratings”
- A wikipedia introduction to inter-rater reliability analysis
- Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial
- Inter-Rater Reliability for binary, categorical, and ordinal ratings
- Intraclass Correlation Coefficients for ordinal and continuous ratings
- Inter-rater reliability measures in r
1.16 Evaluation and Prototyping analysis
“In particular, to understand the impact of an intervention”
1.17 Further sources
- There are a number of Centers of Expertise within Data and Analysis, which offer advice and support.
- Information on specific methods used in the process to produce statistics across the Government Statistical Service (GSS)
- In addition to the internal Statistical Methodology Team, there is also the mostly free GSS Methodology Advice Service
- Lastly, don’t forget that you may be able to get useful help from ChatGPT on methods. When prompting ChatGPT, be clear and specific, concise, use correct grammar and spelling, and provide an example if necessary.