2 Recommended Analytical IT tools

The default position of analysts should be to use Analytical Platform (AP) tools unless there is a good reason otherwise. The recommended approach, dependent on consumer requirements, is as set out below. This should be complemented by adherence to the recommended ways of working.

1. Storage: Data should be loaded into an Amazon S3 bucket. This can either be done manually via the AWS console, or coded via RStudio or JupyterLab. Once the Data Uploader tool is fully rolled out along with suitable guidance, this should generally be used.1

2. Processing 1: Use Athena/SQL for as much data processing as possible.2 This can be done via the Athena UI or coded via the pydbtools package in Python (using Jupyter Lab) or the dbtools (or Rdbtools3) package in R (using R Studio) which work through Athena. Use of Athena/SQL may be necessary to process large datasets4 and there is an accessibility/maintainability benefit of everyone using Athena/SQL to do as much processing as possible. However, it may be easier for all processing of smaller data (see Example 2) to be done using the AP tool of choice.

3. Processing 2: Use the AP tool of choice (if not Athena/SQL e.g. R or Python) to produce any final tables/analyses, charts and text.

4. Presentation: Tools (not all of which can be used on the AP) include HTML, Powerpoint, Excel, CSV, Power BI, Tableau, R Shiny5 and Quicksight. The appropriateness of each of these tools will depend on consumer needs. Questions to ask include:

  • What do users want to achieve and how much expertise they have?
  • How much interactivity is needed?6
  • What tool(s) will enable the product to be as reproducible as possible? This is particularly important if the task is unlikely to be a one-off; a HTML document can for instance easily be updated through RStudio (using the RMarkdown package) or Jupyter Lab while ‘hands on’ work done in Excel is much less reproducible.
  • How accessible to users will the product be? For instance, an HTML document is platform agnostic; all the user needs is a web browser.
  • How much time and skill development will be needed to build and maintain the product? For instance, building a Power BI, Tableau or R Shiny dashboard takes a substantial amount of time (R Shiny more than the others, although also more reproducible).

5. Automation/Scheduling: Use Airflow for scheduling regular R/Python scripts or running R/Python high-memory jobs. Use create-a-derived-table for regular Athena/SQL data processing runs.7

A process map you can use to help with decision making, along with some practical examples are provided in the next section.


  1. This tool automatically stores the data in highly efficient Parquet format which takes much less storage space and is much quicker to process than in CSV format. While the savings may be negligible for small data (e.g. upto 50MB), they increase dramatically for larger data. The tool will also enable data to be loaded by people without an AP account.↩︎

  2. See the Analytical Platform user guidance on using our databases for analysis.↩︎

  3. See the Analytical Platform user guidance section on packages for querying databases.↩︎

  4. Where the memory limitations associated with RStudio and Jupyter Lab need to be bypassed.↩︎

  5. Currently R Shiny on the AP should only be used to test/prototype tools, or be shared with a limited group of customers for this purpose.↩︎

  6. The MoJ Data Visualisation Standards outlines the process and standards to be followed in producing interactive online data visualisation tools.↩︎

  7. This tool enables SQL to be run in a standardised way for regular queries with scheduling of runs and with dependencies easier to trace than with Airflow (the platform currently used to programmatically author, schedule and monitor workflows).↩︎