3 Process Map and Examples

A process map you can use to help with decision making about which analytical IT tools to use along with three practical examples.

3.1 Process Map

3.2 Practical Examples

Example 1: Setting up the monthly production of simple summary outputs using large derived tables on the AP.

Storage / Processing 1: Using derived data provided by Data Engineering, use Athena/SQL via the pydbtools package in Python ⁸ (using JupyterLab) to create specific aggregations based on consumer need and store aggregate tables in S3 as part of an Athena database ⁹.

Processing 2: Access aggregate tables via the pydbtools package in Python; undertake exploration and analysis mindful of consumer requirements and transform these aggregate datasets into summary tables and charts to address customer needs.

Presentation: Summary outputs presented in HMTL and CSV format with Powerpoint slides.

Automation/Scheduling: Use Airflow for scheduling the running of the Python script.

Example 2: Analysing a small survey conducted using Smart Survey

Storage: Load a 300 row dataset exported from Smart Survey as a csv file into an S3 bucket.

Processing 2: Undertake analysis via R mindful of consumer requirements. Produce summary tables and charts as well as a refined and anonymised dataset to enable users to undertake manual interactive exploration and/or machine learning.

Presentation: Create an R Shiny or Power BI dashboard to enable internal users to interact with the anonymised data, also provided as a CSV file. Use the RMarkdown package to provide the summary outputs in HTML and Powerpoint slides.

Example 3: Produce exploratory analysis of a large dataset not housed on the AP

Storage: Load dataset received into an S3 bucket as part of an Athena database

Processing 1: Use Athena/SQL via RStudio (using the dbtools package) to explore, understand and manipulate the data, including joining with other tables on S3 as suitable and addressing any data quality issues. Summarise the data (mindful of GDPR / DPA and customer requirements) to the lowest level of granularity required for analysis.

Processing 2: Use R ¹⁰ to undertake further (e.g. statistical) analysis and produce summary tables and charts mindful of customer requirements.

Presentation: Summary outputs presented (using the RMarkdown package) in HTML with an accompanying CSV file provided for the consumer’s own exploration.

It may be desirable for R to be used instead in this example, particularly if the team generally uses R rather than Python.↩︎
Assuming the aggregate tables should remain available to other users. Otherwise, temporary Athena tables could be created using dbtools/pydbtools in Processing 1 with Processing 2 run straight afterwards (e.g. using the same script).↩︎
It may be desirable for Python to be used instead in this example, particularly if the team generally uses Python rather than R.↩︎