Building a new chart in Splink¶

As mentioned in the Understanding Splink Charts topic guide, splink charts are made up of three distinct parts:

A function to create the dataset for the chart
A template chart definition (in a json file)
A function to read the chart definition, add the data to it, and return the chart itself

Worked Example¶

Below is a worked example of how to create a new chart that shows all comparisons levels ordered by match weight:

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
df = splink_datasets.fake_1000

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
      cl.NameComparison("first_name"),
        cl.NameComparison("surname"),
        cl.DateOfBirthComparison("dob", input_is_string=True),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        cl.LevenshteinAtThresholds("email", 2),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "dob"),
        block_on("surname"),
    ]
)

linker = Linker(df, settings,DuckDBAPI())
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
for rule in [block_on("first_name"), block_on("dob")]:
    linker.training.estimate_parameters_using_expectation_maximisation(rule)

Generate data for chart¶

# Take linker object and extract complete settings dict
records = linker._settings_obj._parameters_as_detailed_records

cols_to_keep = [
    "comparison_name",
    "sql_condition",
    "label_for_charts",
    "m_probability",
    "u_probability",
    "bayes_factor",
    "log2_bayes_factor",
    "comparison_vector_value"
]

# Keep useful information for a match weights chart
records = [{k: r[k] for k in cols_to_keep}
           for r in records
           if r["comparison_vector_value"] != -1 and r["comparison_sort_order"] != -1]

records[:3]

[{'comparison_name': 'first_name',
  'sql_condition': '"first_name_l" = "first_name_r"',
  'label_for_charts': 'Exact match on first_name',
  'm_probability': 0.5009783629340309,
  'u_probability': 0.0057935713975033705,
  'bayes_factor': 86.4714229896119,
  'log2_bayes_factor': 6.434151525637829,
  'comparison_vector_value': 4},
 {'comparison_name': 'first_name',
  'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.92',
  'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.92',
  'm_probability': 0.15450921411813767,
  'u_probability': 0.0023429457903817435,
  'bayes_factor': 65.9465595629351,
  'log2_bayes_factor': 6.043225490816602,
  'comparison_vector_value': 3},
 {'comparison_name': 'first_name',
  'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.88',
  'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.88',
  'm_probability': 0.07548037415770431,
  'u_probability': 0.0015484319951285285,
  'bayes_factor': 48.7463281533646,
  'log2_bayes_factor': 5.607221645966225,
  'comparison_vector_value': 2}]

Create a chart template¶

Build prototype chart in Altair¶

import pandas as pd
import altair as alt

df = pd.DataFrame(records)

# Need a unique name for each comparison level - easier to create in pandas than altair
df["cl_id"] = df["comparison_name"] + "_" + \
    df["comparison_vector_value"].astype("str")

# Simple start - bar chart with x, y and color encodings
alt.Chart(df).mark_bar().encode(
    y="cl_id",
    x="log2_bayes_factor",
    color="comparison_name"
)

Sort bars, edit axes/titles¶

alt.Chart(df).mark_bar().encode(
    y=alt.Y("cl_id",
        sort="-x",
        title="Comparison level"
    ),
    x=alt.X("log2_bayes_factor",
        title="Comparison level match weight = log2(m/u)",
        scale=alt.Scale(domain=[-10,10])
    ),
    color="comparison_name"
).properties(
    title="New Chart - WOO!"
).configure_view(
    step=15
)

alt.Chart(df).mark_bar().encode(
    y=alt.Y("cl_id",
            sort="-x",
            title="Comparison level"
            ),
    x=alt.X("log2_bayes_factor",
            title="Comparison level match weight = log2(m/u)",
            scale=alt.Scale(domain=[-10, 10])
            ),
    color="comparison_name",
    tooltip=[
        "comparison_name",
        "label_for_charts",
        "sql_condition",
        "m_probability",
        "u_probability",
        "bayes_factor",
        "log2_bayes_factor"
        ]
).properties(
    title="New Chart - WOO!"
).configure_view(
    step=15
)

Add text layer¶

# Create base chart with shared data and encodings (mark type not specified)
base = alt.Chart(df).encode(
    y=alt.Y("cl_id",
            sort="-x",
            title="Comparison level"
            ),
    x=alt.X("log2_bayes_factor",
            title="Comparison level match weight = log2(m/u)",
            scale=alt.Scale(domain=[-10, 10])
            ),
    tooltip=[
        "comparison_name",
        "label_for_charts",
        "sql_condition",
        "m_probability",
        "u_probability",
        "bayes_factor",
        "log2_bayes_factor"
    ]
)

# Build bar chart from base (color legend made redundant by text labels)
bar = base.mark_bar().encode(
    color=alt.Color("comparison_name", legend=None)
)

# Build text layer from base
text = base.mark_text(dx=0, align="right").encode(
    text="comparison_name"
)

# Final layered chart
chart = bar + text

# Add global config
chart.resolve_axis(
    y="shared",
    x="shared"
).properties(
    title="New Chart - WOO!"
).configure_view(
    step=15
)

Sometimes things go wrong in Altair and it's not clear why or how to fix it. If the docs and Stack Overflow don't have a solution, the answer is usually that Altair is making decisions under the hood about the Vega-Lite schema that are out of your control.

In this example, the sorting of the y-axis is broken when layering charts. If we show bar and text side-by-side, you can see they work as expected, but the sorting is broken in the layering process.

bar | text

Once we get to this stage (or whenever you're comfortable), we can switch to Vega-Lite by exporting the JSON from our chart object, or opening the chart in the Vega-Lite editor.

chart.to_json()

Chart JSON

  {
  "$schema": "https://vega.github.io/schema/vega-lite/v5.8.0.json",
  "config": {
    "view": {
      "continuousHeight": 300,
      "continuousWidth": 300
    }
  },
  "data": {
    "name": "data-3901c03d78701611834aa82ab7374cce"
  },
  "datasets": {
    "data-3901c03d78701611834aa82ab7374cce": [
      {
        "bayes_factor": 86.62949969575988,
        "cl_id": "first_name_4",
        "comparison_name": "first_name",
        "comparison_vector_value": 4,
        "label_for_charts": "Exact match first_name",
        "log2_bayes_factor": 6.436786480320881,
        "m_probability": 0.5018941916173814,
        "sql_condition": "\"first_name_l\" = \"first_name_r\"",
        "u_probability": 0.0057935713975033705
      },
      {
        "bayes_factor": 82.81743551783742,
        "cl_id": "first_name_3",
        "comparison_name": "first_name",
        "comparison_vector_value": 3,
        "label_for_charts": "Damerau_levenshtein <= 1",
        "log2_bayes_factor": 6.371862624533329,
        "m_probability": 0.19595791797531015,
        "sql_condition": "damerau_levenshtein(\"first_name_l\", \"first_name_r\") <= 1",
        "u_probability": 0.00236614327345483
      },
      {
        "bayes_factor": 35.47812468678278,
        "cl_id": "first_name_2",
        "comparison_name": "first_name",
        "comparison_vector_value": 2,
        "label_for_charts": "Jaro_winkler_similarity >= 0.9",
        "log2_bayes_factor": 5.148857848140163,
        "m_probability": 0.045985303626033085,
        "sql_condition": "jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.9",
        "u_probability": 0.001296159366708712
      },
      {
        "bayes_factor": 11.266641370022352,
        "cl_id": "first_name_1",
        "comparison_name": "first_name",
        "comparison_vector_value": 1,
        "label_for_charts": "Jaro_winkler_similarity >= 0.8",
        "log2_bayes_factor": 3.493985601438375,
        "m_probability": 0.06396730257493154,
        "sql_condition": "jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.8",
        "u_probability": 0.005677583982137938
      },
      {
        "bayes_factor": 0.19514855669673956,
        "cl_id": "first_name_0",
        "comparison_name": "first_name",
        "comparison_vector_value": 0,
        "label_for_charts": "All other comparisons",
        "log2_bayes_factor": -2.357355302129234,
        "m_probability": 0.19219528420634394,
        "sql_condition": "ELSE",
        "u_probability": 0.9848665419801952
      },
      {
        "bayes_factor": 113.02818119005431,
        "cl_id": "surname_4",
        "comparison_name": "surname",
        "comparison_vector_value": 4,
        "label_for_charts": "Exact match surname",
        "log2_bayes_factor": 6.820538712806792,
        "m_probability": 0.5527050424941531,
        "sql_condition": "\"surname_l\" = \"surname_r\"",
        "u_probability": 0.004889975550122249
      },
      {
        "bayes_factor": 80.61351958508214,
        "cl_id": "surname_3",
        "comparison_name": "surname",
        "comparison_vector_value": 3,
        "label_for_charts": "Damerau_levenshtein <= 1",
        "log2_bayes_factor": 6.332949906378981,
        "m_probability": 0.22212752320956386,
        "sql_condition": "damerau_levenshtein(\"surname_l\", \"surname_r\") <= 1",
        "u_probability": 0.0027554624131641246
      },
      {
        "bayes_factor": 48.57568460485815,
        "cl_id": "surname_2",
        "comparison_name": "surname",
        "comparison_vector_value": 2,
        "label_for_charts": "Jaro_winkler_similarity >= 0.9",
        "log2_bayes_factor": 5.602162423566203,
        "m_probability": 0.0490149338194711,
        "sql_condition": "jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.9",
        "u_probability": 0.0010090425738347498
      },
      {
        "bayes_factor": 13.478820689774516,
        "cl_id": "surname_1",
        "comparison_name": "surname",
        "comparison_vector_value": 1,
        "label_for_charts": "Jaro_winkler_similarity >= 0.8",
        "log2_bayes_factor": 3.752622370380284,
        "m_probability": 0.05001678986356945,
        "sql_condition": "jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.8",
        "u_probability": 0.003710768991942586
      },
      {
        "bayes_factor": 0.1277149376863226,
        "cl_id": "surname_0",
        "comparison_name": "surname",
        "comparison_vector_value": 0,
        "label_for_charts": "All other comparisons",
        "log2_bayes_factor": -2.969000820703079,
        "m_probability": 0.1261357106132424,
        "sql_condition": "ELSE",
        "u_probability": 0.9876347504709363
      },
      {
        "bayes_factor": 236.78351486807742,
        "cl_id": "dob_5",
        "comparison_name": "dob",
        "comparison_vector_value": 5,
        "label_for_charts": "Exact match",
        "log2_bayes_factor": 7.887424832202931,
        "m_probability": 0.41383785481447766,
        "sql_condition": "\"dob_l\" = \"dob_r\"",
        "u_probability": 0.0017477477477477479
      },
      {
        "bayes_factor": 65.74625268345359,
        "cl_id": "dob_4",
        "comparison_name": "dob",
        "comparison_vector_value": 4,
        "label_for_charts": "Damerau_levenshtein <= 1",
        "log2_bayes_factor": 6.038836762842662,
        "m_probability": 0.10806341031654734,
        "sql_condition": "damerau_levenshtein(\"dob_l\", \"dob_r\") <= 1",
        "u_probability": 0.0016436436436436436
      },
      {
        "bayes_factor": 29.476860590690453,
        "cl_id": "dob_3",
        "comparison_name": "dob",
        "comparison_vector_value": 3,
        "label_for_charts": "Within 1 month",
        "log2_bayes_factor": 4.881510974428093,
        "m_probability": 0.11300938544779224,
        "sql_condition": "\n            abs(date_diff('month',\n                strptime(\"dob_l\", '%Y-%m-%d'),\n                strptime(\"dob_r\", '%Y-%m-%d'))\n                ) <= 1\n        ",
        "u_probability": 0.003833833833833834
      },
      {
        "bayes_factor": 3.397551460259144,
        "cl_id": "dob_2",
        "comparison_name": "dob",
        "comparison_vector_value": 2,
        "label_for_charts": "Within 1 year",
        "log2_bayes_factor": 1.7644954026183992,
        "m_probability": 0.17200656922328977,
        "sql_condition": "\n            abs(date_diff('year',\n                strptime(\"dob_l\", '%Y-%m-%d'),\n                strptime(\"dob_r\", '%Y-%m-%d'))\n                ) <= 1\n        ",
        "u_probability": 0.05062662662662663
      },
      {
        "bayes_factor": 0.6267794172297388,
        "cl_id": "dob_1",
        "comparison_name": "dob",
        "comparison_vector_value": 1,
        "label_for_charts": "Within 10 years",
        "log2_bayes_factor": -0.6739702908716182,
        "m_probability": 0.19035523041792068,
        "sql_condition": "\n            abs(date_diff('year',\n                strptime(\"dob_l\", '%Y-%m-%d'),\n                strptime(\"dob_r\", '%Y-%m-%d'))\n                ) <= 10\n        ",
        "u_probability": 0.3037037037037037
      },
      {
        "bayes_factor": 0.004272180302776005,
        "cl_id": "dob_0",
        "comparison_name": "dob",
        "comparison_vector_value": 0,
        "label_for_charts": "All other comparisons",
        "log2_bayes_factor": -7.870811748958801,
        "m_probability": 0.002727549779972325,
        "sql_condition": "ELSE",
        "u_probability": 0.6384444444444445
      },
      {
        "bayes_factor": 10.904938885948333,
        "cl_id": "city_1",
        "comparison_name": "city",
        "comparison_vector_value": 1,
        "label_for_charts": "Exact match",
        "log2_bayes_factor": 3.4469097796586596,
        "m_probability": 0.6013808934279701,
        "sql_condition": "\"city_l\" = \"city_r\"",
        "u_probability": 0.0551475711801453
      },
      {
        "bayes_factor": 0.42188504195296994,
        "cl_id": "city_0",
        "comparison_name": "city",
        "comparison_vector_value": 0,
        "label_for_charts": "All other comparisons",
        "log2_bayes_factor": -1.2450781575619725,
        "m_probability": 0.3986191065720299,
        "sql_condition": "ELSE",
        "u_probability": 0.9448524288198547
      },
      {
        "bayes_factor": 269.6074384240141,
        "cl_id": "email_2",
        "comparison_name": "email",
        "comparison_vector_value": 2,
        "label_for_charts": "Exact match",
        "log2_bayes_factor": 8.07471649055784,
        "m_probability": 0.5914840252879943,
        "sql_condition": "\"email_l\" = \"email_r\"",
        "u_probability": 0.0021938713143283602
      },
      {
        "bayes_factor": 222.9721189153553,
        "cl_id": "email_1",
        "comparison_name": "email",
        "comparison_vector_value": 1,
        "label_for_charts": "Levenshtein <= 2",
        "log2_bayes_factor": 7.800719512398763,
        "m_probability": 0.3019669634613132,
        "sql_condition": "levenshtein(\"email_l\", \"email_r\") <= 2",
        "u_probability": 0.0013542812658830492
      },
      {
        "bayes_factor": 0.10692840956298139,
        "cl_id": "email_0",
        "comparison_name": "email",
        "comparison_vector_value": 0,
        "label_for_charts": "All other comparisons",
        "log2_bayes_factor": -3.225282884575804,
        "m_probability": 0.10654901125069259,
        "sql_condition": "ELSE",
        "u_probability": 0.9964518474197885
      }
    ]
  },
  "layer": [
    {
      "encoding": {
        "color": {
          "field": "comparison_name",
          "legend": null,
          "type": "nominal"
        },
        "tooltip": [
          {
            "field": "comparison_name",
            "type": "nominal"
          },
          {
            "field": "label_for_charts",
            "type": "nominal"
          },
          {
            "field": "sql_condition",
            "type": "nominal"
          },
          {
            "field": "m_probability",
            "type": "quantitative"
          },
          {
            "field": "u_probability",
            "type": "quantitative"
          },
          {
            "field": "bayes_factor",
            "type": "quantitative"
          },
          {
            "field": "log2_bayes_factor",
            "type": "quantitative"
          }
        ],
        "x": {
          "field": "log2_bayes_factor",
          "scale": {
            "domain": [
              -10,
              10
            ]
          },
          "title": "Comparison level match weight = log2(m/u)",
          "type": "quantitative"
        },
        "y": {
          "field": "cl_id",
          "sort": "-x",
          "title": "Comparison level",
          "type": "nominal"
        }
      },
      "mark": {
        "type": "bar"
      }
    },
    {
      "encoding": {
        "text": {
          "field": "comparison_name",
          "type": "nominal"
        },
        "tooltip": [
          {
            "field": "comparison_name",
            "type": "nominal"
          },
          {
            "field": "label_for_charts",
            "type": "nominal"
          },
          {
            "field": "sql_condition",
            "type": "nominal"
          },
          {
            "field": "m_probability",
            "type": "quantitative"
          },
          {
            "field": "u_probability",
            "type": "quantitative"
          },
          {
            "field": "bayes_factor",
            "type": "quantitative"
          },
          {
            "field": "log2_bayes_factor",
            "type": "quantitative"
          }
        ],
        "x": {
          "field": "log2_bayes_factor",
          "scale": {
            "domain": [
              -10,
              10
            ]
          },
          "title": "Comparison level match weight = log2(m/u)",
          "type": "quantitative"
        },
        "y": {
          "field": "cl_id",
          "sort": "-x",
          "title": "Comparison level",
          "type": "nominal"
        }
      },
      "mark": {
        "align": "right",
        "dx": 0,
        "type": "text"
      }
    }
  ]
  }

Edit in Vega-Lite¶

Opening the JSON from the above chart in Vega-Lite editor, it is now behaving as intended, with both bar and text layers sorted by match weight.

If the chart is working as intended, there is only one step required before saving the JSON file - removing data from the template schema.

The data appears as follows with a dictionary of all included datasets by name, and then each chart referencing the data it uses by name:

"data": {"name": "data-a6c84a9cf1a0c7a2cd30cc1a0e2c1185"},
"datasets": {
  "data-a6c84a9cf1a0c7a2cd30cc1a0e2c1185": [

    ...

  ]
},

Where only one dataset is required, this is equivalent to:

"data": {"values": [...]}

After removing the data references, the template can be saved in Splink as splink/files/chart_defs/my_new_chart.json

Combine the chart dataset and template¶

Putting all of the above together, Splink needs definitions for the methods that generate the chart and the data behind it (these can be separate or performed by the same function if relatively simple).

Chart definition¶

In splink/charts.py we can add a new function to populate the chart definition with the provided data:

def my_new_chart(records, as_dict=False):
    chart_path = "my_new_chart.json"
    chart = load_chart_definition(chart_path)

    chart["data"]["values"] = records
    return altair_or_json(chart, as_dict=as_dict)

Note - only the data is being added to a fixed chart definition here. Other elements of the chart spec can be changed by editing the chart dictionary in the same way.

For example, if you wanted to add a color_scheme argument to replace the default scheme ("tableau10"), this function could include the line: chart["layer"][0]["encoding"]["color"]["scale"]["scheme"] = color_scheme

Chart method¶

Then we can add a method to the linker in splink/linker.py so the chart can be generated by linker.my_new_chart():

from .charts import my_new_chart

...

class Linker:

    ...

    def my_new_chart(self):

        # Take linker object and extract complete settings dict
        records = self._settings_obj._parameters_as_detailed_records

        cols_to_keep = [
            "comparison_name",
            "sql_condition",
            "label_for_charts",
            "m_probability",
            "u_probability",
            "bayes_factor",
            "log2_bayes_factor",
            "comparison_vector_value"
        ]

        # Keep useful information for a match weights chart
        records = [{k: r[k] for k in cols_to_keep}
                   for r in records 
                   if r["comparison_vector_value"] != -1 and r["comparison_sort_order"] != -1]

        return my_new_chart(records)

Previous new chart PRs¶

Real-life Splink chart additions, for reference: