Building a new chart in Splink¶
As mentioned in the Understanding Splink Charts topic guide, splink charts are made up of three distinct parts:
- A function to create the dataset for the chart
- A template chart definition (in a json file)
- A function to read the chart definition, add the data to it, and return the chart itself
Worked Example¶
Below is a worked example of how to create a new chart that shows all comparisons levels ordered by match weight:
import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
df = splink_datasets.fake_1000
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.NameComparison("first_name"),
cl.NameComparison("surname"),
cl.DateOfBirthComparison("dob", input_is_string=True),
cl.ExactMatch("city").configure(term_frequency_adjustments=True),
cl.LevenshteinAtThresholds("email", 2),
],
blocking_rules_to_generate_predictions=[
block_on("first_name", "dob"),
block_on("surname"),
]
)
linker = Linker(df, settings,DuckDBAPI())
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
for rule in [block_on("first_name"), block_on("dob")]:
linker.training.estimate_parameters_using_expectation_maximisation(rule)
Generate data for chart¶
# Take linker object and extract complete settings dict
records = linker._settings_obj._parameters_as_detailed_records
cols_to_keep = [
"comparison_name",
"sql_condition",
"label_for_charts",
"m_probability",
"u_probability",
"bayes_factor",
"log2_bayes_factor",
"comparison_vector_value"
]
# Keep useful information for a match weights chart
records = [{k: r[k] for k in cols_to_keep}
for r in records
if r["comparison_vector_value"] != -1 and r["comparison_sort_order"] != -1]
records[:3]
[{'comparison_name': 'first_name',
'sql_condition': '"first_name_l" = "first_name_r"',
'label_for_charts': 'Exact match on first_name',
'm_probability': 0.5009783629340309,
'u_probability': 0.0057935713975033705,
'bayes_factor': 86.4714229896119,
'log2_bayes_factor': 6.434151525637829,
'comparison_vector_value': 4},
{'comparison_name': 'first_name',
'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.92',
'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.92',
'm_probability': 0.15450921411813767,
'u_probability': 0.0023429457903817435,
'bayes_factor': 65.9465595629351,
'log2_bayes_factor': 6.043225490816602,
'comparison_vector_value': 3},
{'comparison_name': 'first_name',
'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.88',
'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.88',
'm_probability': 0.07548037415770431,
'u_probability': 0.0015484319951285285,
'bayes_factor': 48.7463281533646,
'log2_bayes_factor': 5.607221645966225,
'comparison_vector_value': 2}]
Create a chart template¶
Build prototype chart in Altair¶
import pandas as pd
import altair as alt
df = pd.DataFrame(records)
# Need a unique name for each comparison level - easier to create in pandas than altair
df["cl_id"] = df["comparison_name"] + "_" + \
df["comparison_vector_value"].astype("str")
# Simple start - bar chart with x, y and color encodings
alt.Chart(df).mark_bar().encode(
y="cl_id",
x="log2_bayes_factor",
color="comparison_name"
)
Sort bars, edit axes/titles¶
alt.Chart(df).mark_bar().encode(
y=alt.Y("cl_id",
sort="-x",
title="Comparison level"
),
x=alt.X("log2_bayes_factor",
title="Comparison level match weight = log2(m/u)",
scale=alt.Scale(domain=[-10,10])
),
color="comparison_name"
).properties(
title="New Chart - WOO!"
).configure_view(
step=15
)
Add tooltip¶
alt.Chart(df).mark_bar().encode(
y=alt.Y("cl_id",
sort="-x",
title="Comparison level"
),
x=alt.X("log2_bayes_factor",
title="Comparison level match weight = log2(m/u)",
scale=alt.Scale(domain=[-10, 10])
),
color="comparison_name",
tooltip=[
"comparison_name",
"label_for_charts",
"sql_condition",
"m_probability",
"u_probability",
"bayes_factor",
"log2_bayes_factor"
]
).properties(
title="New Chart - WOO!"
).configure_view(
step=15
)
Add text layer¶
# Create base chart with shared data and encodings (mark type not specified)
base = alt.Chart(df).encode(
y=alt.Y("cl_id",
sort="-x",
title="Comparison level"
),
x=alt.X("log2_bayes_factor",
title="Comparison level match weight = log2(m/u)",
scale=alt.Scale(domain=[-10, 10])
),
tooltip=[
"comparison_name",
"label_for_charts",
"sql_condition",
"m_probability",
"u_probability",
"bayes_factor",
"log2_bayes_factor"
]
)
# Build bar chart from base (color legend made redundant by text labels)
bar = base.mark_bar().encode(
color=alt.Color("comparison_name", legend=None)
)
# Build text layer from base
text = base.mark_text(dx=0, align="right").encode(
text="comparison_name"
)
# Final layered chart
chart = bar + text
# Add global config
chart.resolve_axis(
y="shared",
x="shared"
).properties(
title="New Chart - WOO!"
).configure_view(
step=15
)
Sometimes things go wrong in Altair and it's not clear why or how to fix it. If the docs and Stack Overflow don't have a solution, the answer is usually that Altair is making decisions under the hood about the Vega-Lite schema that are out of your control.
In this example, the sorting of the y-axis is broken when layering charts. If we show bar
and text
side-by-side, you can see they work as expected, but the sorting is broken in the layering process.
bar | text
Once we get to this stage (or whenever you're comfortable), we can switch to Vega-Lite by exporting the JSON from our chart
object, or opening the chart in the Vega-Lite editor.
chart.to_json()
Chart JSON
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.8.0.json",
"config": {
"view": {
"continuousHeight": 300,
"continuousWidth": 300
}
},
"data": {
"name": "data-3901c03d78701611834aa82ab7374cce"
},
"datasets": {
"data-3901c03d78701611834aa82ab7374cce": [
{
"bayes_factor": 86.62949969575988,
"cl_id": "first_name_4",
"comparison_name": "first_name",
"comparison_vector_value": 4,
"label_for_charts": "Exact match first_name",
"log2_bayes_factor": 6.436786480320881,
"m_probability": 0.5018941916173814,
"sql_condition": "\"first_name_l\" = \"first_name_r\"",
"u_probability": 0.0057935713975033705
},
{
"bayes_factor": 82.81743551783742,
"cl_id": "first_name_3",
"comparison_name": "first_name",
"comparison_vector_value": 3,
"label_for_charts": "Damerau_levenshtein <= 1",
"log2_bayes_factor": 6.371862624533329,
"m_probability": 0.19595791797531015,
"sql_condition": "damerau_levenshtein(\"first_name_l\", \"first_name_r\") <= 1",
"u_probability": 0.00236614327345483
},
{
"bayes_factor": 35.47812468678278,
"cl_id": "first_name_2",
"comparison_name": "first_name",
"comparison_vector_value": 2,
"label_for_charts": "Jaro_winkler_similarity >= 0.9",
"log2_bayes_factor": 5.148857848140163,
"m_probability": 0.045985303626033085,
"sql_condition": "jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.9",
"u_probability": 0.001296159366708712
},
{
"bayes_factor": 11.266641370022352,
"cl_id": "first_name_1",
"comparison_name": "first_name",
"comparison_vector_value": 1,
"label_for_charts": "Jaro_winkler_similarity >= 0.8",
"log2_bayes_factor": 3.493985601438375,
"m_probability": 0.06396730257493154,
"sql_condition": "jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.8",
"u_probability": 0.005677583982137938
},
{
"bayes_factor": 0.19514855669673956,
"cl_id": "first_name_0",
"comparison_name": "first_name",
"comparison_vector_value": 0,
"label_for_charts": "All other comparisons",
"log2_bayes_factor": -2.357355302129234,
"m_probability": 0.19219528420634394,
"sql_condition": "ELSE",
"u_probability": 0.9848665419801952
},
{
"bayes_factor": 113.02818119005431,
"cl_id": "surname_4",
"comparison_name": "surname",
"comparison_vector_value": 4,
"label_for_charts": "Exact match surname",
"log2_bayes_factor": 6.820538712806792,
"m_probability": 0.5527050424941531,
"sql_condition": "\"surname_l\" = \"surname_r\"",
"u_probability": 0.004889975550122249
},
{
"bayes_factor": 80.61351958508214,
"cl_id": "surname_3",
"comparison_name": "surname",
"comparison_vector_value": 3,
"label_for_charts": "Damerau_levenshtein <= 1",
"log2_bayes_factor": 6.332949906378981,
"m_probability": 0.22212752320956386,
"sql_condition": "damerau_levenshtein(\"surname_l\", \"surname_r\") <= 1",
"u_probability": 0.0027554624131641246
},
{
"bayes_factor": 48.57568460485815,
"cl_id": "surname_2",
"comparison_name": "surname",
"comparison_vector_value": 2,
"label_for_charts": "Jaro_winkler_similarity >= 0.9",
"log2_bayes_factor": 5.602162423566203,
"m_probability": 0.0490149338194711,
"sql_condition": "jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.9",
"u_probability": 0.0010090425738347498
},
{
"bayes_factor": 13.478820689774516,
"cl_id": "surname_1",
"comparison_name": "surname",
"comparison_vector_value": 1,
"label_for_charts": "Jaro_winkler_similarity >= 0.8",
"log2_bayes_factor": 3.752622370380284,
"m_probability": 0.05001678986356945,
"sql_condition": "jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.8",
"u_probability": 0.003710768991942586
},
{
"bayes_factor": 0.1277149376863226,
"cl_id": "surname_0",
"comparison_name": "surname",
"comparison_vector_value": 0,
"label_for_charts": "All other comparisons",
"log2_bayes_factor": -2.969000820703079,
"m_probability": 0.1261357106132424,
"sql_condition": "ELSE",
"u_probability": 0.9876347504709363
},
{
"bayes_factor": 236.78351486807742,
"cl_id": "dob_5",
"comparison_name": "dob",
"comparison_vector_value": 5,
"label_for_charts": "Exact match",
"log2_bayes_factor": 7.887424832202931,
"m_probability": 0.41383785481447766,
"sql_condition": "\"dob_l\" = \"dob_r\"",
"u_probability": 0.0017477477477477479
},
{
"bayes_factor": 65.74625268345359,
"cl_id": "dob_4",
"comparison_name": "dob",
"comparison_vector_value": 4,
"label_for_charts": "Damerau_levenshtein <= 1",
"log2_bayes_factor": 6.038836762842662,
"m_probability": 0.10806341031654734,
"sql_condition": "damerau_levenshtein(\"dob_l\", \"dob_r\") <= 1",
"u_probability": 0.0016436436436436436
},
{
"bayes_factor": 29.476860590690453,
"cl_id": "dob_3",
"comparison_name": "dob",
"comparison_vector_value": 3,
"label_for_charts": "Within 1 month",
"log2_bayes_factor": 4.881510974428093,
"m_probability": 0.11300938544779224,
"sql_condition": "\n abs(date_diff('month',\n strptime(\"dob_l\", '%Y-%m-%d'),\n strptime(\"dob_r\", '%Y-%m-%d'))\n ) <= 1\n ",
"u_probability": 0.003833833833833834
},
{
"bayes_factor": 3.397551460259144,
"cl_id": "dob_2",
"comparison_name": "dob",
"comparison_vector_value": 2,
"label_for_charts": "Within 1 year",
"log2_bayes_factor": 1.7644954026183992,
"m_probability": 0.17200656922328977,
"sql_condition": "\n abs(date_diff('year',\n strptime(\"dob_l\", '%Y-%m-%d'),\n strptime(\"dob_r\", '%Y-%m-%d'))\n ) <= 1\n ",
"u_probability": 0.05062662662662663
},
{
"bayes_factor": 0.6267794172297388,
"cl_id": "dob_1",
"comparison_name": "dob",
"comparison_vector_value": 1,
"label_for_charts": "Within 10 years",
"log2_bayes_factor": -0.6739702908716182,
"m_probability": 0.19035523041792068,
"sql_condition": "\n abs(date_diff('year',\n strptime(\"dob_l\", '%Y-%m-%d'),\n strptime(\"dob_r\", '%Y-%m-%d'))\n ) <= 10\n ",
"u_probability": 0.3037037037037037
},
{
"bayes_factor": 0.004272180302776005,
"cl_id": "dob_0",
"comparison_name": "dob",
"comparison_vector_value": 0,
"label_for_charts": "All other comparisons",
"log2_bayes_factor": -7.870811748958801,
"m_probability": 0.002727549779972325,
"sql_condition": "ELSE",
"u_probability": 0.6384444444444445
},
{
"bayes_factor": 10.904938885948333,
"cl_id": "city_1",
"comparison_name": "city",
"comparison_vector_value": 1,
"label_for_charts": "Exact match",
"log2_bayes_factor": 3.4469097796586596,
"m_probability": 0.6013808934279701,
"sql_condition": "\"city_l\" = \"city_r\"",
"u_probability": 0.0551475711801453
},
{
"bayes_factor": 0.42188504195296994,
"cl_id": "city_0",
"comparison_name": "city",
"comparison_vector_value": 0,
"label_for_charts": "All other comparisons",
"log2_bayes_factor": -1.2450781575619725,
"m_probability": 0.3986191065720299,
"sql_condition": "ELSE",
"u_probability": 0.9448524288198547
},
{
"bayes_factor": 269.6074384240141,
"cl_id": "email_2",
"comparison_name": "email",
"comparison_vector_value": 2,
"label_for_charts": "Exact match",
"log2_bayes_factor": 8.07471649055784,
"m_probability": 0.5914840252879943,
"sql_condition": "\"email_l\" = \"email_r\"",
"u_probability": 0.0021938713143283602
},
{
"bayes_factor": 222.9721189153553,
"cl_id": "email_1",
"comparison_name": "email",
"comparison_vector_value": 1,
"label_for_charts": "Levenshtein <= 2",
"log2_bayes_factor": 7.800719512398763,
"m_probability": 0.3019669634613132,
"sql_condition": "levenshtein(\"email_l\", \"email_r\") <= 2",
"u_probability": 0.0013542812658830492
},
{
"bayes_factor": 0.10692840956298139,
"cl_id": "email_0",
"comparison_name": "email",
"comparison_vector_value": 0,
"label_for_charts": "All other comparisons",
"log2_bayes_factor": -3.225282884575804,
"m_probability": 0.10654901125069259,
"sql_condition": "ELSE",
"u_probability": 0.9964518474197885
}
]
},
"layer": [
{
"encoding": {
"color": {
"field": "comparison_name",
"legend": null,
"type": "nominal"
},
"tooltip": [
{
"field": "comparison_name",
"type": "nominal"
},
{
"field": "label_for_charts",
"type": "nominal"
},
{
"field": "sql_condition",
"type": "nominal"
},
{
"field": "m_probability",
"type": "quantitative"
},
{
"field": "u_probability",
"type": "quantitative"
},
{
"field": "bayes_factor",
"type": "quantitative"
},
{
"field": "log2_bayes_factor",
"type": "quantitative"
}
],
"x": {
"field": "log2_bayes_factor",
"scale": {
"domain": [
-10,
10
]
},
"title": "Comparison level match weight = log2(m/u)",
"type": "quantitative"
},
"y": {
"field": "cl_id",
"sort": "-x",
"title": "Comparison level",
"type": "nominal"
}
},
"mark": {
"type": "bar"
}
},
{
"encoding": {
"text": {
"field": "comparison_name",
"type": "nominal"
},
"tooltip": [
{
"field": "comparison_name",
"type": "nominal"
},
{
"field": "label_for_charts",
"type": "nominal"
},
{
"field": "sql_condition",
"type": "nominal"
},
{
"field": "m_probability",
"type": "quantitative"
},
{
"field": "u_probability",
"type": "quantitative"
},
{
"field": "bayes_factor",
"type": "quantitative"
},
{
"field": "log2_bayes_factor",
"type": "quantitative"
}
],
"x": {
"field": "log2_bayes_factor",
"scale": {
"domain": [
-10,
10
]
},
"title": "Comparison level match weight = log2(m/u)",
"type": "quantitative"
},
"y": {
"field": "cl_id",
"sort": "-x",
"title": "Comparison level",
"type": "nominal"
}
},
"mark": {
"align": "right",
"dx": 0,
"type": "text"
}
}
]
}
Edit in Vega-Lite¶
Opening the JSON from the above chart in Vega-Lite editor, it is now behaving as intended, with both bar and text layers sorted by match weight.
If the chart is working as intended, there is only one step required before saving the JSON file - removing data from the template schema.
The data appears as follows with a dictionary of all included datasets
by name, and then each chart referencing the data
it uses by name:
"data": {"name": "data-a6c84a9cf1a0c7a2cd30cc1a0e2c1185"},
"datasets": {
"data-a6c84a9cf1a0c7a2cd30cc1a0e2c1185": [
...
]
},
Where only one dataset is required, this is equivalent to:
"data": {"values": [...]}
After removing the data references, the template can be saved in Splink as splink/files/chart_defs/my_new_chart.json
Combine the chart dataset and template¶
Putting all of the above together, Splink needs definitions for the methods that generate the chart and the data behind it (these can be separate or performed by the same function if relatively simple).
Chart definition¶
In splink/charts.py
we can add a new function to populate the chart definition with the provided data:
def my_new_chart(records, as_dict=False):
chart_path = "my_new_chart.json"
chart = load_chart_definition(chart_path)
chart["data"]["values"] = records
return altair_or_json(chart, as_dict=as_dict)
Note - only the data is being added to a fixed chart definition here. Other elements of the chart spec can be changed by editing the
chart
dictionary in the same way.For example, if you wanted to add a
color_scheme
argument to replace the default scheme ("tableau10"), this function could include the line:chart["layer"][0]["encoding"]["color"]["scale"]["scheme"] = color_scheme
Chart method¶
Then we can add a method to the linker in splink/linker.py
so the chart can be generated by linker.my_new_chart()
:
from .charts import my_new_chart
...
class Linker:
...
def my_new_chart(self):
# Take linker object and extract complete settings dict
records = self._settings_obj._parameters_as_detailed_records
cols_to_keep = [
"comparison_name",
"sql_condition",
"label_for_charts",
"m_probability",
"u_probability",
"bayes_factor",
"log2_bayes_factor",
"comparison_vector_value"
]
# Keep useful information for a match weights chart
records = [{k: r[k] for k in cols_to_keep}
for r in records
if r["comparison_vector_value"] != -1 and r["comparison_sort_order"] != -1]
return my_new_chart(records)
Previous new chart PRs¶
Real-life Splink chart additions, for reference: