Skip to content

completeness_chart

At a glance

Useful for: Looking at which columns are populated across datasets.

API Documentation: completeness_chart()

What is needed to generate the chart? A linker with some data.

What the chart shows

The completeness_chart shows the proportion of populated (non-null) values in the columns of multiple datasets.

What the chart tooltip shows

The tooltip shows a number of values based on the panel that the user is hovering over, including:

  • The dataset and column name
  • The count and percentage of non-null values in the column for the relelvant dataset.

How to interpret the chart

Each panel represents the percentage of non-null values in a given dataset-column combination. The darker the panel, the lower the percentage of non-null values.


Actions to take as a result of the chart

Only choose features that are sufficiently populated across all datasets in a linkage model.

Worked Example

from splink import splink_datasets, DuckDBAPI
from splink.exploratory import completeness_chart

df = splink_datasets.fake_1000

# Split a simple dataset into two, separate datasets which can be linked together.
df_l = df.sample(frac=0.5)
df_r = df.drop(df_l.index)


chart = completeness_chart([df_l, df_r], db_api=DuckDBAPI())
chart