Pseudopeople Census to ACS link
Linking the pseudopeople Census and ACS datasets¶
In this tutorial we will configure and link two realistic, simulated datasets generated by the pseudopeople Python package. These datasets reflect a fictional sample population of ~10,000 simulants living in Anytown, Washington, USA, but pseudopeople can also generate datasets about two larger fictional populations, one simulating the state of Rhode Island, and the other simulating the entire United States.
Here we will link Anytown's 2020 Decennial Census dataset to all years of its American Community Survey (ACS) dataset. Some people surveyed in the ACS, particularly in years other than 2020, will not be represented in the 2020 Census because they did not live in Anytown, or were not alive, when the 2020 Census was conducted - pseudopeople simulates people moving residence, being born and dying. So, our aim will be to link as many as possible of the ACS respondents which are reflected in the 2020 Census.
This tutorial is adapted from the Febrl4 linking example.
Configuring pseudopeople¶
pseudopeople is designed to generate realistic datasets which are challenging to link to one another in many of the same ways that actual datasets are challenging to link. This requires adding noise to the data in the form of various types of errors that occur in real data collection and entry. The frequencies of each type of noise in the dataset can be configured for each column.
Because the ACS dataset is small and therefore has less opportunities for noise to create linkage challenges, let's increase the noise for the race_ethnicity
and last_name
columns in both datasets from their default values. For race_ethnicity
we increase the frequency with which respondents choose the wrong response option to that survey question, and for last_name
we increase both the frequency of respondents typing their last names carelessly, and the probability of a mistake on each character when typing carelessly. See here and here for more details.
import numpy as np
import pandas as pd
import pseudopeople as psp
pd.set_option("display.max_columns", 7)
config_census = {
"decennial_census": { # Dataset
# "Choose the wrong option" is in the column-based noise category
"column_noise": {
"race_ethnicity": { # Column
"choose_wrong_option": { # Noise type
"cell_probability": 0.3, # Default = .01
},
},
"last_name": { # Column
"make_typos": { # Noise type
"cell_probability": 0.1, # Default = .01
"token_probability": 0.15, # Default = .10
},
},
},
},
}
config_acs = {
"american_community_survey": { # Dataset
# "Choose the wrong option" is in the column-based noise category
"column_noise": {
"race_ethnicity": { # Column
"choose_wrong_option": { # Noise type
"cell_probability": 0.3, # Default = .01
},
},
"last_name": { # Column
"make_typos": { # Noise type
"cell_probability": 0.1, # Default = .01
"token_probability": 0.15, # Default = .10
},
},
},
},
}
Exploring the data¶
Next, let's get the data ready for Splink. The Census data has about 10,000 rows, while the ACS data only has around 200. Note that each dataset has a column called simulant_id
, which uniquely identifies a simulated person in our fictional population. The simulant_id
is consistent across datasets, and can be used to check the accuracy of our model. Because it represents the truth, and we wouldn't have it in a real-life linkage task, we will not use it for blocking, comparisons, or any other part of our model, except to check the accuracy of our predictions at the end.
census = psp.generate_decennial_census(config=config_census)
acs = psp.generate_american_community_survey(
config=config_acs, year=None
) # generate all years data with year=None
# Make a year column for ACS so we can combine it with the Census column
# for exploratory analysis charts
acs["year"] = acs.survey_date.dt.year
acs = acs.drop("survey_date", axis=1)
# uniquely identify each row in the datasets (regardless of simulant_id)
census["id"] = census.index
acs["id"] = len(census) + acs.index
# remove ACS respondents born after 2020 (bc they would not be in the 2020 census)
census["age_in_2020"] = pd.to_numeric(census.age, errors="coerce")
acs["age_in_2020"] = pd.to_numeric(acs.age, errors="coerce") - (acs.year - 2020)
acs = acs[acs.age_in_2020 >= 0]
# create row ID to simulant_id lookup table
census_ids = census[["id", "simulant_id"]]
acs_ids = acs[["id", "simulant_id"]]
simulant_lookup = pd.concat([census_ids, acs_ids]).set_index("id")
def prepare_data(data):
# concatenate address fields, setting the new field to NaN
# if any address fields besides unit number are missing
unit_number_no_na = data.unit_number.fillna("")
columns_to_concat = [
"street_number",
"street_name",
"city",
"state",
"zipcode",
]
has_addr_nan = data[columns_to_concat].isna().any(axis=1)
address_data = data[columns_to_concat].astype(str)
address_data["unit_number"] = unit_number_no_na
data["address"] = address_data.agg(" ".join, axis=1).str.strip()
data.loc[has_addr_nan, "address"] = np.nan
return data
dfs = [prepare_data(dataset) for dataset in [census, acs]]
dfs[0] # Census
simulant_id | household_id | first_name | ... | id | age_in_2020 | address | |
---|---|---|---|---|---|---|---|
0 | 0_2 | 0_7 | Diana | ... | 0 | 25.0 | 5112 145th st Anytown WA 00000 |
1 | 0_3 | 0_7 | Anna | ... | 1 | 25.0 | 5112 145th st Anytown WA 00000 |
2 | 0_923 | 0_8033 | Gerald | ... | 2 | 76.0 | 1130 mallory ln Anytown WA 00000 |
3 | 0_2641 | 0_1066 | Loretta | ... | 3 | 61.0 | NaN |
4 | 0_2801 | 0_1138 | Richard | ... | 4 | 73.0 | 950 caribou lane Anytown WA 00000 |
... | ... | ... | ... | ... | ... | ... | ... |
10226 | 0_11994 | 0_8051 | Lauren | ... | 10226 | 17.0 | 3304 ethan allen way Anytown WA 00000 unit 200 |
10227 | 0_19693 | 0_6152 | Johana | ... | 10227 | 20.0 | 1095 ernst st Anytown WA 00000 |
10228 | 0_19556 | 0_2064 | Benjamin | ... | 10228 | 19.0 | 2002 203rd pl se Anytown WA 00000 |
10229 | 0_19579 | 0_1802 | Brielle | ... | 10229 | 19.0 | 233 saint peters road Anytown WA 00000 |
10230 | 0_19666 | 0_881 | Kyle | ... | 10230 | 19.0 | 224 s moraine st Anytown WA 00000 |
10231 rows × 21 columns
dfs[1] # ACS
simulant_id | household_id | first_name | ... | id | age_in_2020 | address | |
---|---|---|---|---|---|---|---|
0 | 0_10873 | 0_4411 | Betty | ... | 10231 | 87.0 | 2403 magnolia park rd Anytown WA 00000 |
1 | 0_10344 | 0_4207 | Dina | ... | 10232 | 47.0 | 4826 stone ridge ln Anytown WA 00000 |
2 | 0_10345 | 0_4207 | Fiona | ... | 10233 | 13.0 | 4826 stone ridge ln Anytown WA 00000 |
3 | 0_10346 | 0_4207 | Molly | ... | 10234 | 9.0 | 4826 stone ridge ln Anytown WA 00000 |
4 | 0_10347 | 0_4207 | Daniel | ... | 10235 | 19.0 | 4826 stone ridge ln Anytown WA 00000 |
... | ... | ... | ... | ... | ... | ... | ... |
206 | 0_18143 | 0_14148 | Bentley | ... | 10437 | 7.0 | 2040 magnolia court Anytown WA 00000 |
207 | 0_24866 | 0_14148 | Jordan | ... | 10438 | 25.0 | 2040 magnolia court Anytown WA 00000 |
208 | 0_18238 | 0_14832 | Sarah | ... | 10439 | 32.0 | NaN |
209 | 0_9712 | 0_15932 | Mom | ... | 10440 | 65.0 | 3619 hanging moss loop Anytown WA 00000 |
210 | 0_12239 | 0_14049 | Sherry | ... | 10441 | 48.0 | 212a apple blossom dr Anytown WA 00000 |
201 rows × 21 columns
Because we are using all years of ACS surveys, and only the 2020 Decennial Census, there will be respondents to ACS surveys which were not in the 2020 Census because they moved away from Anytown before the 2020 Census was conducted, or moved to Anytown after the 2020 Census was conducted. Let's check how many of these there are and display some of them.
not_in_census = acs[~acs.simulant_id.isin(census.simulant_id)]
display(f"{len(not_in_census)} ACS simulants not in 2020 Census")
not_in_census.head(5)
'45 ACS simulants not in 2020 Census'
simulant_id | household_id | first_name | ... | id | age_in_2020 | address | |
---|---|---|---|---|---|---|---|
4 | 0_10347 | 0_4207 | Daniel | ... | 10235 | 19.0 | 4826 stone ridge ln Anytown WA 00000 |
16 | 0_2713 | 0_14807 | Kristi | ... | 10247 | 36.0 | 6045 s patterson pl Anytown WA 00000 |
73 | 0_19708 | 0_3 | Ariana | ... | 10304 | 20.0 | 8203 west farwell avenue Anytown WA 00000 |
93 | 0_16379 | 0_2246 | Gregory | ... | 10324 | 56.0 | 12679 kingston ave Anytown WA 00000 |
101 | 0_1251 | 0_9630 | NaN | ... | 10332 | 16.0 | 17947 newman dr Anytown WA 00000 |
5 rows × 21 columns
Next, to better understand which variables will prove useful in linking, we have a look at how populated each column is, as well as the distribution of unique values within each.
It's usually a good idea to perform exploratory analysis on your data so you understand what's in each column and how often it's missing.
from splink import DuckDBAPI
from splink.exploratory import completeness_chart
completeness_chart(
dfs,
db_api=DuckDBAPI(),
table_names_for_chart=["census", "acs"],
cols=[
"age_in_2020",
"last_name",
"sex",
"middle_initial",
"date_of_birth",
"race_ethnicity",
"first_name",
"address",
"street_number",
"street_name",
"unit_number",
"city",
"state",
"zipcode",
],
)
from splink.exploratory import profile_columns
profile_columns(dfs, db_api=DuckDBAPI())
You will notice that six addresses have many more simulants living at them than the others. Simulants in our fictional population may live either in a residential household, or in group quarters (GQ), which models institutional and non-institutional GQ establishments: carceral, nursing homes, and other institutional, and college, military, and other non-institutional. Each of these addresses simulates one of these six types of GQ "households".
In the ACS data there are 102 unique street names (including typos), with 58 residents of the West Farwell Avenue GQ represented.
acs.street_name.value_counts()
street_name
west farwell avenue 58
grove street 6
n holman st 4
stone ridge ln 4
glenview rd 3
..
ecton la 1
w 47th st 1
kelton ave 1
montgomery street 1
south parker road 1
Name: count, Length: 102, dtype: int64
Defining the model¶
Next let's come up with some candidate blocking rules, which define the record comparisons to generate, and have a look at how many comparisons each rule will generate.
For blocking rules that we use in prediction, our aim is to have the union of all rules cover all true matches, whilst avoiding generating so many comparisons that it becomes computationally intractable - i.e. each true match should have at least one of the following conditions holding.
from splink import DuckDBAPI, block_on
from splink.blocking_analysis import (
cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
)
blocking_rules = [
block_on("first_name"),
block_on("last_name"),
block_on("date_of_birth"),
block_on("street_name"),
block_on("age_in_2020", "sex"),
block_on("age_in_2020", "race_ethnicity"),
block_on("age_in_2020", "middle_initial"),
block_on("street_number"),
block_on("middle_initial", "sex", "race_ethnicity"),
]
db_api = DuckDBAPI()
cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
table_or_tables=dfs,
blocking_rules=blocking_rules,
db_api=db_api,
link_type="link_only",
unique_id_column_name="id",
source_dataset_column_name="source_dataset",
)
For columns like age_in_2020
that would create too many comparisons, we combine them with one or more other columns which would also create too many comparisons if blocked on alone.
Now we get our model's settings by including the blocking rules, as well as deciding the actual comparisons we will be including in our model.
To estimate probability_two_random_records_match
, we will make the assumption that everyone in the ACS is also in the Census - we know from our check using the "true" simulant IDs above that this is not the case, but it is a good approximation. Depending on our knowledge of the dataset, we might be able to get a more accurate value by defining a set of deterministic matching rules and a guess of the number of matches reflected in those rules.
import splink.comparison_library as cl
from splink import Linker, SettingsCreator
settings = SettingsCreator(
unique_id_column_name="id",
link_type="link_only",
blocking_rules_to_generate_predictions=blocking_rules,
comparisons=[
cl.NameComparison("first_name", jaro_winkler_thresholds=[0.9]).configure(
term_frequency_adjustments=True
),
cl.ExactMatch("middle_initial").configure(term_frequency_adjustments=True),
cl.NameComparison("last_name", jaro_winkler_thresholds=[0.9]).configure(
term_frequency_adjustments=True
),
cl.DamerauLevenshteinAtThresholds(
"date_of_birth", distance_threshold_or_thresholds=[1]
),
cl.DamerauLevenshteinAtThresholds("address").configure(
term_frequency_adjustments=True
),
cl.ExactMatch("sex"),
cl.ExactMatch("race_ethnicity").configure(term_frequency_adjustments=True),
],
retain_intermediate_calculation_columns=True,
probability_two_random_records_match=len(dfs[1]) / (len(dfs[0]) * len(dfs[1])),
)
linker = Linker(
dfs, settings, db_api=DuckDBAPI(), input_table_aliases=["census", "acs"]
)
Estimating model parameters¶
Next we estimate u
and m
values for each comparison, so that we can move to generating predictions.
# We generally recommend setting max pairs higher (e.g. 1e7 or more)
# But this will run faster for the purpose of this demo
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.
----- Estimating u probabilities using random sampling -----
Estimated u probabilities using random sampling
Your model is not yet fully trained. Missing estimates for:
- first_name (no m values are trained).
- middle_initial (no m values are trained).
- last_name (no m values are trained).
- date_of_birth (no m values are trained).
- address (no m values are trained).
- sex (no m values are trained).
- race_ethnicity (no m values are trained).
When training the m
values using expectation maximisation, we need some more blocking rules to reduce the total number of comparisons. For each rule, we want to ensure that we have neither proportionally too many matches, or too few.
We must run this multiple times using different rules so that we can obtain estimates for all comparisons - if we block on e.g. date_of_birth
, then we cannot compute the m
values for the date_of_birth
comparison, as we have only looked at records where these match.
session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
block_on("date_of_birth"), estimate_without_term_frequencies=True
)
session_ln = linker.training.estimate_parameters_using_expectation_maximisation(
block_on("last_name"), estimate_without_term_frequencies=True
)
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
l."date_of_birth" = r."date_of_birth"
Parameter estimates will be made for the following comparison(s):
- first_name
- middle_initial
- last_name
- address
- sex
- race_ethnicity
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- date_of_birth
Iteration 1: Largest change in params was -0.439 in the m_probability of race_ethnicity, level `Exact match on race_ethnicity`
Iteration 2: Largest change in params was 0.00823 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.000295 in probability_two_random_records_match
Iteration 4: Largest change in params was 2.96e-05 in the m_probability of first_name, level `All other comparisons`
EM converged after 4 iterations
Your model is not yet fully trained. Missing estimates for:
- date_of_birth (no m values are trained).
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
l."last_name" = r."last_name"
Parameter estimates will be made for the following comparison(s):
- first_name
- middle_initial
- date_of_birth
- address
- sex
- race_ethnicity
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- last_name
Iteration 1: Largest change in params was 0.0149 in the m_probability of address, level `Exact match on address`
Iteration 2: Largest change in params was 0.00255 in the m_probability of date_of_birth, level `All other comparisons`
Iteration 3: Largest change in params was 0.0011 in the m_probability of first_name, level `All other comparisons`
Iteration 4: Largest change in params was 0.000561 in the m_probability of first_name, level `All other comparisons`
Iteration 5: Largest change in params was 0.000278 in the m_probability of first_name, level `All other comparisons`
Iteration 6: Largest change in params was -0.000138 in the m_probability of date_of_birth, level `Exact match on date_of_birth`
Iteration 7: Largest change in params was -6.83e-05 in the m_probability of date_of_birth, level `Exact match on date_of_birth`
EM converged after 7 iterations
Your model is fully trained. All comparisons have at least one estimate for their m and u values
If we wish we can have a look at how our parameter estimates changes over these training sessions.
session_dob.m_u_values_interactive_history_chart()
session_ln.m_u_values_interactive_history_chart()
For variables that aren't used in the m
-training blocking rules, we have two estimates --- one from each of the training sessions (see for example address
). We can have a look at how the values compare between them, to ensure that we don't have drastically different values, which may be indicative of an issue.
linker.visualisations.parameter_estimate_comparisons_chart()
We can now visualise some of the details of our model. We can look at the match weights, which tell us the relative importance for/against a match for each of our comparsion levels.
linker.visualisations.match_weights_chart()
As well as the match weights, which give us an idea of the overall effect of each comparison level, we can also look at the individual u
and m
parameter estimates, which tells us about the prevalence of coincidences and mistakes (for further details/explanation about this see this article). We might want to revise aspects of our model based on the information we ascertain here.
Note however that some of these values are very small, which is why the match weight chart is often more useful for getting a decent picture of things.
linker.visualisations.m_u_parameters_chart()
It is also useful to have a look at unlinkable records - these are records which do not contain enough information to be linked at some match probability threshold. We can figure this out by seeing whether records are able to be matched with themselves.
We have low column missingness, so almost all of our records are linkable for almost all match thresholds.
linker.evaluation.unlinkables_chart()
Making predictions and evaluating results¶
predictions = linker.inference.predict() # include all match_probabilities
df_predictions = predictions.as_pandas_dataframe()
pd.set_option("display.max_columns", None)
columns_to_show = [
"match_probability",
"first_name_l",
"first_name_r",
"last_name_l",
"last_name_r",
"date_of_birth_l",
"date_of_birth_r",
"address_l",
"address_r",
]
df_predictions[columns_to_show]
Blocking time: 0.10 seconds
Predict time: 2.29 seconds
match_probability | first_name_l | first_name_r | last_name_l | last_name_r | date_of_birth_l | date_of_birth_r | address_l | address_r | |
---|---|---|---|---|---|---|---|---|---|
0 | 1.000000e+00 | Clarence | Clarence | Weinmann | Weinmann | 12/31/1959 | 12/31/1959 | 2540 grand river boulevard east Anytown WA 00000 | 2540 grand river boulevard east Anytown WA 00000 |
1 | 1.989803e-05 | Mark | Mark | Witczak | Leath | 53/10/1954 | 12/11/1962 | 84 nichole dr Anytown WA 00000 | 2044 heyden Anytiwn WA 00000 |
2 | 5.711622e-06 | Edward | Edward | Svenson | None | 12/23/1961 | 02/05/1957 | 20 oakbridge py Anytown WA 00000 | 1129 sanford ave Anytown WA 00000 |
3 | 1.945598e-06 | Joyce | Joyce | Janz | Kea | 08/21/1944 | 12/12/1964 | 8203 west farwell avenue Anytown WA 00500 | 8603 nw 302nd st Anytown WA 00000 |
4 | 1.848822e-05 | Joseph | Joseph | Samuel | Hunt | 04/04/1982 | 11/14/1966 | 8203 west farwell avenue Anytown WA 00000 | 69 s hwy 701 Anytown WA 00000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
51286 | 1.506998e-07 | Karla | Tracy | Broughman | Killoran | 05/15/1973 | 07/23/1969 | 5101 garden st ext Anytown WA 00000 | 25846 s 216th ave Anytown WA 00000 |
51287 | 1.506998e-07 | Karla | Jacqueline | Broughman | Day | 05/15/1973 | 02/03/1930 | 5101 garden st ext Anytown WA 00000 | 113 east 219 street Anytown WA 00000 |
51288 | 1.506998e-07 | Karla | Jessica | Broughman | Marrin | 05/15/1973 | 10/05/1981 | 5101 garden st ext Anytown WA 00000 | 122 rue royale Anytown WA 00000 |
51289 | 1.506998e-07 | Karla | Ruth | Broughman | White | 05/15/1973 | 01/18/1943 | 5101 garden st ext Anytown WA 00000 | 1231 riverside dr Anytown WA 00000 |
51290 | 1.506998e-07 | Karla | Lauren | Broughman | Jacinto | 05/15/1973 | 06/15/1991 | 5101 garden st ext Anytown WA 00000 | 273 carolyn drive Anytown WA 00000 |
51291 rows × 9 columns
We can see how our model performs at different probability thresholds, with a couple of options depending on the space we wish to view things. The chart below shows that for a match weight around 6.6, or probability around 99%, our model has 0 false positives and 2 false negatives.
linker.evaluation.accuracy_analysis_from_labels_column(
"simulant_id", output_type="accuracy"
)
Blocking time: 0.09 seconds
Predict time: 2.31 seconds
And we can easily see how many individuals we identify and link by looking at clusters generated at some threshold match probability of interest - let's choose 99% again for this example.
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
predictions, threshold_match_probability=0.99
)
df_clusters = clusters.as_pandas_dataframe().sort_values("cluster_id")
df_clusters.groupby("cluster_id").size().value_counts()
Completed iteration 1, num representatives needing updating: 0
1 10124
2 151
3 2
Name: count, dtype: int64
There are a couple of interactive visualizations which can be useful for understanding and evaluating results.
from IPython.display import IFrame
linker.visualisations.cluster_studio_dashboard(
predictions,
clusters,
"cluster_studio.html",
sampling_method="by_cluster_size",
overwrite=True,
)
# You can view the cluster_studio.html file in your browser,
# or inline in a notbook as follows
IFrame(src="./cluster_studio.html", width="100%", height=1000)
linker.visualisations.comparison_viewer_dashboard(
predictions, "scv.html", overwrite=True
)
IFrame(src="./scv.html", width="100%", height=1000)
In this example we know what the true links are, so we can also manually inspect the ones with the lowest match weights to see what our model is not capturing - i.e. where we have false negatives.
Similarly, we can look at the non-links with the highest match weights, to see whether we have an issue with false positives.
Ordinarily we would not have this luxury, and so would need to dig a bit deeper for clues as to how to improve our model, such as manually inspecting records across threshold probabilities.
# add simulant_id_l and simulant_id_r columns by looking up
# id_l and id_r in the simulant id lookup table
df_predictions = pd.merge(
df_predictions, simulant_lookup, left_on="id_l", right_on="id", how="left"
)
df_predictions = df_predictions.rename(columns={"simulant_id": "simulant_id_l"})
df_predictions = pd.merge(
df_predictions, simulant_lookup, left_on="id_r", right_on="id", how="left"
)
df_predictions = df_predictions.rename(columns={"simulant_id": "simulant_id_r"})
# sort links by lowest match_probability to see if we missed any
links = df_predictions[
df_predictions["simulant_id_l"] == df_predictions["simulant_id_r"]
].sort_values("match_weight")
# sort nonlinks by highest match_probability to see if we matched any
nonlinks = df_predictions[
df_predictions["simulant_id_l"] != df_predictions["simulant_id_r"]
].sort_values("match_weight", ascending=False)
links[columns_to_show]
match_probability | first_name_l | first_name_r | last_name_l | last_name_r | date_of_birth_l | date_of_birth_r | address_l | address_r | |
---|---|---|---|---|---|---|---|---|---|
4928 | 0.405903 | Rmily | Emily | Yancey | Yancey | 12/24/1992 | 12/24/1993 | 8 kline st Anytown WA 00000 ap # 97 | 610 n 54th ln Anytown WA 00000 |
848 | 0.449588 | Benjamin | Benjamin | Allen | Allen | 02/26/2001 | 02/26/Z001 | 8203 west farwell avenue Anytown WA 00000 | 2002 203rd pl se Anytown WA 00000 |
6040 | 0.989385 | Fxigy | Faigy | Wallace | Wqolace | 05/18/1980 | 05/18/1980 | 12679 kingston ave Anytown WA 00000 | 12679 kingston ave Anytown WA 00000 |
5567 | 0.993288 | Harold | Harry | Thomas | Thomas | 09/20/1982 | 09/20/1982 | None | 8203 west farwell avenue Anytown WA 00000 |
1360 | 0.993489 | Mark | Mark | Witczak | Witczak | 53/10/1954 | 03/11/1954 | 84 nichole dr Anytown WA 00000 | None |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
0 | 1.000000 | Clarence | Clarence | Weinmann | Weinmann | 12/31/1959 | 12/31/1959 | 2540 grand river boulevard east Anytown WA 00000 | 2540 grand river boulevard east Anytown WA 00000 |
65 | 1.000000 | Eniya | Eniya | Cameron | Cameron | 03/20/2003 | 03/20/2003 | 617 cadillac st Anytown WA 00000 | 617 cadillac st Anytown WA 00000 |
111 | 1.000000 | Tonya | Tonya | Maiden | Maiden | 04/02/1975 | 04/02/1975 | 7214 wild plum ct Anytown WA 00000 | 7214 wild plum ct Anytown WA 00000 |
381 | 1.000000 | Charlene | Charlene | Semidey | Semidey | 06/03/1976 | 06/03/1976 | 7382 jamison dr Anytown WA 00000 | 7382 jamison dr Anytown WA 00000 |
56 | 1.000000 | Quinn | Quinn | Enright | Enright | 09/21/2008 | 09/21/2008 | 12319 northside pkwy nw Anytown WA 00000 | 12319 northside pkwy nw Anytown WA 00000 |
158 rows × 9 columns
nonlinks[columns_to_show]
match_probability | first_name_l | first_name_r | last_name_l | last_name_r | date_of_birth_l | date_of_birth_r | address_l | address_r | |
---|---|---|---|---|---|---|---|---|---|
5931 | 9.539832e-01 | Francis | Angela | Turner | Turner | 05/12/1971 | 05/12/1971 | 1432 isaac place Anytown WA 00000 | 1432 isaac place Anytown WA 00000 |
4803 | 9.368626e-01 | Tonya | Ryan | Maiden | Maiden | 04/02/1975 | 03/01/2008 | 7214 wild plum ct Anytown WA 00000 | 7214 wild plum ct Anytown WA 00000 |
4821 | 9.147120e-01 | Charles | Lorraine | Estrada Canche | Estrada Canche | 01/06/1931 | 11/25/1934 | 1009 northwest topeka boulevard Anytown WA 000... | 1009 northwest topeka boulevard Anytown WA 000... |
4865 | 7.418094e-01 | David | Alex | Middleton | Middleton | 09/19/1967 | 03/29/2001 | 6634 beachplum way Anytown WA 00000 | 6634 beachplum way Anytown WA 00000 |
4968 | 7.418094e-01 | David | Frederick | Middleton | Middleton | 09/19/1967 | 07/23/1999 | 6634 beachplum way Anytown WA 00000 | 6634 beachplum way Anytown WA 00000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
29651 | 2.164944e-12 | Susan | Justin | Winchell | Duarte | 05/18/1967 | 02/20/1986 | 301 downer st Anytown WA 00000 | 301 e calle gran desierto Anytown WA 00000 |
29646 | 2.164944e-12 | Liam | Stephanie | Donoso | Dorsey | 09/09/2012 | 04/20/1998 | 158 hamilton avenue Anytown WA 00000 | 158 blak mountain rd Anytown WA 00000 |
29644 | 2.164944e-12 | Gerard | Stephanie | Donoso | Dorsey | 05/28/1982 | 04/20/1998 | 158 hamilton avenue Anytown WA 00000 | 158 blak mountain rd Anytown WA 00000 |
29745 | 2.164944e-12 | Sandra | Ant | Arnt | Widmann | 09/16/1972 | 02/16/2013 | 18 85 street Anytown WA 00000 | 18 skyway Anytown WA 00000 |
29670 | 2.164944e-12 | Shanna | William | Donoso | Dorsfy | 08/31/1984 | 06/24/1966 | 158 hamilton avenue Anytoqn WA 00008 | 158 black mountain rd Anytown WA 00000 |
51133 rows × 9 columns
From the false postive/negative chart we saw our model does pretty well when we choose a high match threshold. The small size of the ACS dataset reduces the chances for noise to make linking difficult. For additional challenges, try using a larger dataset like W2, or further increasing the noise levels.
Note that not all columns used for comparisons are displayed in the tables above due to space. The waterfall charts below show some of the lowest match weight true links and highest match weight true nonlinks in more detail.
records_to_view = 5
linker.visualisations.waterfall_chart(
links.head(records_to_view).to_dict(orient="records")
)
records_to_view = 5
linker.visualisations.waterfall_chart(
nonlinks.head(records_to_view).to_dict(orient="records")
)
We may also wish to evaluate the effects of using term frequencies for some columns, such as address
, by looking at examples of the values tf_address
for both common and uncommon address values. Common addresses such as the GQ household, displayed first in the waterfall chart below, will have smaller (negative) values for tf_address
, while uncommon addresses will have larger (positive) values.
linker.visualisations.waterfall_chart(
# choose comparisons that have a term frequency adjustment for address
df_predictions[df_predictions.bf_tf_adj_address != 1]
.head(10) # only display some of the first such comparisons
.sort_values(
"bf_tf_adj_address"
) # sort by lowest adjustment (common addresses) first
.to_dict(orient="records")
)