Types of Record Linkage¶

There are two main types of record linkage - Deterministic and Probabilistic.

Deterministic Linkage¶

Deterministic Linkage is a rules-based approach for joining records together.

For example, consider a single table with duplicates:

AID	Name	DOB	Postcode
A00001	Bob Smith	1990-05-09	AB12 3CD
A00002	Robert Smith	1990-05-09	AB12 3CD
A00003	Robert “Burglar Bob” Smith	1990-05-09	-

and some deterministic rules:

IF Name + DOB match (Rule 1)
ELSE 
IF Forename + DOB + Postcode match (Rule 2)

Applying these rules to the table above leads to no matches:

A0001-A0002 No match (different forename)
A0001-A0003 No match (different forename)
A0002-A0003 No match (missing postcode)

So, even a relatively simple dataset, with duplicates that are obvious to a human, will require more complex rules.

In general, Deterministic linkage is:

✅ Computationally cheap
✅ Capable of achieving high precision (few False Positives)

❌ Difficult to do systematically
❌ Difficult to optimise
❌ Lacking in subtlety
❌ Prone to Low recall (False Negatives)

Deterministic Linkage in Splink

While Splink is primarily a tool for Probabilistic linkage, Deterministic linkage is also supported (utilising blocking rules). See the example notebooks to see how this is Deterministic linkage is implemented in Splink.

Probabilistic Linkage¶

Probabilistic Linkage is a evidence-based approach for joining records together.

Linkage is probabilistic in the sense that it relies on the balance of evidence. In a large dataset, observing that two records match on the full name 'Robert Smith' provides some evidence that these two records may refer to the same person, but this evidence is inconclusive. However, the cumulative evidence from across multiple features within the dataset (e.g. date of birth, home address, email address) can provide conclusive evidence of a match. The evidence for a match is commonly represented as a probability.

For example, putting the first 2 records of the table above through a probabilistic model gives a an overall probability that the records are a match:

In addition, the breakdown of this probability by the evidence provided by each feature can be shown through a waterfall chart:

Given these probabilities, unlike (binary) Deterministic linkage, the user can choose an evidence threshold for what they consider a match before creating a new unique identifier.

This is important, as it allows the linkage to be customised to best support the specific use case. For example, if it is important to:

minimise False Positive matches (i.e. where False Negatives are less of a concern), a higher threshold for a match can be chosen.
maximise True Positive matches (i.e. where False Positives are less of a concern), a lower threshold can be chosen.