Linking financial transactions

Linking banking transactions¶

This example shows how to perform a one-to-one link on banking transactions.

The data is fake data, and was generated has the following features:

Money shows up in the destination account with some time delay
The amount sent and the amount received are not always the same - there are hidden fees and foreign exchange effects
The memo is sometimes truncated and content is sometimes missing

Since each origin payment should end up in the destination account, the probability_two_random_records_match of the model is known.

from splink.datasets import splink_datasets
from splink.duckdb.linker import DuckDBLinker
import altair as alt

df_origin = splink_datasets.transactions_origin
df_destination = splink_datasets.transactions_destination

display(df_origin.head(2))
display(df_destination.head(2))

	ground_truth	memo	transaction_date	amount	unique_id
0	0	MATTHIAS C paym	2022-03-28	36.36	0
1	1	M CORVINUS dona	2022-02-14	221.91	1

	ground_truth	memo	transaction_date	amount	unique_id
0	0	MATTHIAS C payment BGC	2022-03-29	36.36	0
1	1	M CORVINUS BGC	2022-02-16	221.91	1

In the following chart, we can see this is a challenging dataset to link: - There are only 151 distinct transaction dates, with strong skew - Some 'memos' are used multiple times (up to 48 times) - There is strong skew in the 'amount' column, with 1,400 transactions of around 60.00

# Simple settings just for exploratory analysis
settings = {"link_type": "link_only"}
linker = DuckDBLinker([df_origin, df_destination], settings,input_table_aliases=["__ori", "_dest"])
linker.profile_columns(["transaction_date", "memo", "round(amount/5, 0)*5"])

# Design blocking rules that allow for differences in transaction date and amounts
from splink.duckdb.blocking_rule_library import block_on, and_

blocking_rule_date_1 = """
    strftime(l.transaction_date, '%Y%m') = strftime(r.transaction_date, '%Y%m')
    and substr(l.memo, 1,3) = substr(r.memo,1,3)
    and l.amount/r.amount > 0.7   and l.amount/r.amount < 1.3
"""

# Offset by half a month to ensure we capture case when the dates are e.g. 31st Jan and 1st Feb
blocking_rule_date_2 = """
    strftime(l.transaction_date+15, '%Y%m') = strftime(r.transaction_date, '%Y%m')
    and substr(l.memo, 1,3) = substr(r.memo,1,3)
    and l.amount/r.amount > 0.7   and l.amount/r.amount < 1.3
"""

blocking_rule_memo = block_on("substr(memo,1,9)")

blocking_rule_amount_1 = """
round(l.amount/2,0)*2 = round(r.amount/2,0)*2 and yearweek(r.transaction_date) = yearweek(l.transaction_date)
"""

blocking_rule_amount_2 = """
round(l.amount/2,0)*2 = round((r.amount+1)/2,0)*2 and yearweek(r.transaction_date) = yearweek(l.transaction_date + 4)
"""

blocking_rule_cheat = block_on("unique_id")


linker.cumulative_num_comparisons_from_blocking_rules_chart(
    [
        blocking_rule_date_1,
        blocking_rule_date_2,
        blocking_rule_memo,
        blocking_rule_amount_1,
        blocking_rule_amount_2,
        blocking_rule_cheat,
    ]
)

# Full settings for linking model
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_level_library as cll

comparison_amount = {
    "output_column_name": "amount",
    "comparison_levels": [
        cll.null_level("amount"),
        cll.exact_match_level("amount"),
        cll.percentage_difference_level("amount",0.01),
        cll.percentage_difference_level("amount",0.03),
        cll.percentage_difference_level("amount",0.1),
        cll.percentage_difference_level("amount",0.3),
        cll.else_level()
    ],
    "comparison_description": "Amount percentage difference",
}

settings = {
    "link_type": "link_only",
    "probability_two_random_records_match": 1 / len(df_origin),
    "blocking_rules_to_generate_predictions": [
        blocking_rule_date_1,
        blocking_rule_date_2,
        blocking_rule_memo,
        blocking_rule_amount_1,
        blocking_rule_amount_2,
        blocking_rule_cheat
    ],
    "comparisons": [
        comparison_amount,
        cl.jaccard_at_thresholds(
            "memo", [0.9, 0.7]
        ),
        cl.datediff_at_thresholds("transaction_date", 
                                date_thresholds = [1, 4, 10, 30],
                                date_metrics = ["day", "day", "day", "day"],
                                include_exact_match_level=False
                                )
    ],
    "retain_intermediate_calculation_columns": True,
    "retain_matching_columns": True,
}

linker = DuckDBLinker([df_origin, df_destination], settings,input_table_aliases=["__ori", "_dest"])

linker.estimate_u_using_random_sampling(max_pairs=1e6)

----- Estimating u probabilities using random sampling -----



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))



Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - amount (no m values are trained).
    - memo (no m values are trained).
    - transaction_date (no m values are trained).

linker.estimate_parameters_using_expectation_maximisation(block_on("memo"))

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."memo" = r."memo"

Parameter estimates will be made for the following comparison(s):
    - amount
    - transaction_date

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - memo

Iteration 1: Largest change in params was -0.529 in the m_probability of amount, level `Exact match`
Iteration 2: Largest change in params was -0.237 in the m_probability of transaction_date, level `Within 1 day`
Iteration 3: Largest change in params was -0.0185 in the m_probability of transaction_date, level `Within 1 day`
Iteration 4: Largest change in params was 0.00459 in the m_probability of transaction_date, level `Within 30 days`
Iteration 5: Largest change in params was 0.00129 in the m_probability of transaction_date, level `Within 30 days`
Iteration 6: Largest change in params was 0.000332 in the m_probability of transaction_date, level `Within 30 days`
Iteration 7: Largest change in params was 8.32e-05 in the m_probability of transaction_date, level `Within 30 days`

EM converged after 7 iterations

Your model is not yet fully trained. Missing estimates for:
    - memo (no m values are trained).





<EMTrainingSession, blocking on l."memo" = r."memo", deactivating comparisons memo>

session = linker.estimate_parameters_using_expectation_maximisation(block_on("amount"))

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."amount" = r."amount"

Parameter estimates will be made for the following comparison(s):
    - memo
    - transaction_date

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - amount

Iteration 1: Largest change in params was -0.356 in the m_probability of memo, level `Exact match`
Iteration 2: Largest change in params was -0.0328 in the m_probability of memo, level `Exact match`
Iteration 3: Largest change in params was 0.0442 in the m_probability of memo, level `All other comparisons`
Iteration 4: Largest change in params was 0.0483 in the m_probability of memo, level `All other comparisons`
Iteration 5: Largest change in params was 0.0427 in the m_probability of memo, level `All other comparisons`
Iteration 6: Largest change in params was 0.0318 in the m_probability of memo, level `All other comparisons`
Iteration 7: Largest change in params was 0.0211 in the m_probability of memo, level `All other comparisons`
Iteration 8: Largest change in params was 0.0131 in the m_probability of memo, level `All other comparisons`
Iteration 9: Largest change in params was 0.00792 in the m_probability of memo, level `All other comparisons`
Iteration 10: Largest change in params was 0.00474 in the m_probability of memo, level `All other comparisons`
Iteration 11: Largest change in params was 0.00283 in the m_probability of memo, level `All other comparisons`
Iteration 12: Largest change in params was 0.00169 in the m_probability of memo, level `All other comparisons`
Iteration 13: Largest change in params was 0.00101 in the m_probability of memo, level `All other comparisons`
Iteration 14: Largest change in params was 0.000602 in the m_probability of memo, level `All other comparisons`
Iteration 15: Largest change in params was 0.00036 in the m_probability of memo, level `All other comparisons`
Iteration 16: Largest change in params was 0.000216 in the m_probability of memo, level `All other comparisons`
Iteration 17: Largest change in params was 0.000129 in the m_probability of memo, level `All other comparisons`
Iteration 18: Largest change in params was 7.74e-05 in the m_probability of memo, level `All other comparisons`

EM converged after 18 iterations

Your model is fully trained. All comparisons have at least one estimate for their m and u values

linker.match_weights_chart()

df_predict = linker.predict(threshold_match_probability=0.001)

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

linker.comparison_viewer_dashboard(df_predict,"dashboards/comparison_viewer_transactions.html", overwrite=True)
from IPython.display import IFrame
IFrame(
    src="./dashboards/comparison_viewer_transactions.html", width="100%", height=1200
)

pred_errors =  linker.prediction_errors_from_labels_column("ground_truth", include_false_positives=True, include_false_negatives=False)
linker.waterfall_chart(pred_errors.as_record_dict(limit=5))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

pred_errors =  linker.prediction_errors_from_labels_column("ground_truth", include_false_positives=False, include_false_negatives=True)
linker.waterfall_chart(pred_errors.as_record_dict(limit=5))