Skip to content

Linking financial transactions

Linking banking transactionsΒΆ

This example shows how to perform a one-to-one link on banking transactions.

The data is fake data, and was generated has the following features:

  • Money shows up in the destination account with some time delay
  • The amount sent and the amount received are not always the same - there are hidden fees and foreign exchange effects
  • The memo is sometimes truncated and content is sometimes missing

Since each origin payment should end up in the destination account, the probability_two_random_records_match of the model is known.

from splink.datasets import splink_datasets
from splink.duckdb.linker import DuckDBLinker
import altair as alt

df_origin = splink_datasets.transactions_origin
df_destination = splink_datasets.transactions_destination

display(df_origin.head(2))
display(df_destination.head(2))
ground_truth memo transaction_date amount unique_id
0 0 MATTHIAS C paym 2022-03-28 36.36 0
1 1 M CORVINUS dona 2022-02-14 221.91 1
ground_truth memo transaction_date amount unique_id
0 0 MATTHIAS C payment BGC 2022-03-29 36.36 0
1 1 M CORVINUS BGC 2022-02-16 221.91 1

In the following chart, we can see this is a challenging dataset to link: - There are only 151 distinct transaction dates, with strong skew - Some 'memos' are used multiple times (up to 48 times) - There is strong skew in the 'amount' column, with 1,400 transactions of around 60.00

# Simple settings just for exploratory analysis
settings = {"link_type": "link_only"}
linker = DuckDBLinker([df_origin, df_destination], settings,input_table_aliases=["__ori", "_dest"])
linker.profile_columns(["transaction_date", "memo", "round(amount/5, 0)*5"])
# Design blocking rules that allow for differences in transaction date and amounts
from splink.duckdb.blocking_rule_library import block_on, and_

blocking_rule_date_1 = """
    strftime(l.transaction_date, '%Y%m') = strftime(r.transaction_date, '%Y%m')
    and substr(l.memo, 1,3) = substr(r.memo,1,3)
    and l.amount/r.amount > 0.7   and l.amount/r.amount < 1.3
"""

# Offset by half a month to ensure we capture case when the dates are e.g. 31st Jan and 1st Feb
blocking_rule_date_2 = """
    strftime(l.transaction_date+15, '%Y%m') = strftime(r.transaction_date, '%Y%m')
    and substr(l.memo, 1,3) = substr(r.memo,1,3)
    and l.amount/r.amount > 0.7   and l.amount/r.amount < 1.3
"""

blocking_rule_memo = block_on("substr(memo,1,9)")

blocking_rule_amount_1 = """
round(l.amount/2,0)*2 = round(r.amount/2,0)*2 and yearweek(r.transaction_date) = yearweek(l.transaction_date)
"""

blocking_rule_amount_2 = """
round(l.amount/2,0)*2 = round((r.amount+1)/2,0)*2 and yearweek(r.transaction_date) = yearweek(l.transaction_date + 4)
"""

blocking_rule_cheat = block_on("unique_id")


linker.cumulative_num_comparisons_from_blocking_rules_chart(
    [
        blocking_rule_date_1,
        blocking_rule_date_2,
        blocking_rule_memo,
        blocking_rule_amount_1,
        blocking_rule_amount_2,
        blocking_rule_cheat,
    ]
)
# Full settings for linking model
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_level_library as cll

comparison_amount = {
    "output_column_name": "amount",
    "comparison_levels": [
        cll.null_level("amount"),
        cll.exact_match_level("amount"),
        cll.percentage_difference_level("amount",0.01),
        cll.percentage_difference_level("amount",0.03),
        cll.percentage_difference_level("amount",0.1),
        cll.percentage_difference_level("amount",0.3),
        cll.else_level()
    ],
    "comparison_description": "Amount percentage difference",
}

settings = {
    "link_type": "link_only",
    "probability_two_random_records_match": 1 / len(df_origin),
    "blocking_rules_to_generate_predictions": [
        blocking_rule_date_1,
        blocking_rule_date_2,
        blocking_rule_memo,
        blocking_rule_amount_1,
        blocking_rule_amount_2,
        blocking_rule_cheat
    ],
    "comparisons": [
        comparison_amount,
        cl.jaccard_at_thresholds(
            "memo", [0.9, 0.7]
        ),
        cl.datediff_at_thresholds("transaction_date", 
                                date_thresholds = [1, 4, 10, 30],
                                date_metrics = ["day", "day", "day", "day"],
                                include_exact_match_level=False
                                )
    ],
    "retain_intermediate_calculation_columns": True,
    "retain_matching_columns": True,
}
linker = DuckDBLinker([df_origin, df_destination], settings,input_table_aliases=["__ori", "_dest"])
linker.estimate_u_using_random_sampling(max_pairs=1e6)
----- Estimating u probabilities using random sampling -----



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))



Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - amount (no m values are trained).
    - memo (no m values are trained).
    - transaction_date (no m values are trained).
linker.estimate_parameters_using_expectation_maximisation(block_on("memo"))
----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."memo" = r."memo"

Parameter estimates will be made for the following comparison(s):
    - amount
    - transaction_date

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - memo

Iteration 1: Largest change in params was -0.529 in the m_probability of amount, level `Exact match`
Iteration 2: Largest change in params was -0.237 in the m_probability of transaction_date, level `Within 1 day`
Iteration 3: Largest change in params was -0.0185 in the m_probability of transaction_date, level `Within 1 day`
Iteration 4: Largest change in params was 0.00459 in the m_probability of transaction_date, level `Within 30 days`
Iteration 5: Largest change in params was 0.00129 in the m_probability of transaction_date, level `Within 30 days`
Iteration 6: Largest change in params was 0.000332 in the m_probability of transaction_date, level `Within 30 days`
Iteration 7: Largest change in params was 8.32e-05 in the m_probability of transaction_date, level `Within 30 days`

EM converged after 7 iterations

Your model is not yet fully trained. Missing estimates for:
    - memo (no m values are trained).





<EMTrainingSession, blocking on l."memo" = r."memo", deactivating comparisons memo>
session = linker.estimate_parameters_using_expectation_maximisation(block_on("amount"))
----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."amount" = r."amount"

Parameter estimates will be made for the following comparison(s):
    - memo
    - transaction_date

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - amount

Iteration 1: Largest change in params was -0.356 in the m_probability of memo, level `Exact match`
Iteration 2: Largest change in params was -0.0328 in the m_probability of memo, level `Exact match`
Iteration 3: Largest change in params was 0.0442 in the m_probability of memo, level `All other comparisons`
Iteration 4: Largest change in params was 0.0483 in the m_probability of memo, level `All other comparisons`
Iteration 5: Largest change in params was 0.0427 in the m_probability of memo, level `All other comparisons`
Iteration 6: Largest change in params was 0.0318 in the m_probability of memo, level `All other comparisons`
Iteration 7: Largest change in params was 0.0211 in the m_probability of memo, level `All other comparisons`
Iteration 8: Largest change in params was 0.0131 in the m_probability of memo, level `All other comparisons`
Iteration 9: Largest change in params was 0.00792 in the m_probability of memo, level `All other comparisons`
Iteration 10: Largest change in params was 0.00474 in the m_probability of memo, level `All other comparisons`
Iteration 11: Largest change in params was 0.00283 in the m_probability of memo, level `All other comparisons`
Iteration 12: Largest change in params was 0.00169 in the m_probability of memo, level `All other comparisons`
Iteration 13: Largest change in params was 0.00101 in the m_probability of memo, level `All other comparisons`
Iteration 14: Largest change in params was 0.000602 in the m_probability of memo, level `All other comparisons`
Iteration 15: Largest change in params was 0.00036 in the m_probability of memo, level `All other comparisons`
Iteration 16: Largest change in params was 0.000216 in the m_probability of memo, level `All other comparisons`
Iteration 17: Largest change in params was 0.000129 in the m_probability of memo, level `All other comparisons`
Iteration 18: Largest change in params was 7.74e-05 in the m_probability of memo, level `All other comparisons`

EM converged after 18 iterations

Your model is fully trained. All comparisons have at least one estimate for their m and u values
linker.match_weights_chart()
df_predict = linker.predict(threshold_match_probability=0.001)
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
linker.comparison_viewer_dashboard(df_predict,"dashboards/comparison_viewer_transactions.html", overwrite=True)
from IPython.display import IFrame
IFrame(
    src="./dashboards/comparison_viewer_transactions.html", width="100%", height=1200
)

pred_errors =  linker.prediction_errors_from_labels_column("ground_truth", include_false_positives=True, include_false_negatives=False)
linker.waterfall_chart(pred_errors.as_record_dict(limit=5))
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
pred_errors =  linker.prediction_errors_from_labels_column("ground_truth", include_false_positives=False, include_false_negatives=True)
linker.waterfall_chart(pred_errors.as_record_dict(limit=5))