Understanding and debugging Splink
Understanding and debugging Splink's computations¶
Splink contains tooling to help developers understand the underlying computations, how caching and pipelining is working, and debug problems.
There are two main mechanisms:
debug_mode, and setting different logging levels
You can turn on debug mode by setting
linker.debug_mode = True.
This has the following effects:
- Each step of Splink's calculations are executed in turn. That is, pipelining is switched off.
- The SQL statements being executed by Splink are displayed
- The results of the SQL statements are displayed in tabular format
This is probably the best way to understand each step of the calculations being performed by Splink - because a lot of the implementation gets 'hidden' within pipelines for performance reasons.
Note that enabling debug mode will dramatically reduce Splink's performance!
Splink has a range of logging modes that output information about what Splink is doing at different levels of verbosity.
Unlike debug mode, logging doesn't affect the performance of Splink.
You can set the logging level with code like
logging.getLogger("splink").setLevel(desired_level) although see notes below about gotyas.
The logging levels in Splink are:
20): This outputs user facing messages about the training status of Splink models
15: Outputs additional information about time taken and parameter estimation
10): Outputs information about the names of the SQL statements executed
7): Outputs information about the names of the components of the SQL pipelines
5): Outputs the SQL statements themselves
How to control logging¶
Note that by default Splink sets the logging level to
INFO on initialisation
With basic logging¶
import logging linker = DuckDBLinker(df, settings, set_up_basic_logging=False) # This must come AFTER the linker is intialised, because the logging level # will be set to INFO logging.getLogger("splink").setLevel(logging.DEBUG)
Without basic logging¶
# This code can be anywhere since set_up_basic_logging is False import logging logging.basicConfig(format="%(message)s") splink_logger = logging.getLogger("splink") splink_logger.setLevel(logging.INFO) linker = DuckDBLinker(df, settings, set_up_basic_logging=False)