Skip to content

Enhancing the Settings Validator

Overview of Current Validation Checks

Below is a summary of the key validation checks currently implemented by our settings validator. For detailed information, please refer to the source code:

  • Blocking Rules and Comparison Levels Validation: Ensures that the user’s blocking rules and comparison levels are correctly imported from the designated library, and that they contain the necessary details for effective use within the Splink.
  • Column Existence Verification: Verifies the presence of columns specified in the user’s settings across all input dataframes, preventing errors due to missing data fields.
  • Miscellaneous Checks: Conducts a range of additional checks aimed at providing clear and informative error messages, facilitating smoother user experiences when deviations from typical Splink usage are detected.

Extending Validation Logic

If you are introducing new validation checks that deviate from the existing ones, please incorporate them as functions within a new script located in the splink/settings_validation directory. This ensures that all validation logic is centrally managed and easily maintainable.


Error handling and logging

Error handling and logging in the settings validator takes the following forms:

  • Raising INFO level logs - These are raised when the settings validator detects an issue with the user's settings dictionary. These logs are intended to provide the user with information on how to rectify the issue, but should not halt the program.
  • Raising single exceptions - Raise a built-in Python or Splink exception in response to finding an error.
  • Concurrently raising multiple exceptions - In some instances, it makes sense to raise multiple errors simultaneously, so as not to disrupt the program. This is achieved using the ErrorLogger class.

The first two use standard Python logging and exception handling. The third is a custom class, covered in more detail below.

You should look to use whichever makes the most sense given your requirements.

Raising multiple exceptions concurrently

Raising multiple exceptions simultaneously provides users with faster and more manageable feedback, avoiding the tedious back-and-forth that typically occurs when errors are reported and addressed one at a time.

To enable the logging of multiple errors in a single check, the ErrorLogger class can be utilised. This is designed to operate similarly to a list, allowing the storing of errors using the append method.

Once all errors have been logged, you can raise them with the raise_and_log_all_errors method. This will raise an exception of your choice and report all stored errors to the user.

ErrorLogger in practice
from splink.exceptions import ErrorLogger

# Create an error logger instance
e = ErrorLogger()

# Log your errors
e.append(SyntaxError("The syntax is wrong"))
e.append(NameError("Invalid name entered"))

# Raise your errors
e.raise_and_log_all_errors()


Expanding miscellaneous checks

Miscellaneous checks should be added as standalone functions within an appropriate check inside splink/settings_validation. These functions can then be integrated into the linker's startup process for validation.

An example of a miscellaneous check is the validate_dialect function. This assesses whether the settings dialect aligns with the linker's dialect.

This is then injected into the _validate_settings method within our linker, as seen here.


Additional comparison and blocking rule checks

Comparison and Blocking Rule checks can be found within the valid_types.py script.

These checks currently interface with the ErrorLogger class which is used to store and raise multiple errors simultaneously (see above).

If you wish to expand the current set of tests, it is advised that you incorporate any new checks into either log_comparison_errors or _validate_settings (mentioned above).


Checking for the existence of user specified columns

Column and SQL validation is performed within log_invalid_columns.py.

The aim of this script is to check that the columns specified by the user exist within the input dataframe(s). If any invalid columns are found, the script will log this with the user.

Should you need to include extra checks to assess the validity of columns supplied by a user, your primary focus should be on the column_lookups.py script.

There are two main classes within this script that can be used or extended to perform additional column checks:

InvalidCols

InvalidCols is a NamedTuple, used to construct the bulk of our log strings. This accepts a list of columns and the type of error, producing a complete log string when requested.

For simplicity, there are three partial implementations to cover the most common cases: - MissingColumnsLogGenerator - missing column identified. - InvalidTableNamesLogGenerator - table name entered by the user is missing or invalid. - InvalidColumnSuffixesLogGenerator - _l and _r suffixes are missing or invalid.

In practice, this can be used as follows:

# Store our invalid columns
my_invalid_cols = MissingColumnsLogGenerator(["first_col", "second_col"])
# Construct the corresponding log string
my_invalid_cols.construct_log_string()
InvalidColumnsLogger

InvalidColumnsLogger takes in a series of cleansed columns from your settings object (see SettingsColumnCleaner) and runs a series of validation checks to assess whether the column(s) are present within the underlying dataframes.

Any invalid columns are stored in an InvalidCols instance (see above), which is then used to construct a log string.

Logs are output to the user at the INFO level.

To extend the column checks, you simply need to add an additional validation method to the InvalidColumnsLogger class. Checks must be added as a new method and then called within construct_output_logs.

Single column, multi-column and SQL checks

Single and multi-column

Single and multi-column checks are relatively straightforward. Assuming you have a clean set of columns, you can leverage the check_for_missing_settings_column function.

This expects the following arguments: * settings_id: the name of the settings ID. This is only used for logging and does not necessarily need to match the true ID. * settings_column_to_check: the column(s) you wish to validate. * valid_input_dataframe_columns: the cleaned columns from your all input dataframes.

Checking columns in SQL statements

Checking SQL statements is a little more complex, given the need to parse SQL in order to extract your column names.

To do this, you can leverage the check_for_missing_or_invalid_columns_in_sql_strings function.

This expects the following arguments: * sql_dialect: The SQL dialect used by the linker. * sql_strings: A list of SQL strings. * valid_input_dataframe_columns: The list of columns identified in your input dataframe(s). * additional_validation_checks: Functions used to check for other issues with the parsed SQL string, namely, table name and column suffix validation.

NB: for nested SQL statements, you'll need to add an additional loop. See check_comparison_for_missing_or_invalid_sql_strings for more details.