Skip to content

Documentation forsplink.exploratory¶

completeness_chart(table_or_tables, db_api, cols=None, table_names_for_chart=None) ¶

Generate a summary chart of data completeness (proportion of non-nulls) of columns in each of the input table or tables. By default, completeness is assessed for all columns in the input data.

Parameters:

Name Type Description Default
table_or_tables Sequence[AcceptableInputTableType]

A single table or a list of tables of data

required
db_api DatabaseAPISubClass

The backend database API to use

required
cols List[str]

List of column names to calculate completeness. If none, all columns will be computed. Default to None.

None
table_names_for_chart List[str]

A list of names. Must be the same length as table_or_tables.

None

profile_columns(table_or_tables, db_api, column_expressions=None, top_n=10, bottom_n=10) ¶

Profiles the specified columns of the dataframe initiated with the linker.

This can be computationally expensive if the dataframe is large.

For the provided columns with column_expressions (or for all columns if left empty) calculate: - A distribution plot that shows the count of values at each percentile. - A top n chart, that produces a chart showing the count of the top n values within the column - A bottom n chart, that produces a chart showing the count of the bottom n values within the column

This should be used to explore the dataframe, determine if columns have sufficient completeness for linking, analyse the cardinality of columns, and identify the need for standardisation within a given column.

Args:

column_expressions (list, optional): A list of strings containing the
    specified column names.
    If left empty this will default to all columns.
top_n (int, optional): The number of top n values to plot.
bottom_n (int, optional): The number of bottom n values to plot.

Returns:

Type Description
Optional[ChartReturnType]

altair.Chart or dict: A visualization or JSON specification describing the

Optional[ChartReturnType]

profiling charts.

Note
  • The linker object should be an instance of the initiated linker.
  • The provided column_expressions can be a list of column names to profile. If left empty, all columns will be profiled.
  • The top_n and bottom_n parameters determine the number of top and bottom values to display in the respective charts.

Documentation forsplink.exploratory.similarity_analysis¶

comparator_score(str1, str2, decimal_places=2) ¶

Helper function to give the similarity between two strings for the string comparators in splink.

Examples:

import splink.exploratory.similarity_analysis as sa

sa.comparator_score("Richard", "iRchard")

comparator_score_chart(list, col1, col2) ¶

Helper function returning a heatmap showing the sting similarity scores and string distances for a list of strings.

Examples:

import splink.exploratory.similarity_analysis as sa

list = {
        "string1": ["Stephen", "Stephen", "Stephen"],
        "string2": ["Stephen", "Steven", "Stephan"],
        }

sa.comparator_score_chart(list, "string1", "string2")

comparator_score_df(list, col1, col2, decimal_places=2) ¶

Helper function returning a dataframe showing the string similarity scores and string distances for a list of strings.

Examples:

import splink.exploratory.similarity_analysis as sa

list = {
        "string1": ["Stephen", "Stephen","Stephen"],
        "string2": ["Stephen", "Steven", "Stephan"],
        }

sa.comparator_score_df(list, "string1", "string2")

comparator_score_threshold_chart(list, col1, col2, similarity_threshold=None, distance_threshold=None) ¶

Helper function returning a heatmap showing the string similarity scores and string distances for a list of strings given a threshold.

Examples:

import splink.exploratory.similarity_analysis as sa

list = {
        "string1": ["Stephen", "Stephen","Stephen"],
        "string2": ["Stephen", "Steven", "Stephan"],
        }

sa.comparator_score_threshold_chart(data,
                         "string1", "string2",
                         similarity_threshold=0.8,
                         distance_threshold=2)

phonetic_match_chart(list, col1, col2) ¶

Helper function returning a heatmap showing the phonetic transform and matches for a list of strings given a threshold.

Examples:

import splink.exploratory.similarity_analysis as sa

list = {
        "string1": ["Stephen", "Stephen","Stephen"],
        "string2": ["Stephen", "Steven", "Stephan"],
        }

sa.comparator_score_threshold_chart(list,
                         "string1", "string2",
                         similarity_threshold=0.8,
                         distance_threshold=2)

phonetic_transform(string) ¶

Helper function to give the phonetic transformation of two strings with Soundex, Metaphone and Double Metaphone.

Examples:

phonetic_transform("Richard", "iRchard")

phonetic_transform_df(list, col1, col2) ¶

Helper function returning a dataframe showing the phonetic transforms for a list of strings.

Examples:

import splink.exploratory.similarity_analysis as sa

list = {
        "string1": ["Stephen", "Stephen","Stephen"],
        "string2": ["Stephen", "Steven", "Stephan"],
        }

sa.phonetic_match_chart(list, "string1", "string2")