Documentation for`splink.exploratory`¶

`completeness_chart(table_or_tables, db_api, cols=None, table_names_for_chart=None)` ¶

Generate a summary chart of data completeness (proportion of non-nulls) of columns in each of the input table or tables. By default, completeness is assessed for all columns in the input data.

Parameters:

Name	Type	Description	Default
`table_or_tables`	`Sequence[AcceptableInputTableType]`	A single table or a list of tables of data	required
`db_api`	`DatabaseAPISubClass`	The backend database API to use	required
`cols`	`List[str]`	List of column names to calculate completeness. If none, all columns will be computed. Default to None.	`None`
`table_names_for_chart`	`List[str]`	A list of names. Must be the same length as table_or_tables.	`None`

`profile_columns(table_or_tables, db_api, column_expressions=None, top_n=10, bottom_n=10)` ¶

Profiles the specified columns of the dataframe initiated with the linker.

This can be computationally expensive if the dataframe is large.

For the provided columns with column_expressions (or for all columns if left empty) calculate: - A distribution plot that shows the count of values at each percentile. - A top n chart, that produces a chart showing the count of the top n values within the column - A bottom n chart, that produces a chart showing the count of the bottom n values within the column

This should be used to explore the dataframe, determine if columns have sufficient completeness for linking, analyse the cardinality of columns, and identify the need for standardisation within a given column.

Args:

column_expressions (list, optional): A list of strings containing the
    specified column names.
    If left empty this will default to all columns.
top_n (int, optional): The number of top n values to plot.
bottom_n (int, optional): The number of bottom n values to plot.

Returns:

Type	Description
`Optional[ChartReturnType]`	altair.Chart or dict: A visualization or JSON specification describing the
`Optional[ChartReturnType]`	profiling charts.

Note

The linker object should be an instance of the initiated linker.
The provided column_expressions can be a list of column names to profile. If left empty, all columns will be profiled.
The top_n and bottom_n parameters determine the number of top and bottom values to display in the respective charts.

Documentation for`splink.exploratory.similarity_analysis`¶

`comparator_score(str1, str2, decimal_places=2)` ¶

Helper function to give the similarity between two strings for the string comparators in splink.

Examples:

import splink.exploratory.similarity_analysis as sa

sa.comparator_score("Richard", "iRchard")

`comparator_score_chart(list, col1, col2)` ¶

Helper function returning a heatmap showing the sting similarity scores and string distances for a list of strings.

Examples:

import splink.exploratory.similarity_analysis as sa

list = {
        "string1": ["Stephen", "Stephen", "Stephen"],
        "string2": ["Stephen", "Steven", "Stephan"],
        }

sa.comparator_score_chart(list, "string1", "string2")

`comparator_score_df(list, col1, col2, decimal_places=2)` ¶

Helper function returning a dataframe showing the string similarity scores and string distances for a list of strings.

Examples:

import splink.exploratory.similarity_analysis as sa

list = {
        "string1": ["Stephen", "Stephen","Stephen"],
        "string2": ["Stephen", "Steven", "Stephan"],
        }

sa.comparator_score_df(list, "string1", "string2")

`comparator_score_threshold_chart(list, col1, col2, similarity_threshold=None, distance_threshold=None)` ¶

Helper function returning a heatmap showing the string similarity scores and string distances for a list of strings given a threshold.

Examples:

import splink.exploratory.similarity_analysis as sa

list = {
        "string1": ["Stephen", "Stephen","Stephen"],
        "string2": ["Stephen", "Steven", "Stephan"],
        }

sa.comparator_score_threshold_chart(data,
                         "string1", "string2",
                         similarity_threshold=0.8,
                         distance_threshold=2)

`phonetic_match_chart(list, col1, col2)` ¶

Helper function returning a heatmap showing the phonetic transform and matches for a list of strings given a threshold.

Examples:

import splink.exploratory.similarity_analysis as sa

list = {
        "string1": ["Stephen", "Stephen","Stephen"],
        "string2": ["Stephen", "Steven", "Stephan"],
        }

sa.comparator_score_threshold_chart(list,
                         "string1", "string2",
                         similarity_threshold=0.8,
                         distance_threshold=2)

`phonetic_transform(string)` ¶

Helper function to give the phonetic transformation of two strings with Soundex, Metaphone and Double Metaphone.

Examples:

phonetic_transform("Richard", "iRchard")

`phonetic_transform_df(list, col1, col2)` ¶

Helper function returning a dataframe showing the phonetic transforms for a list of strings.

Examples:

import splink.exploratory.similarity_analysis as sa

list = {
        "string1": ["Stephen", "Stephen","Stephen"],
        "string2": ["Stephen", "Steven", "Stephan"],
        }

sa.phonetic_match_chart(list, "string1", "string2")

Documentation forsplink.exploratory¶

completeness_chart(table_or_tables, db_api, cols=None, table_names_for_chart=None) ¶

profile_columns(table_or_tables, db_api, column_expressions=None, top_n=10, bottom_n=10) ¶

Documentation forsplink.exploratory.similarity_analysis¶

comparator_score(str1, str2, decimal_places=2) ¶

comparator_score_chart(list, col1, col2) ¶

comparator_score_df(list, col1, col2, decimal_places=2) ¶

comparator_score_threshold_chart(list, col1, col2, similarity_threshold=None, distance_threshold=None) ¶

phonetic_match_chart(list, col1, col2) ¶

phonetic_transform(string) ¶

phonetic_transform_df(list, col1, col2) ¶

Documentation for`splink.exploratory`¶

`completeness_chart(table_or_tables, db_api, cols=None, table_names_for_chart=None)` ¶

`profile_columns(table_or_tables, db_api, column_expressions=None, top_n=10, bottom_n=10)` ¶

Documentation for`splink.exploratory.similarity_analysis`¶

`comparator_score(str1, str2, decimal_places=2)` ¶

`comparator_score_chart(list, col1, col2)` ¶

`comparator_score_df(list, col1, col2, decimal_places=2)` ¶

`comparator_score_threshold_chart(list, col1, col2, similarity_threshold=None, distance_threshold=None)` ¶

`phonetic_match_chart(list, col1, col2)` ¶

`phonetic_transform(string)` ¶

`phonetic_transform_df(list, col1, col2)` ¶