Documentation forsplink.exploratory
¶
completeness_chart(table_or_tables, db_api, cols=None, table_names_for_chart=None)
¶
Generate a summary chart of data completeness (proportion of non-nulls) of columns in each of the input table or tables. By default, completeness is assessed for all columns in the input data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_or_tables |
Sequence[AcceptableInputTableType]
|
A single table or a list of tables of data |
required |
db_api |
DatabaseAPISubClass
|
The backend database API to use |
required |
cols |
List[str]
|
List of column names to calculate completeness. If none, all columns will be computed. Default to None. |
None
|
table_names_for_chart |
List[str]
|
A list of names. Must be the same length as table_or_tables. |
None
|
profile_columns(table_or_tables, db_api, column_expressions=None, top_n=10, bottom_n=10)
¶
Profiles the specified columns of the dataframe initiated with the linker.
This can be computationally expensive if the dataframe is large.
For the provided columns with column_expressions (or for all columns if left empty) calculate: - A distribution plot that shows the count of values at each percentile. - A top n chart, that produces a chart showing the count of the top n values within the column - A bottom n chart, that produces a chart showing the count of the bottom n values within the column
This should be used to explore the dataframe, determine if columns have sufficient completeness for linking, analyse the cardinality of columns, and identify the need for standardisation within a given column.
Args:
column_expressions (list, optional): A list of strings containing the
specified column names.
If left empty this will default to all columns.
top_n (int, optional): The number of top n values to plot.
bottom_n (int, optional): The number of bottom n values to plot.
Returns:
Type | Description |
---|---|
Optional[ChartReturnType]
|
altair.Chart or dict: A visualization or JSON specification describing the |
Optional[ChartReturnType]
|
profiling charts. |
Note
- The
linker
object should be an instance of the initiated linker. - The provided
column_expressions
can be a list of column names to profile. If left empty, all columns will be profiled. - The
top_n
andbottom_n
parameters determine the number of top and bottom values to display in the respective charts.
Documentation forsplink.exploratory.similarity_analysis
¶
comparator_score(str1, str2, decimal_places=2)
¶
Helper function to give the similarity between two strings for the string comparators in splink.
Examples:
import splink.exploratory.similarity_analysis as sa
sa.comparator_score("Richard", "iRchard")
comparator_score_chart(list, col1, col2)
¶
Helper function returning a heatmap showing the sting similarity scores and string distances for a list of strings.
Examples:
import splink.exploratory.similarity_analysis as sa
list = {
"string1": ["Stephen", "Stephen", "Stephen"],
"string2": ["Stephen", "Steven", "Stephan"],
}
sa.comparator_score_chart(list, "string1", "string2")
comparator_score_df(list, col1, col2, decimal_places=2)
¶
Helper function returning a dataframe showing the string similarity scores and string distances for a list of strings.
Examples:
import splink.exploratory.similarity_analysis as sa
list = {
"string1": ["Stephen", "Stephen","Stephen"],
"string2": ["Stephen", "Steven", "Stephan"],
}
sa.comparator_score_df(list, "string1", "string2")
comparator_score_threshold_chart(list, col1, col2, similarity_threshold=None, distance_threshold=None)
¶
Helper function returning a heatmap showing the string similarity scores and string distances for a list of strings given a threshold.
Examples:
import splink.exploratory.similarity_analysis as sa
list = {
"string1": ["Stephen", "Stephen","Stephen"],
"string2": ["Stephen", "Steven", "Stephan"],
}
sa.comparator_score_threshold_chart(data,
"string1", "string2",
similarity_threshold=0.8,
distance_threshold=2)
phonetic_match_chart(list, col1, col2)
¶
Helper function returning a heatmap showing the phonetic transform and matches for a list of strings given a threshold.
Examples:
import splink.exploratory.similarity_analysis as sa
list = {
"string1": ["Stephen", "Stephen","Stephen"],
"string2": ["Stephen", "Steven", "Stephan"],
}
sa.comparator_score_threshold_chart(list,
"string1", "string2",
similarity_threshold=0.8,
distance_threshold=2)
phonetic_transform(string)
¶
Helper function to give the phonetic transformation of two strings with Soundex, Metaphone and Double Metaphone.
Examples:
phonetic_transform("Richard", "iRchard")
phonetic_transform_df(list, col1, col2)
¶
Helper function returning a dataframe showing the phonetic transforms for a list of strings.
Examples:
import splink.exploratory.similarity_analysis as sa
list = {
"string1": ["Stephen", "Stephen","Stephen"],
"string2": ["Stephen", "Steven", "Stephan"],
}
sa.phonetic_match_chart(list, "string1", "string2")