cleanX.dataset_processing.dataframes module¶
Library for cleaning radiological data used in machine learning applications.
Module dataframes: a module for processing of datasets related to images. This module can be implemented by functions, or can be implemented with classes.
- exception cleanX.dataset_processing.dataframes.GuesserError¶
Bases:
TypeError
This error is raised when the loading code cannot figure out what kind of source it is dealing with.
- class cleanX.dataset_processing.dataframes.ColumnsSource¶
Bases:
ABC
Formal superclass for all sources that should be used to produce
DataFrame
.
- class cleanX.dataset_processing.dataframes.CSVSource(csv, **pd_args)¶
Bases:
ColumnsSource
Class that helps turn csv into a dataframe
- __init__(csv, **pd_args)¶
Initializes this class with the path to the csv file from which to create the dataframe and the pass-through arguments to
pandas.read_csv()
.- Parameters:
csv – The path to CSV file or an open file handle. Must be suitable for
pandas.read_csv()
.**pd_args – The pass-through arguments to
pandas.read_csv()
.
- class cleanX.dataset_processing.dataframes.JSONSource(json, **pd_args)¶
Bases:
ColumnsSource
Class that helps turn json into a dataframe for later exploration
- __init__(json, **pd_args)¶
Initializes this class with the path to the json file from which to create the dataframe and the pass-through arguments to
pandas.read_json()
.- Parameters:
json – The path to JSON file or an open file handle. Must be suitable for
pandas.read_json()
.**pd_args – The pass-through arguments to
pandas.read_json()
.
- class cleanX.dataset_processing.dataframes.DFSource(df)¶
Bases:
ColumnsSource
This class is a no-op source. Use this when you already have a dataframe ready.
- class cleanX.dataset_processing.dataframes.MultiSource(*sources)¶
Bases:
ColumnsSource
This class allows aggregation of multiple sources.
- __init__(*sources)¶
Initializes this class with the number of sources that will be used in a way similar to
itertools.chain()
.- Parameters:
*sources – Sources to be concatenated together to create a single dataframe.
- cleanX.dataset_processing.dataframes.string_source(raw_src)¶
Helper function to select source based on file extension.
- Parameters:
raw_src (Suitable for
os.path.splitext()
) – The path to the file to be interpreted as a source.- Returns:
Either
CSVSource
orJSONSource
, depending on file extension.- Return type:
Union[CSVSource, JSONSource]
- class cleanX.dataset_processing.dataframes.MLSetup(train_src, test_src, unique_id=None, label_tag='Label', sensitive_list=None)¶
Bases:
object
This class allows configuration of the train and test datasets organized into a pandas dataframe to be checked for problems, and creates reports, which can be put in multiple output options.
- known_sources = {<class 'str'>: <function string_source>, <class 'bytes'>: <function string_source>, <class 'pathlib.Path'>: <function string_source>, <class 'pandas.core.frame.DataFrame'>: <function MLSetup.<lambda>>}¶
Mapping of types of sources to the factory functions for creating source objects.
- __init__(train_src, test_src, unique_id=None, label_tag='Label', sensitive_list=None)¶
Initializes this class with various aspects of a typical machine-learning study.
- Parameters:
train_src (If this is a path-like object, it is interpreted to be a path to a file that needs to be read into a
DataFrame
. If this is already a dataframe, it’s used as is. If this is an iterable it is interpreted as a sequence of different sources and will be processed usingMultiSource
.) – The source for training dataset.test_src (Same as
train_src
) – Similar totrain_src
, but for testing dataset.unique_id (Suitable for accessing columns in
pandas
dataframe.) – The name of the column that uniquely identifies the cases (typically, this is patient’s id).label_tag (Suitable for accessing columns in
pandas
dataframe.) – Usually, the training dataset has an assesment column that assings the case to the category of interest. Typically, this is diagnosis, or finding, etc.sensitive_list (A sequence of regular expression that will be applied to the colum names (converted to strings if necessary).) – The list of columns that you suspect might be affecting the fairness of the study. These are typically gender, age, ethnicity etc.
- get_unique_id()¶
Tries to find the column in training and testing datasests that uniquely identifies the entries in both. Typically, this is patient id of some sort. If the setup was initialized with
unique_id
, than that is used. Otherwise, a rather simple heuristic is used.
- get_sensitive_list()¶
Returns a list of regular expressions that will be applied to the list of columns to identify the sensitive categories (those that might bias the training towards overrepresented categories).
- guess_source(raw_src)¶
Helper method to convert sources given by external factors to internal representation.
- Parameters:
raw_src – The externally supplied source. This is typically either a path to a file, or an existing dataframe or a collection of such sources.
- Returns:
An internal representation of source.
- Return type:
ColumnSource
- metadata()¶
Returns a tuple of column names of train and test datasets.
- Returns:
Column names of train and test datasets.
- concat_dataframe()¶
Helper method to generate the dataset containing both training and test data.
- Returns:
A combined dataframe.
- Return type:
- duplicated()¶
Provides information on duplicates found in training and test data.
- Returns:
A dataframe with information about duplicates.
- Return type:
- duplicated_frame()¶
Provides more detailed information about the duplicates found in training and test data.
- Returns:
A tuple of two dataframes first listing duplicates in training data, second listing duplicates in test data.
- duplicates()¶
Calculates the number of duplicates in training and test datasets separately.
- pics_in_both_groups(unique_id)¶
Generates dataframe listing cases that appear in both training and testing datasets.
- Returns:
A dataframe with images found in both training and testing datasets.
- Return type:
- generate_report(duplicates=True, leakage=True, bias=True, understand=True)¶
Generates report object summarizing the properties of this setup. This report can later be used to produce formatted output for inspection by human.
- Parameters:
duplicates (bool) – Whether the information on duplicates needs to be included in the report.
leakage (bool) – Whether the information on data leakage (cases from training set appearing in test set) should be included in report.
bias (bool) – Whether the information about distribution in sensitive categories should be included in the report.
understand (bool) – Whether information about general properties of the dataset should be included in the report.
- leakage()¶
This method explores the data in terms of any instances found in both training and test sets.
- bias()¶
This method sorts the data instances by sensitive categories for each label e.g. if the ML is intended to diagnose pneumonia then cases of pneumonia and not pneumonia would get counts of gender or other specified sensitive categories.
- class cleanX.dataset_processing.dataframes.Report(mlsetup, duplicates=True, leakage=True, bias=True, understand=True)¶
Bases:
object
This class is for a report which can be produced about the data.
- __init__(mlsetup, duplicates=True, leakage=True, bias=True, understand=True)¶
Initializes report instance with flags indicating what parts to include in the report.
- Parameters:
mlsetup (
MLSetup
) – The setup this report is about.duplicates (bool) – Whether information about duplicates is to be reported.
leakage (bool) – Whether information about leakage is to be reported.
bias (bool) – Whether information about bias is to be reported.
understand (bool) – Whether general information about the given setup is to be reported.
- report_duplicates()¶
This method extracts information on duplicates in the datasets, once make into dataframes. The information can be reported.
- report_leakage()¶
Adds a report section on data leakage (training results found in testing samples).
- report_bias()¶
Adds a report section on distribution in sensitive categories.
- report_understand()¶
This method extracts information on the datasets, once make into dataframes. The information can be reported.
- subsection_html(data, level=2)¶
Utility method to recursively generate subsections for HTML report.
- subsection_text(data, level=2)¶
Utility method to recursively generate subsections for text report.
- Parameters:
data (Various datastructures constituting the report) – The data to be reported.
level (int) – How deeply this section is indented.
- Returns:
A strings containing the text of subsection. Only the subsections of the returned subsection are indented. I.e. you need to indent the result according to nesting level.
- Return type:
- to_ipwidget()¶
Generates an
HTML
widget. This is mostly usable when running in Jupyter notebook context.Warning
This will try to import the widget class which is not installed as the dependency of
cleanX
. It relies on it being available as part of Jupyter installation.- Returns:
An HTML widget with the formatted report.
- Return type:
- cleanX.dataset_processing.dataframes.check_paths_for_group_leakage(train_df, test_df, unique_id)¶
Finds train samples that have been accidentally leaked into test samples
- cleanX.dataset_processing.dataframes.see_part_potential_bias(df, label, sensitive_column_list)¶
This function gives you a tabulated
DataFrame
of sensitive columns e.g. gender, race, or whichever you think are relevant, in terms of a labels (put in the label column name). You may discover all your pathologically labeled sample are of one ethnic group, gender or other category in yourDataFrame
. Remember some early neural nets for chest X-rays were less accurate in women and the fact that there were fewer X-rays of women in the datasets they built on did not help
- cleanX.dataset_processing.dataframes.understand_df(df)¶
Takes a
DataFrame
(if you have aDataFrame
for images) and prints information including length, data types, nulls and number of duplicated rows- Parameters:
df (
DataFrame
) –DataFrame
you are interested in getting features of.- Returns:
Prints out information on
DataFrame
.