cleanX.dataset_processing.dataframes module

Library for cleaning radiological data used in machine learning applications.

Module dataframes: a module for processing of datasets related to images. This module can be implemented by functions, or can be implemented with classes.

exception cleanX.dataset_processing.dataframes.GuesserError

Bases: TypeError

This error is raised when the loading code cannot figure out what kind of source it is dealing with.

class cleanX.dataset_processing.dataframes.ColumnsSource

Bases: ABC

Formal superclass for all sources that should be used to produce DataFrame.

abstract to_dataframe()

Descendants of this class must implement this method.

Returns:

Dataframe produced from the source represented by this class.

Return type:

DataFrame.

class cleanX.dataset_processing.dataframes.CSVSource(csv, **pd_args)

Bases: ColumnsSource

Class that helps turn csv into a dataframe

__init__(csv, **pd_args)

Initializes this class with the path to the csv file from which to create the dataframe and the pass-through arguments to pandas.read_csv().

Parameters:
to_dataframe()

Necessary implementation of the abstractmethod.

Returns:

Dataframe produced from the source represented by this class.

Return type:

DataFrame.

class cleanX.dataset_processing.dataframes.JSONSource(json, **pd_args)

Bases: ColumnsSource

Class that helps turn json into a dataframe for later exploration

__init__(json, **pd_args)

Initializes this class with the path to the json file from which to create the dataframe and the pass-through arguments to pandas.read_json().

Parameters:
to_dataframe()

Necessary implementation of the abstractmethod.

Returns:

Dataframe produced from the source represented by this class.

Return type:

DataFrame.

class cleanX.dataset_processing.dataframes.DFSource(df)

Bases: ColumnsSource

This class is a no-op source. Use this when you already have a dataframe ready.

__init__(df)

Initializes this class with the existing dataframe.

Parameters:

df (DataFrame.) – Existing dataframe that should be used as the source.

to_dataframe()

Necessary implementation of the abstractmethod.

Returns:

Dataframe produced from the source represented by this class.

Return type:

DataFrame.

class cleanX.dataset_processing.dataframes.MultiSource(*sources)

Bases: ColumnsSource

This class allows aggregation of multiple sources.

__init__(*sources)

Initializes this class with the number of sources that will be used in a way similar to itertools.chain().

Parameters:

*sources – Sources to be concatenated together to create a single dataframe.

to_dataframe()

Necessary implementation of the abstractmethod.

Returns:

Dataframe produced from the source represented by this class.

Return type:

DataFrame.

cleanX.dataset_processing.dataframes.string_source(raw_src)

Helper function to select source based on file extension.

Parameters:

raw_src (Suitable for os.path.splitext()) – The path to the file to be interpreted as a source.

Returns:

Either CSVSource or JSONSource, depending on file extension.

Return type:

Union[CSVSource, JSONSource]

class cleanX.dataset_processing.dataframes.MLSetup(train_src, test_src, unique_id=None, label_tag='Label', sensitive_list=None)

Bases: object

This class allows configuration of the train and test datasets organized into a pandas dataframe to be checked for problems, and creates reports, which can be put in multiple output options.

known_sources = {<class 'str'>: <function string_source>, <class 'bytes'>: <function string_source>, <class 'pathlib.Path'>: <function string_source>, <class 'pandas.core.frame.DataFrame'>: <function MLSetup.<lambda>>}

Mapping of types of sources to the factory functions for creating source objects.

__init__(train_src, test_src, unique_id=None, label_tag='Label', sensitive_list=None)

Initializes this class with various aspects of a typical machine-learning study.

Parameters:
  • train_src (If this is a path-like object, it is interpreted to be a path to a file that needs to be read into a DataFrame. If this is already a dataframe, it’s used as is. If this is an iterable it is interpreted as a sequence of different sources and will be processed using MultiSource.) – The source for training dataset.

  • test_src (Same as train_src) – Similar to train_src, but for testing dataset.

  • unique_id (Suitable for accessing columns in pandas dataframe.) – The name of the column that uniquely identifies the cases (typically, this is patient’s id).

  • label_tag (Suitable for accessing columns in pandas dataframe.) – Usually, the training dataset has an assesment column that assings the case to the category of interest. Typically, this is diagnosis, or finding, etc.

  • sensitive_list (A sequence of regular expression that will be applied to the colum names (converted to strings if necessary).) – The list of columns that you suspect might be affecting the fairness of the study. These are typically gender, age, ethnicity etc.

get_unique_id()

Tries to find the column in training and testing datasests that uniquely identifies the entries in both. Typically, this is patient id of some sort. If the setup was initialized with unique_id, than that is used. Otherwise, a rather simple heuristic is used.

Returns:

The name of the column that should uniquely identify the cases being studied.

Return type:

The value from columns. Typically, this is a str.

get_sensitive_list()

Returns a list of regular expressions that will be applied to the list of columns to identify the sensitive categories (those that might bias the training towards overrepresented categories).

guess_source(raw_src)

Helper method to convert sources given by external factors to internal representation.

Parameters:

raw_src – The externally supplied source. This is typically either a path to a file, or an existing dataframe or a collection of such sources.

Returns:

An internal representation of source.

Return type:

ColumnSource

metadata()

Returns a tuple of column names of train and test datasets.

Returns:

Column names of train and test datasets.

concat_dataframe()

Helper method to generate the dataset containing both training and test data.

Returns:

A combined dataframe.

Return type:

DataFrame

duplicated()

Provides information on duplicates found in training and test data.

Returns:

A dataframe with information about duplicates.

Return type:

DataFrame

duplicated_frame()

Provides more detailed information about the duplicates found in training and test data.

Returns:

A tuple of two dataframes first listing duplicates in training data, second listing duplicates in test data.

duplicates()

Calculates the number of duplicates in training and test datasets separately.

Returns:

A tuple with number of duplicates in training and test datasets.

Return type:

Tuple[int, int]

pics_in_both_groups(unique_id)

Generates dataframe listing cases that appear in both training and testing datasets.

Returns:

A dataframe with images found in both training and testing datasets.

Return type:

DataFrame

generate_report(duplicates=True, leakage=True, bias=True, understand=True)

Generates report object summarizing the properties of this setup. This report can later be used to produce formatted output for inspection by human.

Parameters:
  • duplicates (bool) – Whether the information on duplicates needs to be included in the report.

  • leakage (bool) – Whether the information on data leakage (cases from training set appearing in test set) should be included in report.

  • bias (bool) – Whether the information about distribution in sensitive categories should be included in the report.

  • understand (bool) – Whether information about general properties of the dataset should be included in the report.

leakage()

This method explores the data in terms of any instances found in both training and test sets.

bias()

This method sorts the data instances by sensitive categories for each label e.g. if the ML is intended to diagnose pneumonia then cases of pneumonia and not pneumonia would get counts of gender or other specified sensitive categories.

class cleanX.dataset_processing.dataframes.Report(mlsetup, duplicates=True, leakage=True, bias=True, understand=True)

Bases: object

This class is for a report which can be produced about the data.

__init__(mlsetup, duplicates=True, leakage=True, bias=True, understand=True)

Initializes report instance with flags indicating what parts to include in the report.

Parameters:
  • mlsetup (MLSetup) – The setup this report is about.

  • duplicates (bool) – Whether information about duplicates is to be reported.

  • leakage (bool) – Whether information about leakage is to be reported.

  • bias (bool) – Whether information about bias is to be reported.

  • understand (bool) – Whether general information about the given setup is to be reported.

report_duplicates()

This method extracts information on duplicates in the datasets, once make into dataframes. The information can be reported.

report_leakage()

Adds a report section on data leakage (training results found in testing samples).

report_bias()

Adds a report section on distribution in sensitive categories.

report_understand()

This method extracts information on the datasets, once make into dataframes. The information can be reported.

subsection_html(data, level=2)

Utility method to recursively generate subsections for HTML report.

Parameters:
  • data (Various data structures constituting the report) – The data to be reported.

  • level (int) – How deeply this section is indented.

Returns:

A list containing HTML markup elements.

Return type:

List[str]

subsection_text(data, level=2)

Utility method to recursively generate subsections for text report.

Parameters:
  • data (Various datastructures constituting the report) – The data to be reported.

  • level (int) – How deeply this section is indented.

Returns:

A strings containing the text of subsection. Only the subsections of the returned subsection are indented. I.e. you need to indent the result according to nesting level.

Return type:

str

to_ipwidget()

Generates an HTML widget. This is mostly usable when running in Jupyter notebook context.

Warning

This will try to import the widget class which is not installed as the dependency of cleanX. It relies on it being available as part of Jupyter installation.

Returns:

An HTML widget with the formatted report.

Return type:

HTML

to_text()

Generates plain text representation of this report.

Returns:

A string suitable for either printing to the screen or saving in a text file.

Return type:

str

cleanX.dataset_processing.dataframes.check_paths_for_group_leakage(train_df, test_df, unique_id)

Finds train samples that have been accidentally leaked into test samples

Parameters:
  • train_df (DataFrame) – Pandas DataFrame containing information about train assets.

  • test_df (DataFrame) – Pandas DataFrame containing information about train assets.

Returns:

duplications of any image into both sets as a new DataFrame

Return type:

DataFrame

cleanX.dataset_processing.dataframes.see_part_potential_bias(df, label, sensitive_column_list)

This function gives you a tabulated DataFrame of sensitive columns e.g. gender, race, or whichever you think are relevant, in terms of a labels (put in the label column name). You may discover all your pathologically labeled sample are of one ethnic group, gender or other category in your DataFrame. Remember some early neural nets for chest X-rays were less accurate in women and the fact that there were fewer X-rays of women in the datasets they built on did not help

Parameters:
  • df (DataFrame) – DataFrame including sample IDs, labels, and sensitive columns

  • label (str) – The name of the column with the labels

  • sensitive_column_list (list) – List names sensitive columns on DataFrame

Returns:

tab_fight_bias2, a neatly sorted DataFrame

Return type:

DataFrame

cleanX.dataset_processing.dataframes.understand_df(df)

Takes a DataFrame (if you have a DataFrame for images) and prints information including length, data types, nulls and number of duplicated rows

Parameters:

df (DataFrame) – DataFrame you are interested in getting features of.

Returns:

Prints out information on DataFrame.

cleanX.dataset_processing.dataframes.show_duplicates(df)

Takes a DataFrame (if you have a DataFrame for images) and prints duplicated rows

Parameters:

df (DataFrame) – Dataframe that needs to be searched for ducplicates.