htstools package

Submodules

htstools.cli module

Command-line interface for hts-tools.

htstools.cli.main() → None[source]

htstools.io module

Utilities for reading files from platereaders and writing columnar data.

class htstools.io.BiotekData(Meta, Procedure_Details, Layout, Results)

Bases: tuple

Layout: Alias for field number 2

Meta: Alias for field number 0

Procedure_Details: Alias for field number 1

Results: Alias for field number 3

htstools.io.from_platereader(file: IO | str | Iterable[IO | str], shape: str, vendor: str, delimiter: str | None = None, measurement_prefix: str = 'measured_') → DataFrame[source]

Convert raw platereader files to columnar format.

Initially only supports Biotek platereaders.

Loads files exported from platereader software according to the parameters, and does the necessary parsing to extract metadata and measured values into a Pandas DataFrame, with one row per well and each column corresponding to a variable such as a measurement or metadata.

Measurement columns can be identified by the measurement_prefix value, and wavelengths are annotated in columns starting with “fluor” or “abs” and ending with “_wavelength”.

Parameters:

file (str, file-like, or list) – File to parse. Must be CSV, TSV, or XLSX format.
shape (str) – “plate” or “row”, idicating whether the data are in a plate format or row-wise table format.
vendor (str) – Platereader manufacturer. Currently only “Biotek” is implemented.
delimiter (str) – Override inference of file format with this delimiter. Must be either “,”, “t”, or “xlsx” (to enforce XLSX parsing).
measurement_prefix (str) – The prefix to add to columns containing raw measured variables.

Returns:

Parsed data.

Return type:

pandas.DataFrame

Raises:

NotImplementedError – Where the indicated platereader export format is not yet supported.

htstools.normalize module

Functions for normalizing data.

htstools.normalize.normalize(data: DataFrame, measurement_col: str, control_col: str, neg: str, method: str | None = None, pos: str | None = None, group: str | List[str] | None = None, flip: bool = False) → DataFrame[source]

Normalize a column based on controls, optionally within groups.

Positive controls should represent the 0% signal, and negative controls should represent the 100% signal. If you set flip = True, then this is reversed.

Calculations are performed within groups, such as batches or plates, indicated by the group column. This function takes the group-wise mean negative controls $mu_n$ and, optionally, positive controls $mu_p$. Then within each group calculates the normalized signal.

Two methods are offered:

Normalized proportion of growth (NPG)

Within each group calculates the normalized signal, $s$, of each measured datapoint, $m$:

$$s = \frac{m - mu_p}{mu_n - mu_p}$$

If you set flip = True, then this equation is used instead:

$$s = \frac{m - mu_n}{mu_p - mu_n}$$

Requires both positive and negative controls.

Proportion of negative (PON)

Within each group calculates the normalized signal, $s$, of each measured datapoint, $m$:

$$s = \frac{m}{mu_n}$$

If you set flip = True, then this equation is used instead:

$$s = 1 - \frac{m}{mu_n}$$

Requires only negative controls.

Parameters:

data (pandas.DataFrame) – Input dataframe.
measurement_col (str) – Name of column containing raw data.
control_col (str) – Name of column containing control indicators.
neg (str) – Name of negative controls.
method (str) – One of PON or NPG. Default PON.
pos (str, optional) – Name of positive controls.
group (str or list, optional) – Name of column containing the grouping variable, such as plates or batches. If not set, then entire the data is taken as one big group.
flip (bool, optional) – Set positive controls as 100% signal, and negative controls as 0% signal.

Returns:

Input data with additional columns, containing mean positive and negative control values (headers ending with “_neg_mean” and “_pos_mean”) and normalized data values (header ending with “_norm”).

Return type:

pandas.DataFrame

Raises:

KeyError – If measurement_col is not in data.
ValueError – If neg or pos is not in data.
TypeError – If control_col is not a str column.

Examples

>>> import pandas as pd
>>> a = pd.DataFrame(dict(compound=['p', 'p', 'c1', 'c2', 'n', 'n'],
...                       m_abs_ch1=[.1, .2, .5, .4, .9, .8],
...                       abs_ch1_wavelength=['600nm'] * 6))
>>> a  
    compound  m_abs_ch1 abs_ch1_wavelength
0        p        0.1              600nm
1        p        0.2              600nm
2       c1        0.5              600nm
3       c2        0.4              600nm
4        n        0.9              600nm
5        n        0.8              600nm
>>> normalize(a, control_col='compound', pos='p', neg='n', measurement_col='m_abs_ch1')  
    compound  m_abs_ch1 abs_ch1_wavelength  m_abs_ch1_neg_mean  m_abs_ch1_pos_mean  m_abs_ch1_norm.pon
0        p        0.1              600nm                0.85                0.15            0.117647
1        p        0.2              600nm                0.85                0.15            0.235294
2       c1        0.5              600nm                0.85                0.15            0.588235
3       c2        0.4              600nm                0.85                0.15            0.470588
4        n        0.9              600nm                0.85                0.15            1.058824
5        n        0.8              600nm                0.85                0.15            0.941176
>>> normalize(a, control_col='compound', pos='p', neg='n', measurement_col='m_abs_ch1', flip=True)  
    compound  m_abs_ch1 abs_ch1_wavelength  m_abs_ch1_neg_mean  m_abs_ch1_pos_mean  m_abs_ch1_norm.pon
0        p        0.1              600nm                0.85                0.15            0.882353
1        p        0.2              600nm                0.85                0.15            0.764706
2       c1        0.5              600nm                0.85                0.15            0.411765
3       c2        0.4              600nm                0.85                0.15            0.529412
4        n        0.9              600nm                0.85                0.15           -0.058824
5        n        0.8              600nm                0.85                0.15            0.058824
>>> normalize(a, control_col='compound', pos='p', neg='n', measurement_col='m_abs_ch1', method='npg')  
    compound  m_abs_ch1 abs_ch1_wavelength  m_abs_ch1_neg_mean  m_abs_ch1_pos_mean  m_abs_ch1_norm.npg
0        p        0.1              600nm                0.85                0.15           -0.071429
1        p        0.2              600nm                0.85                0.15            0.071429
2       c1        0.5              600nm                0.85                0.15            0.500000
3       c2        0.4              600nm                0.85                0.15            0.357143
4        n        0.9              600nm                0.85                0.15            1.071429
5        n        0.8              600nm                0.85                0.15            0.928571

htstools.plot module

Untilities for generating plots.

htstools.plot.plot_dose_response(data: DataFrame, x: str, y: str, file_prefix: str, color: str | None = None, color_control: str | None = None, facet: str | None = None, files: str | None = None, hlines: Iterable[float] | None = None, panel_size: float = 2.5, format: str = 'pdf', sharey: bool = False, sharex: bool = False, x_log: bool = False, y_log: bool = False) → List[str][source]

Plot dose response curves, optionally splitting data across files, facets and colors.

This is a flexible function for data exploration and presentation. Uses a color-blind friendly palette.

Parameters:

data (pandas.DataFrame) – Input data in columnar format.
x (str) – Column to use as x-axis.
y (str) – Column to use as y-axis.
file_prefix (str) – Prefix to use in output filenames.
color (str, optional) – If provided, use this column to split data into separate colored lines.
color_control (str, optional) – If provided, plot this value from the color column as a dark grey.
facet (str, optional) – If provided, split plots into separate facets (panels) based on this column.
files (str, optional) – If provided, split plots into separate files based on this column.
hlines (list of float, optional) – Plot horizontal guidelines at these y-intercepts. Default: [0.].
panel_size (float, optional) – Size of a single panel (facet) in inches. Default: 3.0.
format (str, optional) – File format to save plots. Default: “pdf”.
sharex (bool, optional) – Whether to have shared x-axis ranges. Default: False.
sharey (bool, optional) – Whether to have shared y-axis ranges. Default: False.
x_log (bool, optional) – Whether to make x-axis log scale. Default: False.
y_log (bool, optional) – Whether to make y-axis log scale. Default: False.

Returns:

Filenames in which plots were saved.

Return type:

list

htstools.plot.plot_heatmap(data: DataFrame, x: str | Iterable[str], y: str, panel_size: float = 2.5) → Tuple[Axes, Figure][source]

htstools.plot.plot_histogram(data: DataFrame, x: str, control_col: str, negative: str, positive: str, panel_size: float = 2.5) → Tuple[Axes, Figure][source]

htstools.plot.plot_mean_sd(data: DataFrame, x: str | Iterable[str], y: str, panel_size: float = 2.5) → Tuple[Axes, Figure][source]

htstools.plot.plot_replicates(data: DataFrame, x: str, grouping: str | Iterable[str], control_col: str, negative: str, positive: str, panel_size: float = 2.5) → Tuple[Axes, Figure][source]

htstools.plot.plot_scatter(data: DataFrame, measurement_col: str, x: str, y: str, color: str | None = None, log_color: bool = False, hlines: Iterable[float] | None = None, vlines: Iterable[float] | None = None, x_log: bool = False, y_log: bool = False, panel_size: float = 2.5, **kwargs) → Tuple[Axes, Figure][source]

htstools.plot.plot_zprime(data: DataFrame, x: str | Iterable[str], y: str, panel_size: float = 4.5) → Tuple[Axes, Figure][source]

htstools.qc module

Functions for performing quality control checks on data.

htstools.qc.ssmd(data: DataFrame, measurement_col: str, control_col: str, pos: str, neg: str, group: str | Iterable[str] | None = None, robust: bool = False) → DataFrame[source]

Calculate SSMD based on positive and negative controls, optionally within groups.

Calculations are performed within groups, such as batches or plates, indicated by the group column.

This function takes the group-wise mean and variance of positive and negative controls ($mu_p$, $mu_n$, $sigma_p^2$, $sigma_n^2$), and then within each group calculates the SSMD, $s$:

$$s = frac{mu_n - mu_p}{sqrt{sigma_n^2 + sigma_p^2}}$$

Parameters:

data (pandas.DataFrame) – Input dataframe.
measurement_col (str) – Name of column containing raw data.
control_col (str) – Name of column containing control indicators.
pos (str) – Name of positive controls.
neg (str) – Name of negative controls.
group (str or list, optional) – Name of column containing the grouping variable, such as plates or batches. If not set, then entire the data is taken as one big group.
robust (bool, optional) – Use median instead of mean (still uses variance). Default: False.

Returns:

Summary dataframe with columns for mean, variance, and SSMD.

Return type:

pandas.DataFrame

htstools.qc.z_prime_factor(data: DataFrame, measurement_col: str, control_col: str, pos: str, neg: str, group: str | Iterable[str] | None = None, robust: bool = False) → DataFrame[source]

Calculate Z’-factor based on positive and negative controls, optionally within groups.

Calculations are performed within groups, such as batches or plates, indicated by the group column.

This function takes the group-wise mean and standard deviation of positive and negative controls ($mu_p$, $mu_n$, $sigma_p$, $sigma_n$), and then within each group calculates the Z’-factor, $s$:

$$s = 1 - 3 frac{sigma_n + sigma_p}{abs(mu_n - mu_p)}$$

Parameters:

data (pandas.DataFrame) – Input dataframe.
measurement_col (str) – Name of column containing raw data.
control_col (str) – Name of column containing control indicators.
pos (str) – Name of positive controls.
neg (str) – Name of negative controls.
group (str or list, optional) – Name of column containing the grouping variable, such as plates or batches. If not set, then entire the data is taken as one big group.
robust (bool, optional) – Use median and MAD instead of mean and standard deviation. Default: False.

Returns:

Summary dataframe with columns for mean, variance, and Z’-factor.

Return type:

pandas.DataFrame

htstools.summarize module

Statistical testing and hit-calling.

htstools.summarize.summarize(data: DataFrame, measurement_col: str, neg: List[str] | str, control_col: List[str] | str, group: List[str] | str) → DataFrame[source]

Add summary statstics to dataframe.

Calculates log fold-change (LFC), strictly standardized mean difference (SSMD), T-test, and Mann-Whitney U for the measurement_col column of data. There must also be a column heading starting with measurement_col and ending in “_wavelength”.

Statstics are within a group indicating repeated measurements of the same condition and, where appropriate, calculated relative to the negative control label neg from the column control_column.

Parameters:

data (pandas.DataFrame) – Input dataframe. Must contain a column measurement_col along with a column heading starting with measurement_col and ending in “_wavelength”.
measurement_col (str) – The column for which statistics should be calculated.
group (str or list) – Columns which indicate the grouping within which statistics should be calculated. These groups indicate repeated measurements of the same experimental condition.
neg (str or list) – Negative control label(s). If more than one, should be in the same order as control_col.
control_col (str) – Column(s) from which to take the negative control label(s).

Returns:

Dataframe with summary statistics.

Return type:

pandas.DataFrame

Raises:

ValueError – If length of neg and control_col are not identical.
KeyError – If a column heading starting with measurement_col and ending in “_wavelength” is not present in data.

Examples

>>> import pandas as pd
>>> a = pd.DataFrame(dict(gene=['g1', 'g1', 'g2', 'g2', 'g1', 'g1', 'g2', 'g2'],
...                       compound=['n', 'n', 'n', 'n', 'cmpd1', 'cmpd1', 'cmpd2', 'cmpd2'],
...                       m_abs_ch1=[.1, .2, .9, .8, .1, .3, .5, .45],
...                       abs_ch1_wavelength=['600nm'] * 8))
>>> a  
    gene compound  m_abs_ch1 abs_ch1_wavelength
0    g1        n       0.10              600nm
1    g1        n       0.20              600nm
2    g2        n       0.90              600nm
3    g2        n       0.80              600nm
4    g1    cmpd1       0.10              600nm
5    g1    cmpd1       0.30              600nm
6    g2    cmpd2       0.50              600nm
7    g2    cmpd2       0.45              600nm
>>> summarize(a, measurement_col='m_abs_ch1', control_col='compound', neg='n', group='gene')  
  gene abs_ch1_wavelength  m_abs_ch1_mean  m_abs_ch1_std  ...  m_abs_ch1_t.stat  m_abs_ch1_t.p  m_abs_ch1_ssmd  m_abs_ch1_log10fc
0   g1              600nm          0.1750       0.095743  ...          0.361158       0.742922        0.210042           0.066947
1   g2              600nm          0.6625       0.221265  ...         -1.544396       0.199787       -0.807183          -0.108233

[2 rows x 12 columns] >>> summarize(a, measurement_col=’m_abs_ch1’, control_col=’compound’, neg=’n’, group=[‘gene’, ‘compound’]) # doctest: +SKIP gene compound abs_ch1_wavelength m_abs_ch1_mean … m_abs_ch1_t.stat m_abs_ch1_t.p m_abs_ch1_ssmd m_abs_ch1_log10fc 0 g1 n 600nm 0.150 … 0.000000 1.000000 0.000000 0.000000 1 g2 n 600nm 0.850 … 0.000000 1.000000 0.000000 0.000000 2 g1 cmpd1 600nm 0.200 … 0.447214 0.711723 0.316228 0.124939 3 g2 cmpd2 600nm 0.475 … -6.708204 0.044534 -4.743416 -0.252725

[4 rows x 13 columns]

htstools.tables module

Utilities for joining and pivoting tables.

htstools.tables.join(left: DataFrame, right: DataFrame | Dict[str, DataFrame], how: str = 'inner') → DataFrame[source]

Perform a database-stype join (merge) between two dataframes.

This is simply a wrapper around pandas.merge() to catch errors and return the shared columns for joining.

Parameters:

left (pandas.DataFrame) – Left dataframe.
right (pandas.DataFrame or dict) – Right dataframe. If a dict, this should map a str to a pandas.DataFrame, as returned by pandas.read_excel() when reading multipel sheets. In this case, the sheets will be joined in order.
how (str, optional) – Style of join: “inner”, “outer”, “left”. “right”. Default: “inner”.

Returns:

Shared column headers and joined dataframe.

Return type:

Tuple[str, pandas.DataFrame]

Raises:

AttributeError – When there are no shared columns.
ValueError – When attempting to join on columns of different types. This often happens when integers are stored as int in one dataframe and str in the other.
NotImplementedError – If anything other than a pd.DataFrame or a dictionary mapping to pd.DataFrame is provided to the right parameter.

Examples

>>> import pandas as pd
>>> a = pd.DataFrame(dict(column=['A', 'B', 'A', 'B'], abs=[.1, .2, .23, .11]))
>>> a  
    column   abs
0      A  0.10
1      B  0.20
2      A  0.23
3      B  0.11
>>> b = pd.DataFrame(dict(column=['B', 'A'], drug=['TMP', 'RIF']))
>>> b  
    column drug
0      B  TMP
1      A  RIF
>>> shared_cols, data = join(a, b)
>>> shared_cols
('column',)
>>> data  
  column   abs drug
0      A  0.10  RIF
1      A  0.23  RIF
2      B  0.20  TMP
3      B  0.11  TMP

htstools.tables.pivot_plate(df: DataFrame | Mapping[str, DataFrame], value_name: str = 'value') → DataFrame[source]

Pivot from a row x column plate format to a columnar format.

Handy to convert a visual plate layout to a columnar format for data analysis.

Parameters:

df (pandas.DataFrame, Dict[str, pandas.DataFrame]) – Either a dataframe containing rows labels as index and column labels as headings, or a dictionary of names mapping to such dataframes (as returned by pandas.read_excel()).
value_name (str, optional) – The column heading to give the values within the plate. Default: “value”.

Returns:

Columnar dataframe containign data from df.

Return type:

pandas.DataFrame

Raises:

ValueError – If df is not a dataframe or a dictionary.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> a = pd.DataFrame(index=list("ABCDEFGH"),
...                  columns=range(1, 13),
...                  data=np.arange(1, 97).reshape(8, 12))
>>> a  
    1   2   3   4   5   6   7   8   9   10  11  12
A   1   2   3   4   5   6   7   8   9  10  11  12
B  13  14  15  16  17  18  19  20  21  22  23  24
C  25  26  27  28  29  30  31  32  33  34  35  36
D  37  38  39  40  41  42  43  44  45  46  47  48
E  49  50  51  52  53  54  55  56  57  58  59  60
F  61  62  63  64  65  66  67  68  69  70  71  72
G  73  74  75  76  77  78  79  80  81  82  83  84
H  85  86  87  88  89  90  91  92  93  94  95  96
>>> pivot_plate(a, value_name="well_number")    
row_id column_id  well_number well_id plate_id
0       A         1            1     A01
1       B         1           13     B01
2       C         1           25     C01
3       D         1           37     D01
4       E         1           49     E01
..    ...       ...          ...     ...      ...
91      D        12           48     D12
92      E        12           60     E12
93      F        12           72     F12
94      G        12           84     G12
95      H        12           96     H12

[96 rows x 5 columns]
>>> pivot_plate({'sheet_1': a}, value_name="well_number")    
row_id column_id  well_number well_id plate_id
0       A         1            1     A01  sheet_1
1       B         1           13     B01  sheet_1
2       C         1           25     C01  sheet_1
3       D         1           37     D01  sheet_1
4       E         1           49     E01  sheet_1
..    ...       ...          ...     ...      ...
91      D        12           48     D12  sheet_1
92      E        12           60     E12  sheet_1
93      F        12           72     F12  sheet_1
94      G        12           84     G12  sheet_1
95      H        12           96     H12  sheet_1

[96 rows x 5 columns]

htstools.tables.replicate_table(data: DataFrame, group: str | Iterable[str] | None = None, wide: str | None = None) → DataFrame[source]

Annotate a dataframe with replicates within a group.

Adds a column called “replicate” which contains integer labels randomly assigned within groups indicating repeated measurements of the same experiemntal condition.

Parameters:

data (pandas.DataFrame) – Input dataframe.
group (str or list) – Columns which indicate the grouping within which statistics should be calculated. These groups indicate repeated measurements of the same experiemntal condition.
wide (str, optional) – If provided, returns a “wide” dataframe with replciate labels as column headings and the column name porived as values for the table.

Returns:

Dataframe with a new column “replicate” with labels randomly assigned within the group. If a column name is provided to wide, then the table as the replicate labels as columns and the values from that column as values.

Return type:

pd.DataFrame

Raises:

KeyError – If wide is provided and not a column in the data.

Examples

>>> import pandas as pd
>>> a = pd.DataFrame(dict(group=['g1', 'g1', 'g2', 'g2'],
...                  control=['n', 'n', 'p', 'p'],
...                  m_abs_ch1=[.1, .2, .9, .8],
...                  abs_ch1_wavelength=['600nm'] * 4))
>>> a  
    group control  m_abs_ch1 abs_ch1_wavelength
0    g1       n        0.1              600nm
1    g1       n        0.2              600nm
2    g2       p        0.9              600nm
3    g2       p        0.8              600nm
>>> replicate_table(a, group='group')  
    group control  m_abs_ch1 abs_ch1_wavelength  replicate
0    g1       n        0.1              600nm          1
1    g1       n        0.2              600nm          2
2    g2       p        0.9              600nm          2
3    g2       p        0.8              600nm          1
>>> replicate_table(a, group='group', wide='m_abs_ch1')   
replicate  rep_1  rep_2
group
g1           0.2    0.1
g2           0.8    0.9

htstools.utils module

Miscellaneous utilities for hts-tools.

Concatenate row label and column label columns to a well label column.

Optionally left zero-pads the column label.

Parameters:

row (pandas.Series, numpy.ndarray, or list) – Row labels.
col (pandas.Series, numpy.ndarray, or list) – Column labels.
pad (bool, optional) – Whether to left zero-pad the column labels, i.e. A, 1 -> A01. Default: True.

Returns:

Well labels.

Return type:

pandas.Series

Examples

>>> row_col_to_well(row=['A', 'B', 'C'], col=[1, 6, 12])
0    A01
1    B06
2    C12
dtype: object
>>> row_col_to_well(row=['A', 'B', 'C'], col=[1, 6, 12], pad=False)
0     A1
1     B6
2    C12
dtype: object

htstools package

Submodules

htstools.cli module

htstools.io module

htstools.normalize module

htstools.plot module

htstools.qc module

htstools.summarize module

htstools.tables module

htstools.utils module

Module contents