# Usage - [First pipeline example](#first-pipeline-example) - [Parsing platereader exports](#parsing-platereader-exports) - [Adding experimental conditions](#adding-experimental-conditions) - [Normalization within batches](#normalization-within-batches) - [Plotting dose responses](#plotting-dose-response) - [Second pipeline example](#second-pipeline-example) - [Converting plate-shaped data to columns](#converting-plate-shaped-data-to-columns) - [Statistical testing](#statistical-testing) - [Other commands](#other-commands) **hts-tools** provides command-line utlities to analyse and plot data from platereaders, starting with the _raw_ exported data with no manual conversion or copy-pasting needed. The tools complete specific tasks which can be easily composed into analysis pipelines, because the TSV table output goes to `stdout` by default so they can be piped from one tool to another. To get a list of commands (tools), do ```bash hts --help ``` And to get help for a specific command, do ```bash hts --help ``` For the Python API, [see here](#python). ## First pipeline example This command takes several exported Excel files (matching the pattern `plate-?.xlsx`) from a Biotek platereader, adds annotations on experimental conditions, and normalizes the data based on positive and negative controls. Finally, dose-response curves are plotted. ```bash hts parse plate-?.xlsx --data-shape row --vendor Biotek \ | hts join --right layout.xlsx \ | hts normalize --control compound_name --positive RIF --negative DMSO --grouping plate_id \ | hts plot-dose -x concentration --facets guide_name --color compound_name \ --output plt-test ``` ### Parsing platereader exports The command `hts parse` converts Excel or CSV or TSV files exported from platereader software from a specified vendor into a uniform columnar table format used by all downstream hts-tools. The `data-shape` option indicates whether the export was in plate format or row-wise table format. It should usually be the first command in a pipeline. `hts parse` produces a table with at least the following columns (with example entries): | row_id | column_id | well_id | plate_id | data_filename | data_sheet | measured_abs_ch1 | abs_ch1_wavelength | | ------- | --------- | -------- | -------- | ------------- | ---------------- | ---------------- | ------------------ | | A | 1 | A01 | Plate 6 | plate-6.xlsx | Plate6 - Sheet 1 | 0.366 | 600nm | | A | 2 | A02 | Plate 6 | plate-6.xlsx | Plate6 - Sheet 1 | 0.402 | 600nm | If only fluorescence was measured, the last two columns would be called `measured_fluor_ch1` and `fluor_ch1_wavelength`. If there were multiple measurements, then they will appear as additional columns with increasing `ch` (channel) numbers, for example, `measured_fluor_ch2`, `fluor_ch2_wavelength`. There will also be additional columns using information scraped from the input file. These will usually have headings starting with `meta_`, such as `meta_protocol_file_path`, `meta_date`, `meta_time`, and `meta_reader_serial_number`. ### Adding experimental conditions The output from `hts` parse is passed to `hts join`, which combines two tables based on values in shared columns (known as a database join). For example, if `layout.xlsx` contains a sheet with this table: | column_id | plate_id | compound_name | | --------- | -------- | ------------- | | 2 | Plate 6 | trimethoprim | | 1 | Plate 6 | moxifloxacin | then `hts parse plate-?.xlsx --data-shape row | hts join --right layout.xlsx` will result in | row_id | column_id | well_id | plate_id | data_filename | data_sheet | measured_abs_ch1 | abs_ch1_wavelength | compound_name | | ------- | --------- | -------- | -------- | ------------- | ---------------- | ---------------- | ------------------ | ------------- | | A | 1 | A01 | Plate 6 | plate-6.xlsx | Plate6 - Sheet 1 | 0.366 | 600nm | moxifloxacin | | A | 2 | A02 | Plate 6 | plate-6.xlsx | Plate6 - Sheet 1 | 0.402 | 600nm | trimethoprim | Since `column_id` and `plate_id` are the only column headings in common, the entries of these columns are used to match the rows of the tables. So where `column_id = 2`, `compound_name = trimethoprim` will be added. If you join an Excel XLSX file containing multiple sheets, these will each be joined in order. In this way, you can add experimental conditions easily by, for example, first joining the conditions that vary by plate and row (such as compound), then by column (such as concentration). This approach is very flexible, and you can join on any number of columns and add any new ones you like as long as the column headings aren't repeated. ### Normalization within batches The `hts normalize` command normalizes raw measured data based on controls. This can add power to downstream statsical analyses by mitigating batch-to-batch variation. `hts normalize` adds new columns for each measured column. These columns start with `calc_` and end with `_norm.{method}`, for example `calc_abs_ch1_norm.npg` and `calc_abs_ch2_norm.npg`. Two methods are offered which optionally normalize within within groups, such as batches or plates. #### Normalized proportion of growth (NPG) This method scales raw data to be between $0$ and $1$ based on positive _and_ negative controls, optionally within groups (or batches) of measurements. In the example above, the positive and negative controls are defined as `RIF` and `DMSO`, and should be found in the column `compound_name` (which may have been added by `hts join`). The positive and negative controls are averaged within each value in the `--grouping` column. In the example above, they will be averaged for each `plate_id`, and these will be used to normalize the measured values of that `plate_id` according to: $$s = \frac{m - \mu_p}{\mu_n - \mu_p}$$ where $s$ is the normalized value, $m$ is the measured value, and $\mu_p$ and $\mu_n$ are the mean positive and negative controls. #### Proportion of negative (PON) [default] This method scales raw data relative negative controls _only_, optionally within groups (or batches) of measurements. The negative controls are averaged within each value in the `--grouping` column. In the example above, they would be averaged for each `plate_id`, and these will be used to normalize the measured values of that `plate_id` according to: $$s = \frac{m}{\mu_n}$$ where $s$ is the normalized value, $m$ is the measured value, and $\mu_n$ is the mean negative control. ### Plotting dose response `hts plot-dose` is a very flexible command which takes the columnar tables as input and plots the data in almost any breakdown using a color-blind palette. The required `-x` option indicates which column to use as the x-axis (usually concentration). The y-axis will be values in all the measured and calculated columns (`hts plot-dr` plots them all automatically in seaparte files). The other options allow splitting the plots by file, facet (panel) and color according to the values in columns. The example above, `hts plot-dose -x concentration --facets guide_name --color compound_name`, will produce plots like this: The panels each value with the same `guide_name` is in a `facet` (panel), and the lines are colored by `compound_name`. ## Second pipeline example Here is another example showing the sequential use of `hts join` to join two tables of experimental data, and two other commands: `hts pivot` and `hts summarize`. ```bash hts pivot compounds.xlsx \ --name compound_name hts parse plate-*.txt --data-shape plate \ | hts join --right sample-sheet.csv \ | hts join --right pivoted-compounds.tsv \ | hts normalize --control compound_name --positive RIF --negative DMSO --grouping strain_name plate_id \ | hts summarize --control compound_name --positive RIF --negative DMSO --grouping strain_name compound_name --plot summary \ > summary.tsv ``` These commands are explained below. ### Converting plate shaped data to columns Sometimes you will want to use data, such as plate layouts, which are in a plate-shaped layout instead of a column format. For example: You can convert this to column format using `hts pivot`, which produces a table in the following format: | row_id | column_id | compound_name | well_id | plate_id | filename | | ------ | --------- | ------------- | ------- | ------------------------ | ------------------------ | | C | 2 | RIF | C02 | Plate 7 | compounds.xlsx | | D | 2 | LYSINE | D02 | Plate 7 | compounds.xlsx | | E | 2 | DMSO | E02 | Plate 7 | compounds.xlsx | | F | 2 | RIF | F02 | Plate 7 | compounds.xlsx | It is assumed that there is one plate per sheet for Excel files, and one plate per file for TSV and CSV files. The plate name is taken from the sheet name (Excel) or filename (other formats). You can prepend the names of the `plate_id` and `filename` columns with the `-x` option. for example, `hts pivot compounds.xlsx -x compound_source --name compound_name` would have columns `compound_source_plate_id` `compound_source_filename`. This is helpful when usign `hts join` later where the plate and filename columns would otherwise be shared but have different meanings and values, and you don't want to accidentally join on them. ### Statistical testing Groups of values (such as replicates) can be compared against a negative control for statistical testing using `hts summarize`. The `--grouping` option indicates the columns whose values together indicate values which are replicates of a particular condition of interest. For example, `--grouping strain_name compound_name` would indicate that values which have the same `strain_name` and `compound_name` are replicates. Statistical tests compare to the `--negative` values, and use all measured and normalized (`calc_*_norm`) columns. Currently, the Student's t-test and Mann-whitney U-test are implemented. The t-test is best suited to Normal-distributed data while the MWU is better for other distributions which might not have a nice bell curve distribution. Although `hts summarize` calcualtes both tests simultaneously, it's not a good idea to look for "significant" $p$-values in both. This is called p-hacking, and leads to false positives. Instead, decide which test is most appropriate for your data and stick with that one. This command also calculates other summary statistics such as between-replicate mean, variance, and SSMD. If a filename prefix is provided to the `--plot` option, then volcano and flashlight plots are produced, which may be useful for identifying hits of high throughput screens. ## Other commands There are several other commands from **hts-tools** which take as input the output from `hts parse`, joined to an experimental data table (`layout.xlsx` in these examples). - `hts qc` Do quality control checks by calculating mean, standard deviation, Z\'-factor, and SSMD, and plotting them. ```bash hts parse plate-?.xlsx --data-shape row \ | hts join --right layout.xlsx \ | hts qc --control compound_name --positive RIF --negative DMSO --grouping strain_name plate_id --plot qc-plot \ > qc.tsv ``` - `hts plot-hm` Plot heatmaps of signal intensity arranged by plate well. This can be useful to identify unwanted within-plate variability. ```bash hts parse plate-?.xlsx --data-shape row \ | hts join --right layout.xlsx \ | hts plot-hm --grouping strain_name sample_id plate_id --output hm ``` Here, `--grouping` identifies the columns which indicate values coming from the same plate. One file is produced per measured and normalized (`calc_*_norm`) column. - `hts plot-rep` Plot two replicates against each other for each condition. ```bash hts parse plate-?.xlsx --data-shape row \ | hts join --right layout.xlsx \ | hts plot-rep --control compound_name --positive RIF --negative DMSO --grouping strain_name compound_name --output rep ``` Here, `--grouping` identifies the unique conditions within which values are treated as replicates. The positives and negatives are plotted as different colors. One file is produced per measured and normalized (`calc_*_norm`) column. In the plots, the left column is on a linear-linear scale and the right column is on a log-log scale. There is one row of plots per wavelength set in the dataset. - `hts plot-hist` Plot histograms of the data values. ```bash hts parse plate-?.xlsx --data-shape row \ | hts join --right layout.xlsx \ | hts plot-hist --control compound_name --positive RIF --negative DMSO --output hist ``` The positives and negatives are plotted as different colors. One file is produced per measured and normalized (`calc_*_norm`) column. In the plots, the left two columns are on a linear-linear scale and the right two columns are on a log-log scale. There is one row of plots per wavelength set in the dataset.