# Analyze m1A data

This module contains all code for analyzing the m1A sites in the COX1, COX2 and COX3 samples.

## Environment variables

The provided `.env` file contains the following variable:
- `MOD_SITE`: The one-based index of the m1A site on the modified reference sequence (must be set to **71**)
- `UNMOD_SITE`: The one-based index of the A site on the unmodified reference sequence that corresponds to the m1A site in the modified sequence (**78**)
- `MOD_SITE_ZEROBASED`: The zero-based index of the m1A site (**70**)
- `UNMOD_SITE_ZEROBASED`: The zero-base index of the A equivalent (**77**)
- `HALF_WINDOW_SIZE`: The half-window size to capture the entirety of the shared sequence between modified and unmodifed sample (**52**; corresponds to 52 bases up- and downstream from the central m1A/A site)
- `MAX_READS`: The number of reads to display (used when preparing the aligned data for plotting) (**10000**)
- `SAMPLES`: The sample names (**(cox1 cox2 cox3)**)
- `THREADS`: The number of parallel threads to use (**24**)

The coordinates and sample names should not be changed, since they are inherent to the data.

## Create Alignments

To create the alignments for each sample, run the [`01_run_alignments.sh`](./01_run_alignments.sh) script. This calculates the reference-to-signal alignment using Fishnet and prepares the generated alignments for visualization.

The alignments for each sample are written to the [./alignments/](./alignments/) directory.

To prepare the aligned signals for visualization, they alignments for a subset of reads are filtered, trimmed and written to a pickle file in the [./pickle_files/](./pickle_files/) directory (`./pickle_files/<condition>_<sample>_signal.pkl`) for later visualization. 

## Calculate base-wise statistics

Base-wise statistics are calculated for all samples in the [`02_run_stats.sh`](./02_run_stats.sh) script. Here Fishnet's `align` function is called for all alignment files, calculating the mean and stand. deviation and the dwell time for each base of each read. Bases are filtered to 52 bases up- and downstream from the central modification site. These statistics are written to the [./reformatted/](./reformatted/) directory as `<condition>_<sample>_stats.parquet`.

Since the statistics from multiple reads are collected for each base, the resulting distributions can be further processed and statistically compared. 

Here percentiles are calculated from the modified and unmodified conditions for each sample. The resulting dataset contains the columns `base_index`, `base`, `feature`, `p05`, `p25`, `p50`, `p75`, `p95` and `sample` (which refers to the condition here).

Modified and unmodified data is statistically compared using a two-sample Kolmogorov-Smirnov test and Cohen's d. Here, the resulting dataset contains the columns `base_index`, `base`, `feature`, `stat`, `p_val`, `p_val_corrected` and `cohens_d`.

Percentiles and results of the statistcal comparisons are written to pickle files `<sample>_base_wise_stats_percentiles.pkl` and `<sample>_base_wise_stats_stats.pkl` in the [./pickle_files/](./pickle_files/) directory.

## Interpolate aligned chunks

The [`03_run_interpolation.sh`](./03_run_interpolation.sh) script performs the interpolation of signal chunks into a uniform number of samples for each base with variying number of included bases. These are only the central m1A/A site and +/- 1, 2 and 4 base(s) around it. 

Interpolation is performed for modified and unmodified data and the interpolated data is written to `<condition>_<sample>_interp_<distance-from-central-base>.parquet` in the [./reformatted/](./reformatted/) directory.

Afterwards, the interpolated modified and unmodified data for each sample is combined and a UMAP is calculated with both conditions for a given sample. The dataset containing the `read_id`, `UMAP1`, `UMAP2` and `sample` columns is written to `<sample>_umap_<distance-from-central-base>.pkl` in the [./pickle_files/](./pickle_files/) directory.

## Visualization

The visualization is split into three parts:
1. [`plot_signal.ipynb`](./plot_signal.ipynb) handles visualization of the aligned signal
2. [`plot_base_wise_stats.ipynb`](./plot_base_wise_stats.ipynb) creates plots for the (statistical) comparison of the base-wise statistics
3. [`plot_dim_reduction.ipynb`](./plot_dim_reduction.ipynb) visualizes the UMAPs

Here the previously created pickle files in the [./pickle_files/](./pickle_files/) directory are used. The generated figures are stored in the [./figures/](./figures/) directory.

