# Comparing produced alignments

This module contains all code for comparing alignments produced by `Fishnet`, `Remora`, `f5c` and `Uncalled4`. Here, the subset containing [100000 medium-length reads](../01_data/dna_giab/subsets/medium_100000/) is used.

## Environment variables

This module does not require any additional environment variables.

## Note

The *eventalign* format produced by the reference-to-signal alignment functions of f5c and Uncalled4 does not contain the direct start and end signal indices for each base. As such, it is not possible to extract the direct signal chunks. Accordingly the reference-to-signal alignment comparison is only done with Fishnet and Remora.

The query-to-signal alignments are compared between all four tools.

## Create alignments

To calculates both reference- and query-to-signal alignments for all four tools, execute the [`01_create_alginments.sh`](./01_create_alignments.sh) script. The alignments are written to the [./alignments/](./alignments/). Fishnet alignments are written to PARQUET format, all others are written to TSV. The output files are named as follows: `<tool-name>_<alignment-type>.<tsv/parquet>`

## Parse alignments

The [02_parse_alignments.sh](./02_parse_alignments.sh) script processes the output files produced by Remora, f5c and Uncalled4 into a uniform format. The parsed data is written to the [./alignments_parsed/](./alignments_parsed/) directory.

## Merge alignments

The [03_merge_alignments.sh](./03_merge_alignments.sh) script merges the (parsed) alignments into one large dataset, where one row contains the alignments from all tools as a numpy array. It contains a `read_id` column, and an `alignment_<tool-name>` column for each tool.

The merged datasets are written to PARQUET format and are placed in the [./alignments_merged/](./alignments_merged/) directory.

## Compare alignments

The [04_compare_alignments.sh](./04_compare_alignments.sh) script calculates pairwise statistics for each read from the merged data. The following statistics are calculated between given alignments A and B:
1. Normalized mean difference (NMD)
2. Normalized maximum difference (The largest normalized difference between alignment A and B)
3. Percent of identical boundaries (The fraction of exact boundary matches between alignment A and B)

The statistics are collected for [query-to-signal](./alignments_compared/query_metrics.parquet) and [reference-to-signal](./alignments_compared/reference_metrics.parquet) alignments in the [./alignments_compared/](./alignments_compared/) directory.

The query-to-signal dataset contains the `read_id`, `norm_mean_diff`, `normalized_max_diff`, `pct_identical_boundaries` columns, where each read gets one row. The reference-to-signal dataset contains multiple rows for each read, since multiple pairwise comparisons are done. As such, the dataset contains the `tool1` and `tool2` columns additionally.  

## Visualize

To visualize the calculated statistics, run the code in the [analyze_alignment.ipynb](./analyze_alignment.ipynb) jupyter notebook. Generated figures are collected in the [./figures/](./figures/) directory.
