# Data preparation

This module contains all code needed to prepare the *Genome in a bottle* (GIAB) and the COX1/2/3 direct RNA datasets for the following modules.

## Environment variables

For execution, the following variables must be set in the `.env` file:
- `DORADO_MODEL`: The path to the dorado `rna004_130bps_sup@v5.2.0`
- `DORADO_CUDA`: Cuda configuration (e.g. `cuda:0`)

## GIAB DNA

To prepare the GIAB DNA data, execute the [`prepare_giab_data.sh`](./prepare_giab_data.sh) script. This does the following:
1. Sets up the Python venv and installs dependencies
2. Downloads the reference genome
3. Downloads the kmer-levels table
4. Downloads the POD5 files
5. Basecalls and maps the POD5 data
6. Subsets the data into subsets of 100/1000/10000/100000 short/medium/long reads
7. For each subset, prepares various input files that are required later:
    - POD5 -> BLOW5
    - BAM -> FASTQ
    - Sorts and indexes BAM file
    - Runs the f5c `index` command (this creates the `subset.blow5.idx`, `subset.fastq.index`, `subset.fastq.index.fai` and `subset.fastq.index.gzi` files)

All output is written to [`./dna_giab/`](./dna_giab/), and all subsets are written to the [`./dna_giab/subsets/`](./dna_giab/subsets/) directory.


## COX direct RNA

To prepare the direct RNA data for the m1A analysis, execute the [`prepare_m1a_data.sh`](./prepare_m1a_data.sh) script. It requires that the pod5 files are placed in [./rna_m1a/pod5/](./rna_m1a/pod5/):
```
./rna_m1a/
└── pod5
    ├── mod_cox1.pod5
    ├── mod_cox2.pod5
    ├── mod_cox3.pod5
    ├── unmod_cox1.pod5
    ├── unmod_cox2.pod5
    └── unmod_cox3.pod5
```

The script performs basecalling and mapping against the fitting references provided in [./rna_m1a/ref/](./rna_m1a/ref/) for each sample.