# Processing speed benchmark

This module contains all code for benchmarking the processing times of `Fishnet`, `Remora`, `f5c` and `Uncalled4`.

## Environment variables

For execution, the following variables must be set in the `.env` file:
- `SLURM_MEM`: The memory allocated to the slurm jobs (64G)
- `SLURM_CPU`: The number of CPU cores allocated to the slurm jobs (32)
- `SLURM_EMAIL`: The email address slurm notification get sent to

## Generating jobs

To keep the benchmark modular, each benchmark run is contained in an individual bash script. To generate these scripts, run the [`01_generate_jobs.sh`](./01_generate_jobs.sh) script. This generates both the command that gets benchmarked and wraps this in the job script that contains the hyperfine logic. 

The job scripts are written to [./scripts/jobs](./scripts/jobs/) and each file follows the convention:
```
<tool-name>_<read-length>_<number-of-reads>_<alignment-type>_<number-of-threads>.sh
```
With the following values:
- `tool-name`: fishnet | remora | f5c | uncalled4
- `read-length`: short | medium | long
- `number-of-reads`: 100 | 1000 | 10000 | 100000
- `alignment-type`: query | reference
- `number-of-threads`: 1 | 8 | 16 | 24 (Note that remora is only benchmarked single-threaded (i.e. 1))

As such, after executing the script, the jobs directory should contain **312** scripts.

## Submitting jobs

To start benchmark runs, execute the [`02_run_jobs.sh`](./02_run_jobs.sh) script. This submits the slurm jobs. Jobs are executed sequentially as to not flood the given node with hundreds of jobs. This has the added benefit that the slurm jobs are unlikely to time out.

Upon execution, the following options are given:
1. Submit all jobs: All 312 are submitted at the same time 
2. Submit a specific subset of jobs: Here a pattern can be provided that uses the wildcard character to filter for specific options. The pattern should match the convention stated above. For example, to submit all reference alignments runs for Fishnet, use: 
```
fishnet_*_reference_*
``` 

Afterwards given jobs are added to the queue for sequential execution (Note that this will take a while depending on the parameter combination).

The results of each benchmarking run are written to the [./results/](./results/) directory in JSON format. The naming convention follows the input file with the added timestamp that corresponds to the starting time of the job.

## Visualization

Visualization functions are provided in the [`visualize.ipynb`](./visualize.ipynb) jupyter notebook. Figures are generated and written to the [./figures/](./figures/) directory.