Authors

Lexi E. Frank1,2*, Nicole Flack1,2, Christopher Faulk3, Alyssa J. Block4, 5, Jason C. Bartz2,4, Peter A. Larsen1,2

1 Department of Veterinary and Biomedical Sciences, University of Minnesota, St. Paul, MN, U.S. 55108, 2 Minnesota Center for Prion Research and Outreach, St. Paul, MN, U.S. 55108, 3 Department of Animal Science, University of Minnesota, St. Paul, MN, U.S. 55108, 4 Department of Medical Microbiology and Immunology, School of Medicine, Creighton University, Omaha, NE, U.S., 5 Current Address: Department of Microbiology, Immunology, and Pathology, Colorado State University, Fort Collins, CO, U.S. 80523

*Corresponding Author

Lexi E. Frank

fran1464@umn.edu

Department of Veterinary and Biomedical Sciences

1971 Commonwealth Ave. St. Paul, MN 55108

OCRiD: 0009-0006-7736-4146

Published in Acta Neuropathologica

Experiment Details

Samples: DNA extracted from Mesocricetus auratus (syrian hamsters) brain homogenate.

Extraction method: Qiagen MagAttract kit

Sequencing Instrument: Promethion 2 solo, R.10 Promethion flow cell

Library Preparation: SQK-LSK114, SQK-NBD114.24, SQK-NBD114.96

The experiment for this study had 3 controls and 3 treated hamsters per time point. There were 3 time points.

DNA Methylation Analysis

File Manipulation

Sort BAM files using samtools

samtools sort -@ 32 hamster.bam -o hamster.sorted.bam

Index Sorted BAMs

Index file is needed for modkit pileup in the next step.

samtools index -@ 32 hamster.sorted.bam 

Create .bed Methyl Files with modkit pileup

You need the sorted BAMs with samtools index files (.bai) in the folder.

modkit pileup -t 32 --cpg --ref GCF_017639785.1_BCM_Maur_2.0_genomic.fa --combine-strands --only-tabs hamster.sorted.bam hamster.modkit.bed

Awk to Separate 5-hydroxymethylcytosine (5hmC) and 5-methylcytosine (5mC) Data

This works for all files in a directory.

for FILE in *.modkit.bed ; do
  PREFIX=$(echo ${FILE} | sed 's/\.bed//')
  awk -F'\t' '($4=="h") {print $0}' ${FILE} > ${PREFIX}.h.bed 
  awk -F'\t' '($4=="m") {print $0}' ${FILE} > ${PREFIX}.m.bed
done

Global Methylation Calculation

This works to create a .txt file for each individual with the proportion of canonical Cs, 5mc, and 5hmcs

for FILE in hamster*.modkit.m.bed ; do
  PREFIX=$(echo ${FILE} | sed 's/\.bed//')
  awk '{can+=$13; mod+=$12; oth+=$14; valid+=$10} END{print (can/valid) " CpG canonical\n" (mod/valid) " 5mCpG modified \n" (oth/valid) " 5hmCpG   modified"}' ${FILE} > ${PREFIX}.txt
done

Global Methylation Differences Across Time Points

Calculate if there is a significant difference between global methylation levels across time points. Requires sample_info object defined below.

global_methyl_files <- list.files("modkit_out_m", "*modkit.m.txt",
                                  full.names = TRUE)

names(global_methyl_files) <- str_extract(global_methyl_files, "hamster\\d{1,2}")

global_methyl <- map_dfr(as.list(global_methyl_files),
                         ~ read_delim(.x,
                            delim = " ",
                            col_names = c("ratio", "modification", "note"),
                            show_col_types = FALSE),
                         .id = "ID") %>% 
  filter(modification != "CpG") %>% 
  inner_join(sample_info, by = "ID") %>% 
  group_by(timepoint, modification) %>% 
  nest() %>% 
  mutate(ttest = map(data, ~ t.test(ratio ~ group, data = .x))) %>% 
  mutate(pvalue = map_chr(ttest, ~ pluck(.x, "p.value")))

global_methyl

Differential Methylation Analysis with Methylkit

Use Methylkit for differential DNA methylation analysis. See methylkit vignette.

Create Sample Info

Create a file that contains information such as sample and file names, treatment, time point, etc.

Read in Data

Load in files with methylation data into R studio.

# modkit output column specs
modkit_cols <- list(
  fraction = FALSE,
  chr.col = 1,
  start.col = 2,
  end.col = 3,
  coverage.col = 5,
  strand.col = 6,
  freqC.col = 11
)

# use the file names to load the data
mk_obj <- methRead(
  as.list(sample_info$file),
  sample.id = as.list(sample_info$ID),
  assembly = "hamster",
  pipeline = modkit_cols,
  treatment = sample_info$treatment_binary,
  mincov = 10
)

mk_obj

Summery stats

Coverage

# Iterate over the GRangesList to plot all of the samples
map(mk_obj, getCoverageStats, plot = T, both.strands = F)

# Print one sample's coverage data. You can do both.strands = T to check strand bias.
getCoverageStats(mk_obj[[2]], plot = T, both.strands = F)

Methylation Histogram

In a histogram of methylation percentages across a given sample, we expect a large number of low and high methylation CpGs (peaks on left and right ends) with less density across the middle.

getMethylationStats(mk_obj[[1]],
                    plot = T,
                    both.strands = F
                    )

# Iterate across all
map(mk_obj, getMethylationStats, plot = T, both.strands = F)

Normalization and Coverage Filtering

Normalizing coverage between samples is useful for avoiding bias when coverage across samples is uneven. Filtering anomalously high or low coverage is also helpful.

mk_obj_normal <- filterByCoverage(mk_obj,
                                  lo.count = 10,
                                  hi.perc = 99.9) %>%
  normalizeCoverage(.)

Recheck the summary stat plots after normalization and filtering. Setting both.strands = T can detect strand bias.

# Print one sample's coverage data
getCoverageStats(mk_obj_normal[[2]], plot = T, both.strands = F)

# one sample's methylation stats
getMethylationStats(mk_obj_normal[[2]], plot = T, both.strands = F)
# Iterate over the GRangesList to plot all of the samples
map(mk_obj_normal, getMethylationStats, plot = T, both.strands = F)

# Print one sample's coverage data. You can do both.strands = T to check strand bias.
getCoverageStats(mk_obj_normal[[2]], plot = T, both.strands = F)

Unite

This operation selects CpGs that are covered in all samples.

# Destrand = F allows us to examine DMRs
mk_m_meth <-
  methylKit::unite(mk_obj_normal, destrand = F)

Tiling for Differentially Methylated Regions (DMRs)

We choose windows of 1,000 bases for our regional analysis.

DMRs per Time Point

Separate data for each time point. Time point 3 used as an example.

mk_windows_1000 <- tileMethylCounts(mk_m_meth,win.size=1000,step.size=1000,cov.bases = 3, mc.cores = 10)

## reorganize data to include only timpepoint 3 (do this for all time points)
mk_windows_1000_reorg3 <- reorganize(mk_windows_1000, sample.ids = c("hamster11", "hamster17", "hamster18", "hamster2", "hamster6", "hamster7"), treatment = c(0,0,1,1,0,1), save.db = TRUE)

 
# diff meth statistics for all regions (do this for all time points)
diff_mkwindows3_1000 <- calculateDiffMeth(mk_windows_1000_reorg3,
                             mc.cores = 6)

# filter for significant regions (do this for all time points)
delta_mkwindows3_1000 <- getMethylDiff(diff_mkwindows3_1000,
                          difference = 10,
                          qvalue = 0.05)

Get Stats for Methylation

Do this for each time point. Time point 3 used as an example.

# Separate out hyper and hypomethylation

# get hyper-methylated
hyper_delta_mkwindows3_1000 <- getMethylDiff(diff_mkwindows3_1000, difference=10,qvalue=0.05,type="hyper")

# get hypo-methylated
hypo_delta_mkwindows3_1000 <- getMethylDiff(diff_mkwindows3_1000, difference=10,qvalue=0.05,type="hypo")

## make it a data frame
delta_mkdfwindows3_1000 <- getData(delta_mkwindows3_1000)
hypo_delta_mkdfwindows3_1000 <- getData(hypo_delta_mkwindows3_1000)
hyper_delta_mkdfwindows3_1000 <- getData(hyper_delta_mkwindows3_1000)

## make the background a dataframe
diff_mkdfwindows3df_1000 <- getData(diff_mkwindows3_1000)


##summary stats, hyper or hypo and how much, genic region enrichment (i.e. exons, promoters)

## summary stats,number and  percent average hyper and hypo, and absolute
summarise(delta_mkdfwindows3_1000, mean(abs(meth.diff)))

##summary of count of all
nrow(delta_mkdfwindows3_1000)

## mean of hypermethylated regions
delta_mkdfwindows3_1000 %>% 
  filter(meth.diff > 0) %>%
  summarize(mean(meth.diff))

## calculate count of hypermethylation
delta_mkdfwindows3_1000 %>%
  summarize(count = sum(meth.diff > 0))

## mean of hypomethylated regions
delta_mkdfwindows3_1000 %>% 
  filter(meth.diff < 0) %>%
  summarize(mean(meth.diff))

## calculate count of hypomethylation
delta_mkdfwindows3_1000 %>%
  summarize(count = sum(meth.diff < 0))

## make a histogram
hist(delta_mkdfwindows3_1000$meth.diff)

## see distinct chr 
delta_mkdfwindows3_1000 %>% distinct(chr)

## see shared chr DMRs between timepoint
delta_mkdfwindows3_1000 %>% inner_join(delta_mkdfwindows1_1000, by = c('chr', 'start')) %>% nrow()

Clustering of Individuals

PCA

Setting transpose = FALSE (default is TRUE) turns this into a “regular” PCA plot, where you’re looking at the contribution of samples to the variance.

PCASamples(mk_m_meth, transpose = FALSE)

PCA Plot using ggplot2

##create the objects to make the pca
pca.obj = PCASamples(mk_m_meth, transpose = FALSE, obj.return = TRUE)
pca.obj2 <- pca.obj$rotation[, 1:2] %>% as.data.frame()

##ggplot
p1<- ggplot(pca.obj2, aes(PC1, PC2)) + 
  geom_point(aes(color=sample_info$treatment)) +
  geom_text_repel(label=rownames(pca.obj2), max.overlaps = 15, aes(color=sample_info$treatment)) +
  ggtitle("Hamster CpG 5mc Methylation PCA plot") +
  theme_bw()
p1

Dendrogram

clusterSamples(mk_m_meth)

Scree plot

PCASamples(mk_m_meth, screeplot = TRUE)

Genomation

Use Genomation for summary and annotation of genomic intervals.

RefSeqAll track downloaded from UCSC Genome Browser on 27 march 2024 as “all fields”. Replace ID column with symbol to retain in annotation object:

awk 'BEGIN {OFS="\t"} NR>1 {$4=$13; NF=12; print}' BCM_Maur_2.0_Mar2021_refseqall.bed > BCM_Maur_2.0_Mar2021_refseqall_symbol.bed

Create Annotation Object

This can be done by treatment, time point, etc., comparing the group to the background.

gene_obj <-
  genomation::readTranscriptFeatures("~/Desktop/reference_data/BCM_Maur_2.0_Mar2021_refseqall_bed12.bed", remove.unusual = F)

hyper_delta3W_gr <- makeGRangesFromDataFrame(hyper_delta_mkdfwindows3_1000, keep.extra.columns = T)

# export as a bed file
library(rtracklayer)
export.bed(hyper_delta3W_gr, con = 'hyper_delta3W.bed')

## background
diff3W_gr <- makeGRangesFromDataFrame(diff_mkdfwindows3df_1000, keep.extra.columns = T)

# annotate
annot <- annotateWithGeneParts(delta3W_gr, gene_obj, intersect.chr = T)
getTargetAnnotationStats(annot)
plotTargetAnnotation(annot)

## annotate background
annot_null <- annotateWithGeneParts(diff3W_gr, gene_obj, intersect.chr = T)
getTargetAnnotationStats(annot_null)
plotTargetAnnotation(annot_null)

Chi-Square Test

# Generate pie charts for DMRs and background
##par(mfrow = c(1,2))
##plotTargetAnnotation(annot,
                     ##main = "DMRs by Genic Region")
##plotTargetAnnotation(annot_null,
                    ##main = "All CpGs by Genic Region")

# Count DMRs that overlap with target gene features vs. background
dmr3W_stats <- genomation::getTargetAnnotationStats(annot, percentage = F)
null3W_stats <- genomation::getTargetAnnotationStats(annot_null,
                                                   percentage = F) 
dmr_null_matrix3W <- rbind(dmr3W_stats, null3W_stats)

# 2-way table chisq function
test_annot_matrix <- function(x,y){
  matrix <- x %>% 
  data.frame %>%
  rownames_to_column(var = "source") %>%
  mutate(other_features = rowSums(across(!c(source, all_of(y))))) %>% 
  dplyr::select(source, all_of(y), other_features) %>%
  column_to_rownames(var = "source") %>%
  as.matrix()
  
  test <- chisq.test(matrix)
  
  return(list(matrix = matrix, test = test))
}

# Calculate enrichment in genic regions vs. background
map(list("promoter", "exon", "intron"), 
         ~ test_annot_matrix(dmr_null_matrix3W, .x))

Get Top DMR-Associated Genes

annot_bed <- read_tsv("~/Desktop/reference_data/BCM_Maur_2.0_Mar2021_refseqall.bed", show_col_types = F) %>%
  dplyr::select(name, name2)

topgenes <- annot@dist.to.TSS %>%
  left_join(annot_bed, by = c("feature.name" = "name")) %>%
  group_by(name2) %>%
  tally() %>%
  arrange(desc(n))
topgenes

## wrtie a tsv of top DMR associated genes
write_tsv(topgenes, file = "topDMRassociatedgenes_tp1.tsv")

Lift Over Genes to Enhance Reference

Use NCBI genome, lift over mm39 (Mus musculus) genes to hamster due to lack of dedicated gene ontology resources for the hamster.

Installed LiftOff version 1.6.3 via conda: conda install -c bioconda liftoff. Downloaded GFFs from NCBI.

liftoff -g ../GCF_000001635.27_GRCm39_genomic.gff -o mm39_lift_to_Maur2.gff ../GCF_017639785.1_BCM_Maur_2.0_genomic.fna ../GCF_000001635.27_GRCm39_genomic.fna

Get Genome Data

# pull scaffold info and gene IDs for most recent hamster assembly
lt = getGenomeDataFromNCBI("GCF_017639785.1", return_granges = FALSE)

# generate named vector to translate RefSeq scaffold IDs to Genbank
seqlevels_ncbi2genbank = structure(lt$genome$genbankAccession,
                       names = lt$genome$refseqAccession)

# isolate data frames from list and edit to gene ID column name
df_genome = lt$genome
df_genes = lt$gene %>% dplyr::rename(gene_id = `EntreZ ID`)

# make gene data into GRanges object
genes = makeGRangesFromDataFrame(df_genes,
                                 seqnames.field = "RefSeq Contig Accession",
                                 start.field = "Start",
                                 end.field = "End",
                                 keep.extra.columns = TRUE)

# rename GRanges object seqlevels (chromosome/scaffold IDs)
genes <- renameSeqlevels(genes, seqlevels_ncbi2genbank)

# assign seqlevel and seqlength metadata using genome data
seqlevels(genes) = df_genome$genbankAccession
seqlengths(genes) = df_genome$length

# get rds files
delta3W_gr <- readRDS("delta3W_gr.rds")

# translate NCBI seqnames to GenBank in for all timepoints
deltaW_gr_list <- list(delta1W_gr, delta2W_gr, delta3W_gr)

names(deltaW_gr_list) <- c('delta1W_gr', 'delta2W_gr', 'delta3W_gr')

deltaW_gr_list <- map(deltaW_gr_list,
                      ~ renameSeqlevels(.x, seqlevels_ncbi2genbank))


# genome = lt$genome
# genome = genome[!is.na(genome[, 2]), ]
df_genome$start = 1
df_genome$strand = "+"
genome = GRanges(
  seqnames = df_genome[, 2],
  ranges = IRanges(df_genome[, 6], df_genome[, 5]),
  strand = df_genome[, 4],
  gene_id = df_genome[, 5] 
)
genome <- renameSeqlevels(genome, seqlevels_ncbi2genbank)
seqlevels(genome) = df_genome$genbankAccession
seqlengths(genome) = df_genome$length

Format Lifted Genes

# use liftoff gff and create GRanges
lift_genes <- gffToGRanges("liftoff_out/mm39_lift_to_Maur2.gff", filter = "gene")

# change to genbank names
lift_genes <- renameSeqlevels(lift_genes,
                              seqlevels_ncbi2genbank)
# make dataframe of lift_genes
lift_genes_df <- lift_genes %>%
  data.frame() %>%
  # relocate(gene_id, .after = "strand") %>%
  dplyr::select(!(model_evidence:last_col())) %>% 
  filter(gene_biotype == "protein_coding" & type == "gene") %>% 
  mutate(gene_id = map_chr(Dbxref,
                           ~ pluck(.x, 1) %>% 
           str_remove(., "GeneID:"))) %>% 
  relocate(gene_id, .after = "strand")

#make granges from dataframe
lift_genes_filter <- makeGRangesFromDataFrame(lift_genes_df,
                                              keep.extra.columns = TRUE)

seqlevels(lift_genes_filter) = df_genome$genbankAccession
seqlengths(lift_genes_filter) = df_genome$length

#EU660218.1 5310-6854 

genome(lift_genes_filter) <- "GCF_017639785.1"
genome(genome) <- "GCF_017639785.1"

lift_genes_extend <- extendTSS(lift_genes_filter,
                               extend_from = "gene",
                               seqlengths = seqlengths(lift_genes_filter),
                               mode = "basalPlusExt",
                        basal_upstream = 5000,      
                        basal_downstream = 5000,
                               extension = 0,
                               gene_id_type = "ENTREZ")
genome(lift_genes_extend) <- "GCF_017639785.1"

Find Closest Genes to DMRs

Use bedtools closests to find the intersecting/nearest gene to differentially methylated regions (in bed file format) from bed file of liftoff annotation to get a list for enrichment analysis (DAVID). Do this for all time points data. Time point 3 is used as an example here.

# All data for one time point
delta_mkwindows3_1000 <- getMethylDiff(diff_mkwindows3_1000, difference=10,qvalue=0.05)

# Separate out hyper and hypo
# get hyper-methylated
hyper_delta_mkwindows3_1000 <- getMethylDiff(diff_mkwindows3_1000, difference=10,qvalue=0.05,type="hyper")

# get hypo-methylated
hypo_delta_mkwindows3_1000 <- getMethylDiff(diff_mkwindows3_1000, difference=10,qvalue=0.05,type="hypo")

## make it a data frame
delta_mkdfwindows3_1000 <- getData(delta_mkwindows3_1000)
hypo_delta_mkdfwindows3_1000 <- getData(hypo_delta_mkwindows3_1000)
hyper_delta_mkdfwindows3_1000 <- getData(hyper_delta_mkwindows3_1000)

## Make it a granges
hypo_delta3W_gr <- makeGRangesFromDataFrame(hypo_delta_mkdfwindows3_1000, keep.extra.columns = T)
hyper_delta3W_gr <- makeGRangesFromDataFrame(hyper_delta_mkdfwindows3_1000, keep.extra.columns = T)

#export as a bed file
library(rtracklayer)
export.bed(hyper_delta3W_gr, con = 'hyper_delta1W.bed')
export.bed(hypo_delta3W_gr, con = 'hypo_delta1W.bed')
export.bed(delta3W_gr, con = 'delta1W.bed')
# In command line
# make reference bed only contain genes
# awk -F'\t' '{if($8=="gene")print $0}' mm39_lift_to_Maur2.bed > genes_mm39_lift_to_Maur2.bed

# Find closest gene to methylkit DMR output and only print column with gene name information
bedtools closest -a hypo_delta3W.bed -b genes_mm39_lift_to_Maur2.bed | awk '{"\t"; print $16}' > genes_hypo_delta3W.tsv

# Find closest gene to methylkit DMR output and print everything
bedtools closest -a hypo_delta3W.bed -b genes_mm39_lift_to_Maur2.bed > ALLgenes_hypo_delta3W.tsv

# sed to print only lines with no .
sed '/^[[:space:]]*\.[[:space:]]*$/d' genes_delta3W.tsv > cleaned_genes_delta3W.tsv

# awk to only print $2 column separated by ;
awk -F';' '{print $2}' cleaned_genes_delta3W.tsv > 2cleaned_genes_delta3W.tsv

# sed to print only gene name
sed -n 's/.*GeneID:\([0-9]*\),.*/\1/p' 2cleaned_genes_delta3W.tsv > final_genes_delta3w.tsv

Take this list of genes into DAVID.

GO Term Visualization

Visualize DAVID output with ggplot2.

library(ggplot2)

# load in data
meth_alltp_DAVID_BP <- read.csv("~/Desktop/DAVID/meth_alltp_0.05_DAVID_BP.csv")

subset_meth_alltp_DAVID_BP <- subset(meth_alltp_DAVID_BP, `Benjamini` < 0.002)

## plot
ggplot(data = subset_meth_alltp_DAVID_BP, aes(x = Group, y = Term, color = `Benjamini`, size = Count)) + geom_point() + scale_color_gradient(low = "red", high = "blue", name = "Benjamini-Hochberg\n Adjusted P-Value") + theme_bw() + ylab("Gene Ontology Term") + xlab("") + ggtitle("Gene Ontology Enrichment by Time Point") + theme(axis.text.y = element_text(size=12, colour = "black")) + theme(axis.title.y = element_text(size=18, face="bold", colour = "black")) + theme(plot.title = element_text(size = 20, face = "bold",hjust = 0.5) ) + theme(axis.text.x = element_text(face="bold",size=12, colour = "black"))

## save plot
ggsave("meth_GOenrich_pvalue_0.002.jpeg", plot = last_plot(), path = "/home/Desktop/DAVID", scale = 1, width=17, height=10, units='in', dpi = 300)

RNAseq Analysis

RNAseq data was generated with Dual-indexed libraries using Takara/Clontech Stranded Total RNA-seq kit v2 - Pico Input Mammalian reagents.

Fastqc

Use fastqc v0.12.1 to check quality of raw reads.

gunzip 1_Shield_S2_L003_R1_001.fastq.gz
fastqc 1_Shield_S2_L003_R1_001.fastq --out fastqc_out

Trimming Sequencing Adapters

Use BBduk from BBmap v35.85 to trim the sequencing adapters.

 # bbduk 
 bbduk.sh in1=read1.fq in2=read2.fq out1=clean1.fq out2=clean2.fq
 
 # bbduk for one set of paired ends
 ~/Desktop/bbmap/bbduk.sh -Xmx1g ref= ~/Desktop/adapters.fa in1=10_Shield_S9_L003_R1_001.fastq in2=10_Shield_S9_L003_R2_001.fastq out1=10_Shield_S9_L003_R1_001_trim.fastq out2=10_Shield_S9_L003_R2_001_trim.fastq 
 
 # bbduk loop for multiple sets of paired end samples
 for R1 in *R1*
do
  R2=${R1//R1_001.fastq.gz/R2_001.fastq.gz}
  R1trim=${R1//.fastq.gz/_trim.fastq.gz}
  R2trim=${R2//.fastq.gz/_trim.fastq.gz}
~/Desktop/bbmap/bbduk.sh -Xmx1g ref= ~/Desktop/bbmap/resources/adapters.fa in1=$R1 in2=$R2 out1=$R1trim out2=$R2trim
done

Align to Reference Genome with STAR

Use STAR aligner to align sequence data to a reference genome.

Generate Genome Index


# for GFF instead of GTF file use --sjdbGTFtagExonParentTranscript instead of --sjdbGTFfile
STAR --runThreadN 30 --runMode genomeGenerate --genomeDir ~/Desktop/rnaseq_hamster/genomeDir --genomeFastaFiles ~/Desktop/rnaseq_hamster/reference_data/GCF_017639785.1_BCM_Maur_2.0_genomic.fa --sjdbGTFfile ~/Desktop/rnaseq_hamster/liftoff_out/mm39_lift_to_Maur2.gff 

Align

Align one sample’s paired end files.

mkdir alignments

STAR --runThreadN 30 \
      --genomeDir ~/Desktop/rnaseq_hamster/genomeDir \
      --readFilesIn 1_Shield_S2_L003_R1_001_trim.fastq.gz 1_Shield_S2_L003_R2_001_trim.fastq.gz \
      --readFilesCommand zcat \
      --outSAMtype BAM SortedByCoordinate \
      --quantMode GeneCounts \
      --outFileNamePrefix alignments/1_Shield_S2_L003

Loop to align files R1 and R2 for all samples at once.

# define variables
index=~/Desktop/rnaseq_hamster/genomeDir/

for i in *_001_trim.fastq.gz
do 
  PREFIX=$(echo ${i} | sed 's/\_R._001_trim.fastq.gz//')
  echo $PREFIX

  # define R1 fastq filename
  R1=~/Desktop/rnaseq_hamster/rnaseq_fastq/trim_bbduk/${PREFIX}_R1_001_trim.fastq.gz
  echo $R1

 # define R2 fastq filename
  R2=~/Desktop/rnaseq_hamster/rnaseq_fastq/trim_bbduk/${PREFIX}_R2_001_trim.fastq.gz
  echo $R2

 # align with STAR
  STAR --runThreadN 30 \
    --genomeDir $index \
    --readFilesIn ${R1} ${R2} \
    --outSAMtype BAM   SortedByCoordinate \
    --quantMode GeneCounts \
    --readFilesCommand zcat 
    --outFileNamePrefix alignments/${PREFIX}_
done

echo "done!"

Use featureCounts to Count Transcripts

Use featureCounts v2.0.3 to count mapped reads.

featureCounts -M -p -t exon -F GFF -g "ID" -a ~/Desktop/rnaseq_hamster/liftoff_out/mm39_lift_to_Maur2.gff -o ~/Desktop/rnaseq_hamster/featurecounts_out/1_M_counts.txt ~/Desktop/rnaseq_hamster/rnaseq_fastq/trim_bbduk/alignments/test/1_Shield_S2_L003Aligned.sortedByCoord.out.bam

DESeq2

Use DESeq2 to determine differential methylation.

Import Data

It is VERY important that the rows of coldata are in the same order as the columns of the featurecounts matrix

# read in metadata file 
coldata <- read.csv("sample_data.csv", row.names=1)

# manipulate colData
coldata <- coldata[,c("treatment","timepoint","group")]
coldata$treatment <- factor(coldata$treatment)
coldata$timepoint <- factor(coldata$timepoint)
coldata$group <- factor(coldata$group)

# read in feature counts file
featurecounts <- read.delim("featurecounts_out/featurecounts/all_counts.txt", header = TRUE, skip = 1)

# change the column names to match coldata
colnames(featurecounts)
colnames(featurecounts) <- c("Geneid", "Chr", "Start", "End", "Strand", "Length", "10", "11", "12", "13","14","15","16","17","18","1","20","3","4","5","6","7","8","9")

# make row names the Geneid column
#rownames(featurecounts) <- featurecounts[ , 1]
#featurecounts = as.matrix(featurecounts[ , -1])

# the first column right of Length is where the counts start, until the end of the matrix
column.from = which(colnames(featurecounts) == "Length") + 1
column.end  = ncol(featurecounts)

Create DESeq Dataset

# import into DESeq2 matrix
dds <- 
DESeqDataSetFromMatrix(countData = featurecounts[, column.from:column.end],
                       colData = coldata,
                       design = ~ treatment)
# add gene names as row names to dds
rownames(dds) <- featurecounts$Geneid

dds

# look at assay
assay(dds)

# look at assay names
assayNames(dds)

# examine count sums of all individual samples gene expression
colSums(assay(dds))

Data Exploration

Explore the data following the DESeq2 vignettes.

Pre-filtering for Data Exploration

Perform pre-filtering to keep only rows that have a count of at least 10 for a minimal number of samples. For data exploration only, NOT for statistical analysis.

nrow(dds)
# pre-filtering to keep only rows that have a count of at least 10 for a minimal number of samples
smallestGroupSize <- 3
keep <- rowSums(counts(dds) >= 10) >= smallestGroupSize
dds_ex <- dds[keep,]
nrow(dds_ex)

Transform Data to VST

Transform data to stabilize the variance across the mean using the variance stabilizing transformation (VST) for negative binomial data with a dispersion-mean trend.

vsd <- vst(dds, blind = FALSE)
head(assay(vsd), 3)

Transform Data to rlog

Transform data to stabilize the variance across the mean using the regularized-logarithm transformation or rlog

rld <- rlog(dds, blind = FALSE)
head(assay(rld), 3)

Compare Transformations

dds <- estimateSizeFactors(dds)

df <- bind_rows(
  as.data.frame(log2(counts(dds, normalized=TRUE)[, 1:2]+1)) %>%
         mutate(transformation = "log2(x + 1)"),
  as.data.frame(assay(vsd)[, 1:2]) %>% mutate(transformation = "vst"),
  as.data.frame(assay(rld)[, 1:2]) %>% mutate(transformation = "rlog"))
  
colnames(df)[1:2] <- c("x", "y")  

lvls <- c("log2(x + 1)", "vst", "rlog")
df$transformation <- factor(df$transformation, levels=lvls)

ggplot(df, aes(x = x, y = y)) + geom_hex(bins = 80) +
  coord_fixed() + facet_grid( . ~ transformation)  

Compare Sample Distances

To ensure we have a roughly equal contribution from all genes, we use it on the VST data. calculate the Euclidean distance between samples.

sampleDists <- dist(t(assay(vsd)))
sampleDists

# make the heatmap of sample distances using euclidean distances
sampleDistMatrix <- as.matrix( sampleDists )
rownames(sampleDistMatrix) <- paste( vsd$treatment, vsd$timepoint, sep = " - " )
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows = sampleDists,
         clustering_distance_cols = sampleDists,
         col = colors)

Calculate Sample Distance using Poisson

This measure of dissimilarity between counts takes the inherent variance structure of counts into consideration when calculating the distances between samples.

library("PoiClaClu")
poisd <- PoissonDistance(t(counts(dds_ex)))

# Heatmap of sample-to-sample distances using the Poisson Distance.
samplePoisDistMatrix <- as.matrix( poisd$dd )
rownames(samplePoisDistMatrix) <- paste( dds_ex$treatment, dds_ex$timepoint, sep=" - " )
colnames(samplePoisDistMatrix) <- NULL
pheatmap(samplePoisDistMatrix,
         clustering_distance_rows = poisd$dd,
         clustering_distance_cols = poisd$dd,
         col = colors)

PCA

Principle Component Analysis of the VSD transformed data.

plotPCA(vsd, intgroup=c("treatment", "timepoint"))

# PCA but more detailed (using ggplot)
pcaData <- plotPCA(vsd, intgroup=c("treatment", "timepoint"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=treatment, shape=timepoint)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed()

Differential Expression Analysis on Raw Data

Treatments versus controls only here, no time series included. DESeq2 uses the Benjamini-Hochberg (BH) adjustment.

dds <- DESeq(dds)
resultsNames(dds)
res <- results(dds)
res

## order results by smallest p-value
resOrdered <- res[order(res$pvalue),]
resOrdered

## Summarize basic tallies
summary(res)

## Summarize how many adjusted p-values were less than 0.1?
sum(res$padj < 0.1, na.rm=TRUE)

# results function, lower the false discovery rate threshold
res10 <- results(dds, alpha=0.10)
res05 <- results(dds, alpha=0.05)
summary(res10)

# Raise the log2 fold change threshold, so that we test for genes that show more substantial changes due to treatment. For example, by specifying lfcThreshold = 1, we test for genes that show significant effects of treatment on gene counts more than doubling or less than halving, because 2^1=2.
resLFC1 <- results(dds, lfcThreshold=.5)
table(resLFC1$padj < 0.1)

# Information about which variables and tests were used can be found by calling the function mcols on the results object.
mcols(res)$description
mcols(res, use.names = TRUE)

# The column log2FoldChange is the effect size estimate. It tells us how much the gene’s expression seems to have changed due to treatment in comparison to untreated samples. This value is reported on a logarithmic scale to base 2.

Volcano Plot

plotMA(res10, ylim=c(-6,6))

Histogram

hist(res$pvalue[res$baseMean > 1], breaks = 0:20/20,
     col = "grey50", border = "white")

Heatmap of Counts

library("pheatmap")
select <- order(rowMeans(counts(dds,normalized=TRUE)),
                decreasing=TRUE)[1:20]
df <- as.data.frame(colData(dds)[,c("treatment","timepoint")])
pheatmap(assay(dds)[select,], cluster_rows=FALSE, show_rownames=FALSE,
         cluster_cols=TRUE, annotation_col=df)

Another Histogram

The ratio of small p values for genes binned by mean normalized count. The p values are from a test of log2 fold change greater than 1 or less than -1.

qs <- c(0, quantile(resLFC1$baseMean[resLFC1$baseMean > 0], 0:6/6))
bins <- cut(resLFC1$baseMean, qs)
levels(bins) <- paste0("~", round(signif((qs[-1] + qs[-length(qs)])/2, 2)))
fractionSig <- tapply(resLFC1$pvalue, bins, function(p)
                          mean(p < .05, na.rm = TRUE))
barplot(fractionSig, xlab = "mean normalized count",
                     ylab = "fraction of small p values")

Time Course DESeq2 Analysis

Separate Out Time Points

Separate out data from each time point using the count data. For example, here is time point 3.

# Time point 3
dds3 <- DESeqDataSetFromMatrix(countData = featurecounts[, c("11","17","18","20","6","7")], colData = coldata[c("11","17","18","20","6","7"),], ~ treatment)
# add gene names as row names to dds
rownames(dds3) <- featurecounts$Geneid

dds3

# Results
dds3 <- DESeq(dds3)
resultsNames(dds3)
res3 <- results(dds3)
summary(res3)
res3

# results function
res3_05 <- results(dds3, alpha=0.05)
summary(res3_05)
res3_05

#order by smallest pvalue
res3 <- res3[order(res3$padj),]
res3

Export Results to csv

write.csv(res3, file = "~/Desktop/rnaseq_hamster/results/tp1_gene_results.csv")

Volcano plots

plotMA(res3, ylim=c(-6,6))

GO Term Visualization

library(ggplot2)

# load in data
alltp_DAVID_BP <- read.csv("~/Desktop/rnaseq_hamster/DAVID/alltp_0.05_DAVID_BP.csv")

## plot
ggplot(data = alltp_DAVID_BP, aes(x = Group, y = Term, color = `Benjamini`, size = Count)) + geom_point() + scale_color_gradient(low = "red", high = "blue", name = "Benjamini-Hochberg\n Adjusted P-Value") + scale_size(name = "Gene Count") + theme_bw() + ylab("GO Biological Processes Term") + xlab("") + ggtitle("Gene Ontology Enrichment by Time Point") + theme(axis.text.y = element_text(size=14, colour = "black")) + theme(plot.title = element_text(size = 20, face = "bold",hjust = 0.5) ) + theme(axis.title.y = element_text(size=14, face="bold", colour = "black")) + theme(axis.text.x = element_text(face="bold",size=14, colour = "black")) 

## save plot
ggsave("rnaseq_GOenrich_all.jpeg", plot = last_plot(), path = "/home/larsenlab/Desktop/rnaseq_hamster/DAVID", scale = 1, width=15, height=10, units='in', dpi = 300)
---
title: "Epigenetic Changes Associated with the Progression of Prion Disease in Syrian Hamsters (*Mesocricetus auratus*) - Bioinformatic Pipeline"
date: "2025-10-03"
output: 
  html_notebook:
    toc: true
    toc_depth: 3
    self_contained: true
eval: false
editor_options: 
  chunk_output_type: inline
---

## Authors

Lexi E. Frank1,2\*, Nicole Flack1,2, Christopher Faulk3, Alyssa J. Block4, 5, Jason C. Bartz2,4, Peter A. Larsen1,2

1 Department of Veterinary and Biomedical Sciences, University of Minnesota, St. Paul, MN, U.S. 55108, 2 Minnesota Center for Prion Research and Outreach, St. Paul, MN, U.S. 55108, 3 Department of Animal Science, University of Minnesota, St. Paul, MN, U.S. 55108, 4 Department of Medical Microbiology and Immunology, School of Medicine, Creighton University, Omaha, NE, U.S., 5 Current Address: Department of Microbiology, Immunology, and Pathology, Colorado State University, Fort Collins, CO, U.S. 80523

\*Corresponding Author

Lexi E. Frank

fran1464\@umn.edu

Department of Veterinary and Biomedical Sciences

1971 Commonwealth Ave. St. Paul, MN 55108

OCRiD: 0009-0006-7736-4146

Published in Acta Neuropathologica

# Experiment Details

Samples: DNA extracted from *Mesocricetus auratus* (syrian hamsters) brain homogenate.

Extraction method: Qiagen MagAttract kit

Sequencing Instrument: Promethion 2 solo, R.10 Promethion flow cell

Library Preparation: SQK-LSK114, SQK-NBD114.24, SQK-NBD114.96

The experiment for this study had 3 controls and 3 treated hamsters per time point. There were 3 time points.

# DNA Methylation Analysis

## File Manipulation

Sort BAM files using [samtools](https://www.htslib.org/)

```{bash}
samtools sort -@ 32 hamster.bam -o hamster.sorted.bam
```

## Index Sorted BAMs

Index file is needed for modkit pileup in the next step.

```{bash}
samtools index -@ 32 hamster.sorted.bam 
```

## Create .bed Methyl Files with modkit pileup

You need the sorted BAMs with samtools index files (.bai) in the folder.

```{bash}
modkit pileup -t 32 --cpg --ref GCF_017639785.1_BCM_Maur_2.0_genomic.fa --combine-strands --only-tabs hamster.sorted.bam hamster.modkit.bed
```

## Awk to Separate 5-hydroxymethylcytosine (5hmC) and 5-methylcytosine (5mC) Data

This works for all files in a directory.

```{bash}
for FILE in *.modkit.bed ; do
  PREFIX=$(echo ${FILE} | sed 's/\.bed//')
  awk -F'\t' '($4=="h") {print $0}' ${FILE} > ${PREFIX}.h.bed 
  awk -F'\t' '($4=="m") {print $0}' ${FILE} > ${PREFIX}.m.bed
done
```

## Global Methylation Calculation

This works to create a .txt file for each individual with the proportion of canonical Cs, 5mc, and 5hmcs

```{bash}
for FILE in hamster*.modkit.m.bed ; do
  PREFIX=$(echo ${FILE} | sed 's/\.bed//')
  awk '{can+=$13; mod+=$12; oth+=$14; valid+=$10} END{print (can/valid) " CpG canonical\n" (mod/valid) " 5mCpG modified \n" (oth/valid) " 5hmCpG   modified"}' ${FILE} > ${PREFIX}.txt
done
```

## Global Methylation Differences Across Time Points

Calculate if there is a significant difference between global methylation levels across time points. Requires `sample_info` object defined below.

```{r warning=FALSE}
global_methyl_files <- list.files("modkit_out_m", "*modkit.m.txt",
                                  full.names = TRUE)

names(global_methyl_files) <- str_extract(global_methyl_files, "hamster\\d{1,2}")

global_methyl <- map_dfr(as.list(global_methyl_files),
                         ~ read_delim(.x,
                            delim = " ",
                            col_names = c("ratio", "modification", "note"),
                            show_col_types = FALSE),
                         .id = "ID") %>% 
  filter(modification != "CpG") %>% 
  inner_join(sample_info, by = "ID") %>% 
  group_by(timepoint, modification) %>% 
  nest() %>% 
  mutate(ttest = map(data, ~ t.test(ratio ~ group, data = .x))) %>% 
  mutate(pvalue = map_chr(ttest, ~ pluck(.x, "p.value")))

global_methyl
```

## Differential Methylation Analysis with Methylkit

Use [Methylkit](https://github.com/al2na/methylKit) for differential DNA methylation analysis. See [methylkit vignette](https://github.com/al2na/methylKit/blob/master/vignettes/methylKit.Rmd).

### Create Sample Info

Create a file that contains information such as sample and file names, treatment, time point, etc.

### Read in Data

Load in files with methylation data into R studio.

```{r}
# modkit output column specs
modkit_cols <- list(
  fraction = FALSE,
  chr.col = 1,
  start.col = 2,
  end.col = 3,
  coverage.col = 5,
  strand.col = 6,
  freqC.col = 11
)

# use the file names to load the data
mk_obj <- methRead(
  as.list(sample_info$file),
  sample.id = as.list(sample_info$ID),
  assembly = "hamster",
  pipeline = modkit_cols,
  treatment = sample_info$treatment_binary,
  mincov = 10
)

mk_obj

```

### Summery stats

#### Coverage

```{r}
# Iterate over the GRangesList to plot all of the samples
map(mk_obj, getCoverageStats, plot = T, both.strands = F)

# Print one sample's coverage data. You can do both.strands = T to check strand bias.
getCoverageStats(mk_obj[[2]], plot = T, both.strands = F)
```

#### Methylation Histogram

In a histogram of methylation percentages across a given sample, we expect a large number of low and high methylation CpGs (peaks on left and right ends) with less density across the middle.

```{r}
getMethylationStats(mk_obj[[1]],
                    plot = T,
                    both.strands = F
                    )

# Iterate across all
map(mk_obj, getMethylationStats, plot = T, both.strands = F)
```

### Normalization and Coverage Filtering

Normalizing coverage between samples is useful for avoiding bias when coverage across samples is uneven. Filtering anomalously high or low coverage is also helpful.

```{r}
mk_obj_normal <- filterByCoverage(mk_obj,
                                  lo.count = 10,
                                  hi.perc = 99.9) %>%
  normalizeCoverage(.)
```

Recheck the summary stat plots after normalization and filtering. Setting `both.strands = T` can detect strand bias.

```{r}
# Print one sample's coverage data
getCoverageStats(mk_obj_normal[[2]], plot = T, both.strands = F)

# one sample's methylation stats
getMethylationStats(mk_obj_normal[[2]], plot = T, both.strands = F)

```

```{r}
# Iterate over the GRangesList to plot all of the samples
map(mk_obj_normal, getMethylationStats, plot = T, both.strands = F)

# Print one sample's coverage data. You can do both.strands = T to check strand bias.
getCoverageStats(mk_obj_normal[[2]], plot = T, both.strands = F)
```

### Unite

This operation selects CpGs that are covered in all samples.

```{r}
# Destrand = F allows us to examine DMRs
mk_m_meth <-
  methylKit::unite(mk_obj_normal, destrand = F)

```

### Tiling for Differentially Methylated Regions (DMRs)

We choose windows of 1,000 bases for our regional analysis.

#### DMRs per Time Point

Separate data for each time point. Time point 3 used as an example.

```{r}
mk_windows_1000 <- tileMethylCounts(mk_m_meth,win.size=1000,step.size=1000,cov.bases = 3, mc.cores = 10)

## reorganize data to include only timpepoint 3 (do this for all time points)
mk_windows_1000_reorg3 <- reorganize(mk_windows_1000, sample.ids = c("hamster11", "hamster17", "hamster18", "hamster2", "hamster6", "hamster7"), treatment = c(0,0,1,1,0,1), save.db = TRUE)

 
# diff meth statistics for all regions (do this for all time points)
diff_mkwindows3_1000 <- calculateDiffMeth(mk_windows_1000_reorg3,
                             mc.cores = 6)

# filter for significant regions (do this for all time points)
delta_mkwindows3_1000 <- getMethylDiff(diff_mkwindows3_1000,
                          difference = 10,
                          qvalue = 0.05)
```

#### Get Stats for Methylation

Do this for each time point. Time point 3 used as an example.

```{r}
# Separate out hyper and hypomethylation

# get hyper-methylated
hyper_delta_mkwindows3_1000 <- getMethylDiff(diff_mkwindows3_1000, difference=10,qvalue=0.05,type="hyper")

# get hypo-methylated
hypo_delta_mkwindows3_1000 <- getMethylDiff(diff_mkwindows3_1000, difference=10,qvalue=0.05,type="hypo")

## make it a data frame
delta_mkdfwindows3_1000 <- getData(delta_mkwindows3_1000)
hypo_delta_mkdfwindows3_1000 <- getData(hypo_delta_mkwindows3_1000)
hyper_delta_mkdfwindows3_1000 <- getData(hyper_delta_mkwindows3_1000)

## make the background a dataframe
diff_mkdfwindows3df_1000 <- getData(diff_mkwindows3_1000)


##summary stats, hyper or hypo and how much, genic region enrichment (i.e. exons, promoters)

## summary stats,number and  percent average hyper and hypo, and absolute
summarise(delta_mkdfwindows3_1000, mean(abs(meth.diff)))

##summary of count of all
nrow(delta_mkdfwindows3_1000)

## mean of hypermethylated regions
delta_mkdfwindows3_1000 %>% 
  filter(meth.diff > 0) %>%
  summarize(mean(meth.diff))

## calculate count of hypermethylation
delta_mkdfwindows3_1000 %>%
  summarize(count = sum(meth.diff > 0))

## mean of hypomethylated regions
delta_mkdfwindows3_1000 %>% 
  filter(meth.diff < 0) %>%
  summarize(mean(meth.diff))

## calculate count of hypomethylation
delta_mkdfwindows3_1000 %>%
  summarize(count = sum(meth.diff < 0))

## make a histogram
hist(delta_mkdfwindows3_1000$meth.diff)

## see distinct chr 
delta_mkdfwindows3_1000 %>% distinct(chr)

## see shared chr DMRs between timepoint
delta_mkdfwindows3_1000 %>% inner_join(delta_mkdfwindows1_1000, by = c('chr', 'start')) %>% nrow()
```

### Clustering of Individuals

#### PCA

Setting `transpose = FALSE` (default is `TRUE`) turns this into a "regular" PCA plot, where you're looking at the contribution of samples to the variance.

```{r}
PCASamples(mk_m_meth, transpose = FALSE)
```

PCA Plot using [ggplot2](https://ggplot2.tidyverse.org/)

```{r dev ='png'}
##create the objects to make the pca
pca.obj = PCASamples(mk_m_meth, transpose = FALSE, obj.return = TRUE)
pca.obj2 <- pca.obj$rotation[, 1:2] %>% as.data.frame()

##ggplot
p1<- ggplot(pca.obj2, aes(PC1, PC2)) + 
  geom_point(aes(color=sample_info$treatment)) +
  geom_text_repel(label=rownames(pca.obj2), max.overlaps = 15, aes(color=sample_info$treatment)) +
  ggtitle("Hamster CpG 5mc Methylation PCA plot") +
  theme_bw()
p1
```

Dendrogram

```{r}
clusterSamples(mk_m_meth)
```

Scree plot

```{r}
PCASamples(mk_m_meth, screeplot = TRUE)
```

## Genomation

Use [Genomation](https://bioconductor.org/packages/release/bioc/html/genomation.html) for summary and annotation of genomic intervals.

RefSeqAll track downloaded from UCSC Genome Browser on 27 march 2024 as "all fields". Replace ID column with symbol to retain in annotation object:

```{bash}
awk 'BEGIN {OFS="\t"} NR>1 {$4=$13; NF=12; print}' BCM_Maur_2.0_Mar2021_refseqall.bed > BCM_Maur_2.0_Mar2021_refseqall_symbol.bed
```

### Create Annotation Object

This can be done by treatment, time point, etc., comparing the group to the background.

```{r}
gene_obj <-
  genomation::readTranscriptFeatures("~/Desktop/reference_data/BCM_Maur_2.0_Mar2021_refseqall_bed12.bed", remove.unusual = F)

hyper_delta3W_gr <- makeGRangesFromDataFrame(hyper_delta_mkdfwindows3_1000, keep.extra.columns = T)

# export as a bed file
library(rtracklayer)
export.bed(hyper_delta3W_gr, con = 'hyper_delta3W.bed')

## background
diff3W_gr <- makeGRangesFromDataFrame(diff_mkdfwindows3df_1000, keep.extra.columns = T)

# annotate
annot <- annotateWithGeneParts(delta3W_gr, gene_obj, intersect.chr = T)
getTargetAnnotationStats(annot)
plotTargetAnnotation(annot)

## annotate background
annot_null <- annotateWithGeneParts(diff3W_gr, gene_obj, intersect.chr = T)
getTargetAnnotationStats(annot_null)
plotTargetAnnotation(annot_null)
```

### Chi-Square Test

```{r}
# Generate pie charts for DMRs and background
##par(mfrow = c(1,2))
##plotTargetAnnotation(annot,
                     ##main = "DMRs by Genic Region")
##plotTargetAnnotation(annot_null,
                    ##main = "All CpGs by Genic Region")

# Count DMRs that overlap with target gene features vs. background
dmr3W_stats <- genomation::getTargetAnnotationStats(annot, percentage = F)
null3W_stats <- genomation::getTargetAnnotationStats(annot_null,
                                                   percentage = F) 
dmr_null_matrix3W <- rbind(dmr3W_stats, null3W_stats)

# 2-way table chisq function
test_annot_matrix <- function(x,y){
  matrix <- x %>% 
  data.frame %>%
  rownames_to_column(var = "source") %>%
  mutate(other_features = rowSums(across(!c(source, all_of(y))))) %>% 
  dplyr::select(source, all_of(y), other_features) %>%
  column_to_rownames(var = "source") %>%
  as.matrix()
  
  test <- chisq.test(matrix)
  
  return(list(matrix = matrix, test = test))
}

# Calculate enrichment in genic regions vs. background
map(list("promoter", "exon", "intron"), 
         ~ test_annot_matrix(dmr_null_matrix3W, .x))
```

### Get Top DMR-Associated Genes

```{r}
annot_bed <- read_tsv("~/Desktop/reference_data/BCM_Maur_2.0_Mar2021_refseqall.bed", show_col_types = F) %>%
  dplyr::select(name, name2)

topgenes <- annot@dist.to.TSS %>%
  left_join(annot_bed, by = c("feature.name" = "name")) %>%
  group_by(name2) %>%
  tally() %>%
  arrange(desc(n))
topgenes

## wrtie a tsv of top DMR associated genes
write_tsv(topgenes, file = "topDMRassociatedgenes_tp1.tsv")
```

## Lift Over Genes to Enhance Reference

Use NCBI genome, lift over mm39 (*Mus musculus*) genes to hamster due to lack of dedicated gene ontology resources for the hamster.

Installed [LiftOff](https://github.com/agshumate/Liftoff) version 1.6.3 via conda: `conda install -c bioconda liftoff`. Downloaded GFFs from NCBI.

``` {#bash}
liftoff -g ../GCF_000001635.27_GRCm39_genomic.gff -o mm39_lift_to_Maur2.gff ../GCF_017639785.1_BCM_Maur_2.0_genomic.fna ../GCF_000001635.27_GRCm39_genomic.fna
```

### Get Genome Data

```{r, warning=FALSE}
# pull scaffold info and gene IDs for most recent hamster assembly
lt = getGenomeDataFromNCBI("GCF_017639785.1", return_granges = FALSE)

# generate named vector to translate RefSeq scaffold IDs to Genbank
seqlevels_ncbi2genbank = structure(lt$genome$genbankAccession,
                       names = lt$genome$refseqAccession)

# isolate data frames from list and edit to gene ID column name
df_genome = lt$genome
df_genes = lt$gene %>% dplyr::rename(gene_id = `EntreZ ID`)

# make gene data into GRanges object
genes = makeGRangesFromDataFrame(df_genes,
                                 seqnames.field = "RefSeq Contig Accession",
                                 start.field = "Start",
                                 end.field = "End",
                                 keep.extra.columns = TRUE)

# rename GRanges object seqlevels (chromosome/scaffold IDs)
genes <- renameSeqlevels(genes, seqlevels_ncbi2genbank)

# assign seqlevel and seqlength metadata using genome data
seqlevels(genes) = df_genome$genbankAccession
seqlengths(genes) = df_genome$length

# get rds files
delta3W_gr <- readRDS("delta3W_gr.rds")

# translate NCBI seqnames to GenBank in for all timepoints
deltaW_gr_list <- list(delta1W_gr, delta2W_gr, delta3W_gr)

names(deltaW_gr_list) <- c('delta1W_gr', 'delta2W_gr', 'delta3W_gr')

deltaW_gr_list <- map(deltaW_gr_list,
                      ~ renameSeqlevels(.x, seqlevels_ncbi2genbank))


# genome = lt$genome
# genome = genome[!is.na(genome[, 2]), ]
df_genome$start = 1
df_genome$strand = "+"
genome = GRanges(
  seqnames = df_genome[, 2],
  ranges = IRanges(df_genome[, 6], df_genome[, 5]),
  strand = df_genome[, 4],
  gene_id = df_genome[, 5] 
)
genome <- renameSeqlevels(genome, seqlevels_ncbi2genbank)
seqlevels(genome) = df_genome$genbankAccession
seqlengths(genome) = df_genome$length
```

### Format Lifted Genes

```{r}
# use liftoff gff and create GRanges
lift_genes <- gffToGRanges("liftoff_out/mm39_lift_to_Maur2.gff", filter = "gene")

# change to genbank names
lift_genes <- renameSeqlevels(lift_genes,
                              seqlevels_ncbi2genbank)
# make dataframe of lift_genes
lift_genes_df <- lift_genes %>%
  data.frame() %>%
  # relocate(gene_id, .after = "strand") %>%
  dplyr::select(!(model_evidence:last_col())) %>% 
  filter(gene_biotype == "protein_coding" & type == "gene") %>% 
  mutate(gene_id = map_chr(Dbxref,
                           ~ pluck(.x, 1) %>% 
           str_remove(., "GeneID:"))) %>% 
  relocate(gene_id, .after = "strand")

#make granges from dataframe
lift_genes_filter <- makeGRangesFromDataFrame(lift_genes_df,
                                              keep.extra.columns = TRUE)

seqlevels(lift_genes_filter) = df_genome$genbankAccession
seqlengths(lift_genes_filter) = df_genome$length

#EU660218.1 5310-6854 

genome(lift_genes_filter) <- "GCF_017639785.1"
genome(genome) <- "GCF_017639785.1"

lift_genes_extend <- extendTSS(lift_genes_filter,
                               extend_from = "gene",
                               seqlengths = seqlengths(lift_genes_filter),
                               mode = "basalPlusExt",
                        basal_upstream = 5000,      
                        basal_downstream = 5000,
                               extension = 0,
                               gene_id_type = "ENTREZ")
genome(lift_genes_extend) <- "GCF_017639785.1"
```

## Find Closest Genes to DMRs

Use [bedtools closests](https://bedtools.readthedocs.io/en/latest/content/tools/closest.html) to find the intersecting/nearest gene to differentially methylated regions (in bed file format) from bed file of liftoff annotation to get a list for enrichment analysis (DAVID). Do this for all time points data. Time point 3 is used as an example here.

```{r}
# All data for one time point
delta_mkwindows3_1000 <- getMethylDiff(diff_mkwindows3_1000, difference=10,qvalue=0.05)

# Separate out hyper and hypo
# get hyper-methylated
hyper_delta_mkwindows3_1000 <- getMethylDiff(diff_mkwindows3_1000, difference=10,qvalue=0.05,type="hyper")

# get hypo-methylated
hypo_delta_mkwindows3_1000 <- getMethylDiff(diff_mkwindows3_1000, difference=10,qvalue=0.05,type="hypo")

## make it a data frame
delta_mkdfwindows3_1000 <- getData(delta_mkwindows3_1000)
hypo_delta_mkdfwindows3_1000 <- getData(hypo_delta_mkwindows3_1000)
hyper_delta_mkdfwindows3_1000 <- getData(hyper_delta_mkwindows3_1000)

## Make it a granges
hypo_delta3W_gr <- makeGRangesFromDataFrame(hypo_delta_mkdfwindows3_1000, keep.extra.columns = T)
hyper_delta3W_gr <- makeGRangesFromDataFrame(hyper_delta_mkdfwindows3_1000, keep.extra.columns = T)

#export as a bed file
library(rtracklayer)
export.bed(hyper_delta3W_gr, con = 'hyper_delta1W.bed')
export.bed(hypo_delta3W_gr, con = 'hypo_delta1W.bed')
export.bed(delta3W_gr, con = 'delta1W.bed')
```

```{bash}
# In command line
# make reference bed only contain genes
# awk -F'\t' '{if($8=="gene")print $0}' mm39_lift_to_Maur2.bed > genes_mm39_lift_to_Maur2.bed

# Find closest gene to methylkit DMR output and only print column with gene name information
bedtools closest -a hypo_delta3W.bed -b genes_mm39_lift_to_Maur2.bed | awk '{"\t"; print $16}' > genes_hypo_delta3W.tsv

# Find closest gene to methylkit DMR output and print everything
bedtools closest -a hypo_delta3W.bed -b genes_mm39_lift_to_Maur2.bed > ALLgenes_hypo_delta3W.tsv

# sed to print only lines with no .
sed '/^[[:space:]]*\.[[:space:]]*$/d' genes_delta3W.tsv > cleaned_genes_delta3W.tsv

# awk to only print $2 column separated by ;
awk -F';' '{print $2}' cleaned_genes_delta3W.tsv > 2cleaned_genes_delta3W.tsv

# sed to print only gene name
sed -n 's/.*GeneID:\([0-9]*\),.*/\1/p' 2cleaned_genes_delta3W.tsv > final_genes_delta3w.tsv
```

Take this list of genes into [DAVID](https://davidbioinformatics.nih.gov/).

### GO Term Visualization

Visualize [DAVID](https://davidbioinformatics.nih.gov/) output with ggplot2.

```{r}
library(ggplot2)

# load in data
meth_alltp_DAVID_BP <- read.csv("~/Desktop/DAVID/meth_alltp_0.05_DAVID_BP.csv")

subset_meth_alltp_DAVID_BP <- subset(meth_alltp_DAVID_BP, `Benjamini` < 0.002)

## plot
ggplot(data = subset_meth_alltp_DAVID_BP, aes(x = Group, y = Term, color = `Benjamini`, size = Count)) + geom_point() + scale_color_gradient(low = "red", high = "blue", name = "Benjamini-Hochberg\n Adjusted P-Value") + theme_bw() + ylab("Gene Ontology Term") + xlab("") + ggtitle("Gene Ontology Enrichment by Time Point") + theme(axis.text.y = element_text(size=12, colour = "black")) + theme(axis.title.y = element_text(size=18, face="bold", colour = "black")) + theme(plot.title = element_text(size = 20, face = "bold",hjust = 0.5) ) + theme(axis.text.x = element_text(face="bold",size=12, colour = "black"))

## save plot
ggsave("meth_GOenrich_pvalue_0.002.jpeg", plot = last_plot(), path = "/home/Desktop/DAVID", scale = 1, width=17, height=10, units='in', dpi = 300)

```

# RNAseq Analysis

RNAseq data was generated with Dual-indexed libraries using Takara/Clontech Stranded Total RNA-seq kit v2 - Pico Input Mammalian reagents.

## Fastqc

Use [fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) v0.12.1 to check quality of raw reads.

```{bash}
gunzip 1_Shield_S2_L003_R1_001.fastq.gz
fastqc 1_Shield_S2_L003_R1_001.fastq --out fastqc_out
```

## Trimming Sequencing Adapters

Use BBduk from [BBmap](https://sourceforge.net/projects/bbmap/) v35.85 to trim the sequencing adapters.

```{bash}
 # bbduk 
 bbduk.sh in1=read1.fq in2=read2.fq out1=clean1.fq out2=clean2.fq
 
 # bbduk for one set of paired ends
 ~/Desktop/bbmap/bbduk.sh -Xmx1g ref= ~/Desktop/adapters.fa in1=10_Shield_S9_L003_R1_001.fastq in2=10_Shield_S9_L003_R2_001.fastq out1=10_Shield_S9_L003_R1_001_trim.fastq out2=10_Shield_S9_L003_R2_001_trim.fastq 
 
 # bbduk loop for multiple sets of paired end samples
 for R1 in *R1*
do
  R2=${R1//R1_001.fastq.gz/R2_001.fastq.gz}
  R1trim=${R1//.fastq.gz/_trim.fastq.gz}
  R2trim=${R2//.fastq.gz/_trim.fastq.gz}
~/Desktop/bbmap/bbduk.sh -Xmx1g ref= ~/Desktop/bbmap/resources/adapters.fa in1=$R1 in2=$R2 out1=$R1trim out2=$R2trim
done
```

## Align to Reference Genome with STAR

Use [STAR](https://github.com/alexdobin/STAR) aligner to align sequence data to a reference genome.

### Generate Genome Index

```{bash}

# for GFF instead of GTF file use --sjdbGTFtagExonParentTranscript instead of --sjdbGTFfile
STAR --runThreadN 30 --runMode genomeGenerate --genomeDir ~/Desktop/rnaseq_hamster/genomeDir --genomeFastaFiles ~/Desktop/rnaseq_hamster/reference_data/GCF_017639785.1_BCM_Maur_2.0_genomic.fa --sjdbGTFfile ~/Desktop/rnaseq_hamster/liftoff_out/mm39_lift_to_Maur2.gff 

```

### Align

Align one sample's paired end files.

```{bash}
mkdir alignments

STAR --runThreadN 30 \
      --genomeDir ~/Desktop/rnaseq_hamster/genomeDir \
      --readFilesIn 1_Shield_S2_L003_R1_001_trim.fastq.gz 1_Shield_S2_L003_R2_001_trim.fastq.gz \
      --readFilesCommand zcat \
      --outSAMtype BAM SortedByCoordinate \
      --quantMode GeneCounts \
      --outFileNamePrefix alignments/1_Shield_S2_L003

```

Loop to align files R1 and R2 for all samples at once.

```{bash}
# define variables
index=~/Desktop/rnaseq_hamster/genomeDir/

for i in *_001_trim.fastq.gz
do 
  PREFIX=$(echo ${i} | sed 's/\_R._001_trim.fastq.gz//')
  echo $PREFIX

  # define R1 fastq filename
  R1=~/Desktop/rnaseq_hamster/rnaseq_fastq/trim_bbduk/${PREFIX}_R1_001_trim.fastq.gz
  echo $R1

 # define R2 fastq filename
  R2=~/Desktop/rnaseq_hamster/rnaseq_fastq/trim_bbduk/${PREFIX}_R2_001_trim.fastq.gz
  echo $R2

 # align with STAR
  STAR --runThreadN 30 \
    --genomeDir $index \
    --readFilesIn ${R1} ${R2} \
    --outSAMtype BAM   SortedByCoordinate \
    --quantMode GeneCounts \
    --readFilesCommand zcat 
    --outFileNamePrefix alignments/${PREFIX}_
done

echo "done!"
```

## Use featureCounts to Count Transcripts

Use [featureCounts](https://subread.sourceforge.net/featureCounts.html) v2.0.3 to count mapped reads.

```{bash}
featureCounts -M -p -t exon -F GFF -g "ID" -a ~/Desktop/rnaseq_hamster/liftoff_out/mm39_lift_to_Maur2.gff -o ~/Desktop/rnaseq_hamster/featurecounts_out/1_M_counts.txt ~/Desktop/rnaseq_hamster/rnaseq_fastq/trim_bbduk/alignments/test/1_Shield_S2_L003Aligned.sortedByCoord.out.bam
```

## DESeq2

Use [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) to determine differential methylation.

### Import Data

It is VERY important that the rows of coldata are in the same order as the columns of the featurecounts matrix

```{r}
# read in metadata file 
coldata <- read.csv("sample_data.csv", row.names=1)

# manipulate colData
coldata <- coldata[,c("treatment","timepoint","group")]
coldata$treatment <- factor(coldata$treatment)
coldata$timepoint <- factor(coldata$timepoint)
coldata$group <- factor(coldata$group)

# read in feature counts file
featurecounts <- read.delim("featurecounts_out/featurecounts/all_counts.txt", header = TRUE, skip = 1)

# change the column names to match coldata
colnames(featurecounts)
colnames(featurecounts) <- c("Geneid", "Chr", "Start", "End", "Strand", "Length", "10", "11", "12", "13","14","15","16","17","18","1","20","3","4","5","6","7","8","9")

# make row names the Geneid column
#rownames(featurecounts) <- featurecounts[ , 1]
#featurecounts = as.matrix(featurecounts[ , -1])

# the first column right of Length is where the counts start, until the end of the matrix
column.from = which(colnames(featurecounts) == "Length") + 1
column.end  = ncol(featurecounts)
```

### Create DESeq Dataset

```{r}
# import into DESeq2 matrix
dds <- 
DESeqDataSetFromMatrix(countData = featurecounts[, column.from:column.end],
                       colData = coldata,
                       design = ~ treatment)
# add gene names as row names to dds
rownames(dds) <- featurecounts$Geneid

dds

# look at assay
assay(dds)

# look at assay names
assayNames(dds)

# examine count sums of all individual samples gene expression
colSums(assay(dds))
```

## Data Exploration

Explore the data following the DESeq2 vignettes.

### Pre-filtering for Data Exploration

Perform pre-filtering to keep only rows that have a count of at least 10 for a minimal number of samples. For data exploration only, NOT for statistical analysis.

```{r}
nrow(dds)
# pre-filtering to keep only rows that have a count of at least 10 for a minimal number of samples
smallestGroupSize <- 3
keep <- rowSums(counts(dds) >= 10) >= smallestGroupSize
dds_ex <- dds[keep,]
nrow(dds_ex)
```

### Transform Data to VST

Transform data to stabilize the variance across the mean using the variance stabilizing transformation (VST) for negative binomial data with a dispersion-mean trend.

```{r}
vsd <- vst(dds, blind = FALSE)
head(assay(vsd), 3)
```

### Transform Data to rlog

Transform data to stabilize the variance across the mean using the regularized-logarithm transformation or rlog

```{r}
rld <- rlog(dds, blind = FALSE)
head(assay(rld), 3)
```

### Compare Transformations

```{r}
dds <- estimateSizeFactors(dds)

df <- bind_rows(
  as.data.frame(log2(counts(dds, normalized=TRUE)[, 1:2]+1)) %>%
         mutate(transformation = "log2(x + 1)"),
  as.data.frame(assay(vsd)[, 1:2]) %>% mutate(transformation = "vst"),
  as.data.frame(assay(rld)[, 1:2]) %>% mutate(transformation = "rlog"))
  
colnames(df)[1:2] <- c("x", "y")  

lvls <- c("log2(x + 1)", "vst", "rlog")
df$transformation <- factor(df$transformation, levels=lvls)

ggplot(df, aes(x = x, y = y)) + geom_hex(bins = 80) +
  coord_fixed() + facet_grid( . ~ transformation)  
```

### Compare Sample Distances

To ensure we have a roughly equal contribution from all genes, we use it on the VST data. calculate the Euclidean distance between samples.

```{r}
sampleDists <- dist(t(assay(vsd)))
sampleDists

# make the heatmap of sample distances using euclidean distances
sampleDistMatrix <- as.matrix( sampleDists )
rownames(sampleDistMatrix) <- paste( vsd$treatment, vsd$timepoint, sep = " - " )
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows = sampleDists,
         clustering_distance_cols = sampleDists,
         col = colors)
```

### Calculate Sample Distance using Poisson

This measure of dissimilarity between counts takes the inherent variance structure of counts into consideration when calculating the distances between samples.

```{r}
library("PoiClaClu")
poisd <- PoissonDistance(t(counts(dds_ex)))

# Heatmap of sample-to-sample distances using the Poisson Distance.
samplePoisDistMatrix <- as.matrix( poisd$dd )
rownames(samplePoisDistMatrix) <- paste( dds_ex$treatment, dds_ex$timepoint, sep=" - " )
colnames(samplePoisDistMatrix) <- NULL
pheatmap(samplePoisDistMatrix,
         clustering_distance_rows = poisd$dd,
         clustering_distance_cols = poisd$dd,
         col = colors)
```

### PCA

Principle Component Analysis of the VSD transformed data.

```{r}
plotPCA(vsd, intgroup=c("treatment", "timepoint"))

# PCA but more detailed (using ggplot)
pcaData <- plotPCA(vsd, intgroup=c("treatment", "timepoint"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=treatment, shape=timepoint)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed()
```

## Differential Expression Analysis on Raw Data

Treatments versus controls only here, no time series included. DESeq2 uses the Benjamini-Hochberg (BH) adjustment.

```{r}
dds <- DESeq(dds)
resultsNames(dds)
res <- results(dds)
res

## order results by smallest p-value
resOrdered <- res[order(res$pvalue),]
resOrdered

## Summarize basic tallies
summary(res)

## Summarize how many adjusted p-values were less than 0.1?
sum(res$padj < 0.1, na.rm=TRUE)

# results function, lower the false discovery rate threshold
res10 <- results(dds, alpha=0.10)
res05 <- results(dds, alpha=0.05)
summary(res10)

# Raise the log2 fold change threshold, so that we test for genes that show more substantial changes due to treatment. For example, by specifying lfcThreshold = 1, we test for genes that show significant effects of treatment on gene counts more than doubling or less than halving, because 2^1=2.
resLFC1 <- results(dds, lfcThreshold=.5)
table(resLFC1$padj < 0.1)

# Information about which variables and tests were used can be found by calling the function mcols on the results object.
mcols(res)$description
mcols(res, use.names = TRUE)

# The column log2FoldChange is the effect size estimate. It tells us how much the gene’s expression seems to have changed due to treatment in comparison to untreated samples. This value is reported on a logarithmic scale to base 2.

```

### Volcano Plot

```{r}
plotMA(res10, ylim=c(-6,6))
```

### Histogram

```{r}
hist(res$pvalue[res$baseMean > 1], breaks = 0:20/20,
     col = "grey50", border = "white")
```

### Heatmap of Counts

```{r}
library("pheatmap")
select <- order(rowMeans(counts(dds,normalized=TRUE)),
                decreasing=TRUE)[1:20]
df <- as.data.frame(colData(dds)[,c("treatment","timepoint")])
pheatmap(assay(dds)[select,], cluster_rows=FALSE, show_rownames=FALSE,
         cluster_cols=TRUE, annotation_col=df)
```

Another Histogram

The ratio of small p values for genes binned by mean normalized count. The p values are from a test of log2 fold change greater than 1 or less than -1.

```{r}
qs <- c(0, quantile(resLFC1$baseMean[resLFC1$baseMean > 0], 0:6/6))
bins <- cut(resLFC1$baseMean, qs)
levels(bins) <- paste0("~", round(signif((qs[-1] + qs[-length(qs)])/2, 2)))
fractionSig <- tapply(resLFC1$pvalue, bins, function(p)
                          mean(p < .05, na.rm = TRUE))
barplot(fractionSig, xlab = "mean normalized count",
                     ylab = "fraction of small p values")
```

## Time Course DESeq2 Analysis

### Separate Out Time Points

Separate out data from each time point using the count data. For example, here is time point 3.

```{r}
# Time point 3
dds3 <- DESeqDataSetFromMatrix(countData = featurecounts[, c("11","17","18","20","6","7")], colData = coldata[c("11","17","18","20","6","7"),], ~ treatment)
# add gene names as row names to dds
rownames(dds3) <- featurecounts$Geneid

dds3

# Results
dds3 <- DESeq(dds3)
resultsNames(dds3)
res3 <- results(dds3)
summary(res3)
res3

# results function
res3_05 <- results(dds3, alpha=0.05)
summary(res3_05)
res3_05

#order by smallest pvalue
res3 <- res3[order(res3$padj),]
res3
```

### Export Results to csv

```{r}
write.csv(res3, file = "~/Desktop/rnaseq_hamster/results/tp1_gene_results.csv")
```

### Volcano plots

```{r}
plotMA(res3, ylim=c(-6,6))
```

### GO Term Visualization

```{r}
library(ggplot2)

# load in data
alltp_DAVID_BP <- read.csv("~/Desktop/rnaseq_hamster/DAVID/alltp_0.05_DAVID_BP.csv")

## plot
ggplot(data = alltp_DAVID_BP, aes(x = Group, y = Term, color = `Benjamini`, size = Count)) + geom_point() + scale_color_gradient(low = "red", high = "blue", name = "Benjamini-Hochberg\n Adjusted P-Value") + scale_size(name = "Gene Count") + theme_bw() + ylab("GO Biological Processes Term") + xlab("") + ggtitle("Gene Ontology Enrichment by Time Point") + theme(axis.text.y = element_text(size=14, colour = "black")) + theme(plot.title = element_text(size = 20, face = "bold",hjust = 0.5) ) + theme(axis.title.y = element_text(size=14, face="bold", colour = "black")) + theme(axis.text.x = element_text(face="bold",size=14, colour = "black")) 

## save plot
ggsave("rnaseq_GOenrich_all.jpeg", plot = last_plot(), path = "/home/larsenlab/Desktop/rnaseq_hamster/DAVID", scale = 1, width=15, height=10, units='in', dpi = 300)

```
