# Dataset associated with publication "Higher evolutionary dynamics of gene copy number for Drosophila glue genes located near short repeat sequences"
# Manon Monier, Isabelle Nuez, Flora Borne, Virginie Courtier-Orgogozo


---

## Description of the data and file structure

Table S1. List of species and genome assemblies used in this study. All genome assemblies are PacBio-based or Nanopore-based, except the D. eugracilis and D. takahashii genome assemblies which relied on Illumina GAIIx data only. P: genome assembly based on PacBio and Illumina reads, N: genome assembly based on Nanopore and Illumina reads, I: genome assembly based on Illumina reads only.


Table S2: (Table_S2.csv): Genomic coordinates of all the Sgs genes studied here in 24 Drosophila species. 

Table S3: (.csv files compressed in zip file): Correspondence between NCBI gene names and the gene names used in this study, together with a description of the changes in gene annotations that have been made. ‘no change’ indicates that no modification was done on the annotations obtained from NCBI, ‘based on Borne et al, 2021 annotation’ means that the annotation was obtained from Borne et al. 2021 study. ‘annotations transferred from’ means that the gene annotation was done manually based on the existing annotation of the corresponding gene in a closely related species. There are four .csv files: (1) Sgs1 and neighboring genes, (2) Sgs3x and neighboring genes, (3) Sgs3/7/8 and neighboring genes, (4) ng genes annotated in 3C11-12, 87A1, 88C3-4 loci. In the third .csv file, ‘Newly annotated ng genes’ column indicates whether an ng gene was newly annotated in this study (‘Y’), already annotated (‘N’), or is not an ng (‘Not applicable’).

Table S4: (Table_S4.csv): Sgs exons and intron sizes for studied species. For each species, the size of the first coding exon (CDS1), intron and second coding exon (CDS2) are given in base pairs (bp). The amino acid encoded at the position of the unique phase 1 intron is also indicated.

Supplementary Files

File S1. Compressed zip file of the gene annotations (GenBank .gb files, inputs for Easyfig) of large genomic regions containing all the Sgs genes and their neighboring genes in the 24 studied species.

File S2. Fasta file of all the Sgs amino acid sequences used to create Figure 1B and Figure S1.

File S3. Compressed zip file of reference and corrected nucleotide sequences used to create Figure S2.

File S4. Compressed zip file of Sgs protein alignments (fasta.files) used to compute phylogenetic trees and make Weblogo figures.

File S5. Sgs coding sequence length in bp for species having an Sgs3x copy (.csv file, input for R script sgs_size.R).

File S6. Sgs coding sequence length in bp for species not having an Sgs3x copy (.csv file, input for R script sgs_size.R).

File S7. Compressed zip file of comparisons between pairs of large genomic regions (.out files obtained as outputs from Easyfig).

File S8. Table of pairwise percentage of identity between several Sgs1 and Sgs3 amino-acid sequences (.csv).

File S9. Compressed zip file of the repeats annotations (.csv files) obtained with FindRepeat in Geneious on large genomic regions for D. melanogaster Sgs1, Sgs3/7/8, Sgs3x, D. teissieri Sgs3/7/8, D. subobscura Sgs3, D. eugracilis Sgs3.

File S10. Compressed zip file of new glue protein alignments (.fasta files) used to make Fig. S9.

File S11. Fasta file of all the Sgs nucleotide sequences studied here.

File S12. Fasta file of the 154 ng nucleotide sequences found at loci 68C11 and 68C13.

File S13. Fasta file of the 41 ng nucleotide sequences found at loci 3C11-12, 28E6-28E7, 87A1 and 88C3-4.

File S14. Compressed zip file of all the R scripts (.R files) used to create the figures.

File S15. Bam file of raw reads mapped to D. rhopaloa Sgs1 corrected nucleotide sequence, used to create Figure S2A.

File S16. Bam file of raw reads mapped to D. ficusphila Sgs1 reference nucleotide sequence, used to create Figure S2B.

File S17. Bam file of raw reads mapped to D. biarmipes Sgs3x corrected nucleotide sequence, used to create Figure S2C.


## Sharing/Access information

This supplementary data is available on DRYAD. Please email the authors if you need further information.


## Code/Software

See Methods section in the article entitled "Higher evolutionary dynamics of gene copy number for Drosophila glue genes located near short repeat sequences".
