An ancient and still ongoing genetic conflict between transposable elements and their repressors

Rachele Cagliani 1✉ Emailrachele.cagliani@lanostrafamiglia.it

Diego Forni 1✉ Emaildiego.forni@lanostrafamiglia.it

Alessandra Mozzi 1

Roudin Sarama 2

Uberto Pozzoli 1

Matteo Fumagalli 2

Manuela Sironi 3

IRCCS E. Institute 4

MEDEA 4

1 Scientific Institute IRCCS E. MEDEA, Computational Biology Unit 23842 Bosisio Parini Italy

School of Biological and Behavioural Sciences Queen Mary University of London UK

3 School of Medicine and Surgery University of Milano-Bicocca 20900 Monza Italy

4 Computational Biology Unit Bosisio Parini Italy

Rachele Cagliani1*, Diego Forni1*, Alessandra Mozzi1, Roudin Sarama2, Uberto Pozzoli1, Matteo Fumagalli2, Manuela Sironi3

1Scientific Institute IRCCS E. MEDEA, Computational Biology Unit, Bosisio Parini 23842, Italy;

2School of Biological and Behavioural Sciences, Queen Mary University of London, UK;

3School of Medicine and Surgery, University of Milano-Bicocca, Monza 20900, Italy.

Corresponding authors:

Rachele Cagliani, rachele.cagliani@lanostrafamiglia.it. Scientific Institute IRCCS E. MEDEA, Computational Biology Unit, Bosisio Parini, Italy.

Diego Forni, diego.forni@lanostrafamiglia.it. Scientific Institute IRCCS E. MEDEA, Computational Biology Unit, Bosisio Parini, Italy.

Rachele Cagliani and Diego Forni contributed equally to this work.

Abstract

Background. Transposable element (TE) mobilization poses a significant fitness challenge to host genomes. Consequently, a variety of systems have emerged to silence TE activity. Just like TEs, such systems are widespread across the tree of life and their evolution is expected to be shaped by intra-genomic conflicts. To test this hypothesis, we performed an evolutionary analysis of TE control systems across different timescales.

Results. We show that a substantial fraction of TE control genes were targets of positive selection during primate evolution, as well as during the more recent history of human populations, with abundant signatures in proteins of the piRNA pathway. In these proteins, selection was strongest in intrinsically disordered regions (IDRs), particularly those with low conformational entropy, and contributed to modulate ensemble features and sequence patterning. In primates, positive selection in a larger number of genes that silence TEs also resulted in reduced accumulation of new mobile elements, as assessed by genomic analysis or recent TE occurrences. Across longer evolutionary time frames, we uncover wide variability in the genomic content of PIWI-containing proteins and an unprecedented diversity of domain architectures for such proteins in eukaryotes. Finally, we identify a bacterial Argonaute as the closest prokaryotic relative of human Argonaute proteins, and we show unusual conservation of IDR sequence and ensemble features across huge evolutionary distances.

Conclusions. Our data provide insight into the evolution and diversity of TE control systems in eukaryotes and, through analyses over deep evolutionary distances, contribute information to the rapidly growing evo-immuno discipline.

Keywords:

Transposable element control genes

Positive selection

PIWI domain

intrinsically disordered regions

Argonaute proteins

Background

Transposable elements (TEs) are mobile genetic units that are able to move and self-amplify within host cells. TEs have successfully populated eukaryotic and prokaryotic genomes, and the typical mammal has ~ 45% of its genome derived from TEs [1–4]. TEs are highly diverse, but can be classified into two major groups depending on the mobilization mechanism [5, 6]. Class I elements, also known as retrotransposons, propagate through an RNA intermediate. This allows dissemination in a “copy-and-paste” style, with the original TE copy remaining intact and the newly generated copies integrating at other locations in the genome [7]. Class I elements can be further divided into short interspersed elements (SINEs), long interspersed elements (LINEs), and long terminal repeat (LTR) retrotransposons [5]. Class II TEs, also known as DNA transposons, use a DNA intermediate to propagate. DNA transposons mobilize through a “cut-and-paste” mechanism or by rolling circle replication [8, 9]. In mammalian genomes, the large majority of TEs belong to class I. In other eukaryotes, the overall frequency of TEs and their distribution in the two classes is highly diverse [10]. In prokaryotes, most mobile units are DNA elements, which include transposons and insertion sequences, but also plasmids and phages [11].

Although increasing evidence indicates that they significantly contribute to genome evolution, TEs behave as selfish elements and their replication at the genomic level can be detrimental [12–18]. TEs can be mutagenic, either directly (e.g., by insertional disruption of coding or regulatory sequences) or indirectly (e.g., by mediating chromosomal rearrangements), and TE insertions can affect the regulatory environment of nearby genes [15]. For these reasons, cellular organisms have evolved several mechanisms to suppress TE activity and to limit their propagation. In mammals, a major TE control pathway hinges on P-element induced Wimpy testis (PIWI)-interacting RNAs (piRNAs). piRNAs form effector complexes with Argonaute proteins (Agos) and Piwi proteins (Piwis), the latter a germline-specific class of Agos. These complexes guide recognition and silencing of TEs at the transcriptional or post-transcriptional level [19, 20]. For transcriptional silencing, complexes of piRNAs and Piwi/Ago use specific interactors (e.g., SPOCD1 and TEX15) to promote the recruitment of DNA methyltransferases (DNMTs) and chromatin remodeling complexes to generate a repressive chromatin environment [15]. In the case of post-transcriptional control, piRNAs complexed with Piwi/Ago, together with co-factors (e.g., GTSF1), promote the degradation of TE transcripts in the cytoplasm [15, 21]. In mammals, the piRNA/Piwi pathway is particularly important in the germline, where TEs tend to mobilize. As a consequence, mutations in several components of the pathway result in the sterility of one or both sexes [15].

Another mechanism of TE control is based on Krüppel-associated box (KRAB) domain-containing zinc finger proteins (KZFPs), which are encoded by an expanded gene family in vertebrates [22]. Through the KAP1/TRIM28 interactor, KZFPs recruit heterochromatin silencing factors, including DNMTs, heterochromatin protein 1 (CBX1/CBX3), SETDB1, and the NuRD complex (nucleosome remodeling and deacetylation complex). Moreover, additional systems exploit different TE features for recognition and silencing. For instance, the HUSH (human silencing hub) complex (which comprises TASOR, MPHOSPH8, and PPHLN1), together with its effectors (MORC2 and SETDB1), identifies introneless DNA fragments with high adenine content in the sense strand and epigenetically represses them [23]. Other mechanisms of TE suppression are instead based on RNA modifications, including N6-methyladenosine, 3′ uridylation, and 5-hydroxymethylcytosine [15].

Whereas some of these TE control mechanisms, such as KZFPs, represent evolutionary innovations only found in vertebrates, others are evolutionary conserved across the tree of life [15]. For instance, based on the distribution and diversity of Piwi/Ago modules in prokaryotes and eukaryotes, it is generally considered that a minimal RNA interference pathway was already present in the last eukaryotic common ancestor (LECA) and probably originated from prokaryotic modules that safeguard host cells against foreign nucleic acids (viruses, plasmids, TEs) [15, 24, 25]. On one hand, this testifies the central role of these systems in the maintenance of genome integrity and organism fitness. On the other, the antagonistic effect exerted by these cellular pathways against TE propagation is expected to result in intra-genomic conflicts [26, 27]. Indeed, it was previously shown that KZFPs diverged and amplified in vertebrate lineages in response to the emergence of novel retroelements [22, 28] and de novo DNA methyltransferases show dynamic evolution in primates and rodents [29]. Likewise, signals of adaptive evolution were also reported in the HUSH complex effector MORC2 in primates [30]. Despite these insights, a comprehensive analysis of the evolution of TE control genes across different evolutionary timescales is missing. In this study, we integrated molecular evolution analysis, population genetics, structural homology and domain organization searches to delve into the evolutionary history of TE control systems, with a focus on Piwi/Ago proteins.

Results

Positive selection drives the evolution of several TE control genes

We assembled a list of sixty coding genes involved in TE control (Fig. 1A, Additional file 1: Table S1). KZFPs were not included because i) they were previously analyzed and ii) their fast evolution in terms of gene copies makes it difficult to reconstruct orthology relationships [22, 28, 31]. We first sought to determine whether the protein products on these genes are engaged in a genetic conflict with TEs. If this were the case, evidence of positive selection might be expected as a signature of fast evolution in response to TE challenge. We thus focused on primates and retrieved sequence information of the coding sequences from public databases, with the aim to analyze evolutionary patterns (Additional file 1: Table S1). To test the hypothesis that positive selection has been driving the evolution of TE control genes, we applied likelihood ratio tests (LRTs) implemented in the PAML suite [32, 33]. All genes were screened for recombination and split into different regions if necessary. The neutral models (M7 and M8a) were rejected in favor of the positive selection model (M8) for 14 genes, corresponding to a considerably high fraction of 23.3% (Fig. 1A and 1B, Additional file 1: Table S2). The positively selected genes included three Piwi proteins and other molecules that participate to the piRNA pathway (SPOCD1 and TEX15), as well as components of the HUSH and NuRD complexes. In line with previous findings, we detected evidence of positive selection in MORC2 [30] and in the N-terminal region of DNMT3A [29] (Fig. 1B).

Fig. 1

Evolution of TE control systems in primates. (A) Protein-protein interaction network predicted for the 60 analyzed proteins. The network was generated using STRING (https://string-db.org/), which covers both physical interactions and functional associations between proteins. Proteins under positive selection are colored in red. NuRD and HUSH complex components are shown in green and blue, respectively. (B) Domain structures of the 14 proteins under positive selection in the primate phylogeny. Schematic domain structures of human proteins are drawn to scale. The protein domains were obtained from the InterPro website (https://www.ebi.ac.uk/interpro/) [129], with the exception of the Piwi proteins. The domains of PIWIL2 were obtained from Li et al. [130], while the domains of PIWIL3 and PIWIL4 were defined by sequence homology from PIWIL2. IDRs identified by the Metapredict tool based on human proteins are shown in blue. Positively selected sites are shown above each domain structure as arrowheads. (C) Structural alignment of the human PIWIL3 (dark grey) Alphafold model (Alphafold DB: AF-Q7Z3Z3-F1-v4) with the human PIWIL4 (light grey) Alphafold model (Alphafold DB:AF-Q7Z3Z4-F1-v4). Positively selected sites are reported as ball and sticks, and color-coded according to the domain they map to (see panel B). Positively selected sites in PIWIL3 (red) and PIWIL4 (magenta) falling in the same structural region are shown the enlargement.

We were also interested in identifying the positively selected sites in these genes. To this aim, we used four methods: the BEB (Bayes empirical Bayes) analysis from M8, MEME (Mixed Effects Model of Evolution), FUBAR (Fast, Unconstrained Bayesian AppRoximation), and FEL (Fixed Effects Likelihood). To be conservative, a site was called as selected if it was detected by at least two methods. A total of 144 selected sites were identified (Additional file 1: Table S2). The average fraction of selected sites per protein resulted equal to 0.84%, with the highest proportion being observed for SPOCD1 (3.29%). In general, components of the piRNA pathway were found to display a high number of selection signals (Fig. 1B). Among these, we detected three sites that were independent targets of selection in the PAZ (PIWI-Argonaute-Zwille) domains of the two paralogous genes PIWIL3 (I310 and R311) and PIWIL4 (Q282). These sites define the same structural patch in the 3D structures of the corresponding Piwi proteins (Fig. 1C), suggesting that changes in this region are functionally relevant and adaptive. Although a number of positively selected sites was found to be located in the MID and PAZ domains, none of them is involved in the binding of guide RNAs.

Positive selection signals in TE control genes are enriched in intrinsically disordered regions

The proteins encoded by these TE control genes have different domain architectures and, because several of them bind nucleic acids, they are expected to contain intrinsically disordered regions (IDRs) [34]. We thus used Metapredict v2.0 to predict IDRs in all proteins of the reference human proteome. We found that IDRs are significantly more common in these proteins compared to the proteome average (IDR fraction in TE control proteins = 0.43, average IDR fraction in the human proteome = 0.29, Wilcoxon rank sum test p-value = 1.7 x10-5), and some of them are very long. By analyzing the location of positively selected sites, we observed that a large number of them map to IDRs (Fig. 1B, Additional file 1: Table S2). To determine whether this represents a statistically significant enrichment, we used a binomial test that accounts for the fraction of sites embedded in IDRs. We found that the proportion of sites in IDRs is indeed significantly higher than expected by chance (p-value = 1.795x10-7).

Sequence patterns and ensemble features differ in IDRs depending on the selective regime

We next aimed to investigate whether any relation exists between IDR properties and selection patterns. Although the structural properties of IDRs cannot be predicted, some ensemble properties are quantifiable and provide information on three dimensional features and IDR function [35]. These include two length-independent parameters: the conformational entropy per residue (Sconf/N) and the Flory scaling exponent (ν), a measure of chain compactness. These two features are non-independent, as compact IDRs tend to have low Sconf/N [35]. We thus used a predictor based on support vector regression (SVR) models to calculate Sconf/N and ν for all orthologous IDRs in the primate TE control proteins [35](Additional file 1: Table S3). To determine whether the conformational properties influence selective constraints, for all IDRs, we also calculated the fraction of sites targeted by purifying selection. No correlation was observed between the fraction of negatively selected sites and either Sconf/N or ν (Kendall’s correlation coefficients: 0.081 and 0.027, both p-values > 0.05). This indicates that the selective constraint acting on IDRs is not related to their ensemble conformational features. We next compared ensemble features between IDRs that were or were not targeted by positive selection (i.e., that display at least one positively selected site or none). Interestingly, we found that positively selected IDRs have significantly lower Sconf/N and tend to be more compact, although the difference in ν was not statistically significant (Fig. 2A).

Fig. 2

Conformational features across primate orthologous IDRs. Violin wrapping box and whiskers plots of conformational features calculated for all IDRs in TE control proteins. Plots show the mean values (A) or the standard deviation (B) among orthologs and colors represent IDRs that are (red) or are not (gray) targeted by positive selection. Wilcoxon rank sum test p-values are also reported.

Previous studies showed that the amino acid composition and patterning of residues are important determinants of sequence-ensemble relationships in IDRs [36–41]. We thus used the SVR predictor to calculate the following parameters: sequence hydropathy decoration (SHD, a measure of the patterning of hydrophobic residues) [42], fraction of charged residues (FCR), and net charge per residue (NCPR) (Fig. 2A, Additional file 1: Table S3). We found that positively selected IDRs have significantly higher SHD compared to the non-selected ones. They also tend to have fewer charged residues and more negative charge, although the differences are not significant (Fig. 2A). Overall this suggest that compaction and low entropy in selected IDRs are due to the patterning of hydrophobic/aromatic residues.

We next sought to investigate how positive selection may impact IDR compositional patterns and ensemble properties. For this purpose, we calculated the standard deviations among primate orthologs of conformational parameters (Sconf/N and ν) and sequence features (SHD, FCR, and NCPR). By this approach, we obtained a single measure of feature conservation for each IDR. A comparison between positively selected and non-positively selected IDRs indicated that the former have more variability in both ensemble features and sequence parameters than the latter, although statistical significance was only reached for SHD and NCPR (Fig. 2B). These results suggest that positive selection modulates IDR properties by introducing changes in sequence patterns and ensemble features properties, which are in turn related to function[35].

Positive selection influences the accumulation of new TEs

In general, mammalian genomes display similar levels of TE content and diversity [4]. However, it was recently shown that they show substantial differences in term of gaining new elements [4]. These differences may be related to the ability of different species to control TE mobilization. Thus, in order to evaluate how the selective pressure acting on the TE control genes is related to TE genomic content, we analyzed the accumulation of recently acquired transposable elements in our primate dataset. To this aim, for the sixty genes, we analyzed the presence of episodic positive selection at the terminal branches of the primate phylogenetic tree and, for each species, we counted the number of genes showing a proportion of sites that have evolved under positive selection. We found 25 genes with at least one branch showing evidence of episodic selection, involving different species across the tree (Fig. 3, Additional file 1: Table S2). We also retrieved TE content data generated in a previous work [4] where elements having a genetic distances < 4% when compared with their consensus sequences were considered young or recent TEs (Fig. 3) [4]. We next ran a phylogenetic generalized least squares (PGLS) regression using a Brownian motion to model trait evolution and we found a negative correlation (coefficient: -0.27, p-value = 0.008) between recently acquired TEs and the count of genes under episodic selection at each tip. This correlation is mostly driven by LINE elements, which are the only elements that are significantly associated with gene counts (coefficient: -0.23, p-value = 0.004). These results suggest that episodic positive selection in TE control genes can result in reduced accumulation of new TEs, at least in some species.

Fig. 3

Episodic positive selection on TE control genes and TE accumulation. The phylogenetic tree shows the action of episodic positive selection on a set of primate species. Red branches represent species that are identified as under episodic selection by aBSREL in at least one TE control gene and branch width is proportional to the number of genes under selection. Bar plots indicating the frequency of newly acquired TE elements (all TE and LINEs only) as defined by [4] are also shown. The phylogenetic tree was retrieved from the Zoonomia database [127].

TE control genes were targets of positive selection during the recent evolution of human populations

We next sought to investigate selective patterns during the more recent history of human populations. We thus used Tajima’s D [43] and Fay and Wu’s H [44] statistical tests to assess evidence of long-term, non-recent positive selection in the set of TE control genes. Low values of Tajima’s D identify deviations from genetic neutrality, while low values Fay and Wu’s H capture signals of high-frequency derived alleles [45, 46]. To account for the effect of demographic effects [47], for each test, we examined the distribution of genes with extremely ranked values (low 5%) of both statistical tests across the populations. We identified 22 and 17 genes with extreme low values of D and H respectively, in at least 10 human populations (Fig. 4). Notably, 13 genes exhibited concurrent extreme values for both D and H tests in 10 or more populations, indicating a strong consistent signal of long-term positive selection of these genes across a diverse group of populations (Fig. 4).

Fig. 4

Distribution of TE control genes with extremely low summary statistics values across populations. The y-axis shows genes of interest, while the x-axis indicates the number of populations where each gene exhibits extreme low values (below the 5% threshold) for Tajima’s D or Fay and Wu’s H. Bars are color-coded by statistic, with transparency reduced for counts below the threshold of 10 populations (indicated by the dashed red line marks). Gene names in bold reflect those with extreme values for both D and H statistics, in at least 10 populations.

Extremely dynamic evolution of Piwi/Ago proteins in eukaryotes

Given their relevance for TE silencing and the signatures of positive selection we detected, we next focused on Piwi/Ago proteins and sought to determine their distribution in eukaryotes [48]. To this aim, we used the HMM (hidden Markov model) profile of the PIWI domain, which is the defining feature of this protein family, to search the reference proteomes of representative eukaryotic phyla. After exclusion of reference proteomes with a BUSCO (Benchmarking Universal Single-Copy Orthologs) completeness score lower than 80%, we analyzed 1,815 proteomes and identified 11,642 hits (Additional file 1: Table S4).

We found an extreme variability in the number of PIWI-domain containing proteins in eukaryotes (Fig. 5A). Complete loss was observed in 7.7% of the organisms we analyzed; most of these are fungi, apicomplexa protozoa, and green algae (Chlorophyta). The highest number of Piwi proteins was instead detected in two species of flowering plants in the genus Gossypium (G. barbadense and G. anomalum, phylum Streptophyta), with more than 100 proteins each. Other plant proteomes also showed a remarkable expansion of Piwi proteins, whereas some comprised relatively few (Fig. 5A). This is in line with the notion that the Piwi/Ago family has recurrently expanded and contracted during plant evolution [49].

Fig. 5

Distribution of PIWI modules in eukaryotes. (A) The numbers of PIWI domain-containing proteins in the reference proteomes of eukaryotes in different phyla are reported in the form of box plots. Modules were searched using the HMM profile of the PIWI domain (PF02171). (B) The domain composition of PIWI-containing proteins was analyzed by retrieving UniProt annotations. The frequency with which each domain appears in the PIWI-containing proteins of eukaryotes in different phyla is proportional to the size of the circles. (C) Domain architecture of representative proteins mentioned in the text.

In most metazoa, the number of Piwi proteins had a relatively narrow range, with some exceptions accounted for by species in the phyla Nematoda and Rotifera, plus the single proteomes of Tardigrada (Ramazzottius varieornatus, 18 Piwi hits) and Platyhelminths (Macrostomum lignano, 23 hits) (Fig. 5A). The wide variability in PIWI module content observed in nematodes is in line with the observation that some lineages encode an extended array of these proteins [50], whereas others have lost part of the piRNA pathway [51]. With respect to rotifers, these animals were recently shown to host a wide diversity of active TEs, which most likely triggered expansion of the Piwi pathway [52].

In simpler eukaryotes, a substantial expansion in the number of PIWI-containing proteins was observed in Ciliophora (Fig. 5A). Interestingly, in these organisms, Piwis have taken over novel functions and contribute to programmed DNA elimination, a process whereby somatic nuclear development occurs by removal of TEs and other repeat types [53]. Finally, the situation was very variegated for fungal proteomes (Fig. 5A, Additional file 1: Table S4). A remarkable expansion of Piwi proteins was observed in some organisms in the phylum Mucoromycota. Notably, all these are fungi in the Glomeromycotina subphylum of arbuscular mycorrhizal symbionts, which establish mutualistic interactions with land plants and display high TE proportions in their genomes. Remarkably, these asexual fungi and use TEs to generate genomic diversity while avoiding TE overproliferation through the Piwi pathway and other mechanisms [54].

Eukaryotic Piwi/Ago proteins are known to commonly display a similar architecture, which comprises the signature PIWI domain, plus PAZ, MID, and AgoN (Argonaute N-terminal) domains, as well as linkers and terminal IDRs (Fig. 1B). Prokaryotic Agos have instead a more diverse domain composition [55, 56]. Given the large number of PIWI-containing proteins we obtained with the HMM search, we set out to explore the diversity of eukaryotic PIWI-containing proteins in terms of domain architecture (Additional file 1: Table S4). We thus retrieved from UniProt all domains that are annotated in the eukaryotic PIWI-containing proteins. We detected a total of 72 different domains, with similar domain diversity in fungi, plants, and metazoans (Fig. 5B). Conversely, the few proteins in the other kingdoms showed that PIWI modules only co-occur with kinase domains in Alveolata. As expected, the PAZ domain was the most frequently detected, although it should be noted that its frequency may be underestimated in our analyses as we only relied on available annotations. The same applies to other domains (MID, AgoN, and Argonaute linkers) commonly found in Piwi/Ago proteins (Fig. 5B). A number of domains annotated in PIWI-containing proteins were found to function in the maintenance of genome integrity (e.g., BRCT, HORMA, MCM), in nucleic acid binding (e.g., BZIP, DRBM, HTH, MADF, RRM, SAMD1 and different types of zinc fingers) and in protein-protein interaction (DH, F box, PLAT, PUL, and SAM) (Fig. 5B). We also detected some remarkable domain associations. In three instances from fungi and green algae, the PIWI module was acquired independently by dicer-like proteins, suggesting that such proteins bring together the recognition and cleavage of mobile elements operated by dicer with the effector functions of the PIWI domain [57] (Fig. 5B and Fig. 5C). Another interesting example is represented by three related proteins encoded by fungi in the genus Colletotrichum, where the PIWI module appears in association with cytidine/dCMP deaminase (CDA) domains (Fig. 5B and Fig. 5C). The latter domains play different roles in eukaryotes, including defense from viruses or mobile elements [58]. It is thus possible that the CDA domains in these fungal Piwi/Ago proteins contribute a further effector function in the silencing of TEs or viruses. Similar considerations may apply to metazoan proteins that carry a PIWI module together with domains that are commonly found in antiviral proteins, such as DEAD box helicase and PARP (poly(ADP-ribose) polymerase) [59, 60] (Fig. 5B and Fig. 5C). Notably, we also found two plant proteins that carry both PIWI and reverse transcriptase (RT) domains (Fig. 5B and Fig. 5C). The latter have been co-opted to serve functions in cellular organisms and, in prokaryotes, they often have antiviral roles [61, 62]. Whether this is also the case in eukaryotes remains to be evaluated, but the association of PIWI and RT in these proteins suggests that it may be the case. Finally, we identified three histone acetyltransferases (HAT) in metazoa that recruited PIWI and PAZ modules (Fig. 5B and Fig. 5C). Because histone acetylation is usually associated with an active chromatin conformation, the PIWI domains in these proteins might function in gene regulation rather than TE repression [63].

Structural homology searches identify a bacterial close relative of eukaryotic Piwi/Ago proteins with conserved IDR location

We next sought to explore the deep evolutionary history of Piwi/Ago proteins. Growing evidence has indicated that a number of eukaryotic innate immunity modules, some of which also control TE replication, originated from prokaryotic antiphage systems [64]. These include Piwis/Agos, other components of the same pathway (e.g., PLD6), as well as MOV10/MOV10L1 [64]. Recently, using an HMM-based approach, Bastiaanssen and coworkers showed that Asgard achaea (or Asgardarchaeota) encode a wide diversity of Piwi/Ago genes and identified a protein (HrAgo1) encoded by the Lokiarchaeon ‘Candidatus Harpocratesius repetitus’ that shares a common origin with eukaryotic Piwi and Ago proteins [24]. An alternative strategy to identify distantly related proteins is to rely on structural conservation. The recent development of Foldseek allows the rapid screening of protein structure databases with high sensitivity [65]. We thus used the structures (or Alphafold models) of human Ago1-4 (hAgo1-4) and Piwi1-4 (hPiwi1-4) to search for proteins with similar folds in bacteria and archaea. Surprisingly, in all cases the best hit, with E values < 1.49 x 10–53, was an Argonaute protein encoded by a Planctomycetota bacterium (PlaAgo, A0A7C6F4G8) sampled from hot springs in the USA (strain SpSt-977) (Additional file 1: Table S5). PlaAgo shows higher sequence identity with hAgos than HrAgo1, making it the closest relative of human argonautes identified in bacteria or archaea. Sequence identity with hPiwis is instead lower (Fig. 6A). Pairwise structural alignments of PlaAgo with hAgo4 and hPiwi2 using TM-align showed very high scores and structural similarity covering all domains (AgoN, L1, PAZ, L2, MID, and PIWI) (Fig. 6B and 6C). Based on the comparison with hAgos, the catalytic tetrad of PlaAgo is formed by residues Asp728, Glu774, Asp806, and His946.

Fig. 6

Structural homology results and characterization of PlaAgo. (A) Pairwise sequence identity between human Piwis/Agos and PlaAgo or HrAgo1. (B) Domain organization of PlaAgo. IDRs are shown as blue bars, as in Fig. 1B (C) Structural alignment of the PlaAgo Alphafold model (Alphafold DB: AF-A0A7C6F4G8-F1-v4) with the structures of hAgo4 (PDB: 6OON) or hPiwi2 (PDB: 7YFX). Domains are color-coded as in (B). (D) Phylogeneetic tree of the MID-PIWI domains of representative prokaryotic and eukaryotic Piwis/Agos. The tree was generated with IQtree and it is mid-point rooted. We performed 1000 ultrafast bootstrap replicates and nodes with support higher than 0.75 are indicated with a gray dot. (E) Phylogenetic tree generated from the FoldMason structural alignment of PlaAgo and well characterized eukaryotic and prokaryotic Piwis/Agos. Abbreviations: siwi, Bombyx mori Piwi; EfPiwi, Ephydatia flauviatilis Piwi; EfAgo, Ephydatia flauviatilis Ago, dmPiwi, Drosophila melanogaster Piwi; KpAgo, Kluyveromyces polysporus Ago; CbAgo, Clostridium butyricum Ago; PfAgo, Pyrococcus furiosus Ago; MjAgo, Methanocaldococcus jannaschii Ago; MpAgo, Marinitoga piezophila Ago; RsAgo, Rhodobacter sphaeroides Ago; AfAgo, Archaeoglobus fulgidus Ago; TtAgo, Thermus thermophilus Ago; AaAgo, Aquifex aeolicus Ago. The IDs for all structures/models are reported in Table S7 (Additional File 1). Organism silhouettes were obtained from the PhyloPic database (https://www.phylopic.org/). (F) Local similarity, computed with the PRSS program, between PlaAgo and either hPiwi1 or hAgo1. The regions of significant similarity are denoted by the green shadow. The N- and C- terminal IDRs are depicted in blue and red, and the alignments are shown.

To explore in further detail the relationships among PlaAgo and other prokaryotic and eukaryotic Piwis/Agos, we generated a phylogenetic tree using the MID-PIWI domains of several Piwi/Ago proteins (Additional file 1: Table S6). These included both short (solely consisting of the PIWI and MID domains) and long (also including the PAZ and AgoN domains) prokaryotic Agos, either catalytically active (long AgoA) or not (long AgoB and short Agos). Results indicated that plaAgo is sister to eukaryotic Agos, whereas, in line with previous data, HrAgo1 is sister to eukaryotic Agos and Piwis (Fig. 6D) [24]. A phylogenetic tree generated from a FoldMason structural alignment of a subset of well characterized Piwis/Agos confirmed these findings (Fig. 6E, Additional file 1: Table S7).

Interestingly, in contrast to HrAgo1 (which only comprises a short disordered N-terminal region), PlaAgo presents IDRs at both the N- and C-termini (Fig. 6C), reminiscent of the domain organization of hAgos and hPiwis. We thus focused on the IDRs in PlaAgo, as protein disorder is known to be relatively uncommon in prokaryotes [66–68]. Indeed, out of 126 prokaryotic Agos we included in the phylogenetic tree, only 9 have IDRs (most of these internal rather than terminal). We first asked whether some sequence conservation between PlaAgo and hPiwis/hAgos was detectable within the IDRs. We thus inspected local similarity using the PRSS program, which computes the significance of a pairwise alignment by shuffling the amino acids in one of the two sequences. Remarkably, results indicated that, when PlaAgo is compared to hAgo1-4, significant sequence similarity extends through the C-terminal IDR. With respect to the N-terminal PlaAgo IDR, it shows significant similarity to the IDRs in hPiwi1 and hPiwi3 (Fig. 6F).

Sequence conservation of IDRs across huge evolutionary distances is quite unusual. We thus retrieved the N- and C-terminal IDRs of Piwi/Ago proteins of representative organisms from different realms of life (Additional file 1: Table S8). We found higher sequence conservation for the C-terminal IDR than for the N-terminal one, which also widely differed in length, with the former being shorter (Fig. 7A). Previous studies indicated that whereas sequence and size conservation is low in IDRs, specific features may be instead conserved across orthologs [37, 38, 69–71]. We thus used the SVR predictor [35] to calculate length-independent parameters. In addition to Sconf/N and ν, we computed SHD and NCPR. Results indicated that N-and C- terminal IDRs are relatively homogeneous in terms of conformational parameters and sequence features (Fig. 7B). Despite relatively high NCPR, those at the N-terminus have low entropy and are compact, a feature likely determined by the clustering of hydrophobic residues [42, 72] (Fig. 7B). The C-terminal IDRs are extended and have high Sconf/N; with two exceptions, they tend to be negatively charged with low SHD (Fig. 7B). Overall, these results indicated that, despite the huge phylogenetic distance and poor sequence/size similarity, the IDRs in Piwi/Ago proteins maintain some patterning features that in turn reflect into conserved conformational features.

Fig. 7

IDR sequence divergence and conservation of conformational features. (A) Alignment of IDRs located in the N-terminal and C-terminal regions of Piwi/Ago proteins from different organisms. The shading denotes the percentage of identity and was generated using Jalview. Abbreviations that were not defined in the text or in the legend of Fig. 6 are as follows: oryAgo, Oryza sativa Ago; miwi, Mus musculus Piwi; dnPiwi, Danio rerio Piwi; ggPiwi, Gallus gallus Piwi; seawi, Strongylocentrotus purpuratus Piwi; seaAgo, Strongylocentrotus purpuratus Ago; xPiwi, Xenopus laevis Piwi; mmAgo, Mus musculus Ago; ggAgo, Gallus gallus Ago; dnAgo1, Danio rerio Ago1; xAgo, Xenopus laevis Ago; boAgo, Bombyx mori; dmAgo1, Drosophila melanogaster Ago1 (all protein IDs in Additional File 1: Table S8). (B) The conformational features of the IDRs located at the N- or C-termini of Piwi/Ago proteins are plotted, together with SHD and NCPR. All parameters were calculated using the SVR predictor [35].

Discussion

TE mobilization poses a significant fitness challenge to host genomes. As a consequence, a variety of systems have emerged to curb and silence TE activity. Just like TEs, such systems are widespread across the tree of life and their evolution is expected to be shaped by recurrent arms races, as new TE families emerge and start amplifying. In addition to TEs, viruses can also exert a selective pressure on some of these genes, as the pathways that control retrotransposition are often also involved in the restriction of exogenous viruses [73]. For instance, the HUSH complex is a crucial repressor of HIV and it is counteracted by the Vpr and Vpx viral proteins [74, 75].

Here, we explored the evolution of proteins that control TEs over different timescales. Our data indicate that, in primates, a considerable fraction of TE control genes evolved under positive selection, as expected in a genetic conflict scenario [76]. In particular, proteins that participate in the piRNA pathway were found to have experienced very strong selection, resulting in the identification of many sites showing fast amino acid replacement. These proteins play a central role in the silencing of retrotransposon activity in the germline, where epigenetic reprogramming results in the derepression of TEs [15]. The germline also represents the stage where the genomic conflict plays out, as new TE insertions can be transmitted through vertical inheritance only if they occur in the germline or before its specification (i.e., in the early stages of embryonic development). Indeed, the expression of many TEs seems to be restricted to various stages of gametogenesis or early embryogenesis, whereas activity in somatic tissues holds no apparent benefit to TEs [77]. That primates (like most cellular organisms) have evolved additional layers of TE control in the germline, and that these layers experience the strongest selective pressure, are thus in line with expectations under an arms race scenario.

We next reasoned that changes resulting from positive selection at these genes might impact the ability of individual primate species to control TE activity. A large-scale study of TE content in mammalian genomes showed that recently accumulated TEs occur at different frequencies across mammals and that dietary habits in a few orders, but not in primates, are responsible for some differences. The authors thus suggested that other factors, such as TE defense systems, may also contribute [4]. Here, we found that primate taxa that experienced positive selection at more TE control genes have a lower proportion of recent TE insertions compared to the ones that experienced selection at no or fewer genes. Albeit preliminary and based on a relatively small set of observations, these data suggest that positively selected changes in proteins that control TE activity have an impact on the mobilization of these elements in the genomes of extant species.

Recent TE activity has also occurred in humans and a few elements, all of them retrotransposons, are still mobile in the our genomes, as exemplified by polymorphic insertions [4]. For instance, SVAs (a family of SINEs specific to hominins) show evidence of relatively high recent activity, as also testified by the existence of two subfamilies specific to humans [4]. We thus screened the TE control genes for evidence of positive selection in human populations, which may result from the ongoing pressure exerted by mobile elements. We detected several signals of strong positive selection across populations. In line with findings in primates, we identified as selection targets three genes (TEX15, GTSF, and GTSFL1) that participate to the piRNA pathway [15]. With the exclusion of TEX15, however, the genes that experienced the strongest selective pressure during the recent evolution of human populations did not overlap with those undergoing fast evolution in primates. Interestingly, four of the 13 genes with strong evidence of selection encode components of the NuRD complex. In the context of TE suppression, this complex interacts with KAP1 and mediates de novo heterochromatin formation, thus transcriptionally suppressing LTR retrotransposons [78]. However, NuRD plays additional functions in the cell and its activity is essential for the regulation of gene expression, cell fate specification, and cell cycle progression [79]. As a consequence, deregulation of NuRD has been associated with cancer and neurodevelopmental disorders [79, 80]. Moreover, NuRD is the target of several infectious viruses, most notably herpesviruses [81–83], that hijack transcriptional regulation. It is thus possible that the selective pressures acting in human populations on NuRD components are not directly linked to its TE-silencing activity. It is also worth noting that variants in some of the genes (MTA1, RBBP4, AGO1, SAFB2, and TET1) showing strong evidence of positive selection in human populations are associated to anthropometric traits (waist-to-hip ratio adjusted for BMI, height) (https://www.ebi.ac.uk/gwas/home). In turn, such traits were shown to carry evidence of polygenic adaptation in humans [84–87], raising the possibility that the signals we detected are related to the pleiotropic functions of these genes.

Given that their biological functions often involve nucleic acid binding, the proteins encoded by the genes we analyzed are particularly rich in IDRs, which pose challenges in their analysis: they are in general poorly conserved and, because they lack stable secondary or tertiary structures, the biological and biochemical functions of IDRs are difficult to predict [34]. As a consequence, IDRs and their evolutionary trajectories remain under-investigated [35]. We found that signals of positive selection are significantly enriched within the disordered portions of proteins involved in TE control. Whereas this finding is in line with a previous study of human IDRs, it opens the questions of how changes in IDRs can promote adaptation and which biological functions are affected by such changes [88]. We thus applied recently developed methods to investigate IDR sequence-ensemble properties in relation to evolutionary parameters. As growing evidence indicates that chain compaction and conformational entropy are relevant descriptors of IDR function, we first asked whether these parameters correlate with the level of selective constraint, measured as the fraction of sites experiencing purifying selection. We found no evidence of correlation, which was somehow unexpected because previous data showed that human pathogenic variants are more likely to be located in IDRs with low conformational entropy compared to benign variants [35], suggesting that such IDRs are less tolerant to change. It is however possible that human IDRs are heterogeneous with respect to the relationship between tolerance and entropy, which may depend on protein function. Analysis of positive selection signals however showed that they are preferentially located in regions of low conformational entropy. We thus hypothesized that, by altering the sequence composition or patterning of the IDRs, the positively selected sites might modulate the conformational ensemble features and, hence, IDR functions. By calculating across-orthologs variance, we verified this hypothesis, although statistical significance was often borderline, most likely due to the small number of comparisons. Experimental investigation or molecular dynamic simulations will be required to determine the effect of positive selection on IDR function in these proteins. Nonetheless, our data show that the evolutionary trajectories of IDRs may be linked to their conformational characteristics, exactly in the same way as structured domains often evolve by changes that affect the solvent-exposed surface rather than the inner hydrophobic cores. This conclusion is reminiscent of findings on IDRs in proteins that derived from retrotransposon domestication, suggesting that it may represent a general feature [89].

In addition to evolving changes within a fixed repertoire of proteins, another strategy to achieve TE control is by gene amplification. For instance, in the case of KZFPs new gene copies appeared during vertebrate evolution in response to recent LTR retroelement activity and, in mammals, the emergence of new families of endogenous retroviruses is predictive of the appearance of new duplicate KZNF genes [22]. Piwi/Ago proteins are ubiquitous across the tree of life. Together with the signature PIWI module, these proteins carry all or some of the domains that are characteristic of the family [48]. In eukaryotes, the number of paralogous Piwi/Ago proteins is known to vary widely as a consequence of gene duplication and losses [48]. However, compared to prokaryotic Agos, eukaryotic Agos are are thought to be less diverse in terms of sequence and domain composition [55, 56]. Because a comprehensive analysis of the representation and domain architecture of PIWI-containing proteins in eukaryotes was missing, we systematically analyzed the proteomes of eukaryotes that are representative of the diversity in the superkingdom. In line with previous reports on model organisms or individual taxa [49–54, 90], we identified a large variation in the number of PIWI-containing proteins in these organisms, with the highest number detected in plants and, among metazoa, in nematodes, rotifers, tardigrades, and flat worms. However, we also found that the domain architecture of these PIWI-containing proteins is much more assorted than previously thought, and that the number of different domains associated with these proteins is similar in higher eukaryotes and fungi. Notably, some of these PIWI-containing proteins combine additional effector domains, which may strengthen their defense role against TEs and other mobile elements (e.g., viruses). For instance, CDA domains are thought to have originated from bacterial toxins and, in eukaryotes, often function in DNA/RNA editing and in the defense from viruses or mobile elements [58]. In tetrapods, the CDA domains in APOBEC and ADAR proteins are essential for antiviral defense by hypermutation and other mechanisms [91, 92]. Likewise, PARP family members are important elements of innate immunity in mammals, whit broad antiviral activity [60]. These domain architectures, which contribute additional effector domains to Piwis/Agos, are reminiscent of prokaryotic defense systems. In fact, many prokaryotic Agos contain domains absent in eukaryotic Agos, including the SIR2 (Silent information regulator 2) and TIR (Toll-interleukin receptor) modules, which confer protection either by NAD + depletion or by generating signaling molecules [55, 56]. Additionally, catalytically inactive prokaryotic Agos were found to commonly act in complex with effector molecules, which in most cases kill the infected host cell in response to the detection of non-self genetic elements [55, 56, 93].

Given, the recent pieces of evidence indicating that several eukaryotic defense systems originated in prokaryotes, and the consequent emergence of an evo-immuno discipline [64, 94], we also explored the origin of eukaryotic Piwi/Ago proteins in prokaryotes. By relying on structural homology searches, we identified a bacterial protein, PlaAgo, very closely related to human Agos. This was unexpected, as recent works showed that Asgard archaea host a wide diversity of Piwi/Ago proteins, suggesting that they contributed to the early diversification of these modules in prokaryotes and to their origin in eukaryotes [24, 25]. The identification of HrAgo1 as sister to eukaryotic Piwi/Ago, and its structural features characteristic of both Piwis and Agos, support this idea [24]. As opposed to several Asgard Agos, HrAgo1 displays all structured domains that are characteristic of eukaryotic Piwis/Agos, a feature also observed in PlaAgo. The latter, despite showing higher sequence and structure similarity to hAgos than to hPiwis, comprises IDRs at both termini. This feature is reminiscent of the module structure of both eukaryotic Piwis (most of which have IDRs at the N-terminus) and of eukaryotic Agos (with IDRs at the C-terminus). This is also remarkable because IDRs are known to be relatively scant in prokaryotes [66–68]. It is difficult to synthesize these findings into a model of Piwi/Ago evolution. The evidence for an origin in Asgard archaea is very strong and supported by the phylogenetic placement of HrAgo1 [24]. However, it is also established that horizontal gene transfer (HGT) among prokayotes and between prokaryotes and early eukaryotes most like contributed the overall genetic makeup of extant living organisms [64]. It is thus possible that PlaAgo emerged through HGT of an ancestral Piwi/Ago protein from early eukaryotes to bacteria.

Although we cannot exactly reconstruct the events that originated PlaAgo and eukaryotic Piwi/Ago, our analyses reveal surprising evolutionary relationships across very long timeframes. In this respect, the positional and sequence conservation of the IDRs in eukaryotes and bacteria is remarkable. We have no knowledge of instances of conservation across similarly deep time scales. This particularly applies to the C-terminal IDRs found in several Ago proteins across eukaryotes and in PlaAgo. The N-terminal IDR, characteristic of Piwi proteins, is much less conserved in sequence and in length and, when multiple proteins from distantly related organisms were aligned, only a few positions were found to be conserved. However, previous studies indicated that IDRs with highly divergent primary sequences can perform similar functions, which are in turn recapitulated by sequence patterning and ensemble properties [37, 38, 69, 71]. Indeed, computation of chain compaction and conformational entropy, as well as of hydropathy decoration and net charge per residue, showed relatively similar values despite large differences in IDR size and primary sequence. It is thus possible that, across organisms as disparate as bacteria, plants, sponges, and vertebrates, the IDRs of Piwi/Ago proteins evolve under stabilizing selection to maintain a shared function, as previously shown for IDRs in yeasts [37]. Experimental analysis by IDR swapping will be necessary to test whether this is the case and to clarify the functional role of IDRs in Piwi/Ago proteins.

Conclusions

We performed an evolutionary analysis of TE control systems across different timescales. Our working hypothesis was that the evolution of such systems has been dynamic, as the result of genetic conflicts with TEs. In line with this scenario, we show that positive selection, module duplication, and tinkering of domain architectures contributed to shape the arsenal of mechanisms that silence TE activity in eukaryotes. In prokaryotes, we identify a close relative of human Agos and we report unusually deep conservation of IDR sequence and ensemble features across huge evolutionary distances. Overall, our data provide insight into the evolution and diversity of TE control systems and contribute information to the rapidly growing evo-immuno discipline.

Methods

Gene selection

The list of sixty coding genes involved in TE control was derived from previous reviews and individual studies [15, 95–98] (Additional file 1: Table S1).

Primate sequence retrieval and positive selection analysis

We obtained primate orthologous coding sequences from the NCBI database (http://www.ncbi.nlm.gov). Primate genes carrying premature stop codons or with a low sequence coverage were excluded. A list of primate species analyzed for each gene is reported in Additional File 1: Table S1.

The RevTrans 2.0 utility was used to generate multiple sequence alignments (MSAs) using MAFFT v6.240 as an aligner [99, 100]. The resulting alignments were manually trimmed and adjusted to contain only well-aligning regions. MSAs were analyzed for the presence of recombination signals using Genetic Algorithm Recombination Detection (GARD, parameters: general discrete model, three rate classes) [101]. GARD is a genetic algorithm based on phylogenetic incongruence among alignment segments which detects the best-fit number and location of recombination breakpoints. When evidence of recombination was detected, the coding alignment was split based on the recombination breakpoints and sub-regions (of at least 50 nucleotides) were used as input for subsequent molecular evolution analyses. Recombination breakpoints were identified for MOV10L1, PIWIL3, and TNRC18 alignments (Additional File 1: Table S2).

Phylogenetic trees were reconstructed using the phyML program (version 3.1) with a General Time Reversible (GTR) model plus gamma-distributed rates and 4 substitution rate categories with a fixed proportion of invariable sites [102]. The MSAs and phyML trees were used as inputs to perform evolutionary analysis. To detect positive selection, the codon-based codeml program implemented in the PAML (Phylogenetic Analysis by Maximum Likelihood) suite was applied [32], using the F3x4 codon frequencies model (codon frequencies estimated from the nucleotide frequencies in the data at each codon site) [32, 33]. A model (M8, positive selection model) that allows a class of sites to evolve with dN/dS > 1 was compared to two models (M7 and M8a, neutral models) that do not allow dN/dS > 1. To assess statistical significance, twice the difference of the likelihood (ΔlnL) for each model (M8a vs M8 and M7 vs M8) is compared to a χ2 distribution (1 and 2 degrees of freedom for M8a vs M8 and M7 vs M8 comparisons, respectively). To be conservative and to obtain robust results, we called a gene as a target of positive selection only if it was detected by both M7 versus M8 and M8a versus M8 comparisons.

In order to identify specific sites subject to positive selection, we applied four different methods: 1) the Bayes Empirical Bayes (BEB) analysis (with a cutoff of 0.90), which calculates the posterior probability that each codon is from the positive selection site class (under M8 model) [103]; 2) Fast Unbiased Bayesian AppRoximation (FUBAR) (with a cutoff of 0.90), an approximate hierarchical Bayesian method that generates an unconstrained distribution of selection parameters to estimate the posterior probability of positive diversifying selection at each site in a given alignment [104]; 3) Mixed Effects Model of Evolution (MEME) (with a p-value cutoff < 0.1), which allows the distribution of dN/dS to vary from site to site and from branch to branch at a site [105]; 4) Fixed Effects Likelihood (FEL) (with a p-value cutoff < 0.1), a maximum-likelihood (ML) approach to infer dN/dS on a per-site basis, assuming that the selection pressure for each site is constant along the entire phylogeny [106]. Again, to be conservative and to limit false positives, only sites detected using at least two methods were considered as positive selection targets.

Episodic positive selection was detected using the adaptive branch-site REL test for episodic diversification (aBSREL) [107], using a subset of primate species for which data of TE genomic content is available [4]. aBSREL tests which branches in a phylogeny have a proportion of sites that have evolved under positive selection. In this case, for each gene alignment, we tested the tips of a primate phylogenetic tree derived from the zoonomia database [4].

The single-likelihood ancestor counting (SLAC) method was applied to identify sites under negative/purifying selection [106]. GARD, FUBAR, MEME, FEL, SLAC, and aBSREL analyses were run locally through the HyPhy suite V2.5.29 [108].

Prediction of intrinsically disordered regions

The sequences of reviewed (Swiss-Prot) canonical proteins (20,420) of the reference human proteome (UP000005640) were retrieved from Uniprot ( https://www.uniprot.org/ ). Intrinsically disordered regions (IDRs) were estimated using the Metapredict V2 tool [109, 110], that defines which residues from a protein sequence are disordered by applying a deep-learning algorithm based on a consensus score calculated from eight different predictors [109]. Metapredict V2 was run using default parameters and consecutive disordered stretches equal to or longer than 30 residues were labeled as IDRs. For primate genes, IDRs were kept only if the same orthologous region was predicted in at least 50% of the species analyzed.

Analysis of IDR conformational properties and sequence patterns

The conformational entropy per residue (Sconf/N) and the Flory scaling exponent (ν) were calculated for all analyzed IDRs. Sconf/N is a measure of the landscape of different structures accessible to an IDR. ν derives from the scaling laws of polymers that describe how chain dimensions vary as a function of chain length [111]. Both parameter were estimated using a Colab notebook (https://colab.research.google.com/github/KULL-Centre/_2023_Tesei_IDRome/blob/main/IDR_SVR_predictor.ipynb) [35, 112], which uses a support vector regression model trained on simulations performed using the CALVADOS model [112, 113]. The same SVR predictor was used to derive other measures: SHD (sequence hydropathy decoration) [42], FCR (the fraction of charged residue), and NCPR (net charge per residue).

Population genetics analyses

We conducted our analysis on a dataset of variant call format (VCF) files comprising genetic variation data of autosomal chromosomes from 54 distinct human populations and 828 individuals from the Human Genome Diversity Panel [114]. VCF files were further indexed and filtered for minimum quality parameters (-minQ/-minMapQ < 20) using bcftools and samtools [115, 116].

For each population, we calculated two summary statistics of genetic diversity: Tajima’s D and Fay’s H. These statistics were calculated in sliding windows, with a window size of 50 kilobases (kb) with a step size of 10 kb, using the software ANGSD [117]. We ranked the values for each summary statistics, and each population separately, then assigned the minimum rank value to each gene, considering all the windows encompassing its genomic coordinates.

We found no significant difference in lengths between candidate genes and the rest of the human genes. To test for signals of selection, we ranked the calculated summary statistics within each population. For each gene, we identified windows that fell into the extreme 5% of ranked values for Tajima’s D and Fay and Wu’s H (Additional File 1: Table S9). We then tagged the TE control genes that fell in those lower values and visualized the frequency of which those genes are found across the multiple populations. The analysis was conducted using RStudio and R version 4.2.2 and utilized ggplot2 package [118] for visualization.

Search of the PIWI domain in eukaryotic proteomes

The HMM profile of the PIWI domain (PF02171) was downloaded from InterPro (https://www.ebi.ac.uk/interpro/). The profile was used as the input for hmmsearch (version 2.39.0) to search the reference proteomes of representative eukaryotic phyla [119]. Default parameters for hit identification were applied (Significance E-value < 0.01 and < 0.03 for sequences and hits). BUSCO scores [120] were obtained from UniProt and only proteomes with a completeness score higher than 80% were retained.

Structure similarity search and comparison with bacterial/archaeal proteins

Protein structures or models were obtained from the RCSB PDB (www.rcsb.org) or Alphafold database [121, 122]. IDs are available in figure legends or in Tables S5 and S7 (Additional File 1).

Foldseek (https://search.foldseek.com/) [65] was run using either the solved structures or Alphafold models of hAgo1-4 and hPiwi1-4 (Additional File 1: Table S5) limiting the search to Eubacteria and Archaea.

For pairwise structure alignment, TM-align [123] was run from the dedicated website (https://zhanggroup.org/TM-align/) using the model of PlaAgo (AF-A0A7C6F4G8-F1-model_v4) and the resolved structures of hAgo4 (6OON) or hPiwi2 (7YFX).

A structural alignment of PlaAgo and well characterized Piwis/Agos was obtained using FoldMason [124]. The multiple structural alignment (MSTA) file was used as the input for IQ-TREE [125]

A phylogenetic tree of prokaryote and eukaryote Piwis/Agos was generated by aligning representative sequences (including some from Asgard archaea in Bastiaanssen and coworkers) [24] with MAFFT [100] (Additional File 1: Table S6). The aligned sequences were used as an input for IQ-TREE.

PRSS was accessed through the University of Virginia website (https://fasta.bioch.virginia.edu/) [126] and was run using default parameters (Blosum50 scoring matrix, gap open and extension penalty − 8 and − 2, 200 uniform shuffles).

Statistical analysis

Linear Models were run using a phylogenetic generalized least squares (PGLS) model. In particular, for each primate species, we counted the number of genes showing evidence of episodic positive selection and we correlated these counts with the frequency of recently acquired transposable elements as defined by [4]. To account for the phylogenetic relationship among species, a phylogenetic tree retrieved from the zoonomia database [127] was implemented in the model under a Brownian motion process of evolution. PGLS models were run using the R package ape [128].

To evaluate whether positively selected sites are enriched in IDRs, we applied a binomial test, using the counts of sites falling in IDRs as number of successes, the number of total selected sites as trials, and the ratio between all IDR length divided by the length of all analyzed regions as the probability of success.

Wilcoxon rank sum and Kendall’s correlation tests were performed in the R v.4.0.5 environment.

Data availability

All data are available in the manuscript and its supplementary files.

Acknowledgements

Not applicable.

Funding

This work was supported by the Italian Ministry of Health (“Ricerca Corrente” to MS). RS is funded by a studentship from the Qaddumi Foundation and the Faculty of Medicine at Al-Quds University.

Author Contribution

MS, RC, and DF conceived the study and designed the experiments; RC, DF, UP, AM, and RS performed the experiments and generated the data, with assistance from MS and MF; RC, DF, UP, AM, and RS conducted the data analysis with support from MS and MF; MS, RC, DF, and RS drafted the manuscript with input from UP, MF and AM; MS, DF and RC organized the data and finalized the manuscript.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Supplementary Information:

Additional file 1

Table S1. List of primate species used for the positive selection analysis.

Table S2. Likelihood ratio test statistics for models of variable selective pressure among sites (F3x4).

Table S3. Conformational parameters for all primate IDRs.

Table S4. Domain composition of PIWI-containing proteins in eukaryotes.

Table S5. Foldseek results of structural homology searches with human Ago/Piwi proteins.

Table S6. List of proteins used to generate the tree of MID-PIWI domains of representative prokaryotic and eukaryotic Agos/Piwis.

Table S7. List of proteins used to generate the multiple structure alignment.

Table S8. List of proteins used to generate the multiple alignments of N-terminal and C-terminal IDRs.

Table S9. Summary statistics minimum ranks for TE control genes.

Electronic Supplementary Material

Below is the link to the electronic supplementary material

Supplementary Material 1

Data Availability

Data is provided within the manuscript or supplementary information files

References

De Koning APJ, Gu W, Castoe TA, Batzer MA, Pollock DD. Repetitive Elements May Comprise Over Two-Thirds of the Human Genome. Copenhaver GP, editor. PLoS Genet. 2011;7:e1002384.

Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, et al. The Sequence of the Human Genome. Science. 2001;291:1304–51.

Haudiquet M, De Sousa JM, Touchon M, Rocha EPC. Selfish, promiscuous and sometimes useful: how mobile genetic elements drive horizontal gene transfer in microbial populations. Phil Trans R Soc B. 2022;377:20210234.

Osmanski AB, Paulat NS, Korstian J, Grimshaw JR, Halsey M, Sullivan KAM, et al. Insights into mammalian TE diversity through the curation of 248 genome assemblies. Science. 2023;380:eabn1430.

Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, et al. A unified classification system for eukaryotic transposable elements. Nat Rev Genet. 2007;8:973–82.

Finnegan DJ. Eukaryotic transposable elements and genome evolution. Trends Genet. 1989;5:103–7.

Eickbush TH, Jamburuthugoda VK. The diversity of retrotransposons and the properties of their reverse transcriptases. Virus Res. 2008;134:221–34.

Thomas J, Pritham EJ. Helitrons, the Eukaryotic Rolling-circle Transposable Elements. Chandler M, Craig N, editors. Microbiol Spectr. 2015;3:3.4.03.

Kapitonov VV, Jurka J. Self-synthesizing DNA transposons in eukaryotes. Proc Natl Acad Sci USA. 2006;103:4540–5.

10.

Wells JN, Feschotte C. A Field Guide to Eukaryotic Transposable Elements. Annu Rev Genet. 2020;54:539–61.

11.

Hall JPJ, Harrison E, Baltrus DA. Introduction: the secret lives of microbial mobile genetic elements. Phil Trans R Soc B. 2022;377:20200460.

12.

Doolittle WF, Sapienza C. Selfish genes, the phenotype paradigm and genome evolution. Nature. 1980;284:601–3.

13.

Almojil D, Bourgeois Y, Falis M, Hariyani I, Wilcox J, Boissinot S. The Structural, Functional and Evolutionary Impact of Transposable Elements in Eukaryotes. Genes. 2021;12:918.

14.

Joly-Lopez Z, Bureau TE. Exaptation of transposable element coding sequences. Curr Opin Genet Dev. 2018;49:34–42.

15.

Almeida MV, Vernaz G, Putman ALK, Miska EA. Taming transposable elements in vertebrates: from epigenetic silencing to domestication. Trends Genet. 2022;38:529–53.

16.

Jangam D, Feschotte C, Betrán E. Transposable Element Domestication As an Adaptation to Evolutionary Conflicts. Trends Genet. 2017;33:817–31.

17.

Schrader L, Schmitz J. The impact of transposable elements in adaptive evolution. Mol Ecol. 2019;28:1537–49.

18.

Bourque G, Burns KH, Gehring M, Gorbunova V, Seluanov A, Hammell M, et al. Ten things you should know about transposable elements. Genome Biol. 2018;19:199.

19.

Iwasaki YW, Siomi MC, Siomi H, PIWI-Interacting RNA. Its Biogenesis and Functions. Annu Rev Biochem. 2015;84:405–33.

20.

Ozata DM, Gainetdinov I, Zoch A, O’Carroll D, Zamore PD. PIWI-interacting RNAs: small RNAs with big functions. Nat Rev Genet. 2019;20:89–108.

21.

Arif A, Bailey S, Izumi N, Anzelon TA, Ozata DM, Andersson C, et al. GTSF1 accelerates target RNA cleavage by PIWI-clade Argonaute proteins. Nature. 2022;608:618–25.

22.

Thomas JH, Schneider S. Coevolution of retroelements and tandem zinc finger genes. Genome Res. 2011;21:1800–12.

23.

Seczynska M, Lehner PJ. The sound of silence: mechanisms and implications of HUSH complex function. Trends Genet. 2023;39:251–67.

24.

Bastiaanssen C, Bobadilla Ugarte P, Kim K, Finocchio G, Feng Y, Anzelon TA, et al. RNA-guided RNA silencing by an Asgard archaeal Argonaute. Nat Commun. 2024;15:5499.

25.

Leão P, Little ME, Appler KE, Sahaya D, Aguilar-Pine E, Currie K, et al. Asgard archaea defense systems and their roles in the origin of eukaryotic immunity. Nat Commun. 2024;15:6386.

26.

Werren JH. Selfish genetic elements, genetic conflict, and evolutionary innovation. Proc Natl Acad Sci USA. 2011;108:10863–70.

27.

Hurst LD, Atlan A, Bengtsson BO. Genetic Conflicts. Q Rev Biol. 1996;71:317–64.

28.

Jacobs FMJ, Greenberg D, Nguyen N, Haeussler M, Ewing AD, Katzman S, et al. An evolutionary arms race between KRAB zinc-finger genes ZNF91/93 and SVA/L1 retrotransposons. Nature. 2014;516:242–5.

29.

Molaro A, Malik HS, Bourc’his D. Dynamic Evolution of De Novo DNA Methyltransferases in Rodent and Primate Genomes. Mol Biol Evol. 2020;37:1882–92.

30.

Lasserre A, Marie S, Morel M, Martin MM, Legrand A, Vauthier V et al. MORC2 restriction factor silences HIV proviral expression [Internet]. 2023 [cited 2024 Dec 11]. Available from: http://biorxiv.org/lookup/doi/10.1101/2023.03.29.534756

31.

Kosuge M, Ito J, Hamada M. Landscape of evolutionary arms races between transposable elements and KRAB-ZFP family. Sci Rep. 2024;14:23358.

32.

Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24:1586–91.

33.

Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl biosciences: CABIOS. 1997;13:555–6.

34.

Holehouse AS, Kragelund BB. The molecular basis for cellular function of intrinsically disordered protein regions. Nat Rev Mol Cell Biol. 2024;25:187–211.

35.

Tesei G, Trolle AI, Jonsson N, Betz J, Knudsen FE, Pesce F, et al. Conformational ensembles of the human intrinsically disordered proteome. Nature. 2024;626:897–904.

36.

Das RK, Ruff KM, Pappu RV. Relating sequence encoded information to form and function of intrinsically disordered proteins. Curr Opin Struct Biol. 2015;32:102–12.

37.

Zarin T, Tsai CN, Nguyen Ba AN, Moses AM. Selection maintains signaling function of a highly diverged intrinsically disordered region. Proc Natl Acad Sci USA [Internet]. 2017 [cited 2024 May 16];114. Available from: https://pnas.org/doi/full/10.1073/pnas.1614787114

38.

Zarin T, Strome B, Nguyen Ba AN, Alberti S, Forman-Kay JD, Moses AM. Proteome-wide signatures of function in highly diverged intrinsically disordered regions. eLife. 2019;8:e46883.

39.

Sherry KP, Das RK, Pappu RV, Barrick D. Control of transcriptional activity by design of charge patterning in the intrinsically disordered RAM region of the Notch receptor. Proc Natl Acad Sci USA [Internet]. 2017 [cited 2024 May 16];114. Available from: https://pnas.org/doi/full/10.1073/pnas.1706083114

40.

Beveridge R, Migas LG, Das RK, Pappu RV, Kriwacki RW, Barran PE. Ion Mobility Mass Spectrometry Uncovers the Impact of the Patterning of Oppositely Charged Residues on the Conformational Distributions of Intrinsically Disordered Proteins. J Am Chem Soc. 2019;141:4908–18.

41.

Holehouse AS, Das RK, Ahad JN, Richardson MOG, Pappu RV. CIDER: Resources to Analyze Sequence-Ensemble Relationships of Intrinsically Disordered Proteins. Biophys J. 2017;112:16–21.

42.

Zheng W, Dignon G, Brown M, Kim YC, Mittal J. Hydropathy Patterning Complements Charge Patterning to Describe Conformational Preferences of Disordered Proteins. J Phys Chem Lett. 2020;11:3408–15.

43.

Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 1989;123:585–95.

44.

Fay JC, Wu C-I. Hitchhiking Under Positive Darwinian Selection. Genetics. 2000;155:1405–13.

45.

Nielsen R, Hellmann I, Hubisz M, Bustamante C, Clark AG. Recent and ongoing selection in the human genome. Nat Rev Genet. 2007;8:857–68.

46.

Ramos-Onsins SE, Marmorini G, Achaz G, Ferretti L. A General Framework for Neutrality Tests Based on the Site Frequency Spectrum. Genes. 2023;14:1714.

47.

Voight BF, Kudaravalli S, Wen X, Pritchard JK. A Map of Recent Positive Selection in the Human Genome. Hurst L, editor. PLoS Biol. 2006;4:e72.

48.

Gutbrod MJ, Martienssen RA. Conserved chromosomal functions of RNA interference. Nat Rev Genet. 2020;21:311–31.

49.

Zhang H, Xia R, Meyers BC, Walbot V. Evolution, functions, and mysteries of plant ARGONAUTE proteins. Curr Opin Plant Biol. 2015;27:84–90.

50.

Yigit E, Batista PJ, Bei Y, Pang KM, Chen C-CG, Tolia NH, et al. Analysis of the C. elegans Argonaute Family Reveals that Distinct Argonautes Act Sequentially during RNAi. Cell. 2006;127:747–57.

51.

Sarkies P, Selkirk ME, Jones JT, Blok V, Boothby T, Goldstein B et al. Ancient and Novel Small RNA Pathways Compensate for the Loss of piRNAs in Multiple Independent Nematode Lineages. Hurst LD, editor. PLoS Biol. 2015;13:e1002061.

52.

Nowell RW, Wilson CG, Almeida P, Schiffer PH, Fontaneto D, Becks L, et al. Evolutionary dynamics of transposable elements in bdelloid rotifers. eLife. 2021;10:e63194.

53.

Fang W, Wang X, Bracht JR, Nowacki M, Landweber LF. Piwi-Interacting RNAs Protect DNA against Loss during Oxytricha Genome Rearrangement. Cell. 2012;151:1243–55.

54.

Dallaire A, Manley BF, Wilkens M, Bista I, Quan C, Evangelisti E, et al. Transcriptional activity and epigenetic regulation of transposable elements in the symbiotic fungus Rhizophagus irregularis. Genome Res. 2021;31:2290–302.

55.

Koopal B, Mutte SK, Swarts DC. A long look at short prokaryotic Argonautes. Trends Cell Biol. 2023;33:605–18.

56.

Ryazansky S, Kulbachinskiy A, Aravin AA. The Expanded Universe of Prokaryotic Argonaute Proteins. Papoutsakis ET, editor. mBio. 2018;9:e01935-18.

57.

Cornec A, Poirier EZ. Interplay between RNA interference and transposable elements in mammals. Front Immunol. 2023;14:1212086.

58.

Iyer LM, Zhang D, Rogozin IB, Aravind L. Evolution of the deaminase fold and multiple origins of eukaryotic editing and mutagenic nucleic acid deaminases from bacterial toxin systems. Nucleic Acids Res. 2011;39:9473–97.

59.

Andrisani O, Liu Q, Kehn P, Leitner WW, Moon K, Vazquez-Maldonado N, et al. Biological functions of DEAD/DEAH-box RNA helicases in health and disease. Nat Immunol. 2022;23:354–7.

60.

Zhu H, Zheng C. When PARPs Meet Antiviral Innate Immunity. Trends Microbiol. 2021;29:776–8.

61.

Gladyshev EA, Arkhipova IR. A widespread class of reverse transcriptase-related cellular genes. Proc Natl Acad Sci USA. 2011;108:20311–6.

62.

Arkhipova IR, Yushenova IA. To Be Mobile or Not: The Variety of Reverse Transcriptases and Their Recruitment by Host Genomes. Biochem Mosc. 2023;88:1754–62.

63.

Grunstein M. Histone acetylation in chromatin structure and transcription. Nature. 1997;389:349–52.

64.

Bernheim A, Cury J, Poirier EZ. The immune modules conserved across the tree of life: Towards a definition of ancestral immunity. PLoS Biol. 2024;22:e3002717.

65.

Van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, et al. Fast and accurate protein structure search with Foldseek. Nat Biotechnol. 2024;42:243–6.

66.

Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol. 2004;337:635–45.

67.

Uversky VN. Intrinsically Disordered Proteins and Their Mysterious (Meta)Physics. Front Phys. 2019;7:10.

68.

Basile W, Salvatore M, Bassot C, Elofsson A. Why do eukaryotic proteins contain more intrinsically disordered regions? Wilke CO. editor PLoS Comput Biol. 2019;15:e1007186.

69.

Shinn MK, Cohan MC, Bullock JL, Ruff KM, Levin PA, Pappu RV. Connecting sequence features within the disordered C-terminal linker of Bacillus subtilis FtsZ to functions and bacterial cell division. Proc Natl Acad Sci USA. 2022;119:e2211178119.

70.

Hsu IS, Strome B, Lash E, Robbins N, Cowen LE, Moses AM. A functionally divergent intrinsically disordered region underlying the conservation of stochastic signaling. Fay JC, editor. PLoS Genet. 2021;17:e1009629.

71.

González-Foutel NS, Glavina J, Borcherds WM, Safranchik M, Barrera-Vilarmau S, Sagar A, et al. Conformational buffering underlies functional selection in intrinsically disordered protein regions. Nat Struct Mol Biol. 2022;29:781–90.

72.

Lotthammer JM, Ginell GM, Griffith D, Emenecker RJ, Holehouse AS. Direct prediction of intrinsically disordered protein conformational properties from sequence. Nat Methods. 2024;21:465–76.

73.

Bobadilla Ugarte P, Barendse P, Swarts DC. Argonaute proteins confer immunity in all domains of life. Curr Opin Microbiol. 2023;74:102313.

74.

Yurkovetskiy L, Guney MH, Kim K, Goh SL, McCauley S, Dauphin A, et al. Primate immunodeficiency virus proteins Vpx and Vpr counteract transcriptional repression of proviruses by the HUSH complex. Nat Microbiol. 2018;3:1354–61.

75.

Chougui G, Munir-Matloob S, Matkovic R, Martin MM, Morel M, Lahouassa H, et al. HIV-2/SIV viral protein X counteracts HUSH repressor complex. Nat Microbiol. 2018;3:891–7.

76.

Sironi M, Cagliani R, Forni D, Clerici M. Evolutionary insights into host-pathogen interactions from mammalian sequence data. Nat Rev Genet. 2015;16:224–36.

77.

Chuong EB, Elde NC, Feschotte C. Regulatory activities of transposable elements: from conflicts to benefits. Nat Rev Genet. 2017;18:71–86.

78.

Rowe HM, Jakobsson J, Mesnard D, Rougemont J, Reynard S, Aktas T, et al. KAP1 controls endogenous retroviruses in embryonic stem cells. Nature. 2010;463:237–40.

79.

Hoffmann A, Spengler D. Chromatin Remodeling Complex NuRD in Neurodevelopment and Neurodevelopmental Disorders. Front Genet. 2019;10:682.

80.

Lai AY, Wade PA. Cancer biology and NuRD: a multifaceted chromatin remodelling complex. Nat Rev Cancer. 2011;11:588–96.

81.

Salamun SG, Sitz J, De La Cruz-Herrera CF, Yockteng-Melgar J, Marcon E, Greenblatt J et al. The Epstein-Barr Virus BMRF1 Protein Activates Transcription and Inhibits the DNA Damage Response by Binding NuRD. Longnecker RM, editor. J Virol. 2019;93:e01070-19.

82.

Naik NG, Nguyen TH, Roberts L, Fischer LT, Glickman K, Golas G et al. Epigenetic factor siRNA screen during primary KSHV infection identifies novel host restriction factors for the lytic cycle of KSHV. Feng P, editor. PLoS Pathog. 2020;16:e1008268.

83.

Savaryn JP, Reitsma JM, Bigley TM, Halligan BD, Qian Z, Yu D, et al. Human Cytomegalovirus pUL29/28 and pUL38 Repression of p53-Regulated p21CIP1 and Caspase 1 Promoters during Infection. J Virol. 2013;87:2463–74.

84.

Polimanti R, Yang BZ, Zhao H, Gelernter J. Evidence of Polygenic Adaptation in the Systems Genetics of Anthropometric Traits. Caramelli D, editor. PLoS ONE. 2016;11:e0160654.

85.

Robinson MR, Hemani G, Medina-Gomez C, Mezzavilla M, Esko T, Shakhbazov K, et al. Population genetic differentiation of height and body mass index across Europe. Nat Genet. 2015;47:1357–62.

86.

Turchin MC, Chiang CW, Palmer CD, Sankararaman S, Reich D et al. Genetic Investigation of ANthropometric Traits (GIANT) Consortium,. Evidence of widespread selection on standing variation in Europe at height-associated SNPs. Nat Genet. 2012;44:1015–9.

87.

Guo J, Wu Y, Zhu Z, Zheng Z, Trzaskowski M, Zeng J, et al. Global genetic differentiation of complex traits shaped by natural selection in humans. Nat Commun. 2018;9:1865.

88.

Afanasyeva A, Bockwoldt M, Cooney CR, Heiland I, Gossmann TI. Human long intrinsically disordered protein regions are frequent targets of positive selection. Genome Res. 2018;28:975–82.

89.

Cagliani R, Forni D, Mozzi A, Fuchs R, Tussia-Cohen D, Arrigoni F et al. Evolution of Virus-like Features and Intrinsically Disordered Regions in Retrotransposon-derived Mammalian Genes. Zhang Y, editor. Molecular Biology and Evolution. 2024;41:msae154.

90.

Hutvagner G, Simard MJ. Argonaute proteins: key players in RNA silencing. Nat Rev Mol Cell Biol. 2008;9:22–32.

91.

Pecori R, Giorgio SD, Lorenzo JP, Papavasiliou FN. Functions and consequences of AID/APOBEC-mediated DNA and RNA deamination. Nat reviewsGenetics. 2022;23:505–18.

92.

Pujantell M, Riveira-Muñoz E, Badia R, Castellví M, Garcia-Vidal E, Sirera G, et al. RNA editing by ADAR1 regulates innate and antiviral immune functions in primary macrophages. Sci Rep. 2017;7:13339.

93.

Prostova M, Kanevskaya A, Panteleev V, Lisitskaya L, Perfilova Tugaeva KV, Sluchanko NN, et al. DNA-targeting short Argonautes complex with effector proteins for collateral nuclease activity and bacterial population immunity. Nat Microbiol. 2024;9:1368–81.

94.

Imler J-L, Cai H, Meignin C, Martins N. Evolutionary immunology to explore original antiviral strategies. Phil Trans R Soc B. 2024;379:20230068.

95.

Aktaş T, Avşar Ilık İ, Maticzka D, Bhardwaj V, Pessoa Rodrigues C, Mittler G, et al. DHX9 suppresses RNA processing defects originating from the Alu invasion of the human genome. Nature. 2017;544:115–9.

96.

Wang X, Ramat A, Simonelig M, Liu M-F. Emerging roles and functional mechanisms of PIWI-interacting RNAs. Nat Rev Mol Cell Biol. 2023;24:123–41.

97.

Ilık İA, Glažar P, Tse K, Brändl B, Meierhofer D, Müller F-J, et al. Autonomous transposons tune their sequences to ensure somatic suppression. Nature. 2024;626:1116–24.

98.

Zhao S, Lu J, Pan B, Fan H, Byrum SD, Xu C, et al. TNRC18 engages H3K9me3 to mediate silencing of endogenous retrotransposons. Nature. 2023;623:633–42.

99.

Wernersson R. RevTrans: multiple alignment of coding DNA from aligned amino acid sequences. Nucleic Acids Res. 2003;31:3537–9.

100.

Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30:772–80.

101.

Pond SLK, Posada D, Gravenor MB, Woelk CH, Frost SD. Automated phylogenetic detection of recombination using a genetic algorithm. Mol Biol Evol. 2006;23:1891–901.

102.

Guindon S, Delsuc F, Dufayard JF, Gascuel O. Estimating maximum likelihood phylogenies with PhyML. Methods in molecular biology. (Clifton NJ). 2009;537:113–37.

103.

Anisimova M, Bielawski JP, Yang Z. Accuracy and Power of Bayes Prediction of Amino Acid Sites Under Positive Selection. Mol Biol Evol. 2002;19:950–8.

104.

Murrell B, Moola S, Mabona A, Weighill T, Sheward D, Pond SLK, et al. FUBAR: a fast, unconstrained bayesian approximation for inferring selection. Mol Biol Evol. 2013;30:1196–205.

105.

Murrell B, Wertheim JO, Moola S, Weighill T, Scheffler K, Pond SLK. Detecting individual sites subject to episodic diversifying selection. PLoS Genet. 2012;8:e1002764.

106.

Kosakovsky Pond SL, Frost SDW. Not So Different After All: A Comparison of Methods for Detecting Amino Acid Sites Under Selection. Mol Biol Evol. 2005;22:1208–22.

107.

Smith MD, Wertheim JO, Weaver S, Murrell B, Scheffler K, Pond SLK. Less is more: an adaptive branch-site random effects model for efficient detection of episodic diversifying selection. Mol Biol Evol. 2015;32:1342–53.

108.

Pond SLK, Frost SDW, Muse SV. HyPhy: hypothesis testing using phylogenies. Bioinformatics. 2005;21:676–9.

109.

Emenecker RJ, Griffith D, Holehouse AS. Metapredict: a fast, accurate, and easy-to-use predictor of consensus disorder and structure. Biophys J. 2021;120:4312–9.

110.

Emenecker RJ, Griffith D, Holehouse AS. Metapredict V2: An update to metapredict, a fast, accurate, and easy-to-use predictor of consensus disorder and structure [Internet]. 2022 [cited 2024 May 16]. Available from: http://biorxiv.org/lookup/doi/10.1101/2022.06.06.494887

111.

Flory PJ, Volkenstein M. Statistical mechanics of chain molecules. Biopolymers. 1969;8:699–700.

112.

Tesei G, Lindorff-Larsen K. Improved predictions of phase behaviour of intrinsically disordered proteins by tuning the interaction range. Open Res Europe. 2023;2:94.

113.

Tesei G, Schulze TK, Crehuet R, Lindorff-Larsen K. Accurate model of liquid–liquid phase behavior of intrinsically disordered proteins from optimization of single-chain properties. Proc Natl Acad Sci USA. 2021;118:e2111696118.

114.

Bergström A, McCarthy SA, Hui R, Almarri MA, Ayub Q, Danecek P, et al. Insights into human genetic variation and population history from 929 diverse genomes. Science. 2020;367:eaay5012.

115.

Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinf (Oxford England). 2011;27:2987–93.

116.

Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10:giab008. 10.1093/gigascience/giab008.

117.

Korneliussen TS, Albrechtsen A, Nielsen R. ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics. 2014;15:356.

118.

Wickham H. ggplot2: elegant graphics for data analysis. springer; 2016.

119.

Eddy SR. Accelerated Profile HMM Searches. Pearson WR, editor. PLoS Comput Biol. 2011;7:e1002195.

120.

Feron R, Waterhouse RM. Assessing species coverage and assembly quality of rapidly accumulating sequenced genomes. GigaScience. 2022;11:giac006.

121.

Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–9.

122.

Varadi M, Bertoni D, Magana P, Paramval U, Pidruchna I, Radhakrishnan M, et al. AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Res. 2024;52:D368–75.

123.

Zhang Y. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 2005;33:2302–9.

124.

Gilchrist CLM, Mirdita M, Steinegger M. Multiple Protein Structure Alignment at Scale with FoldMason [Internet]. 2024 [cited 2024 Dec 11]. Available from: http://biorxiv.org/lookup/doi/10.1101/2024.08.01.606130

125.

Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, Von Haeseler A et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Teeling E, editor. Molecular Biology and Evolution. 2020;37:1530–4.

126.

Pearson WR. Empirical statistical estimates for sequence similarity searches. J Mol Biol. 1998;276:71–84.

127.

Foley NM, Mason VC, Harris AJ, Bredemeyer KR, Damas J, Lewin HA, et al. A genomic timescale for placental mammal evolution. Science. 2023;380:eabl8189.

128.

Paradis E, Schliep K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in. Bioinf (Oxford England). 2019;35:526–8.

129.

Blum M, Andreeva A, Florentino LC, Chuguransky SR, Grego T, Hobbs E et al. InterPro: the protein sequence classification resource in 2025. Nucleic Acids Res. 2024;gkae1082.

130.

Li Z, Li Z, Zhang Y, Zhou L, Xu Q, Li L, et al. Mammalian PIWI–piRNA–target complexes reveal features for broad and efficient target silencing. Nat Struct Mol Biol. 2024;31:1222–31.

Yes