Advancing Chlamydia trachomatis genomic surveillance and research with a novel core-genome MLST (cgMLST) approach

ZohraLodhia1

VerónicaMixão2

JoanaIsidro2

RitaFerreira2

DoraCordeiro1

CristinaCorreia1

InêsJoão1

JoãoPauloGomes2,3

MariaJoséBorrego

PhD

1,4✉Phone+351217519241Emailm.jose.borrego@insa.min-saude.pt

VítorBorges

PhD

2,5✉Phone+351217519227Emailvitor.borges@insa.min-saude.pt

1National Reference Laboratory (NRL) for Sexually Transmitted Infections (STI), Department of Infectious DiseasesNational Institute of Health Doutor Ricardo Jorge (INSA)LisbonPortugal

2Genomics and Bioinformatics Unit, Department of Infectious DiseasesNational Institute of Health Doutor Ricardo Jorge (INSA)LisbonPortugal

3Veterinary and Animal Research Center (CECAV), Faculty of Veterinary MedicineLusófona UniversityLisbonPortugal

4National Reference Laboratory (NRL) for Sexually Transmitted Infections (STI)National Institute of Health, Instituto Nacional de Saúde Doutor Ricardo Jorge, INSAAvenida Padre Cruz1649-016LisbonIPPortugal

5Genomics and Bioinformatics UnitNational Institute of Health, Instituto Nacional de Saúde Doutor Ricardo Jorge, INSA, IP)Avenida Padre Cruz1649-016LisbonPortugal

Zohra Lodhia^1,*, Verónica Mixão^2,*, Joana Isidro², Rita Ferreira², Dora Cordeiro¹, Cristina Correia¹, Inês João¹, João Paulo Gomes^2,3, Maria José Borrego^1,#, Vítor Borges^2,#

¹ National Reference Laboratory (NRL) for Sexually Transmitted Infections (STI), Department of Infectious Diseases, National Institute of Health Doutor Ricardo Jorge (INSA), Lisbon, Portugal

² Genomics and Bioinformatics Unit, Department of Infectious Diseases, National Institute of Health Doutor Ricardo Jorge (INSA), Lisbon, Portugal

³ Veterinary and Animal Research Center (CECAV), Faculty of Veterinary Medicine, Lusófona University, Lisbon, Portugal

^#Corresponding authors (joint last authors): Maria José Borrego, PhD, National Reference Laboratory (NRL) for Sexually Transmitted Infections (STI), National Institute of Health (Instituto Nacional de Saúde Doutor Ricardo Jorge, INSA, IP), Avenida Padre Cruz, 1649-016, Lisbon, Portugal. Phone number: +351217519241. E-mail: m.jose.borrego@insa.min-saude.pt; Vítor Borges, PhD, Genomics and Bioinformatics Unit, National Institute of Health (Instituto Nacional de Saúde Doutor Ricardo Jorge, INSA, IP), Avenida Padre Cruz, 1649-016, Lisbon, Portugal. Phone number: +351217519227. E-mail: vitor.borges@insa.min-saude.pt

Zohra Lodhia and Verónica Mixão contributed equally to this work.

Abstract

Chlamydia trachomatis is the most common sexually transmitted bacterial infection, with an estimated 129 million new cases annually. Its classification traditionally relies on ompA-genotyping, but whole-genome sequencing (WGS) offers transformative resolution to study evolution, transmission dynamics and epidemiological patterns. Yet, WGS-based surveillance of C. trachomatis remains very limited by technical challenges and the lack of standardized typing frameworks. Core-genome multilocus sequence typing (cgMLST) is a scalable and portable approach widely applied to bacterial pathogens, but remains little explored for C. trachomatis. In this context, we compiled and curated the largest C. trachomatis genome dataset to date (1230 samples from 26 countries), including publicly available and newly generated assemblies, to develop a novel cgMLST schema optimized for standardized local deployment. Fueled by existing (like ReporTree) and newly developed bioinformatic resources, the extensive cgMLST analyses performed in this study allowed an in-depth and unprecedented exploration of C. trachomatis global phylogenomic diversity and recombination-driven evolution. Indeed, the novel cgMLST schema (n = 846 loci) robustly recapitulated the four major evolutionary lineages of C. trachomatis and showed high congruence with core-SNP approaches, while providing high resolution to resolve intra-lineage genogroup diversity and detect recombination mosaicisms. Also, it efficiently captured the clonal expansion of epidemiologically relevant strains, including the lymphogranuloma venereum (LGV) epidemic “L2b” and the emergent L4 strains, further consolidating its robustness for contemporary transmission and outbreak monitoring. By enabling a rapid link between loci/alleles and specific phylogenomic/phenotypic traits, the novel cgMLST approach not only elucidated C. trachomatis genome-wide recombination landscape (e.g., through straightforward detection of major genotype-lineage incongruences), but also identified lineage-specific alleles (and disrupted loci) with potential diagnostic and/or functional relevance. Finally, to further advance C. trachomatis genomic surveillance and research, this novel schema is released (https://doi.org/10.5281/zenodo.17177579) accompanied by a hierarchical cgMLST-based nomenclature that supports harmonized genogroup tracking across laboratories and countries. In summary, this work delivers both an expanded global C. trachomatis genomic resource and a robust cgMLST framework, with immediate utility for research and standardized, high-resolution genome-scale routine surveillance.

Keywords:

Chlamydia trachomatis

Whole-Genome Sequencing

cgMLST

genomic surveillance

Introduction

Chlamydia trachomatis is an obligate intracellular bacterium and causes the most common bacterial sexually transmitted infection (STI) worldwide, with 129 million new cases estimated annually, according to the World Health Organization (WHO) ¹. The classification of C. trachomatis strains has historically relied on ompA-genotyping, a method based on sequencing the gene encoding the major outer membrane protein (MOMP), which is highly variable and immunodominant ². Fifteen major ompA-genotypes are defined: A, B/Ba, C, D/Da, E, F, G, H, I/Ia, J/Ja, K, L1, L2, L2b and L3 ³, which correlate with three main clinical disease groups. Specifically, genotypes A-C are associated with ocular infections and trachoma, constituting the leading cause of preventable blindness in the world ^4,5. Genotypes D-K are associated with urogenital infections, the most frequent manifestation of C. trachomatis, which are often asymptomatic, particularly in women, contributing to underdiagnosis and increased risk of pelvic inflammatory disease (PID), ectopic pregnancy, and/or infertility ^5,6. Finally, ompA-genotypes L1-L3 are associated with lymphogranuloma venereum (LGV), characterised by its invasive nature and severe clinical manifestations such as inguinal lymphadenopathy and proctitis, often observed in high-risk populations such as men who have sex with men (MSM) ^5,7–9.

In 2003, an outbreak of L2b C. trachomatis strains, first detected in Europe, marked the re-emergence of LGV as a significant public health concern, prompting renewed attention to its surveillance and diagnosis strategies ^13–15. Consequently, rapid PCR assays targeting potential LGV-specific molecular markers, such as the 36 bp deletion in the pmpH gene, were introduced and widely adopted for diagnostic and epidemiological purposes ^16–19. While these assays improved routine detection of LGV in clinical settings, their inability to capture circulating diversity, for example, to identify novel, hybrid, or recombinant strains, has raised concerns about potential misclassifications and limited epidemiological insight ^10–12. Other complementary methods, such as multi-locus sequence typing (MLST), have been proposed to improve strain discrimination beyond rapid LGV detection and ompA-genotyping ^9,20,21. However, their uptake in routine diagnostic settings remains reduced ²². In fact, most laboratories do not even perform ompA-genotyping, and practical constraints associated with MLST have limited its widespread adoption. Indeed, MLST requires samples with a relatively high bacterial load and provides only modest resolution, limiting its effectiveness for epidemiological surveillance, outbreak detection, and transmission tracking ^20,23.

In this context, whole-genome sequencing (WGS) has emerged as the decisive tool for studying C. trachomatis evolution and transmission. With its superior resolution, WGS can identify and track new clones, reveal recombination events, and uncover microevolutionary changes with significant clinical and epidemiological relevance, all at a finer scale than traditional typing methods. Foundational WGS studies ^7,24,25 revealed that C. trachomatis genome is characterized by a high protein-coding density, few pseudogenes and few non-essential genes (in contrast to other obligate intracellular pathogens, such as Rickettsia prowazekii) ^26,27. Also, it presents a nearly overlapping core-genome [i.e., genes shared by (virtually) all strains of a given species] and pan-genome [i.e., all genes described in a given species], with a single non-chromosomal element, a ~ 7.5 Kbp plasmid that co-evolved with the chromosome ^5,28,29 and is known to play a key functional role in regulating chromosomal genes ^30–38. Subsequent landmark WGS studies by Harris et al. (2012) ⁵ and Hadfield et al. (2017) ²⁹ unveiled the presence of four deep-branching, evolutionarily distinct lineages, which broadly correlate with the three historical C. trachomatis disease groups (LGV, ocular and urogenital), namely: the LGV lineage, the “T1 urogenital” lineage (essentially composed of prevalent genotypes E and F), the “T2 urogenital” lineage (mainly enrolling non-E/F genital genotypes) and the “T2 ocular” lineage. Following this ancient diversification, C. trachomatis genomes have been extensively reshaped by frequent homologous recombination and marked by recent clonal expansions ^5,29,39. Comparative genomic analyses have further demonstrated that recombination occurs both intragenically and across dispersed genomic regions, involving strains with similar or distinct tissue tropisms, and with no absolute barriers to genetic exchange ^5,29,39–42. Opportunities for recombination appear higher among strains infecting the same tissue niche, with higher gene flow observed within urogenital lineages compared to LGV or ocular lineages ⁴⁰. Not unexpectedly, WGS has enabled the identification of epidemiologically and clinically relevant clones that challenged conventional molecular typing strategies, such as the L2b epidemic strain associated with a proctitis outbreak among MSM, including the recombinant L2b/D-Da strains ^{29,39,43–45}. More recently, genomic analyses have described the emergence of a novel LGV strain, designated L4 due to its ompA signature distinct from the classical L1-L3 genotypes (45). First identified in South America ^46,47, L4 has shown signs of potential increasing international circulation ⁴⁸. These findings underscore the added value of implementing high-resolution WGS-based surveillance to adequately and timely capture the evolving landscape of C. trachomatis, including the emergence of sublineages with potential clinical and epidemiological relevance. Still, routine implementation of WGS in C. trachomatis surveillance remains very limited, primarily due to technical and operational constraints associated with its obligate intracellular nature. In order to overcome these limitations, advances including targeted enrichment approaches such as RNA- or DNA-based hybrid capture have enabled successful WGS directly from clinical samples with sufficient bacterial load allowing for culture-independent, high-resolution genomic surveillance ^29,49–51. However, these protocols are often costly, labor-intensive, and not widely standardized across laboratories. Moreover, the lack of harmonized analytical pipelines and curated genomic typing schemas for C. trachomatis, like those applied for other human bacterial pathogens, such as Listeria monocytogenes or Salmonella enterica, limits the practical integration of WGS into public health systems ^52,53. For instance, core-genome MLST (cgMLST) schemas are well-established for outbreak investigation and surveillance of many bacterial pathogens, but very little explored for C. trachomatis. Although a cgMLST schema has been developed for C. trachomatis in 2018 ⁵⁴, being publicly available in the PubMLST platform ⁵⁵, to our knowledge, its application has been limited to the original study. Its design for online usage, together with the absence of benchmark studies with state-of-the-art allele callers for local deployment, like the widely used chewBBACA ⁵⁶, may justify its lower adoption.

Recognizing the value of cgMLST for advancing C. trachomatis molecular surveillance, this study introduces a novel cgMLST schema tailored for local deployment, providing a standardized and scalable framework for comparing genome data across laboratories and studies. By analysing over 1200 genome sequences, this work provides new insights into the global phylogenetic structure and lineage diversity of C. trachomatis, representing one of the most extensive genomics efforts to date. Beyond public health applications, the generated cgMLST resources, like the in-depth allelic mapping of current C. trachomatis global diversity, might be of great potential to accelerate research, thereby bridging surveillance and scientific investigation.

Material and Methods

C. trachomatis samples collection and selection for WGS

The National Reference Laboratory (NRL) for STIs, at the National Institute of Health Doutor Ricardo Jorge (INSA), Lisbon, Portugal, maintains a vast collection of C. trachomatis-positive clinical specimens collected through routine diagnostics or voluntarily sent by collaborating laboratories across the country since 1990. By the end of 2021, a total of 5824 samples had been successfully ompA-genotyped, as described in Lodhia et al., 2025 ⁵⁷. Among these, 1188 samples were further subtyped for LGV-specific ompA subvariants ⁴⁸, and 502 samples were subjected to multiplex sequencing to survey genetic markers of antimicrobial resistance (AMR) and infer strain-specific genomic backbones ⁵⁸. To select samples for WGS, a dataset of 850 DNA C. trachomatis-positive samples was assembled (Supplementary file S1). These samples had been processed using two distinct approaches: 15 DNA samples derived from clinical specimens that were cultured in McCoy cells (standard conditions, as previously described ⁵⁹, and 835 DNA extracts obtained directly from C. trachomatis-positive clinical specimens that were further sub-selected for targeted enrichment WGS protocol. The genomic DNA of the 15 cultured samples was obtained using a scale‑down version of the urografin‑based purification protocol and subjected to standard WGS, as described by Pereira et al. ⁶⁰. For the 835 C.trachomatis-positive clinical DNA samples, bacterial load was quantified using an absolute quantification real-time PCR (qPCR) assay targeting the number of C. trachomatis copies, as previously described by Gomes et al., 2006 ⁶¹. This method enabled absolute quantification of bacterial genome copies per µL of DNA. Based on our previous experiences with SureSelect targeted enrichment approach ^29,50, indicating a positive correlation between higher bacterial load and successful genome recovery, samples with ≥ 10⁴ genome copies in the input volume (7 µL) were prioritized for sequencing. A total of 246 samples were evaluated for dsDNA quantification (using Qubit, ThermoFisher Scientific). A final subset of 125 samples, passing both the bacterial load (≥ 10⁴ genome copies in 7 µL) and DNA quantity (> 10ng in 7 µL) minimum thresholds defined, was selected for targeted WGS. Additionally, 23 samples not meeting the ≥ 10⁴ genome copies threshold were also included, given their epidemiological/clinical interest and acceptable DNA quantity to proceed with the protocol. Full details on bacterial load and DNA quantification are provided

in Supplementary file S1.

Targeted C. trachomatis whole-genome capture and sequencing directly from clinical samples

Whole-genome sequencing was performed using a targeted enrichment approach based on the SureSelect XT HS technology (G9702-90000, version F0, September 2022, Agilent Technologies, Santa Clara, CA, United States), optimized for C. trachomatis, employing custom-designed RNA oligonucleotide baits targeting the C. trachomatis genome, as previously described by Borges et al., 2019 ⁴³, using a 1:5 bait dilution. Library preparation began with the fragmentation of high-quality DNA using the Agilent’s SureSelectXT HS Low input Enzymatic Fragmentation kit and was further carried out following the manufacturer’s instructions. Libraries fragments size, concentration and molarity were determined on a Fragment Analyzer system (Agilent, Santa Clara, CA, United States). Libraries were then sequenced on Illumina MiSeq or NextSeq platforms (Illumina, San Diego, CA, United States), applying paired-end multiplexed sequencing. Descriptive and laboratory data, including sequencing metrics, are detailed in Supplementary file S1.

Quality control and de novo genome assembly

The analysis of the WGS data of C. trachomatis relied on an assembly-based approach targeting C. trachomatis sequencing reads. First, Quality Control (QC) and trimming of sequencing reads was performed with INNUca pipeline v4.2.2 (https://github.com/B-UMMI/INNUca), using FastQC v0.11.5 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and Trimmomatic v0.38 ⁶². Kraken v2.1.3 ⁶³ was used to classify trimmed reads using Kraken2 standard 16Gb database downloaded on September 4th, 2024. Trimmed reads classified by Kraken2 as Chlamydiales (taxon ID: 204428) were then extracted using the extract_kraken_reads.py script of Kraken Tools v1.2 ⁶⁴ and using the “--include-children” argument. Samples without sequencing reads classified as belonging to this taxon ID or respective “children” were discarded from the analysis. For the remaining, the filtered sequencing reads classified as belonging to this taxon ID or respective “children” were assembled with INNUca pipeline v4.2.2 (https://github.com/B-UMMI/INNUca), using SPAdes v3.14 ⁶⁵ and Pilon v1.23 ⁶⁶. Only assemblies with at least 90% and no more than 110% of the expected genome size (compared with the reference genome NC_000117.1: 1 042 519 bp) were used for further analysis. At a later stage, the number of loci called in the novel cgMLST schema (see subsection “Creation of a novel cgMLST schema for C. trachomatis”) was also used as a proxy for assembly quality, and all isolates with less than 95% loci called were excluded.

Compilation of a public dataset for cgMLST schema creation

In order to create a novel cgMLST schema for C. trachomatis and explore the global species diversity, a dataset of publicly available WGS data was compiled. First, 53 complete chromosomal sequences representing the species diversity ^5,67, as well as all 1319 C. trachomatis genome assemblies available through the AllTheBacteria project ⁶⁸ on January 31th, 2025 (https://osf.io/xv7q9/; https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/) and WGS data of 14 C. trachomatis isolates previously generated in our laboratory ^43,59,69 were gathered. Additionally, WGS data of the samples used in two important genomic studies in the field, namely Hadfield et al. 2017 and Olagoke et al. 2025 ^29,70 were de novo assembled, following the methodology described in the “Quality control and de novo genome assembly” section. Isolates classified as mutants or with evidence of possible genetic manipulation according to the available information at the European Nucleotide Archive (ENA) were excluded. The assembly QC criteria described in the “Quality control and de novo genome assembly” section (i.e., 90–110% expected genome size and 95% loci called) were also applied to all the genomes of public samples. Of note, when the same sample (i.e., isolates sharing the same ENA run accession number) was present in the AllTheBacteria ⁶⁸ and in Hadfield et al. 2017 or/and in Olagoke et al. 2025 datasets ^29,70, preference was given to the AllTheBacteria genome assembly, except if it did not pass the QC criteria. In this case, only the genome assembly passing the QC was kept. Details on the final dataset are provided in Supplementary file S2.

Creation of a novel cgMLST schema for C. trachomatis

The C. trachomatis cgMLST schema was constructed using the chewBBACA suite v3.3.10 ⁵⁶ available at https://github.com/B-UMMI/chewBBACA. Independent schemas were created for the chromosome and plasmid sequences using the CreateSchema module and later merged following curation. CDS prediction for the chromosomal dataset was performed using default parameters with a Prodigal training file derived from the reference genome C. trachomatis D/UW-3 (RefSeq accession NC_000117.1) ²⁴. For the plasmid dataset, the schema was generated using default settings, with exception of --prodigal-mode (meta) to better fit to plasmids. The input dataset included 53 complete chromosomal sequences and 42 complete plasmid sequences (listed in Supplementary file S2) representing the species diversity ^5,67,71. Manual curation of loci was then conducted using the SchemaEvaluator module ⁵⁶, combined with prior in-depth gene-specific knowledge, which includes assessments of allelic diversity, pseudogene detection, and lineage-specific gene disruptions, etc. ^71–74. A few loci of interest were deliberately retained in the cgMLST schema despite their low allele calling proportion, in order to enrich downstream evolutionary and functional application of the schema, beyond basic clustering. For instance, these included the highly polymorphic CT_456 (TARP) gene, which chewBBACA initially predicted as three separate CDSs but that was manually merged due to its known biological significance, or loci that are frequently pseudogenized or disrupted in lineage- or tropism-specific contexts ^73,74. Following manual inspection, the curated chromosomal and plasmid schemas were merged and loci were renamed using historical and community-recognized CDS identifiers: chromosomal genes followed the RefSeq D/UW-3 annotation ²⁴, while plasmid ORFs followed the established ORF and pgp nomenclature ^31,71,75. To populate the schema, an initial allele calling run was performed using all available assemblies, as described above. Assemblies with < 95% of loci called were excluded from the final dataset and schema population step. The final version of the schema was populated iteratively using the curated dataset (n = 1230) (Supplementary file S2) until no new alleles were detected (3 rounds). Further cgMLST analysis were performed using the final allele matrix for the 1230 samples (Supplementary file S3) or the filtered matrix only with the 53 complete above-mentioned assemblies, depending on the objective of the evaluation.

In silico typing

All genome assemblies (n = 1230) were subjected to MLST, ompA-genotyping and screening for the presence of a potentially LGV-specific 74 bp insertion in a region upstream of CT_105 ^58,76. Sequence type (ST) determination was performed with mlst v2.23.0 (https://github.com/tseemann/mlst) based on Chlamydiales MLST schema ⁷⁷. For ompA-genotyping, a rapid BLAST-based classification was initially performed on the alleles defined by cgMLST (a total of 116 alleles identified across the 1230 assemblies), using ABRicate v1.0.1 (https://github.com/tseemann/ABRIcate) embedded within ReporType v1.0.0 (https://github.com/insapathogenomics/ReporType) ⁷⁸, against a custom database containing representative partial sequences of main ompA-genotypes (https://github.com/insapathogenomics/ReporType/blob/main/databases/c_trachomatis.fasta). For the few samples with unassigned ompA alleles by cgMLST (n = 10), genotype inference with ReporType was performed directly on the assemblies. Additionally, the non-redundant ompA sequences were aligned using MAFFT v7.525 ⁷⁹, and a neighbor-joining phylogenetic tree was constructed in MEGA v.7.0.21 (https://www.megasoftware.net/) with 1000 bootstrap replicates and evolutionary distances calculated using the Kimura two-parameter method, as previously described ⁸⁰. ompA sequences belonging to LGV genotypes were also subjected to “subvariant” classification, as previously described ⁴⁸. Finally, for screening the presence of a potentially LGV-specific 74 bp insertion in a region upstream of CT_105 ^58,76, ReportType v1.0.0 was run over all genome assemblies using a custom database including two sequences fragments compatible with the presence/absense of the insertion (https://github.com/insapathogenomics/ReporType/blob/main/databases_extra/CtrachomatisLGVinsertionUpCT105.fasta). For two samples (ERR164836 and ERR140781) in which no hits were found due to contig fragmentation, the presence or absence of the insertion was verified by inspecting in IGV ⁸¹ the BAM files after mapping against the L2b/UCH-1/proctitis reference sequences (AM884177.2 and AM886279.1), as described below.

Cluster detection and characterization

Genetic clustering using the novel cgMLST schema was obtained for all possible threshold levels with ReporTree v2.5.4 ⁸², providing the desired allele matrix and metadata as input, and using the single-linkage or the GrapeTree ⁸³ clustering method, depending on the analysis. Samples with less than 95% of the cgMLST loci called were excluded with the ‘--loci-called’ argument. The arguments ‘--columns_summary_report’ and ‘--metadata2report’ were used to characterize the different genetic clusters or metadata groups, respectively, with the available metadata (Supplementary file S2). ReporTree was also used to obtain count and frequency tables with the distribution of a given metadata variable of interest over another one (selected compilation in Supplementary file S4, and full compilation in Zenodo repository https://doi.org/10.5281/zenodo.17177579).

Genogroup identification and nomenclature

In order to establish a hierarchical nomenclature that allows monitoring the main circulating clusters, a first threshold grouping the whole dataset into four clusters, corresponding to the four expected lineages ^5,29, was determined based on the results obtained for the hierarchical clustering of the whole dataset (Supplementary file S5). Then, to define thresholds with higher discriminatory power, cluster stability regions, i.e. threshold ranges providing similar clustering, were determined with comparing_partitions_v2.py (https://github.com/insapathogenomics/ComparingPartitions) ⁸², following a previously described approach ^84–87. Stability regions were defined as at least 5 subsequent thresholds yielding an Adjusted Wallace Coefficient (AWC) ≥ 0.98 (Supplementary file S6). Two stability phases at high resolution were selected to cluster isolates into high- and low-level WGS-based genogroups. An additional ReporTree run was then performed to classify each isolate into a specific genogroup by requesting an hierarchical nomenclature code attributed based on the hierarchical clustering results at the thresholds corresponding to lineage, high- and low-level genogroup classification, using a threshold that fell between the edges of the respective stability region. The final code has the structure LIN-HG-LG, where “LIN” is a lineage code (LGV for LGV lineage, T2O for T2 ocular, T2G for T2 urogenital and T1G for T1 urogenital), and “HG” and “LG” correspond to the high- and low-level genogroup classification, respectively. The genogroup codes correspond to the cluster or singleton nomenclature attributed by ReporTree ⁸²: “Cn” when the isolate belongs to a cluster and “Sn” when the isolate is a singleton, with “n” corresponding to the respective cluster/singleton number.

Congruence analysis

The assessment of cluster congruence between two different typing approaches (e.g., cgMLST-based and SNP-based clustering analysis) was performed using a previously developed methodology ^84,85,87 with EvalTree v1.0.0 (https://github.com/insapathogenomics/ACDTree/tree/main/EvalTree), providing the ReporTree output folder of each method (obtained as described in the “Cluster detection and characterization” section) as input. Briefly, EvalTree compares the genetic clusters at each possible threshold levels in the two methods (available in ReporTree “partitions” table) and calculates a congruence score (CS) between them based on the Adjusted Rand and Adjusted Wallace coefficients ^84,85,87. Inter-pipeline corresponding points, i.e. the allele/SNP threshold in one pipeline that provided the most similar clustering results in the other, were assessed only considering comparisons rendering CS ≥ 2.85.

Comparison with C. trachomatis PubMLST cgMLST schema

To compare the resolution and clustering congruence between the novel cgMLST and the cgMLST schema available in PubMLST (https://pubmlst.org/) ^54,55, which was designed for online usage within this platform, each of the 817 loci of the latter was downloaded on August 6th, 2025. For its local deployment with chewBBACA allele caller ⁵⁶, the schema was prepared with chewBBACA v3.10 PrepExternalSchema module using default settings and providing the same Prodigal training file as used for the novel schema creation (see section “Creation of a novel cgMLST schema for C. trachomatis”). Afterwards, allele calling with chewBBACA v3.10 AlleleCall module was iteratively performed using the curated dataset (n = 1230) (Supplementary file S2) until no new alleles were detected (5 rounds). The final allele matrix was used for cluster analysis, as described in the “Cluster detection and characterization” section.

Comparison with core-genome SNP-based analysis

To compare the resolution and clustering congruence of the cgMLST approach with traditionally applied SNP-based methods, three comparative exercises were conducted. The first analysis involved comparison with the 53 complete C. trachomatis chromosome sequences used for the initial construction of the cgMLST schema. For this, the full chromosome sequences were aligned using the progressive alignment algorithm implemented in Mauve v2.3.1 ⁸⁸. A core alignment was then extracted by retaining only regions where all chromosome sequences aligned over at least 500 bp. The resulting core genome alignment was manually inspected and corrected, yielding 1 007 702 nucleotide positions (~ 97% of the chromosome), including 16 532 single-nucleotide variant (SNV) sites.

The second analysis included the entire dataset comprising 1230 genome assemblies that passed QC criteria (see Supplementary file S2). These assemblies were aligned using Parsnp v2.0.5 ⁸⁹, with default parameters, except for the –C parameter, which was set to 2000 to maximize reference coverage. The resulting core SNV alignment used in this evaluation contained 20 941 variant sites.

Finally, targeted analyses were conducted using a reference-based read mapping and SNP calling approach for two LGV clusters of epidemiological interest: the epidemic L2b strain ^5,7,29,69 and the emergent L4 strain ⁴⁶. For this, QC-filtered reads from samples identified by cgMLST within each cluster (L2b or L4) (Supplementary file S2) were mapped using Snippy v4.6.0 (https://github.com/tseemann/snippy; --mapqual 30 --mincov 5 --minfrac 0.9 --basequal 20) against representative genome sequences for each strain: the complete genome of L2b/UCH-1/proctitis (AM884177.2/AM886279.1) ⁷, and the assembly of an L4 isolate from 2018 (the earliest year with an L4 genome analyzed in the present study) selected based on best contig metrics (number and N50) among those with available reads (ENA accession: ERR13385367), respectively. Core genome SNV alignments were then extracted from the Snippy-derived whole alignment using the “alignment processing module” of ReporTree v2.5.4 ⁸², with “--site-inclusion” 0.9 (i.e., only keeping variant positions with < 10% missing data). For the L2b cluster, all recombination-driven SNVs within the previously identified ompA containing recombinant region (NC_010280; approximate positions 55221–59461) of the recombinant L2b/D-Da strain ^43,45 were additionally excluded. The final core SNV alignments, comprising 156 SNVs for L2b and 16 SNVs for L4, were analyzed using single-linkage hierarchical clustering in ReporTree v2.5.4 (65) and maximum-likelihood phylogenetic reconstruction with IQ-TREE v2.1.4 ⁹⁰, employing the automatically detected best-fit model (K3P + ASC) and 1000 bootstrap replicates.

Identification of exclusive and shared alleles across groups of interest

In order to explore allelic variation across categories of isolates, a python-based script (“allelic_heatmap.py”, available at https://github.com/insapathogenomics/allelic_diversity) that generates an interactive, tree-ordered allelic heatmap was developed. The script takes as input the cgMLST allelic matrix, the derived single-linkage hierarchical clustering dendrogram (tree in Newick format) and a sample metadata file. The category of interest to assess allelic distribution (lineage) was defined by a user-specified metadata column. In the output heatmap, each allele is colored according to the category (lineage) in which it is most prevalent. Loci that are identified as fully or nearly conserved (defined by any allele that is dominant in at least 3 lineages; "ncat_dominant 3") across all samples are displayed in black, and loci not called (denoted by allele “0”) in specific samples are displayed in white. Moreover, samples are ordered according to the input tree, facilitating visual comparison with genetic relationships. The interactive HTML version, allowing hover-based inspection and export as PNG/SVG, is provided as Supplementary file S7. Summaries of lineage-exclusive and shared alleles as well as an annotated allelic matrix are compiled in Supplementary file S8.

Additionally, for the identification of alleles that are exclusive or shared across certain metadata-defined groups (e.g., lineage, ompA-genotype, etc.), another python-based script (“allelic_distribution.py”, available at https://github.com/insapathogenomics/allelic_diversity) was developed. Given a metadata table (or clustering table) and an allele matrix, it generates a summary report with the proportion of isolates of a given group that harbors each of the alleles present in the allele matrix. The provided report was then used to determine not only the alleles present in (or specific of) a certain group, but also the allele distribution in the species population. This script was run using the allele matrix obtained with the novel cgMLST schema for the curated dataset (n = 1230) (Supplementary file S2), and requesting the allele distribution across lineage, ompA-genotype, source and other available metadata. The obtained summary reports are presented in Supplementary file S9.

Together, these cgMLST resources provide both a global visualization and quantitative summaries of category-specific, shared and conserved alleles, enabling comparative analyses of allelic variation, recombination events, and category-specific markers.

Results

Compilation of a WGS dataset with the global C. trachomatis genomic diversity

With the aim of creating a novel cgMLST schema for C. trachomatis and explore the global species diversity, a large and diverse C. trachomatis genome dataset with 1173 assemblies (publicly available or newly generated) was compiled and curated (Supplementary file S2, details in Material and Methods). Given the broad lack of investment in WGS of this prevalent human pathogen (e.g., only 83 of 1173 genomes had available collection year of 2015 onwards), this study further sought to advance national and international genomic surveillance by sequencing more clinical isolates from the Portuguese NRL collection. For this, 850 C. trachomatis-positive DNA samples were evaluated for eligibility for targeted-enrichment WGS by estimating bacterial genome copies via qPCR and quantifying dsDNA in the required input volume (Supplementary file S1). In total, 125 samples met both selection criteria (≥ 10⁴ genome copies and > 10 ng DNA in 7 µL), being considered as eligible, while a small subset (n = 23) not fulfilling the copy number threshold was still selected for targeted-enrichment WGS (Supplementary figure S1A). At the end, 45 (30.41%) of the samples sequenced with a “targeted-enrichement” approach yielded complete genome assemblies (Supplemental table S1), with samples with higher bacterial loads generally producing a greater proportion of “on-target” reads (Supplementary figure S1B). Indeed, although the genome assembly success was 34.4% (43/125) for samples fulfilling the adopted threshold of ≥ 10⁴ copies in WGS input, this value increases, for example, to 64.7% (22/34) when considering samples with ≥ 10⁵ copies in WGS input. This pattern was reflected in the proportion of “on-target” reads, which had a median percentage of 48.84% (IQR: 39.60–70.17) among samples yielding complete genomes (n = 45), compared with 5.16% (IQR: 1.43–12.43) observed in samples that failed to meet the criteria for inclusion in the final genome dataset (n = 103) (Supplementary figure S1B). Besides the 45 genomes recovered directly from DNA samples, complete genome assemblies were also obtained for 12 cultured clinical isolates (Supplementary file S1), which were included in the final curated dataset (Supplementary file S2). Overall, this effort led to the inclusion of 57 new C. trachomatis genomes from Portugal, increasing by 17% the proportion of European isolates in the dataset and by 56% the amount of anorectal samples. The final dataset comprises 1230 C. trachomatis genome assemblies, spanning isolates collected between 1957 and 2023, from different clinical origins, and representing at least 26 countries (Supplementary file S2).

Development of a novel cgMLST schema for large-scale genomic analysis and monitoring of C. trachomatis diversity and evolution

To leverage a high resolution WGS-based typing method that can be easily deployed in local settings for routine genomic monitoring of C. trachomatis diversity and evolution, in this study, a novel cgMLST schema was developed for this species. Schema construction was based on 53 complete chromosomal sequences representing the species diversity ^5,67 and 42 complete plasmid sequences (listed in Supplementary file S2). The final version was populated iteratively using the curated dataset (n = 1230) (Supplementary file S2) until no new alleles were detected (3 rounds). The final cgMLST schema is composed of 846 loci (838 CDSs from the chromosome and 8 from the plasmid), covering ~ 85–88% of expected genome size and representing a total of 26,528 unique alleles (median: 27; min: 2; max: 182 per locus, Supplementary figure S2). When compared with the local deployment of the previously available PubMLST schema (designed primarily for online usage) ⁵⁴, the novel schema showed highly congruent clustering but a slight increase in resolution power (Supplementary figure S2). For instance, clusters defined at 10 allelic differences (ADs) in the novel schema corresponded to 9 ADs (CS = 2.999) in PubMLST schema (Supplementary file S5). Most importantly, the novel schema achieved a higher proportion of loci called per sample (median 99.2%; range 95.1–99.9%) than PubMLST schema (median 96.7%; range 92.4–97.8%). These results demonstrate the enhanced resolution and typeability of the novel schema in local laboratory settings using the widely used chewBBACA allele caller ⁵⁶. In addition, the performance of the novel schema was tested by comparing two popular clustering algorithms, single-linkage hierarchical clustering and GrapeTree MSTreeV2 ⁸³. As expected ⁸⁷, they presented a high cluster congruence at all resolution levels, (Supplementary file S5), demonstrating the flexibility for implementation of this cgMLST schema in the commonly used workflows for bacterial genomic surveillance ⁹¹.

To further assess its discriminatory power and ability to capture C. trachomatis genomic population structure, the novel C. trachomatis cgMLST schema was also benchmarked with SNP-based approaches. First, it was compared with a curated core genome alignment of the set of 53 complete C. trachomatis genomes (Supplementary file S2), covering ~ 97% genome and enrolling 16 532 SNVs. This exercise demonstrated that, despite inferring lower genetic distances between the samples (as expected), the novel cgMLST approach is able to capture the expected species population structure, presenting absolute clustering congruence with the core genome alignment at several thresholds of resolution (Supplementary figure S2 and Supplementary file S5). To further support these observations, a second exercise was performed with the larger and more diverse dataset of C. trachomatis (n = 1230 assemblies) (Supplementary file S2), comparing the novel cgMLST with an assembly-based SNP approach with Parsnp ⁸⁹, which yielded a core SNP alignment of 20 941 SNVs. The results not only corroborated the previous observation but also demonstrated that the novel cgMLST provides similar, or even better, resolution than the assembly-based SNP approach (Supplementary figure S2). Indeed, both methods discriminate > 900 profiles for the 1230 isolates, showcasing the high discriminatory power of this novel schema (Supplementary file S5).

The implementation of cgMLST-based hierarchical nomenclature systems, increasingly adopted in genomic surveillance of multiple bacterial pathogens ^92,93, can add substantial value by enabling stable and comparable tracking of main circulating genogroups across studies and laboratories. Therefore, in order to leverage a surveillance-oriented nomenclature for the novel cgMLST schema, C. trachomatis clustering patterns were explored towards the identification of hierarchical thresholds that could be adequate for nomenclature design. First, the minimum threshold required to segregate the global hierarchical clustering tree (Fig. 1) into the expected C. trachomatis four main lineages (“T1 urogenital”, “T2 ocular”, “T2 urogenital” and LGV) ^5,29 was identified, corresponding to 449 ADs (Supplementary files S5). This threshold fell within a large clustering stability region (Supplementary file S6), which supported the rationale of having a first nomenclature level with direct correlation with lineage classification. This analysis further highlighted a large stability region between 63 and 86 ADs, which could be used to define genogroups for longitudinal surveillance, and another stability region between 23 and 30 ADs, which opened the possibility of extending the genogroup definition to a more discriminatory level. Based on these results, an intuitive cgMLST-based nomenclature system for C. trachomatis was designed, consisting on a hierarchical code that reflects the isolate clustering at lineage level (475 ADs), high-level genogroup (75 ADs) and low-level genogroup (25 ADs) (see Material and Methods), thus capturing the global population structure at different resolution levels. In total, 473 different nomenclature codes were attributed to the 1230 C. trachomatis isolates, covering 179 and 473 high- and low-level genogroups, respectively (Figs. 1 and 2, and Supplementary file S2). Of note, taking advantage of the clustering congruence analysis described above, the GrapeTree thresholds providing the most congruent results with these hierarchical code levels selected for nomenclature were identified and confirmed to also fall within stability regions, specifically 525 ADs for lineage classification, 86 ADs for high-level genogroup and 27 ADs for low-level genogroup (Supplementary file S5). These data provide the basis to directly implement this novel nomenclature system into workflows relying on any of these two clustering methods (Supplementary file S5).

Fig. 1

Characterization of the cgMLST-based diversity of the 1230 C. trachomatis genomes. A) Circular dendrogram of single-linkage hierarchical clustering depicting phylogenomic relationships among genomes. Concentric metadata blocks (from the center outward) show: ompA class (historical “disease groups” inferred from the traditional ompA classification), ompA-genotype, lineage, and the presence/absence profile of the 74 bp insertion upstream of CT_105. B) Geographical map displaying the distribution of sequences across countries, with circular plots indicating the proportion of isolates per lineage in each country (only sequences with available countries are shown). An interactive version of both figure panels, including genogroup classification and the complete set of metadata (Supplementary file S2), is available at Microreact ¹¹⁴ (https://microreact.org/project/5y7PVeTiw9vZxkzEomGqHK-advancing-chlamydia-trachomatis-genomic-surveillance-and-research-with-a-novel-core-genome-mlst-cgmlst-approach).

Fig. 2

cgMLST allelic variation across C. trachomatis isolates (n = 1230) colored by lineage. The heatmap shows allele presence across loci (columns) and isolates (rows), ordered by cgMLST-based single-linkage hierarchical clustering of the 1230 C. trachomatis genomes (dendrogram) (as in Fig. 1). Metadata blocks aligned with the tree indicate (from left to right) ompA class (historical “disease groups” inferred from the traditional ompA classification), cgMLST-inferred lineage, and high- and low-level genogroup classifications defined at the respective allelic distance (AD) threshold levels (475, 75 and 25, respectively) that are indicated with dashed red line in the dendrogram (dark and fainted lineage colors were alternately attributed to each genogroup). Heatmap colors represent the dominant lineage in which each allele occurs: faint shades mark alleles exclusive of one lineage, while darker shades highlight alleles shared across lineages, showcasing potential homologous recombination. Alleles that are dominant in at least 3 lineages (i.e., denoting fully or nearly conserved loci) are displayed in black. Loci that are absent in specific samples (denoted by allele “0”) are displayed in white. An interactive HTML version provided as Supplementary file S7 allows hover-based inspection and export as PNG/SVG. Summaries of lineage-exclusive and shared alleles as well as an annotated allelic matrix are provided in Supplementary file S8.

The novel cgMLST schema and associated resources for immediate applicability into standardized, high-resolution WGS-based routine genomic surveillance are publicly available at Zenodo repository (https://doi.org/10.5281/zenodo.17177579). An adapted version of the schema is also available at chewie-NS (https://chewbbaca.online/) ⁹⁴.

Capturing the global population structure of C. trachomatis and major genotype-lineage incongruences by cgMLST

The cgMLST clustering results of the studied 1230 C. trachomatis genomes were used to explore the genomic diversity of this species. As mentioned before, this analysis confirmed the expected clustering into the four previously defined lineages (Figs. 1 and 2). As expected, the lineages strongly - but not exclusively - correlate with the three historical “disease groups” (ocular, genital and LGV) inferred from traditional ompA classification (Figs. 1–3 and Supplementary file S2). For instance, T1 (n = 376) and T2 urogenital (n = 281) lineages are composed mostly of ompA genital strains (97.3% and 93.2%, respectively), although both also included ompA ocular strains (2.7% and 6.8%, respectively). While T1 lineage predominantly includes the prevalent genotypes E (61.2%) and F (20.7%), the T2 urogenital lineage shows a broader ompA-genotype composition, comprising genotypes G (31.7%), D (20.3%), K (19.2%), J (7.5%), B (6.8%), Ia (6.4%), H (6.0%), F (1.1%), I (0.7%), and Da (0.4%). By other side, the T2 ocular lineage (n = 371) consisted exclusively of ompA ocular strains (100.0%), distributed as follows: A (67.7%), Ba (31.3%), B (0.5%) and C (0.5%). Finally, the LGV lineage (n = 202) included predominantly ompA LGV strains (94.6%), namely L1 (8.9%), L1/L2 hybrid (3.0%), L2 (41.6%), L2b (34.2%), and L3 (1.5%), and the emerging L4 (5.4%), but also atypical strains carrying non-LGV ompA-genotypes (5.4%) (Fig. 3). These included 10 isolates of the recently described recombinant L2b/D-Da strain ^43,45, as well as a genotype E strain (accession ERR351523) displaying a highly atypical recombinant profile (detailed below).

Fig. 3

Distribution of C. trachomatis (n = 1230) ompA genotypes across cgMLST-inferred lineages and high-level genogroup classification. A) Heatmap of genome counts across ompA genotypes and cgMLST-inferred C. trachomatis lineages. Each cell indicates the number of genomes sharing a specific ompA-genotype within a given lineage, with the intensity of colour corresponding to the number of genomes (see scale bar). Red circles highlight atypical genotype-lineage combinations, with major incongruences represented by thicker lines. *Recombinant LGV strain with hybrid non-LGV ompA-genotype (L2-L2b/D-Da) as most of its ompA sequence (~ 75%) derives from a genital genotype (D/Da) ⁴³. B) Sankey plot illustrating the relationships between C. trachomatis classifications at four levels: ompA class (historical “disease groups” inferred from the traditional ompA classification*), ompA-genotype, cgMLST-inferred lineage, and high-level genogroup (excluding singletons). Each node represents a category within a given stage, and the width of the connecting flows is proportional to the number of isolates sharing the corresponding combination of categories. Grouping in “ompA_class” was strictly based on ompA sequence, so the recombinant L2/L2b–D/Da strain ^43,45 was classified as “genital”, as most of its sequence (~ 75%) derives from a genital genotype (D/Da). High-level genogroups were obtained by single-linkage hierarchical clustering at 75 allelic differences. Supplementary file S2 provides complete metadata and additional information.

In total, 30 samples (besides the recombinant L2-L2b/D-Da strain characterized elsewhere ^43,45) displayed major incongruence between their ompA-genotype and phylogenomic lineage (Figs. 1–3 and Supplementary file S2). As described above, the most striking profile corresponded to a sequence (accession ERR351523; genogroup LGV-S3-S5) carrying an ompA-genotype E and yet clustering within the LGV lineage (Figs. 1–3). Comparative allelic profiling suggested that this strain likely resulted from the transfer of at least 229 cgMLST loci from a T1 strain into an LGV genomic background, enrolling the exchange of at least seven distinct genomic regions (each spanning a minimum of 3 genes) via homologous recombination (Fig. 2, Supplementary files S7 and S8). Noteworthy, within the LGV lineage, a distinct group of 22 strains (composing the high-level genogroups LGV-C4, LGV-C7 and LGV-C10) that, while retaining an LGV “L2” ompA-genotype, exhibit substantial mosaicism LGV-T1 was also identified (Fig. 2, Supplementary files S7 and S8). Several of the exchanged regions, particularly the large regions between loci CT_374-CT_402, CT_440-CT446 and CT_542-CT650, partially overlap those identified in ERR351523, suggesting they might represent potential recombination hotspots between T1 and LGV strains (Fig. 2).

Other notable inter-lineage ompA exchanges include 19 strains carrying ocular ompA-genotype B, but clustering within the T2 urogenital lineage. The cgMLST analysis shows that they are not confined to a single monophyletic (clonal) branch but are instead dispersed across different high-level genogroups, consistent with multiple and independent ocular-urogenital ompA exchange events. These include three genogroups composed exclusively of recombinant B isolates (T2G-C60, T2G-C84 and T2G-S77), as well as two genogroups that also harbor isolates with “genital” ompA (T2G-C57, containing ompA-genotypes B, D/Da, G and H, and T2G-C61, containing ompA-genotypes B and D) (Figs. 1 and 3B). Noteworthy, only the high-level genogroup T2G-C84 includes isolates from the past decade, among them two recombinant B anorectal isolates (from 2019 and 2023) from Portugal, possibly representing a recent sublineage that is currently circulating. Comparative allelic profiling (Fig. 2, Supplementary files S7 and S8) suggests that these recombinant T2G-C84 isolates possess a genomic backbone largely consistent with the T2 lineage, with only a limited set of loci (n = 126) carrying strong evidence of inter-lineage recombination (i.e., having alleles exclusively shared between ocular and urogenital T2 lineages).

Regarding the detection of ompA-genotypes Ba (n = 2) and C (n = 8) within the T1 lineage (Fig. 3A), these isolates belong to two divergent and highly clonal genogroups (T1G-C25-C55 and T1G-C17-C20, respectively) exclusively composed of ocular ompA-genotype isolates from the 1980s and 1990s. Moreover, among other infrequent genotype-lineage combinations (Fig. 3A), an anorectal G strain (SRR25447256) within T1 (genogroup T1G-C16-C17) is noted, since this genotype is typically associated with the T2 urogenital lineage (Figs. 1 and 3). Conversely, three ompA-genotype F strains (genogroup T2G-C78-C134) appeared as tightly clustered within the T2 urogenital lineage, although this genotype is normally predominant in T1 (Fig. 3) ^5,29,80.

Finally, a closer examination of the ompA allelic profile distribution by lineage (Supplementary figure S3 and Supplementary file S4) revealed additional notable trends. For instance, although “D-like” genotypes are generally reported in both T1 and T2 urogenital lineages (23), with the exception of one sequence in each group, D ompA alleles occurred exclusively in T2, while Da alleles were restricted to T1. Similarly, all ompA-genotype J strains (ompA alleles 14 and 26) were found in the T2 urogenital lineage, whereas genotype Ja strains (ompA alleles 56 and 110) appeared exclusively in T1. Noteworthy, cases where all strains of a given ompA-genotype belonged to the same lineage and formed a well-defined, genetically homogeneous cluster at cgMLST level were rare (Fig. 1). These cases were mainly observed in genotypes with very low representation in the dataset, such as genotype I (n = 2, genogroup T2G-C83-C149) and L3 (n = 3, genogroup LGV-C9-C9), or in cases of clonal expansion of recently emerged, epidemiologically relevant LGV strains, namely the the epidemic “L2b” strain (genogroup LGV-C1-C1) ^5,7,29,69, which includes the recombinant strain with hybrid L2-L2b/D-Da ompA (alleles 63 and 77) (n = 10) ^43,45, and the emergent L4 strain (genogroup LGV-C5-C4; ompA alleles 49 and 89) (n = 11) ⁴⁶. A more in-depth analysis within the LGV lineage is presented in the next section.

Exploring C. trachomatis epidemiologically-relevant genogroups and novel insights for genomic surveillance

The cgMLST analysis of 202 C. trachomatis genomes belonging to the LGV lineage (Supplementary file S2) revealed considerable intra-lineage diversity, allowing clear and straightforward identification of both previously described and emerging subvariants (Fig. 1) ^{5,7,29,70,95–99}. Of greatest current epidemiological and clinical relevance, two well-defined LGV clusters stand out: the epidemic “L2b” strain ^5,7,29,69 and the emergent L4 strain ⁴⁶.

Regarding the epidemic “L2b” strain, the cgMLST analysis effectively captured its monophyly ^5,29, clustering all isolates (n = 112) at a single-linkage threshold of 23 ADs, thus placing them all within the same low-level genogroup (LGV-C1-C1) (Fig. 1). Within this genogroup, there was a clear separation of a subcluster containing the earliest L2b isolates (n = 2) identified in the USA in the 1980s ^7,97–99, and another comprising all genomes (n = 10) of the recombinant L2b/D-Da strain, which grouped within a 10 AD threshold, consistently with its ongoing global expansion ^43,45. Notably, a typical core SNP maximum likelihood analysis (mapping against the L2b/UCH-1 reference genome) also segregated the same subclusters of interest (Supplementary figure S4), showing that both genome-scale typing approaches present comparable discriminatory power in capturing microevolution-driven divergence during L2b clonal expansion. In the studied dataset, within the “L2b” epidemic clone, three main ompA-genotypes were observed, namely L2b (n = 69), L2 (n = 33), and the hybrid L2-L2b/D-Da (n = 10), comprising seven, four and two distinct ompA alleles, respectively. Noteworthy, although most isolates genotyped as “L2” clustered within a separate sub-branch, L2 sequences were also found in other sub-branches. This pattern, which was observed in both SNP (Supplementary figure S4) and cgMLST (Fig. 1) analyses, supports the previously proposed hypothesis that the reported reversion of the ompA “L2b” signature” (mutation A485G) ^100,101 had occurred independently on multiple occasions throughout the epidemic. When examining the L2b phylogeny at an in-depth sub-cluster level, particularly the placement of ompA-genotype LGV subvariants (i.e., harboring distinct ompA alleles), which were recently compiled ⁴⁸, all “L2bV5” isolates (n = 17) were found to be clustered together in the core SNP phylogeny (Supplementary figure S4), a pattern less evident in the cgMLST clustering analysis (Fig. 1). L2bV5 is the second most frequent L2b subvariant in the genome dataset (after the prototype “L2b”) and was also frequent in a large LGV ompA-genotyping study in Portugal (18.5%, 2007–2023) ⁴⁸, supporting its epidemiological relevance. By suggesting a L2bV5 monophyletic nature, these results indicate that its characteristic ompA allele (allele 48 in cgMLST, with G271A and C493A mutations leading to Ala91Thr and His165Asn protein changes ⁴⁸) can still be a reliable marker for tracking its international spread. Of note, a GyrA mutation potentially associated with fluoroquinolone resistance (Ser83Ile; allele 36 of the locus "CT_189_gyrA") was previously identified in LGV clinical samples, including recombinant L2b/D-Da (42, 44). In the present genome dataset it was only observed in one genome of that recombinant strain, corroborating its potential rarity in vivo.

Regarding the emergent L4 strain, cgMLST robustly clustered all L4 genomes (n = 11) within a deeply separated and highly homogeneous genetic cluster within LGV lineage, consistent with its recent description (66). Indeed, the L4 genogroup (LGV-C5-C4) could be segregated from other LGV genomes at a cgMLST threshold as high as 229 ADs. As with L2b, the intra-subcluster diversity detected by cgMLST was very low, and of similar magnitude to that obtained with core SNP analysis, with all L4 genomes presenting a pairwise median distance of 8 ADs and 8 SNPs.

In addition to the two LGV clusters of highest contemporary relevance (“L2b” and “L4”), cgMLST analysis within the LGV lineage supported the expected high diversity of ompA-genotyped L1 or L1-like genomes, as previously observed ^5,29. Indeed, despite they predominantly grouped into a single monophyletic clade (genogroup LGV-C3-C3; n = 15), other two largely divergent genogroups were observed: i) LGV-C8-C8 with three L1 sequences, including the historical prototype strain L1/440; and ii) LGV-C6-C6 with 6 genomes presenting a hybrid L1/L2 ompA-genotype from South Africa ^5,102 (Fig. 1).

Finally, regarding the analysis of potentially epidemiologically relevant genogroups within non-LGV lineages, this study was limited not only by scarce metadata (e.g., missing sampling dates for 497 isolates), which hinders temporal resolution, but also by the paucity of confidently dated contemporary genomes. Indeed, the dataset only included 73 non-LGV genomes dated from the last decade (since 2015), comprising one from China, sequences from a recent study in the Fiji Islands ^51,103, and the new genomes generated in Portugal. However, it was still possible to extract some insights that shed light on strains that have recently circulated or are currently in circulation. For instance,the recent non-LGV Portuguese isolates were split into two singletons and four clusters at the low-level genogroup assignment. Besides the genogroup described above that harbored two recombinant “B” anorectal isolates from Portugal (T2G-C84-C152), the remaining three genogroups included strains from multiple countries, namely genogroups T1G-C20-C28 (Portugal and unknown origin; ompA-genotype Da), T2G-C66-C122 (Sweden, Portugal, and Argentina; ompA-genotype G), and T2G-C69-C125 (Sweden, Portugal, and the UK; ompA-genotype J), possibly reflecting strains with international dissemination.

Linking cgMLST alleles to traits of epidemiological and research relevance

An additional added-value of cgMLST is its utility to rapidly explore associations between specific alleles and clinically or epidemiologically relevant traits, whether phylogenomic (e.g., lineage) or phenotypic (e.g., tissue tropism or disease). Therefore, in this study, the distribution of alleles across the 1230 C. trachomatis samples and the 846 cgMLST loci was straightforwardly assessed and evaluated by following a novel approach that maps the allele predominance and “lineage exclusive/shared” profile across the single-linkage tree generating an “allelic heatmap” (“allelic_heatmap.py”, see Material and Methods) (Fig. 2). This allowed the detection of potential homologous recombination events and enabled the streamlined investigation of genetic mosaicism of isolates of interest, such as those showing major genotype–lineage incongruences (addressed above). An interactive HTML version of this heatmap is provided as Supplementary file S7, and a complete list of loci and their allelic classification, including “lineage exclusive/shared” profile and category association is compiled in Supplementary file S8. Additionally, by developing a complementary approach (“allelic_distribution.py”, see Material and Methods) that assesses the allele distribution across any metadata variable, such as lineage (Supplementary file S9), this study also streamlined the identification of loci with category-specific alleles, like those frequently absent or displaying pseudogene-like profiles in particular lineages (explored in the next section). In practice, these innovative tools and complementary output tables enabled several levels of insight with relevance for both surveillance and research activities.

Indeed, the systematic analysis and coloring scheme of the allelic heatmap (Fig. 2, Supplementary file S7) allowed the straightforward distinction between highly conserved and variable loci, as well as the identification of all cgMLST loci with alleles classified as “exclusive” of a given lineage or “shared” between isolates of two or more lineages (Table 1 and Supplementary files S8), thereby supporting the detection of candidate targets for diagnostics or functional studies. For instance, 214 loci presented a single and exclusive allele in LGV, 49 of which being called in all samples (Table 1 and Supplementary file S8), thus representing potential robust markers for LGV identification. Moreover, 124 genes with alleles exclusive to the emergent L4 strain were detected, 120 of which fully conserved (i.e., unique allele across all L4), thus constituting potential targets of future wet- or dry-lab assays for its direct tracking.

Table 1
Summary of the number of loci exclusive of a given C. trachomatis lineage or shared between any combination of lineages (n loci) and the total number of allele occurrences across those loci (n alleles). For exclusive "n loci" counts, the number of conserved loci (i.e., with only one allele, which is exclusive of the lineage) is indicated within brackets. Detailed data is provided in Supplementary file S8.
Lineage	Profile	n loci (n_loci conserved)	n alleles
LGV	exclusive	794 (214)	2393
T1	exclusive	833 (29)	7521
T2_ocular	exclusive	837 (28)	7931
T2_urogenital	exclusive	834 (32)	6814
LGV, T1	shared	295	305
LGV, T2_ocular	shared	8	8
LGV, T2_urogenital	shared	38	39
T1, T2_ocular	shared	62	65
T1, T2_urogenital	shared	374	606
T2_ocular, T2_urogenital	shared	254	283
LGV, T1, T2_ocular	shared	6	6
LGV, T1, T2_urogenital	shared	109	117
LGV, T2_ocular, T2_urogenital	shared	48	48
T1, T2_ocular, T2_urogenital	shared	277	290
LGV, T1, T2_ocular, T2_urogenital	shared	100	101

Although distinguishing between signals of common ancestry and those of recombination being always a challenge, the comprehensive allelic heatmap (Fig. 2 and Supplementary file S7) concordantly reflected the known C. trachomatis evolutionary history, marked by ancient diversification into four lineages, followed by contemporary mixing, as discussed elsewhere ^{5,7,29,40,41,72–74,104}. Indeed, on the one hand, it clearly highlighted the presence of shared alleles between lineages that diverged from a more recent common ancestor (T2 ocular and T2 urogenital), reflecting alleles retained from that common ancestry rather than ongoing exchange. On the other hand, it supported the absence of strict barriers to genetic exchange within the species, demonstrating that homologous recombination has occurred repeatedly within and between lineages, with a higher frequency being observed among lineages infecting similar tissue niches (e.g., urogenital and anorectal) compared with the more specialized LGV and ocular lineages. As explored previously ^5,29,71–73, a distinctive genetic feature of the LGV lineage is the markedly high allele sharing with the prevalent urogenital T1 lineage compared with other lineages. In this study, the precise identification of all these shared profiles (loci and alleles) may serve as a starting point for selecting candidate genes to explore previously raised hypotheses about the ecological success of T1 strains, in which LGV-T1 shared antigenic and genetic traits could have contributed to the remarkable expansion and persistence of the T1 lineage ⁴⁰. Ultimately, it is noteworthy that recombination signals were found to be extended across nearly the entire genome (Fig. 2, Supplementary files S7 and S8), with only 20 cgMLST loci showing exclusively lineage-specific alleles (no evidence of inter-lineage exchange) (Supplementary file S8). Moreover, only 69 loci displayed inter-lineage allelic exchanges confined to T1 and T2 urogenital lineages, with no evidence of exchange across the three main disease-associated groups (LGV, ocular, genital), suggesting fully concordant tropism-specific allelic signatures.

In practice, the vast allelic database, and associated tools for rapid link of cgMLST loci/alleles to specific traits, may serve as a useful resource to identify molecular signatures for epidemiological surveillance or outbreak tracing (e.g., for the development of multiplex sequencing approaches or rapid computational classification of genomes), as well as for research activities (e.g., functional studies to identify determinants of tissue tropism, virulence, etc.).

Other cgMLST-derived genomic fingerprints with surveillance and typing relevance

Plasmid insights

The plasmid of C. trachomatis has long been of epidemiological and diagnostic relevance, historically serving as a key target for molecular assays ^5,28–38. Increasing evidence also points to its critical role in the transcriptional regulation of chromosomal genes with important implications for virulence, tissue tropism, and disease outcome ^30–38. For these reasons, plasmid genes were deliberately incorporated into the novel cgMLST schema. Consistent with previous observations ^5,28,29, this study corroborated that plasmids share the same evolutionary history as their respective chromosomes, indicating predominantly vertical inheritance (Fig. 1 and Supplementary figure S5). Still, comparative cgMLST clustering analysis also found the previously described major topological incongruence between plasmid and whole cgMLST clustering (Supplementary figure S5), due to plasmid replacement in a group of strains from T2 urogenital lineage (including the Ia strains from Southampton ⁵). Moreover, although being rare, inter-lineage exchange of plasmid was observed (Fig. 2 and Supplementary files S7 and S8). For example, a T2 urogenital isolate (ERR189741) that, despite harboring a chromosome backbone similar to the same-genogroup isolates (T2G-C73-C119), possesses plasmid alleles that are exclusive of the LGV lineage was detected. Another remarkable example stands for T2 urogenital isolates of two genogroups of interest highlighted above, T2G-C66-C122 (ompA-genotype G) and T2G-C84-C152 (ompA-genotype B), whose most plasmid CDSs share alleles with the T2 ocular lineage. Apart from these major events, the in-depth allelic profiling (Fig. 2 and Supplementary files S7 and S8) also evidenced several instances of recombination within the plasmid, as also noticed in previous WGS-based comparative genomic studies ^5,28,29. In summary, the plasmid component of the cgMLST scheme offers a practical resource for rapidly examining plasmid sequence variation, protein signatures, and their correspondence with chromosomal genes, specially plasmid-regulated genes. Given the co-evolution of plasmid and chromosome in C. trachomatis, plasmid cgMLST can be particularly valuable for strain tracking in scenarios where whole-genome recovery is incomplete, as the plasmid’s multiple copy nature might increase the likelihood of successful sequencing.

Lineage- and disease-specific pseudogenes

Taking advantage of the in-depth analysis of allele distribution across multiple metadata variables, cgMLST loci with absent or pseudogene-like profiles (i.e., lacking predicted CDS) that were highly prevalent within particular lineage(s) were also identified (Supplementary file S8 and S9). The results are fully consistent with previous literature, namely the patterns observed in smaller genomic datasets ^73,74, reinforcing the existence of a subset of loci frequently disrupted or pseudogenized that are useful genomic markers for lineage or disease-group inference (Fig. 5). Specifically, these include: i) for LGV lineage: CT_373/aaxB ¹⁰⁵ and CT_855/fumC; ii) for T1 lineage: CT_473 ^74,106; ii) for T2 ocular lineage: CT_058 ¹⁰⁷, CT_105 ^73,76,106, the trpRBA operon ¹⁰⁸, CT_374/aaxC ¹⁰⁵ and CT_845; iii) for T2 urogenital lineages: CT_832/nusB ^29,106; and iv) for T1 and T2 urogenital lineages: CT_101 ¹⁰⁷. Regarding the tryptophan operon (trpRBA), it is historically known to be functional in C. trachomatis strains infecting the genital tract, being expectedly inactive in most ocular trachoma strains ^108–110. Concordantly, 95% of genomes belonging to the T2 ocular lineage had at least one of the three CDSs of the operon not called. Still, despite harboring the ocular-specific 3 bp deletion in trpA (nucleotides 408–410; loss of Phe-136), 40 (11%) ocular strains did not present the downstream 1bp-deletion (nucleotide 528, ocular numbering) that typically yields a truncated nonfunctional TrpA ^109,110. Noteworthy, with exception of two isolates from Ethiopia, all these ocular isolates with predicted undisrupted trpA (trpA alleles 2, 14 and 15) were collected in Tanzania, including several recently described isolates ⁹⁹ belonging to cgMLST high-level genogroups T2O-C32 and T2O-C33. Not unexpectedly, the LGV lineage exclusively possessed the two TrpA marker amino acids (“YE”, at protein positions 177–178), and all recombinant T1 and T2 isolates with an ocular ompA-genotype had all three trpRBA loci successfully called, reinforcing the historically known genotype-phenotype associations related with tryptophan biosynthesis (reviewed elsewhere ⁴⁰). Of note, the very few genomes (n = 15) from the T1 and T2 urogenital lineages that unexpectedly had at least one trpRBA locus not called were dispersed throughout the species tree. So, it cannot be ruled out that this observation reflects assembly issues. Still, for other loci, different explanations can also account for the lack of absolute lineage specificity (Fig. 5). For example, in the case of CT_473, the very few T1 isolates (n = 2) in which this gene was called carried allele 3, which otherwise occurs exclusively in LGV strains. This indicates that the apparent lack of specificity is due to discrete recombination events. By contrast, for CT_058, the allelic distribution is more consistent with ancestral pseudogenization during the evolutionary diversification of the ocular lineage. Indeed, CT_058 was not called in 63% of T2 ocular isolates, but these strains were not scattered across the lineage. Instead, the “not called” status was largely confined to three major clusters. In summary, the (nearly) absolute lineage-specific or group-specific status of this subset of cgMLST loci (Fig. 5) might render them particularly useful for rapid computational classification of genomes, even from incomplete assemblies, and might open further research insights into the evolutionary and functional diversification of C. trachomatis.

Known and potential real-time PCR targets for rapid LGV genotyping

So far, most molecular diagnostic assays designed to detect LGV strains - and, within them, the epidemic clone L2b - target presumed specific insertions or deletions in the pmpH gene ^10,16,18. In the present cgMLST dataset, it was concordantly observed that all pmpH alleles within the LGV lineage carried the characteristic 36 bp deletion commonly used in LGV-specific tests ^16,18. However, unexpectedly, the 9 bp insertion applied to specifically detect the epidemic L2b ¹⁰ was not unique to this strain. Indeed, while present in allele 13, which is exclusive from isolates with an L2b genomic backbone (genogroup LGV-C1-C1), the 9 bp insertion also occurs in allele 28, which appeared exclusive to the emerging L4 strain (genogroup LGV-C5-C4) (Supplementary file S8). In practice, this indicates that the real-time PCR target that relies on this 9 bp insertion to detect the epidemic L2b ¹⁰ most likely does not reliably discriminate between the two co-circulating LGV strains, L2b and L4, a limitation that may hinder their accurate tracking and obscure the emergence of the latter.

Regarding this topic, an alternative candidate marker previously proposed as LGV-specific ^58,76, namely a 74 bp insertion located 23 bp upstream of the predicted transcriptional start site (TSS) of CT_105 (position related to L2/434/Bu reference strain) was also evaluated. This feature could serve as a valuable target for future rapid LGV genotyping assays, particularly to anticipate potential future limitations of traditional ompA-genotyping and pmpH-based rapid assays, which already yielded discordant results in the presence of recombinants ¹⁰, such as the L2-L2b/D-Da strain ^43,45. In the 1230 genomes used in this study, the 74 bp insertion was present in all LGV genomes and absent from non-LGV lineages, confirming a 100% LGV-specificity rate (Fig. 1). This high specificity supports its potential for the development of rapid LGV/non-LGV discriminatory tests, enabling precise strain differentiation while minimizing misclassification due to recombination.

cgMLST lineages, traditional MLST sequence types (STs) ompA-genotypes versus anatomical sites

Taking advantage of the comprehensive analytical outputs from the genomic surveillance tool ReporTree (selected compilation in Supplementary file S4, and full compilation in Zenodo repository https://doi.org/10.5281/zenodo.17177579), cgMLST-inferred lineages, traditional MLST sequence types (STs), and ompA-genotypes distributions were evaluated across anatomical sites (a historical proxy for C. trachomatis “tissue tropism”). Although anatomical site data were unavailable for a substantial subset of isolates, the analysis of available records (n = 755) still revealed noteworthy trends (Supplementary file S2 and S4). As expected, all isolates belonging to T2 ocular lineage were collected from the conjunctival area (n = 181, Supplementary file S4). The anorectal site was the most frequently reported source among the LGV lineage (86.3%) and the LGV strains accounted for the majority (76.5%) of all anorectal isolates in the dataset. Although T1 and T2 urogenital lineages were mostly composed of isolates from the urogenital area (88.9% and 90.7%, respectively), they also presented a non-negligible proportion of isolates collected from anorectal (6.3% and 7.4%, respectively) and conjunctival (4.7% and 2.0%, respectively) sites. Noteworthy, within T1 and T2 urogenital lineages, ompA-genotype Ja (exclusively in T1) (Figs. 1 and 3A) displayed the highest proportion of anorectal isolates (6/23, 26.1%), followed by ompA-genotype G (11/74, 14.9%). In another perspective, the ompA-genotypes G and Ja were among the most frequent non-LGV genotypes detected in isolates from anorectal origin (8.7% and 4.8%, respectively), together with E (n = 6/126, 4.8%). When investigating the placement of these anorectal isolates in the cgMLST tree, it was noticed that the Ja isolates fall within three genogroups, which are exclusively composed by isolates from Fiji islands ^51,103: T1G-C21-C36, T1G-C16-C17 and T1G-C16-C37 (the latter two belong to the same high-level genogroup) (Fig. 1). Regarding the anorectal G isolates, they were dispersed across nine low-level genogroups (seven of them belonging to two high-level T2G-C56 and T2G-C66 genogroups) (Fig. 1), with highlight for the T2G-C66-C122 detected at multicountry level, as described above. Regarding the anorectal E isolates, they were dispersed across six clusters within two large high-level genogroups (T1G-C14 and T1G-C16), which together contained 86.1% (199/231) of all E isolates from the dataset.

Analysis of the distribution of STs (based on Chlamydiales MLST schema ⁷⁷ per lineage revealed that no single ST was present in more than one lineage (Fig. 1, Supplementary figure S6 and Supplementary file S4). The T1 lineage showed the highest ST diversity, encompassing 23 distinct STs, followed by the T2 urogenital lineage with 19 STs. In contrast, the LGV lineage comprised only 6 STs, while the T2 ocular lineage was the least diverse, containing 4 STs. Despite traditional MLST providing much less resolution than the two genogroup-levels defined by cgMLST, a good overall correspondence between the most frequent STs and the major cgMLST lineages was observed (Supplementary figure S6). Within the LGV lineage, ST44 (englobing the L1, L2, and epidemic L2b clone) was the most frequently observed and, along with ST11, to which the emergent L4 strain belongs ⁴⁷, was exclusively found among LGV lineage strains. The T2 ocular lineage was mainly represented by ST3, ST45 and ST117. The T1 lineage was dominated by ST4, ST38 and ST12, while T2 urogenital lineage was represented mainly by ST9 and ST13 (among other less frequent STs) (Supplementary figure S6).

Analysis of ompA-genotype distribution within STs showed that the majority of them were associated with a single ompA-genotype (n = 33), while 15 STs harboured more than one ompA-genotype (Fig. 1 and Supplementary files S2 and S4). Among them, several coincided with the major ompA-lineage incongruences, such as the strain ERR351523 (ompA-genotype E, LGV lineage) that was assigned to ST44, which typically linked to ompA LGV genotypes. Similarly, ompA-genotype B strains clustering within the T2 urogenital lineage were assigned to ST95, a type that also includes genotype G, concordant with their similar genomic backbone. Likewise, the ompA-genotype F strains detected in T2 belonged to ST38 and ST95, both of which also including other urogenital genotypes such as D, Da, E, F, G, and Ja, while the ompA-genotypes Ba and C strains clustering within T1 corresponded to ST4, a type frequently associated with E/F backgrounds. These associations illustrate that, although traditional MLST has much lower resolution than cgMLST, ST assignments generally reflect the underlying lineage background rather than the recombinant ompA allele, thereby partially mitigating ompA-lineage incongruences.

Discussion

To advance WGS-based surveillance of C. trachomatis and improve the routine analysis of its population structure and diversity, this study assembled the largest curated C. trachomatis genome dataset to date and developed a novel cgMLST schema for the species. While WGS offers transformative resolution for studying genomic diversity, evolutionary dynamics, and epidemiology, far surpassing traditional molecular methods such as ompA-genotyping or MLST ^{5,9,21,29,42,43,49,50}, its systematic use in C. trachomatis surveillance remains limited. The main challenges are wet-lab constraints associated with the obligate intracellular nature of the pathogen, uneven sampling, and the absence of standardized genome-scale typing frameworks ^21,45,49,50.

On the wet-lab side, obtaining complete C. trachomatis genomes from culture or directly from clinical material is still technically demanding in routine laboratory settings, due to the overwhelming presence of host nucleic acids in extracted DNA samples ^{49–51,111,112}. In line with these challenges, despite our efforts to pre-screen the bacterial load of hundreds of C. trachomatis-positive clinical samples by quantitative PCR, only a small fraction met the criteria for targeted enrichment WGS, and fewer than one-third of the sequenced samples yielded complete genome assemblies. Enrichment success was variable (Supplementary figure S1), and some samples with high bacterial loads above the selection threshold (≥ 10⁴ genome copies in the WGS input) still failed to generate complete genomes. This suggests that successful genome recovery is multifactorial and cannot be explained by bacterial load alone ^49–51. Technical and biological variables, such as DNA integrity, sample storage conditions, presence of PCR inhibitors, and the ratio of bacterial to host DNA all likely influence enrichment outcomes ^49–51. In addition, technical aspects of library preparation (including fragmentation size, probe design and coverage, and hybridization conditions) can directly affect capture efficiency and the proportion of on-target reads ^43,49,50. While this reinforces the need for continued community efforts to optimize and improve culture-independent WGS protocols ^51,113, our quantitative criteria for sample selection may provide useful guidance for prioritizing samples in future studies. Opportunities for optimization include refining capture probe panels to maximize coverage across diverse lineages, adjusting hybridization protocols to improve specificity, or adopting complementary pathogen enrichment strategies. Nonetheless, the new genomic data contributed by this study (57 genomes from Portugal collected between 1995 and 2023) represent an important addition, expanding the representation of European isolates in the global dataset by 17% and anorectal samples by 56%.

On the analytical side, this study advances C. trachomatis molecular surveillance through the development of a novel cgMLST schema, designed for local deployment in both research and surveillance settings, providing a standardized and scalable approach to compare genome data across laboratories and studies. Indeed, as vastly consolidated for other important bacterial pathogens ^52,53, the availability of robust, user-friendly, and portable cgMLST schemes can be pivotal for integrating WGS into routine surveillance, and our results underscore its potential to accelerate this transition by delivering an efficient and harmonized framework for genomic monitoring and research. Taking advantage of the most comprehensive C. trachomatis genome collection to date, covering the currently known species diversity, we constructed a novel cgMLST schema of 846 loci spanning chromosomal and plasmid genes. The schema was tested with commonly used clustering algorithms (single-linkage hierarchical clustering and GrapeTree ^82,83, which showed strong overall cluster congruence, demonstrating its flexibility for implementation in commonly used bacterial genomic surveillance workflows. Using the 1230 genomes analysed here, cgMLST recapitulated the expected four deep evolutionary lineages of C. trachomatis (Figs. 1 and 2) ^5,29,41, enabling a straightforward identification of multiple instances where the ompA-genotype was incongruent with the ancestral phylogenomic lineage (Fig. 3). The most striking and previously undocumented example was that of an “E” strain clustering within the LGV lineage, with the in-depth allelic profiling (Fig. 2) revealing multiple large regions imported from a T1 lineage strain. At the same time, cgMLST achieved very high resolution to resolve intra-lineage diversity and detect mosaic profiles, thereby extending and refining beyond previous large-scale SNP-based evolutionary analysis ^51,54. Importantly, the novel cgMLST design includes a proposal of hierarchical genogroup nomenclature, which might streamline a future harmonized tracking of major circulating genogroups across laboratories and regions. For instance, within the LGV lineage, cgMLST robustly grouped the epidemic L2b clone into a genetically homogeneous cluster (genogroup LGV-C1-C1), consistent with its clonal expansion across MSM networks in Europe since the early 2000s ^5,43,100. At the same time, cgMLST clearly segregated the recently described L4 strain into a well-supported cluster (genogroup LGV-C5-C4), in agreement with recent genomic reports ^46,113. This demonstrates the cgMLST ability to discriminate and capture microevolution-driven divergence within epidemic clones and demarcate emergent lineages, illustrating its immediate translational value for genomic surveillance. Of note, these observations relied on comparisons with typical core SNP analysis (mapping against a reference genome, followed by maximum likelihood phylogenetics), which confirmed that both genome-scale typing approaches (novel cgMLST and core SNP) achieve comparable resolution. Despite large-scale C. trachomatis genomic analyses having, thus far, been almost exclusively performed with SNP-based approaches ^5,29,50,54, their routine application to large and diverse datasets is computationally demanding (especially if reference-based mapping is used) and typically does not deliver a comprehensive or readily exploitable view of the recombination landscape. In contrast, as demonstrated in the present study, cgMLST not only ensures high discriminatory power, but also inherently buffers the noise introduced by recombination, while still being able to more effectively capture its signal, and thereby providing a clearer and more consistent framework for routine phylogenomic analysis of C. trachomatis. Notwithstanding, it should be noted that cgMLST can face limitations in resolving the direction of evolution (or transmission route) among highly similar genomes. Unlike SNP-based phylogenies, which can infer ancestry and evolutionary relationships, the rooted-ancestry–independent cgMLST framework is less suited for such investigatory fine-scale resolution purposes.

Beyond phylogenetic resolution, cgMLST provided a versatile framework to link allelic variation with traits of epidemiological and research relevance. By implementing an innovative allelic heatmap approach across the 846 loci and 1230 samples (Fig. 2, and Supplementary files S7 and S8), we were able to comprehensively explore allele distributions. Although distinguishing between signals of common ancestry and those of recombination is always challenging, our simple yet comprehensive strategy allowed to identify lineage-specific and shared alleles, thereby pinpointing potential recombination events, confirming the existence of loci disrupted or pseudogenized in specific lineages, and highlighting candidate molecular signatures for lineage or disease-group. Also, all genomes were screened for the presence of a 74 bp insertion located upstream of CT_105, confirming its 100% LGV-specificity (Fig. 1) and potential suitability for the development of rapid LGV/non-LGV discriminatory tests, in line with previous observations by multiplex NGS of hundreds of clinical samples ⁵⁸. This in-depth exploration of multiple C. trachomatis genomic fingerprints, linking loci and alleles to traits of epidemiological and research relevance, illustrates how cgMLST can identify robust molecular signatures for both practical surveillance applications (e.g., development of diagnostic panels, rapid genome classification, outbreak tracing) and research directions (e.g., functional studies on determinants of tissue tropism or virulence). For instance, we exemplified that this innovative and flexible approach can be extended at different resolution levels, like at genogroup level, as demonstrated by the comprehensive identification of 124 loci with alleles exclusive to the emergent L4 strain (genogroup LGV-C5-C4), which represent promising candidates for the development of L4-specific wet- or dry-lab assays. This need was further underscored by the observation that a 9 bp insertion in pmpH that has been used to specifically detect the epidemic L2b strain ¹⁰ would most likely yield positive results also for the emergent L4 strain, potentially jeopardizing their separate tracking in countries relying solely on rapid tests for LGV molecular surveillance. This, together with a currently low (and potentially declining) adoption of ompA genotyping and MLST, further corroborate that public health surveillance laboratories should invest in setting routine WGS as a critical asset for genomic epidemiology of this important pathogen.

In summary, this work underscores the added value of the developed cgMLST for C. trachomatis genomics, providing a portable and standardized schema that preserves deep lineage structure, while offering enhanced resolution to capture sublineages, recombination, and clinically relevant clonal expansions. By integrating seamlessly with tools, such as chewBBACA ⁵⁶, ReporTree ⁸² and the newly developed scripts, the schema enables automated clustering for prospective tracking, while promoting inter-laboratory comparability. More broadly, the allelic database and associated analytical outputs, including heatmaps and searchable tables, create a rich resource for rapid link of genomic loci to epidemiological traits, bridging research and public health applications. Ultimately, this study delivers both methodological innovation and actionable genomic resources, paving the way for more harmonized, high-resolution WGS-based surveillance of C. trachomatis worldwide.

Acknowledgements

The Portuguese National Reference Laboratory (NRL) for sexually transmitted infections want to thank all laboratories along the country that send C. trachomatis-positive specimens to INSA. The authors also thank the Technology and Innovation Unit of INSA for WGS support and Mário Ramirez and Rafael Mamede from Faculty of Medicine of the University of Lisbon (FMUL) and Gulbenkian Institute for Molecular Medicine (GIMM) for their kind support in releasing the novel cgMLST in Chewie-NS (https://chewbbaca.online/).

Funding

ZL was supported by a PhD fellowship from Fundação para a Ciência e a Tecnologia (FCT), Portugal (SFRH/BD/147446/2019). VM contribution was funded by national funds through FCT - Foundation for Science and Technology, I.P., in the frame of Individual CEEC 2022.00851.CEECIND/CP1748/CT0001 (doi: 10.54499/2022.00851.CEECIND/CP1748/CT0001). This study was also supported by the European Union project “Sustainable use and integration of enhanced infrastructure into routine genome-based surveillance and outbreak investigation activities in Portugal” - GENEO [101113460] on behalf of the EU4H programme [EU4H-2022-DGA-MS-IBA-01-02].

Conflict of Interest Statement

The authors have no conflict of interest to declare.

Data Availability Statement

All “Chlamydiales-classified” sequence reads (see Material and Methods) generated in this study are deposited in the European Nucleotide Archive (ENA) under the BioProject PRJEB32243. The novel cgMLST schema, and associated resources for immediate applicability for standardized, high-resolution WGS-based routine genomic surveillance, is publicly available at Zenodo repository (https://doi.org/10.5281/zenodo.17177579). An adapted version is also available at chewie-NS (https://chewbbaca.online/) ⁹⁴. The collection of scripts used to conduct these analyses are available at the github repository (https://github.com/insapathogenomics/allelic_diversity). For reproducibility, the version corresponding to this study is deposited in Zenodo (https://doi.org/10.5281/zenodo.17177356). All supplementary data and genome assemblies generated in this study are also available in Zenodo (https://doi.org/10.5281/zenodo.16814321).

Ethics Statement

Not applicable. The study was conducted exclusively on publicly available sequence data and on sequence data generated from anonymized samples, and all newly generated and released data was filtered to retain only reads classified as bacterial.

Transparency declarations

The authors declare that this manuscript represents an accurate, unbiased and honest study and that all data and materials are available within the article and/or supplementary files.

Authors contributions

VB designed the study. ZL, JI, RF, DC, CC and IJ performed molecular techniques and sequencing. ZL, VM and VB analysed the data and performed bioinformatics. ZL, VM and VB wrote the paper. VB and MJB coordinated the study. All authors revised the manuscript.

References

World Health Organization. The Diagnostics Landscape for Sexually Transmitted Infections. Geneva: World Health Organization, 2023 (2023)

Stephens RS, Tam MR, Kuo CC, Nowinski RC (1982) Monoclonal antibodies to Chlamydia trachomatis: antibody specificities and antigen characterization. J Immunol 128:1083–1089

Yuan Y, Zhang YX, Watkins NG, Caldwell HD (1989) Nucleotide and deduced amino acid sequences for the four variable domains of the major outer membrane proteins of the 15 Chlamydia trachomatis serovars. Infect Immun 57:1040–1049

Ending neglect attain sustainable development goals: road map neglected tropical diseases 2021–2030. World Health Organization

Harris SR et al (2012) Whole-genome analysis of diverse Chlamydia trachomatis strains identifies phylogenetic relationships masked by current clinical typing. Nat Genet 44:413–419

Workowski KA et al (2021) Sexually transmitted infections treatment guidelines, 2021. MMWR Recomm Rep 70:1–187

Thomson NR et al (2008) Chlamydia trachomatis: genome sequence analysis of lymphogranuloma venereum isolates. Genome Res 18:161–171

Savage EJ et al (2009) Lymphogranuloma venereum in Europe, 2003–2008. Euro Surveill 14

Dean D et al (2009) Predicting phenotype and emerging strains among Chlamydia trachomatis infections. Emerg Infect Dis 15:1385–1394

10.

Touati A, Peuchant O, Hénin N, Bébéar C, de Barbeyrac B (2016) The L2b real-time PCR targeting the pmpH gene of Chlamydia trachomatis used for the diagnosis of lymphogranuloma venereum is not specific to L2b strains. Clin. Microbiol. Infect. 22, 574.e7–9

11.

Manning C et al (2021) High-resolution genotyping of Lymphogranuloma Venereum (LGV) strains of Chlamydia trachomatis in London using multi-locus VNTR analysis-ompA genotyping (MLVA-ompA). PLoS ONE 16:e0254233

12.

Meyer T (2016) Diagnostic Procedures to Detect Chlamydia trachomatis Infections. Microorganisms 4:25

13.

Nieuwenhuis RF et al (2004) Resurgence of lymphogranuloma venereum in Western Europe: an outbreak of Chlamydia trachomatis serovar l2 proctitis in The Netherlands among men who have sex with men. Clin Infect Dis 39:996–1003

14.

Nieuwenhuis RF, Ossewaarde JM, van der Meijden WI, Neumann HA (2003) M. Unusual presentation of early lymphogranuloma venereum in an HIV-1 infected patient: effective treatment with 1 g azithromycin. Sex Transm Infect 79:453–455

15.

French P, Ison CA, Macdonald N (2005) Lymphogranuloma venereum in the United Kingdom. Sex Transm Infect 81:97–98

16.

Morré SA et al (2005) Real-time polymerase chain reaction to diagnose lymphogranuloma venereum. Emerg Infect Dis 11:1311–1312

17.

Schaeffer A, Henrich B (2008) Rapid detection of Chlamydia trachomatis and typing of the Lymphogranuloma venereum associated L-Serovars by TaqMan PCR. BMC Infect Dis 8:56

18.

Verweij SP et al (2011) Lymphogranuloma venereum variant L2b-specific polymerase chain reaction: insertion used to close an epidemiological gap. Clin Microbiol Infect 17:1727–1730

19.

Halse TA, Musser KA, Limberger RJ (2006) A multiplexed real-time PCR assay for rapid detection of Chlamydia trachomatis and identification of serovar L-2, the major cause of Lymphogranuloma venereum in New York. Mol Cell Probes 20:290–297

20.

Pannekoek Y et al (2010) Multi locus sequence typing of Chlamydia reveals an association between Chlamydia psittaci genotypes and host species. PLoS ONE 5:e14179

21.

Klint M et al (2007) High-resolution genotyping of Chlamydia trachomatis strains by multilocus sequence analysis. J Clin Microbiol 45:1410–1414

22.

Gravningen K et al (2012) Multilocus sequence typing of genital Chlamydia trachomatis in Norway reveals multiple new sequence types and a large genetic diversity. PLoS ONE 7:e34452

23.

Bom RJM et al (2013) Distinct transmission networks of Chlamydia trachomatis in men who have sex with men and heterosexual adults in Amsterdam, The Netherlands. PLoS ONE 8:e53869

24.

Stephens RS et al (1998) Genome sequence of an obligate intracellular pathogen of humans: Chlamydia trachomatis. Science 282:754–759

25.

Carlson JH, Porcella SF, McClarty G, Caldwell H (2005) D. Comparative genomic analysis of Chlamydia trachomatis oculotropic and genitotropic strains. Infect Immun 73:6407–6418

26.

Andersson SG, Kurland CG (1998) Reductive evolution of resident genomes. Trends Microbiol 6:263–268

27.

Andersson JO, Andersson SG (1999) Insights into the evolutionary process of genome degradation. Curr Opin Genet Dev 9:664–671

28.

Seth-Smith HMB et al (2009) Co-evolution of genomes and plasmids within Chlamydia trachomatis and the emergence in Sweden of a new variant strain. BMC Genomics 10:239

29.

Hadfield J et al (2017) Comprehensive global genome dynamics of show ancient diversification followed by contemporary mixing and recent lineage expansion. Genome Res 27:1220–1229

30.

Carlson JH et al (2008) The Chlamydia trachomatis plasmid is a transcriptional regulator of chromosomal genes and a virulence factor. Infect Immun 76:2273–2283

31.

Gong S, Yang Z, Lei L, Shen L, Zhong G (2013) Characterization of Chlamydia trachomatis plasmid-encoded open reading frames. J Bacteriol 195:3819–3826

32.

Song L et al (2013) Chlamydia trachomatis plasmid-encoded Pgp4 is a transcriptional regulator of virulence-associated genes. Infect Immun 81:636–644

33.

Wang Y et al (2011) Development of a transformation system for Chlamydia trachomatis: restoration of glycogen biosynthesis by acquisition of a plasmid shuttle vector. PLoS Pathog 7:e1002258

34.

Wang Y et al (2013) Transformation of a plasmid-free, genital tract isolate of Chlamydia trachomatis with a plasmid vector carrying a deletion in CDS6 revealed that this gene regulates inclusion phenotype. Pathog Dis 67:100–103

35.

Jones CA et al (2020) The Nature and Extent of Plasmid Variation in Chlamydia trachomatis. Microorganisms 8:373

36.

Kari L et al (2011) A live-attenuated chlamydial vaccine protects against trachoma in nonhuman primates. J Exp Med 208:2217–2223

37.

Zhong G (2017) Chlamydial Plasmid-dependent pathogenicity. Trends Microbiol 25:141–152

38.

Matsumoto A, Izutsu H, Miyashita N, Ohuchi M (1998) Plaque formation by and plaque cloning of Chlamydia trachomatis biovar trachoma. J Clin Microbiol 36:3013–3019

39.

Luu LDW, Kasimov V, Phillips S, Myers GSA, Jelocnik M (2023) Genome organization and genomics in Chlamydia: whole genome sequencing increases understanding of chlamydial virulence, evolution, and phylogeny. Front Cell Infect Microbiol 13:1178736

40.

Nunes A, Borrego MJ, Gomes JP (2013) Genomic features beyond Chlamydia trachomatis phenotypes: what do we think we know? Infect Genet Evol 16:392–400

41.

Joseph SJ et al (2012) Population genomics of Chlamydia trachomatis: insights on drift, selection, recombination, and population structure. Mol Biol Evol 29:3933–3946

42.

Borges V, Hyden P, Gomes JP, Rattei T (2020) Chlamydia Genomics. Chlamydia Biology: From Genome to Disease. Caister Academic

43.

Borges V et al (2019) Chlamydia trachomatis: when the virulence-associated genome backbone imports a prevalence-associated major antigen signature. Microb Genom 5

44.

Smelov V et al (2017) Chlamydia trachomatis strain types have diversified regionally and globally with evidence for recombination across geographic divides. Front Microbiol 8:2195

45.

Borges V et al (2021) Transcontinental dissemination of the L2b/D-Da recombinant Chlamydia trachomatis lymphogranuloma venereum (LGV) strain: Need of broad multi-country molecular surveillance. Clin Infect Dis 73:e1004–e1007

46.

Büttner KA et al (2025) Chlamydia trachomatis genomes from rectal samples: description of a new clade comprising ompA-genotype L4 from Argentina. Microb Genom 11

47.

Büttner KA et al (2024) OmpA sequencing and Multilocus Sequence Typing of Lymphogranuloma venereum cases in Buenos Aires reveal new chlamydia trachomatis genotypes. Microorganisms 12:587

48.

Lodhia Z et al (2025) Lymphogranuloma venereum (LGV) ompA-subvariants of the Portuguese collection of Chlamydia trachomatis, 2007–2023. Sex. Transm. Infect. sextrans–2024–056427

49.

Christiansen MT et al (2014) Whole-genome enrichment and sequencing of Chlamydia trachomatis directly from clinical samples. BMC Infect Dis 14:591

50.

Seth-Smith HMB et al (2013) Whole-genome sequences of Chlamydia trachomatis directly from clinical samples without culture. Genome Res 23:855–866

51.

Bowden KE et al (2021) Whole-genome enrichment and sequencing of chlamydia trachomatis directly from patient clinical vaginal and rectal swabs. mSphere 6

52.

Liu Y-Y et al (2022) A Listeria monocytogenes fast-tracing platform for global surveillance. PLoS ONE 17:e0267972LmTraceMap

53.

Leeper MM et al (2023) Evaluation of whole and core genome multilocus sequence typing allele schemes for Salmonella enterica outbreak detection in a national surveillance network, PulseNet USA. Front Microbiol 14:1254777

54.

Versteeg B et al (2018) Genomic analyses of the Chlamydia trachomatis core genome show an association between chromosomal genome, plasmid type and disease. BMC Genomics 19:130

55.

Jolley KA, Bray JE, Maiden MCJ (2018) Open-access bacterial population genomics: BIGSdb software, the PubMLST.org website and their applications. Wellcome Open Res 3:124

56.

Silva M et al (2018) chewBBACA: A complete suite for gene-by-gene schema creation and strain identification. Microb Genom 4

57.

Lodhia Z et al (2025) Distribution of Chlamydia trachomatis ompA-genotypes over three decades in Portugal. Sex Transm Infect 101:73–80

58.

Lodhia Z et al (2025) Surveying genetic markers of antibiotic resistance and genomic background in Chlamydia trachomatis: insights from a multiplex NGS-based approach in clinical strains from Portugal. J Antimicrob Chemother 80:1072–1079

59.

Borges V et al (2015) Chlamydia trachomatis in vivo to in vitro transition reveals mechanisms of phase variation and down-regulation of virulence factors. PLoS ONE 10:e0133420

60.

Pereira IS et al (2022) The type III secretion effector CteG mediates host cell lytic exit of chlamydia trachomatis. Front Cell Infect Microbiol 12:902210

61.

Gomes JP et al (2006) Correlating Chlamydia trachomatis infectious load with urogenital ecological success and disease pathogenesis. Microbes Infect 8:16–26

62.

Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120

63.

Wood DE, Lu J, Langmead B (2019) Improved metagenomic analysis with Kraken 2. Genome Biol 20:257

64.

Lu J et al (2022) Metagenome analysis using the Kraken software suite. Nat Protoc 17:2815–2839

65.

Prjibelski A, Antipov D, Meleshko D, Lapidus A, Korobeynikov A (2020) Using SPAdes De Novo Assembler. Curr Protoc Bioinf 70:e102

66.

Walker BJ et al (2014) Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE 9:e112963

67.

Borges V et al (2014) Complete Genome Sequence of Chlamydia trachomatis Ocular Serovar C Strain TW-3. Genome Announcements. 10.1128/genomea.01204-13

68.

Hunt M et al (2024) AllTheBacteria - all bacterial genomes assembled, available and searchable. bioRxiv 10.1101/2024.03.08.584059

69.

Borges V, Gomes JP (2015) Deep comparative genomics among Chlamydia trachomatis lymphogranuloma venereum isolates highlights genes potentially involved in pathoadaptation. Infect Genet Evol 32:74–88

70.

Olagoke O, Aziz A, Zhu LH, Read TD, Dean D (2025) Whole-genome automated assembly pipeline for strains from reference, and clinical samples using the integrated CtGAP pipeline. NAR Genom Bioinform 7:lqae187

71.

Rita, Ferreira (2013) Vítor Borges, Alexandra Nunes, Maria José Borrego, João Paulo Gomes. Assessment of the load and transcriptional dynamics of Chlamydia trachomatis plasmid according to strains’ tissue tropism. Microbiol Res 168:333–339

72.

Nunes A, Nogueira PJ, Borrego MJ, Gomes JP (2008) Chlamydia trachomatisdiversity viewed as a tissue-specific coevolutionary arms race. Genome Biol 9:1–13

73.

Borges V, Nunes A, Ferreira R, Borrego MJ, Gomes JP (2012) Directional Evolution of Chlamydia trachomatis towards Niche-Specific Adaptation. J Bacteriol. 10.1128/jb.01291-12

74.

Ferreira R et al (2014) In silico scrutiny of genes revealing phylogenetic congruence with clinical prevalence or tropism properties of Chlamydia trachomatis strains. G3 (Bethesda) 5:9–19

75.

Comanducci M, Ricci S, Cevenini R, Ratti G (1990) Diversity of the Chlamydia trachomatis common plasmid in biovars with different pathogenicity. Plasmid 23:149–154

76.

Pais SV et al (2019) CteG is a Chlamydia trachomatis effector protein that associates with the Golgi complex of infected host cells. Sci Rep 9:6133

77.

Pannekoek Y et al (2008) Multi locus sequence typing of Chlamydiales: clonal groupings within the obligate intracellular bacteria Chlamydia trachomatis. BMC Microbiol 8:42

78.

Cruz H, Pinheiro M, Borges V, ReporType: (2024) A Flexible Bioinformatics Tool for Targeted Loci Screening and Typing of Infectious Agents. Int J Mol Sci 25

79.

Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30:772–780

80.

Nunes A, Borrego MJ, Nunes B, Florindo C, Gomes JP (2009) Evolutionary dynamics of ompA, the gene encoding the Chlamydia trachomatis key antigen. J Bacteriol 191:7182–7192

81.

Thorvaldsdóttir H, Robinson JT, Mesirov JP (2013) Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform 14:178–192

82.

Mixão V et al (2023) ReporTree: a surveillance-oriented tool to strengthen the linkage between pathogen genetic clusters and epidemiological data. Genome Med 15:43

83.

Zhou Z et al (2018) GrapeTree: visualization of core genomic relationships among 100,000 bacterial pathogens. Genome Res 28:1395–1404

84.

Carriço JA et al (2006) Illustration of a common framework for relating multiple typing methods by application to macrolide-resistant Streptococcus pyogenes. J Clin Microbiol 44:2524–2532

85.

Severiano A, Pinto FR, Ramirez M, Carriço JA (2011) Adjusted Wallace coefficient as a measure of congruence between typing methods. J Clin Microbiol 49:3997–4000

86.

Barker DOR et al (2018) Rapid identification of stable clusters in bacterial populations using the adjusted Wallace coefficient. bioRxiv 10.1101/299347

87.

Mixão V et al (2025) Multi-country and intersectoral assessment of cluster congruence between pipelines for genomics surveillance of foodborne pathogens. Nat Commun 16:3961

88.

Darling ACE, Mau B, Blattner FR, Perna NT (2004) Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 14:1394–1403

89.

Kille B et al (2024) Parsnp 2.0: Scalable Core-Genome Alignment for Massive Microbial Datasets. bioRxiv 10.1101/2024.01.30.577458

90.

Minh BQ et al (2020) IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol 37:1530–1534

91.

Uelze L et al (2020) Typing methods based on whole genome sequencing data. One Health Outlook 2:3

92.

Palma F et al (2024) Bacterial strain nomenclature in the genomic era: Life Identification Numbers using a gene-by-gene approach. bioRxiv 2024.03.11.584534 10.1101/2024.03.11.584534

93.

Zhou Z, Charlesworth J, Achtman M (2021) HierCC: a multi-level clustering scheme for population assignments based on core genome MLST. Bioinformatics 37:3645–3646

94.

Mamede R, Vila-Cerqueira P, Silva M, Carriço JA, Ramirez M (2021) Chewie Nomenclature Server (chewie-NS): a deployable nomenclature server for easy sharing of core and whole genome MLST schemas. Nucleic Acids Res 49:D660–D666

95.

Last AR et al (2018) Population-based analysis of ocular Chlamydia trachomatis in trachoma-endemic West African communities identifies genomic markers of disease severity. Genome Med 10:15

96.

Pickering H et al (2022) Genomics of ocular Chlamydia trachomatis after 5 years of SAFE interventions for trachoma in Amhara, Ethiopia. J Infect Dis 225:994–1004

97.

Spaargaren J et al (2005) Slow epidemic of lymphogranuloma venereum L2b strain. Emerg Infect Dis 11:1787–1788

98.

Bommana S, Somboonna N, Richards G, Tarazkar M, Dean D (2021) Tryptophan operon diversity reveals evolutionary trends among geographically disparate Chlamydia trachomatis ocular and urogenital strains affecting tryptophan repressor and synthase function. MBio 12

99.

Ghasemian E et al (2025) Evolutionary dynamics in the genome of ocular Chlamydia trachomatis strains from Northern Tanzania following mass drug administration. Microb Genom 11

100.

Seth-Smith HMB et al (2021) Ongoing evolution of Chlamydia trachomatis lymphogranuloma venereum: exploring the genomic diversity of circulating strains. Microb Genom 7

101.

Peuchant O et al (2016) Changing pattern of Chlamydia trachomatis strains in lymphogranuloma venereum outbreak, France, 2010–2015. Emerg Infect Dis 22:1945–1947

102.

Hayes LJ et al (1994) Evidence for naturally occurring recombination in the gene encoding the major outer membrane protein of lymphogranuloma venereum isolates of Chlamydia trachomatis. Infect Immun 62:5659–5663

103.

Joseph SJ et al (2023) Patterns of within-host spread of Chlamydia trachomatis between vagina, endocervix and rectum revealed by comparative genomic analysis. bioRxivorg 10.1101/2023.01.25.525576

104.

Gomes JP et al (2007) Evolution of Chlamydia trachomatis diversity occurs by widespread interstrain recombination involving hotspots. Genome Res 17:50–60

105.

Giles TN, Fisher DJ, Graham DE (2009) Independent inactivation of arginine decarboxylase genes by nonsense and missense mutations led to pseudogene formation in Chlamydia trachomatis serovar L2 and D strains. BMC Evol Biol 9:166

106.

da Cunha M et al (2014) Identification of type III secretion substrates of Chlamydia trachomatis using Yersinia enterocolitica as a heterologous system. BMC Microbiol 14:40

107.

Almeida F et al (2012) Polymorphisms in inc proteins and differential expression of inc genes among Chlamydia trachomatis strains correlate with invasiveness and tropism of lymphogranuloma venereum isolates. J Bacteriol 194:6574–6585

108.

Caldwell HD et al (2003) Polymorphisms in Chlamydia trachomatis tryptophan synthase genes differentiate between genital and ocular isolates. J Clin Invest 111:1757–1769

109.

Fehlner-Gardiner C et al (2002) Molecular basis defining human Chlamydia trachomatis tissue tropism. A possible role for tryptophan synthase. J Biol Chem 277:26893–26903

110.

Shaw AC, Christiansen G, Roepstorff P, Birkelund S (2000) Genetic differences in the Chlamydia trachomatis tryptophan synthase alpha-subunit can explain variations in serovar pathogenesis. Microbes Infect 2:581–592

111.

Putman TE, Suchland RJ, Ivanovitch JD, Rockey DD (2013) Culture-independent sequence analysis of Chlamydia trachomatis in urogenital specimens identifies regions of recombination and in-patient sequence mutations. Microbiology 159:2109–2117

112.

Brown AC, Christiansen MT (2017) Whole-genome enrichment using RNA probes and sequencing of Chlamydia trachomatis directly from clinical samples. Methods Mol Biol 1616:1–22

113.

Büttner KA et al (2025) Evaluating methods for genome sequencing of Chlamydia trachomatis and other sexually transmitted bacteria directly from clinical swabs. Microb Genom 11

114.

Argimón S et al (2016) Microreact: visualizing and sharing data for genomic epidemiology and phylogeography. Microb Genom 2:e000093

Tables

Fig. 4

C. trachomatis cgMLST loci found to be frequently disrupted or pseudogenized in lineage-specific contexts. The percentage (and color gradient) reflects the proportion of isolates within each lineage with pseudogene-like profile (i.e., lack of CDS prediction). The “not called” classification type of chewBBACA (https://chewbbaca.readthedocs.io/en/latest/user/modules/AlleleCall.html#outputs) varies across loci, with the majority corresponding to: “LNF” for CT_058, CT_101, CT_169, CT_170, CT_374 and CT_855; “ASM” for CT_105, CT_171, CT_373, CT_473, CT_832, CT_845. Note: for some loci, namely CT_832 and CT_845, the differential profile relies on the presence of alternative start codons across groups, without functional confirmation of them being true pseudogenes. The same occurs with other pseudogene-like loci previously reported ⁷⁴, such as CT_300 (a potential pseudogene in LGV), which in our cgMLST schema predicts CDS of different lengths within the ± 20% “size-threshold” tolerated by chewBBACA, thus leading to allele assignment (instead of the “not called” profile expected for a pseudogene). *Proportion of genomes from T2 ocular lineage that have at least one of the three loci of the trpRBA operon not called.

Supplementary files

Supplementary file S1.

List of DNA C. trachomatis-positive samples of INSA's collection (Portugal) evaluated for WGS selection (n = 850).

Supplementary file S2.

C. trachomatis genome dataset used in this study and available metadata (n = 1230).

Supplementary file S3.

Allelic matrix and loci statistics of the novel cgMLST schema for C. trachomatis.

Supplementary file S4.

ReporTree generated genome counts according to metadata categories.

Supplementary file S5.

Clustering information obtained at all possible threshold levels for the global dataset (n = 1230 genomes) with the novel schema using single-linkage hierarchical clustering (HC) or the GrapeTree MSTreeV2 (GT) algorithms, being already prepared to implement a nomenclature system with ReporTree, including a three-level hierarchical nomenclature code as used for genogroup definition in this study. This file also includes the clustering information at all possible levels obtained with the methods used for comparative benchmarking, as well as the inter-pipeline corresponding points (CS ≥ 2.85).

Supplementary file S6.

Identification of Stability Regions based on single linkage hierarchical clustering analysis.

Supplementary file S7.

Interactive heatmap of the cgMLST allelic variation across C. trachomatis isolates (n = 1230) displaying allele presence across loci (columns) and isolates (rows), ordered by cgMLST-based single-linkage hierarchical clustering (as in Fig. 2). Heatmap colors represent the dominant lineage in which each allele occurs: faint shades mark alleles exclusive to one lineage, while darker shades highlight alleles shared across lineages. Alleles that are dominant in at least 3 lineages (i.e., denoting fully or nearly conserved loci) are displayed in black. Loci that are absent in specific samples (denoted by allele “0”) are displayed in white. Hover tooltips display sample, locus, allele, and lineage. Static PNG/SVG figures can be downloaded.

Supplementary file S8.

Detailed summary of cgMLST alleles identified as “exclusive”, occurring only within a lineage, or “shared” between genomes of two or more lineages. A similar summary is also provided for alleles of the epidemic L4 strains.

Supplementary file S9.

Allele distribution across C.trachomatis lineages.

Supplementary figures

Supplementary figure S1.

Distribution of input parameters and sequencing outcomes in the targeted enrichment WGS of C. trachomatis. A) Distribution of C. trachomatis genome copies (left boxplot, n = 835) and dsDNA amount (right boxplot, n = 246) in the target enrichment WGS input volume (7 µL), with dots colored according to WGS selection status. WGS selected samples (n = 148) are represented in green and orange for genome copies and DNA amount, respectively, and non-selected samples in grey. The red dashed lines represent the thresholds of 10⁴ genome copies and 10 ng (both in 7 µL), used to prioritize samples for sequencing. B) Distribution of the number of C. trachomatis copies in the WGS input volume and the corresponding percentage of “on-target” reads in log₁₀ scale for the 148 isolates selected for WGS. Each point represents one sample submitted for targeted enrichment WGS, being colored in dark blue when passing the inclusion criteria for the final curated genome dataset and in light blue when not passing these criteria. “On-target” reads were assumed as QC-passed reads classified as Chlamydiales, i.e., reads used for C. trachomatis genome assembly (see Materials and Methods).

Supplementary figure S2. Evaluation and benchmarking of the novel C. trachomatis cgMLST schema. A)

Distribution of the number of unique alleles per locus of the cgMLST schema. B) Comparison of the distribution of the percentage of loci called per sample in the novel cgMLST schema (blue) and the local deployment of the PubMLST schema with chewBBACA allele caller (orange) using the global C. trachomatis dataset (n = 1230). C) Number of partitions (clusters and singletons) obtained at each possible distance threshold level in the novel cgMLST schema and the local deployment of the PubMLST schema with chewBBACA ⁵⁶, both using ReporTree single-linkage hierarchical clustering ⁸². D) Heatmap of the congruence score (0 - no congruence; 3 - absolute congruence) obtained for the pairwise comparisons performed between all possible threshold combinations of the two approaches analysed in C. E) Number of partitions at each possible distance threshold level in the novel cgMLST schema using ReporTree single-linkage hierarchical clustering and GrapeTree MSTreeV2 algorithm ^82,83. F) Heatmap of the congruence score obtained for the pairwise comparisons performed between all possible threshold combinations of the two approaches analysed in E. G) Number of partitions at each possible distance threshold level in the novel cgMLST schema and the curated core genome alignment of the 53 complete assemblies, both cases using ReporTree single-linkage hierarchical clustering ⁸². H) Heatmap of the congruence score (0 - no congruence; 3 - absolute congruence) obtained in the pairwise comparisons performed between all possible threshold combinations of the two approaches analysed in G. I) Number of partitions at each possible distance threshold level in the novel cgMLST schema and the Parsnp ⁸⁹ alignment of the global dataset, both using ReporTree single-linkage hierarchical clustering ⁸². J) Heatmap of the congruence score obtained for the pairwise comparisons performed between all possible threshold combinations of the two approaches analysed in I.

Supplementary figure S3. Neighbor-joining phylogenetic tree of C. trachomatis ompA alleles identified by cgMLST in the studied genome dataset.

Nodes are coloured by the ompA genotype. Supplementary file S4 provides a summary of ompA allele distributions across lineages, and Supplementary file S2 details the ompA alleles identified in individual genomes. Allele 112, which was identified in a single genome, was excluded, as a shorter CDS was inferred by cgMLST. Scale bar represents numbers of substitutions per site.

Supplementary figure S4. Core SNP maximum-likelihood phylogenetic tree of the epidemic L2b C. trachomatis strain.

The tree was reconstructed from a core alignment comprising 156 SNVs using IQ-TREE v2.1.4 (68) and rooted to the earliest available L2b sequences from the 1980s (ERR348840 and ERR348841). *All recombination-driven SNVs within the previously identified ompA-containing recombinant region of the L2b/D-Da strain (NC_010280; ~positions 55,221–59,461) (25, 27) were excluded prior to phylogenetic analysis. Edges are coloured by the ompA-genotype and nodes by LGV ompA subvariant (with ‘new’ indicating additional subvariants not included in a recent compilation) ⁴⁸. Scale bar represents numbers of SNPs.

Supplementary figure S5.

Tanglegram comparing cgMLST-based single-linkage hierarchical clustering built from the whole cgMLST scheme (left) and plasmid-only loci (right). Branches, as connectors linking the same isolates in both trees, are colored by lineage. Distances are based on allelic differences.

Supplementary figure S6. Sankey plot illustrating the relationships between Chlamydia trachomatis classifications at four levels

ompA class (historical “disease groups” inferred from the traditional ompA classification*), ompA-genotype, cgMLST lineage, and MLST-derived sequence type (ST). Each node represents a category within a given stage, and the width of the connecting flows is proportional to the number of isolates sharing the corresponding combination of categories. *Grouping in “ompA_class” was strictly based on ompA sequence, so the recombinant L2/L2b–D/Da strain ^43,45 was classified as “genital”, as most of its sequence (~ 75%) derives from a genital genotype (D/Da). Supplementary files S2 and S4 provide source data.

Yes