A robust classifier for the intrinsic consensus molecular subtypes in colorectal cancer.
A
Dr
PetrosTsantoulisMD, PhD
1,5✉YouraeHong2
PratyakshaWirapati3
TingPu2
AllysonM.Peddle2
ThomasMckee4
SabineTejpar2
1A
A
A
Service d’oncologie de précisionHôpitaux Universitaires de Genève 2Molecular Digestive Oncology, Department of OncologyKatholieke Universiteit LeuvenLeuvenBelgium
3Faculté de MédecineUniversité de Genève
4Service de pathologie cliniqueHôpitaux Universitaires de Genève
5Service d’Oncologie de Précision– Hopitaux Universitaires de GenèveRue Gabrielle- Perret-Gentil 41205GenèveSwitzerland
Petros Tsantoulis,1 Yourae Hong,2 Pratyaksha Wirapati,3 Ting Pu,2 Allyson M. Peddle, 2 Thomas Mckee,4 Sabine Tejpar2
Affiliations
1. Hôpitaux Universitaires de Genève, Service d’oncologie de précision
2. Molecular Digestive Oncology, Department of Oncology, Katholieke Universiteit Leuven, Leuven, Belgium.
3. Université de Genève, Faculté de Médecine
4. Hôpitaux Universitaires de Genève, Service de pathologie clinique
Corresponding author:
Dr Petros Tsantoulis, MD, PhD, Petros.Tsantoulis@hug.ch
Service d'Oncologie de Précision, – Hopitaux Universitaires de Genève– Rue Gabrielle-Perret-Gentil 4, 1205 Genève - Switzerland
Keywords
Colorectal cancer
intrinsic consensus molecular subtypes
iCMS
CMS
A
Abstract
Background
An analysis of colorectal cancer epithelial cells showed two intrinsic subtypes called iCMS2 and iCMS3, in addition to the bulk consensus subtypes (CMS1-4). The intrinsic subtypes can be prognostically important in some patient subsets and may prove predictive of response to treatment. We present a method for calculating the iCMS subtypes that is robust to technical variation and is designed for single-sample applications.
Results
A single-sample classifier (SSC) was developed based on non-parametric correlation similarity with gene expression centroids, synthetically created by resampling samples with known iCMS classes from public datasets that have been used in the derivation of the iCMS classification. We selected the subset of iCMS genes (N = 201) with the strongest epithelial expression in colorectal cancer, aiming to reduce unrelated, non-epithelial variation. The SSC calculates the most likely iCMS class based on the distribution of the classes of the nearest centroids with either an absolute cutoff or K-nearest-neighbors voting.
We used new data from the Geneva Tumor Registry cohort (GTR) for formal validation, without applying prior batch correction or denoising. Ground truth was established by exhaustively estimating the distance of the GTR samples from the reference iCMS dataset in the space of over 12000 genes, and assigning the most likely iCMS by proximity to a known reference sample. In the GTR cohort the SSC achieved a accuracy ranging from 86 to 88% with or without batch correction, outperforming the previously published nearest template predictor, which only reached 75.4% without batch correction and 89.8% with batch correction.
The SSC was more robust to the addition of noise, missing genes or simulated contamination with the opposite iCMS class. In addition, the SSC was applied to data from the VELOUR trial, for which reference iCMS calls were also available. The SSC iCMS was prognostic for OS in iCMS2 vs iCMS3 (P < 0.00001) and performed at least as well as the reference iCMS, which was also prognostic (P = 0.0001).
A
Conclusions: The SSC reproduces the iCMS classification and appears significantly more robust to batch effect. The SSC calls are prognostic in clinical trial data. An R package is available for download (
https://github.com/CRCrepository/iCMS.SSC).
A
Background
Colorectal cancer presents a complex, heterogeneous bulk gene expression profile that converges on four major molecular subtypes, previously described as CMS1 to CMS4.[1] The stroma-rich CMS4 subtype is typically associated with worse relapse-free survival, while the CMS1 subtype is enriched in tumors with microsatellite instability.[1] These subtypes arise from a mixture of multiple cell types with very different gene expression patterns.
The advent of single-cell RNA sequencing enabled the analysis of individual cell types. An analysis focused on epithelial cells from single-cell RNA sequencing of five colorectal cancer cohorts with 373’058 cells showed two main subtypes which are strongly correlated with bulk CMS2/3 and were accordingly labelled intrinsic CMS2/3 (iCMS).[2] The bulk CMS1 subtype is almost exclusively associated with epithelial iCMS3, while the stroma-rich bulk CMS4 subtype can be associated with either epithelial iCMS2 or iCMS3. Interestingly, the epithelial component can have a profound effect on prognosis. Bulk CMS4 with iCMS3 epithelial cells has worse survival than CMS4 with iCMS2 epithelial cells, demonstrating the importance of the iCMS classification.
Wide adoption of the iCMS classification and further research on its clinical utility would be greatly facilitated by a standardized algorithm implemented in an easily available software package. Clinical application of gene expression classifiers must be robust to technical variation and batch effect, as patients are classified individually, for example during clinical trial screening and are not always part of a batch cohort. The previously published method for iCMS classification was not specifically optimized for application to individual samples, when batch correction and normalization is not feasible[2]. Here we provide the implementation of a single-sample classifier that replicates the results of the reference algorithm with improved performance in a new, unseen, non-batch-corrected validation dataset.
Methods
Population
The study population has been previously described[3]. Briefly, the Geneva Cancer Registry database collected data and tissue for all colorectal cancer patients diagnosed between 1985 and 2013. Samples with available fresh frozen material were selected for gene expression profiling with RNAseq (N = 384), these ranged from 2002 until 2011.
RNA sequencing
Cryosections of the fresh frozen blocks were cut and the first and last sections stained using standard techniques and the tumor content verified by a trained pathologist. Only samples with a tumor cell content of more than 30% were used for further analysis. RNA was extracted from the intervening sections using the RNAeasy kit (Qiagen, Netherlands) and its’ quality and quantity controlled using a tape station (Agilent, Santa Clara, CA, USA).
RNA-seq libraries were prepared using the TruSeq Stranded Total RNA Library Prep Human/Mouse/Rat reagents (Illumina, San Diego, USA). Following quality assessment on the Fragment Analyzer (Agilent, Santa Clara, USA), libraries were 100nt paired-end sequenced to a target of 50 million clusters per library on the Illumina NovaSeq 6000 (Illumina, San Diego, USA).
The GTR cohort has a few outlier samples, based on both RNAseq quality metrics and PCA (Figure S1A). Use of the RUV algorithm for batch effect correction seems to attenuate the difference of these samples from the main population and also slightly reduces total variance (Figure S1B)[4]. We did not remove samples that were of potentially lower quality, given our goal to develop a classifier that will be useful in a real-life setting, where sample quality can vary, and the acquisition of a new sample can incur delays and risk for patients.
Single-sample classifier
For the single-sample classifier (SSC) we focused on a subset of the previously published iCMS genes[2] whose expression in epithelial cells from public single-cell datasets[5] was statistically significantly higher than in all other major cell types by a log-fold change of at least 1 and whose log-normalized average expression was also at least 1. These genes are conceptually less likely to be sensitive to biological noise from other cell types. The final gene subset contains 201 genes, shown in Table S1.
The SSC is based on non-parametric correlation distance (Kendall’s τ) from synthetic nearest neighbors, as shown in Fig. 1. We used three non-batch corrected datasets from TCGA[6], PETACC-3[7] (E-MTAB-990) and GSE39582[8] with a total of 1779 samples to generate synthetic centroids for each class. Importantly, these datasets were included during the initial iCMS derivation, and the reference iCMS class of each sample is known. The expression of the 201 genes was corrected by mean-centering for the purposes of visualization, since the three datasets are based on different assays (array and RNAseq). No other normalization or batch correction was applied, and significant batch effect remained (Fig. 1), a fact that was intended to capture real life conditions and ensure that the classifier would be robust to such variation.
A
Each training dataset was used to extract 32 exemplar synthetic centroids for each class, for 192 centroids in total, by calculating the mean expression of each gene in 8 randomly selected samples of the same class (iCMS2 or iCMS3). The most likely class for each input sample is estimated by comparing the 90th percentiles of the distance distributions for each class (distribution quantile [DQ] criterion) for each class with an absolute cutoff that can be set to zero if the nearest, most likely class is desired or higher if a confident call is necessary. During training, a minimum correlation distance of 0.1 or a minimum difference in correlation distance of 0.05 was used as a criterion to determine confident calls.
Alternatively, a K-nearest neighbor voting criterion (KNN) can also be used, with a default K = 10 for similar results. In this case, a confident call is made if at least 90% of neighbors (default = 9) are of the same class. A function wrapper is provided to enable parallel computation of multiple samples.
The previously published bulk RNAseq classification was based on nearest template prediction and the reference iCMS genes [2, 9]. A fast parallel implementation of this algorithm was developed for the purposes of performance comparisons and is provided in the accompanying R package (bulk NTP classifier).
Generating reference calls in the validation dataset
Ground truth in the unseen data (GTR cohort) was established by integrating the GTR cohort with the full previously published bulk dataset, that was used during the derivation of the iCMS[2]. Briefly, the reference dataset was clustered using hierarchical Ward's method, as previously. Based on the ordering in the tree, each new unseen sample was assigned an integer index indicating the position and its proximity to the nearest reference sample based on Pearson's correlation. The indices of the respective positions of the nearest references were assigned to the test samples, and used to sort the samples on the test heatmap. The genes of the test heatmap used the same subset and ordering as in the reference heatmap. This method is time and compute intensive since it uses a high number of samples (N = 2873) and genes (N ~ 12000) to establish the best matching iCMS.
Gene enrichment analysis
Gene set enrichment analysis was performed using the clusterProfiler R package (v4.8.3). Hallmark gene sets were obtained from MSigDB via the msigdbr package and formatted into a two-column TERM2GENE data frame (pathway name and Entrez gene ID). The previously published iCMS genes and the selected SSC genes were converted to Entrez IDs and input as separate clusters to compareCluster(fun="enricher", TERM2GENE = TERM2GENE,…), with Benjamini–Hochberg adjustment for multiple comparisons. Results were summarized and visualized using dot plots in enrichplot.
Statistical methods
The analyses were performed in the R environment for statistics. Survival analyses were performed with the survival packages using the logrank test. The alpha cutoff for statistical significance was set to 0.05.
Software availability
The method presented here is available in an R package that can be downloaded from github (https://github.com/CRCrepository/iCMS.SSC).
Results
Gene selection and training
As discussed above, the genes used for the classification were selected based on strong and specific epithelial expression. In the training set, use of the epithelial genes (N = 201) demonstrated a significant increase in the desirable distance variance explained by class (Fig. 2A), while reducing unrelated variance (Fig. 2B). In addition, gene set enrichment analysis of the C2 curated signatures, C4 cancer signatures, C8 cell type signatures from MSigDB and the Reactome signatures (Fig. 2C-F) showed that the epithelial gene subset was enriched for the same pathways as the larger iCMS gene set, with favorable gene ratios.
During training, the SSC confident calls achieved over 98% agreement with the reference calls in the for both the DQ and KNN criteria (Fig. 2G). A confident call could be made for the vast majority of training samples: 89.2% of samples with the DQ criteria, 87.3% of samples with the KNN criteria. Agreement was also high when using the nearest (not necessarily confident) calls, at 92.2% for DQ and 93.8% for KNN. One fourth (25.5%) of the samples had a significant similarity to both iCMS classes, defined as an absolute difference of less than 0.1 between the closest iCMS2 and iCMS3 neighbors. Although a “mixed” iCMS is not formally defined at the biological level, these samples did seem to have slightly worse RFS than the “pure” iCMS classes (HR = 1.25, 95% 1.05–1.48, P = 0.011).
The Geneva Tumor Registry colorectal cancer cohort
The gene expression data from the Geneva Tumor Registry colorectal cancer cohort (GTR) presented a typical colorectal cancer profile. As shown in Fig. 3A, right-sided tumors were more often MSI and CMS1, and almost all BRAF-mutated tumors were right-sided. The CMS1 group was infiltrated with immune cells (orange genes). The CMS4 group was stroma-rich (green genes), and was split into an iCMS2 and iCMS3 subgroup, as previously demonstrated.
The GTR cohort was not used during the development of the SSC but was solely used for validation. By integrating and re-clustering the data with the previously published reference dataset (N = 2873 samples, Fig. 3E in [2]) we were able to estimate the most likely iCMS class for each GTR sample, if it had been included in the initial iCMS derivation. This class was used as the reference iCMS. Using this time-consuming and resource intensive approach, 87.6% of samples could be confidently mapped to a most probable iCMS class.
The SSC achieved excellent agreement with reference calls with both the DQ and KNN criteria, surpassing 90% agreement for confident calls (Fig. 3B). The label swaps of iCMS3 to iCMS2 were more common than the opposite, especially with the DQ criteria, while most of the ambiguity concerned samples that could not be confidently classified by either the reference or the SSC (Fig. 3C-D). The fraction of calls that were considered confident was quite high for all bulk methods, over 85% without applying the RUV algorithm for batch correction (Fig. 3E).
As discussed, the SSC can always deliver a nearest sample call, even when not statistically confident. When the most likely, not necessarily confident, calls were considered, the SSC achieved a concordance ranging from 86 to 88% with or without batch correction (Fig. 3F), outperforming the previously published bulk NTP predictor, which only reached 75.4% without batch correction (89.8% with batch correction). Both the KNN and DQ criteria were robust to batch correction and achieved very comparable performance with, and without batch corrections (Fig. 3B, 3F). Interestingly, the proportion of confident calls with the KNN criterion somewhat increased with batch correction, but the agreement of confident calls slightly decreased.
Robustness to perturbation
We also tested the resistance of the SSC and the bulk NTP methods to the addition of noise in the gene expression data, simulating situations where the results are of poor quality. Specifically, we added random uniform noise in increasing ranges defined as multiples of the standard deviation of each gene in the validation GTR cohort. This resulted in a significant fraction of the total gene expression variance being random and therefore non-informative, from 10% to more than 80% on average, with obvious changes in the gene expression heatmap (Figure S3 shows examples for -/+1SD, 2SD and 3SD). All methods, including the previously published bulk NTP classifier, were quite robust to the addition of noise and maintained very stable performance for nearest iCMS calls, up to about 70% of noise in the total variance (Fig. 4A). Performance declined rapidly thereafter, but the SSC remained superior, with a slight advantage for the KNN criteria.
Ideally, the SSC should also be applied to assays for which the number of available genes may be reduced, for example when gene panels are used. We therefore tested what the performance would be if random iCMS genes, selected from the 201 classifier genes, were not available for analysis (Fig. 4B). Again, the performance was remarkably stable when most of the genes were removed and remained acceptable even when only 20–30 genes were available. The bulk NTP classifier is shown for reference, although it normally relies on the larger full set of iCMS genes. Its performance was on average relatively constant without batch correction, it presented much more variation depending on which genes were available for analysis.
Finally, some tumors may have similarity with both iCMS2 and iCMS3, possibly as a result of mixed tumor components, and it is not clear how the classification would be influenced by variable iCMS2/iCMS3 mixture. We therefore tested the behavior of different methods in synthetic (artificial) mixtures of iCMS2 and iCMS3 samples. Briefly, for each synthetic sample we picked three random samples from our validation cohort and calculated the mean gene expression. Thus, we generated four cohorts of synthetic samples with pure iCMS2 content, pure iCMS3 content and mixed content (1/3 or 2/3 iCMS2, the rest being iCMS3), as shown in Figure S4A. The profile of mixed samples was mostly aligned with the proportion of iCMS2/3 content, with samples containing a majority of iCMS2 aligning with pure iCMS2 and inversely samples with 2/3 iCMS3 aligning with pure iCMS3. The likeness of the sample to iCMS2 or iCMS3 as estimated by the DQ classifier mostly reflected the composition (Figure S4B), with a slightly higher sensitivity for iCMS2. The bulk NTP algorithm was very sensitive to the presence of iCMS2 content. All samples with pure iCMS2 or 2/3 iCMS2 content were labeled as iCMS2. However, only 14% of samples with 2/3 iCMS3 and only 64% of samples with pure iCMS3 content were labeled as iCMS3 (Fig. 4C).
Correlation with clinical endpoints
A
The SSC was applied to unseen data from the VELOUR trial to test for associations with relapse-free and overall survival. As expected, the bulk NTP classifier was prognostic with a hazard ratio (HR) of 0.778 (0.62–0.97, P = 0.025) for PFS and a HR of 0.653 (0.52–0.82, P = 0.0002), both in favor of iCMS2 (Fig. 
5). Similarly, the SSC iCMS was prognostic with the DQ criteria for PFS (HR 0.752, 0.60–0.94, P = 0.0112) and OS (HR = 0.601, 0.48–0.75, P < 0.0001), but also with the KNN criteria (PFS HR 0.702, 0.56–0.88, P = 0.0026, OS HR = 0.571, 0.45–0.72, P < 0.0001). Overall, the SSC seems to perform at least as well as the NTP classifier.
Compute time requirements
The bulk NTP algorithm is permutation-based and requires a relatively long computation time. The total CPU time required to execute the bulk NTP classifier on the GTR cohort was 124.5 seconds, against 12.5 seconds for the SSC(DQ) and 12.4 for SSC(KNN). The SSC and the bulk NTP algorithm were both implemented for parallel execution. With 8 parallel jobs the real elapsed time was shorter, at 19.8, 2.0 and 1.7 seconds respectively in a mainstream personal computer.
Discussion
We present a robust iCMS classifier (SSC) that is specifically tailored for analyses in a per sample setting, where batch correction methods such as RUV are not applicable. The SSC obtains results that are highly concordant with the iCMS classification when replicated by clustering a new, previously unseen validation cohort with the reference iCMS dataset. Further, it is robust to technical and biological perturbation and remains highly prognostic in a retrospective analysis of the VELOUR clinical trial.
The SSC was developed with minimal processing and quality control of the training datasets, a process that resulted in significant variation between the centroids that are used to infer similarity with iCMS2 and iCMS3. Instead of using two templates as in the NTP method, the SSC extends this to 192 centroids, which should better capture technical and biological variability. Further, the reduction of the reference genes to those that are predominantly and specifically expressed in epithelial cells seems to also increase the desirable variability between classes, without affecting the enriched pathways.
As shown in the initial iCMS publication, the iCMS prognostic value derives mostly from splitting the stroma-rich CMS4 samples by epithelial iCMS class, with iCMS3/CMS4 having the worst prognosis of all molecular subtype combinations. Here we show that the prognostic value of the SSC is comparable with that of the bulk NTP classifier in clinical trial data. Based on the analysis of synthetic “mixed” samples, the NTP classifier seems particularly sensitive to the presence of an iCMS2 component. It is therefore possible that for samples that are biologically heterogeneous the NTP will tend to classify them as iCMS2, a hypothesis that is compatible with the greater proportion of iCMS2 in the VELOUR trial. Further analyses in multiple trials are planned, especially in relation to clinicopathological parameters.
Multiregion sampling suggests that colorectal cancer can be highly heterogeneous between tumor regions, with prognostic consequences[10]. In that study, 40% of tumors were found to have CMS variability, typically mixtures of CMS1/3, CMS2/4 or CMS1/4. Although the iCMS was not developed with the intent of being robust to heterogeneity, other studies suggest that it is more stable than CMS between regions but also between metastases, especially for confident iCMS calls[10]. The data presented here do not evaluate the stability of the iCMS across regions or metastases, but the SSC DQ iCMS distance metric may allow some separation of iCMS mixtures, as shown in synthetic data (Figure S4). Interestingly, tumor with similarities to both iCMS2 and iCMS3 (“mixed”) seemed to have unfavorable prognosis, an observation that could point to either a complex ecosystem or significant plasticity[11]. We hope to that more evidence from multi-region and multi-sample cohorts and a deeper biological understanding will be available soon.
Conclusions
Further research in colorectal cancer will be facilitated by the existence of a standardized software package for the estimation epithelial intrinsic molecular subtypes. The iCMS was derived in a non-supervised manner and its prognostic and predictive associations are not fully explored. We hope that by providing a package that is specifically tailored for single-sample analyses, more applications will be made possible. Exciting new research should complement the epithelial iCMS axis with information from other tissue components, such as the tumor stroma and immune infiltrates.
A
Availability of data and materials
RNAseq data can be obtained from [IN THE PROCESS OF UPLOADING THE DATA].
A
Authors' contributions
Conceptualization: Sabine Tejpar, Petros Tsantoulis, Thomas McKee
Methodology: Petros Tsantoulis, Pratyaksha Wirapati
Investigation : Petros Tsantoulis, Yourae Hong, Ting Pu, Allyson Peddle
Draft preparation: Petros Tsantoulis, Yourae Hong
Draft review and editing : Thomas McKee
Acknowledgements
The authors acknowledge the support of Paolo Angelino (Swiss Bioinformatics Institute) and Melinda Charrier during the process of manuscript submission.
Electronic Supplementary Material
Below is the link to the electronic supplementary material
References
1.Guinney J, Dienstmann R, Wang X, de Reyniès A, Schlicker A, Soneson C, et al. The consensus molecular subtypes of colorectal cancer. Nat Med. 2015;21:1350–6.
 2.Joanito I, Wirapati P, Zhao N, Nawaz Z, Yeo G, Lee F, et al. Single-cell and bulk transcriptome sequencing identifies two epithelial tumor cell states and refines the consensus molecular classification of colorectal cancer. Nat Genet. 2022;54:963–75.
 3.Benhamou S, Fournier E, Puppa G, McKee T, Ris F, Rubbia-Brandt L, et al. Cohort profile: population-based cohorts of patients with colorectal cancer and of their relatives in Geneva, Switzerland. BMJ Open. 2022;12:e063914.
 4.Gagnon-Bartsch JA, Speed TP. Using control genes to correct for unwanted variation in microarray data. Biostatistics. 2012;13:539–52.
 5.Lee H-O, Hong Y, Etlioglu HE, Cho YB, Pomella V, Van den Bosch B, et al. Lineage-dependent gene expression programs influence the immune landscape of colorectal cancer. Nat Genet. 2020;52:594–603.
 6.Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature. 2012;487:330–7.
 7.Van Cutsem E, Labianca R, Bodoky G, Barone C, Aranda E, Nordlinger B, et al. Randomized phase III trial comparing biweekly infusional fluorouracil/leucovorin alone or with irinotecan in the adjuvant treatment of stage III colon cancer: PETACC-3. J Clin Oncol. 2009;27:3117–25.
 8.Marisa L, de Reyniès A, Duval A, Selves J, Gaub MP, Vescovo L, et al. Gene expression classification of colon cancer into molecular subtypes: characterization, validation, and prognostic value. PLoS Med. 2013;10:e1001453.
 9.Hoshida Y. Nearest template prediction: a single-sample-based flexible class prediction with confidence assessment. PLoS ONE. 2010;5:e15543.
 10.Langerud J, Eilertsen IA, Moosavi SH, Klokkerud SMK, Reims HM, Backe IF, et al. Multiregional transcriptomics identifies congruent consensus subtypes with prognostic value beyond tumor heterogeneity of colorectal cancer. Nat Commun. 2024;15:4342.
 11.Cammareri P, Raponi M, Hong Y, Billard CV, Peckett N, Zhu Y et al. Loss of colonic fidelity enables multilineage plasticity and metastasis. Nature. 2025.
 Table S1. (xls)
Gene subset