Clifti-GPT: Privacy-preserving federated fine-tuning and transferable inference of foundation models on clinical single-cell data

MohammadBakhtiari1✉Emailmohammad.bakhtiari@uni-hamburg.de

MariaLouiseElkjaer1,2

AliOğuzCan3,4

FabianTheis3,4,5

MhanedOubounyt1

JanBaumbach1,6

1Institute for Computational Systems Biology and Center for Data and Computing in Natural SciencesUniversität Hamburg22761HamburgGermany

2Institute for Inflammation Research, Center for Rheumatology and Spine DiseasesCopenhagen University HospitalRigshospitalet, CopenhagenDenmark

3Institute of Computational BiologyHelmholtz CenterMunichGermany

4School of Computation, Information and TechnologyTUM, Technical University of MunichMunichGermany

TUM School of Life SciencesTechnical University of MunichGermany

6Department of Mathematics and Computer ScienceUniversity of Southern DenmarkOdenseDenmark

Mohammad Bakhtiari¹⁺, Maria Louise Elkjaer^1,2, Ali Oğuz Can^3,4,, Fabian Theis^3,4,5, Mhaned Oubounyt¹*, Jan Baumbach^1,6*

¹ Institute for Computational Systems Biology and Center for Data and Computing in Natural

Sciences, Universität Hamburg, 22761 Hamburg, Germany

² Institute for Inflammation Research, Center for Rheumatology and Spine Diseases, Copenhagen

University Hospital, Rigshospitalet, Copenhagen, Denmark

³ Institute of Computational Biology, Helmholtz Center, Munich, Germany

⁴ TUM, School of Computation, Information and Technology, Technical University of Munich, Munich,

Germany

⁵ TUM School of Life Sciences, Technical University of Munich, Germany

⁶ Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark

* joint last author

⁺ Corresponding author, e-mail: mohammad.bakhtiari@uni-hamburg.de

Abstract

Foundation models have demonstrated immense value for scRNA-seq analysis, but their fine-tuning or inference on heterogeneous, privacy-sensitive clinical cohorts is governed by strict data protection policies, which often prohibit centralization. We introduce clifti-GPT, a privacy-preserving federated solution that leverages secure multiparty computation to enable collaborative model training and transferable inference of local statistics in zero-shot applications across decentralized scRNA-seq clinical repositories, without sharing patient data or clinical-level statistics or models. Built upon the scGPT foundation model, clifti-GPT achieves performance within 4% of centralized baselines in accuracy, precision, recall, and macro-F1 for cell type classification and reference mapping across six datasets. Furthermore, it demonstrates high communication efficiency, reaching 99% of centralized performance in fewer than two rounds, and scales robustly to 30 clients with less than 2% accuracy loss. Thus, clifti-GPT makes it feasible to fine-tune and apply single-cell foundation models across distributed clinical datasets under real-world privacy and governance constraints.

Introduction

Single-cell RNA sequencing (scRNA-seq) has significantly advanced our understanding of cellular heterogeneity, cells lineage tracking, and disease mechanisms, enabling comprehensive data atlases like the Human Cell Atlas¹. The rapid expansion of scRNA-seq data necessitates advanced methods for managing and analyzing it to gain meaningful biological insights effectively. Foundation models, such as those built on the self-attention transformer architecture, offer a promising solution. These models, pretrained on large-scale, diverse datasets, have shown remarkable success in various fields, including biological applications such as Enformer², which predicts gene expression directly from DNA sequence by modeling long-range regulatory interactions. By leveraging generative pretraining, foundation models can outperform task-specific models and adapt to various downstream tasks, making them ideal for the complex and expansive datasets in single-cell research. This approach has the potential to unify and enhance current fragmented machine-learning methods in scRNA-seq (cell typing, trajectory analysis, cell-cell communications, perturbation effects) within one framework, enabling more comprehensive and scalable analyses.

The progression of foundation models for scRNA-seq has rapidly advanced, opening new avenues for leveraging large-scale data in biological research. Early models like scBERT³ and Geneformer⁴ introduced the concept of pretraining on vast single-cell datasets, enabling the transfer of learned representations to specific downstream tasks such as cell type annotation and gene function prediction. Building on these foundations, the emergence of models like scGPT⁵, UCE⁶ and scFoundation⁷ has further solidified this approach. scGPT, as a foundation model specifically designed for scRNA-seq data, leverages the self-attention transformer architecture to achieve state-of-the-art performance across various single-cell applications. ScGPT is pre-trained on the extensive CELLxGENE scRNA-seq datasets (https://cellxgene.cziscience.com/), which encompasses a wide range of cell types, tissues, species, and conditions, providing a comprehensive resource for learning intricate patterns across different biological contexts. This extensive pre-training enables scGPT to excel in tasks such as multi-omics data integration, cell type annotation, and gene function prediction. Similarly, scFoundation offers a scalable framework for querying and analyzing diverse cell atlases, while models like SCimilarity⁸ exemplify the utility of foundation models in identifying cellular commonalities across various conditions, such as cancer or immune disorders. Additionally, nicheFormer⁹ introduces a hybrid transformer-graph architecture tailored for sparse, high-dimensional single-cell data, enhancing representation learning in both dissociated and spatial contexts. Complementing this, scGPT-Spatial¹⁰ integrates spatial transcriptomics modalities by combining positional encoding with gene expression embeddings. Importantly, it retains full compatibility with dissociated scRNA-seq data, thereby unifying spatial and non-spatial analyses within a single foundation framework. Collectively, these models represent a transformative shift in scRNA-seq analysis, offering unified and scalable solutions that transcend traditional task-specific methods. Additionally, models like scJoint¹¹ and scTab¹² have significantly influenced the field by integrating multimodal data and scaling cell type classification across tissues, paving the way for more robust, scalable foundation models in the single-cell field.

As scRNA-seq enters the age of large-scale foundation models, two main privacy challenges demand attention. First, raw single-cell count matrices are not truly “anonymized” data: they encode genotype–phenotype correlations, essentially genetic fingerprints that link individuals to their health conditions, that adversaries can exploit to re-identify donors or infer sensitive traits^13,14 and, even after standard preprocessing, can unintentionally leak private information, such as individual identities or disease status, when compared against public or auxiliary datasets¹³. Linkage attacks using public eQTL maps¹³ or cross-dataset triangulation¹⁵ demonstrate that even de-identified profiles can reveal disease status or family relationships ^14–16, exacerbating legal exposure under GDPR Article 4(1)¹⁷. Second, institutional data-governance frameworks impose stringent policies on who may access, analyze, or publish both raw data and derived discoveries, such as novel cell types or mutations, driven by intellectual-property, publication priority, and regulatory compliance considerations¹⁸. Conventional data sharing or centralized model training conflicts with these governance mandates, risking both patient privacy and institutional liability. These persistent vulnerabilities underscore the need for privacy-preserving, governance-compliant workflows. By keeping raw and minimally processed counts on-premises and exchanging only encrypted, aggregated models and statistics updates researchers can mitigate re-identification risks and satisfy institutional governance policies without sacrificing analytical breadth or model performance.

Federated Learning (FL)¹⁹ offers a privacy-aware solution by enabling decentralized model training without sharing raw data, while also addressing heterogeneity ^20–22 and data-imbalance²³ challenges. FL complies with privacy and legal mandates ^18,24,25 with its legal implications well-documented²⁶. Moreover, the development of FL platforms like FeatureCloud²⁷ has facilitated deployment of real-world applications in biomedicine. Despite these advances in federated infrastructure, deep-learning models trained on scRNA-seq remain vulnerable to “interrogation” attacks²⁸: Model-inversion can reconstruct individual gene-expression profiles from outputs²⁹, and gradient-leakage³⁰ in federated settings can recover sensitive count data from shared weight updates. Consequently, formal privacy-preserving guarantees, such as those provided by secure multiparty computation (SMPC), are essential for truly privacy-preserving federated learning in omics^31,32. SMPC mitigates these threats by cryptographically splitting client updates into secret shares that are aggregated without ever reconstructing cleartext gradients, thus preventing inversion or membership inference while enabling collective model improvement ^31,32. Recent studies have successfully applied federated learning to downstream scRNA-seq analyses, with some incorporating privacy-enhancing technologies (PETs)^33–36. In particular, Tabula³⁷ represents the only federated foundation model for scRNA-seq to date but lacks SMPC or other PETs and is therefore not privacy-preserving. Accordingly, privacy-preserving federated fine-tuning of established, benchmarked foundation models remains unexplored.

Building on the foundation of scGPT, a promising candidate for developing federated foundation models due to achieving state-of-the-art performance across various single-cell applications³⁸, we developed clifti-GPT, a privacy-preserving federated framework for fine-tuning and applying pretrained foundation models to downstream scRNA-seq data analysis. We extend clifti-GPT beyond prior work in two key ways. First, we enable secure multiparty computation (SMPC)–protected federated fine-tuning and application of foundation models across decentralized clinical cohorts. Second, we introduce transferable inference, where complementary statistics of zero-shot embeddings, such as neighborhood relations or similarity scores, are securely exchanged instead of raw data, embeddings, or local models. This allows institutions to benefit from cross-site knowledge in tasks such as reference mapping or annotation transfer while maintaining strict governance and privacy compliance. We use Crypten³⁹ for secret-sharing and secure aggregation of both local model updates and locally inferred statistics, ensuring compliance with GDPR and institutional data governance requirements. In a comprehensive empirical evaluation, clift-GPT matches centralized training performance level for cell-type classification while demonstrating high communication efficiency across pathologically heterogeneous, imbalanced cohorts. To ensure consistent, GDPR-compliant preprocessing, we devise a federated binning pipeline that harmonizes count discretization across sites. Furthermore, in our transferable inference pipeline, we implemented an SMPC-protected k-nearest neighbors module for zero-shot reference mapping, complemented by an ensemble secure voting mechanism, achieving centralized-level performance in collaborative inference across heterogeneous cohorts. The integration of foundation models with federated learning unlocks key advantages like expanding data availability without centralized pooling, sharing computational burden across stakeholders, avoiding single-vendor monopolies, and enabling robust multi‐task and multi‐modality workflows⁴⁰. clifti-GPT, the first privacy-preserving federated framework for scRNA-seq foundation models, embodies these benefits in the single-cell domain. As a generic solution, it demonstrates secure, SMPC-enabled fine-tuning and inference while respecting patient privacy and institutional governance, which can be extended to other benchmarked foundation models and a wide range of downstream analyses.

Results

We introduce clifti-GPT, a privacy-preserving federated framework for fine-tuning and applying the scGPT foundation model to downstream scRNA-seq analyses. Pre-trained on cell atlas, with over 33 million cells¹⁰, to capture the full complexity of single-cell data (Fig. 1a), scGPT delivers high performance but typically requires fine-tuning on large, heterogeneous cohorts that are often siloed across hospitals and cannot be centrally pooled for privacy or governance reasons. Although data sharing could potentially yield the most accurate results, it is frequently prohibited (Fig. 1b). Consequently, local clients are restricted on training on individual datasets as an alternative (Fig. 1c). To address these challenges and empower hospitals to enhance their downstream analysis, clifti-GPT framework initiates by distributing the scGPT pre-trained global model to each site, where clients train locally on private data and then secret-share their model updates with

independent computational parties. These parties use SMPC to aggregate updates without ever reconstructing cleartext gradients; the resulting global update is redistributed for the next round (Fig. 1d), iteratively refining the model. To assess the effect of secret-sharing on performance, we also implemented a non-SMPC variant.

Fig. 1

Privacy-preserving federated training matches centralized scGPT performance and outperforms local client models in cell type classification across HP, MS, Lung-Kim, CL, and Myeloid datasets. a) Pre-trained scGPT on a comprehensive cell atlas serves as a versatile backbone but requires downstream fine-tuning to achieve optimal task-specific performance. b) Centralized fine-tuning of scGPT risks violating patient privacy and institutional data-governance policies. c) In the absence of data sharing, hospitals independently fine-tune the pre-trained model on their limited local cohorts, often resulting in suboptimal performance. d) In our privacy-preserving federated workflow, each hospital trains locally and splits its model updates into cryptographic secret shares, which are distributed to independent computational parties. These parties aggregate the shares using secure multiparty computation (SMPC) without reconstructing any client’s gradients, producing a global model that is redistributed for the next training round. e) The federated models achieve accuracy comparable to centralized fine-tuning and outperform local models in cell type classification across the MS, Lung-Kim, CL, and Myeloid datasets.

We refer to our federated models by the aggregation algorithm used—FedAvg¹⁹ or FedProx⁴¹—each of which can be accompanied by SMPC. All experiments (local, centralized, and federated) used the same set of hyperparameters for the scGPT model, including a learning rate of

and batch size of 32. Centralized and local models were trained for 20 epochs, while federated models were trained for one local epoch per client over 20 communication rounds (Fig. 1e). We evaluated cell type classification performance based on accuracy, precision, recall, and Macro-F1 across five single-cell RNA-seq datasets with realistic federated scenarios: four clients for the Multiple Sclerosis (MS)⁴², four for Human Pancreas (HP)^43–47, two for Cell Line (CL)⁴⁸, ten for Lung-Kim ⁴⁹, and five for the Myeloid⁵⁰ (Additional File 1: Tables 1–6). For the Myeloid dataset, we assigned the four largest batches (by cell count) to four individual clients and grouped the remaining batches into a fifth reference client. In all other datasets, each client was assigned a unique batch to reflect biological, technical, and disease-driven heterogeneity (Additional File 1: Tables 1–6, Additional file 2: S1-2).

Fig. 2

Hyperparameter tuning enhances the robustness and communication efficiency of clifti-GPT in cell type classification across HP, CL, Myeloid, MS, and Lung datasets. a. Best accuracy achieved using FedAvg and FedProx aggregation strategies, with and without SMPC, compared to centralized scGPT performance. b. Corresponding training configurations (number of communication rounds, local training epochs, and proximal term

) for each result shown in panel a, highlighting the diversity of optimal settings across datasets and aggregation strategies. c. Increasing the number of local training epochs in FedAvg-SMPC on MS, Myeloid, and HP datasets does not yield accuracy improvements, indicating limited benefit from additional client-side computation. d. Using FedAvg-SMPC and FedProx-SMPC, different margins (70% to 99%) of centralized accuracy can be reached under varying communication budgets, demonstrating flexible communication efficiency. ”NR” indicates that the specified accuracy threshold was not reached within the evaluated training budget.

clifti-GPT demonstrates strong communication efficiency in federated cell type annotation. Across the CL, MS, Lung, and Myeloid datasets, the best-performing aggregation strategies match the accuracy of the centralized scGPT model (Fig. 2a). A similar trend holds for precision across datasets (Additional file 2: Fig. S6). However, using SMPC slightly reduces recall on the MS dataset and macro-F1 scores on both the MS and Myeloid datasets. In general, SMPC introduces a modest performance drop for both FedAvg and FedProx, with the exception of FedProx-SMPC precision, which remains nearly identical to its non-SMPC counterpart (Additional file 2: Fig. S6b). The results in Fig. 2a are achieved using varying hyperparameter settings, as detailed in Fig. 2b. Notably, one local training epoch consistently yields the highest accuracy across datasets. This pattern is consistent for other metrics as well (Additional file 2: Fig. S7). To assess the impact of increased local computation, we examined the effect of increasing the number of local epochs in FedAvg-SMPC across the MS, Myeloid, and HP datasets. As shown in Fig. 2c, additional local computation does not lead to notable improvements in accuracy. We investigated the communication efficiency by measuring rounds required to achieve accuracy thresholds (70%–99%) of centralized performance across the HP, MS, and Myeloid datasets. As shown in Fig. 2d, both FedAvg-SMPC and FedProx-SMPC achieve 90% of centralized accuracy using just a single communication round and varying number of local epochs. Furthermore, 99% of centralized accuracy on MS and Myeloid is reached in as few as two rounds, demonstrating high communication efficiency. While neither FedAvg-SMPC nor FedProx-SMPC on the HP dataset reach 99% of centralized accuracy within the evaluated training budget, both strategies achieve 99% of the centralized model’s performance on other metrics (precision, recall, macro-F1) in fewer than four rounds (Additional file 2: Fig. S8). Although the federated model trails the centralized model in accuracy on the HP dataset, the difference is minimal at 0.01 (Additional file 2: Fig. S9).

Clifti-GPT demonstrates competitive performance in rare cell type classification when compared to centralized models. We analysed the per-class true positive rate (TPR) of the best federated model against centralized training, considering the rare cell types in the reference data. TPRs of the federated model are shown using

, where

denotes

and a positive sign indicates improvement. Using FedProx-SMPC, trained for two epochs and four rounds with

, the model achieved an overall recall of 0.79 on MS dataset, closely matching the centralized recall of 0.81 (Fig. 3a). In general, per-class TPRs were well preserved, with slight improvements for SV2C interneurons

and PVALB interneurons

. For rare cell types (fewer than 100 cells in the reference data), Phagocytes (47 cells) and Mixed glial cells (55 cells), the TPRs were

and

, respectively, indicating only minor decreases compared to centralized training (Fig. 3a, Additional file 1: Table 1).

Using FedAvg-SMPC, trained for three epochs and 21 rounds, the model achieved an overall recall of 0.86 on the HP dataset. In general, per-class TPRs were preserved or improved for common cell types, e.g., Alpha, Beta, Delta, Ductal, and Endothelial (Fig. 3b). Among rare cell types, the federated model showed a notable gain with 0.71 TPR in Mast cells (25 cells) and matched the perfect TPR of the centralized model on Epsilon cells (21 cells). However, both the centralized and federated models misclassified the single MHC class II cell (1 cell in reference) as a Ductal cell (Fig. 3b, Additional file 1: Table 2).

Fig. 3

Confusion matrix (per-class TPRs) and UMAP visualization of the best federated models compared with centralized training, showing comparable performance across various cell types. a. On the MS dataset, using FedProx-SMPC (

), trained for two epochs and four rounds, the model closely matched centralized per-class TPRs for abundant cell types, with only minor decreases for rare cell types: Phagocytes and Mixed glial cells. b. On the HP dataset, FedAvg-SMPC, trained for three epochs and 21 rounds, preserved or improved centralized per-class TPRs for common cell types; among rare cell types, it achieved a large gain for Mast cells and matched the centralized model on Epsilon cells.

Privacy-preserving federated reference mapping matches or outperforms the centralized baseline on the HP, CL, Lung-Kim, and MS datasets. Beyond fine-tuning, clifti-GPT also enables transferable inference in zero-shot applications such as reference mapping, where only complementary statistics, e.g., distances or neighborhood relations, are exchanged instead of raw data or embeddings. In this setting, foundation models provide the embeddings, while transferable inference supplies the necessary computation; for example, reference mapping assigns query cells based on the majority of their closest reference neighbors. We evaluated privacy-preserving federated transferable inference for zero-shot reference mapping, with and without SMPC, against the centralized model using accuracy, precision, recall, and macro-F1 (Fig. 4). The federated SMPC model matches the centralized performance on the CL dataset across all metrics, matches or improves all metrics for HP and Lung-Kim, and consistently outperforms the centralized model by + 0.01 across all metrics for MS (Additional file 2: Fig. S11).

Fig. 4

Privacy-preserving federated reference mapping on HP, Lung-Kim, CL, and MS datasets improves client performance and matches the centralized model across all metrics. From left to right, panels show Accuracy, Precision, Recall, and Macro-F1. The horizontal line represents centralized reference mapping performance. Diamonds and stars represent the federated model without and with SMPC, respectively. Circles depict the performance of each client’s local reference mapping using its limited private data.

Fig. 5

Privacy-preserving federated annotation and reference mapping match centralized performance and outperform local models, with scalability demonstrated on the distributed Myeloid dataset across 5, 10, 20, and 30 clients. a. Cell type annotation performance of FedProx-SMPC trained for one epoch and optimal number of rounds (shown in the legend) for various clients (Top5, Top10, Top20, and Top30) b. Convergence curves of FedProx-SMPC for cell type classification over 200 communication rounds on the Myeloid dataset distributed across 5, 10, 20, and 30 clients over various metrics. c. Reference mapping accuracy of federated models with and without SMPC across 5, 10, 20, and 30 clients on the Myeloid dataset.

We specifically examined federated performance on five rare cell types in the myeloid dataset (< 100 cells): cDC2_CXCL9 (25 cells), cDC2_ISG15 (53), pDC_LILRA4 (59), cDC3_LAMP3 (94), and cDC1_CLEC9A (98) (Additional file 1: Tables 5–8). We compared per-class recall between federated and centralized models across four federated scenarios (Additional file 2: Figs S13-14). In Top5, FedAvg–SMPC (three epochs, 17 rounds) achieved the same overall recall as the centralized model (0.44), with gains of 2–18% recall for cDC1_CLEC9A, cDC3_LAMP3, and cDC2_CXCL9, but a 17% decrease for cDC2_ISG15. In Top10, FedProx–SMPC (

, one epoch, two rounds) yielded slightly lower overall recall (0.42 vs. 0.44); recall for pDC_LILRA4 and cDC1_CLEC9A was unchanged, increased by 3% for cDC3_LAMP3, and remained at 0 for cDC2_ISG15. In Top20, FedProx–SMPC (

, one epoch, 38 rounds) slightly improved overall recall (0.45 vs. 0.44), with stable performance for cDC1_CLEC9A, cDC2_ISG15, and pDC_LILRA4, and gains for cDC3_LAMP3 and cDC2_CXCL9. In Top30, FedProx–SMPC (

, one epoch, 180 rounds) further increased overall recall (0.46 vs. 0.44), with a 73% gain for cDC2_CXCL9, a 2% improvement for cDC3_LAMP3, a 2% decline for cDC1_CLEC9A, and unchanged recall for cDC2_ISG15. Across scenarios, federated training achieved competitive overall and per-class recall for all rare cell types except for cDC2_ISG15 in Top5 and Top10 scenarios (Additional file 2: Fig. S14).

Privacy-preserving federated reference mapping matched the centralized baseline accuracy across all configurations, with the “rest” client outperforming centralized and federated models in Top5–Top20 but underperforming in Top30 (the most realistic federated scenario) (Fig. 5c). Overall, the federated model matched or exceeded the centralized model across all metrics for all splits, except for accuracy, recall, and macro-F1 in Top30, where the gap was only 0.01 (Additional file 2: Figs. S11, S15).

Privacy-preserving federated analysis exhibits heightened sensitivity to batch effects, yet attains performance comparable to the centralized baseline following batch effect correction. We constructed a six-client federated scenario on the Covid-19⁵¹ dataset, which contains nine batches in total., Each client received one batch (10X, Covid, HCL, Northwestern, Oetjen, Sanger) with three batches held out as query data (Freytag, Krasnow, and Sun) (Additional file 1: Table 9). To illustrate the batch effect, we generated UMAP visualizations of the top 10 most abundant cell types in the Covid-19 dataset (Fig. 6a). By reducing the number of displayed cell types and using distinguishable colors, we observed that in the uncorrected embeddings, cells cluster by batch rather than by underlying biology. For example, the HCL, Krasnow, and Northwestern batches are separated within the Macrophage population; similarly, Northwestern and Krasnow separate within the AT2 population, and Oetjen separates from the mixed group of 10X, Freytag, Krasnow, and Sun batches within CD4 + T cells (Fig. 6a). A complete UMAP visualization including all cell types shows a similar batch effect for smaller populations (Additional file 2: Fig. S16). We applied scGen with default parameters (see Methods) to correct for batch effects, considering all nine batches in the dataset. This correction resulted in improved mixing of the HCL, Krasnow, and Northwestern batches, as well as better separation between Macrophages and Monocytes (Fig. 6b). Overall, while batch mixing was only partially improved, the distinguishability of cell types was enhanced across all abundant populations, including Dendritic cells, Monocytes, and Neutrophils.

Fig. 6

Batch effect correction of Covid-19 dataset improves centralized analysis and enables performance to match centralized baselines. a. UMAP visualizations of a subset of the Covid-19 dataset colored by batch (top) and cell type (bottom) prior to correction. b. UMAP after batch effect correction, demonstrating improved integration across batches and clearer cell type clustering. c. Cell type classification performance comparing centralized, client-level, and federated models (FedAvg, FedAvg-SMPC, FedProx, FedProx-SMPC) trained for one epoch over 20 rounds on uncorrected vs. corrected data, evaluated by accuracy, precision, recall, and macro-F1. d. Accuracy gain of centralized and federated (with and without SMPC) reference mapping after batch effect correction compared to client-level performance.

To evaluate the impact of batch effects on downstream analysis using the scGPT foundation model, we compared the performance of centralized, client-level, and federated models on both uncorrected and batch-corrected versions of the Covid-19 dataset. After training each aggregation scheme for one epoch over 20 rounds, the federated models lagged substantially behind the centralized baseline (Fig. 6c). For instance, FedProx-SMPC achieved an accuracy of 0.69 compared to 0.85 for the centralized model, while some individual clients outperformed both the federated and centralized models for specific metrics (Fig. 6c). Following batch effect correction, centralized performance improved across all metrics, with gains of 7–21% in accuracy, precision, recall, and macro-F1. Notably, the best privacy-preserving federated model (FedAvg-SMPC trained for one epoch over 61 rounds) matched centralised performance by reaching 0.93 accuracy, while for up to 83 rounds of training, narrowing the difference in other metrics to less than 4% (Supplementary Figure S9). Furthermore, batch effect correction improves the reference mapping performance for both centralized and federated analyses across various metrics, while the privacy-preserving federated method consistently matches the centralized baseline on both uncorrected and corrected data (Fig. 6d; Additional file 2: Fig. S17).

Discussion

We present the first privacy-preserving federated fine-tuning and inference framework for foundation models (FMs) in scRNA-seq, providing a robust and scalable solution for distributed transcriptomic analysis. While FMs, as largely pre-trained models, offer strong representational capabilities, task-specific fine-tuning, often through the addition of specialized layers such as a classification head for cell type annotation, remains crucial for optimal performance in downstream applications. Notably, FMs can also be applied in a zero-shot setting without additional training, leveraging their rich embeddings to mitigate representation issues that hinder inference. In such cases, FM embeddings can be used directly for tasks like reference mapping or perturbation prediction by identifying the most similar reference profiles to a query sample. In scenarios where scRNA-seq data is distributed across institutions, privacy-preserving federated learning enables zero-shot inference without centralizing sensitive genomic data. Given the stringent requirements of regulations such as GDPR¹⁷ and institutional governance frameworks, centralizing such data, despite its typical accuracy advantage, is often infeasible. Meanwhile, traditional client-level training on local datasets may yield suboptimal performance due to dataset size limitations and sampling biases (Fig. 1d).

Our proposed framework, clifti-GPT, bridges this gap by securely aggregating model updates from decentralized datasets using additive secret sharing, achieving accuracy levels comparable to centralized training while preserving data privacy. We evaluate multiple aggregation strategies, including FedAvg and FedProx, and examine the impact of SMPC on performance. Our findings highlight that optimal hyperparameter configurations depend on both the target metric (accuracy, precision, recall, macro-F1) and the dataset. In some cases, such as the Lung-Kim and CL datasets, the same hyperparameters yield competitive performance relative to centralized models across all metrics (Fig. 1e). However, for other datasets (e.g., HP and MS), optimal configurations vary substantially across metrics (Fig. 1e; Additional file 2: Fig. S3, S9). Through a systematic, metric-driven hyperparameter analysis, we found that increasing local computation at clients does not necessarily improve convergence (Fig. 2c), whereas fewer local epochs can be beneficial in highly heterogeneous scenarios, such as Myeloid-Top30 (Fig. 5b). We provide a consolidated summary of optimal hyperparameter settings across datasets in Additional file 2: Fig. S7.

In addition, the clifti-GPT workflow demonstrates strong communication efficiency across diverse datasets and levels of data heterogeneity (Fig. 2d). For the MS, HP, and Myeloid (Top5) datasets, the federated model achieves at least 95% of the centralized model performance within fewer than eight communication rounds across multiple evaluation metrics. Moreover, 99% of the centralized baseline is reached within 12 rounds for all dataset–metric pairs, except for HP–accuracy and MS–recall (Additional file 2: Fig. S8), highlighting the framework’s ability to balance high performance with low communication cost, an essential requirement for practical federated learning deployments. This makes clifti-GPT particularly well-suited for scenarios where minimizing communication overhead is critical without sacrificing performance. When hyperparameters are tuned via metric-driven optimization, the privacy-preserving clifti-GPT framework consistently matches centralized model performance across all metrics for the CL, Lung-Kim, Myeloid-Top5, and Myeloid-Top20 datasets. For the remaining datasets, clifti-GPT with SMPC achieves centralized-level performance for most metrics, with only minor deviations: up to a 2% accuracy drop on HP and Myeloid-Top30, a 1% precision drop for Covid-19, a 4% recall drop for Covid-19, MS, and Myeloid-Top10, and a 3% macro-F1 drop for Covid-19 and Myeloid-Top10. Overall, clifti-GPT with SMPC remains within 2% accuracy and within 4% for all other metrics relative to the corresponding centralized baselines across all evaluated datasets and federated scenarios (Additional file 2: Fig. S9), underscoring its robustness to diverse metrics, datasets, and federation scales.

We designed pathologically heterogeneous federated scenarios by distributing data on a per-batch basis, assigning each client all cells from one experimental batch (except the residual client in Myeloid Top5 to Top20). This mirrors real-world settings where institutional or experimental boundaries naturally produce non-independent and identically distributed (non-IID) partitions. As a result, cell type composition varied substantially across clients: some had a broad mix of common and rare types, while others were dominated by one or a few cell types. For instance, in the Covid-19 scenario, certain clients were dominated by abundant types, while others, such as Sanger, were enriched for low-abundance populations. Rare types in the reference were absent in multiple clients, further amplifying inter-client heterogeneity (Additional file 1: Table 9). Moreover, in Myeloid-Top5, the “rest” client was dominated by highly abundant types yet still contained all rare types from the reference (Additional file 1: Table 5). In Myeloid-Top10, abundant types were more evenly distributed, reducing extreme dominance, but rare type representation in the “rest” client became minimal (Additional file 1: Table 6). In Top20 and Top30, both the distribution of abundant types and the sparsity of rare types among the clients became more pronounced (Additional file 1: Tables 8–9). These imbalances created highly heterogeneous inter-client distributions, making rare type preservation during model aggregation particularly challenging. Clients with narrow or skewed distributions risk disproportionately influencing model updates, potentially slowing convergence or biasing predictions toward overrepresented types. Designing experiments under such conditions enables a rigorous evaluation of clifti-GPT’s robustness under realistic, high-skew data distributions.

Federated learning demonstrated strong relative to centralised effectiveness in classifying rare cell types under highly heterogeneous scenarios, with a worst-case macro-F1 and recall margin of − 0.03 across the MS, HP, Myeloid, and Lung-Kim datasets. In terms of per-class TPRs, the privacy-preserving federated models achieved competitive recall for Mixed glial cells in the MS dataset, Mast and Epsilon cells in the HP dataset, and Dendritic cells in the Lung-Kim dataset (Fig. 3; Additional file 2: Fig. S10). Across diverse heterogeneity settings in the Myeloid dataset, competitive recall was also obtained for rare cell types (Additional file 2: Fig. S14). We observed a recall drop for certain rare classes, such as Phagocytes in the MS dataset (Fig. 3), highlighting that performance may vary depending on the rarity and distribution of specific cell types. In some cases, federated models surpass centralized baselines on certain metrics due to metric-driven adaptations, but often at the expense of others. Increasing epochs or rounds rarely resolves this, as extended training can overfit to client-specific distributions—improving accuracy and precision while degrading recall and macro-F1.

Scalability is a critical requirement for applying federated foundation models to scRNA-seq analysis in real-world settings, where data are often generated across numerous institutions, studies, or sequencing batches with highly variable sizes and cell-type compositions. In four scalability scenarios, we partitioned 30 reference batches from the Myeloid dataset into increasingly fine-grained client distributions (Top5, Top10, Top20, and Top30). In the most heterogeneous and realistic Top30 case, even though the local model of client 12 matched accuracy due to its similar cell distribution to the query (Additional file 1: Table 8), it failed to match the centralized model on other metrics (Additional file 2: Fig. S12). Overall, local client-level models failed to match the centralized baseline, whereas clifti-GPT with SMPC consistently maintained performance within a worst-case margin of − 0.03 across all metrics (Additional file 2: Fig. S9). These results highlight clifti-GPT’s ability to scale effectively to complex, multi-institutional scRNA-seq environments, delivering robust performance despite the diversity and fragmentation inherent to real-world data.

Foundation models, despite their rich pretrained embeddings, are still prone to batch effects, and this limitation can be more impactful in federated analysis due to the isolation of data across clients. In our experiment on the Covid-19 dataset, we observed batch effects across multiple batches and cell types, for example HCL, Krasnow, and Northwestern within Macrophages (Fig. 6a; Additional file 2: Fig. S16). To mitigate this, we applied scGen with default parameters, correcting batches in the direction of the batch containing the majority of cells for each cell type. Accordingly, we selected three batches as queries that did not contain the maximum number of cells for any cell type, to avoid data leakage from query to reference. As a result, this correction improved mixing within HCL, Krasnow, and Northwestern and enhanced the separation of almost all cell types (Fig. 6b; Additional file 2: Fig. S16). Furthermore, batch effect correction improved centralized cell type classification performance by 7–21% across various metrics compared to uncorrected data (Fig. 6c). Notably, batch effects had a stronger impact on the federated analysis, where the best federated model performed 16% less accurately than the centralized baseline on uncorrected data, emphasizing the susceptibility of federated models to this issue (Fig. 6c). After correction, the privacy-preserving federated model closed the gap, achieving 0.93 accuracy (Additional file 2: Fig. S9). Likewise, reference mapping performance also improved for both centralized and federated models (Fig. 6d; Additional file 2: Fig. S17), although the margin of improvement was more limited compared to fine-tuning (Additional file 2: Fig. S17). Overall, these findings demonstrate that batch effects had a greater impact on fine-tuning than on zero-shot applications, though correction enhanced performance in both cases. Meanwhile, we did not assess the impact of alternative centralized or federated⁵² batch effect correction methods for scRNA-seq data, as the primary focus of this study was to benchmark and optimize the clifti-GPT workflow rather than evaluate correction strategies.

In the centralized setting, zero-shot application of foundation models produces rich embeddings that can be directly leveraged for downstream analyses without task-specific fine-tuning, for example reference mapping by inferring query annotations from neighboring reference cells. In this context, complementary statistics such as neighborhood relations, similarity measures, or distributional summaries can be readily computed and used. In the federated setting, however, raw data and embeddings remain local; instead, each clinical site computes these statistics on its embedded data in accordance with governance and patient-privacy requirements. The secure transfer and aggregation of such complementary quantities across sites is therefore a critical capability for federated analysis. We term this functionality transferable inference, denoting the privacy-preserving exchange of locally inferred statistics enabled by zero-shot applications of foundation models across decentralized clinical cohorts.

The zero-shot application of FMs in scRNA-seq analysis represents a major leap forward, enabled by advances in model architectures and increasingly powerful embedding representations. These embeddings capture rich transcriptomic features that can be directly leveraged for a range of downstream analyses without task-specific fine-tuning, including reference mapping for cell type annotation, perturbation prediction, and reverse perturbation inference enabled by distance-based inference of query annotations from neighboring reference cells⁵. As embedding quality continues to improve, the potential impact of such zero-shot approaches is substantial, promising faster, more generalizable, and resource-efficient workflows that can adapt to new datasets and analytical tasks. In fact, comparing the centralized performance of fine-tuning vs. reference mapping for inferring cell types in the CL, Covid-Corrected, Lung-Kim, and Myeloid (all scenarios) datasets shows that although reference mapping trails fine-tuning across all metrics, the worst-case margin is only 0.07, underscoring the potential of zero-shot foundation models as a viable alternative to fine-tuning (Additional file 2: Figs. S9, S11). A similar margin of difference between zero-shot and fine-tuning is observed across federated scenarios (Additional file 2: Figs. S9, S11). Realizing the full potential of zero-shot FM applications in real-world settings requires addressing several challenges. For instance, while scalability and heterogeneity can affect performance, our model maintained consistent accuracy in Myeloid scenarios, with at most a 1% drop from the centralized baseline across all metrics (Additional file 2: Fig. S11). Furthermore, while batch effects remain a major source of variation that can confound embeddings, these can be mitigated through correction, as shown by our federated model matching centralized accuracy and staying within a 5% margin on precision, recall, and macro-F1 for the Covid-19 dataset (Additional file 2: Fig. S11).

Our workflow uses additive secret sharing as an effective countermeasure against model inversion and reconstruction attacks, which are particularly relevant for foundation models that may memorize rare or unique samples and allow their retrieval through inference-time probing. In both fine-tuning and zero-shot settings, updates from each client are masked and split across multiple computational parties before aggregation, ensuring that no single party can access unprotected model parameters or intermediate representations. We use a default configuration of three computational parties, which can be increased to further reduce the risk of collusion. This design prevents exposure of raw data, learned embeddings, or sensitive cell-type distributions while maintaining performance comparable to non-encrypted training. Although secret sharing introduces randomization that can cause minor run-to-run variation, it does not add noise and preserves numerical accuracy. This makes it suitable for deploying large-scale foundation models in federated biomedical contexts, where both the richness of embeddings and the potential for memorization require robust, theoretically sound privacy protections that scale to heterogeneous, multi-institution datasets.

Our workflow ensures rigorous privacy preservation by maintaining all intermediate results in a secret-shared format throughout the reference mapping process, revealing values only in the final step. Clients’ shared data is utilized for reference mapping without disclosing the underlying cell-level information, including the actual expression profiles, the set of cell types available on each client, or the number of cells they hold. This is crucial from a data governance perspective, as it prevents the inadvertent disclosure of the presence of rare or novel cell types within specific institutions. The protocol further prevents leakage of which specific reference cells are closest to each query or the labels associated with those reference cells. Even during the voting phase, the design conceals which clients contributed decisive votes, ensuring that no party can infer another’s influence on the outcome. Similarly, query samples are also kept in secret-shared form, such that neither the reference holders nor other parties learn their contents or their mapping relationships during processing. This end-to-end secure aggregation ensures that the entire mapping, nearest neighbor search, and voting procedures are performed collaboratively without compromising the confidentiality of either the reference or query datasets.

This study underscores the potential of foundation models to transform single-cell transcriptomics by combining rich, pre-trained embeddings with adaptable fine-tuning strategies. As future FMs are trained on larger and more diverse biological corpora with improved embedding architectures, their ability to support accurate zero-shot inference and efficient domain-specific fine-tuning will only increase. At the same time, the sensitive and often regulated nature of single-cell data creates a growing demand for federated approaches that allow model improvement without centralizing patient-level data. Our framework demonstrates how such models can be securely integrated into distributed workflows using additive secret sharing, addressing both the technical challenges of scalability and data heterogeneity and the theoretical requirements of privacy, data governance, and compliance. By enabling robust, metric-driven performance in settings with extreme non-IID distributions, this work provides a blueprint for adopting next-generation FMs in a privacy-preserving and communication-efficient manner. In doing so, it sets the stage for scalable, collaborative analysis pipelines that respect institutional boundaries while benefiting from the full capabilities of emerging foundation models.

Electronic Supplementary Material

Below is the link to the electronic supplementary material

Supplementary Material 1

Supplementary Material 2

Supplementary Material 3

Declarations

Competing Interests

F.J.T. consults for Immunai Inc., CytoReason Ltd, Cellarity, BioTuring Inc., and Genbio.AI Inc., and has an ownership interest in Dermagnostix GmbH and Cellarity. No other author declares competing interests.

References

Regev A, et al. Hum Cell Atlas eLife. 2017;6:e27041.

Avsec Ž, et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods. 2021;18:1196–203.

Yang F, et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat Mach Intell. 2022;4:852–66.

Theodoris CV, et al. Transfer learning enables predictions in network biology. Nature. 2023;618:616–24.

Cui H, et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat Methods. 2024;1–11. 10.1038/s41592-024-02201-0.

Rosen Y et al. Universal Cell Embeddings: A Foundation Model for Cell Biology. 2023.11.28.568918 Preprint at https://doi.org/10.1101/2023.11.28.568918 (2024).

Hao M, et al. Large-scale foundation model on single-cell transcriptomics. Nat Methods. 2024;1–11. 10.1038/s41592-024-02305-7.

Heimberg G et al. Scalable querying of human cell atlases via a foundational model reveals commonalities across fibrosis-associated macrophages. 2023.07.18.549537 Preprint at https://doi.org/10.1101/2023.07.18.549537 (2023).

Schaar AC et al. Nicheformer: a foundation model for single-cell and spatial omics. 2024.04.15.589472 Preprint at https://doi.org/10.1101/2024.04.15.589472 (2024).

10.

Wang C et al. scGPT-spatial: Continual Pretraining of Single-Cell Foundation Model for Spatial Transcriptomics. 2025.02.05.636714 Preprint at https://doi.org/10.1101/2025.02.05.636714 (2025).

11.

Lin Y, et al. scJoint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning. Nat Biotechnol. 2022;40:703–10.

12.

Fischer F, et al. scTab: Scaling cross-tissue single-cell annotation models. Nat Commun. 2024;15:6611.

13.

Walker CR, et al. Private information leakage from single-cell count matrices. Cell. 2024;187:6537–e654910.

14.

Harmanci A, Gerstein M. Quantification of private information leakage from phenotype-genotype data: linking attacks. Nat Methods. 2016;13:251–6.

15.

Gymrek M, McGuire AL, Golan D, Halperin E, Erlich Y. Identifying Personal Genomes by Surname Inference. Science. 2013;339:321–4.

16.

Sweeney L et al. Re-identification Risks in HIPAA Safe Harbor Data: A study of data from one environmental health study. Technol. Sci. 2017, 2017082801 (2017).

17.

Voigt P, Von Bussche D. A. The EU General Data Protection Regulation (GDPR). Cham: Springer International Publishing; 2017. 10.1007/978-3-319-57959-7.

18.

Rieke N, et al. The future of digital health with federated learning. Npj Digit Med. 2020;3:119.

19.

McMahan B, Moore E, Ramage D, Hampson S, Arcas BA. y. Communication-Efficient Learning of Deep Networks from Decentralized Data. in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (eds Singh, A. & Zhu, J.) vol. 54 1273–1282 (PMLR, 2017).

20.

Nasirigerdeh R et al. Federated Multi-Mini-Batch: An Efficient Training Approach to Federated Learning in Non-IID Environments. (2020) 10.48550/ARXIV.2011.07006

21.

Li T et al. Federated Optimization in Heterogeneous Networks. in Proceedings of Machine Learning and Systems (eds Dhillon, I., Papailiopoulos, D. & Sze, V.) vol. 2 429–450 (2020).

22.

Nasirigerdeh R, Rueckert D, Kaissis G. Utility-preserving Federated Learning. in Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security 55–65ACM, Copenhagen Denmark, (2023). 10.1145/3605764.3623908

23.

Wang J, Liu Q, Liang H, Joshi G, Poor HV. Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization. (2020) 10.48550/ARXIV.2007.07481

24.

Brauneck A, et al. Federated Machine Learning, Privacy-Enhancing Technologies, and Data Protection Laws in Medical Research: Scoping Review. J Med Internet Res. 2023;25:e41588.

25.

Brauneck A, et al. Federated machine learning in data-protection-compliant research. Nat Mach Intell. 2023;5:2–4.

26.

Woisetschläger H, Mertel S, Krönke C, Mayer R, Jacobsen H-A. Federated Learning and AI Regulation in the European Union: Who is Responsible? -- An Interdisciplinary Analysis. Preprint at https://doi.org/10.48550/arXiv.2407.08105 (2024).

27.

Matschinske J, et al. The FeatureCloud Platform for Federated Learning in Biomedicine: Unified Approach. J Med Internet Res. 2023;25:e42621.

28.

Erlich Y, Narayanan A. Routes for breaching and protecting genetic privacy. Nat Rev Genet. 2014;15:409–21.

29.

Fredrikson M, Jha S, Ristenpart T. Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures. in Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security 1322–1333Association for Computing Machinery, New York, NY, USA, (2015). 10.1145/2810103.2813677

30.

Zhu L, Liu Z, Han S. Deep Leakage from Gradients. in Advances in Neural Information Processing Systems vol. 32Curran Associates, Inc., (2019).

31.

Mohassel P, Zhang Y, SecureML. A System for Scalable Privacy-Preserving Machine Learning. in 2017 IEEE Symposium on Security and Privacy (SP) 19–38 (2017). 10.1109/SP.2017.12

32.

Bonawitz K et al. Practical Secure Aggregation for Privacy-Preserving Machine Learning. in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security 1175–1191 (Association for Computing Machinery, New York, NY, USA, 2017). 10.1145/3133956.3133982

33.

Zolotareva O, et al. Flimma: a federated and privacy-aware tool for differential gene expression analysis. Genome Biol. 2021;22:338.

34.

Wang S, et al. scFed: federated learning for cell type classification with scRNA-seq. Brief Bioinform. 2023;25:bbad507.

35.

Sav S, Bossuat J-P, Troncoso-Pastoriza JR, Claassen M, Hubaux J-P. Privacy-preserving federated neural network learning for disease-associated cell classification. Patterns. 2022;3:100487.

36.

Bakhtiari M, Bonn S, Theis F, Zolotareva O, Baumbach J. FedscGen: privacy-aware federated batch effect correction of single-cell RNA sequencing data. Preprint at. 2024. https://doi.org/10.21203/rs.3.rs-4807285/v1.

37.

Ding J, et al. Toward a privacy-preserving predictive foundation model of single-cell transcriptomics with federated learning and tabular modeling. Preprint at. 2025. https://doi.org/10.1101/2025.01.06.631427.

38.

Szałata A, et al. Transformers in single-cell omics: a review and new perspectives. Nat Methods. 2024;21:1430–43.

39.

Knott B et al. Curran Associates, Inc.,. CrypTen: Secure Multi-Party Computation Meets Machine Learning. in Advances in Neural Information Processing Systems vol. 34 4961–4973 (2021).

40.

Zhuang W et al. When Foundation Model Meets Federated Learning: Motivations, Challenges, and Future Directions. Preprint at https://doi.org/10.48550/arXiv.2306.15546 (2025).

41.

Li T et al. Federated Optimization in Heterogeneous Networks. Proc. Mach. Learn. Syst. 2, 429–450 (2020).

42.

Schirmer L, et al. Neuronal vulnerability and multilineage diversity in multiple sclerosis. Nature. 2019;573:75–82.

43.

Baron M, et al. A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. Cell Syst. 2016;3:346–e3604.

44.

Muraro MJ, et al. A Single-Cell Transcriptome Atlas of the Human Pancreas. Cell Syst. 2016;3:385–e3943.

45.

Wang YJ, et al. Single-Cell Transcriptomics of the Human Endocrine Pancreas. Diabetes. 2016;65:3028–38.

46.

Xin Y, et al. RNA Sequencing of Single Human Islet Cells Reveals Type 2 Diabetes Genes. Cell Metab. 2016;24:608–15.

47.

Segerstolpe Å, et al. Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes. Cell Metab. 2016;24:593–607.

48.

Zheng GXY, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8:14049.

49.

Kim N, et al. Single-cell RNA sequencing demonstrates the molecular and cellular reprogramming of metastatic lung adenocarcinoma. Nat Commun. 2020;11:2285.

50.

Cheng S, et al. A pan-cancer single-cell transcriptional atlas of tumor infiltrating myeloid cells. Cell. 2021;184:792–e80923.

51.

Lotfollahi M, et al. Mapping single-cell data to reference atlases by transfer learning. Nat Biotechnol. 2022;40:121–30.

52.

Bakhtiari M, Bonn S, Theis F, Zolotareva O, Baumbach J. FedscGen: privacy-preserving federated batch effect correction of single-cell RNA sequencing data. Genome Biol. 2025;26:216.

53.

Lotfollahi M, Wolf FA, Theis F. J. scGen predicts single-cell perturbation responses. Nat Methods. 2019;16:715–21.

Methods.

Leveraging recent advances in. foundation models (FMs) for scRNA-seq data, which have demonstrated strong generalization capabilities across diverse datasets, we developed clifti-GPT, built upon scGPT FM, as a privacy-preserving federated workflow that enables multiple institutions to collaboratively adapt a pretrained FM without sharing raw data. Our approach aims to unlock the full potential of FMs in clinical and multi-institutional settings, which are otherwise constrained by privacy, governance, and heterogeneity challenges inherent to biomedical data. The framework supports both fine-tuning for downstream analysis and zero-shot inference, accompanied by complementary distance-based similarity inference. To ensure robust privacy guarantees, clifti-GPT integrates secure additive secret sharing as a type of a secure multiparty computation (SMPC) protocol, ensuring that model updates, label mappings, and other sensitive information remain protected during both model aggregation and local statistics computation. This design not only preserves patient privacy but also supports scalability, accommodates heterogeneous data distributions, and provides a foundation for integrating future FMs with improved embedding architectures in secure multi-party environments.

We conducted three categories of experiments to evaluate the effectiveness of our privacy-preserving federated workflow for the scGPT model on downstream tasks. First, we implemented a centralized scenario, in which all client data was aggregated into a single repository without any GDPR restrictions. This approach maximized the available data for training but did not preserve privacy. Second, we performed local fine-tuning, where each client fine-tuned the pretrained foundation model

on its own dataset without sharing any information, maintaining data privacy but risking overfitting due to limited local samples. Finally, we developed a federated scenario, where data remained fully decentralized and only model updates and local statistics were exchanged and aggregated across clients, enabling a GDPR-compliant and privacy-aware solution that overcomes data-sharing restrictions.

To ensure comparability across scenarios, we trained all models on the entire reference set and evaluated performance using a held-out query set. The gene tokens used for fine-tuning were aligned with those in the pretrained foundation model, and all gene expression values were normalized consistently. In the federated scenario, the common set of genes across clients was aggregated using a secure hashing mechanism, thereby concealing the presence or absence of specific genes at each site and preserving privacy. For cell type annotation tasks, gene expression values were log-transformed and binned prior to fine-tuning. We assume clients, each client with training samples, collaboratively fine-tune or apply the scGPT foundation model on their local data without ever exchanging raw data. Clients communicate their local updates using additive secret sharing among computational parties. Under this scheme, sensitive quantities are secret-shared and only the aggregated values are revealed in plaintext.

Federated Binning

For the federated scenario, we developed a federated binning workflow (Additional file 3: Algorithm 1) to transform continuous gene expression values into discrete bins that are standardized across multiple clients, ensuring consistency and alignment for downstream federated analyses while maintaining data privacy and enhancing scalability. Each client first calculates local bin edges on its gene expression data using quantile binning, which partitions the non-zero gene expression values into a predefined number of consistent discrete intervals. The local bin edges along with corresponding sample size , are then shared with the coordinator. The coordinator then aggregates these local bin edges:

Using

, each client discretizes its local data by assigning a bin index to each non-zero value, ensuring a unified data representation across all participants.

Privacy-preserving Federated fine-tuning

In clifti-GPT, the coordinator initializes the pretrained foundation model weights and distributes the initial model to all participating clients. Each client trains its local model and secret shares its weights. For model aggregation, we employed FedAvg, where the coordinator computes the weighted mean of local model parameters:

where is the total number of samples across all clients, and denotes the global model parameters at round. This weighting gives larger datasets proportionally greater influence on the global model. We also experimented with FedProx to address data heterogeneity and improve convergence. In this approach, a proximal term is added to the local objective:

where

is the proximal coefficient that constrains local updates to remain closer to the global model. Aggregation then follows the same weighted averaging as FedAvg but benefits from improved stability when client data distributions differ. The aggregation process is repeated over multiple communication rounds until the global model converges. By combining SMPC with these federated optimization algorithms, clifti-GPT ensures both privacy preservation and robust model performance in multi-institutional scRNA-seq settings.

Privacy-preserving Federated reference mapping

We implement transferable inference for zero-shot reference mapping, enabling clients to exchange only complementary statistics such as nearest-neighbor distances and votes, rather than raw data or embeddings. This allows knowledge transfer across clinical cohorts while preserving privacy and governance. In a privacy-aware fashion we developed a federated workflow for reference mapping (Additional file 3: Algorithm 2), where all parties agree on a shared embedding model

, and compute:

Then clients calculate local distance matrices and index matrices via and retaining only top- neighbor entries. The server collects these local top- index lists, merges them to form a global top- neighbor set for each query, and redistributes this global index list back to the clients. Finally, each client casts votes, by looking up its own cell-type labels for the selected neighbors, and returns the vote counts to the server. Consequently, the server aggregates client votes to select the most frequent label as the final prediction for each query cell. This protocol ensures that raw reference data never leaves any client while still enabling accurate zero-shot nearest-neighbor classification.

To guarantee privacy preservation, we customize the transferable inference pipeline for reference mapping by applying additive secret sharing to both query and reference statistics, enabling local and global KNN searches followed by secure voting (Additional file 3: Algorithm 3). Importantly, we refrain from disclosing these statistics in cleartext; instead, they remain secret-shared and are only aggregated under SMPC, ensuring that neither raw data nor intermediate quantities are ever exposed outside the originating institution. First, we secret-share the query embeddings among parties. Then the coordinator collects client’s plaintext labels into the global label set and define a bijection that assigns each unique label string an integer index, assuming the labels are harmonized across clients. Next, the coordinator securely aggregate the clients’ local sample counts to compute the global offset, thereby defining disjoint index ranges for each client. Accordingly, clients map their cell-type labels to global indices and secret-share besides their embedded reference data and local offset index vector (Additional file 3: Algorithm 4). Each client then triggers the secret-shared squared Euclidean distance matrix calculation between reference cells and secret-shared target query cells:

where denotes element-wise multiplication and is an ‐vector of ones. Afterwards, each client secretly selects its local nearest neighbors for each query, producing per-client top- distance and index shares:

Here,

performs a secret-shared in one-hot form to locate the nearest‐neighbor entries (Additional file 3: Algorithm 5). Finally, computational parties concatenate all clients’ top- shares along the neighbor axis to form the global secret-shared distance matrices:

and likewise for the indices. Next, the coordinator locates the global nearest neighbors by iteratively applying the one-hot to the concatenated distance matrix:

where, after each selection, we suppress the selected minima using a large constant

so that

is the fully concatenated distance matrix (Additional file 3: Algorithm 5).

In the voting phase, each client matches global neighbor indices to its own reference indices to cast its vote:

where,

produces a one-hot vector marking exactly those positions where the th global neighbor index matches client ’s local reference index. Multiplying this mask by yields the scalar vote which is then expanded to a length of one-hot vector (all zeros if ). Next, clients accumulate their top- votes into, and the coordinator securely produces the global vote matrix. All operations triggered by either clients or coordinators are conducted in secret shared format across computational parties. Finally, the coordinator extracts the most frequent vote for each query by applying a one‐hot maximum mask on the secret-shared global votes matrix:

where produces a one-hot encoding of the index with the maximum vote count for each query, and denotes the SMPC aggregation of secret-shares from parties being revealed to the coordinator. The results are then mapped to labels using the global label‐index vector, yielding the federated reference predictions

Batch effect correction

We employed scGen⁵³ for batch effect removal, leveraging its variational autoencoder (VAE) framework to learn a latent representation that disentangles biological signals from technical variation. scGen optimizes a combined reconstruction and Kullback–Leibler (KL) divergence loss:

Following training, mean latent features for each shared cell type are computed in a reference batch, and the latent representations of other batches are shifted toward these means. This correction procedure aligns cell-type-specific embeddings across batches, aiming for mitigating technical variability and preserving biological heterogeneity.

Datasets

In this study, we applied identical data preparation procedures for both reference mapping and cell type classification tasks. For each dataset, cells were partitioned into reference and query sets, ensuring that all cell types present in the query set were also represented in the reference set. Reference cells were assigned to clients based on batch origin to simulate realistic federated learning scenarios encompassing diverse technical, biological, and disease-driven heterogeneity. Highly variable genes (HVGs) were then selected from the combined reference and query sets for downstream analyses.

The Multiple Sclerosis (MS)⁴² dataset, comprising 21,312 cells across 18 distinct cell types, was collected from three brain regions—prefrontal cortex, cerebral cortex, and premotor cortex—in both healthy controls and MS patients. To simulate a realistic federated learning scenario with biological and disease-driven heterogeneity, we defined four reference clients: control-prefrontal, MS-prefrontal, control-cerebral, and MS-cerebral, encompassing a total of 18,739 cells. The remaining 2,573 cells, derived from the control and MS premotor cortex, were designated as the query dataset (See Additional file 1: Table 1). For downstream analysis, we selected 3,000 highly variable genes (HVGs).

The Covid-19⁵¹ dataset includes 19,922 cells spanning 36 immune and epithelial cell types, collected from nine distinct cohorts: 10X, Covid, HCL, Northwestern, Oetjen, Sanger, Freytag, Krasnow, and Sun. To simulate a realistic federated learning scenario with diverse technical and biological batch effects, we assigned six cohorts (10X, Covid, HCL, Northwestern, Oetjen, Sanger) as reference clients (16,977 cells), and designated three cohorts (Freytag, Krasnow, Sun) as the query dataset (2,945 cells). This split ensures label consistency without requiring any filtering, as all cell types present in the query set were also represented in the reference set. We selected 1200 HVGs.

The Human Pancreas (HP) dataset comprises 14,746 cells spanning 11 cell types, sourced from five studies: Baron⁴³, Muraro⁴⁴, Wang⁴⁵, Xin⁴⁶, and Segerstolpe⁴⁷. To simulate a federated learning scenario, we designated four studies (Baron, Muraro, Wang, and Xin) as reference clients, contributing a total of 12,684 cells, and assigned Segerstolpe (2,062 cells) as the query dataset. We selected 3,000 highly variable genes (HVGs) for downstream analysis.

The Lung-Kim dataset⁴⁹ includes 30,472 cells spanning 10 cell types, collected from 14 primary lung adenocarcinoma patient samples. We simulate a federated scenario with 10 clients (P0006, P0008, P0018, P0020, P0025, P0028, P0030, P0034, P1028, P1058) as the reference set (23,185 cells) and the rest of samples (P0019, P0031, P1006, P1049) as the query set (7,287 cells). We selected 3,000 highly variable genes (HVGs) for downstream analysis.

The Cell Line (CL)⁴⁸ dataset comprises 9,531 cells spanning two cell types—293T and Jurkat—profiled across three batches. To simulate a federated learning scenario, we included two batches as reference clients (2,885 cells), and designated one batch containing a mix of both cell types as the query set (3,388 cells). A total of 1,126 highly variable genes (HVGs) were selected for downstream analysis.

The Myeloid dataset⁵⁰ comprises 13,178 cells spanning 16 annotated immune cell types, collected across 84 distinct batches. To simulate a large-scale federated learning scenario with considerable technical and biological variability, we designated 30 batches (10,555 cells) as reference clients and 54 batches (2,623 cells) as the query dataset. For downstream analysis, we selected 3,000 highly variable genes (HVGs). To evaluate scalability, we designed four federated scenarios based on the number of clients involved. To ensure that all scenarios use the same set of training samples, we defined the

setting, where

, by assigning the

largest batches (based on cell count) as individual reference clients, and grouping the remaining batches into a single reference client labeled “rest.” In the top-30 scenario, each reference client corresponds to a unique batch, thereby preserving the full diversity of the reference data.

Code availability

All datasets used in this study were published in previously cited papers, and the preprocessed datasets are available for download. The source code for clifti-GPT is available at https://github.com/Mohammad-Bakhtiari/clifti-GPT, under the Apache 2.0 license.

Author Contribution

M.B. and J.B. conceived the study. M.B. implemented and conducted the experiments and wrote the manuscript. J.B. and M.O. supervised the project. J.B., M.O., M.E., A.C., and F.T. reviewed the experiments, analyzed the results, and contributed to the manuscript text.

Funding

Open Access funding enabled and organized by Projekt DEAL. This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No. 101137278 (CVDLINK). The views and opinions expressed are those of the author(s) only and do not necessarily reflect those of the European Union or the European Health and Digital Executive Agency (HADEA). Neither the European Union nor the granting authority can be held responsible for them. It was also developed as part of the NetMap project and is funded by the German Federal Ministry of Research, Technology and Space (BMFTR) under grant number 031L0309B.

Competing interests

Yes