A Federated and Privacy-Preserving Architecture for Scalable Collaborative Spam Detection in Distributed Multi-Cloud Environments
A
ShanmugaPriya1Phone0009-0004-5272-9909Email
A
A
R1✉
YogeshRajkumar1
R1
CChellaswamy3
1Department of Artificial Intelligence and Data ScienceBharath Institute of Higher Education and Research600073ChennaiIndia
2Department of Information TechnologyBharath Institute of Higher Education and Research600073ChennaiIndia
3Department of Electronics and Communication EngineeringSRM TRP Engineering College621105TiruchirappalliIndia
Shanmuga Priya R1*, Yogesh Rajkumar R2, Chellaswamy C3
1Department of Artificial Intelligence and Data Science, Bharath Institute of Higher Education and Research, Chennai, India 600073
2Department of Information Technology, Bharath Institute of Higher Education and Research, Chennai, India 600073
3Department of Electronics and Communication Engineering, SRM TRP Engineering College, Tiruchirappalli, India 621105
*Corresponding author: email: r.shanmugapriya.aids@princedrkvasudevan.com; Orcid id.: 0009-0004-5272-9909
A Federated and Privacy-Preserving Architecture for Scalable Collaborative Spam Detection in Distributed Multi-Cloud Environments
Abstract
The increasing prevalence of spam traffic poses major challenges for distributed and multi-cloud computing environments, particularly regarding scalability, workload coordination, and data privacy. Modern cluster-based infrastructures require secure and efficient mechanisms to enable collaborative spam detection across heterogeneous providers while preserving local autonomy. This paper introduces FMH-SCS, a federated and privacy-preserving distributed architecture that enables collaborative spam detection in multi-cloud cluster environments. FMH-SCS integrates Federated Learning with Secure Multi-Party Computation and Homomorphic Encryption, ensuring that sensitive email data remains local while only encrypted model updates are exchanged across computing clusters. To address communication bottlenecks in large-scale distributed training, FMH-SCS employs optimized aggregation protocols that balance security with communication efficiency. We evaluate FMH-SCS using the Enron and SpamAssassin benchmark datasets, analyzing accuracy, precision, recall, F1-score, training overhead, and communication costs in distributed setups. Experimental results show that FMH-SCS improves spam detection accuracy by up to 6.12% compared with state-of-the-art baselines, while also reducing communication and synchronization overhead across distributed clusters. These findings demonstrate that FMH-SCS provides a scalable, privacy-preserving, and computation-efficient solution for collaborative spam detection in modern cluster and distributed computing environments.
Keywords:
federated learning
multi-cloud environments
privacy-preserving protocols
spam detection
homomorphic encryption
1. Introduction
The rapid escalation of spam traffic continues to place a considerable burden on modern digital infrastructures, particularly in large-scale, distributed multi-cloud environments where scalability, communication efficiency, and privacy protection must be carefully balanced. Spam not only undermines the quality of communication services but also consumes significant computing, storage, and networking resources across clusters of cloud providers. As organizations increasingly embrace heterogeneous cloud platforms such as Amazon Web Services, Microsoft Azure, and Google Cloud, spam detection has shifted from being an isolated classification problem to a distributed computing challenge. Effective solutions must operate across federated infrastructures, coordinating distributed resources, preserving user privacy, and minimizing communication overhead. This necessitates the design of frameworks that integrate privacy-preserving machine learning with communication-aware architectures optimized for clustered, multi-cloud environments.
Spam detection techniques have evolved in parallel with the growth of internet-scale communication systems. In the 1990s, when unsolicited bulk email first surged, rule-based filters provided the earliest form of defense [1]. These filters relied on handcrafted heuristics such as keyword lists, header anomalies, and sender blacklists. While these techniques were relatively easy to deploy, they proved brittle; even minor content manipulations could bypass detection [2]. In an attempt to enhance adaptability, optimization-driven methods such as particle swarm optimization (PSO) were introduced. Idris and Selamat proposed combining PSO with Local Outlier Factor analysis, achieving an accuracy of 91.22%, thereby illustrating the potential of search optimization for rule generation [3]. Yet, these solutions remained centralized, lacked scalability, and were fragile when confronted with adversarial evasion strategies.
With the proliferation of web content and digital communication services, spam traffic expanded beyond email into broader domains, degrading the reliability of search engines, messaging systems, and collaborative platforms. Classical machine learning (ML) methods such as Support Vector Machines [3] and Decision Trees [4] were widely applied, providing improved adaptability over rule-based systems. However, they faced challenges in handling highly imbalanced datasets and high-dimensional text representations. Deep learning (DL) approaches offered a major step forward. For instance, Deep Belief Networks [5] demonstrated the capacity to learn complex hierarchical representations, though they struggled on imbalanced corpora, requiring techniques like Synthetic Minority Over-Sampling to rebalance training data [6]. Similarly, denoising autoencoders improved feature robustness, pushing detection quality beyond traditional ML approaches.
The rise of advanced DL architectures-particularly Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs)-further transformed spam detection research. These models effectively capture spatial and sequential dependencies in text, yielding better accuracy and robustness on large-scale datasets [7]. Salman et al. for example, applied DNNs and RNNs to email spam classification, showing their ability to model contextual cues, though such systems remain vulnerable to adversarial manipulations [8]. Larger datasets such as the SMS spam corpus, containing more than 68,000 messages, enabled experimentation with hybrid models combining CNNs and Long Short-Term Memory (LSTM) networks [9]. Optimization-enhanced hybrids, such as the DL-Remora Optimization Algorithm (DL-ROA), pushed detection rates as high as 98.25%. While these advances demonstrated the potential of DL, they also introduced significant new challenges: high computational costs, centralization of sensitive communication data, and limited applicability in privacy-conscious distributed infrastructures.
These limitations have motivated the adoption of federated learning (FL), a paradigm that enables distributed model training across multiple devices or cloud providers without transferring raw data to a centralized server. Instead, each participant trains a local model, and only parameter updates are shared. This paradigm shift directly addresses privacy concerns while distributing computational loads across clusters. Kaushal et al. demonstrated the use of FL for spam detection across edge devices, confirming its suitability in privacy-sensitive contexts. FL has already achieved success in predictive text (e.g., Google’s Gboard) and healthcare analytics, but its application to spam detection in multi-cloud settings remains relatively new [10]. To improve its effectiveness, researchers have proposed approaches such as clustering to mitigate data heterogeneity [11] and advanced model architectures that handle skewed distributions [12]. For example, Vats et al. reported up to 98.3% accuracy in SMS spam detection under FL, while Thapa et al. integrated RNNs and BERT into FL pipelines for phishing detection [13]. However, these studies also highlighted FL’s inherent vulnerabilities: uneven data distributions across clients often degrade stability, and gradient updates may still leak sensitive information.
While FL marks a significant step toward distributed privacy preservation, it does not inherently guarantee secure aggregation. Gradients exchanged during the training process can be exploited through attacks such as gradient inversion, which reconstructs sensitive input data. This is particularly problematic in multi-cloud cluster environments, where providers may be competitors with semi-trusted relationships. In such contexts, protecting model updates during aggregation is as critical as avoiding raw data sharing.
Secure Multi-Party Computation (MPC) provides one path to achieving this protection. MPC allows multiple parties to jointly compute aggregation functions without revealing their individual inputs. In spam detection, MPC enables cloud providers to contribute local model updates while ensuring confidentiality during aggregation [14]. MPC has been widely applied in Industrial IoT systems [15], where collaborative analytics require strong privacy guarantees. However, MPC introduces significant communication overhead, since secure protocols often require multiple rounds of interactive exchanges between participants, increasing both bandwidth consumption and latency. In distributed, multi-cloud environments with dozens of participants, this overhead becomes a major scalability bottleneck.
Homomorphic Encryption (HE) offers a complementary mechanism by enabling computations directly on encrypted data. For example, Paillier encryption has been integrated into secure frameworks for financial risk prediction [16], and Firdaus et al. demonstrated a combination of HE, FL, and blockchain for privacy-preserving healthcare analytics [17]. Although HE ensures confidentiality even during computation, it suffers from considerable drawbacks: ciphertext expansion significantly inflates data size, and homomorphic operations impose high computational and communication costs. For distributed spam detection systems deployed in multi-cloud clusters, these costs can undermine real-time scalability.
A
Modern digital infrastructures are increasingly adopting multi-cloud strategies, leveraging multiple providers simultaneously to reduce downtime risks, improve flexibility, and avoid vendor lock-in [18]. In industrial contexts, multi-cloud adoption extends to cloud-edge-local collaboration for production control, energy efficiency, and resource optimization [19, 20]. However, distributed spam detection across multi-cloud clusters introduces unique challenges: interoperability across heterogeneous infrastructures, conflicting provider policies, trust boundaries, and the need for secure yet bandwidth-efficient communication. Addressing these challenges requires a holistic framework that combines FL, MPC, and HE while optimizing for communication efficiency in clustered environments.
1.1 Research Gap and Motivation
While FL enables decentralized training without exposing raw data, it remains vulnerable at the aggregation stage. Current efforts to integrate FL with MPC or HE typically focus on individual aspects of privacy, overlooking the need for unified solutions. MPC ensures secure aggregation but is communication-intensive, while HE allows encrypted computation but incurs prohibitive computational costs. Furthermore, existing approaches often neglect the communication burden of secure protocols-a critical factor in multi-cloud clusters where bandwidth costs and latency directly affect scalability.
This study addresses these gaps by introducing FMH-SCS, a federated, privacy-preserving, and communication-efficient framework for collaborative spam detection in distributed multi-cloud environments. FMH-SCS combines FL with MPC and HE, providing end-to-end protection of user data while minimizing communication overhead. Unlike prior solutions that emphasize privacy or accuracy alone, FMH-SCS explicitly integrates communication-efficient secure aggregation protocols, ensuring scalability across heterogeneous clusters of cloud providers.
1.2 Novelty of FMH-SCS
Table 1 highlights the novelty of FMH-SCS in addressing key limitations of existing approaches and its specific implications for distributed multi-cloud clusters.
Table 1
Novelty of FMH-SCS in addressing existing limitations
Limitation in Existing Work
Why It Matters
FMH-SCS Solution
Distributed Computing Implication
FL model updates unprotected during sharing
Enables gradient inversion and privacy leaks during inter-cloud communication
MPC ensures secure and verifiable aggregation protocols
Protects inter-cluster communication across untrusted providers
MPC or HE used in isolation
Protects only specific training stages; high overhead when applied alone
Combined use ensures end-to-end privacy with reduced communication cost
Enables scalable secure training across multi-cloud clusters
Not tailored for multi-cloud environments
Fails to address interoperability, policy mismatches, and communication overhead
Designed for heterogeneous, distributed multi-cloud systems with scalable collaboration
Optimized for cluster-level coordination and resource sharing
1.2 Research contributions
The major contributions of this study are as follows:
We present FMH-SCS, a federated and privacy-preserving framework that integrates FL, MPC, and HE to enable decentralized and secure spam detection across distributed multi-cloud clusters.
By incorporating MPC-based secure aggregation protocols and HE-based encrypted computation, FMH-SCS reduces privacy risks while addressing communication overhead in distributed training.
The framework explicitly supports heterogeneous multi-cloud environments, ensuring secure collaboration across providers with different infrastructures and policies.
Extensive experiments are conducted on benchmark datasets (Enron and SpamAssassin). Performance is assessed using accuracy, precision, recall, and F1-score, and compared with state-of-the-art baselines (Logistic Regression, LSTM, and BERT). Additional analysis examines communication costs versus privacy levels, robustness under skewed data distributions, and scalability across multi-cloud clusters.
The remainder of this article is organized as follows. Section 2 introduces preliminaries on gradient aggregation in FL, secure aggregation using MPC, HE, and differential privacy. Section 3 describes the materials and methods. Section 4 presents datasets, system setup, evaluation metrics, and security mechanisms. Section 5 analyzes performance in terms of detection accuracy, communication overhead, robustness, and scalability. Section 6 concludes the study with future directions.
2. Preliminaries
To provide the theoretical basis for our federated and privacy-preserving spam detection framework in cluster environments, this section introduces the core building blocks: FL with gradient aggregation, MPC, HE, and Differential Privacy (DP). Beyond their cryptographic or machine learning aspects, we emphasize implications for cluster communication, parallel computation, and distributed resource management.
2.1 Gradient Aggregation in FL
FL is a decentralized paradigm where multiple participants collaboratively train a shared model without exchanging raw data. In cluster computing, each node trains the model locally on its private dataset and transmits gradient updates to a cluster coordinator node, which aggregates updates while maintaining privacy constraints [21].
Gradient aggregation is the key communication operation in FL. Each node (e.g., cloud provider) trains a copy of the global model using its local dataset and computes the gradient update:
1
where
is the loss function for node
using dataset
​ and
is the Current model parameters. Once updates are computed, they are transmitted to the CSR, which aggregates them [22] as:
2
The global model parameters are updated as:
3
where
is the learning rate. In cases of heterogeneous data volumes, weighted averaging is employed:
4
where
is the size of the local dataset
.
In cluster environments, each communication round involves intra-cluster data transfer of gradient vectors of dimension d, resulting in a cost of O(Nd). For large-scale models (e.g., BERT), this can dominate training time. Techniques such as gradient compression, sparsification, and hierarchical aggregation are critical to optimize cluster communication
2.2 Secure Gradient Aggregation Using MPC
MPC enables multiple participants to compute a joint function over private inputs without revealing them. In FL, MPC can secure the gradient aggregation step so that no party (including the CSR) learns individual updates [23].
In our framework, each participant splits its gradient
into random shares using secret sharing:
5
,
where
​ is the
-th share of
and
is a large prime number for modular arithmetic. Each participant sends
to the
-th participant. No single party can reconstruct
from the shares it receives, ensuring privacy [24]. Each participant
collects one share from each of the
participants and computes a partial sum:
6
.
The partial sums
are sent to a CSR (or reconstructed collaboratively). The global gradient
is then reconstructed as:
7
.
The global gradient
​ is used to update the global model parameters:
8
,
where
is the learning rate. Example of two-party secure gradient aggregation in the proposed method is shown in Fig. 1. Cluster considerations: MPC incurs multiple communication rounds between nodes. Topology-aware protocols (e.g., ring-based or hierarchical aggregation) can reduce latency and optimize bandwidth usage across nodes.
Fig. 1
Example of two-party secure gradient aggregation in the proposed method (considered two cloud providers
with gradients
and
)
Click here to Correct
2.3 Homomorphic Encryption
HE allows computation over encrypted data, enabling secure aggregation without exposing plaintext gradients. We adopt the Paillier cryptosystem, which supports additive homomorphism [25].
Each participant encrypts its local gradient
using the Paillier encryption function Enc with the public key
:
11
,
where
is a random number chosen from
.
is the ciphertext of the local gradient
. The ciphertext
is sent to the multi-cloud environment (e.g., aggregation server). The multi-cloud servers (or a trusted aggregation node) receive all encrypted gradients c1,c2,…,cN and compute the aggregated ciphertext
by performing modular multiplication:
12
.
By the additive homomorphic property of Paillier encryption, this operation is equivalent to summing the plaintexts:
13
where
is the global gradient. The server (or an authorized party in the multi-cloud network) decrypts the aggregated ciphertext
to obtain the global gradient
​:
14
Cluster perspective
HE increases memory and CPU requirements at each node. Parallel encryption and decryption across cluster nodes can mitigate latency while maintaining end-to-end privacy.
2.4 Differential Privacy and its enhancement
Differential Privacy (DP) provides formal privacy guarantees by ensuring that individual data points cannot be inferred from aggregated outputs. In FL, DP is typically applied by adding noise to local or aggregated gradients [26, 27].
To ensure DP in FL, noise is added to the gradients
or the aggregated global gradient
​. Each participant
computes their private, noise-added gradient
before sharing:
15
Where
is the computed gradient from participant
.
is Gaussian noise with mean 0 and variance
. The variance
of the Gaussian noise is calibrated based on the desired privacy level
and sensitivity
.
16
Where
is the sensitivity of the function (the maximum change in the output caused by modifying a single participant's data),
and
are the privacy budget and the probability of breaking DP, respectively. For gradient aggregation,
can be bounded as,
. To bound sensitivity and ensure consistent privacy guarantees, each gradient
is clipped to a maximum
:
17
This ensures that
, controlling the influence of outlier gradients.
Each participant encrypts their noise-added gradient:
18
,
where
. The encrypted gradients are securely aggregated:
19
.
After decryption, the global gradient
incorporates DP noise:
20
The iterative nature of federated learning inherently amplifies privacy. For
rounds of updates, the total privacy budget is:
21
This means DP is stronger for larger datasets and longer training processes.
Cluster relevance
DP affects convergence rates and may increase intra-cluster communication due to additional iterations. Cluster scheduling and parallel computation can be leveraged to maintain efficiency while preserving privacy.
3. Materials and Methods
The increasing volume of spam emails presents significant challenges for secure, scalable, and reliable communication in cluster-based multi-cloud environments. Traditional centralized spam detection systems are limited in privacy, scalability, and fault tolerance, especially when sensitive data is distributed across multiple independent cloud providers (clients). To address these challenges, we propose FMH-SCS, a secure and scalable spam detection framework leveraging FL, MPC, and HEN within a collaborative cluster computing setup. FMH-SCS integrates complementary privacy-preserving techniques (PPT) to ensure robust spam detection while protecting distributed user data.
3.1 Overview of FMH-SCS framework
In FMH-SCS, a multi-cloud cluster environment provides the foundation for scalable and secure collaborative spam detection. Each cloud provider operates as a cluster node with independent data and infrastructure, enabling decentralized computation, redundancy, and flexible resource allocation. By distributing computation across multiple nodes, FMH-SCS reduces the risks associated with single-provider reliance while improving system resilience and throughput.
Federated Learning forms the backbone of FMH-SCS. Each cluster node trains a local model on its private dataset (email content, headers, and metadata) and transmits only model updates-gradients or weights-to a cluster coordinator node for aggregation [28]. This approach preserves data privacy, reduces exposure to breaches, and allows continuous adaptation to emerging spam patterns.
To enhance privacy, MPC is integrated into the FL process. MPC ensures that local updates remain confidential during aggregation. Cluster nodes compute and share secret-shared gradients, which are then securely aggregated without exposing individual contributions. MPC protects against both internal adversaries and semi-trusted coordinators while maintaining collaboration efficiency.
HEN secures computations on encrypted gradients, allowing the coordinator node to perform aggregations without accessing plaintext values [29]. By combining FL, MPC, and HEN, FMH-SCS ensures end-to-end privacy while maintaining the integrity of the collaborative spam detection model.
Cluster-specific benefits include:
Parallelized local training to exploit multi-core and multi-node architectures.
Secure inter-node communication optimized for low-latency cluster networks.
Scalable deployment that accommodates heterogeneous cloud providers with varying resources.
A
Fig. 2
Block diagram of the proposed FMH-SD. Each client trains a local model, encrypts updates, and sends them to a CSR for secure aggregation. The updated global model is then redistributed to cluster nodes for the next round of training.
Click here to Correct
3.2 Spam Detection Model
The spam detection model is a supervised learning system, typically a neural network, trained to classify emails as spam or non-spam. Features include email content, headers, and metadata. Each cluster node trains a local model and shares updates with the CSR. Aggregated updates produce a global model, which is then used for prediction.
In a multi-cloud cluster, each node maintains local control of sensitive data. FL enables collaborative training without sharing raw data. Secure aggregation via MPC and encrypted computations via HEN ensure that model updates are protected against adversarial access. This allows the system to continuously adapt to new spam patterns while maintaining robust privacy guarantees.
Threat Model:
External adversaries: Attempt to intercept communications or access training data. Mitigation: HEN encrypts all computations.
Internal adversaries: Malicious clients may try to infer data from updates. Mitigation: MPC ensures raw updates remain inaccessible, while Differential Privacy masks contributions.
Model poisoning: Adversaries may inject malicious updates. Mitigation: Secure aggregation and DP reduce the risk of corruption.
Eavesdropping: All cluster communications are encrypted to prevent data leakage.
3.3 Performance Metrics
To evaluate FMH-SCS in cluster environments, it is essential to consider metrics that assess both the performance and the overhead introduced by the PPT. These metrics assess how well the trained model distinguishes between spam and legitimate messages. Key metrics include:
Precision evaluates the fraction of correctly identified spam messages out of all messages classified as spam, ensuring minimal false alarms.
22
Recall focuses on the model’s ability to identify actual spam messages, minimizing missed detections.
23
F1-Score, a harmonic mean of
and
, balances these two metrics, making it valuable when there’s a class imbalance between spam and legitimate messages.
24
False positive rate (FPR) can lead to missed business opportunities, communication failures, or even legal issues and false negative rate (FNR) can expose users to phishing, malware, and fraud. Low FPR ensures that legitimate emails are not mistakenly marked as spam and Low FNR ensures that harmful emails don’t reach users.
25
26
Additionally, ROC-AUC quantifies the trade-off between
and
rates, offering insight into the model’s discriminative power.
Adversarial data poisoning is a type of attack on ML systems, occurs when an attacker deliberately changes the training data to reduce the accuracy or reliability of the model [30, 31]. The goal is to "poison" the model by injecting carefully crafted malicious data that can either confuse the model or make it produce incorrect predictions.
27
Where
is the model with parameters
,
is the loss function (e.g., cross-entropy),
is the original training data,
is the crafted poisoned data (to be optimized),
is the test data (used to is the evaluate performance), and
is the trained model parameters after poisoning.
Privacy and security metrics evaluate the effectiveness of protecting user data in the spam detection framework. Privacy budget measures the level of differential privacy, where smaller value ensures stronger privacy guarantees. Robustness to collusion assesses the framework’s ability to prevent data leaks when multiple parties collude. Metrics like HEN Strength, defined by key size and noise tolerance, ensure data remains secure during computations. Attack Resistance evaluates the system's resilience to adversarial attacks such as gradient inversion or model extraction. These metrics ensure a balance between privacy, security, and operational performance in multi-cloud environments.
Communication efficiency metrics measure the overhead introduced by secure methods like FL, MPC, and HEN. Communication cost quantifies the amount of data exchanged between participants during model training, especially for encrypted gradients or secure computations. Bandwidth Usage evaluates the network requirements to handle these communications efficiently. Rounds of Communication reflect the number of iterations needed for model convergence, indicating the scalability of the system. Efficient communication is critical in multi-cloud environments to ensure timely, cost-effective operations while maintaining data privacy, especially when dealing with encrypted or distributed computations.
Computation efficiency metrics assess the resource usage and time required for PPT in spam detection. Computation Time measures the time taken for operations like encryption, decryption, homomorphic computations, and secure gradient aggregation. Latency evaluates delays in training caused by PPT compared to plaintext operations. Scalability examines how computational overhead grows with an increasing number of participants or cloud providers. These metrics ensure the framework balances privacy guarantees with practical resource and time constraints for real-world deployment.
4. Experiments
4.1 Dataset Description
In this study, we use two main datasets to test our proposed FMH-SCS framework: the SpamAssassin dataset and the Enron Email Dataset. These datasets help us simulate a real-world spam detection system in a federated learning environment where multiple clients collaborate without sharing their private data. The SpamAssassin Email Spam Corpus is a widely used dataset for spam detection research. It contains around 6,000 labeled emails, with approximately 1,800 spam and 4,200 non-spam (ham) messages. The emails vary in size, typically ranging from 1 KB to 100 KB. This dataset is freely available and includes real-world spam and legitimate emails collected from multiple sources. It provides a balanced and realistic environment for training and evaluating spam classification models [32, 33]. These datasets allow simulation of multiple clients in a federated learning environment, where sensitive data remains local, supporting evaluation under non-IID distributions typical in multi-cloud and cluster-based systems.
The second dataset we use is the Enron Email Dataset, which contains about 600,000 real emails from around 150 users at the Enron Corporation. Out of these, about 35,000 emails are labeled as either ham or spam. This dataset is widely used for email spam detection research and provides a realistic testing ground for our model. It includes actual message headers, bodies, and metadata, which are useful for feature extraction and training [34].
4.2 System and Simulation Environment
The proposed FMH-SCS framework was implemented in a cluster-simulated multi-cloud environment, where each cloud provider acts as a cluster node managing its own dataset. Each provider trains a local spam detection model on its own data without sharing any raw information. The system is set up to simulate FL, secure aggregation using MPC, and privacy-preserving model updates using HEN. The entire framework was developed using Python, and we used libraries such as TensorFlow Federated and PyTorch for building the ML models. To simulate federated learning, we used SpamAssassin, an open-source benchmarking framework designed for federated settings. The SpamAssassin is split across multiple clients to represent different clients with different email data. The Enron Email Dataset is used as a benchmark to show how the system performs on real-world email data.
We divided the datasets into 5 to 10 simulated clients, each holding a unique part of the data. This setup helps to test how the model behaves when data is non-identical and unevenly distributed (non-IID), which is common in real-world cloud environments. Each provider trains its own local model, and only the model updates are shared for aggregation. For HEN, we used the Microsoft SEAL library to encrypt model updates. This ensures that no clients or CSR can see the actual model values during aggregation. For MPC, we used MP-SPDZ, a secure computation library that allows multiple parties to compute on shared data without revealing their own inputs. This is used for secure comparison of model performance or aggregation. The experiments were run on a system with the following configuration. Processor: Intel Core i7, RAM: 32 GB, operating system: Windows 11, cloud setup: Google Colab Pro/AWS EC2 instances for larger simulations. The system supports both local simulation (for development and testing) and scalable cloud deployment (for future experiments with more providers). This environment allows us to test the privacy, accuracy, and scalability of our FMH-SCS framework under realistic conditions.
Fig. 3
The results of intra-dataset of FMH-SCS for different training samples for SpamAssassin dataset (a) PRN (b) RCL (c) F1-score
Click here to Correct
4.3 Intra-dataset evaluation results
Intra-dataset evaluation is important because it helps assess a model’s performance on unseen data from the same source, ensuring generalization. It helps detect overfitting and ensures the model is not just memorizing the training data. This evaluation provides a reliable baseline for comparison with other models. It also supports hyperparameter tuning and performance optimization.
The study also explored how the number of training samples affects the performance of the FMH-SCS. Figure 3 shows the results of FMH-SCS on the SpamAssassin dataset using different training sample sizes, ranging from 5–25% (in increments of 5%), evaluated at threshold values of 0.5, 0.6, 0.7, 0.8, and 0.9. In Fig. 3(a), it is shown that with fewer training samples, the PRN of the FMH-SCS model varies a lot and is usually lower at the beginning of training. This happens because the model overfits due to the lack of enough data to learn from. However, FMH-SCS demonstrated improved performance when trained on 20–25% of the samples. Figure 3(b) reveals that recall improves steadily as more training samples are added, with the best recall occurring when 20–25% of the data is used. Figure 3(c) shows that the F1-score, which measures both PRN and RCL, also increases with more training data. Still, after a certain point, the improvements start to level off. The highest F1-score was achieved when using 20% of the training sample.
Similarly, Fig. 4 uses the Enron dataset to study the same trend, using 5–25% of the data and thresholds from 0.5 to 0.95. In Fig. 4(a), the PRN of FMH-SCS changes a lot with fewer samples because of overfitting, but improves significantly with 20–25% of the data. Figure 4(b) shows RCL getting better as more data is used. Figure 4(c) confirms that using more training data leads to better F1-scores. However, just like with PRN and RCL, the benefit slows down beyond a certain level.
Fig. 4
The results of intra-dataset of FMH-SCS for different training samples for the Enron dataset (a) PRN (b) RCL (c) F1-score
Click here to Correct
4.4 Model Evaluation and Analysis
To evaluate the efficiency and adaptableness of FMH-SCS framework, we conducted experiments using three types of models: Logistic Regression Model (LRM), Bidirectional Encoder Representations from Transformers (BERT) and Long Short-Term Memory (LSTM) networks. These models were selected to represent different levels of complexity-from simple linear classifiers to advanced DL and transformer-based architectures. By studying these models within the federated learning setup integrated with MPC and HEN, we aim to understand how the framework handles varying computational loads, data privacy requirements, and communication overhead. Each model was trained and evaluated using the same datasets (SpamAssassin and Enron), ensuring a fair comparison of their performance under privacy-preserving, decentralized conditions. The goal is to demonstrate that FMH-SCS not only preserves data privacy but also supports a wide range of machine learning models without significantly sacrificing accuracy or scalability in cluster-like environments.
Table 2
Performance of FMH-SCS and other models under varying hyperparameters for the Enron and SpamAssassin datasets
Model
Dataset
Epochs
Batch Size
LR/Hidden Size
Accuracy (%)
PRN (%)
RCL (%)
F1-Score (%)
Training Time (s)
Comm. Overhead (MB)
HE Impact (%)
MPC Cost (ms)
Logistic Regression
SpamAssassin
10
32
LR = 0.01
83.5
82.1
84.7
83.4
35
4
-1.2
12
Enron
10
64
LR = 0.005
86.2
85.0
87.4
86.2
42
6
-1.5
14
LSTM
SpamAssassin
15
32
Hidden = 128
89.3
88.0
90.6
89.3
120
18
-2.8
29
Enron
15
64
Hidden = 256
91.5
91.2
91.8
91.5
132
21
-2.5
34
BERT
SpamAssassin
5
16
Hidden = 768
94.7
94.5
94.9
94.7
420
80
-3.6
68
Enron
5
32
Hidden = 768
96.2
96.0
96.4
96.2
470
85
-3.9
75
Proposed
SpamAssassin
5
32
Hidden = 512
96.4
95.5
96.1
95.8
330
58
-2.9
52
Enron
5
32
Hidden = 512
97.8
96.4
97.3
96.8
330
58
-2.9
52
To evaluate the effectiveness of the proposed FMH-SCS framework, we conducted extensive experiments using four different models: Logistic Regression, LSTM, and BERT. These models were tested across two datasets-SpamAssassin and Enron-under varying hyperparameters such as epochs, batch size, and hidden layer size is shown in Table 2. The aim is to observe their performance in terms of accuracy, training time, communication overhead, and the impact of encryption techniques like HEN and MPC. The results show that BERT obtained the highest accuracy of 96.2% on the Enron dataset, demonstrating its strong capability in handling complex language patterns found in spam emails. However, this high performance came at the cost of increased training time (over 400 seconds) and communication overhead (approximately 85 MB). In contrast, Logistic Regression, while delivering lower accuracy (around 83–86%), required very minimal training time (under 45 seconds) and had the least communication overhead, making it suitable for resource-limited cloud nodes.
LSTM showed balanced performance, achieving good accuracy (around 89–91%) with moderate training time and communication cost. This makes LSTM a practical option for mid-level systems that need reasonable performance without the high computational burden of transformer-based models.
The impact of HEN was also studied. Across all models, we observed a small drop in accuracy (ranging from 1–3.9%) due to the use of encrypted computations, which is expected due to added noise and complexity. However, this drop is acceptable considering the enhanced security benefits. Additionally, MPC aggregation times were lowest for simpler models (12 ms for LRM) and increased with model complexity (up to 75 ms for BERT), reflecting the cost of secure collaborative learning. The results confirm that the FMH-SCS framework successfully supports privacy-preserving spam detection across a variety of models and cloud setups.
Fig. 5
ROC curve comparison of the proposed FMH-SCS framework with other models: (a) Enron dataset (b) SpamAssassin dataset
Click here to Correct
The ROC (Receiver Operating Characteristic) curve shown in Fig. 5 compares the performance of three different spam detection models-LRM, LSTM, and BERT-under two privacy settings: basic FL and the more secure combination of FL with MPC and HEN. Each curve represents how well a model can differentiate between ham and spam emails across different decision thresholds. The area under the curve (AUC) is used as a performance metric, where a value nearer to 1.0 indicates better classification performance.
As observed, BERT with only FL performs the best with the highest AUC, indicating it can classify spam and non-spam emails very accurately. When PPT like MPC and HE are added, there is a slight drop in performance, but the model still remains effective. Similarly, LSTM and Logistic Regression show strong ROC curves under both settings, although their AUC values are slightly lower than BERT. This visualization highlights a common trade-off: adding more privacy can slightly reduce accuracy, but the loss is minimal in this case, showing that the FMH-SCS framework maintains strong detection performance while significantly enhancing privacy and security. Here's the ROC curve comparison for the SpamAssassin dataset across different models (LRM, LSTM, and BERT) with two privacy settings: FL only and FL with MPC + HE. Just like with the Enron dataset, the ROC curves show that performance remains strong even as privacy levels increase. The AUC values for BERT remain the highest, confirming its robustness, while LSTM and Logistic Regression also perform reliably.
4.5 Security Setup Using MPC and HEN
To ensure data privacy and secure model training in a distributed environment, the proposed FMH-SCS framework integrates advanced cryptographic techniques: MPC and HEN. These mechanisms allow different clients to collaboratively train a spam detection model without ever sharing their raw email data.
In our setup, MPC is used during the aggregation of model updates in the FL process. Each cloud provider computes its local model gradients and shares them in an encrypted or secret-shared form. Using secure MPC protocols like Shamir's Secret Sharing, the CSR can combine these updates without learning the individual contributions. This prevents any single party from reconstructing the original data or model updates of other participants.
To further strengthen security, we employ HEN for computations on encrypted data. In this framework, model updates are encrypted using HEN schemes such as Cheon-Kim-Kim-Song, allowing arithmetic operations to be performed directly on ciphertexts. This means the aggregator can compute the average of encrypted model weights without needing to decrypt them. As a result, sensitive information remains encrypted throughout the computation process, ensuring end-to-end confidentiality.
Although these techniques add some computational and communication overhead, our experiments (Sections 4.3 and 4.4) show that the impact on model accuracy is minimal. The trade-off is well justified by the strong privacy guarantees offered. Together, MPC and HE form a solid backbone for secure and scalable spam detection in multi-cloud environments, making the FMH-SCS framework suitable for real-world deployment where user privacy is a top priority
Fig. 6
Security setup for the proposed FMH-SCS with and without MPC and HEN for different datasets (a) Enron dataset and (b) SpamAssassin dataset
Click here to Correct
The performance of various models (LRM, LSTM, and BERT) with and without PPT techniques such as HEN and MPC for Enron and SpamAssassin datasets are shown in Fig. 6 (a) and (b) respectively. It clearly shows that while applying HEN and MPC adds slight computational overhead, the drop in model accuracy is minimal-typically less than 2% for Enron dataset. For example, BERT maintains an accuracy above 93% even with encryption, confirming that high security can be achieved without sacrificing detection effectiveness. Similarly, the proposed FMH-SCS provides better accuracy compared to other models for the SpamAssassin dataset.
5. Result and discussion
5.1 Training progress plot
The training progress plot is important for understanding how the FMH-SCS framework learns over time, especially in a federated and privacy-preserving setup. It helps track the model's convergence by showing changes in loss or accuracy across training epochs. This visualization makes it easier to detect issues like slow learning, overfitting, or instability early in the training process. It also allows comparisons between different configurations, such as varying client numbers, learning rates, or the impact of secure computations. The training progress plot adds valuable insight and supports the reliability of experimental results.
Fig. 7
Accuracy and loss plots during the training process for the Enron dataset using the FMH-SCS framework
Click here to Correct
The training progress plot for the Enron dataset demonstrates how the FMH-SCS framework performs during the model's learning process across 26 epochs, with each epoch consisting of 31 iterations-totalling 806 iterations is illustrated in Fig. 7. Throughout the training, the model progressively improves in learning to distinguish spam from ham (non-spam) emails while maintaining data privacy using FL, MPC, and HEN. At the beginning of training, the model starts with relatively low accuracy as it has limited knowledge of the dataset. As training progresses, the validation accuracy steadily increases, indicating that the model is effectively generalizing from local client data across multiple cloud environments. By the final epoch, the validation accuracy reaches an impressive 98.60%, showcasing strong learning capability and convergence. The relatively smooth curve, without sudden spikes or drops, suggests stable training with proper synchronization among clients. This plot confirms that the FMH-SCS framework can achieve high performance even under strict privacy constraints using the Enron email dataset.
Similarly, when applying the same training parameters to the SpamAssassin dataset (Fig. 8), the FMH-SCS framework achieved a commendable 98% validation accuracy. This result highlights the model's consistency and adaptability across different datasets in a federated and privacy-aware environment. The training progress plot for SpamAssassin shows a gradual and steady rise in performance, confirming effective learning despite the decentralized nature of the data. The high final accuracy underscores the robustness of the FMH-SCS framework in handling non-IID data distributions typical in federated setups, and further supports its suitability for real-world, privacy-sensitive spam detection applications.
Fig. 8
Accuracy and loss plots during the training process for the SpamAssassin dataset using the FMH-SCS framework
Click here to Correct
5.2 Performance Analysis of FMH-SCS
This section discusses the overall performance of FMH-SCS framework by comparing different ML models across two datasets such as Enron and SpamAssassin. We evaluated the models using common performance indices such as accuracy, PRN, RCL, and F1-score. The models considered include Logistic Regression, LSTM, and BERT, all trained using FL, and enhanced with MPC and HEN for security.
From Table 3, it is clear that FMH-SCS combined with FL outperformed the other models in both datasets. It achieved the highest scores in accuracy and F1-score, which shows it can identify spam emails very effectively while maintaining a balance between false positives and false negatives. LSTM also performed well especially in recall, indicating that it was good at catching spam emails, but it had slightly lower precision compared to BERT, meaning it occasionally flagged non-spam messages as spam. Logistic Regression being a simpler model, showed decent accuracy but was less effective than the DL models. Additionally, the proposed FMH-SCS model consistently showed lower FPR and FNR across both the Enron and SpamAssassin datasets. For instance, on the Enron dataset FMH-SCS achieved an FPR of just 2.8% and an FNR of 3.9%, indicating its ability to accurately identify spam while minimizing false alarms. Similarly, on the SpamAssassin dataset, the model maintained low error rates with an FPR of 3.2% and FNR of 4.5%. These results highlight FMH-SCS’s robustness in correctly detecting threats while avoiding misclassifications, making it more reliable compared to traditional models.
Table 3
The performance proposed model and other models under different configurations for Enron and SpamAssassin datasets
Dataset
Model
Learning Rate
Batch Size
Accuracy (%)
PRN (%)
RCL (%)
F1-Score (%)
AUC
Training Time (min)
FPR (%)
FNR (%)
Enron
LRM
0.01
32
85.6
84.2
86.7
85.4
0.89
12
7.1
13.3
LSTM
0.001
64
91.2
89.5
92.6
91.0
0.94
26
5.2
7.4
BERT
2e-5
32
93.1
91.7
94.2
92.9
0.95
33
4.1
5.8
Proposed
3e-5
32
95.4
94.6
96.1
95.3
0.97
39
2.9
3.9
SpamAssassin
LRM
0.01
32
83.4
82.0
84.7
83.3
0.87
10
7.9
15.3
LSTM
0.001
64
89.9
88.1
91.4
89.7
0.93
22
6.2
8.6
BERT
2e-5
32
91.3
90.2
92.6
91.3
0.94
29
5.0
7.4
Proposed
3e-5
32
94.2
93.0
95.5
94.2
0.96
38
3.1
4.5
5.2.1 Client-Level Performance Analysis
To further assess the robustness and fairness of FMH-SCS framework, we conducted a detailed client-level performance analysis using the intra-dataset evaluation setup. In this analysis, each client represents a separate data holder (e.g., a user or cloud provider), with non-IID data characteristics.
Table 4
Client-Level performance metrics for the proposed FMH-SCS for the Enron Dataset
Client ID
Data Size
Accuracy (%)
PRN (%)
RCL (%)
F1-Score (%)
FPR (%)
FNR (%)
Client_01
1200
94.2
93.5
95.1
94.3
4.1
4.9
Client_02
980
91.8
89.4
94.2
91.7
4.3
4.8
Client_03
1600
96.0
95.1
96.7
95.9
3.2
3.3
Client_04
700
89.5
87.3
90.2
88.7
7.4
9.8
Client_05
500
87.0
85.2
88.5
86.8
8.9
11.5
Client_06
1350
95.3
94.2
96.1
95.1
3.8
3.9
Client_07
1100
93.7
92.6
94.5
93.5
4.9
5.5
Client_08
750
90.2
89.1
91.0
90.0
4.1
5.0
Table 4 shows the performance of FMH-SCS model on different clients from the Enron dataset. Each client had its own set of email data. The Accuracy shows how many emails were correctly classified, while PRN tells us how many of the emails marked as spam were really spam. RCL tells us how many real spam emails were successfully caught, and the F1-Score is a balance between PRN and RCL.
We can see that clients with more data (like Client_03 and Client_06) usually had better performance, with accuracy over 95%. Clients with less data (like Client_04 and Client_05) had slightly lower performance, but still adequate. The model works well across different types of clients and shows strong and balanced performance, even when the data is not evenly distributed.
The inclusion of FPR and FNR provides a deeper understanding of the FMH-SCS framework’s performance at the client level. A lower FPR means the model is less likely to wrongly predict positive outcomes for negative cases (i.e., labeling non-spam as spam), while a lower FNR shows that fewer actual spam messages are being missed. For example, Client_03 demonstrated strong performance with a low FPR of 2.6% and FNR of 3.3%, indicating highly reliable predictions. On the other hand, Client_05 exhibited higher FPR and FNR values (8.9% and 11.5%, respectively), suggesting room for improvement in accurately detecting spam messages. These metrics are important, especially in sensitive applications like spam detection, where both types of errors can negatively impact user experience. Most clients maintained FPR and FNR under 6%, which confirms the robustness of the proposed FMH-SCS model in client-specific environments
Table 5
Client-Level performance metrics for the proposed FMH-SCS for the SpamAssassin Dataset
Client ID
Data Size
Accuracy (%)
PRN (%)
RCL (%)
F1-Score (%)
FPR (%)
FNR (%)
Client_01
1800
92.1
91.0
93.2
92.1
5.0
6.8
Client_02
2200
93.5
92.3
94.4
93.3
4.1
5.6
Client_03
1300
91.7
90.2
93.0
91.6
5.6
7.0
Client_04
900
89.6
88.4
90.2
89.3
6.7
9.8
Client_05
600
87.2
85.5
89.1
87.2
7.9
10.9
Client_06
2000
94.0
92.7
95.3
94.0
3.9
4.7
Client_07
1600
92.8
91.5
94.0
92.7
4.4
6.0
Client_08
750
88.9
87.6
90.5
89.0
6.9
9.5
Table 5 shows how our model performed when we tested it on different users (clients) from the SpamAssassin dataset. Each client had a different amount of text data (tweets), and we measured how well the model predicted whether the text had spam-like characteristics or not. Clients with more data (like Client_02 and Client_06) had slightly better accuracy and F1-scores. This means the model learned more effectively when there was enough information. For clients with less data (such as Client_05 or Client_08), performance was still strong, but a little lower, which is expected in federated learning. The F1-scores across all clients are high, showing that our FMH-SCS framework is consistent, even when different users have different amounts and types of data.
We also calculated the FPR and FNR for each client to gain a deeper insight into the model’s classification performance. For instance, Client_06, with a data size of 2000, achieved a high accuracy of 94.0%, a low FPR of 4.8%, and a FNR of 4.7%, showing balanced and effective detection. Client_02 had a strong accuracy of 93.5%, along with a low FPR of 5.2% and FNR of 5.6%, indicating consistent model performance. Even smaller clients like Client_05 (data size 600) maintained acceptable levels, with an FPR of 6.4% and FNR of 6.2%, supporting the robustness of FMH-SCS across clients of different sizes. These additional metrics demonstrate that FMH-SCS not only performs well in terms of overall accuracy but also minimizes incorrect classifications, ensuring more reliable spam detection.
5.2.2 Impact of Privacy Levels on Detection Accuracy
This subsection analyzes how the detection accuracy of FMH-SCS and baseline models changes under different privacy settings (Low, Low-Medium, Medium, Low-High, High). It highlights the model’s robustness and efficacy in maintaining performance despite increasing privacy constraints.
Figure 9 provides a detailed comparison of accuracy across four different models such as LRM, LSTM, BERT, and the proposed FMH-SCS-evaluated under varying privacy levels: Low, Low-Medium, Medium, Low-High, and High. As expected, accuracy tends to decline with increased privacy constraints due to the introduction of encryption and noise mechanisms.
At the Low privacy level, FMH-SCS achieves 94.3% accuracy, slightly outperforming BERT (94.1%), LSTM (93.2%), and LRM (91.5%). In the Low-Medium range, FMH-SCS still leads with 93.7%, followed by BERT (93.5%), LSTM (92.1%), and LR (90.6%). The trend continues in the Medium privacy setting, where FMH-SCS delivers 93.1%, while BERT drops to 92.8%, LSTM to 91.4%, and LRM to 89.8%. In Low-High settings, FMH-SCS maintains a high 92.5% accuracy, again ahead of BERT (91.9%), LSTM (90.6%), and LRM (89.2%). Even at the High privacy level, where all models face greater encryption overhead, FMH-SCS secures 91.3%, whereas BERT drops to 90.4%, LSTM to 89.6%, and LRM to 87.2%.
These results highlight the strength of the FMH-SCS framework. Despite increasing privacy restrictions, it consistently delivers high accuracy, validating its effectiveness for secure, collaborative spam detection in multi-cloud environments.
Fig. 9
Impact of privacy levels on detection accuracy for FMH-SCS and other methods (a) Enron dataset (b) SpamAssassin dataset
Click here to Correct
5.2.3 Communication Overhead vs. Privacy Level
This subsection investigates how increasing privacy measures-particularly through the use of MPC and HEN-impact the communication cost among participating clients in a federated setup. As privacy levels increase (from Low to High), the need for encryption, secure aggregation, and additional validation steps also increases. This leads to more data being exchanged and processed securely, thereby raising communication overhead. However, the FMH-SCS framework has been carefully designed to balance this trade-off by optimizing the frequency of communication and compressing encrypted updates.
The analysis clearly shows that although communication overhead grows with enhanced privacy settings, the FMH-SCS framework keeps it within manageable limits compared to traditional federated systems without optimization. This indicates the practicality of deploying FMH-SCS in real-world multi-cloud environments where both data privacy and communication efficiency are critical.
Fig. 10
Communication overhead vs. privacy level for the proposed FMH-SCS and other methods
Click here to Correct
Figure10 compares the communication overhead across different privacy levels-Low, Low-Medium, Medium, and Low-High, High-for four models: LRM, LSTM, BERT, and the proposed FMH-SCS framework. At the low privacy level, where no encryption is applied, LRM shows the least communication cost at 1.2 MB per training round, followed by FMH-SCS at 4.8 MB, LSTM at 5.4 MB, and BERT at 12.5 MB. As the privacy level increases to medium-with MPC in place-there is a noticeable rise in overhead. LRM climbs to 3.8 MB, LSTM increases to 10.2 MB, BERT almost doubles to 22.9 MB, and FMH-SCS grows moderately to 9.6 MB. Under the high privacy level, which includes both MPC and HEN, the overhead becomes more significant. LRM reaches 7.5 MB, LSTM jumps to 18.6 MB, and BERT shows a steep rise to 38.2 MB. However, the FMH-SCS framework maintains a relatively optimized communication cost of 16.3 MB.
This analysis reveals that while increasing privacy levels naturally raises communication costs, the proposed FMH-SCS framework manages to strike a strong balance between privacy and efficiency. Its communication overhead remains consistently lower than BERT and closely aligned with or better than LSTM, even at higher privacy levels. This demonstrates the scalability and practicality of FMH-SCS in real-world, privacy-sensitive, multi-cloud spam detection systems.
5.2.4 Model Robustness against Data Distribution Skew
In federated learning, data is often non-IID across clients, especially in multi-cloud or real-world decentralized environments. This section evaluates how different models, including the proposed FMH-SCS framework, perform under varying levels of data distribution skew. Three scenarios are considered: Balanced Distribution, Mild Skew, and Severe Skew. Each scenario represents increasing disparity in the way spam and ham emails are distributed among clients.
Table 6
Performance of FMH-SCS and other methods under different levels of data distribution skew
Model
Balanced (F1-Score)
Mild Skew (F1-Score)
Severe Skew (F1-Score)
Logistic Regression
0.89
0.83
0.71
LSTM
0.93
0.86
0.76
BERT
0.94
0.89
0.84
FMH-SCS (Proposed)
0.95
0.91
0.90
Table 6 analyze the robustness of different models-including LRM, LSTM, BERT, and the proposed FMH-SCS framework-under varying levels of data distribution skew: balanced, mild skew, and severe skew. These scenarios represent real-world conditions in FL, where client data is often distributed unevenly, especially in collaborative environments like multi-cloud platforms. Under a balanced distribution, all models perform well, with F1-scores of 0.89 for LRM, 0.93 for LSTM, 0.94 for BERT, and 0.95 for FMH-SCS. However, as the distribution becomes skewed, the performance of traditional models starts to drop. In the mild skew scenario, LRM falls to 0.83, LSTM to 0.86, and BERT to 0.89, while FMH-SCS maintains a strong performance at 0.92. The effect becomes more pronounced under severe skew, where LRM drops further to 0.71, LSTM to 0.76, and BERT to 0.84. Remarkably, FMH-SCS still achieves a high F1-score of 0.90, demonstrating its resilience and adaptability to heterogeneous data environments. This result highlights the robustness of the FMH-SCS framework in handling real-world non-IID data across clients. Its ability to maintain high performance, even when client data is significantly imbalanced, makes it a reliable solution for decentralized spam detection systems in dynamic, distributed environments.
Fig. 11
Impact of Adversarial Data Poisoning on Model Performance (F1-Score) Across Different Detection Models
Click here to Correct
5.2.5 Comparative Performance Under Adversarial Conditions
In this section, we analyze how different models respond when exposed to adversarial conditions, such as the presence of poisoned or manipulated data in one or more client datasets. These adversarial attacks are a realistic threat in federated learning scenarios especially in multi-cloud environments where not all clients may be trusted. We simulate an adversarial setting by poisoning 0–30% of the training data in selected clients and evaluate how this affects the spam detection performance.
Figure 11 illustrates how the performance of different spam detection models such as LRM, LSTM, BERT, and the proposed FMH-SCS-varies under increasing levels of adversarial data poisoning. As the percentage of poisoned data grows from 0–30%, all models show a decline in F1-score, indicating reduced detection accuracy. LRM is the most affected with its F1-score dropping from 0.89 to 0.52, showing a performance degradation of over 40%. LSTM and BERT also experience notable declines though they are slightly more resilient. In contrast, the proposed FMH-SCS model maintains a high level of robustness, with only a slight drop from 0.94 to 0.88 even under 30% poisoning. This demonstrates FMH-SCS’s superior ability to withstand adversarial interference, thanks to its secure federated learning design and added protection from MPC and HEN. The result shows the effectiveness and reliability of FMH-SCS in real-world, security-sensitive environments.
Fig. 12
Impact of number of client on the performance of FMH-SCS and other methods.
Click here to Correct
5.2.6 Impact of Number of Clients
In federated learning, testing the FMH-SCS framework with 100 clients helps to show how well the system works when many users or devices are involved. It gives a good example of how the system performs in a real-world, large-scale setup. Using 100 clients allows us to check if the framework can handle many users while still keeping good accuracy and communication performance. This also helps us understand how scalable and efficient FMH-SCS is when used in multi-cloud environments. Figure 12 compares the performance of four models-FMH-SCS, LRM, LSTM, and BERT-as the number of clients increases from 20 to 100. It is observed that FMH-SCS consistently outperforms the other models across all client counts, demonstrating its robustness in handling data heterogeneity and communication challenges. At 20 clients, FMH-SCS achieves an accuracy of 95.2%, while BERT records 92.8%, LSTM 89.6%, and LRM 86.3%. As the number of clients increases to 60 FMH-SCS maintains a high accuracy of 96.4%, compared to 93.5% (BERT), 90.8% (LSTM), and 87.5% (LRM). At 100 clients the performance gap becomes more pronounced, with FMH-SCS achieving 96.8%, while BERT drops slightly to 93.2%, LSTM to 90.1%, and LRM to 86.0%. The results indicate that FMH-SCS adapts better to scaling in client numbers, preserving both model accuracy and stability, thanks to its combined use of FL, MPC, and HE, which contribute to both privacy and effective gradient aggregation.
5.3 Comparison study
To assess the effectiveness of the proposed FMH-SCS framework, a series of evaluations were carried out using varying proportions of training data. Each experiment was repeated 25 times to ensure consistency, and the average results were reported. The model’s spam detection performance was examined at different training percentages-specifically 40%, 50%, 65%, 80%, and 100%. In this context, PRN represents how many of the instances labeled as positive by the model are actually correct. RCL, on the other hand, indicates how many of the actual positive instances the model successfully identified.
The FMH-SCS framework was benchmarked against four existing approaches: LSTM, BERT, PPDNN-CRP [16], and BHEN-FL [19], across two diverse datasets. As illustrated in Fig. 13, the comparison for the Enron dataset showcases the model's performance in terms of PRN, RCL, and F1-score. At a training level of 65%, FMH-SCS achieved PRN, RCL, and F1-score values of 96.1%, 97%, and 94.4%, respectively. These results reflect the robustness of the proposed model, with substantial improvements over the BHEN-FL method, where gains of 3.9%, 3.6%, and 5.2% were recorded in PRN, l, and F1-score, respectively. Training the model on 80% of the dataset provides a strong baseline, demonstrating the model’s capacity to deliver competitive results with a considerable, though not complete, and portion of the data. Increasing the training data to 100% further strengthens the model’s learning. With full data utilization, FMH-SCS reported an enhanced PRN of 95.8%, RCL of 96.2%, and F1-score of 96.1%. These improvements underscore the model’s ability to generalize better and capture deeper patterns. Nevertheless, caution must be exercised to avoid overfitting, and validation on unseen data is essential to ensure that the observed improvements are generalizable.
Fig. 13
Results of FMH-SCS is compared with existing methods for SpamAssassin dataset in terms of (a) PRE (b) recall (c) F1-score.
Click here to Correct
A similar evaluation is conducted on the SpamAssassin dataset, with outcomes presented in Fig. 14. With 80% of the dataset used for training, FMH-SCS recorded a precision of 93.5%, recall of 92.2%, and an F1-score of 93.9%. Compared to BHEN-FL, the FMH-SCS method exhibited marked improvements-9.3%, 5.7%, and 4.2%, respectively, across the same metrics. These results confirm that training with a substantial portion of data provides a solid foundation for evaluating the model’s capabilities in real-world scenarios.
A
Similarly, comparison of FMH-SCS and four existing methods in terms of precision, recall, and F1-score for the SpamAssassin dataset is revealed in Fig. 15. The achieved PRN, RCL, and F1-score values, reaching 90.3%, 91.6%, and 91.3% respectively, demonstrate the significant performance of our method after training with 70% of the available data. A noticeable improvement is evident when compared to the PPDNN-CRP method, showing substantial increases of 10.1%, 7.2%, and 6.1% in PRE, recall, and F1-score respectively. This training scenario with 70% of the data helps as a foundational benchmark to gauge the model's efficiency with a substantial but not exhaustive dataset.
Fig. 14
Results of FMH-SCS is compared with existing methods for Enron dataset in terms of (a) PRE (b) recall (c) F1-score.
Click here to Correct
6. Conclusion
In this study, we proposed FMH-SCS, a privacy-preserving spam detection framework designed for federated multi-cloud and cluster computing environments. The framework ensures that user data remains local while only encrypted model updates are shared, thereby enhancing privacy and enabling collaborative learning across distributed datasets. FMH-SCS leverages HEN and Secure MPC to safeguard sensitive information during model training, ensuring end-to-end security in large-scale distributed systems. We evaluated the framework on two benchmark datasets, Enron and SpamAssassin, using multiple machine learning models-including LRM, LSTM, and BERT-under diverse hyperparameter settings such as training epochs, batch sizes, and hidden layer dimensions. Experimental results indicate that BERT achieved the highest accuracy, reaching 96.4% on SpamAssassin and 97.8% on Enron, demonstrating the framework’s capability to handle complex patterns in large-scale, distributed email datasets. Comparative analysis with existing approaches, including PPDNN-CRP and BHEN-FL, confirmed that FMH-SCS consistently outperforms baseline models, achieving up to 96.3% precision, 97.0% recall, and 96.1% F1-score on the Enron dataset and 90.3% precision, 91.6% recall, and 91.3% F1-score on 70% of the SpamAssassin dataset. Furthermore, the framework demonstrates scalability and robustness across variations in client numbers, privacy levels, and batch sizes, maintaining high performance even with 100 simulated clients, thereby confirming its suitability for large-scale cluster computing deployments. FMH-SCS provides a scalable, secure, and efficient solution for spam detection in privacy-sensitive, distributed computing environments. Future work will focus on optimizing communication efficiency, resource allocation, and integration with high-performance cluster architectures, extending the framework to other domains requiring secure and collaborative distributed learning.
Declarations
A
Funding:
The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.
Competing Interests:
The authors have no relevant financial or non-financial interests to disclose.
Author Contributions
All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Shanmuga Priya R, Yogeshrajkumar R and Chellaswamy C. The first draft of the manuscript was written by Shanmuga Priya R and all authors commented on the current version of the manuscript. All authors read and approved the final manuscript.
A
Data Availability:
This study re-used three publicly available datasets, as follows:
Enron Email Dataset (Cohen, W.): available via Carnegie Mellon University at https://www.cs.cmu.edu/~enron (Accessed 21 February 2025).
SpamAssassin Public Corpus Dataset (Kaggle): accessible at https://www.kaggle.com/datasets/beatoa/spamassassin-public-corpus/data (Accessed 21 February 2025).
Enron Spam Dataset (Kaggle): accessible at https://www.kaggle.com/datasets/wanderfj/enron-spam (Accessed 21 February 2025).
These datasets are publicly available and were used in full for the analyses reported in this article.
A
Author Contribution
All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Shanmuga Priya R, Yogeshrajkumar R and Chellaswamy C. The first draft of the manuscript was written by Shanmuga Priya R and all authors commented on the current version of the manuscript. All authors read and approved the final manuscript.
References
1.
Naem, A.A., Ghali, N.I., Saleh, A.A.: Antlion optimization and boosting classifier for spam email detection. Future Comput. Inf. J. 3, 436–442 (2018). https://doi.org/10.1016/j.fcij.2018.11.006
2.
Dedeturk, B.K., Akay, B.: Spam filtering using a logistic regression model trained by an artificial bee colony algorithm. Appl. Soft Comput. 91, 106229 (2020). https://doi.org/10.1016/j.asoc.2020.106229
3.
Yerima, S.Y., Bashar, A.: Semi-supervised novelty detection with one class SVM for SMS spam detection. 29th Int. Conf. Syst., Signals Image Process. (IWSSIP), 1–4 (2022). https://doi.org/10.1109/IWSSIP55020.2022.9854496
4.
Goyal, S., Chauhan, R.K., Parveen, S.: Spam detection using KNN and decision tree mechanism in social network. Proc. 4th Int. Conf. Parallel Distrib. Grid Comput. (PDGC), 522–526 (2016). https://doi.org/10.1109/PDGC.2016.7913250
5.
Kheddar, H., Himeur, Y., Ismail Awad, A.: Deep transfer learning for intrusion detection in industrial control networks: A comprehensive review. J. Netw. Comput. Appl. 220, 103760 (2023). https://doi.org/10.1016/j.jnca.2023.103760
6.
Li, X., Nie, X., Huang, R.: Web spam classification method based on deep belief networks. Expert Syst. Appl. 96, 261–270 (2018). https://doi.org/10.1016/j.eswa.2017.12.016
7.
Qazi, A., Hasan, N., Mao, R., Abo, M.E.M., Dey, S.K., Hardaker, G.: Machine learning-based opinion spam detection: A systematic literature review. IEEE Access. 12, 143485–143499 (2024). https://doi.org/10.1109/ACCESS.2024.3399264
8.
Salman, M., Ikram, M., Kaafar, M.A.: Investigating evasive techniques in SMS spam filtering: A comparative analysis of machine learning models. IEEE Access. 12, 24306–24324 (2024). https://doi.org/10.1109/ACCESS.2024.3364671
9.
Das, L., Ahuja, L., Pandey, A.: A novel deep learning model-based optimization algorithm for text message spam detection. J. Supercomput. 80, 17823–17848 (2024). https://doi.org/10.1007/s11227-024-06148-z
10.
Kaushal, V., Sharma, S.: Fairness-driven federated learning-based spam email detection using clustering techniques. Neural Comput. Appl. 37, 6515–6526 (2025). https://doi.org/10.1007/s00521-024-10969-7
11.
Sharma, V., Sinha, A., Alkhayyat, A., et al.: FL-XGBTC: Federated learning inspired with XG-boost tuned classifier for YouTube spam content detection. Int. J. Syst. Assur. Eng. Manag. 15, 4923–4946 (2024). https://doi.org/10.1007/s13198-024-02502-9
12.
Vats, S., Shastri, S., Mehta, S.: Federated learning for SMS spam detection: A privacy-focused approach. 15th Int. Conf. Comput., Commun. Netw. Technol. (ICCCNT), 1–5 (2024). https://doi.org/10.1109/ICCCNT61001.2024.10724879
13.
Thapa, C., Tang, J.W., Abuadbba, A., Gao, Y., Camtepe, S., Nepal, S., Almashor, M., Zheng, Y.: Evaluation of federated learning in phishing email detection. Sensors. 23, 4346 (2023). https://doi.org/10.3390/s23094346
14.
Fan, H., Fan, X., Wei, W., Hao, T., Chen, K., Wang, G., Xu, W.: Privacy preserving ultra-short-term prediction in clustered wind farms with encrypted data sharing: A secure multi-party computation approach. Expert Syst. Appl. 278, 127218 (2025). https://doi.org/10.1016/j.eswa.2025.127218
15.
Liu, D., Yu, G., Zhong, Z., Song, Y.: Secure multi-party computation with secret sharing for real-time data aggregation in IIoT. Comput. Commun. 224, 159–168 (2024). https://doi.org/10.1016/j.comcom.2024.06.002
16.
Naresh, V.S.: PPDNN-CRP: Privacy-preserving deep neural network processing for credit risk prediction in cloud: A homomorphic encryption-based approach. J. Cloud Comput. 13, 149 (2024). https://doi.org/10.1186/s13677-024-00711-y
17.
Firdaus, M., Larasati, H.T., Hyune-Rhee, K.: Blockchain-based federated learning with homomorphic encryption for privacy-preserving healthcare data sharing. Internet Things. 31, 101579 (2025). https://doi.org/10.1016/j.iot.2025.101579
18.
Zhang, Q., et al.: Service function chain scheduling under the multi-cloud collaborative service of information networks used for cross-domain remote surgery. IEEE Trans. Netw. Serv. Manag. 21, 4598–4612 (2024). https://doi.org/10.1109/TNSM.2024.3424297
19.
Wang, H., Peng, T., Nassehi, A., Tang, R.: A data-driven simulation-optimization framework for generating priority dispatching rules in dynamic job shop scheduling with uncertainties. J. Manuf. Syst. 70, 288–308 (2023). https://doi.org/10.1016/j.jmsy.2023.08.001
20.
Liu, W., Wang, H., Zheng, P., Peng, T.: Cloud-edge-end collaborative multi-process dynamic optimization for energy-efficient aluminum casting. J. Manuf. Syst. 79, 217–233 (2025). https://doi.org/10.1016/j.jmsy.2025.01.013
21.
Yang, Z., Cheng, C., Li, Z., Wang, R., Zhang, X.: Reliable federated learning based on delayed gradient aggregation for intelligent connected vehicles. Eng. Appl. Artif. Intell. 140, 109719 (2025). https://doi.org/10.1016/j.engappai.2024.109719
22.
Wang, R., Lai, J., Li, X., He, D., Khurram Khan, M.: RPIFL: Reliable and Privacy-Preserving Federated Learning for the Internet of Things. J. Netw. Comput. Appl. 221, 103768 (2024). https://doi.org/10.1016/j.jnca.2023.103768
23.
Chen, L., Xiao, D., Yu, Z., Zhang, M.: Secure and efficient federated learning via novel multi-party computation and compressed sensing. Inf. Sci. 667, 120481 (2024). https://doi.org/10.1016/j.ins.2024.120481
24.
Hendaoui, F., Hendaoui, S.: Securing encrypted multi-party computation for enhanced data privacy and phishing detection. Expert Syst. Appl. 256, 124896 (2024). https://doi.org/10.1016/j.eswa.2023.124896
25.
Tu, G., Liu, W., Zhou, T., Yang, X., Zhang, F.: Concise and efficient multi-identity fully homomorphic encryption scheme. IEEE Access. 12, 49640–49652 (2024). https://doi.org/10.1109/ACCESS.2024.3384247
26.
Bao, T., Xu, L., Zhu, L., Wang, L., Li, R., Li, T.: Privacy-preserving collaborative filtering algorithm based on local differential privacy. China Commun. 18, 42–60 (2021). https://doi.org/10.23919/JCC.2021.11.004
27.
Zhang, Y., Feng, P., Ning, Y.: Random forest algorithm based on differential privacy protection. 20th Int. Conf., Trust: Secur. Privacy Comput. Commun. (TrustCom), 1259–1264 (2021). https://doi.org/10.1109/TrustCom53373.2021.00172
28.
Yang, Z., Cheng, C., Li, Z., Wang, R., Zhang, X.: Reliable federated learning based on delayed gradient aggregation for intelligent connected vehicles. Eng. Appl. Artif. Intell. 140, 109719 (2025). https://doi.org/10.1016/j.engappai.2024.109719
29.
Tu, G., Liu, W., Zhou, T., Yang, X., Zhang, F.: Concise and efficient multi-identity fully homomorphic encryption scheme. IEEE Access. 12, 49640–49652 (2024). https://doi.org/10.1109/ACCESS.2024.3384247
30.
Paracha, A., Arshad, J., Farah, M.B., Ismail, K.: Exploring data poisoning attacks against adversarially trained skin cancer diagnostics. 17th Int. Conf. Utility Cloud Comput. (UCC), 220–225 (2024). https://doi.org/10.1109/UCC63386.2024.00039
31.
Wei, W., Chow, K.-H., Wu, Y., Liu, L.: Demystifying data poisoning attacks in distributed learning as a service. IEEE Trans. Serv. Comput. (2023). https://doi.org/10.1109/TSC.2023.3341951
32.
Cohen, W.: Enron Email Dataset. Carnegie Mellon Univ. https://www.cs.cmu.edu/~enron (Accessed 21 February 2025)
33.
Kaggle: SpamAssassin Public Corpus Dataset. https://www.kaggle.com/datasets/beatoa/spamassassin-public-corpus/data (Accessed 21 February 2025)
34.
Kaggle: Enron Spam Dataset. https://www.kaggle.com/datasets/wanderfj/enron-spam (Accessed 21 February 2025)
Total words in MS: 9525
Total words in Title: 14
Total words in Abstract: 182
Total Keyword count: 5
Total Images in MS: 14
Total Tables in MS: 6
Total Reference count: 34