1. Introduction
Artificial intelligence (AI) has become integral to modern cybersecurity, particularly in intrusion detection systems (IDS) designed to identify malicious activity within increasingly complex and dynamic network environments. Traditional rule-based IDS rely on static signatures or handcrafted heuristics, making them effective against known threats but limited in detecting zero-day attacks, polymorphic malware, and subtle behavioral anomalies [1] These limitations have motivated the transition toward machine learning (ML)-based IDS, which learn statistical patterns from network flows and offer greater adaptability in detecting previously unseen intrusions [2,3]. Despite these advances, AI-driven IDS introduce new security challenges, most notably the susceptibility of ML models to adversarial manipulation.
Recent studies in adversarial machine learning (AML) demonstrate that even small, carefully crafted perturbations can cause ML models to misclassify malicious traffic as benign while preserving domain-level semantics [6]. In flow-based IDS, adversarial examples may involve subtle modifications to packet counts, byte volumes, inter-arrival times, and directionality, changes that remain plausible within real network conditions yet significantly degrade classifier performance. This vulnerability has raised concerns about the operational reliability of AI-based IDS, particularly when adversaries operate under realistic constraints and lack full access to model internals.
1.1. Importance of AI in Intrusion Detection
AI-based IDS have improved detection capabilities through supervised learning, anomaly detection, and automated feature extraction. Classical models such as Random Forests and Logistic Regression offer transparent and computationally efficient baselines [1,7], while deep architectures, including multilayer perceptrons and convolutional neural networks, enable richer nonlinear representations of traffic flows [2, 8]. These approaches, combined with datasets such as CICIDS2017 and UNSW-NB15, have established strong performance benchmarks across diverse attack categories [9,10].
However, the effectiveness of these systems remains tightly coupled to data quality and robustness. Prior analyses highlight labeling inconsistencies, feature artifacts, and sampling biases in common IDS datasets [10,11] all of which may distort classifier behavior and amplify vulnerability to adversarial manipulation. As AI becomes increasingly embedded in high-risk domains, ranging from cybersecurity to education [12], ensuring robustness and trustworthiness is critical to preventing harmful decision failures.
1.2. Adversarial Machine Learning Threat
Adversarial machine learning represents a significant threat to AI-based IDS. In white-box scenarios, adversaries with full model access leverage gradient-based attacks such as the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) to produce highly effective perturbations [13,14]. In contrast, black-box adversaries infer decision boundaries through probing, statistical approximation, or surrogate modeling [15,16]. Because production IDS rarely expose architectural details or model outputs beyond final decisions, black-box threats are more representative of real-world attacker capabilities [17].
The broader cybersecurity landscape further amplifies these risks. Recent advances in generative AI enable adversaries to create realistic deepfake-based cyber threats [18], synthesize AI-generated malware with adaptive behaviors [19], and automate vulnerability discovery [20]. These developments illustrate the increasing sophistication of offensive AI tools and reinforce the need to rigorously evaluate IDS performance under adversarial conditions.
A
Emerging AML research also critiques the overreliance on gradient-dependent evaluations that assume unrealistic attacker capabilities. Consequently, transfer-based black-box attacks, where adversaries train local surrogate models to approximate the target IDS, have become a preferred methodology for assessing operationally relevant robustness [21]. Complementary approaches, such as GAN-based traffic generation [22], reinforcement learning-driven evasion [23], and protocol-aware semantic constraints, further emphasize realism and plausibility in adversarial perturbations.
1.3. Problem Statement
Despite widespread research on ML-based IDS, there remains a significant gap in understanding their adversarial robustness under realistic operational constraints. Many existing studies emphasize clean accuracy without examining model fragility under targeted perturbations or distribution shift. Furthermore, adversarial transferability across models and datasets remains underexplored, even though real IDS deployments routinely encounter heterogeneous traffic conditions. Addressing these gaps requires a systematic, semantically grounded framework for evaluating adversarial robustness under black-box assumptions.
1.4. Research Objectives
This study addresses the following research questions:
1.5.
Contributions of the Paper
By leveraging a proposed, fully reproducible adversarial evaluation framework which integrates flow preprocessing, IDS trainings, surrogate-based attack generations and semantic-validity constrain, this study establishes a unified pipeline for systematic stress-testing for intrusion detection system.
By incorporating multi-model adversarial robustness analysis across four representative IDS architecture (RF, LR, MLP, CNN-1D) and four attacks families (FGSM, PGD, HSJA, ZOO), the study uncover how architecture difference is shaping vulnerability pattern and highlights a distinct robustness profile across classical and deep model.
By combining cross-dataset evaluations using the CICIDS2018 Friday subsets with practical deployment-oriented recommendation, the study is providing insight into adversarial transferability under distribution shifts and deliver an actionable guidance for integrating robustness assessment into real-world IDS developments workflow.
1.1. Paper Organization
The remainder of this paper is organized as follows: Section II reviews prior work on intrusion detection, adversarial machine learning, and model robustness evaluation. Section III describes the dataset, preprocessing workflow, IDS models, surrogate-based attack generation, semantic constraints, evaluation metrics, and experimental setup. Section IV presents the results of clean and adversarial evaluation on CICIDS2017, as well as cross-dataset testing on CICIDS2018. Section V discusses the findings and their implications for operational IDS deployment. Section VI outlines limitations, future research directions, and concludes the paper.
2. Related Work
In the realm of AI-based NIDSs, AI-driven intrusion detection has emerged as a prominent cybersecurity research topic, especially in Internet of Things (IoT) contexts [32]. Advancements in machine learning (ML), deep learning (DL), and adversarial machine learning (AML) have all contributed significantly to this achievement. This section reviews three interrelated bodies of work: (1) ML and DL approaches for intrusion detection, (2) adversarial attacks targeting IDS models, and (3) robustness evaluation methodologies and dataset considerations. The goal is to position the present study within ongoing efforts to develop resilient AI-based IDS that remain reliable under adversarial pressure.
2.1. Machine Learning for Intrusion Detection
Traditional IDS relied heavily on static signatures or heuristic rules, limiting their ability to detect previously unseen threats or adapt to dynamic network behaviors [1]. ML-based IDS addressed these limitations by learning statistical patterns from flow and packet-level features. Classical models such as Random Forests, Support Vector Machines, Logistic Regression, and k-Nearest Neighbors have demonstrated strong performance across several benchmark datasets, with ensemble methods in particular showing robustness and interpretability advantages [24,25].
DL-based IDS extend these capabilities by automatically extracting high-level representations from raw traffic. Autoencoders, multilayer perceptrons (MLP), convolutional neural networks (CNN), and recurrent architectures (LSTM/GRU) have been widely applied to detect complex attacks and temporal dependencies [2,8,26]. Recent hybrid and attention-based architectures further improve granularity by capturing spatial and temporal correlations across flows [27].
A
Benchmark datasets underpinning IDS evaluation include the CICIDS2017 dataset [9], UNSW-NB15 [28], NSL-KDD [29], and newer IoT-oriented datasets such as BoT-IoT [30] and TONIoT [31,32]. These datasets vary in attack coverage, network design, and labeling fidelity, which affects model generalizability and robustness. A summary of commonly used IDS datasets is provided in Table 1.
TABLE I
Dataset Comparison
|
Dataset
|
Year
|
Traffic Source
|
Features
|
Attack Types
|
Notes
|
|
NSL-KDD
(Tavallaee et al.)
|
2009
|
Synthetic
|
41
|
4 categories
|
Removes duplicate records from KDD’99; outdated for modern threats.
|
|
UNSW-NB15
(Moustafa & Slay, 2015)
|
2015
|
IXIA PerfectStorm
|
49
|
9 families
|
Realistic traffic generation; modern attack taxonomy.
|
|
CICIDS2017
(Sharafaldin et al., 2018)
|
2018
|
Simulated enterprise environment
|
80+
|
Multiple
|
Most widely used; captures diverse attack scenarios.
|
|
CICIDS2018
(Sharafaldin et al., 2018)
|
2018
|
Emulated daily traffic
|
80+
|
Multiple
|
Offers improved realism; used for cross-dataset generalization.
|
|
BoT-IoT
(Koroniotis et al., 2019)
|
2019
|
IoT environment
|
46
|
4
|
Large-scale IoT traffic; high imbalance.
|
|
TON_IoT
(Moustafa, 2021)
|
2021
|
IoT and Edge
|
Multimodal
|
Multiple
|
Integrates system logs, telemetry, and network flows.
|
This expanding ecosystem of IDS datasets helps benchmark detection capabilities, but it also introduces inconsistencies, particularly in data quality, class imbalance, and feature engineering, which complicate robustness assessment [10,33]. These limitations motivate the need for methodologically rigorous evaluations such as those undertaken in the present study.
2.2. Adversarial Attacks on IDS and Cybersecurity Systems
Adversarial attacks against ML models have demonstrated that small, imperceptible perturbations can be sufficient to cause misclassification across computer vision, NLP, and increasingly, network security domains [13, 15, 6]. In the context of flow-based IDS, attackers may modify packet counts, byte volumes, durations, or timing features to evade detection while maintaining realistic communication semantics.
White-box attacks such as FGSM and PGD exploit knowledge of model gradients to produce highly effective perturbations [13,14]. However, real-world adversaries rarely possess full model access. As a result, black-box attacks, where adversaries rely on query probing, zeroth-order approximation, or surrogate modeling, have become a core focus in IDS research [13,16,17].
Transfer-based attacks are among the most operationally realistic: adversaries train a surrogate classifier using locally collected or publicly available data and then optimize perturbations that transfer to the target model. Prior studies show varying degrees of transferability across architectures, with nonlinear deep networks often more vulnerable than tree ensembles [21]. Decision-based attacks such as HopSkipJump [34] and score-based attacks such as ZOO [35] further mimic practical constraints where only final decisions or soft outputs are observable.
Beyond handcrafted perturbations, generative models have emerged as powerful tools for adversarial traffic synthesis. IDS-GAN and similar frameworks use GANs to mimic network traffic patterns, injecting adversarial flows capable of triggering misclassification [22]. Reinforcement learning-based attack agents have also demonstrated success in learning query-efficient evasion strategies within constrained environments [23]. These methods highlight the rapid evolution of offensive AI techniques and the need for robust defenses.
2.3. Robustness Evaluation Methods and Research Gaps
Despite substantial progress, existing evaluations of IDS robustness often rely on unrealistic assumptions, limited datasets, or narrow attack models. Many studies report high clean accuracy without examining how models behave under adversarial stress or distribution shift [10]. Gradient-based attacks frequently assume white-box access that adversaries do not possess in practice [17]. Meanwhile, most robustness research evaluates models on a single dataset, even though real network environments involve diverse traffic patterns and evolving attack behaviors.
Recent work has called for more realistic threat modeling, emphasizing black-box constraints, semantic feature validity, and cross-dataset evaluation [21,17]. These gaps motivate the approach used in this study, which integrates:
1.
Surrogate-based black-box adversarial attacks reflecting operational attacker capabilities.
2.
Semantically constrained feature perturbations ensuring realistic network flow modifications.
3.
Cross-dataset evaluation using both CICIDS2017 and CICIDS2018 to assess generalization.
4.
Comparative robustness analysis across classical and deep architectures.
By addressing these gaps, the present study contributes a rigorous and practically relevant assessment of adversarial robustness for AI-based IDS.
3. Methodology
This section outlines the methodological framework used to evaluate the adversarial robustness of machine learning-based intrusion detection systems (IDS). The framework integrates a modern flow-based dataset, multiple IDS model families, a realistic black-box adversarial threat model, semantically constrained perturbation mechanisms, and an evaluation protocol aligned with contemporary adversarial machine learning (AML) research. The design of this methodology draws upon foundational work in AML [36, 13, 6] and established practices in intrusion detection [1,2].
3.1. Dataset and Preprocessing
All experiments are conducted using the CICIDS2017 dataset, one of the most comprehensive and widely adopted corpora for evaluating network intrusion detection systems. Developed by the Canadian Institute for Cybersecurity, CICIDS2017 [43] was created to overcome the limitations of earlier datasets such as KDD’99 by offering realistic, heterogeneous traffic patterns, modern attack classes, and high-quality flow-based features [9,29]. Its broad adoption in benchmarking studies [10,7] underscores its suitability for evaluating robustness under adversarial manipulation.
This study utilizes the MachineLearningCVE CSV files, which contain bidirectional flow records represented by 78 numerical features extracted with CICFlowMeter [9]. The features capture statistical and temporal properties of network flows, including packet counts, byte volumes, inter-arrival times, and TCP flag dynamics. Standard preprocessing procedures are applied, including removal of invalid or infinite values, elimination of duplicates, and normalization of all features to the [0,1] range using Min-Max scaling [37]. This normalization is essential for ensuring meaningful and controlled adversarial perturbations.
To protect the natural imbalance characteristic of intrusion datasets, where benign flows vastly outnumber many attack classes, the data is partitioned using a stratified 70/10/20 train-validation-test split. Maintaining this imbalance is consistent with prior IDS recommendations [10,38], and it also reflects realistic operating conditions in which minority attack samples often have weaker and more brittle decision boundaries [39]. Table 2 summarizes the dataset distribution.
Table 2
|
Split
|
Samples
|
% of Total
|
|
Training
|
1,979,512
|
70%
|
|
Validation
|
282,788
|
10%
|
|
Test
|
565,576
|
20%
|
This preprocessing pipeline ensures clean, normalized, and representative input data for both baseline and adversarial robustness evaluation.
To assess cross-dataset generalization and adversarial transfer, the CICIDS2018 Friday 02-03-2018 slice was additionally preprocessed using identical normalization and feature constraints as CICIDS2017, enabling a controlled external evaluation under real distribution shift.
3.2. IDS Models and Surrogate Architecture
The IDS models used in this study represent diverse algorithmic paradigms commonly employed in intrusion detection. As emphasized in several surveys [1,2], IDS performance varies by model family due to differences in inductive bias, feature sensitivity, and resilience to noise. To reflect this diversity, two classical machine learning classifiers and two deep learning architectures are selected:
Multilayer Perceptron (MLP)
1D Convolutional Neural Network (CNN1D)
Random Forest (RF) serves as a strong nonlinear baseline well suited for tabular intrusion detection tasks. Its ensemble structure provides robustness to noise and offers interpretable feature importance insights, contributing to its wide use in security analytics [39,17]. Logistic Regression (LR) represents a transparent linear model commonly employed in anomaly detection pipelines [1]. Its interpretability and predictable behavior under adversarial manipulation make it an informative baseline.
Deep learning models have shown increasing promise in IDS research due to their capacity to model nonlinear feature relationships. The Multilayer Perceptron (MLP) architecture used in this study follows designs previously shown to perform well on flow-based datasets [2,8]. Complementing the MLP, a 1D Convolutional Neural Network (CNN1D) is included to explore local structural patterns in the flow features. Prior studies have demonstrated the effectiveness of CNNs for both payload and flow-level intrusion detection tasks [26].
To generate adversarial examples in a realistic black-box threat setting, a separate surrogate model is trained. Following the methodology of [13,21], the surrogate is an independently trained MLP designed to approximate the decision boundaries of the target models. All adversarial perturbations are computed using gradients from this surrogate and then transferred to the target IDS. This approach reflects operationally realistic adversarial conditions, wherein attackers typically do not possess detailed knowledge of deployed IDS architectures [29,17].
3.3. Threat Model
This study assumes a black-box, transfer-based adversarial threat model, which has become increasingly relevant in AML research due to its alignment with real-world attacker constraints [35,6]. The adversary is assumed to have no access to the target model’s architecture, gradients, training data, or hyperparameters. Instead, the adversary trains a surrogate classifier on data drawn from an equivalent distribution and uses it to craft perturbations.
To clearly illustrate the attacker’s capabilities and the flow of surrogate-based perturbation generation, the overall threat model is depicted in Fig. 1. The figure shows how the adversary trains a local surrogate model on data drawn from the same distribution as the target IDS and subsequently uses this surrogate to craft adversarial perturbations for transfer. This visual representation clarifies the separation between the adversary’s knowledge and the target IDS internals and highlights the operational constraints driving the choice of a black-box, transfer-based threat model.
The adversary’s goal is to produce a perturbed flow x' from an original input x such that the predicted label changes while keeping the perturbation within a bounded norm region:
The
constraint is widely used in adversarial research because it restricts the maximum allowable change to any individual feature [13,14]. For flow-based IDS, it is a particularly suitable metric given that small changes to individual flow statistics can mimic realistic and stealthy evasion behavior.
3.4. Adversarial Attack Formulation
Adversarial perturbations are generated using two widely studied gradient-based evasion techniques: the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD). Both attacks operate on the surrogate classifier and are subsequently transferred to the target IDS models.
FGSM generates adversarial examples by taking a single gradient-aligned step that maximizes the surrogate’s loss [13]. The perturbation is computed as:
where fs is the surrogate model, and the clipping function ensures that the perturbed values remain within the [0,1] range.
PGD extends FGSM through iterative refinement, applying multiple perturbation steps while enforcing the same norm constraint [14]. Each iteration updates the input as:
with the projection operator defined as:
PGD is widely regarded as one of the strongest first-order attacks and provides a rigorous measure of IDS robustness.
3.5. Semantic Validity Constraints
A common limitation of feature-space adversarial attacks is the possibility of producing flows that violate protocol-level semantics or physical plausibility. Addressing this issue is essential for ensuring that adversarial examples reflect realistic network behavior [32,17]. Accordingly, each adversarial example undergoes a semantic validation stage.
First, non-negativity constraints preserve basic flow properties:
Second, monotonicity constraints derived from dataset characteristics ensure preservation of feature relationships:
Third, statistical plausibility is maintained by projecting feature values outside empirical training-set ranges back into valid bounds. Finally, features encoding TCP flag states are validated to avoid impossible flag combinations. This constraint layer ensures that adversarial examples remain sufficiently realistic to represent plausible evasion attempts in operational networks.
To provide a clear visual summary of how these constraints are applied during attack generation, the semantic constraint enforcement pipeline is illustrated in Fig. 2. The figure shows how raw adversarial perturbations produced by the surrogate model are projected onto valid feature ranges, adjusted to maintain protocol consistency, and filtered to preserve domain-specific semantics before being evaluated by the target IDS models. This ensures that adversarial examples remain both realistic and operationally meaningful.
3.6. Evaluation Metrics
Model robustness is evaluated using metrics that capture both baseline classification performance and degradation under adversarial conditions. Given the inherent class imbalance in CICIDS2017, macro-averaged precision, recall, and F1-score provide balanced insights across minority and majority classes. Accuracy is included for completeness.
Adversarial performance is evaluated using three additional metrics. The attack success rate (ASR) measures the fraction of originally correct predictions that become incorrect:
The robust accuracy quantifies accuracy on adversarially perturbed samples:
Finally, the accuracy drop provides a direct indicator of adversarial degradation:
These metrics align with established recommendations for AML robustness evaluation (Yuan et al., 2019; Zhang et al., 2021).
3.7. Cross-Dataset Evaluation
To evaluate robustness under realistic distribution shift, adversarial transferability is assessed using the CICIDS2018 Friday 02-03-2018 slice. This dataset provides traffic collected under different temporal, operational, and environmental conditions from CICIDS2017, making it suitable for examining whether perturbations crafted on one dataset remain effective when applied to another. Using identical preprocessing and semantic constraint enforcement ensures that differences in performance reflect dataset drift rather than preprocessing artifacts.
In this evaluation, IDS models are trained solely on CICIDS2017. Adversarial examples are generated using the surrogate model trained on CICIDS2017 and then applied directly to the CICIDS2018 flows without retraining or fine-tuning. This setup reflects realistic operational settings in which deployed IDS must confront new traffic distributions without full model retraining. Performance is measured using clean accuracy, adversarial accuracy, accuracy drop (Δ accuracy), and attack success rate (ASR), enabling a direct comparison with in-distribution robustness.
The cross-dataset evaluation pipeline is illustrated in Fig. 3, which depicts the sequence of model training on CICIDS2017, adversarial example generation, and external validation on CICIDS2018. This visual summary clarifies how transferability is measured and how adversarial behavior changes under distribution shift.
3.8. Experimental Setup
All experiments are implemented using Python, with machine learning models built using scikit-learn [36] and deep learning models built using PyTorch. Training and evaluation are conducted on a CPU-only workstation equipped with an Intel i7 processor and 32 GB of RAM, reflecting constraints commonly encountered in lightweight or distributed IDS deployments [2]. Deep learning models use early stopping based on validation loss, and fixed random seeds ensure reproducibility. All models share identical data partitions, and adversarial examples are generated on validation and test subsets to ensure a fair and controlled comparison across models. This experimental configuration provides a consistent and reproducible environment for assessing adversarial robustness under realistic assumptions and resource conditions.
Figure 4 summarizes the end-to-end methodology used in this study, integrating dataset preprocessing, IDS model training, surrogate-based attack generation, semantic and protocol-aware constraint enforcement, and evaluation under both in-distribution (CICIDS2017) and cross-dataset (CICIDS2018) conditions. The architecture reflects the black-box, transfer-based threat model adopted in this work and illustrates how gradient-based (FGSM, PGD) and black-box (HSJA, ZOO) adversarial examples are generated from the surrogate and applied to the target IDS models. Evaluation includes clean and adversarial accuracy, Δ accuracy, ASR, per-class degradation, and cross-dataset transferability.
4. Experimental Results
This section presents the empirical evaluation of the four intrusion detection models, Random Forest (RF), Logistic Regression (LR), Multilayer Perceptron (MLP), and CNN1D, on the CICIDS2017 dataset under both clean and adversarial conditions. We first report training and baseline classification performance on the full validation and test splits, then analyze robustness under multiple adversarial threat models, including FGSM, PGD, HopSkipJump (HSJA), and Zeroth-Order Optimization (ZOO). The evaluation follows best practices in adversarial machine learning [13,14,6] and reflects the realistic black-box assumptions discussed earlier [16,17].
4.1. Training and Baseline Classification Performance
The first set of experiments evaluates all four models in a purely supervised setting, without adversarial perturbations, using the full CICIDS2017 validation and test splits. These results establish the clean-performance baseline against which adversarial degradation is later measured.
4.1.1. 4.1.1 Classical Models
Table 3 summarizes the performance of the classical baselines, Random Forest and Logistic Regression, on the validation (282,788 flows) and test (565,576 flows) splits.
Table 3
Classical baseline performance on CICIDS2017
|
Model
|
Split
|
Accuracy
|
Macro F1
|
Weighted F1
|
|
Random Forest
|
Validation
|
0.9984
|
0.8617
|
0.9984
|
|
Random Forest
|
Test
|
0.9985
|
0.8714
|
0.9985
|
|
Logistic Regression
|
Validation
|
0.9596
|
0.3997
|
0.9560
|
|
Logistic Regression
|
Test
|
0.9591
|
0.4003
|
0.9562
|
Random Forest attains near-perfect accuracy on both validation and test splits while maintaining comparatively high macro F1-scores above 0.86. This indicates that RF not only models the dominant benign and frequent attack classes well but also performs reasonably across several minority classes despite the pronounced class imbalance, echoing prior observations about the strength of ensemble tree methods in IDS [39,17].
Logistic Regression achieves high overall accuracy (~ 0.96) but substantially lower macro F1 (~ 0.40). This discrepancy reflects a well-known phenomenon in intrusion detection: linear decision boundaries can capture the benign versus frequent-attack separation but do not adequately represent numerous rare attack types, leading to strong weighted averages yet poor macro-level performance [1].
4.1.2. 4.1.2 Deep Models
The MLP converges quickly, achieving its best validation accuracy (0.9829) at epoch 7 and generalizing well to the test set with accuracy 0.9804 and a small gap between validation and test loss. This positions the MLP as a strong deep baseline, only slightly below RF in clean accuracy. The CNN1D achieves a best validation accuracy of 0.9528 and test accuracy of 0.9368, with a more noticeable loss gap, suggesting a somewhat higher sensitivity to class imbalance and a greater tendency to overfit the validation distribution. Nonetheless, CNN1D provides valuable architectural diversity, leveraging local patterns in the feature dimension, which becomes important in the robustness analysis. Table 4 presents the training dynamics and test performance of the deep baselines: MLP and CNN1D.
Table 4
Deep baseline performance on CICIDS2017
|
Model
|
Best Val Epoch
|
Best Val Accuracy
|
Best Val Loss
|
Test Accuracy
|
Test Loss
|
|
MLP
|
7
|
0.9829
|
0.0418
|
0.9804
|
0.0468
|
|
CNN1D
|
17
|
0.9528
|
0.1699
|
0.9368
|
0.2212
|
4.1.3. 4.1.3 Comparative View
From a clean-data perspective, RF is the strongest model, followed closely by the MLP, with CNN1D and LR trailing. This diversity of inductive biases, nonlinear ensembles, linear models, fully connected deep networks, and convolutional architectures, creates a rich setting for studying adversarial robustness, allowing us to compare how different model families trade off between clean performance and robustness [2,20].
4.2. Clean Performance on the Adversarial Evaluation Subset
To enable controlled adversarial evaluation, a 20,000-flow subset of the CICIDS2017 test split is used for FGSM and PGD experiments. Table 5 reports the clean accuracy of all models on this subset.
Table 5
Clean Accuracy on 20,000-Flow Adversarial Evaluation Subset
|
Model
|
Clean Accuracy
|
|
Random Forest
|
0.9987
|
|
Logistic Regression
|
0.9610
|
|
MLP
|
0.9817
|
|
CNN1D
|
0.9411
|
The ranking is consistent with full-test results: RF > MLP > CNN1D > LR. These values serve as the reference point for computing robust accuracy, accuracy drops, and attack success rates (ASR) under adversarial perturbations.
4.3. FGSM Robustness (Single-Step, Surrogate-Based Attack)
We first evaluate robustness against the Fast Gradient Sign Method (FGSM), a single-step gradient-based attack computed on the surrogate model and transferred to all target models [10,13]. Table 6 summarizes the robust accuracy, accuracy drop, and ASR for perturbation budgets ε ∈ {0.01, 0.03, 0.05} on the 20,000-flow subset.
Table 6
FGSM Robust Accuracy and ASR (ε ∈ {0.01, 0.03, 0.05})
|
Model
|
ε
|
Adv Accuracy
|
Δ Accuracy
|
ASR
|
|
RF
|
0.01
|
0.8049
|
−0.1938
|
0.195
|
|
RF
|
0.03
|
0.8049
|
−0.1938
|
0.195
|
|
RF
|
0.05
|
0.8049
|
−0.1938
|
0.195
|
|
LR
|
0.01
|
0.8308
|
−0.1302
|
0.144
|
|
LR
|
0.03
|
0.7437
|
−0.2173
|
0.238
|
|
LR
|
0.05
|
0.6149
|
−0.3461
|
0.371
|
|
MLP
|
0.01
|
0.8059
|
−0.1758
|
0.181
|
|
MLP
|
0.03
|
0.7095
|
−0.2721
|
0.287
|
|
MLP
|
0.05
|
0.6435
|
−0.3382
|
0.356
|
|
CNN1D
|
0.01
|
0.7853
|
−0.1558
|
0.203
|
|
CNN1D
|
0.03
|
0.7246
|
−0.2165
|
0.268
|
|
CNN1D
|
0.05
|
0.6245
|
−0.3166
|
0.373
|
All models exhibit noticeable degradation under FGSM, with accuracy drops ranging from roughly 13 to 35 percentage points at higher ε values. RF shows a characteristic “saturation” effect: its robust accuracy decreases to ~ 0.80 even at the smallest ε and then plateaus, suggesting that a subset of flows is consistently vulnerable while the rest remain robust. LR, MLP, and CNN1D, by contrast, display monotonic vulnerability, robust accuracy deteriorates steadily as ε increases, reflecting how FGSM exploits both linear and nonlinear decision boundaries (Goodfellow et al., 2014; Yuan et al., 2019).
To visualize these patterns across perturbation strengths, Figs. 5 and 6 summarize the FGSM robustness behavior of all models, showing the corresponding trends in robust accuracy and attack success rate (ASR) as ε increases.
(Plots show RF’s flat robustness curve versus sharper declines for LR, MLP, and CNN1D.)
4.4. PGD Robustness (Iterative, Surrogate-Based Attack)
We next evaluate robustness under Projected Gradient Descent (PGD), an iterative refinement of FGSM known to be among the strongest first-order adversarial attacks [14]. Table 7 reports the results for ε ∈ {0.01, 0.03, 0.05} on the same 20,000-flow subset.
Table 7
PGD Robust Accuracy and ASR (ε ∈ {0.01, 0.03, 0.05})
|
Model
|
ε
|
Adv Accuracy
|
Δ Accuracy
|
ASR
|
|
RF
|
0.01
|
0.8049
|
−0.1938
|
0.195
|
|
RF
|
0.03
|
0.8049
|
−0.1938
|
0.195
|
|
RF
|
0.05
|
0.8049
|
−0.1938
|
0.195
|
|
LR
|
0.01
|
0.8222
|
−0.1388
|
0.151
|
|
LR
|
0.03
|
0.7544
|
−0.2066
|
0.223
|
|
LR
|
0.05
|
0.6997
|
−0.2613
|
0.279
|
|
MLP
|
0.01
|
0.6927
|
−0.2890
|
0.294
|
|
MLP
|
0.03
|
0.3495
|
−0.6322
|
0.644
|
|
MLP
|
0.05
|
0.3075
|
−0.6741
|
0.687
|
|
CNN1D
|
0.01
|
0.8016
|
−0.1395
|
0.185
|
|
CNN1D
|
0.03
|
0.8047
|
−0.1364
|
0.185
|
|
CNN1D
|
0.05
|
0.7989
|
−0.1422
|
0.190
|
PGD reveals more severe vulnerabilities than FGSM, especially for the MLP. At ε = 0.03, the MLP’s accuracy collapses from 0.9817 to 0.3495, with ASR ≈ 0.64; at ε = 0.05, accuracy further declines to 0.3075 with ASR ≈ 0.69. This catastrophic behavior is consistent with prior findings that iterative attacks can exploit sharp decision boundaries in deep networks [14].
In contrast, RF and CNN1D remain remarkably stable under PGD transfer: robust accuracy is ~ 0.80 across all ε values, and PGD offers no additional damage beyond FGSM, suggesting weak cross-architecture transferability from the MLP surrogate. LR shows intermediate behavior, with progressive but not catastrophic deterioration. These PGD-induced degradation patterns are summarized in Fig. 7, which illustrates the robust accuracy curves for all models across increasing ε values.
4.5. HSJA Robustness (Decision-Based Black-Box Attack)
To model realistic black-box attackers who only observe final model decisions, we evaluate the HopSkipJump Attack (HSJA) on a 1,000-flow subset of the test data with ε = 0.03, 10 refinement iterations, and 10 binary search steps. HSJA operates purely on hard-label outputs and does not rely on gradients or confidence scores [34].
Table 8
HSJA Robustness (ε = 0.03, 1,000-Flow Subset)
|
Model
|
Clean Accuracy
|
Adv Accuracy
|
Δ Accuracy
|
ASR
|
|
RF
|
0.997
|
0.817
|
−0.180
|
0.1815
|
|
LR
|
0.955
|
0.909
|
−0.046
|
0.0660
|
|
MLP
|
0.979
|
0.921
|
−0.058
|
0.0725
|
|
CNN1D
|
0.938
|
0.833
|
−0.105
|
0.1535
|
HSJA induces moderate but non-catastrophic degradation. RF loses about 18 percentage points of accuracy, while LR and MLP suffer relatively small drops (4–6 percentage points). CNN1D occupies a middle ground, with a ~ 10.5 percentage-point decline and ASR ~ 0.15. These results (reported in Table 8) are consistent with expectations that decision-based attacks are weaker than gradient-based methods, particularly in high-dimensional tabular spaces and under limited query budgets [16,34]. These patterns are visualized in Fig. 8, which compares clean and adversarial accuracy for all models under HSJA.
4.6. ZOO Robustness (Score-Based Black-Box Attack)
We also evaluate a score-based black-box attack using a lightweight Zeroth-Order Optimization (ZOO) implementation on a 1,000-flow subset with ε = 0.03, 10 iterations, and four coordinates per step [35]. Because ZOO requires access to model confidence scores or probabilities, it is applied only to RF and LR.
A
Table 9
ZOO Robustness (ε = 0.03, 1,000-Flow Subset)
|
Model
|
Clean Accuracy
|
Adv Accuracy
|
Δ Accuracy
|
ASR
|
|
RF
|
0.997
|
0.949
|
−0.048
|
0.0481
|
|
LR
|
0.955
|
0.955
|
0.000
|
0.0000
|
A
Under the constrained query budget, ZOO achieves only mild degradation for RF and fails to significantly affect LR. This outcome aligns with prior observations that score-based black-box attacks struggle in high-dimensional, structured feature spaces when query budgets and coordinate sampling are limited [16,6]). These effects are illustrated in Fig.
9, which shows the clean versus adversarial accuracy for RF and LR under the ZOO attack.
4.7. Overall Interpretation
Taken together, the results highlight a nuanced robustness landscape across model families and threat models. From a clean-data standpoint, RF and MLP are the strongest performers, with CNN1D and LR trailing. Under adversarial conditions, however, their behaviors diverge markedly. RF demonstrates consistently strong robustness across FGSM, PGD, HSJA, and ZOO, with robust accuracy stabilizing around 80% under gradient-based attacks and only moderate degradation under decision- and score-based attacks. These findings reinforce the value of ensemble methods for robust tabular intrusion detection [39]
The MLP, in contrast, is highly vulnerable to PGD transfer from the surrogate. While it maintains strong clean accuracy, its robust accuracy collapses below 35% for moderate ε, with ASR approaching 0.70, demonstrating the deleterious effect of sharp decision boundaries when exploited by iterative attacks [2,6]. CNN1D exhibits mixed behavior: it is noticeably affected by FGSM but remains relatively robust to PGD transfer and experiences moderate degradation under HSJA. LR behaves as a predictable linear baseline, sensitive but not catastrophically fragile.
The contrast between white-box-style PGD transfer and realistic black-box attacks (HSJA and ZOO) underscores an important conceptual point: models that appear catastrophically vulnerable in idealized white-box settings may exhibit more moderate degradation under operationally realistic black-box assumptions [17]. Finally, the differing transfer patterns across RF, LR, MLP, and CNN1D suggest that architectural diversity can be leveraged as a defensive asset. Heterogeneous ensembles combining models with distinct inductive biases may provide defense-in-depth against transfer-based attacks [32], especially when combined with semantic constraints and protocol-aware validation as described in the methodology.
4.8. Cross-Dataset Generalization: CICIDS2018
To evaluate the generalization of the CICIDS2017-trained models under distribution shift, we conducted an external validation using the CICIDS2018 Friday (02-03-2018) slice, processed into a binary Benign-Attack classification task. The slice contains 731,167 training flows, 156,679 validation flows, and 156,679 test flows, each represented by 78 CICFlowMeter-derived features, with a consistent benign-attack ratio of approximately 72.6% to 27.4% across all splits. This dataset differs meaningfully from CICIDS2017 in traffic composition, attack behaviors, and feature distributions, making it suitable for assessing out-of-distribution robustness.
Table 10 reports clean accuracy on a 20,000-flow subset of the CICIDS2018 Friday test split. Random Forest and MLP retain moderate cross-dataset performance (0.72 and 0.67), while Logistic Regression and CNN1D collapse to near-random behavior (0.08 and 0.04). These results indicate that CICIDS2018 represents a genuinely different distribution rather than a trivial variant of the training data, highlighting the importance of cross-dataset evaluation for IDS robustness studies.
Table 10
CICIDS2018 Cross-Dataset Clean Accuracy (20,000 Flows)
|
Model
|
Clean Accuracy
|
|
RF
|
0.72045
|
|
LR
|
0.07950
|
|
MLP
|
0.67445
|
|
CNN1D
|
0.03600
|
We then applied adversarial examples generated on the CICIDS2017 surrogate model (MLP) using FGSM and PGD with ε = 0.03. Surprisingly, Random Forest remained almost entirely unaffected under both attacks (Δ < 0.0001; ASR < 0.001), indicating that perturbations crafted on CICIDS2017 do not transfer destructively to RF on CICIDS2018. The MLP exhibited modest deterioration under FGSM (Δ = 0.022) and PGD (Δ = 0.037), far smaller than the steep within-dataset degradation reported in Section 4.3 and 4.4. In contrast, LR and CNN1D displayed substantial improvements in accuracy when perturbed, driven by the fact that both models had initially learned miscalibrated decision boundaries under the shifted distribution. In such cases, small L ∞ perturbations behave more like benign data augmentation than intentional adversarial distortions. These patterns are summarized in Tables 11 and 12.
Table 11
CIMIDS2018 FGSM Robustness (ε = 0.03)
|
Model
|
Clean Acc
|
Adv Acc
|
Δ Acc
|
ASR
|
|
RF
|
0.72045
|
0.72045
|
0.00000
|
0.00000
|
|
LR
|
0.07950
|
0.32740
|
−0.24790
|
0.02767
|
|
MLP
|
0.67445
|
0.65220
|
0.02225
|
0.05983
|
|
CNN1D
|
0.03600
|
0.18495
|
−0.14895
|
0.86250
|
Table 12
CICIDS2018 PGD Robustness (ε = 0.03)
|
Model
|
Clean Acc
|
Adv Acc
|
Δ Acc
|
ASR
|
|
RF
|
0.72045
|
0.72040
|
0.00005
|
0.00007
|
|
LR
|
0.07950
|
0.32835
|
−0.24885
|
0.03082
|
|
MLP
|
0.67445
|
0.63780
|
0.03665
|
0.08021
|
|
CNN1D
|
0.03600
|
0.34655
|
−0.31055
|
0.78194
|
These results collectively demonstrate that adversarial perturbations optimized on CICIDS2017 do not automatically transfer to a distinct dataset. Instead, transferability depends strongly on both model architecture and distribution alignment. For well-aligned models (RF, MLP), transferred adversarial examples induce only limited degradation; for poorly aligned models (LR, CNN1D), perturbations often act as random noise that occasionally improves accuracy. This cross-dataset analysis directly addresses the reviewer’s concern and provides a more realistic assessment of adversarial robustness for flow-based IDS.
5. Discussion of Findings
The findings of this study reveal a complex adversarial robustness landscape across machine learning and deep learning intrusion detection models. Although all models achieved high clean accuracy on CICIDS2017, their susceptibility to adversarial perturbations varied substantially once evaluated under realistic black-box conditions. These outcomes underscore the limitations of evaluating IDS performance solely on clean datasets and highlight the necessity of robustness assessments that align with the operational threat environment.
5.1. Interpreting Adversarial Robustness
A clear distinction emerges between models with inherently stable decision boundaries and those whose representations are highly sensitive to small adversarial shifts. Random Forest, for example, consistently maintained approximately 80% accuracy under both FGSM and PGD transfer-based attacks. This resilience aligns with prior observations that ensemble tree methods produce piecewise-constant decision surfaces that are difficult for gradient-driven perturbations to exploit [39,17].
In contrast, the Multilayer Perceptron exhibited pronounced fragility under PGD, with accuracy dropping from 0.98 to below 0.35. This sharp degradation is consistent with adversarial machine learning research showing that iterative attacks can exploit curvature around decision boundaries in deep models [14,6]. Logistic Regression demonstrated predictable linear behavior, experiencing moderate declines without catastrophic failure, while CNN1D showed a mixed profile: comparatively stable under PGD but more exposed to FGSM and HSJA. These differences illustrate that architectural properties, rather than clean accuracy alone, drive adversarial susceptibility.
5.2. Model-Specific Vulnerabilities
The model-specific responses observed across attacks reflect both architectural design and dataset characteristics. Deep networks are particularly exposed to high-curvature adversarial manipulation in feature-rich, imbalanced domains where minority attack classes lie near decision boundaries [38]. CNN1D’s convolutional structure captures local feature patterns that can limit PGD transfer but remain sensitive to directionally coherent perturbations such as FGSM. Logistic Regression, by contrast, demonstrated significant degradation on CICIDS2018, reinforcing the limitations of linear models when confronted with distribution shift and heterogeneous attack behaviors.
These results highlight that IDS designers must evaluate model robustness beyond static benchmarking metrics. Models that perform competitively under clean conditions may be fragile in dynamic adversarial settings, especially when trained on datasets with labeling inconsistencies or imbalanced class distributions [10,32).
5.3. Operational Implications for IDS Design
The divergence between clean and adversarial performance has direct implications for operational IDS deployment. First, high clean accuracy is not a reliable indicator of real-world robustness. SOC engineers and MLOps pipelines should incorporate adversarial stress testing as part of standard model validation, particularly for systems deployed in high-risk environments such as cloud infrastructures, financial systems, and industrial networks.
Second, the severe PGD vulnerability of the MLP suggests caution when deploying deep models for flow-based IDS unless complemented by robustness-enhancing methods such as adversarial training or uncertainty quantification [40]. Conversely, the resilience of Random Forest highlights the value of non-differentiable or non-smooth architectures in resisting gradient-based attacks.
Finally, broader trustworthy AI considerations extend beyond intrusion detection. As shown in fairness-aware educational assessment systems [12], high-impact AI systems must balance accuracy with stability, transparency, and robustness. These parallels reinforce the importance of designing IDS models that maintain consistent performance under adversarial pressure.
5.4. Realistic Threat Models and Cross-Dataset Insights
The transfer-based black-box threat model adopted in this study aligns with real-world attacker capabilities. As emphasized by [17], adversaries typically lack access to model gradients or parameters and instead rely on probing or surrogate approximations. Our results validate this perspective: while PGD induces catastrophic degradation in idealized white-box contexts, its transferability is substantially limited in realistic black-box scenarios and across heterogeneous models.
The CICIDS2018 cross-dataset evaluation reinforces this observation. Perturbations optimized on CICIDS2017 transferred only minimally to CICIDS2018, with Random Forest showing no measurable degradation and even vulnerable models exhibiting sharply reduced susceptibility. These findings demonstrate that adversarial risk must be interpreted within the context of actual dataset alignment and operational conditions rather than assumed to generalize uniformly across traffic distributions.
7. Conclusion and Future work
This study proposed a rigorous and realistic framework for evaluating the adversarial robustness of AI-based intrusion detection systems using surrogate-based attack generation, semantic feature constraints, and cross-dataset validation. The experiments demonstrate that adversarial vulnerability is highly dependent on model architecture, attack strength, and dataset alignment. While deep networks achieved strong clean accuracy, they exhibited sharp fragility under iterative perturbations, whereas Random Forest maintained stable performance across all threat models. The limited transferability of adversarial examples across datasets emphasizes the need for evaluating IDS under diverse traffic conditions and realistic adversarial assumptions.
By integrating multiple attack types, semantically aligned constraints, and cross-dataset analysis, this work provides practical insights and methodological guidance for developing IDS models that are not only accurate but robust, deployable, and aligned with modern adversarial threat landscapes.
7.1. Future Directions
Several research avenues can strengthen IDS robustness and extend this work:
A
Adversarial training: Incorporating FGSM, PGD, randomized smoothing, and distributionally robust optimization may improve model resilience without sacrificing clean accuracy [14].
Temporal and sequential modeling: LSTM, GRU, and Transformer-based architectures may better capture temporal dependencies in network flows, potentially stabilizing decision boundaries under adversarial shift.
Explainable AI (XAI): Techniques such as LIME, SHAP, and integrated gradients [41] can expose which features contribute most to vulnerability, supporting finer-grained hardening strategies.
Federated and privacy-preserving IDS: Federated learning and privacy-aware [42] offer promising pathways for enabling collaborative detection while mitigating exposure to poisoning and evasion attacks.
These directions reflect a broader movement toward trustworthy, transparent, and distributionally robust AI systems for securing next-generation IoT infrastructures [32,45]. As IoT settings expand, incorporating these advancements will be vital for providing robust, adaptable, and future-proof intrusion detection.