Machine Learning Analysis Pipeline
EDR: Dataset Loading & Preprocessing
EDR – Train/Test Overview
• Train shape: (185442, 20) | Test shape: (16287, 20)
• Total train samples: 185,442 | Total test samples: 16,287
• Number of features: 18
• Target column: 'label'
• Missing values (train): 0 | (test): 0
• Train shape: (185442, 20) | Test shape: (16287, 20)
• Total train samples: 185,442 | Total test samples: 16,287
• Number of features: 18
• Target column: 'label'
• Missing values (train): 0 | (test): 0
EDR – Train Class Distribution
• 0: 184,585
• 1: 857
• Class balance (minority/majority): 0.4643%
• 0: 184,585
• 1: 857
• Class balance (minority/majority): 0.4643%
EDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
⚠️ Extreme Class Imbalance Detected
• Minority class represents only 0.4643% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
• Minority class represents only 0.4643% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
Baseline (Most-Frequent) Accuracy: 0.9955
EDR: Model Performance Comparison
EDR – Model Performance Metrics
| Model | Accuracy | Balanced Acc | Precision | Recall | F1 | ROC-AUC | PR-AUC |
|---|---|---|---|---|---|---|---|
| Logistic Regression | 0.9626 | 0.5777 | 0.0249 | 0.1892 | 0.0440 | 0.6720 | 0.0265 |
| Random Forest (SMOTE) | 0.9955 | 0.5269 | 0.5000 | 0.0541 | 0.0976 | 0.6809 | 0.0728 |
| LightGBM | 0.8596 | 0.6335 | 0.0132 | 0.4054 | 0.0256 | 0.6343 | 0.0082 |
| Balanced RF | 0.9101 | 0.6791 | 0.0227 | 0.4459 | 0.0431 | 0.8530 | 0.0414 |
| SGD SVM | 0.9407 | 0.5532 | 0.0131 | 0.1622 | 0.0242 | nan | nan |
| IsolationForest | 0.9859 | 0.5490 | 0.0465 | 0.1081 | 0.0650 | nan | nan |
Confusion Matrix Analysis
| Model | TN | FP | FN | TP | FP Rate | Miss Rate |
|---|---|---|---|---|---|---|
| Logistic Regression | 15664 | 549 | 60 | 14 | 3.39% | 81.08% |
| Random Forest (SMOTE) | 16209 | 4 | 70 | 4 | 0.02% | 94.59% |
| LightGBM | 13970 | 2243 | 44 | 30 | 13.83% | 59.46% |
| Balanced RF | 14790 | 1423 | 41 | 33 | 8.78% | 55.41% |
| SGD SVM | 15309 | 904 | 62 | 12 | 5.58% | 83.78% |
| IsolationForest | 16049 | 164 | 66 | 8 | 1.01% | 89.19% |
Best Models by Metric
Accuracy
Random Forest (SMOTE)
0.9955
Balanced Acc
Balanced RF
0.6791
Precision
Random Forest (SMOTE)
0.5000
Recall
Balanced RF
0.4459
F1
Random Forest (SMOTE)
0.0976
ROC-AUC
Balanced RF
0.8530
PR-AUC
Random Forest (SMOTE)
0.0728
Lowest False Positive Rate
Random Forest (SMOTE)
0.02%
Lowest Miss Rate
Balanced RF
55.41%
EDR – Metrics by Model
EDR – ROC Curves
EDR – Precision–Recall Curves
EDR – Predicted Probability Distributions
EDR – Threshold Sweep
EDR: Logistic Regression – Detailed Analysis
EDR – Logistic Regression: Confusion Matrix
EDR – Logistic Regression: Confusion Matrix
EDR – Logistic Regression: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9962 | 0.9661 | 0.9809 | 16213.0000 |
| 1 | 0.0249 | 0.1892 | 0.0440 | 74.0000 |
| accuracy | nan | nan | 0.9626 | 16287.0000 |
EDR – Logistic Regression: Feature Importance
EDR – Logistic Regression: Feature Importance
EDR: Random Forest (SMOTE) – Detailed Analysis
EDR – Random Forest (SMOTE): Confusion Matrix
EDR – Random Forest (SMOTE): Confusion Matrix
EDR – Random Forest (SMOTE): Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9957 | 0.9998 | 0.9977 | 16213.0000 |
| 1 | 0.5000 | 0.0541 | 0.0976 | 74.0000 |
| accuracy | nan | nan | 0.9955 | 16287.0000 |
EDR – Random Forest (SMOTE): Feature Importance
EDR – Random Forest (SMOTE): Feature Importance
EDR: LightGBM – Detailed Analysis
EDR – LightGBM: Confusion Matrix
EDR – LightGBM: Confusion Matrix
EDR – LightGBM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9969 | 0.8617 | 0.9243 | 16213.0000 |
| 1 | 0.0132 | 0.4054 | 0.0256 | 74.0000 |
| accuracy | nan | nan | 0.8596 | 16287.0000 |
EDR – LightGBM: Feature Importance
EDR – LightGBM: Feature Importance
EDR: Balanced RF – Detailed Analysis
EDR – Balanced RF: Confusion Matrix
EDR – Balanced RF: Confusion Matrix
EDR – Balanced RF: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9972 | 0.9122 | 0.9528 | 16213.0000 |
| 1 | 0.0227 | 0.4459 | 0.0431 | 74.0000 |
| accuracy | nan | nan | 0.9101 | 16287.0000 |
EDR – Balanced RF: Feature Importance
EDR – Balanced RF: Feature Importance
EDR: SGD SVM – Detailed Analysis
EDR – SGD SVM: Confusion Matrix
EDR – SGD SVM: Confusion Matrix
EDR – SGD SVM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9960 | 0.9442 | 0.9694 | 16213.0000 |
| 1 | 0.0131 | 0.1622 | 0.0242 | 74.0000 |
| accuracy | nan | nan | 0.9407 | 16287.0000 |
EDR – SGD SVM: Feature Importance
EDR – SGD SVM: Feature Importance
EDR: IsolationForest – Detailed Analysis
EDR – IsolationForest: Confusion Matrix
EDR – IsolationForest: Confusion Matrix
EDR – IsolationForest: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9959 | 0.9899 | 0.9929 | 16213.0000 |
| 1 | 0.0465 | 0.1081 | 0.0650 | 74.0000 |
| accuracy | nan | nan | 0.9859 | 16287.0000 |
EDR – IsolationForest: Feature Importance
Feature importance not available for this model type.
XDR: Dataset Loading & Preprocessing
XDR – Train/Test Overview
• Train shape: (185442, 34) | Test shape: (16287, 34)
• Total train samples: 185,442 | Total test samples: 16,287
• Number of features: 32
• Target column: 'label'
• Missing values (train): 0 | (test): 0
• Train shape: (185442, 34) | Test shape: (16287, 34)
• Total train samples: 185,442 | Total test samples: 16,287
• Number of features: 32
• Target column: 'label'
• Missing values (train): 0 | (test): 0
XDR – Train Class Distribution
• 0: 184,585
• 1: 857
• Class balance (minority/majority): 0.4643%
• 0: 184,585
• 1: 857
• Class balance (minority/majority): 0.4643%
XDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
⚠️ Extreme Class Imbalance Detected
• Minority class represents only 0.4643% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
• Minority class represents only 0.4643% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
Baseline (Most-Frequent) Accuracy: 0.9955
XDR: Model Performance Comparison
XDR – Model Performance Metrics
| Model | Accuracy | Balanced Acc | Precision | Recall | F1 | ROC-AUC | PR-AUC |
|---|---|---|---|---|---|---|---|
| Logistic Regression | 0.8257 | 0.6636 | 0.0130 | 0.5000 | 0.0254 | 0.6585 | 0.0235 |
| Random Forest (SMOTE) | 0.9955 | 0.5269 | 0.5714 | 0.0541 | 0.0988 | 0.6588 | 0.0680 |
| LightGBM | 0.7120 | 0.7410 | 0.0121 | 0.7703 | 0.0237 | 0.7560 | 0.0115 |
| Balanced RF | 0.9225 | 0.6786 | 0.0256 | 0.4324 | 0.0483 | 0.8520 | 0.0250 |
| SGD SVM | 0.9570 | 0.5816 | 0.0229 | 0.2027 | 0.0411 | nan | nan |
| IsolationForest | 0.9939 | 0.5194 | 0.0938 | 0.0405 | 0.0566 | nan | nan |
Confusion Matrix Analysis
| Model | TN | FP | FN | TP | FP Rate | Miss Rate |
|---|---|---|---|---|---|---|
| Logistic Regression | 13411 | 2802 | 37 | 37 | 17.28% | 50.00% |
| Random Forest (SMOTE) | 16210 | 3 | 70 | 4 | 0.02% | 94.59% |
| LightGBM | 11540 | 4673 | 17 | 57 | 28.82% | 22.97% |
| Balanced RF | 14993 | 1220 | 42 | 32 | 7.52% | 56.76% |
| SGD SVM | 15572 | 641 | 59 | 15 | 3.95% | 79.73% |
| IsolationForest | 16184 | 29 | 71 | 3 | 0.18% | 95.95% |
Best Models by Metric
Accuracy
Random Forest (SMOTE)
0.9955
Balanced Acc
LightGBM
0.7410
Precision
Random Forest (SMOTE)
0.5714
Recall
LightGBM
0.7703
F1
Random Forest (SMOTE)
0.0988
ROC-AUC
Balanced RF
0.8520
PR-AUC
Random Forest (SMOTE)
0.0680
Lowest False Positive Rate
Random Forest (SMOTE)
0.02%
Lowest Miss Rate
LightGBM
22.97%
XDR – Metrics by Model
XDR – ROC Curves
XDR – Precision–Recall Curves
XDR – Predicted Probability Distributions
XDR – Threshold Sweep
XDR: Logistic Regression – Detailed Analysis
XDR – Logistic Regression: Confusion Matrix
XDR – Logistic Regression: Confusion Matrix
XDR – Logistic Regression: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9972 | 0.8272 | 0.9043 | 16213.0000 |
| 1 | 0.0130 | 0.5000 | 0.0254 | 74.0000 |
| accuracy | nan | nan | 0.8257 | 16287.0000 |
XDR – Logistic Regression: Feature Importance
XDR – Logistic Regression: Feature Importance
XDR: Random Forest (SMOTE) – Detailed Analysis
XDR – Random Forest (SMOTE): Confusion Matrix
XDR – Random Forest (SMOTE): Confusion Matrix
XDR – Random Forest (SMOTE): Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9957 | 0.9998 | 0.9978 | 16213.0000 |
| 1 | 0.5714 | 0.0541 | 0.0988 | 74.0000 |
| accuracy | nan | nan | 0.9955 | 16287.0000 |
XDR – Random Forest (SMOTE): Feature Importance
XDR – Random Forest (SMOTE): Feature Importance
XDR: LightGBM – Detailed Analysis
XDR – LightGBM: Confusion Matrix
XDR – LightGBM: Confusion Matrix
XDR – LightGBM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9985 | 0.7118 | 0.8311 | 16213.0000 |
| 1 | 0.0121 | 0.7703 | 0.0237 | 74.0000 |
| accuracy | nan | nan | 0.7120 | 16287.0000 |
XDR – LightGBM: Feature Importance
XDR – LightGBM: Feature Importance
XDR: Balanced RF – Detailed Analysis
XDR – Balanced RF: Confusion Matrix
XDR – Balanced RF: Confusion Matrix
XDR – Balanced RF: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9972 | 0.9248 | 0.9596 | 16213.0000 |
| 1 | 0.0256 | 0.4324 | 0.0483 | 74.0000 |
| accuracy | nan | nan | 0.9225 | 16287.0000 |
XDR – Balanced RF: Feature Importance
XDR – Balanced RF: Feature Importance
XDR: SGD SVM – Detailed Analysis
XDR – SGD SVM: Confusion Matrix
XDR – SGD SVM: Confusion Matrix
XDR – SGD SVM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9962 | 0.9605 | 0.9780 | 16213.0000 |
| 1 | 0.0229 | 0.2027 | 0.0411 | 74.0000 |
| accuracy | nan | nan | 0.9570 | 16287.0000 |
XDR – SGD SVM: Feature Importance
XDR – SGD SVM: Feature Importance
XDR: IsolationForest – Detailed Analysis
XDR – IsolationForest: Confusion Matrix
XDR – IsolationForest: Confusion Matrix
XDR – IsolationForest: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9956 | 0.9982 | 0.9969 | 16213.0000 |
| 1 | 0.0938 | 0.0405 | 0.0566 | 74.0000 |
| accuracy | nan | nan | 0.9939 | 16287.0000 |
XDR – IsolationForest: Feature Importance
Feature importance not available for this model type.