Machine Learning Analysis Report

Machine Learning Analysis Pipeline

EDR: Dataset Loading & Preprocessing

EDR – Train/Test Overview
• Train shape: (185442, 20) | Test shape: (16287, 20)
• Total train samples: 185,442 | Total test samples: 16,287
• Number of features: 18
• Target column: 'label'
• Missing values (train): 0 | (test): 0
EDR – Train Class Distribution
• 0: 184,585
• 1: 857
• Class balance (minority/majority): 0.4643%
EDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
⚠️ Extreme Class Imbalance Detected
• Minority class represents only 0.4643% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
Baseline (Most-Frequent) Accuracy: 0.9955

EDR: Model Performance Comparison

EDR – Model Performance Metrics

ModelAccuracyBalanced AccPrecisionRecallF1ROC-AUCPR-AUC
Logistic Regression0.96260.57770.02490.18920.04400.67200.0265
Random Forest (SMOTE)0.99550.52690.50000.05410.09760.68090.0728
LightGBM0.85960.63350.01320.40540.02560.63430.0082
Balanced RF0.91010.67910.02270.44590.04310.85300.0414
SGD SVM0.94070.55320.01310.16220.0242nannan
IsolationForest0.98590.54900.04650.10810.0650nannan

Confusion Matrix Analysis

ModelTNFPFNTPFP RateMiss Rate
Logistic Regression1566454960143.39%81.08%
Random Forest (SMOTE)1620947040.02%94.59%
LightGBM139702243443013.83%59.46%
Balanced RF14790142341338.78%55.41%
SGD SVM1530990462125.58%83.78%
IsolationForest160491646681.01%89.19%

Best Models by Metric

Accuracy
Random Forest (SMOTE)
0.9955
Balanced Acc
Balanced RF
0.6791
Precision
Random Forest (SMOTE)
0.5000
Recall
Balanced RF
0.4459
F1
Random Forest (SMOTE)
0.0976
ROC-AUC
Balanced RF
0.8530
PR-AUC
Random Forest (SMOTE)
0.0728
Lowest False Positive Rate
Random Forest (SMOTE)
0.02%
Lowest Miss Rate
Balanced RF
55.41%

EDR – Metrics by Model

EDR – Metrics by Model

EDR – ROC Curves

EDR – ROC Curves

EDR – Precision–Recall Curves

EDR – Precision–Recall Curves

EDR – Predicted Probability Distributions

EDR – Predicted Probability Distributions

EDR – Threshold Sweep

EDR – Threshold Sweep

EDR: Logistic Regression – Detailed Analysis

EDR – Logistic Regression: Confusion Matrix

EDR – Logistic Regression: Confusion Matrix

EDR – Logistic Regression: Confusion Matrix

EDR – Logistic Regression: Classification Report

Modelprecisionrecallf1support
00.99620.96610.980916213.0000
10.02490.18920.044074.0000
accuracynannan0.962616287.0000

EDR – Logistic Regression: Feature Importance

EDR – Logistic Regression: Feature Importance

EDR – Logistic Regression: Feature Importance

EDR: Random Forest (SMOTE) – Detailed Analysis

EDR – Random Forest (SMOTE): Confusion Matrix

EDR – Random Forest (SMOTE): Confusion Matrix

EDR – Random Forest (SMOTE): Confusion Matrix

EDR – Random Forest (SMOTE): Classification Report

Modelprecisionrecallf1support
00.99570.99980.997716213.0000
10.50000.05410.097674.0000
accuracynannan0.995516287.0000

EDR – Random Forest (SMOTE): Feature Importance

EDR – Random Forest (SMOTE): Feature Importance

EDR – Random Forest (SMOTE): Feature Importance

EDR: LightGBM – Detailed Analysis

EDR – LightGBM: Confusion Matrix

EDR – LightGBM: Confusion Matrix

EDR – LightGBM: Confusion Matrix

EDR – LightGBM: Classification Report

Modelprecisionrecallf1support
00.99690.86170.924316213.0000
10.01320.40540.025674.0000
accuracynannan0.859616287.0000

EDR – LightGBM: Feature Importance

EDR – LightGBM: Feature Importance

EDR – LightGBM: Feature Importance

EDR: Balanced RF – Detailed Analysis

EDR – Balanced RF: Confusion Matrix

EDR – Balanced RF: Confusion Matrix

EDR – Balanced RF: Confusion Matrix

EDR – Balanced RF: Classification Report

Modelprecisionrecallf1support
00.99720.91220.952816213.0000
10.02270.44590.043174.0000
accuracynannan0.910116287.0000

EDR – Balanced RF: Feature Importance

EDR – Balanced RF: Feature Importance

EDR – Balanced RF: Feature Importance

EDR: SGD SVM – Detailed Analysis

EDR – SGD SVM: Confusion Matrix

EDR – SGD SVM: Confusion Matrix

EDR – SGD SVM: Confusion Matrix

EDR – SGD SVM: Classification Report

Modelprecisionrecallf1support
00.99600.94420.969416213.0000
10.01310.16220.024274.0000
accuracynannan0.940716287.0000

EDR – SGD SVM: Feature Importance

EDR – SGD SVM: Feature Importance

EDR – SGD SVM: Feature Importance

EDR: IsolationForest – Detailed Analysis

EDR – IsolationForest: Confusion Matrix

EDR – IsolationForest: Confusion Matrix

EDR – IsolationForest: Confusion Matrix

EDR – IsolationForest: Classification Report

Modelprecisionrecallf1support
00.99590.98990.992916213.0000
10.04650.10810.065074.0000
accuracynannan0.985916287.0000

EDR – IsolationForest: Feature Importance

Feature importance not available for this model type.

XDR: Dataset Loading & Preprocessing

XDR – Train/Test Overview
• Train shape: (185442, 34) | Test shape: (16287, 34)
• Total train samples: 185,442 | Total test samples: 16,287
• Number of features: 32
• Target column: 'label'
• Missing values (train): 0 | (test): 0
XDR – Train Class Distribution
• 0: 184,585
• 1: 857
• Class balance (minority/majority): 0.4643%
XDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
⚠️ Extreme Class Imbalance Detected
• Minority class represents only 0.4643% of the data
• This extreme imbalance may cause models to predict everything as majority class
• Consider: more aggressive SMOTE ratios, cost-sensitive learning, or ensemble methods
• Metrics like Precision-Recall AUC and F1 are more meaningful than accuracy
Baseline (Most-Frequent) Accuracy: 0.9955

XDR: Model Performance Comparison

XDR – Model Performance Metrics

ModelAccuracyBalanced AccPrecisionRecallF1ROC-AUCPR-AUC
Logistic Regression0.82570.66360.01300.50000.02540.65850.0235
Random Forest (SMOTE)0.99550.52690.57140.05410.09880.65880.0680
LightGBM0.71200.74100.01210.77030.02370.75600.0115
Balanced RF0.92250.67860.02560.43240.04830.85200.0250
SGD SVM0.95700.58160.02290.20270.0411nannan
IsolationForest0.99390.51940.09380.04050.0566nannan

Confusion Matrix Analysis

ModelTNFPFNTPFP RateMiss Rate
Logistic Regression134112802373717.28%50.00%
Random Forest (SMOTE)1621037040.02%94.59%
LightGBM115404673175728.82%22.97%
Balanced RF14993122042327.52%56.76%
SGD SVM1557264159153.95%79.73%
IsolationForest16184297130.18%95.95%

Best Models by Metric

Accuracy
Random Forest (SMOTE)
0.9955
Balanced Acc
LightGBM
0.7410
Precision
Random Forest (SMOTE)
0.5714
Recall
LightGBM
0.7703
F1
Random Forest (SMOTE)
0.0988
ROC-AUC
Balanced RF
0.8520
PR-AUC
Random Forest (SMOTE)
0.0680
Lowest False Positive Rate
Random Forest (SMOTE)
0.02%
Lowest Miss Rate
LightGBM
22.97%

XDR – Metrics by Model

XDR – Metrics by Model

XDR – ROC Curves

XDR – ROC Curves

XDR – Precision–Recall Curves

XDR – Precision–Recall Curves

XDR – Predicted Probability Distributions

XDR – Predicted Probability Distributions

XDR – Threshold Sweep

XDR – Threshold Sweep

XDR: Logistic Regression – Detailed Analysis

XDR – Logistic Regression: Confusion Matrix

XDR – Logistic Regression: Confusion Matrix

XDR – Logistic Regression: Confusion Matrix

XDR – Logistic Regression: Classification Report

Modelprecisionrecallf1support
00.99720.82720.904316213.0000
10.01300.50000.025474.0000
accuracynannan0.825716287.0000

XDR – Logistic Regression: Feature Importance

XDR – Logistic Regression: Feature Importance

XDR – Logistic Regression: Feature Importance

XDR: Random Forest (SMOTE) – Detailed Analysis

XDR – Random Forest (SMOTE): Confusion Matrix

XDR – Random Forest (SMOTE): Confusion Matrix

XDR – Random Forest (SMOTE): Confusion Matrix

XDR – Random Forest (SMOTE): Classification Report

Modelprecisionrecallf1support
00.99570.99980.997816213.0000
10.57140.05410.098874.0000
accuracynannan0.995516287.0000

XDR – Random Forest (SMOTE): Feature Importance

XDR – Random Forest (SMOTE): Feature Importance

XDR – Random Forest (SMOTE): Feature Importance

XDR: LightGBM – Detailed Analysis

XDR – LightGBM: Confusion Matrix

XDR – LightGBM: Confusion Matrix

XDR – LightGBM: Confusion Matrix

XDR – LightGBM: Classification Report

Modelprecisionrecallf1support
00.99850.71180.831116213.0000
10.01210.77030.023774.0000
accuracynannan0.712016287.0000

XDR – LightGBM: Feature Importance

XDR – LightGBM: Feature Importance

XDR – LightGBM: Feature Importance

XDR: Balanced RF – Detailed Analysis

XDR – Balanced RF: Confusion Matrix

XDR – Balanced RF: Confusion Matrix

XDR – Balanced RF: Confusion Matrix

XDR – Balanced RF: Classification Report

Modelprecisionrecallf1support
00.99720.92480.959616213.0000
10.02560.43240.048374.0000
accuracynannan0.922516287.0000

XDR – Balanced RF: Feature Importance

XDR – Balanced RF: Feature Importance

XDR – Balanced RF: Feature Importance

XDR: SGD SVM – Detailed Analysis

XDR – SGD SVM: Confusion Matrix

XDR – SGD SVM: Confusion Matrix

XDR – SGD SVM: Confusion Matrix

XDR – SGD SVM: Classification Report

Modelprecisionrecallf1support
00.99620.96050.978016213.0000
10.02290.20270.041174.0000
accuracynannan0.957016287.0000

XDR – SGD SVM: Feature Importance

XDR – SGD SVM: Feature Importance

XDR – SGD SVM: Feature Importance

XDR: IsolationForest – Detailed Analysis

XDR – IsolationForest: Confusion Matrix

XDR – IsolationForest: Confusion Matrix

XDR – IsolationForest: Confusion Matrix

XDR – IsolationForest: Classification Report

Modelprecisionrecallf1support
00.99560.99820.996916213.0000
10.09380.04050.056674.0000
accuracynannan0.993916287.0000

XDR – IsolationForest: Feature Importance

Feature importance not available for this model type.