We tried three main classical machine learning models, namely Logistic Regression, Support Vector Machines (SVMs), and Random Forests.
Logistic Regression
1. Introduction:
Since we’re on the topic of classical machine learning, we ought to include the good ‘ol classic logistic regression. Multinomial logistic regression was performed in sklearn on 8 prediction classes and 46 features: duration, protocol_type, rate, srate, drate, syn_flag_number, psh_flag_number, ack_flag_number, fin_flag_number, fin_count, ack_count, syn_count, rst_count, header_length, rst_flag_number, cwr_flag_number, ece_flag_number , number, http, https, ssh, telnet, smtp, irc, tcp, udp, dhcp, arp, icmp, ipv, llc, dns, flow_duration, urg_count, avg, max, std, tot_size, tot_sum, min, iat, magnitude, weight, radius, covariance, and variance.
2. Data Preprocessing:
Data was scaled, labels were encoded to be integers 0-7 and 100,000 test datapoints and 100,000 test datapoints were sampled.
3. Model Training and Evaluation:
Overall performance metrics using logistic regression are not very good.
Even after tuning hyper-parameters like solvers, penalties, and “C” (inverse of regularization), performance metrics are under 0.85, where metrics other than accuracy perform noticeably worse.
Increasing the number of iterations has the greatest impact on performance, but it is not efficient to just merely increase the number of iterations.
Also, since the dataset has thousands of data points, this means that we are also misclassifying thousands of cases.
Confusion Matrix
Grid Search was used to find the optimal hyper-parameter “C” (inverse of regularization strength), which was C = 1.
We see a variation in the accuracy of predictions
Dos and DDos are not well identified, commonly being misidentified for each other and with other labels.
Also, we see that Web is never correctly identified, only being misidentified for DDos.
4. Conclusion:
Logistic Regression is often good starting point to test out a dataset. However, it performs rather bad on this dataset. Although we can improve the model by further optimizing hyper-parameters or increasing the size of the training dataset, it is clear that logistic regression is not the best model for the task at hand.
Support Vector Machines
1. Introduction:
This report presents an analysis of Support Vector Machines (SVM) performance in classifying network intrusion data. The dataset used comprises 200,000 randomly sampled data points with eight distinct labels: "DDoS," "DoS," "Mirai," "Benign," "Spoofing," "Recon," "Web," and "BruteForce." The dataset was evenly split into training and testing sets.
2. Data Preprocessing:
All 46 features, excluding "protocol_type," were used for model training. The "protocol_type" feature was either omitted, transformed using LabelEncoder(), or converted using pd.getDummy().
3. Model Training and Evaluation:
3.1 Balanced SVC Model:


A balanced SVM model was trained with class_weight set to "balanced." Grid search was performed on the following parameters: {"C": [0, 0.01, 0.1, 0.5, 1, 1.5, 2, 2.5, 3], "kernel": ["rbf", "sigmoid"], "gamma": ["scale", "auto"]}. The best parameters found were {C = 3, kernel = "rbf", gamma = "scale"}.
However, the model's performance on the test dataset was suboptimal, especially for "DDoS" and "DoS" with Recall scores of 0.64 and 0.88, respectively. Mirai performed exceptionally well with only 28 misclassified out of 5657 datapoints.
3.2 Adjusted Class Weights SVC Model:


A second SVM model was trained with adjusted class weights calculated based on the log_10 counts of each label.
Despite adjusting class weights, the model's overall performance did not show significant improvement. Notably, the Recall score for "Spoofing" decreased from 0.78 to 0.66 compared to the balanced SVC model.
4. Analysis and Recommendations:
Insufficient Training Data: The models' performance could be improved with a larger training dataset instead of a 100,000 training dataset, especially for labels like "Web" and "BruteForce," which had very few training samples.
Random Forests
1. Introduction:
One of the first types of models that came to mind were Random Forests for their good balance of predictive and computational performance. They tend to have good out-of-the-box performance, and are generally resistant to over-fitting. Additionally, the question of computational performance is of particular importance given that this problem is formulated to involve running on IoT devices--RFs have the benefit of being embarrassingly parallel not only in training, but also in prediction.
2. Data Preprocessing:
The training dataset is 1% of the original dataset, randomly under-counted with class distributions preserved from this notebook.
Additional preprocessing includes filtering variables to the 22 identified as most important per this notebook.
3. Default Model Training & Performance
Random Forests are known for performing pretty well right out of the box, so let's see how well it does with all the default settings.
3.1 Distributions & Statistics
As expected, that's actually not too terrible! We should be careful, however, not to rejoice too much in those 1.00
weighted averages, as they're somewhat misleading.
Note the supports (i.e. frequencies) in the classification report: this dataset is highly imbalanced. There's roughly a 3.5-fold difference in magnitude between the largest and smallest classes.
The way that sklearn
's weighted statistics work in multiclass problems is by taking a weighted average of any statistic using the support of a class to weight it. That means majority classes will have a disproportionate effect, which compounds on the majority bias most models will take on in imbalanced classification. Although RF is somewhat resistant to class imbalance compared to most classical ML models, it's not completely immune.
Instead, let's look at the macro and per-class statistics, which evenly weight each class' stats regardless of how frequently they show up in data. There we have much more modest (but still respectable) macro averages, and a particularly high macro precision. Examining the classes, we can see the performance of minority classes, specifically the two smallest BruteForce and Web classes dragging down others.
One last thing to note before moving on from this classification report is that this model has high precision and relatively low recall, especially in its minority classes. In other words, it can be said to be conservative in its predictions, which ostensibly sounds like a good thing. Tweaking where the model lies on the precision vs recall tradeoff curve is something to possibly revisit in the future.
Briefly looking at the overall class distribution, more things about Web and BruteForce jump out, namely just how few samples there are. The latter at least has more than 100, but the former only has 55 total. That's sketchily few number of samples to work with, even without an imbalanced dataset.
In part, this tiny amount is to be expected given the tiny slice of the real data we're working with in this exploration. Ideally, it's an issue that mostly goes away when training a "real" model using the full dataset but let's put a pin in that issue for now and try to forge on as best we can. If we can squeeze some reasonable performance just using this small, 1% subset of the dataset then we should be in good shape for the full dataset.
3.2 Confusion Matrix
We'll look at the confusion matrices more later, but this is the one for our initial model. Of particular note are:
Web attacks (which the model performs worst on) are mostly confused with Spoofing attacks
Web attacks are never correctly identified as Web attacks specifically,but are still mostly recognized as malicious traffic
Of the incorrectly classified samples, Web, Spoofing, Recon, and Brute Force attacks are most likely to be classified as Benign i.e. they are most likely to slip through a filter using this model
4. Balancing the Dataset
There are lots of general methods to work with imbalanced datasets: under-sampling, oversampling, SMOTE, etc. All these general methods work on a principle of resampling the training data, and come with some pretty significant downsides. Before investigating the effects of those methods, let's first try some RF-specific balancing methods.
In particular, I'll look at the two methods outlined in this paper: https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf
Weight-based balancing (Referred to as "Weighted Random Forest" in the paper)
Bagging-based balancing (Referred to as "Balanced Random Forest" in the paper)
4.1 Weight-Based Balancing
First let's look at weight-based balancing.
The primary idea is to modify misclassification penalties when training individual trees, with misclassified minority classes providing a higher penalty than majority classes. That way, the majority classes (ostensibly) do not "overwhelm" minority classes in scoring.
sklearn
supports this method by default with the class_weight
parameter.
4.2 Bagging-Based Balancing
Now let's look at bagging-based balancing.
In essence, it involves modifying the distribution of samples selected during the bagging stage of training. When sampling the dataset, it down-samples majority classes and potentially upsamples the minority classes resulting in a more balanced dataset for trees to be trained on. No other steps of the processes are affected.
While this may seem identical to simply resampling the base training set, as the paper shows, performance actually does differ in favor of resampling during bagging. The exact mechanisms of why are definitely out of the scope of this notebook, but it serves as a reminder that lots of aspects of ML can be very counterinuitive.
This is something that can be implemented ourselves with some effort, but thankfully the imblearn
package's BalancedRandomForestClassifier
implements this exact balancing method.
4.3 Comparing Confusion Matrices
The performance here using a bagging-focused balancing method seems to be noticeably worse than the weight-focused method, except for recall for the two smallest classes (BruteForce and Web). However, let's examine the two confusion matrices and see if there's anything that jumps out.
You can see here more clearly the difference in performance between the two models--the top matrices corresponding to recall (i.e. the distribution of labeling among all the True labels from each class) and the bottom to precision (i.e. the distribution of labeling among all the predicted labels of each class).
Both models were similarly precise for predicting Benign labels, predicting it correctly ~88% of the time it predicts Benign. We will discuss the importance of this class' metrics more later, but for now consider that a misprediction from one malicious class (i.e. not Benign) to another malicious class would be substantially less harmful than wrongfully predicting a malicious connection as a benign one or vice versa.
The models perform almost identically when it comes to three classes (DDoS, DoS, and Mirai) and are on opposite ends of the precision-recall spectrum when it comes to the four weakest performing classes (BruteForce, Recon, Spoofing, and Web).
While the precision of predicting Benign labels is substantially higher for the weight-based model, giving it a definite edge, the two models are similar enough in performance that it merits examining both balancing methodologies going forward instead of discarding one in favor of the other.
5. Misclassification Analysis
OK, so why do the classifiers suck on BruteForce and Web? The pithy answer, of course, is "not enough samples". But let's try to look a little deeper at the misclassifications to see if there's anything we can't use.
5.1 PCA
First, let's try doing some quick and dirty PCA to see if anything jumps out at us immediately. Maybe there'll be an easy win in finding an odd distribution or boundary we can use.
Okay, so this plot is completely incomprehensible. We could fiddle some more with the opacity and labeling, but before that let's just graph the classes we're interested in digging more into (i.e. Web and BruteForce)
This graph is still pretty bad and unhelpful, and while we could fiddle with opacities and distributions, ultimately doing PCA doesn't seem like it will lead to much--the data's "true dimensionality" is likely high enough that it can't be represented graphically, and other teams have already done the work of determining feature importances and reducing dimensionality.
5.2 Per-Feature Histograms
Instead, let's look at the distributions of misclassified instances. Specifically, let's fix a class and observe the distribution of each feature from misclassified (false negative and false positive) samples. 8 labels is few enough that we can reasonably separate out what each misclassification is actually classified as.
While there are a lot of graphs, they are much more readable than the earlier PCA graphs. We can do this process on each class, but looking at Web, the class with the worst performance overall, is a good starting point. Of particular note is the island of misclassifications around iat
values of ~0.75, with no intermingled true positives.
This could be a result of individual trees within the forest coming up against depth or leaf hyperparamter restrictions, but the models were trained with default settings i.e. no such restrictions. A possible alternative explanation could be issues with the entropy loss, as it appears that the misclassifications on that island are evenly split between DDoS and DoS labels. Additionally, the decision trees within random forests find boundary values rather than intervals at each node, the latter being the true boundary. For trees to do so, they require at least two nodes, and so the split (and thus loss calculation at that split) cannot be atomic. Both issues could lead to an artificially high loss when splitting on iat
values, and difficulty finding a clean boundary.
There are lots of directions to go with these histograms, and it’ll be a key focus when it comes to tuning the performance of these models.
Conclusions & Next Steps
1. Ensemble Models
Our decision to shift our focus towards ensemble models, specifically Random Forests, XGBoost, AdaBoosting, and GradientBoosting, is rooted in the superior performance observed compared to other algorithms such as Logistic Regression and Support Vector Machine (SVM). Despite the theoretical strength of SVM, its impractical training time became a substantial bottleneck, rendering it virtually impossible to train models efficiently, especially with large datasets like the 500,000 datapoints on Kaggle. On the contrary, ensemble models demonstrated remarkable efficacy, with Random Forests standing out by providing excellent results in a fraction of the time. This shift aligns with our commitment to not only optimize for model accuracy but also consider computational efficiency, making ensemble methods a pragmatic choice for our next steps in the pursuit of effective and scalable machine learning solutions.
2. Tuning Fit
Comparing the performance of the three Random Forest models on test and training data, it’s safe to say that the default and weight-based models are heavily overfit.
This should be completely expected given the default hyperparameter settings of infinite depth and zero leaf-size restrictions. While the bagging-based model is quite balanced, as seen earlier its performance metrics on test data are slightly worse than the other two.
There are several knobs we can use to tune the fit of tree-based models, the most obvious of which is the hyperparamters used when training them.
Bagging (which we’ve already used implicitly) and boosting are other methods to help tune Random Forests. In general, bagging helps decrease variance and counter overfitting, and boosting helps decrease bias and counter underfitting. Reducing bias may seem almost counterproductive given the state of current models, but given the permissiveness of the default hyperparameters, it’s likely that tuning them is going to result in reducing overfitting (and thus increasing bias). So it’ll be good to keep a method to help counter underfitting in our back pocket.
Importantly, bagging and boosting are not mutually exclusive. It’s entirely possible to combine the two by adding bagging to sample selection when growing trees during the second stage of boosting. Of course, by the NFL theorem this may not be best, but it’s worth a shot.
3. Feature Engineering
As seen in the histogram of misclassification feature distributions, there are some ranges of values with purely incorrect classifications such as with iat
in Web misclassifications. Creating obvious features to split on derived from those ranges could result in better (i.e. more generalizable) splits, which will become even more important as we tune our models to decrease variance (and thus increase bias).
4. Chaining Models
Our next strategic move in training models involves the implementation of a chained model approach to enhance classification accuracy. By merging labels and grouping distinct attack types, we aim to augment the model's capacity to generalize and classify unseen data more precisely. Leveraging insights from Random Forests, where notable predictive performance was observed for DDoS, DoS, and Mirai attacks, we propose a hierarchical approach. Initially, the model predicts whether each datapoint represents an attack or benign activity. Subsequently, it further refines its predictions by classifying the specific attack type, creating a cascading structure of predictions. For instance, the model could be trained and tested on {"Benign", "Attack"} to predict "Attack," followed by predicting {"DDoS", "DoS", "Mirai", "pooled1"} and subsequently {"Spoofing", "Recon", "pooled2"} to achieve granular attack type classifications such as {"Web", "BruteForce"}. This hierarchical chaining aims to capitalize on the strengths observed in the model's performance for specific attack categories, ultimately optimizing its predictive accuracy across diverse threat scenarios.
5. Determining Metrics & Goals
As much as we can squeeze tradeoff-free performance improvements out of the models with the aforementioned next steps, ultimately we’re going to hit a wall and will need to decide which performance metrics we want to prioritize. Is it better to improve performance on classifying Benign connections at the expense of malicious ones? Is it worse to misclassify a Benign packet or a malicious packet? Are certain malicious packets more important to capture than others?
These questions and more deserve an future in-depth post, as not only can they be expanded on in a substantial amount of depth, but also will determine how we ultimately tune our models.