Sparse and Spurious
A Chonky Problem
As mentioned in a previous post, we have a CHONKY 14 GB dataset. This is too large to even load into the memory we have available, so we’ve been working on downsizing the dataset.
Earlier, we sampled 5% of the 46M datapoints to get the dataset down to 600 MB in size. Progress! 💪 But there’s still one problem… we have literally PAGES of features we could use and no idea which ones are best.
Worse still, the features are very sparse. 33 of them just specify which network protocol is being used. Since only one protocol can be used at a time, we’re storing hundreds of megabytes of empty data! 😖
It’s not just the data storage which is the problem. If we try to train an ML model like a neural network with all features, we’ll need a lot of inputs. A lot of inputs = a lot of parameters = a lot of training time + a model that can’t run on a small IoT device.
Simple conclusion: we have to find the best features before we start training models.
Approach A: Heatmaps?
This is the conventional approach. Find the correlations between each cyberattack and each feature in our data (like how long a network connection lasted).
Without much effort, we calculated the correlations and plotted them in heatmaps. 🤩Each square in the grid shows the correlations between one feature pair. For example, there are no strong correlations between network protocols and cyberattack types.
Though it was easy to create these heatmaps, problems kept arising in their details:
At first, the largest correlation we found was a strong 0.74 between the network protocol used and DDoS attacks. Yet this feature seemed odd since it just had random integers? After way too much digging into preprocessing scripts, we saw each network protocol had a random numerical code. So it was a coincident that larger random numbers correlated with DDoS attacks. The spurious correlation disappeared when we one-hot-encoded the network protocols 😭
Some of the correlations seemed cherry-picked. In the heatmap above, the GRE network protocol is highly correlated with Mirai attacks. Oversimplified, this indicates a specific malware is ‘mixing’ unsupported and supported network requests to bypass defences. Yet only a few strains of the Mirai malware use the GRE network protocol like this. The dataset probably only simulated those strains, so the correlation wouldn’t apply to the real world generally.
The largest correlations were for “hand-engineered” features from the dataset authors. Ex: the inter-arrival time (IAT) in the heatmap below is computed as the time difference between consecutive network packets received. Thus, the raw network data is too sparse and low-level to be useful without complex models capable of finding hidden relationships. 😕
Due to the ‘finnicky’ nature of correlational analysis, we don’t trust this approach to select features. This leaves us with two options to select features:
Use human expertise. The team is luckily coled by a security analyst, so we could just rely on her opinions on which features would distinguish a cyberattack. Pros: quick. Cons: slow to scale / update as cyberattacks change.
Use evolutionary algorithms (smart versions of trial and error). Ex: we can copy the genetic approach of ‘mutations’ in the features used, followed by ‘survival of the fittest'. Pros: automated. Cons: computationally expensive.
There are a lot more details involved in both approaches, so stay tuned for some cool demos soon! 😉