We don’t like pointless suspense so TLDR; the dark secret is that AI in cybersecurity by default doesn’t work. To explain why, we’ll summarise the award-winning paper: “Dos and Don'ts of Machine Learning in Computer Security.”

Why is the paper “award-winning?”
For decades, AI developers have been trying to improve cybersecurity tools. Spam filters, intrusion detection systems, anomaly detectors, malware classifiers, … every cybersecurity tool has an AI-enhanced equivalent by now.
At least, it seems that way on paper.
The good thing about AI models is that they can quickly learn with minimal human input. The catch is that it’s very easy for AI models to “memorise” instead of “learn.” Over the past decade, flashy results from academia keep turning out to be unreliable when tested in the real world. 😒This paper examines why.
The underlying issue is that AI models are generally bad at making predictions on data they haven’t seen. AI developers describe this with fancy words like ‘robustness’ or ‘data distribution shifts.’
In most industries, this problem can be small enough to not alter the entire AI development process. In cybersecurity, however, millions of hackers make a living from fooling cyber defences (including AI models). There are huge incentives to fool AI-enhanced malware classifiers by creating new/unseen malware.
This is why robustness issues matter much more in cybersecurity applications of AI compared to other industries. Failing to follow anything but the best practices results in unreliable technology that works on paper but not in real life. 😮
That’s where this award-winning paper comes in. It breaks the broken development cycle in the industry. Specifically, the paper reviews 30 of the top AI x cyber papers in the last decade and identifies best practices on how to avoid their errors.
The 5 Most Common Errors
1/5 Inappropriate Comparisons
This issue is about reporting fabulous results without comparing work to past work or simpler work. It might sound fancy to create an ‘autoencoder ensemble for network intrusion detection’… until this paper’s authors got better performance by running a boxplot test that a highschool student would know about. 😭

Easy takeaway: start with simple solutions. Added complexity must be justified by added performance.
2/5 Innappropriate Metrics
Let’s say your business gets 1000 emails a day, of which 10 are spam.
If your model predicts every email is safe, you’ve labelled 990/1000 = 99% of emails correctly!🎉 Though your model is useless.
If your model predicts every email is safe, you’ve caught 10/10 = 100% of the spam emails!🎉 Though your model is useless.
Key idea: It’s not good enough to just report the accuracy of a model. Precision, recall, true positive rate, false positive rate, ROC curves, and number of examples in dataset by class should be reported.
3/5 Spurious Correlations
This is when the model isn’t really “learning” about the data. It’s just “memorising” coincidental IP addresses, data sources, etc. that correlate well with the prediction.
For example, this study identifies vulnerable code with AI. It just so happened that vulnerable code examples in its dataset often used arrays with 16+ items. Thus, the model memorised this to get good performance on paper. Still, memorising “arrays over length 16 are vulnerable” isn’t helpful in the real world. 😁
Detecting spurious correlations is hard. Here’s an intro to the field of Explainable AI that checks how models make decisisons. Though the general question to ask is: “Are there any coincidental features in this dataset that happen to correlate with the labels?” These factors have to be eliminated to prevent inaccuracy.
4/5 Sampling Bias
This is when the dataset used to train the model doesn’t match the data distribution seen in real life. This causes the model to perform well when tested in the lab, but not work well in real life.
Often, this problem occurs because researchers have trouble finding data from real devices in use, unlike companies. 😢 Thus, researchers mix whatever data they can find from different sources and end up with a biased sample.
For example, this study detects viruses in Android apps from Google Play and Chinese play stores. Since the different sources had different likelihoods of an app being flagged by an antivirus, mixing these data sources led to spurious correlation. The model could easily predict: “app from Chinese playstore = malware” and usually be right, even if it’s not learning to examine what we want.

A potential solution is to use data augmentation and transfer learning when challenged with small datasets. This avoids introducing in human-made biases when mixing datasets.
5/5 Mislabelling
This is a continuation of issues for researchers in finding data. Although large quantities of data are available on open source databases like Virus Total, researchers can’t verify if examples are labelled properly. Here’s a human analogy showing why mislabelled examples will confuse AI models. 😵
This study recommends that researchers manually verify a random fraction of their dataset for quality control. They also recommend that researchers don’t remove data that has uncertainty around labels because this causes sampling bias
For the sake of brevity, we’ll stop listing errors there (but here are the full paper and our notes). The key point is; it’s difficult to make AI models that do good instead of seem good. Lots of subtle errors can render a model useless, even if caused by well-meaning researchers facing tough constraints.
We want to keep highlighting attention to some of the best practices that this much-needed paper mentioned! Also, we want to explain why we’re so particular about tiny methodological details on our team 😄
P.S. If you want to get summaries of more valuable papers in cybersecurity and AI :-)