Our Team’s Problem
Over the past few posts, we’ve been trying to narrow down which features to use from our GIANT 15 GB dataset. The genetic algorithm is a bio-inspired AI algorithm that finally helped us solve the puzzle!
We’ll keep this post shorter and lighter and dive right into the details 🙂
How the Genetic Algorithm Works
The algorithm is inspired by evolution; lots of individuals have different genes, some are more fit than others, and the fittest reproduce. Over time, the ancestors of fitter individuals stick around so the entire population gets more fit.
How does this work for our context?
We want to find a set of features
X
which gives us a large accuracy when predicting if a network request is a cyberattack or normal. We’ll start with random guesses about which features will help:X1
,X2
,X3
, and so on.Each guess represents an individual. Each feature in a guess represents a gene. Let’s say the features are “packet size in bytes”, “request duration”, and “HTTPS use.” All individuals (guesses) have 3 genes (one per feature). The genes can have values of 0 (exclude the feature) or 1 (include the feature).
Next, we need to determine the fitness of each individual. Mathematically, it’s just a magic function f(X)
. In practice, the function takes an individual as input, selects the features marked with 1s, and trains a small neural network on those features for 100 iterations. The fitness is how ‘optimisable’ the neural network was with the selected features (the difference between the final and initial loss).

Now that we have different individuals and a way to check how fit each one is, let the hunger games begin! 😈
We’ll choose the top 50% of individuals (solutions) to be parents. The rest die.
The parents reproduce and swap genomes (like chromosomes in cells crossover)
The new children and the parents all have mutations to genes (occasionally, 0s and 1s flip).
Now, the population of individuals (solutions) should have more individuals that are like the fittest solutions. Still, some diversity is added from mutations. Just keep repeating this process for let’s say 100 iterations, 1000 individuals, and 46 genes. That’s the genetic algorithm! At the end, we’ll get back the best set of features to use.
Interpreting the Results
The algorithm picked 21 features out of the 46. Shreya, a grad student familiar with cybersecurity on our team, confirmed that the features were relevant to the types of attacks we wanted to detect. On her end, she also ran a feature importance analysis by examining a logistic regression model’s weights. Apart from 3, the same features were identified by both algorithms! 🎉
With all that done, the genetic algorithm seemed too easy compared to the pains of implementing the grey wolf optimiser :O So I decided to have some fun with matplotlib and networkx to make some cute visuals 😁 I’ll leave you with those to end the post.