Why bother with this algorithm?
Our team has an ongoing problem. We’re trying to classify cyberattacks in a dataset with 47.2 million CSV rows and 50+ features. Trying to work with a dataset this large is unecessarily long and expensive… and the end model wouldn’t work for the small smart devices we want to protect. 😑
The natural solution is to only pick the best features to train on: the ones that best ‘go along with’ cyberattacks. Unfortunately, quickly analysing correlations between features wasn’t giving us reliable results.
This is the predicament that led me to try the grey wolf optimiser algorithm, with the inspiration of M.D. Moizuddin and M.V. Jose, 2022. So TL;DR - this algorithm is going to help us ‘test-drive’ many features and find which ones work best. 🚗☁👍
But how does it work?
Let’s start with understanding how wolves hunt.
Wolves hunt in packs to find prey in a vast landscape. Specifically, wolves follow the leaders in a social hierarchy. The lead wolves (the ‘alpha’ wolf, ‘beta’ wolf, and ‘delta’ wolf) choose where the prey could be, while other wolves follow them.
The wolves start off exploring in many directions, but get closer together over time as they start getting close to their prey.
Once the wolf pack spots they prey, the wolves end the hunt by circling around the prey. Here’s what the process looks like visually:

Why would this help solve problems in general? We can use the encircling and following behaviour for any objective minimisation problem. That is, any problem where we change our inputs to get the smallest output possible. 📉
Let’s say you have two inputs x
and y
. Here’s a visual example of the grey wolf optimiser choosing different values for x,y
(represented by the different dots or ‘wolves’). Over time, the dots converge to the values of x,y
that get the smallest output (the green ‘alpha’, red ‘beta’, and blue ‘delta’ wolves).

How does this work in detail? Here are math details for those interested.
🎓 Le Math 🎓
A vector represents both the position of each wolf and the features (inputs) chosen. Each vector just has a list of 0s/1s that mean exclude/include each feature.
Let’s create a function that selects features from the dataset given the 0/1 vectors. Ex: Let’s say our features are “flavour”, “colour”, and “price.” If a wolf has the vector [1,0,1]
, the “flavour” and “price” features are chosen.
At first, each wolf has a random vector of 0s/1s. So each chooses random features from the dataset. The “alpha” wolf is the wolf whose features return the lowest output (“fitness function score”). 💪 The “beta” wolf and “delta” wolf are the ones with the second and third lowest output.
But what is the fitness function? Well, we’re selecting features to train some AI model. So let’s train a logistic regression classifier (SGD, 100 steps) for each wolf’s features. Then, we evaluate the final model error on a test dataset. This process doesn’t change for any wolf, so a lower final error means a wolf selected better features.
The fitness function could just be that error. Though we also care whether the model selected just a few features. It doesn’t help us if the model returns all 50 features in the dataset.🙄 So we’ll add both those goals to our function.
At this point, these equations just describe a ‘trial-and-error’ algorithm. We just randomly guess features and then check which guess are best. But how do we get the wolves to follow the leader and circle in on the best guesses?
First, recall that the alpha, beta, and delta wolves (solutions) are the three solutions which best minimise the fitness score. We want other wolves to follow these wolves. So for the ith
wolf, we’ll set its “position” (selected features) to the average of the lead wolves’ position at the last iteration.
Except this equation would send all non-lead wolves to the exact same solution. 😖 We need more ‘diversity’ / randomness so the wolves explore different solutions. Still, we do want wolves to circle around the same spot eventually. We need some parameter to encourage exploration at the start and exploitation at the end.
This is better, but wolves from all around the “landscape” (of features) will end up within a certain radius of each other, no matter if they were close to that spot or far. Since teleportation is frowned upon in nature, we’ll make one last adjustment. 🙃
Now, a wolf that’s very far away from a lead wolf won’t move right next to the lead wolf. Except near the end of the hunt, where a=0
.
That finally finishes all the equations! 😤
Key ideas: start with random features. On every iteration, the best ones are the alpha, beta, delta. The other wolves follow the average features of the alpha, beta, and delta. Still, some random noise is added and a wolf’s motion depends on its distance from the alpha, beta, and delta. After many iterations, a parameter reduces the magnitude of the random noise so the wolves will circle together.
So did it work in practice?
No 😁
Problem 1: Dataset size
Consider the function that returns the classification error (0%-100%). Since it has to train a model on a dataset of 2.3 million rows, it takes forever to run. Keep in mind, the 2.3 million row dataset was the reduced dataset (5% of the original)!
So I went back to the drawing board and adjusted our data downsampling script. Instead of a fixed 5% of the original data, I created many downsampled datasets. The one with reasonable performance had 0.1% of the original data. Though this almost certainly creates a biased dataset, at least the algorithm was working as a proof of concept, right? 😊🙂🤨😑😭
Problem 2: Hyperparameter tuning
The algorithm kept picking one feature out of 46. It’s hard to believe it predicts cyberattacks. The problem is that I’d set the hyperparameter tau to 0.5, creating an equal reward for reducing feature size and improving prediction accuracy.
So I set tau to 0.9, 0.95, 0.99, and eventually 1 (meaning the algorithm should minimise the error alone). Yet the algorithm still kept picking one feature. The only solution to avoid this was reducing the maximum iterations. Read: the algorithm wasn’t just unhelpful, it was actively making things worse! 😥
Problem 3: Random Noise
Here’s a graph to debug the problem.
As you can see, the percent error is fluctuating randomly as different wolves try their solution. This is because I initialised the test model with random parameters on every iteration (since the model’s input dimension changes based on selected features).
How can we fix this? Well, we don’t care what the final model error is after just a few updates. We care how ‘optimisable’ the model is with the selected features. So why not measure improvement instead of final performance? 🤔
Problem 4: Learning rate
Even with the above change, I still saw random fluctuations in the error for different solutions. So I decided to look at the test model. If it wasn’t training properly, then the effect of better vs. worse subsets would be negligible.
Sure enough, I found my learning rate was too low when I collected data on the loss over time. This meant any training run’s loss didn’t change much. What determined the final loss was roughly the starting loss. And what determined the starting loss was random parameter initialisation 🙄

Problem 5: Regularisation Parameters
Even after increasing the learning rate, I still saw fluctuations and low feature count.
Recall that the fitness function for the algorithm prioritises both reduced dataset size and the error of the selected features based on the parameter tau:
Except I set tau to 1, so only the error should be minimised? 😵 It seemed I needed to reinforce the importance of the error over the feature size even more. I might even need to reward a higher feature size, not a lower one! To achieve this, I replaced tau with more hyperparameters in the fitness function.
Where all three constants can be adjusted to emphasise the importance of keeping the error low, the feature size low, or the feature size high.
I implemented the above with the feature size constants set to 0 and the error constant set to 10. Despite 10x as much importance for finding features that reduce the error, the algorithm’s error was still mostly noise!
Problem 6: Realising the real issue
After wondering why higher error would follow lower error in the graph above, I thought of another bio-inspired algorithm: the genetic algorithm.
In that algorithm, random mutations cause lost solutions that were previously good. These mutations can be reduced by lowering a hyperparameter called the mutation rate. Drawing inspiration from that, I needed to stop the best solutions (the alpha, beta, and delta) from being lost due to random vectors in the update equation.
This is especially important since the wolf positions are just vectors of 0s/1s. Either an important feature is included or it isn’t. So it’s impossible to have a ‘minor’ change in the fitness of a wolf’s position. That’s what was leading to the spiky changes! 🧠
With this inspiration, I updated my code to keep the save wolf positions (no changes of any sort to them, as if reducing a ‘mutation rate’). And… voila?
At first, the graph still seems ‘spiky.’ Though if you look closer at the y-axis, the changes are within 1 point of each other. Previously, they were 2-3x larger. Now, the fluctuations exist since the same solution tested on different random parameter initialisations gives slightly (but not extremely) different losses.
If I don’t run the algorithm for too many iterations (>10-15), I can get some reasonable features selected.
That took soooo long 😦🥱🙄😆
Next steps
The modified algorithm I’ve built is a bad genetic algorithm. I say this since it tries to achieve a similar result, but with ‘hacks’ instead of refined hyperparameters. Generally, the grey wolf optimiser seems more suitable to me for continuious search spaces than highly discrete search spaces (0s to 1s, not 0s or 1s).
I’ll test my hypothesis by implementing a genetic optimisation algorithm for the same problem. If you’d like to play around with the grey wolf optimiser in the meantime (hopefully for continuous search spaces 😁), here’s my opensource implementation.