Evolution Strategies (ES) isn't new. It's been around since the 1970s - a way to optimise systems by mutating parameters, testing variants, and keeping what works. Black-box optimisation without gradients. For decades, it lived in the background of AI research, useful in niche cases but overshadowed by gradient-based methods. Now it's back, and it's competitive with reinforcement learning for fine-tuning large language models.
The reason? ES doesn't need perfect credit assignment. Reinforcement learning struggles when it's hard to tell which action caused which outcome - especially in long sequences where cause and effect are distant. ES sidesteps that problem entirely by treating the model as a black box. Perturb the weights, test performance, keep the good mutations. No gradients required.
Why This Matters for Post-Training
Fine-tuning LLMs after pre-training is messy. You're often optimising for fuzzy objectives - things like "generate more helpful responses" or "follow instructions better". These are hard to capture in a clean loss function. Reinforcement learning from human feedback (RLHF) is the standard approach, but it's complicated. You need reward models, policy gradients, careful tuning to avoid instability.
Evolution Strategies offers a simpler path. EGGROLL, a recent implementation, makes ES GPU-efficient by using low-rank perturbations. Instead of mutating millions of parameters individually, it perturbs a small subspace and projects those changes across the model. This keeps memory overhead low and makes ES viable at the scale of modern LLMs.
The trade-off is that ES is sample-inefficient. You need to test many variants to find good ones. But in post-training scenarios - where you're fine-tuning on specific tasks with clear evaluation metrics - that's often acceptable. You're not training from scratch. You're adjusting a pre-trained model, and ES can explore that adjustment space effectively without needing the infrastructure complexity of RLHF.
When to Use ES Over RL
Evolution Strategies works best when:
Credit assignment is hard. If your task involves long sequences where it's unclear which part of the output caused success or failure, gradients become noisy. ES doesn't care - it evaluates the whole output and adjusts accordingly.
Your reward function is simple but non-differentiable. Maybe you're optimising for human preference scores, or task completion rates, or some other metric that doesn't have clean gradients. ES treats the reward as a black box and optimises directly.
You want to avoid RL infrastructure. RLHF requires reward models, policy networks, value functions, and careful hyperparameter tuning. ES is conceptually simpler - generate variants, test them, keep the best ones. Less moving parts.
The downside is sample efficiency. RL can learn from fewer examples when gradients are informative. ES needs more evaluations because it's exploring blindly. But for tasks where evaluation is cheap and gradients are messy, that trade-off works.
What This Unlocks
EGGROLL's low-rank perturbation approach makes ES practical for large models. Previously, mutating millions of parameters was prohibitively expensive in both memory and compute. By constraining mutations to a low-dimensional subspace, EGGROLL keeps costs manageable while still exploring effectively.
This opens up post-training workflows that don't depend on RLHF. You can fine-tune models for specific tasks using simpler infrastructure. You can optimise for objectives that are hard to express as differentiable loss functions. And you can do it without needing deep RL expertise on your team.
Evolution Strategies won't replace gradient-based methods entirely. But for a specific class of problems - post-training tasks with fuzzy objectives and hard credit assignment - it's proving competitive. And the simplicity matters. Less infrastructure complexity means more teams can experiment with fine-tuning without needing RL specialists.
Old methods don't die. They just wait for the right moment to be useful again.