Here's a finding that flips conventional wisdom on its head. When training AI agents to use tools, the variety of tasks matters far more than the volume of training data. A new study tested this across 373 different tools and found that diverse training examples outperformed massive datasets - even when using four times less data.
This is DIVE (Diversity in Agentic Task Synthesis), and it challenges the "more data equals better performance" assumption that has driven much of recent AI development.
The Experiment
The researchers wanted to understand what makes AI agents better at generalising their tool use to new situations. They trained models on tasks involving 373 different tools - things like calendar management, file operations, web searches - and then tested how well they handled completely new scenarios they'd never seen before.
The key variable? Not how many training examples, but how varied those examples were. They systematically changed three things: the types of tasks, the combinations of tools used together, and the patterns of interaction between tools.
The results were striking. Models trained on diverse examples showed improvements of 22 points or more on out-of-distribution benchmarks - tests designed to measure performance on genuinely new situations. Meanwhile, simply adding more training data of the same type showed diminishing returns.
What This Actually Means
Think of it like learning to cook. You could practice making the same pasta dish 1,000 times, or you could make 250 different dishes covering various techniques, ingredients, and cuisines. When someone asks you to improvise with what's in the fridge, which training approach serves you better?
The same principle applies to AI agents. An agent that has only ever seen calendar bookings done one way will struggle when the context changes slightly. But an agent trained on varied calendar scenarios - different time zones, conflicting appointments, multi-person scheduling, urgent changes - develops genuine understanding of what calendars are and how they work.
This matters because most real-world tool use isn't about following a script. It's about adapting to context, handling edge cases, and combining tools in ways that make sense for the specific situation.
The Practical Implications
For anyone building AI systems, this research suggests a different approach to training data. Instead of gathering massive amounts of similar examples, focus on systematic variation. Cover different scenarios, combine tools in unexpected ways, and expose the model to diverse patterns of interaction.
The efficiency gain is significant. Getting better results with 4x less data means faster training, lower computational costs, and more sustainable development. It also suggests that smaller teams with thoughtfully curated datasets might compete effectively against those with simply more raw data.
There's a deeper insight here too. The fact that diversity beats volume suggests these models are learning something closer to actual understanding rather than pattern memorisation. They're building internal representations of how tools work and relate to each other, not just mapping inputs to outputs.
The Bigger Picture
This research arrives at an interesting moment. As AI capabilities expand and more systems gain access to multiple tools and APIs, generalisation becomes crucial. An AI assistant that can only handle the exact scenarios it was trained on isn't particularly useful. You need systems that can reason about new situations and adapt their tool use accordingly.
The DIVE approach offers a path toward that adaptability. By deliberately varying the training examples across multiple dimensions - task types, tool combinations, interaction patterns - you create agents that develop more robust and transferable skills.
For the field as a whole, it's a reminder that throwing more compute and data at problems isn't always the answer. Sometimes the solution is about being more thoughtful with the data you already have. Quality and diversity over raw quantity.
That's good news for sustainable AI development. It suggests we can build more capable systems without always scaling up to the next order of magnitude in resources. The intelligence comes from the variety of experience, not just the volume.