After a week where the US Treasury summoned bank CEOs over AI-discovered vulnerabilities, a small agency in Wellingborough explains why they built their own AI review system. Not because it was trendy. Because single-model review nearly cost them a client.
The Meeting Nobody Expected
This week, US Treasury Secretary Bessent and Fed Chair Powell summoned the CEOs of every major US bank to an emergency meeting. Anthropic's new Mythos model had been finding decades-old vulnerabilities in systems everyone considered solid. Thousands of zero-day flaws. Some older than my coding career.
I run a small agency. We don't have bank-grade security teams. But we've been solving a version of this problem for months, and the answer turned out to be surprisingly simple in principle.
One Model Isn't Enough
Here's what happened to us. We asked an AI model to review a piece of code - a basic email validation regex. It scored 24 out of 25. Ship it, right?
We ran the same code past a second model from a different vendor. It scored 6 out of 25 and flagged a vulnerability the first model missed entirely.
One judge would have shipped broken code. Two judges caught it. We now use three.
That experience changed how we build everything. If you're relying on a single AI model to review its own output, or to review code written by the same vendor's model, you've got a blind spot. Every model has biases. The only way to find them is to use models that don't share them.
What We Learned From Building It
I'm not going to lay out the full architecture here - we're productising it. But there are a few things we learned that are worth sharing because they apply to anyone building with AI.
Trust nothing by default. Treat AI output like any other untrusted input. It needs review, logging, and audit trails. Not because AI is bad. Because anything that writes code at speed needs a second pair of eyes.
Vendor diversity matters. Not for the sake of it. Because different training data, different architectures, and different fine-tuning create genuinely different perspectives. When your reviewers disagree, that's where the real value is.
Memory has to be yours. If your AI system's knowledge lives inside a vendor's persistent memory, you're one model swap away from losing it. We keep ours in files we control. Simple, portable, survives anything.
Engineering discipline doesn't change. Review before ship. Log everything. Deploy through CI, never locally. Recover gracefully. The fundamentals of good engineering are the fundamentals of safe AI use. There's no shortcut.
What This Means For The Rest Of Us
The banks are scrambling because they're finding flaws in systems that have been running for decades. Systems that passed human review. Systems that passed automated testing. Systems that were considered battle-tested.
If you're a small team building with AI right now, you're in a strange position. You don't have the legacy burden the banks have. But you also don't have their security teams or their budgets.
What you can do is build with the right assumptions from the start. Assume every model has blind spots. Assume vendor lock-in will hurt you. Assume you need audit trails for everything that ships.
These aren't expensive assumptions. They're architectural decisions that cost nothing if you make them early.
What We're Building Next
We're productising what we've built into something called Marbl Delphi. The name comes from the ancient Greek oracle - people travelled to Delphi for wisdom on decisions that mattered.
Our version has multiple voices instead of one. You come to it for answers, not lint reports.
"Is this architecture decision sound?" "What am I missing?" "Is this safe to ship?"
The waitlist opens this week at delphi.marbl.codes.
If you're a founder or technical lead building with AI and you want to ship responsibly without slowing down, have a look.
And if you want to understand the security landscape that's making this urgent, today's Luma digest covers it: luma.marbl.codes
The Question I'm Still Sitting With
Here's what I keep thinking about. If AI can find decades-old vulnerabilities in battle-tested banking systems, what's it finding in the code we're shipping today?
Not just the obvious stuff. The subtle things. The assumptions we're making because we've always made them. The patterns we think are safe because they've worked for years.
I don't have a complete answer yet. But I know that relying on a single perspective - human or AI - isn't going to cut it.