PayPal Cut Model Latency by a Third Without Adding Hardware

PayPal just deployed an optimisation that makes their language models 33% faster without buying new GPUs. No architectural overhaul. No model retraining. Just a smarter way to generate tokens.

The technique is called speculative decoding - specifically, a variant called EAGLE3 paired with NVIDIA's Nemotron models. Here's what it does: instead of generating one token at a time like a standard transformer, it drafts multiple tokens in parallel using a smaller, faster model, then verifies them against the full model in a single pass. When the draft is good, you skip several generation steps. When it's wrong, you've lost almost nothing.

PayPal fine-tuned Nemotron models on their own data and deployed EAGLE3 speculative decoding in production. The result: latency dropped between 18% and 33% depending on the task. More importantly, a single GPU now matches the performance of dual-GPU setups from before. Same output quality. Same accuracy. Half the hardware.

Why This Matters Outside PayPal

Inference cost is the new bottleneck. Training gets the headlines, but inference is where the bills pile up. Every API call, every chatbot interaction, every real-time recommendation - that's inference. Faster inference means lower cloud costs, better user experience, and the ability to run more complex models on constrained hardware.

What's interesting here isn't just the speed gain - it's that this works in production, at scale, on a system handling real transactions. PayPal isn't running toy benchmarks. They're processing payments, detecting fraud, answering customer queries. The optimisation had to be reliable, not just fast. The fact that it shipped tells you it's stable enough to trust with money.

Speculative decoding has been around in research for a while, but deployment is rare. The method requires careful tuning: the draft model has to be fast enough to matter but accurate enough to avoid wasting cycles on bad guesses. PayPal's contribution isn't the algorithm itself - it's proving it works when the stakes are real.

The Practical Implications

If you're running LLMs in production, this changes the cost equation. Instead of scaling horizontally - adding more GPUs when traffic spikes - you can extract more performance from what you already have. That's a different kind of scaling. It's not limitless, but it's cheaper and faster to implement than provisioning new infrastructure.

For developers building on APIs like OpenAI or Anthropic, this won't change much directly - you're using their infrastructure, not managing your own. But if providers adopt techniques like this (and they likely will), it could mean lower API costs or faster response times without you doing anything. That makes more ambitious applications viable. Tasks that were too slow or expensive start to look practical.

The bigger pattern is this: optimisation is now the frontier. We've squeezed massive performance gains from better hardware - NVIDIA's H100s, AMD's MI300s, custom ASICs. But the next wave of improvement is algorithmic. Smarter scheduling. Better memory management. Techniques like speculative decoding that change how tokens are generated without rewriting the model itself.

PayPal's results suggest there's still headroom. A 33% latency reduction from a technique that costs nothing to add (once you've built it) means most production systems are leaving performance on the table. The question is whether other companies have the engineering bandwidth to implement this kind of optimisation, or whether they'll wait for it to ship as a standard feature in frameworks like vLLM or TensorRT.

Either way, inference just got faster. And when inference gets faster, the applications that seemed too slow to build start looking possible again.

Read the full paper on arXiv