Most AI recommendation systems are confident and wrong. You ask for a game suggestion and get something you already own, or something completely mismatched to your taste, delivered with absolute certainty.
VoxGaming's VoxBot does something smarter: it checks what you actually own before recommending anything. Sounds obvious. Turns out it required building a custom architecture from scratch.
The technical approach
The team's detailed writeup walks through their stack: Llama 3.3 70B quantised to 4-bit, running on a single RTX 4090. Not cloud inference. Not an API call to OpenAI. A 70-billion parameter model running locally, fast enough to feel interactive.
The quantisation is doing heavy lifting here. Full-precision 70B models need multiple high-end GPUs and massive VRAM. Quantising to 4-bit compresses the model dramatically - each parameter uses 4 bits instead of 16 or 32. You lose some accuracy, but gain the ability to run locally on consumer hardware.
For context: 4-bit quantisation typically degrades performance by 2-5% on benchmarks, but makes the model 4-8x smaller in memory. The tradeoff matters less for conversational tasks than for precise mathematical reasoning. A gaming advisor can afford to be slightly less precise if it means running locally with low latency.
Tool calling as verification
The interesting architectural choice is how VoxBot handles recommendations. It doesn't just generate text. It uses tool calling to query the user's game library before suggesting anything.
Tool calling - sometimes called function calling - lets an LLM trigger external code during generation. Instead of hallucinating whether you own a game, the model calls a function that checks your actual library, gets a response, and incorporates that into its answer.
This is more robust than retrieval-augmented generation (RAG) for this use case. RAG would embed your library into a vector database and retrieve relevant entries. But "do I own this specific game?" is a lookup problem, not a semantic search problem. A direct database query is faster and more reliable.
The result: VoxBot can say "I see you already own Hades, so instead I'd recommend..." with actual confidence, because it checked.
Production deployment lessons
VoxGaming's writeup includes the kind of detail that matters when you're actually shipping something: how they handle streaming responses (so the user sees text appearing progressively, not all at once after a long wait), how they manage GPU memory to avoid crashes, and how they balance quantisation aggressiveness against response quality.
One standout detail: they tested multiple quantisation levels (2-bit, 3-bit, 4-bit, 8-bit) and settled on 4-bit as the sweet spot. 2-bit was too degraded - recommendations felt generic and missed nuance. 8-bit was noticeably better but required more VRAM than a single 4090 could provide reliably. 4-bit gave them "good enough" quality with consistent performance.
This kind of empirical testing - not just accepting defaults, but actually measuring tradeoffs in your specific use case - separates projects that ship from projects that stay in development.
Why this architecture matters
VoxBot's approach points at a broader pattern: local inference + tool calling is becoming viable for production applications. Not everything needs to hit OpenAI's API. Not everything needs to run in the cloud.
For certain use cases - especially those with privacy concerns, latency requirements, or cost constraints at scale - running a quantised open-weight model locally with carefully designed tool calling can outperform cloud-based approaches.
The limitations are real. You need GPU hardware (though a 4090 is £1,500, not £15,000). You need to handle model updates manually. You're responsible for prompt engineering, error handling, and all the infrastructure a managed API would provide.
But the upsides are significant. No per-request API costs. No data leaving your infrastructure. Full control over model behaviour and updates. And - crucially - the ability to tightly integrate with your existing systems through tool calling.
The builder perspective
What makes VoxGaming's writeup valuable is that it's clearly written by someone who built and shipped something, not just experimented. The tradeoffs are specific. The deployment challenges are honest. The architecture choices are explained in terms of constraints, not ideals.
Gaming recommendations are a relatively low-stakes problem - if VoxBot suggests a game you don't like, the consequence is minor. But the underlying architecture (local quantised LLM + tool calling for verification + streaming responses) applies to much higher-stakes domains. Customer support. Medical information lookup. Financial advisory. Any domain where you need natural language interaction grounded in verifiable, domain-specific data.
The fact that this stack now runs on a single consumer GPU, with response times good enough to feel interactive, is a recent development. Twelve months ago, this architecture would have required a small cluster of expensive hardware or significant cloud API costs. Now it's viable for individual developers and small teams.
VoxBot is one implementation. The pattern it represents - local inference, tool-based verification, quantisation as a production strategy - is going to show up in a lot more places.