Intelligence is foundation
Podcast Subscribe
Builders & Makers Wednesday, 25 February 2026

VoxGaming built a game advisor that checks your library first

Share: LinkedIn
VoxGaming built a game advisor that checks your library first

Most AI recommendation systems are confident and wrong. You ask for a game suggestion and get something you already own, or something completely mismatched to your taste, delivered with absolute certainty.

VoxGaming's VoxBot does something smarter: it checks what you actually own before recommending anything. Sounds obvious. Turns out it required building a custom architecture from scratch.

The technical approach

The team's detailed writeup walks through their stack: Llama 3.3 70B quantised to 4-bit, running on a single RTX 4090. Not cloud inference. Not an API call to OpenAI. A 70-billion parameter model running locally, fast enough to feel interactive.

The quantisation is doing heavy lifting here. Full-precision 70B models need multiple high-end GPUs and massive VRAM. Quantising to 4-bit compresses the model dramatically - each parameter uses 4 bits instead of 16 or 32. You lose some accuracy, but gain the ability to run locally on consumer hardware.

For context: 4-bit quantisation typically degrades performance by 2-5% on benchmarks, but makes the model 4-8x smaller in memory. The tradeoff matters less for conversational tasks than for precise mathematical reasoning. A gaming advisor can afford to be slightly less precise if it means running locally with low latency.

Tool calling as verification

The interesting architectural choice is how VoxBot handles recommendations. It doesn't just generate text. It uses tool calling to query the user's game library before suggesting anything.

Tool calling - sometimes called function calling - lets an LLM trigger external code during generation. Instead of hallucinating whether you own a game, the model calls a function that checks your actual library, gets a response, and incorporates that into its answer.

This is more robust than retrieval-augmented generation (RAG) for this use case. RAG would embed your library into a vector database and retrieve relevant entries. But "do I own this specific game?" is a lookup problem, not a semantic search problem. A direct database query is faster and more reliable.

The result: VoxBot can say "I see you already own Hades, so instead I'd recommend..." with actual confidence, because it checked.

Production deployment lessons

VoxGaming's writeup includes the kind of detail that matters when you're actually shipping something: how they handle streaming responses (so the user sees text appearing progressively, not all at once after a long wait), how they manage GPU memory to avoid crashes, and how they balance quantisation aggressiveness against response quality.

One standout detail: they tested multiple quantisation levels (2-bit, 3-bit, 4-bit, 8-bit) and settled on 4-bit as the sweet spot. 2-bit was too degraded - recommendations felt generic and missed nuance. 8-bit was noticeably better but required more VRAM than a single 4090 could provide reliably. 4-bit gave them "good enough" quality with consistent performance.

This kind of empirical testing - not just accepting defaults, but actually measuring tradeoffs in your specific use case - separates projects that ship from projects that stay in development.

Why this architecture matters

VoxBot's approach points at a broader pattern: local inference + tool calling is becoming viable for production applications. Not everything needs to hit OpenAI's API. Not everything needs to run in the cloud.

For certain use cases - especially those with privacy concerns, latency requirements, or cost constraints at scale - running a quantised open-weight model locally with carefully designed tool calling can outperform cloud-based approaches.

The limitations are real. You need GPU hardware (though a 4090 is £1,500, not £15,000). You need to handle model updates manually. You're responsible for prompt engineering, error handling, and all the infrastructure a managed API would provide.

But the upsides are significant. No per-request API costs. No data leaving your infrastructure. Full control over model behaviour and updates. And - crucially - the ability to tightly integrate with your existing systems through tool calling.

The builder perspective

What makes VoxGaming's writeup valuable is that it's clearly written by someone who built and shipped something, not just experimented. The tradeoffs are specific. The deployment challenges are honest. The architecture choices are explained in terms of constraints, not ideals.

Gaming recommendations are a relatively low-stakes problem - if VoxBot suggests a game you don't like, the consequence is minor. But the underlying architecture (local quantised LLM + tool calling for verification + streaming responses) applies to much higher-stakes domains. Customer support. Medical information lookup. Financial advisory. Any domain where you need natural language interaction grounded in verifiable, domain-specific data.

The fact that this stack now runs on a single consumer GPU, with response times good enough to feel interactive, is a recent development. Twelve months ago, this architecture would have required a small cluster of expensive hardware or significant cloud API costs. Now it's viable for individual developers and small teams.

VoxBot is one implementation. The pattern it represents - local inference, tool-based verification, quantisation as a production strategy - is going to show up in a lot more places.

More Featured Insights

Robotics & Automation
ZaiNar turns WiFi into robot GPS - no new hardware required
Voices & Thought Leaders
Developers are closing the loop - agents that build, test, and ship

Video Sources

Theo (t3.gg)
My new app is really stupid (I wrote none of the code)
Dwarkesh Patel
Why AI Needs a Trillion Words to Do What Humans Do Easily - Dario Amodei

Today's Sources

DEV.to AI
J'ai construit un conseiller gaming IA qui connaît vraiment votre collection
Replit Blog
Replit Pro Is Here - and Core Now Offers Even Better Value
DEV.to AI
OpenBB: Streamlining Financial Data for Developers
ML Mastery
How to Combine LLM Embeddings + TF-IDF + Metadata in One Scikit-learn Pipeline
The Robot Report
ZaiNar raises $100M and launches physical AI platform
The Robot Report
AI2 Robotics raises Series B funding to advance AlphaBot, embodied AI
The Robot Report
Aramine AutoNav makes mining safer at Reward Gold Mine
ROS Discourse
Traversability_generator3d: 3D Traversability from MLS Maps
ROS Discourse
ROS Meetup Malmö Sweden
Latent Space
[AINews] The Unreasonable Effectiveness of Closing the Loop
Digital Native
10 Charts That Capture How the World Is Changing
Latent Space
Claude Code for Finance + The Global Memory Shortage
Ben Thompson Stratechery
Xbox Replaces Head of Gaming, Xbox History, Whither Xbox

About the Curator

Richard Bland
Richard Bland
Founder, Marbl Codes

27+ years in software development, curating the tech news that matters.

Subscribe RSS Feed
View Full Digest Today's Intelligence
Free Daily Briefing

Start Every Morning Smarter

Luma curates the most important AI, quantum, and tech developments into a 5-minute morning briefing. Free, daily, no spam.

  • 8:00 AM Morning digest ready to listen
  • 1:00 PM Afternoon edition catches what you missed
  • 8:00 PM Daily roundup lands in your inbox

We respect your inbox. Unsubscribe anytime. Privacy Policy

© 2026 MEM Digital Ltd t/a Marbl Codes
About Sources Podcast Audio Privacy Cookies Terms Thou Art That
RSS Feed