Google just shipped an API that lets web developers run language models directly in the browser - no server, no cloud, no external dependencies. The Prompt API is part of Chrome's built-in AI toolkit, and it changes the economics of building AI features into web apps.
The model runs locally on the user's device. For developers, that means zero API costs, no rate limits, and no network latency. For users, it means their data never leaves their machine. For privacy-sensitive applications, that's the entire pitch.
This isn't a research preview. It's shipping in Chrome's early-access channels now, with stable release expected in the coming months.
What the API Actually Does
The Prompt API is deliberately simple. You call window.ai.createTextSession() to get a session object, then send it prompts using session.prompt(). The API returns text responses. That's it. No model selection. No fine-tuning. No configuration beyond basic parameters like temperature and token limits.
The model behind the API is Gemini Nano, Google's smallest production language model. It's not GPT-4. It won't write you a novel or solve complex reasoning tasks. But it handles summarisation, simple classification, basic Q&A, and text generation well enough for most web use cases.
The key constraint is context length. Gemini Nano supports around 4,000 tokens of context - enough for a few pages of text, but not entire documents. If your use case needs more, you'll still need a cloud-based model.
The Privacy Angle
For certain applications, local inference isn't a nice-to-have - it's a requirement. Medical records. Legal documents. Financial data. Anything subject to GDPR, HIPAA, or other privacy regulations. Sending that data to OpenAI's servers is often a non-starter.
The Prompt API solves that problem by keeping everything local. The model is downloaded once and cached on the user's device. Prompts and responses never touch a network. For developers building tools in regulated industries, this opens up AI features that were previously off-limits.
It also solves the offline problem. If your web app works offline, your AI features can now work offline too. That's valuable for progressive web apps, especially on mobile devices with unreliable connectivity.
The Cost Equation
Running inference locally shifts costs from the developer to the user. Instead of paying OpenAI per API call, you're using the user's CPU and battery. For high-traffic applications, that's a huge cost saving. For users on limited devices, it's a performance hit.
Google's documentation suggests the API uses on-device acceleration when available - GPUs, NPUs, or other specialised hardware. On a modern laptop, inference is fast enough for real-time interaction. On older devices or low-end hardware, it might not be.
The API includes capability detection. You can check if the model is available before trying to use it. If it's not - because the device doesn't meet minimum requirements or the user hasn't downloaded the model - you can fall back to a cloud-based alternative.
That fallback strategy is critical. Not every user will have Gemini Nano installed. Treating local inference as an enhancement, not a requirement, keeps your app functional across devices.
What This Enables
The obvious use cases are the ones where latency and privacy matter most. Real-time text assistance: autocomplete, grammar checking, tone adjustment. Local document analysis: summarising emails, extracting key points from meeting notes. Chatbots that don't require a backend.
But the more interesting applications are the ones that weren't viable before. Browser extensions that add AI features without requiring API keys. Client-side moderation tools for user-generated content. Personalised recommendations that run entirely in the browser, trained on local data.
The Prompt API also pairs well with Chrome's other on-device AI experiments. The Translation API, the Summarisation API, and the upcoming image generation APIs all follow the same pattern: run the model locally, keep data private, avoid cloud costs.
The Ecosystem Shift
If local LLMs in the browser become standard, it changes how developers think about AI features. Right now, adding AI to a web app means integrating with OpenAI, Anthropic, or another cloud provider. That comes with costs, rate limits, and vendor lock-in.
With local inference, AI becomes a browser capability - like geolocation or WebRTC. You use it when it's available. You don't pay per use. You don't worry about rate limits or API downtime. The browser handles the complexity.
That's the vision Google is pushing. Whether it happens depends on adoption. If developers build features that depend on the Prompt API, and those features work well, other browsers will follow. If the API stays Chrome-specific, it'll remain a niche tool.
The Limitations
Local models are smaller and less capable than cloud models. Gemini Nano won't replace GPT-4 for complex tasks. The 4,000-token context limit rules out many document-processing use cases. And device compatibility is still uneven - not every machine can run the model smoothly.
There's also a cold-start problem. The first time a user visits your site, the model might not be installed yet. Chrome downloads it in the background, but until that's complete, your AI features won't work. You need a fallback or a loading state.
And finally, local inference is hard to monitor. If your AI feature produces bad output, you can't log the prompt and response for debugging - it never hit your servers. That makes quality control more difficult.
What Happens Next
The Prompt API is in early access now. The stable release will likely come with clearer performance benchmarks, better device compatibility, and more robust fallback mechanisms. If it gains traction, expect Mozilla and Apple to ship equivalents.
For developers, the play is to experiment now while the API is still flexible. Build a prototype. Test it on real devices. Figure out where local inference makes sense and where cloud models are still necessary. The patterns you learn now will matter when this becomes standard infrastructure.
Read the full technical documentation on Chrome's developer site.