A developer needed a screen-reading assistant for Windows. The obvious choice: call OpenAI's API. Instead, they built Clicky, a tool that runs local LLMs via Ollama. Their writeup explains why privacy, cost, and latency made local-first the right call, with practical model comparisons and API patterns for anyone considering the same trade-off.
The Problem: Screen Content is Sensitive
A screen-reading assistant sees everything. Emails, passwords, financial data, private messages. Sending that to a cloud API means trusting a third party with your entire digital life. For some use cases, that's fine. For screen reading, it's a risk many users won't accept.
Running the model locally means the data never leaves the machine. No API calls, no server logs, no possibility of a breach exposing your screen content. This isn't paranoia - it's a legitimate design constraint for any tool that handles sensitive input.
Cost and Latency
Cloud APIs charge per token. For a screen-reading assistant that might process multiple screenshots per minute, costs add up fast. A local model has an upfront compute cost - you need a machine capable of running inference - but after that, it's free. For high-frequency use cases, the economics favour local inference.
Latency matters too. Sending a screenshot to a cloud API, waiting for the response, and displaying the result introduces delay. Local inference is faster - milliseconds, not seconds. For an assistive tool where responsiveness affects usability, that difference is noticeable.
Model Comparisons
The developer tested several models via Ollama: Llama 3.2 Vision, Mistral, and Qwen. Each had trade-offs. Llama 3.2 Vision handled complex layouts well but was slower. Mistral was faster but missed nuance in dense UIs. Qwen struck a balance - good enough accuracy, acceptable speed, reasonable hardware requirements.
This is the reality of local LLMs in 2025. You don't get GPT-4 level performance. You get models that are good enough for specific tasks, with constraints you can work around. The question isn't whether local models match cloud APIs. It's whether they're sufficient for your use case, and whether the trade-offs - privacy, cost, latency - justify the capability gap.
API Patterns: Ollama vs. Cloud
Ollama's API is simpler than you'd expect. You load a model, send it input, get a response. No authentication, no rate limits, no usage tracking. For a local tool, that simplicity is a feature. The code is cleaner. The failure modes are predictable.
The developer shares patterns for handling screenshots, batching requests, and managing model switching in the full writeup. These aren't abstractions - they're working code from a shipped product. If you're building something similar, the patterns transfer directly.
When Local Makes Sense
Not every application should run local models. Cloud APIs have better accuracy, more capabilities, and zero infrastructure burden. But for use cases where privacy is non-negotiable, usage is high-frequency, or latency matters, local inference is a serious option.
Clicky proves the approach works. A functional screen-reading assistant, running on consumer hardware, with no cloud dependency. The model isn't perfect. The developer documents its limitations clearly. But it's good enough to ship, and the trade-offs made the product possible.
This is what local LLM tooling looks like in practice. Not a replacement for cloud APIs, but a viable alternative when the constraints favour it. Privacy-first tools, offline functionality, cost-predictable deployments - these are problems local models solve better than API calls.
If you're building something that handles sensitive data, runs frequently, or needs to work offline, Ollama and models like Qwen are worth testing. The capability gap is narrowing. The tooling is maturing. And for some products, local-first isn't just a nice-to-have. It's the only option that works.