The monthly AWS bill for a voice assistant running in the cloud: £2,400. The same system running on a microcontroller: £5 upfront, then nothing.
That economic reality is driving a architectural shift in how developers deploy AI systems. This practical guide from DEV.to maps the technical trade-offs - and more importantly, the cost structures - that are pushing inference workloads to the edge.
The pattern is consistent across industries. Smart home devices, industrial sensors, medical wearables, retail point-of-sale systems - anything that needs real-time AI inference is moving away from cloud APIs and toward local processing. Not because the cloud doesn't work, but because the economics and architecture don't make sense for always-on systems.
Four Reasons Edge Wins
Bandwidth: Sending raw sensor data to the cloud for processing means constant network usage. A security camera doing object detection locally processes 30 frames per second without ever touching the network. The same system sending frames to a cloud API needs 5-10 Mbps sustained upload. At scale, that's prohibitive. A building with 50 cameras would need dedicated business fibre just for the video feeds.
Latency: Round-trip time to a cloud API is 100-300ms under good conditions. That's acceptable for a chatbot. It's unusable for real-time control systems. A robot arm doing visual inspection needs sub-10ms response times. A drone maintaining altitude needs sub-5ms. Physics doesn't care about your API rate limits.
Privacy: Healthcare and financial services have regulatory requirements that make cloud processing expensive or impossible. GDPR, HIPAA, PCI-DSS - all of them prefer or require local processing of sensitive data. A medical device that never sends patient data to the cloud simplifies compliance enormously.
Operating costs: This is the big one. Cloud inference pricing is per-request. That works for batch processing or occasional queries. It breaks down for continuous operation. A device making 100 inferences per second costs pennies on-device and hundreds of pounds per month in the cloud. Over a product's 5-year lifespan, cloud costs exceed device costs by 50-100x.
What Edge Processing Actually Looks Like
Modern microcontrollers can run surprisingly capable models. The guide walks through deploying a quantised neural network on an ESP32 - a £4 chip with 520KB of RAM. The model does basic image classification at 10 frames per second. Not spectacular, but entirely sufficient for detecting whether a package is damaged or a door is open.
The constraint is model size and complexity. You're not running GPT-4 on a microcontroller. But you don't need to. Most edge use cases need narrow classification tasks - is this an anomaly? Has this threshold been crossed? Does this image contain a face? These are solved problems that fit in kilobytes, not gigabytes.
Model quantisation is the key enabler. A model trained in 32-bit floating-point precision can often be quantised to 8-bit integers with minimal accuracy loss. That 4x compression is the difference between a model that needs cloud processing and one that runs locally.
What Still Needs The Cloud
Edge processing handles inference. The cloud remains essential for training, fleet management, and model updates. This is the hybrid architecture that's emerging - thousands of devices doing local inference, all reporting aggregate statistics back to a central system that improves the model over time.
The feedback loop works like this: devices run a quantised model locally and log inference results. If confidence drops below a threshold, they flag the case. The cloud system collects these edge cases, retrains the model, and pushes an updated version to the fleet. Each device gets smarter without ever sending raw data to the cloud.
This architecture also handles model updates cleanly. A device running firmware version 1.2 can be updated to version 1.3 over-the-air, including a completely new neural network. The operational model becomes: deploy locally, monitor centrally, improve continuously.
The Developer Experience
Tooling has caught up with the architecture shift. TensorFlow Lite, PyTorch Mobile, and Apache TVM all support exporting models for edge deployment. The workflow is straightforward - train in the cloud using full-precision models and large datasets, then quantise and deploy to microcontrollers for inference.
The guide notes that the biggest friction point isn't the technical implementation - it's understanding which tasks actually need cloud processing versus which can run locally. Developers trained on cloud-first architectures default to API calls even when local inference would be faster and cheaper.
For builders, the decision tree is simple. If you need real-time response, operate continuously, handle sensitive data, or deploy at scale, edge processing probably makes sense. If you need massive compute, frequent model updates, or complex reasoning, cloud APIs are still the right choice.
The economics push toward a hybrid model - lightweight inference at the edge, heavyweight processing in the cloud, with clear boundaries between them. That's not every use case, but it's enough use cases to shift where development effort gets focused. The edge just became a first-class deployment target.