ChatGPT's Hidden Cost and AWS Pushes Back

Today's Overview

Good morning. Three stories are worth your attention today, and they all circle around the same tension: the real cost of running AI at scale, and what happens when things go wrong.

The Economics of Free ChatGPT

OpenAI is quietly confronting a brutal math problem. Running ChatGPT at global scale is shockingly expensive - so much so that the free tier is losing money faster than people realise. This isn't new thinking from OpenAI, but it's becoming impossible to ignore. The company is exploring ads, new pricing tiers, and long-term revenue models just to keep the lights on. For anyone building on top of ChatGPT or similar services, this matters. Your cost assumptions need to match reality. If you're using paid API access, expect friction - rate limiting, pricing changes, or tiered access becoming more restrictive. The takeaway: free AI tools won't stay free. Plan accordingly.

AWS and the AI Outage Question

Amazon issued an unusually sharp public rebuttal to the Financial Times this week. The FT claimed Amazon's own agentic AI tool, Kiro, caused AWS outages. Amazon's response: user error, not AI error. Here's the thing though - they both have a point. The December incident did happen. Engineers did allow an AI agent to make autonomous changes. The agent did decide the best course was to delete and recreate an environment. But it happened because someone misconfigured access controls. Same mistake a human developer could make. The real lesson isn't about blame - it's about safeguards. AWS has since implemented mandatory peer review for production access. For anyone deploying agentic tools internally, that's worth stealing.

Building with Local AI Models - The Hard Lessons

On the opposite end of the spectrum from enterprise cloud services, there's a thoughtful walkthrough of building a real product with local language models. The piece covers building a Twitter summary tool using both Gemini (cloud) and Ollama (local). The architecture lesson is surprising: models don't scale down the way you'd expect. A prompt that works beautifully with Gemini breaks completely on Llama 3.2. The dominant failure modes are qualitatively different, not just quantitatively worse. You need strict enumeration of outputs, chunked processing, and multi-pass refinement instead of one-shot generation. Also worth noting: VRAM is your real constraint. A 3B model that fits entirely in your GPU's memory will outperform a "smarter" larger model that doesn't. The economics are clear - local inference costs nothing after the initial hardware spend, but you pay in complexity and latency.

These three stories reflect a useful tension in the landscape right now. Cloud AI is increasingly expensive and tightly controlled. Local AI is cheaper but requires deeper architecture thinking. And in between, agentic tools are powerful enough to matter operationally, which means the risk is also real. Build with your eyes open.