Testing Google's Agent API for $0.37 - Every Bug You'll Hit

Stephen Sebastian ran Google's Antigravity managed agents against 14 services and spent $0.37 doing it. Then he wrote up every bug, every cost surprise, and every production readiness gap he found.

This is what good builder content looks like. Not "here's why agents are significant". More like "here's what broke when I tried to use them for dependency audits".

What Antigravity Actually Does

Antigravity is Google's managed runtime for AI agents. You define a task - in this case, auditing dependencies across multiple services - and the agent handles the execution. It's meant to abstract away the infrastructure complexity of running agents at scale.

The promise: you focus on what the agent should do, Google handles how it runs. The reality, as Sebastian found, is messier than the marketing.

He tested it on 14 different services, checking for outdated dependencies, security vulnerabilities, and configuration drift. Each run cost between $0.02 and $0.04 in tokens, depending on the size of the codebase and the depth of the audit.

Total spend: $0.37. That's the interesting bit. We're at the point where you can test an AI agent system across a realistic workload for less than a coffee. The barrier to experimentation is essentially zero.

The Token Economics Breakdown

Each agent run cost $0.044 on average. That's not compute cost - that's token cost. The agent is making multiple LLM calls per run: parsing the dependency file, cross-referencing versions, checking for known vulnerabilities, generating a report.

Sebastian breaks down the token usage per run. Most of the cost is in the analysis phase, not the output. The agent reads more than it writes. For a typical Node.js service with 50-70 dependencies, the agent used around 8,000 input tokens and 1,200 output tokens per audit.

At current pricing, that works out to about $0.04 per service. For a company with 100 microservices, running a weekly dependency audit would cost $4. Monthly cost: $16. That's cheaper than paying someone to do it manually, and the agent runs consistently every time.

But here's the catch Sebastian found: the cost scales non-linearly with complexity. Small services with well-structured dependency files cost $0.02. Large monorepos with nested dependencies and multiple package managers cost $0.08. The difference is in how many LLM calls the agent needs to make to understand the structure.

The Bugs He Hit

Sebastian's writeup is valuable because he documents the failures, not just the successes. Antigravity isn't production-ready for every use case. Here's what broke:

Rate limiting: Running 14 audits in parallel hit Google's API rate limits immediately. The agent doesn't handle backoff gracefully - it just fails and returns an error. You need to implement your own retry logic.

Timeout handling: Large codebases take longer to analyze than the default timeout allows. The agent gets cut off mid-analysis and returns incomplete results. There's no way to extend the timeout or resume from where it stopped.

Dependency resolution: The agent struggled with monorepos that use multiple package managers. It could handle npm or Yarn, but not both in the same run. It also missed workspace dependencies in Yarn 2+ configurations.

Output formatting: The agent returns results as unstructured text, not JSON. If you want to pipe the results into another system, you need to parse the text output yourself. That's fine for human review, less useful for automation.

Version comparison logic: The agent occasionally flagged dependencies as outdated when they weren't. It compared semantic versions incorrectly - treating 1.10.0 as older than 1.9.0 because it compared digit-by-digit as strings, not as version numbers.

The Production Readiness Checklist

Sebastian's conclusion: Antigravity works for one-off audits and manual reviews. It's not ready for automated, production-scale workflows. His checklist for what's missing:

Structured output: Agents need to return JSON, not prose. If the output is meant for another system, unstructured text doesn't cut it.

Resumability: Long-running tasks need checkpoints. If an agent times out, it should resume from where it stopped, not restart from scratch.

Error handling: Rate limits, timeouts, and API failures need graceful degradation. Right now, the agent just fails. It should retry with backoff or return partial results.

Cost visibility: You don't know the token cost until after the run completes. For production workflows, you need cost estimation upfront so you can set budgets and alert on overruns.

Observability: There's no way to see what the agent is doing while it runs. You get the final result, but not the intermediate steps. For debugging, that's a problem.

Why Managed Runtimes Change Workflows

The bigger point Sebastian makes is about managed runtimes in general. When you give an agent to Google to run, you lose control over execution. You can't inspect the runtime, debug the process, or optimize the infrastructure. You trade control for convenience.

For some use cases, that trade-off works. Running a dependency audit once a week? Fine. The cost is low, the task is simple, and you don't need custom infrastructure. But for complex, business-critical workflows, managed runtimes introduce risks. You're dependent on the provider's reliability, their pricing, and their feature roadmap.

Sebastian's advice: use managed agents for low-stakes automation and one-off tasks. For anything production-critical, run your own infrastructure. The cost might be higher upfront, but you control the failure modes.

What This Tells Us About Agent Maturity

The fact that you can test an agent system for $0.37 is remarkable. The fact that it still has these bugs is expected. We're early. The infrastructure is improving, but it's not polished yet.

What's useful about Sebastian's writeup is the specificity. He doesn't say "agents aren't ready" - he says "here's exactly what breaks and under what conditions". That's the kind of feedback that moves the ecosystem forward.

If you're building with agents, read his full breakdown. It's the best $0.37 someone else has spent for you this week.