GitHub Built an Accessibility Agent. Then Learned When Not to Use It.

GitHub deployed an AI agent to review pull requests for accessibility issues. It processed 3,535 PRs with a 68% resolution rate. The numbers sound good. The lessons are better. The team discovered that knowing when NOT to deploy an agent is more important than knowing how to build one.

The project, detailed on the GitHub Blog, started with a straightforward goal: catch accessibility problems before they ship. Screen reader compatibility, keyboard navigation, colour contrast, ARIA labels - the unglamorous work that gets skipped when deadlines compress. An AI agent seemed like a perfect fit. Review every PR, flag issues, suggest fixes. Simple.

What they learned: agents fail predictably, and managing those failures is most of the work.

Complexity Thresholds Are Real

The first lesson: not every task is agent-appropriate. The GitHub team built a complexity classifier that evaluates each PR before the agent touches it. Simple issues - missing alt text, incorrect ARIA roles - go to the agent. Complex issues - redesigning navigation patterns, refactoring component hierarchies - go to humans.

The classifier isn't sophisticated. It looks at file count, lines changed, number of components affected, and dependency depth. If any metric crosses a threshold, the PR gets flagged for human review. The agent never sees it. This isn't about the agent's capability. It's about the cost of failure. When an agent mishandles a simple issue, you waste time. When it mishandles a complex refactor, you ship broken code.

The threshold system reduced false positives by 40%. More importantly, it reduced the number of PRs where developers had to override or ignore the agent's suggestions. When developers ignore an agent too often, they stop trusting it entirely. The classifier preserves trust by keeping the agent in its competence zone.

Sub-Agent Architecture Beats Monoliths

The team's second finding: one agent trying to handle everything fails more than specialised sub-agents handling specific tasks. They split the system into focused agents - one for screen reader compatibility, one for keyboard navigation, one for colour contrast, one for ARIA markup. Each sub-agent has its own prompt, its own evaluation criteria, its own pass/fail thresholds.

The sub-agent approach improved accuracy by 25%. But the real win was debuggability. When a monolithic agent fails, it's hard to know why. The failure could be anywhere in its reasoning chain. When a sub-agent fails, you know exactly which domain broke down. You can tune that agent without affecting the others.

This mirrors what works in human review. You don't ask one person to check design, code quality, security, and accessibility. You split the work. The agent architecture should match the human workflow it's replacing.

Linear Execution Order Prevents Cascading Failures

The third lesson: agents should run in a fixed sequence, not in parallel. The GitHub team initially ran all sub-agents simultaneously, thinking it would be faster. It wasn't. When multiple agents modify the same file, their suggestions conflict. Resolving conflicts manually erases any time saved.

They switched to linear execution. The screen reader agent runs first. If it makes changes, those changes are committed before the keyboard navigation agent runs. If that agent makes changes, they're committed before the colour contrast agent runs. Each agent works on a stable baseline. No conflicts. No overwrites. Slower per-PR, but faster overall because developers don't spend time untangling agent suggestions.

Explicit Limits Preserve Developer Trust

The fourth finding: agents need hard limits on what they'll attempt. The GitHub agent is explicitly prohibited from refactoring component structure, changing design patterns, or altering user-facing behaviour. It can suggest those changes in comments. It cannot make them automatically.

This constraint seems obvious, but it's easy to skip. When an agent is working well on small fixes, the temptation is to expand its scope. "It's already fixing ARIA labels. Why not let it restructure the navigation while it's there?" Because restructuring navigation requires context the agent doesn't have. User research. Design intent. Business requirements. An agent that oversteps its authority ships subtle bugs that take weeks to surface.

The hard limits aren't technical. They're enforced in prompts and in the task classification system. If a PR requires work outside the agent's defined scope, it gets flagged for human review. The agent's job is to make safe, narrow fixes. Everything else is out of bounds.

What This Means for Production AI Agents

GitHub's accessibility agent isn't significant. The techniques aren't novel. But the discipline is rare. Most teams deploy agents optimistically and scale back when things break. GitHub did the opposite. They defined constraints first, then built the agent inside those constraints.

The result: 68% of flagged accessibility issues are resolved automatically, with a false positive rate low enough that developers don't disable it. That's not a demo. That's production infrastructure doing real work.

The broader lesson: effective agents aren't about capability. They're about knowing when NOT to act. Complexity thresholds, sub-agent architecture, linear execution, explicit limits - these aren't limitations. They're the design principles that make agents reliable enough to trust.

The full writeup includes implementation details and prompt engineering notes. The code isn't open source, but the lessons are transferable. If you're building agents for production, the constraints matter more than the model.