A FastAPI backend with over 1,000 tests was taking 20 minutes to run in CI. The obvious culprit: not enough parallelism. The actual culprit: a 20-second cold import at the start of every test shard.
This is the detailed breakdown of how a developer cut CI time in half by optimising the thing nobody looks at first - the startup overhead before a single test runs.
Why Parallelism Didn't Help
The standard advice for slow test suites is to parallelise. Run tests across multiple workers using pytest-xdist or similar. If you have 1,000 tests and 10 workers, you should get close to a 10x speedup. In theory.
In practice, this backend saw almost no improvement from xdist. Adding more workers didn't reduce wall-clock time. The reason: every worker had to import the entire application before running its first test. That import took 20 seconds. With xdist, you're paying that 20-second tax on every worker, all at once. Ten workers meant ten 20-second imports running in parallel, but the overall pipeline couldn't proceed until the slowest worker finished importing.
The breakthrough came from switching to serial sharding instead of parallel workers. Instead of spinning up ten workers at once, the CI pipeline was split into four separate jobs that ran sequentially. Each job handled a subset of tests, but only paid the 20-second import cost once. Total import overhead: 80 seconds across four shards. Compare that to xdist: 200 seconds across ten workers (20 seconds each, but all waiting for the slowest to finish).
What Was Taking 20 Seconds?
The import bottleneck was traced to a combination of heavy dependencies and lazy initialisation. FastAPI applications often import Pydantic models, database ORM layers, and third-party SDKs at startup. Each of those can take a few seconds to load, and they compound. In this case, the application was importing:
- SQLAlchemy models with complex relationships
- AWS SDK clients (boto3) that initialise service endpoints
- A large Pydantic schema registry used for validation
- Several cryptographic libraries with native extensions
None of these imports were unnecessary. They're all used in production. But in a test environment, most tests don't need all of them. A test that validates API input schemas doesn't need the database layer. A test that checks business logic doesn't need AWS clients. But because everything was imported at the module level, every test paid for everything.
The Optimisations That Worked
The developer applied three changes that brought the pipeline from 20 minutes to 10 minutes:
1. Lazy imports for heavy dependencies. Instead of importing boto3 at the module level, it's imported inside the functions that actually use it. This delayed the import cost until a test actually needed AWS clients. Most tests didn't, so they skipped that 3-second initialisation entirely.
2. Mocked database connections in the test fixture. The ORM layer was initialising a connection pool at import time, even in tests that used an in-memory database. The connection pool setup took 5 seconds and was entirely redundant. Mocking the connection in the test fixture removed that overhead.
3. Four serial shards instead of xdist parallelism. The CI pipeline was split into four jobs, each running a quarter of the tests. Each job ran sequentially, so the total import overhead was 4x 20 seconds = 80 seconds. This beat xdist because xdist's parallelism only helps if the bottleneck is test execution time, not startup time.
Why This Pattern Shows Up Everywhere
FastAPI, Flask, Django - any Python web framework hits this problem once the codebase reaches a certain size. You add a new dependency, import it at the module level, and the test suite gets 2 seconds slower. Do that ten times and you've added 20 seconds to every CI run. Nobody notices the individual additions. Everyone notices when the pipeline takes 20 minutes.
The same pattern appears in JavaScript with Webpack or Vite, in Ruby with Rails, in Java with Spring Boot. Any framework that does heavy initialisation at import/startup time will penalise test suites that import the world for every test. The solution is always the same: lazy imports, dependency injection, and strategic sharding.
When to Optimise Startup vs Execution
Most developers optimise test execution time first. They parallelise, they mock slow dependencies, they cache database fixtures. Those optimisations help, but they're optimising the wrong bottleneck if your startup time is 20 seconds and your test execution time is 10 minutes. Halving execution time saves 5 minutes. Halving startup time saves 5 minutes per shard.
The heuristic: if adding more parallelism doesn't reduce wall-clock time, your bottleneck is startup. Profile the import chain, find the heavy dependencies, and make them lazy. If parallelism does help but you're still hitting resource limits (CPU, memory, I/O), strategic sharding beats infinite parallelism.
What This Means for CI Budgets
CI time is expensive. GitHub Actions, CircleCI, GitLab CI - they all charge per minute. A 20-minute CI run costs twice as much as a 10-minute run. If your team merges ten PRs a day, that's 200 minutes saved per day, 1,000 minutes per week, 52,000 minutes per year. At typical CI pricing, that's thousands of dollars saved just by optimising imports.
It's also a developer experience problem. A 20-minute CI run means developers wait 20 minutes for feedback. A 10-minute run means they get feedback while the context is still in their head. That's the difference between staying in flow and switching tasks. Faster CI isn't just cheaper - it's more humane.