Ten calls to one: rebuilding feature flags from the inside
The cheapest network call is the one you don't make. A note on how a small architectural shift collapsed an entire class of latency, and what it taught me about reading systems backwards.
Most of the load-bearing decisions in our platform pass through a feature flag at some point. We recently identified that our feature flag (aka feature toggle) architecture, while fully functional, was becoming a bottleneck at scale.
Under high traffic, multiple feature flags each performing a check over HTTP add up quickly. Even if a single API call takes ~30ms, 10 flags mean ~300ms added to every request, and this compounds under heavy load. Not ideal when performance is critical.
§The shape of the problem
I want to be careful not to make this sound like a heroic rewrite. It wasn't. The honest summary is: we had been treating flag evaluation as a remote operation when, for our access patterns, it could safely be a local one.
Data drives decisions. We did not rely on gut feel. We used observability data to see exactly how many provider API calls were happening per request, making the problem visible and the solution undeniable.
Once that re-framing landed, the rest followed almost mechanically. Pull the rule set down once at startup. Inject them into the dependency container. Evaluate in-process. Re-fetch on a schedule. Fall back gracefully. None of those moves are clever on their own. What was clever, if anything was, was noticing that the network was doing work the language could do faster.1
§What the fix looked like
The new path, simplified considerably:
// One call at boot, refreshed on a timer.
const rules = await client.fetchRules({ env, project });
function isOn(name, ctx) {
const rule = rules[name];
if (!rule) return false;
return evaluate(rule, ctx); // pure, in-process
}
That's it. No per check network calls. No cascading latency. And no need for a caching layer that risks going stale. The interesting work was everywhere else: convincing ourselves it was safe, building the trust scaffolding, writing the tests that would catch a regression nobody would otherwise notice.
§The trust scaffolding
A change like this only works if the next person doesn't have to take it on faith. So most of the calendar time went into things that didn't move the metric:
- A drift detector that compares local decisions against remote decisions on a sampled slice of traffic, and shouts if they disagree.
- A flag-by-flag rollout, gated on its own flag. Yes, feature-flagged feature flags.
- A migration log written in plain English, not just commit messages, so a teammate joining six months later can read the why.
I think this is the part most writeups skip. The diff is small. The confidence to ship the diff is the project.2
§What I took from it
Two things, mostly. The first is a willingness to ask, of any remote call, whether it has to be remote. The answer is usually yes. Sometimes it isn't, and when it isn't the savings compound in interesting ways.
The second is a healthier respect for the soft infrastructure around a change: the dashboards, the migration notes, the people you talk to before opening a PR. The code is the smallest part. The change is the bigger thing the code is wrapped in.
Architecture is mostly the choice of where to put the seams. Everything else is taste.
Feature flags are essential for continuous deployment, but how you implement them matters. If your flags sit in the request path, remote evaluation is rarely the right choice. Move the logic closer to the application and performance improves immediately.