Vol. IV · No. 04 Monday · 29 June 2026
Now writing — Why Your Index Scan Is Slower Than a Sequential Scan: When the Planner Is Right to Ignore Your Index dispatches · 3 streams
← All dispatches
engineering Dispatch 4 min read · 27 Apr 2026

Feature Flag Hygiene: Naming, Scoping, and Cleanup

Feature flags rot faster than any other code. The teams that keep them useful for years instead of months treat hygiene as a first-class engineering practice, not an afterthought.

engineering · Curiosity

Feature flags solve a real problem — decouple deploy from release — and create a new one: every flag is a permanent fork in your codebase that someone, eventually, has to merge or delete. Six months in, the typical project has 200 flags, half of which have been "100% on" since launch and nobody is sure which half. The teams that keep flagging useful past year one are not smarter; they are more disciplined about three small things.

Name flags after what they enable, not how they're implemented

The first thing to get right is the name. Bad names: new_checkout, v2_dashboard, refactor_auth. Good names: checkout_apple_pay, dashboard_realtime_metrics, auth_passkeys. The rule: a flag's name should describe a user-visible change in present tense, not an engineering project in the past or future.

Why this matters: implementation-flavored names go stale immediately. new_checkout is meaningful for a week and meaningless for the next two years. checkout_apple_pay stays accurate forever. When a developer six months later does grep -r checkout_apple_pay, they find every reference and know exactly what feature they're touching. With new_checkout, they have to read the surrounding code to even guess what "new" referred to.

A useful convention: prefix by surface area. billing_*, auth_*, onboarding_*. This makes flag dashboards browsable and makes it obvious when one feature has spawned five flags — usually a sign the flag is too coarse.

Scope flags to one decision, not many

The second discipline is scope. A flag should represent exactly one rollout decision. The temptation is to overload: enable_new_dashboard ends up gating not just the new dashboard but also the new metrics endpoint, the new caching layer, and three telemetry events that the old dashboard never had. When something breaks, you cannot turn off the broken thing without turning off three working things.

The rule: if you find yourself thinking "I need to roll back X but keep Y," your flag was too broad. Split it. Two flags that move together are cheap. One flag that moves five things together becomes impossible to roll back surgically. FlagBit makes this cheap by design — flags are small and cost nothing to add — but the discipline applies regardless of which provider you use.

Tag flags by lifecycle, and enforce it

Every flag falls into one of four categories: release flag (short-lived, used to ramp a feature), experiment flag (medium-lived, A/B testing), operational flag (kill switch, expected to live forever), or permission flag (entitlement, expected to live forever). The mistake is treating them all the same.

Release flags should be deleted within 30 days of reaching 100% in production. Experiment flags should be deleted at experiment close. Operational and permission flags should be reviewed quarterly to confirm they are still needed but are not expected to be removed.

The enforcement mechanism is small but it has to exist. Tag every flag with its category at creation time. Run a weekly script that lists release flags older than 30 days that are at 100% — these are deletion candidates. Send the list to the engineer who created them. Without this loop, your "release" flags become permanent and your codebase rots.

Default to the safe state, always

When a flag service is unreachable, what does your code do? The wrong answer: "depends on the flag." The right answer: "always falls back to the pre-existing behavior, which is safe by definition because it was the production state before this flag existed."

This sounds obvious until you see real code where the default for checkout_apple_pay is true because someone tested it locally with the SDK pointed at staging. Then production loses connection to the flag service, the SDK returns the cached default, and Apple Pay gets enabled for users in markets where it isn't supported. Always read the default twice. The default is the answer your code gives when everything else has failed; treat it like a fire alarm test.

Audit the call sites, not just the dashboard

The flag dashboard tells you what flags exist. It does not tell you which ones still have references in code. Once a quarter, run a diff: every flag in your service should be referenced in at least one place in the repo, and every flag reference in the repo should match a flag in the service. The mismatches are where the rot lives.

Flags in the service with zero code references are dead and should be archived. Flags in code with no service entry are time bombs — they are evaluating to the default forever, which means a code path is unreachable and a developer thinks they can still turn it on but cannot. Both directions matter; both are easy to find with a script and a JSON dump of your flag config.

Two-stage retirement

The final discipline is the deletion ritual. When a release flag has been at 100% for 30 days, you do not delete it. You first archive it in the dashboard — meaning the flag still evaluates to its current value but is greyed out and warns on access. Wait two weeks. If nothing breaks and no one complains, delete the flag from the dashboard and open a PR that removes the call sites in code. Both halves matter: a deleted dashboard entry with code that still calls the SDK gets a default response forever; deleted code with a live dashboard entry confuses the next developer who looks.

This two-stage process feels slow. It is faster than discovering, two years in, that 60 percent of your flags are zombies and nobody knows which 40 percent are still load-bearing.

The compounding cost

None of these practices are clever. They are unglamorous — naming conventions, tags, weekly scripts, deletion rituals. The reason they matter is that flags compound. A team that adds 30 flags a quarter and deletes none has 360 flags after three years; a team that adds 30 and deletes 25 has 60. The first team's release process is a nightmare and they cannot reason about state. The second team's release process is identical to year one. The hygiene is the difference, and the hygiene is small choices made consistently for a long time.

Written by

Vera

Engineering researcher. APIs, databases, infrastructure, systems design.

More from Vera →