A/B testing frameworks for backend features
Heads up: This post is over 3 years old. People, opinions, and industries change — some of this may be outdated.
When most people hear "A/B testing," they picture two different button colors or a rearranged landing page. But some of the most impactful experiments I've run had zero UI changes. They were entirely on the backend — different algorithms, new database queries, refactored service calls. And the tooling for that kind of testing is a different beast.
It's not just feature flags (but it starts there)
Feature flags are the foundation. At the simplest level, you're toggling a code path on or off for a subset of users. But backend A/B testing goes further — you need percentage rollouts, user segmentation, and metric collection baked into the same system.
I've used a few approaches over the years:
- LaunchDarkly — The most polished option. Great SDK support, targeting rules, and built-in analytics hooks. The downside is cost. Once you're past the free tier, it adds up fast for a growing team.
- Unleash — Open-source alternative that covers most of what you need. Self-hosted, so you own your data. The trade-off is operational overhead — you're running another service.
- Homegrown solutions — I've built these too. A database table of experiments, a simple evaluation engine, and some middleware. It works for small teams, but it gets messy fast. You end up reinventing targeting logic, audit trails, and rollback mechanisms that the dedicated tools already handle.
The best framework is the one your team will actually use consistently. A fancy setup that only one person understands is worse than a simple config file everyone can read.
Measuring success on the backend
Frontend A/B tests measure clicks and conversions. Backend tests measure different things entirely:
- Latency — Did the new query plan make things faster or slower under load?
- Error rates — Are we seeing more 5xx responses in the experiment group?
- Business metrics — Settlement times, transaction throughput, processing costs. These are the numbers that actually matter.
The tricky part is attribution. When a request flows through multiple services, you need to propagate the experiment context. I typically pass experiment assignments as headers or metadata through the call chain so downstream services know which variant a request belongs to.
Canary deployments vs. feature flags
These are related but different. A canary deployment routes a percentage of traffic to a new version of an entire service. A feature flag toggles a specific code path within the same deployment.
I use canaries for infrastructure-level changes — new runtime versions, dependency upgrades, major refactors. Feature flags are for logic changes — a new pricing algorithm, a different fraud scoring model, an alternative retry strategy.
Combining both gives you the most control. Deploy the new code behind a flag, canary the deployment to 5% of traffic, then gradually enable the flag for more users within that canary group.
The gotchas nobody warns you about
Stateful services are the biggest headache. If your service maintains in-memory state or sticky sessions, you can't just flip a flag mid-request. You need to evaluate the experiment assignment once and carry it through the entire session lifecycle.
Database migrations during experiments are painful. If variant B requires a new column or table, you need that schema in place for everyone, even though only a fraction of traffic uses it. I've learned to keep schema changes decoupled from experiment logic — migrate first, flag second.
Cache invalidation will bite you. If variant A and variant B produce different responses for the same cache key, you'll serve stale or incorrect data. The fix is to include the variant in the cache key, but that effectively splits your cache and reduces hit rates. Plan for the capacity impact.
Interaction effects between experiments are real. Running two backend experiments simultaneously on overlapping user segments can produce confounding results. Most mature frameworks support mutual exclusion groups — use them.
When to roll your own
Honestly? Almost never. But if your constraints are unusual — air-gapped environments, extreme latency requirements, regulatory restrictions on third-party services — a lightweight homegrown solution might be justified. Just keep it simple. A config file, a hash-based assignment function, and structured logging will get you surprisingly far.
The important thing is to start experimenting on the backend at all. Too many teams treat their server-side code as a monolith that changes only through big-bang releases. Small, measured, reversible changes are just as valuable behind the API as they are in front of it.