All posts

Retry mechanisms for transaction failures

Retry logic sounds simple. Request fails, try again. But in payment systems, a naive retry can charge someone twice, trigger fraud alerts, or create orphaned transactions that take days to reconcile. I've debugged enough of these to know that the edge cases are the whole story.

The basics: backoff, jitter, and circuit breakers

If you're not already using exponential backoff with jitter, start there. A fixed retry interval means all your failed requests retry simultaneously, creating a thundering herd that makes the problem worse. Exponential backoff spreads the retries over time, and jitter randomizes them so they don't all land at the same moment.

A simple formula: delay = min(base * 2^attempt + random(0, jitter), max_delay)

Circuit breakers sit on top of retry logic. If a downstream service fails repeatedly, the circuit breaker "opens" and short-circuits requests for a cooldown period instead of hammering a service that's already down. This protects both you and the downstream system.

The goal of retry logic isn't to make every request succeed. It's to recover from transient failures without making permanent failures worse.

Idempotency keys: the non-negotiable

Every payment request your system sends should include an idempotency key — a unique identifier that tells the downstream system "if you've already processed this request, return the original result instead of processing it again."

Without idempotency keys, a timeout becomes a nightmare. Did the charge go through? You don't know. If you retry, you might double-charge. If you don't retry, you might have a failed payment that's actually successful.

Implementation isn't complicated, but the details matter:

  • Generate the key client-side, tied to the intent (e.g., hash of order ID + amount + timestamp). Don't use random UUIDs — if the client crashes and retries, it needs to produce the same key.
  • Store the key and result server-side for a reasonable TTL (24-72 hours for payment operations).
  • Return the cached result for duplicate keys, including the original status code. A retry should be indistinguishable from the original request.

The edge cases that break everything

Timeout ambiguity. Your request to the payment processor times out after 30 seconds. Did it succeed? The processor might have received and processed it but the response was lost. This is the hardest problem in retry logic. My approach: record the attempt with a "pending" status, query the processor for the transaction status before retrying, and only retry if you can confirm the original didn't succeed.

Partial failures. A multi-step transaction — authorize, capture, settle — can fail partway through. You authorized the card, but the capture timed out. Now you have a hold on the customer's card with no corresponding charge. You need compensating actions: release the authorization if capture fails, and track the state of each step independently.

Duplicate charges. Even with idempotency keys, duplicates happen. Maybe the key TTL expired. Maybe you're integrating with a processor that doesn't support idempotency natively. Build reconciliation into your system from day one. A nightly job that compares your records against the processor's records catches duplicates before they become customer complaints.

Race conditions in concurrent retries. If your retry logic runs on multiple instances (which it does in any distributed system), two instances might retry the same failed transaction simultaneously. Both get through before the idempotency check kicks in. Use distributed locks or optimistic concurrency on the transaction record to prevent this.

When NOT to retry

Not every failure is retryable. Knowing when to stop is as important as knowing when to try again.

  • Hard declines. Expired card, stolen card, or insufficient funds. Retrying won't help and will annoy the issuer.
  • Fraud declines. The processor's fraud system flagged the transaction. Retrying is actively harmful — it looks like a fraud pattern.
  • Validation errors. Invalid card number, bad CVV, malformed request. Fix the input, don't retry the same bad data.
  • 4xx errors in general. These are client errors. Retrying the same request will produce the same result.

Only retry on transient failures: network timeouts, 502/503/504 responses, connection resets, rate limits (with appropriate backoff). Everything else should fail fast and surface the error to the caller.

What I've learned the hard way

The retry mechanism itself needs monitoring. Track retry rates by endpoint, by error type, by time of day. A spike in retries is often the first signal of a downstream degradation — before the alerts fire, before the dashboards turn red.

Log every retry attempt with the original request ID, the retry count, the error that triggered it, and the delay before the next attempt. When something goes wrong (and it will), this log is the difference between a 10-minute investigation and a 4-hour one.

And test your retry logic under failure conditions, not just in happy-path integration tests. Kill the downstream service mid-request. Inject random timeouts. Simulate a processor that accepts the charge but drops the response. These scenarios aren't hypothetical — they're Tuesday.