Skip to content
← BACK TO BLOG
Fikri Firman Fadilah
Reliability

Designing Idempotent Operations: Making Retries Safe by Default in Distributed Systems

Practical patterns for request deduplication and idempotency keys that prevent duplicate charges, data corruption, and cascading failures when networks fail.

Designing Idempotent Operations: Making Retries Safe by Default in Distributed Systems

Retries are non-negotiable in distributed systems. Networks fail. Timeouts fire. Services crash mid-operation. But retries without idempotency are a loaded gun pointed at your data.

The question isn't whether your system will retry—it's whether those retries will silently corrupt state, charge customers twice, or create orphaned records that haunt your database for months. Idempotency is the defensive wall between "we recovered from a timeout" and "we now have two orders for one payment."

The Real Cost of Non-Idempotent Operations

Let me ground this in actual failure modes I've seen in production:

Scenario 1: The Double Charge

A payment processing service receives a charge request. It successfully deducts money from the customer's account and initiates a transfer to the merchant. But the response packet is lost on the network. The client never receives confirmation. After 30 seconds, the retry logic kicks in and sends the identical request again.

Without idempotency checks, the system processes it as a new transaction. The customer is charged twice. The merchant receives two deposits. By the time monitoring alerts fire, 47 customers have been double-charged, and your support team is fielding angry calls while you scramble to reverse transactions manually.

Scenario 2: The Inventory Ghost

An e-commerce system receives an order creation request. It decrements inventory, creates an order record, and publishes an event to the fulfillment system. The database commit succeeds, but the event publishing times out. The client retries.

The second attempt finds the inventory already decremented and skips that step. But it creates a second order record with a different ID. Now you have duplicate orders in your system, one of which was never meant to exist. The customer receives two shipments. Inventory counts become unreliable. Your reconciliation jobs start failing.

Scenario 3: The Subscription Limbo

A user upgrades their subscription tier. The system charges their card, updates the subscription record, and sends a confirmation email. The charge succeeds. The database update succeeds. But the email service times out. The retry logic fires and sends the same request again.

Without idempotency, the system processes the upgrade twice, charging the card again. Or worse: it attempts to update the subscription record twice, and due to a race condition, the second update reverts the first one, leaving the customer on their old tier but charged at the new rate.

These aren't edge cases. They're the inevitable result of distributed system failure modes colliding with non-idempotent operations. And they're preventable.

What Idempotency Actually Means

Idempotency means that applying the same operation multiple times produces the same result as applying it once. Mathematically, f(f(x)) = f(x).

In the context of API operations, it means:

  • First request: executes the operation, returns result
  • Second identical request: returns the same result without re-executing the operation
  • Nth identical request: same as the second

The key word is identical. You need a way to recognize that two requests are the same operation, not two separate operations that happen to look similar.

Pattern 1: Idempotency Keys

The most practical and widely deployed pattern is the idempotency key: a unique token provided by the client that identifies the logical operation, not the network request.

How It Works

POST /api/v1/charges { "idempotency_key": "charge-2024-01-15-user-4521-attempt-1", "amount_cents": 9999, "currency": "usd", "customer_id": "cust_4521" }

The server does this:

  1. Receive the request and extract the idempotency key
  2. Check if a record with this key already exists in a deduplication store
  3. If yes: return the cached response without re-executing
  4. If no: execute the operation, store the result keyed by the idempotency key, return the result

Implementation Details That Matter

Storage: The deduplication store must be durable and fast. A distributed cache (Redis) is common, but it must be backed by a database for durability. If your Redis cluster fails and loses the deduplication cache, you'll process duplicate requests.

Better pattern: store the idempotency key and result in your primary database, in a dedicated table:

sql
CREATE TABLE idempotency_records (
  idempotency_key VARCHAR(255) PRIMARY KEY,
  request_body JSONB,
  response_body JSONB,
  operation_status VARCHAR(50),
  created_at TIMESTAMP,
  expires_at TIMESTAMP
);

Expiration: Don't keep idempotency records forever. After 24 hours (or whatever your retry window is), you can safely delete them. This prevents unbounded growth and reduces lookup time.

Atomicity: The check-then-execute must be atomic. Use a database constraint or transaction to ensure that between checking for the key and inserting it, no other request sneaks in and creates a duplicate:

sql
BEGIN TRANSACTION;
  SELECT * FROM idempotency_records WHERE idempotency_key = $1;
  -- If found, return cached response and COMMIT
  
  -- If not found, execute operation
  INSERT INTO idempotency_records (idempotency_key, response_body, ...)
    VALUES ($1, $2, ...);
  COMMIT;

Client Responsibility: Clients must generate and include the idempotency key on every request. This is a contract. Document it clearly. Provide SDKs that generate UUID-based keys automatically if clients don't provide them. Many payment processors (Stripe, Square) require this and generate keys server-side if not provided.

Pattern 2: Detecting Duplicate Effects

Sometimes you can't use an idempotency key because the client doesn't support it, or the operation is internal (service-to-service). In these cases, detect duplicate effects by inspecting state.

The Deduplication Window

When a request arrives, check if the effect has already been applied:

python
def create_order(customer_id, items, total_cents):
    # Check if an order with these exact properties already exists
    # from this customer in the last 60 seconds
    existing_order = Order.find_recent(
        customer_id=customer_id,
        total_cents=total_cents,
        created_after=now() - 60_seconds
    )
    
    if existing_order:
        # This is a retry; return the existing order
        return existing_order
    
    # New operation; execute it
    order = Order.create(customer_id, items, total_cents)
    return order

This is weaker than idempotency keys because it relies on heuristics (time windows, exact field matching), but it works when you can't change the client contract.

Limitations

  • Requires identifying what constitutes a "duplicate"
  • Sensitive to timing (what if the window is too short?)
  • Doesn't work well for operations that change state incrementally (incrementing a counter)

Use this pattern as a fallback, not a primary defense.

Pattern 3: State Machine Design

The most robust approach is to encode idempotency into your state machine. Design operations so that retries naturally converge to the same end state.

Example: Payment Processing

Instead of a single "charge" operation, model it as a state machine:

PENDING → PROCESSING → CHARGED → SETTLED ↓ FAILED

When a charge request arrives:

  1. Create a charge record in PENDING state
  2. Attempt to charge the card
  3. If successful, transition to CHARGED
  4. If timeout, leave in PROCESSING and retry

On retry:

  • Check if a charge record exists for this customer/amount/time window
  • If in PROCESSING state, resume from where we left off
  • If in CHARGED state, the operation already succeeded; return success
  • If in FAILED state, don't retry; return the failure

This design makes retries naturally idempotent because the state machine ensures we don't execute the same transition twice.

python
def charge_customer(customer_id, amount_cents, idempotency_key):
    # Find or create charge record
    charge = Charge.find_or_create(
        idempotency_key=idempotency_key,
        customer_id=customer_id,
        amount_cents=amount_cents
    )
    
    if charge.status == "CHARGED":
        return charge  # Already succeeded
    
    if charge.status == "FAILED":
        return charge  # Already failed; don't retry
    
    if charge.status == "PROCESSING":
        # Resume from where we left off
        result = payment_gateway.check_charge_status(charge.gateway_id)
        if result.status == "success":
            charge.transition_to("CHARGED")
        else:
            charge.transition_to("FAILED")
        return charge
    
    # charge.status == "PENDING"
    try:
        gateway_response = payment_gateway.charge(amount_cents, customer_id)
        charge.gateway_id = gateway_response.id
        charge.transition_to("CHARGED")
    except TimeoutError:
        charge.transition_to("PROCESSING")
        raise
    except PaymentDeclinedError as e:
        charge.failure_reason = str(e)
        charge.transition_to("FAILED")
        raise
    
    return charge

The state machine approach is powerful because:

  • It's self-documenting (the states are explicit)
  • It handles timeouts gracefully (PROCESSING state means "resume from here")
  • It prevents impossible
#idempotency#distributed-systems#reliability#retries#incident-prevention#platform-engineering