Processing Webhooks Reliably: Idempotency, Retries, and Dead Letters

Q: How do I prevent processing the same webhook twice?

Persist a unique provider identifier such as Stripe's event id or GitHub's X-GitHub-Delivery header with a database unique constraint and an INSERT ... ON CONFLICT DO NOTHING. If no rows are affected, you have seen the event before. Pair this with idempotent side effects keyed on a natural business id so partial retries converge safely.

Q: How many times should I retry a failed webhook?

Retry only transient failures using exponential backoff with jitter, capped at a bounded number of attempts; five to eight is typical. Do not retry permanent errors like validation failures. After the cap, move the event to a dead-letter status, alert a human, and keep the payload and error history for debugging and replay.

Webhooks are the connective tissue of modern software: Stripe tells you a payment succeeded, GitHub tells you a branch was pushed, a provider tells you a job finished. The catch is that webhooks are an inherently unreliable medium. They arrive out of order, sometimes twice, sometimes not at all, and the sender will happily retry until you return a 2xx. A reliable webhook pipeline is built around that reality, not against it.

This post walks through the patterns we use at CodeAustral to make webhook processing boring and correct: signature verification, idempotency keys, async processing, retries with backoff, dead-letter handling, and replay. The examples lean on Stripe and GitHub because they are the two providers most teams integrate first, and they make different but instructive choices.

Treat the Endpoint as a Hostile, Unordered Channel

Before any code, internalize the contract the sender actually offers. It is weaker than people assume:

At-least-once delivery. The same event can arrive multiple times. Networks drop ACKs, senders retry, load balancers replay. Stripe explicitly documents that you may receive duplicates.
No ordering guarantee. A customer.subscription.updated can land before the customer.subscription.created that logically precedes it.
Tight response budgets. Senders expect a fast 2xx. Stripe times out at around 20 seconds; do real work synchronously and you will start collecting timeouts and retries.
The payload is not authoritative. A webhook is a notification, not a source of truth. The amount, status, or object state in the body may be stale by the time you read it.

Every design decision below follows from these four facts. The goal is a handler that is safe to call zero, one, or fifty times with the same event and always converges to the correct state.

Verify the Signature Before You Trust a Byte

Your webhook URL is public. Anyone who finds it can POST arbitrary JSON claiming a customer just paid you. Signature verification is the only thing standing between that and a fraudulent fulfillment.

Both Stripe and GitHub sign the raw request body with a shared secret. The single most common bug here is verifying against a re-serialized body. If your framework parses JSON and you re-stringify it, whitespace and key order change and the signature no longer matches. You must capture the raw bytes.

import express from "express";
import Stripe from "stripe";

const stripe = new Stripe(process.env.STRIPE_SECRET_KEY!);
const app = express();

// Capture the RAW body for this route only.
app.post(
  "/webhooks/stripe",
  express.raw({ type: "application/json" }),
  (req, res) => {
    const sig = req.headers["stripe-signature"] as string;
    let event: Stripe.Event;
    try {
      event = stripe.webhooks.constructEvent(
        req.body,                       // Buffer, not parsed JSON
        sig,
        process.env.STRIPE_WEBHOOK_SECRET!
      );
    } catch (err) {
      // Bad signature, malformed payload, or stale timestamp.
      return res.status(400).send("invalid signature");
    }

    // Hand off and ACK fast (see async section).
    void enqueue(event);
    return res.status(200).json({ received: true });
  }
);

Stripe's signature scheme includes a timestamp, and constructEvent rejects events outside a tolerance window (default five minutes) to defeat replay attacks. GitHub uses an HMAC-SHA256 over the body in the X-Hub-Signature-256 header; verify it with a constant-time comparison such as crypto.timingSafeEqual to avoid leaking the secret through timing. Never roll a plain === string comparison for HMACs.

A few non-negotiables:

Store the signing secret in your secrets manager, not in code. Stripe and GitHub both let you rotate it; build for rotation by accepting two valid secrets during a cutover window.
Reject early. Signature failure is a 400 (or 401), never a 500, so the sender does not retry a request it can never satisfy.
Log the event id and type on rejection, but never log the secret or the full signature header.

Idempotency Is the Heart of the System

Because delivery is at-least-once, idempotency is not optional. The cleanest design separates *recording* an event from *acting* on it.

Persist the event id, let the database enforce uniqueness

Every Stripe event has a stable evt_... id. GitHub sends an X-GitHub-Delivery UUID per delivery. Use that id as a unique key and let the database be the arbiter of "have I seen this before."

CREATE TABLE webhook_events (
  id            TEXT PRIMARY KEY,          -- provider event id
  provider      TEXT NOT NULL,
  type          TEXT NOT NULL,
  payload       JSONB NOT NULL,
  status        TEXT NOT NULL DEFAULT 'pending',  -- pending|processing|done|dead
  attempts      INT  NOT NULL DEFAULT 0,
  received_at   TIMESTAMPTZ NOT NULL DEFAULT now(),
  processed_at  TIMESTAMPTZ,
  last_error    TEXT
);

On receipt, after signature verification, do an idempotent insert:

INSERT INTO webhook_events (id, provider, type, payload)
VALUES ($1, 'stripe', $2, $3)
ON CONFLICT (id) DO NOTHING;

If the insert affects zero rows, you have already seen this event. Acknowledge with 200 and stop. This single constraint absorbs the entire class of duplicate-delivery bugs at the front door.

Make the side effects idempotent too

Recording the event is half the job. The handler still has to perform work, and that work must be safe to retry, because your own processing can crash after a side effect but before marking the event done. Practical tactics:

Natural keys. When provisioning an order, key it on the Stripe checkout.session id or payment_intent id with a unique constraint, so a second attempt updates rather than duplicates.
Pass idempotency keys downstream. When your handler calls another API that supports them (Stripe's own Idempotency-Key, for instance), derive the key deterministically from the event id so retries collapse.
Upsert, do not insert. Prefer INSERT ... ON CONFLICT DO UPDATE over blind inserts for any record derived from an event.

The mental model: the event id guards "did I receive this," and a natural business key guards "did I apply this." You need both.

Acknowledge Fast, Process Async

Doing the real work inside the HTTP handler is the most common reliability mistake. If fulfillment takes eight seconds, you have eight seconds of exposure to a timeout, after which the sender retries and you process twice (or, worse, a slow database makes every delivery time out and the provider eventually disables your endpoint).

The fix is to split receipt from processing:

Verify signature.
Persist the event (idempotent insert).
Return 200 immediately.
Process from a worker reading the webhook_events table or a queue.

For many teams the database table *is* the queue. A worker polls for status = 'pending', claims rows with SELECT ... FOR UPDATE SKIP LOCKED, processes them, and marks them done. This avoids adding infrastructure and keeps the audit trail in one place.

-- Worker claims a batch without colliding with peers.
UPDATE webhook_events
SET status = 'processing', attempts = attempts + 1
WHERE id IN (
  SELECT id FROM webhook_events
  WHERE status = 'pending'
  ORDER BY received_at
  FOR UPDATE SKIP LOCKED
  LIMIT 20
)
RETURNING *;

When throughput grows, graduate to a real broker (SQS, Redis Streams, RabbitMQ) but keep the same shape: the endpoint's only job is to durably record and ACK. We have run the table-as-queue pattern comfortably into the thousands of events per minute before reaching for a dedicated broker.

One caveat: returning 200 means *you have taken responsibility* for the event. If you ACK and then lose it because the insert was not actually durable, the sender will not help you. Persist before you ACK, always.

Retries and Backoff: Yours and Theirs

There are two retry loops in play, and conflating them causes pain.

The provider's retry loop fires when you do not return 2xx. Stripe retries with exponential backoff for up to about three days; GitHub retries far less aggressively. You do not control this loop, so make your endpoint's failures meaningful: return non-2xx only for transient problems you genuinely want retried, and 2xx-with-record for anything you have safely captured.

Your own retry loop runs in the worker when processing fails. This is where backoff belongs. Use exponential backoff with jitter to avoid thundering herds against a recovering downstream:

function nextDelayMs(attempt: number): number {
  const base = Math.min(30_000, 2 ** attempt * 1000); // cap at 30s
  const jitter = Math.random() * base * 0.3;           // +/- jitter
  return base + jitter;
}

Distinguish error classes before retrying:

Transient (network blip, 503 from a downstream, lock timeout): retry with backoff.
Permanent (validation error, missing referenced object, malformed payload): do not retry. No amount of backoff fixes a 422. Send it straight to the dead-letter path.

Blindly retrying permanent failures is how a single bad event burns thousands of worker cycles and masks the real problem.

Dead Letters: Where Bad Events Go to Be Found

A retry budget that never ends is just a slow infinite loop. After a bounded number of attempts (we typically use five to eight), mark the event dead and stop. The point of a dead-letter store is not to hide failures but to make them visible and actionable.

UPDATE webhook_events
SET status = 'dead', last_error = $2
WHERE id = $1 AND attempts >= 8;

A healthy dead-letter setup includes:

An alert. A row entering dead should page or post to a channel. Silent dead letters are just data loss with extra steps.
The full context. Keep the raw payload, every error message, and attempt count. You will debug from this, often days later.
A runbook. Common causes (a downstream that was down, a deploy that shipped a bug, an event type you do not handle yet) should map to known responses.

Crucially, dead-lettering should be *rare*. If your dead-letter queue is filling steadily, that is a code or dependency problem, not a queue you should drain on a schedule.

Replay: The Payoff for Doing the Rest Right

The reward for storing every verified event and making processing idempotent is that replay becomes trivial and safe. Fixed a bug in your fulfillment logic? Re-run the dead-lettered events. Need to backfill after an outage? Reset status to pending and let the worker pick them up again.

-- Replay a window of dead events after shipping a fix.
UPDATE webhook_events
SET status = 'pending', attempts = 0, last_error = NULL
WHERE status = 'dead'
  AND type = 'invoice.payment_succeeded'
  AND received_at > now() - interval '24 hours';

Because every side effect is keyed and idempotent, replaying an event that *did* partially succeed simply converges to the correct state instead of double-charging or double-provisioning. Both Stripe and GitHub also let you resend events from their dashboards, which is useful for one-offs, but having your own replay control means you are not dependent on a provider UI during an incident.

A Pragmatic Build vs. Buy Decision

You do not always need to hand-roll this. A quick decision guide:

One or two providers, modest volume, existing Postgres: build the table-as-queue pipeline above. It is a day of work and you own every line.
Many providers, or you want a normalized event schema and built-in retries: an ingestion gateway (Hookdeck, Svix, or a cloud equivalent) handles signature verification, retries, and replay for you, at the cost of another dependency and bill.
High volume, multiple consumers of the same event: put a real broker behind the endpoint and treat the HTTP layer as a thin durable buffer.

Whatever you choose, the invariants do not change: verify signatures, dedupe on event id, ACK fast, retry transient failures with backoff, dead-letter the rest, and keep events replayable.

Frequently Asked Questions

How do I prevent processing the same webhook twice?

Persist a unique identifier from the provider, such as Stripe's event id or GitHub's X-GitHub-Delivery header, with a database unique constraint and an INSERT ... ON CONFLICT DO NOTHING. If the insert affects no rows, you have seen the event before. Pair this with idempotent side effects keyed on a natural business id so partial retries converge safely.

Should I process webhooks synchronously in the HTTP handler?

No. Senders expect a fast 2xx, and Stripe times out around 20 seconds. Do only signature verification and a durable write in the handler, then return 200 immediately. Perform the real work in a background worker reading from your events table or a queue, which isolates slow downstreams from delivery timeouts.

What status code should I return when a webhook fails?

Return 2xx as soon as you have durably stored the event, even if downstream processing has not finished. Reserve non-2xx for cases where you want the provider to retry, such as a failure to persist. Return 400 or 401 for signature failures so the sender stops retrying a request it can never satisfy.

How many times should I retry a failed webhook?

Retry only transient failures, using exponential backoff with jitter, capped at a bounded number of attempts (five to eight is typical). Do not retry permanent errors like validation failures. After the cap, move the event to a dead-letter status, alert a human, and keep the payload and error history so it can be debugged and replayed.

What is a dead-letter queue and when should I use one?

A dead-letter store holds events that exhausted their retry budget so they are not silently lost. Use it once you have automated retries: failures land there with full context, trigger an alert, and can be replayed after a fix. It should be rare in a healthy system; a steadily filling dead-letter queue signals a code or dependency problem.

Working with CodeAustral

We build payment, integration, and event-driven systems where correctness is not negotiable, and webhook reliability is usually where those systems live or die. If you are wiring up Stripe, GitHub, or a fleet of providers and want a pipeline that handles duplicates, retries, and replay without surprising you in production, send us a short brief at https://codeaustral.com/contact and we will tell you how we would approach it.