Background Jobs and Queues for Next.js Apps

Next.js is exceptional at serving requests and rendering pages, but it was never designed to be a job runner. The moment your app needs to send an email after checkout, resize an upload, call a slow third-party API, or generate a report, you hit a wall: serverless functions time out, requests block, and retries become a guessing game. The answer is almost always a background job system, and choosing the right one is an architecture decision worth getting right early.

Why a Next.js App Needs a Queue

A web request should do one thing: return a response quickly. Anything that is slow, flaky, or non-essential to the response belongs somewhere else. When you try to do real work inside a request handler, three problems show up immediately.

First, timeouts. On Vercel and most serverless platforms, function execution is capped (often 10-60 seconds depending on plan and runtime). A long API call or a batch of database writes can exceed that and fail mid-flight, leaving partial state. Second, coupling. If your checkout route also sends the receipt email, calls the analytics API, and notifies Slack, then a slow email provider makes checkout slow, and an email outage can take down payments. Third, no retries. A failed fetch inside a route is just gone. There is no built-in mechanism to try again in five minutes when the downstream service recovers.

A queue solves all three by separating *accepting* work from *doing* work:

The request enqueues a job and returns immediately.
A separate worker process pulls jobs and executes them.
Failed jobs are retried on a schedule, with backoff, until they succeed or exhaust their attempts.

This is the single most important reliability upgrade most growing Next.js apps can make.

The Options, From Simplest to Most Capable

There is no single right answer. The correct choice depends on your infrastructure, throughput, and how much you want to operate. Here is how we evaluate the main options at CodeAustral.

Platform cron and scheduled functions

The lightest option. Vercel Cron, GitHub Actions schedules, or a system cron hitting an API route handle *time-based* work well: nightly digests, cache warming, cleanup jobs. They do not handle *event-driven* work (do this right after an order) and have no per-job retry semantics. Good for periodic batches, wrong for "process this thing now, reliably."

pg-boss (Postgres-backed queue)

If you already run Postgres, pg-boss gives you a real durable queue without adding infrastructure. Jobs live in your database, so they are transactional, backed up with everything else, and easy to inspect with SQL. Throughput is fine for the tens-to-low-thousands of jobs per minute most products see. This is our default recommendation for the majority of Next.js apps.

BullMQ (Redis-backed queue)

When you need high throughput, rate limiting, priorities, repeatable jobs, and flow/parent-child workflows, BullMQ on Redis is the mature, battle-tested choice. It is faster and richer than a Postgres queue, at the cost of running and persisting Redis. Reach for it when job volume is high or you need its advanced scheduling features.

Durable execution engines

Tools like Inngest, Trigger.dev, Temporal, and Cloudflare Queues/Workflows shift the model from "queue + worker you operate" to "managed durable functions." They excel at multi-step workflows with sleeps, fan-out/fan-in, and human-in-the-loop steps, and they remove most of the operational burden. The tradeoff is a vendor dependency and a programming model you have to adopt. Excellent fit for serverless-first teams and complex orchestration.

Decision guidance

Periodic, time-based only → platform cron hitting a protected route.
You run Postgres, moderate volume → pg-boss. Least new infra, transactional, easy to debug.
High volume, rate limits, priorities, job graphs → BullMQ + Redis.
Serverless-first, complex multi-step workflows, want it managed → Inngest / Trigger.dev / Temporal.
All-in on Cloudflare → Cloudflare Queues + Workflows.

Where the Worker Actually Runs

This is the detail people miss. A queue needs a long-lived process to consume jobs, and a serverless function is the opposite of long-lived. You have two paths.

Run a dedicated worker process. On a VPS or container, run your Next.js app and a separate Node worker (node worker.js) side by side, managed by PM2, systemd, or your container orchestrator. The worker imports the same business logic but runs outside the request lifecycle. This is the simplest mental model and what we use across most of our portfolio.

Use push-based delivery. Managed platforms (Inngest, QStash, Cloudflare Queues) invoke an HTTP endpoint in your Next.js app when a job is ready. No worker to operate, but you inherit per-invocation timeouts and must design steps to fit inside them.

If you deploy to a single server or container, run a worker. If you are purely serverless, prefer a push-based managed platform over trying to keep a worker alive.

Retries, Backoff, and Idempotency

A queue without retries is just a slower function call. The real value is in surviving transient failures, and that survival depends on two things working together.

Retries with exponential backoff

When a job fails, do not retry instantly in a tight loop. Use exponential backoff with jitter so a struggling downstream service is not hammered. Cap attempts, and route exhausted jobs to a dead-letter queue (or a failed state) for inspection rather than silently dropping them.

Idempotency is non-negotiable

Because jobs retry, every job *will* eventually run more than once. At-least-once delivery is the norm; exactly-once is mostly a myth. Your handlers must be safe to run twice. The standard technique is an idempotency key plus a uniqueness constraint:

CREATE TABLE processed_jobs (
  idempotency_key text PRIMARY KEY,
  result          jsonb,
  created_at      timestamptz NOT NULL DEFAULT now()
);

// worker.ts — pg-boss handler that is safe to retry
import PgBoss from "pg-boss";

const boss = new PgBoss(process.env.DATABASE_URL!);
await boss.start();

await boss.work("send-receipt", async ([job]) => {
  const { orderId, email } = job.data as { orderId: string; email: string };
  const key = `receipt:${orderId}`;

  // Claim the work atomically. If this row already exists, we already did it.
  const inserted = await db.query(
    `INSERT INTO processed_jobs (idempotency_key)
     VALUES ($1) ON CONFLICT DO NOTHING RETURNING idempotency_key`,
    [key],
  );

  if (inserted.rowCount === 0) {
    return; // Already processed — ack and move on.
  }

  await sendEmail({ to: email, template: "receipt", orderId });
});

Two more rules that prevent the classic dual-write bug:

Enqueue inside the same transaction as your state change when you can. With pg-boss, inserting the job and committing the order in one Postgres transaction means a job never exists for an order that rolled back, and an order never commits without its job.
Pass identifiers, not payloads. Enqueue orderId, then load fresh data inside the worker. Stale snapshots in job payloads are a common source of "why did it email the old address?" bugs.

Observability: You Cannot Operate What You Cannot See

Background jobs fail quietly by design — that is the point. So you need to make their behavior visible, or you will only learn about problems from angry customers.

Metrics. Track queue depth, jobs completed per minute, failure rate, and processing latency. A growing backlog is the earliest signal that workers cannot keep up.
Structured logs. Log job name, job id, attempt number, and outcome on every run. Correlate with a request id so you can trace a job back to the action that created it.
Dead-letter alerting. Alert when jobs land in the failed/dead-letter state. These are the jobs that exhausted retries and need a human.
Dashboards. BullMQ has Bull Board and Taskforce; pg-boss state lives in queryable tables. Managed platforms ship dashboards out of the box. Use them.

A simple guardrail we apply: alert if queue depth exceeds a threshold for more than a few minutes, and alert on *any* dead-letter arrival. Those two alerts catch the majority of job-system incidents before users notice.

A Pragmatic Reference Architecture

For a typical Next.js product running on a server or container with Postgres, this stack has served us well:

Route handler validates input, writes domain state and enqueues the job in one transaction, returns 200.
A separate PM2/systemd-managed worker process consumes jobs with pg-boss.
Handlers are idempotent via an idempotency-key table.
Retries use exponential backoff; exhausted jobs go to a failed state with alerting.
Time-based jobs (digests, cleanup) are scheduled with pg-boss cron or platform cron.
Metrics and structured logs flow to your existing observability tool.

Start here. Move to BullMQ when volume demands it, or to a durable execution platform when your workflows grow multi-step legs that need sleeps and orchestration. Do not adopt the heaviest tool on day one — the cost of operating it rarely pays off until you actually need its features.

Common Mistakes We See

Doing work in the request and calling it "async" because you didn't `await`. A floating promise on serverless is killed when the function returns. That is not a background job; it is a coin flip.
No idempotency. Works in testing, double-charges or double-emails in production the first time a retry fires.
Putting whole objects in the payload. They go stale, and large payloads bloat the queue.
Treating cron as a queue. Cron is for schedules, not for reliable per-event processing with retries.
No backpressure plan. When a downstream API is down, jobs pile up. Decide whether to pause, slow, or shed load before it happens.

Frequently Asked Questions

Do I really need a queue, or can I just use a serverless function?

You need one as soon as work must survive failure or outlive a request. If a task can fail and must be retried, takes longer than your function timeout, or shouldn't block the user's response, put it in a queue. For purely synchronous, fast, all-or-nothing work, a plain function is fine.

Should I use pg-boss or BullMQ?

Choose pg-boss if you already run Postgres and have moderate volume — it adds no new infrastructure and is transactional and easy to debug with SQL. Choose BullMQ when you need high throughput, priorities, rate limiting, or parent-child job flows, and you are willing to run and persist Redis to get them.

How do I make a job safe to retry?

Make it idempotent. Derive a stable idempotency key from the work (for example the order id), and before doing the side effect, atomically claim that key using a unique constraint. If the key already exists, acknowledge the job and stop. Pass identifiers rather than data snapshots so retries always act on current state.

Where does the worker run on Vercel?

Vercel has no long-lived process to consume a queue, so either run a worker elsewhere (a VPS, container, or Render/Fly service) or use a push-based platform such as Inngest, Trigger.dev, or Upstash QStash that invokes a Next.js route when a job is ready. Keeping a worker "alive" on pure serverless is the wrong tool.

How is durable execution different from a normal queue?

A queue delivers a single unit of work to a worker. Durable execution engines like Temporal, Inngest, and Trigger.dev orchestrate multi-step workflows, persisting progress between steps so a function can sleep for days, fan out and back in, or wait on a human, and resume exactly where it left off after a crash. It is queueing plus stateful orchestration.

Working with CodeAustral

We design and build the unglamorous reliability layer — queues, idempotent workers, retries, and observability — into the web platforms and AI products we ship for clients worldwide. If you are deciding how background work should run in your Next.js stack, or untangling a job system that is dropping or duplicating work, send us a short brief at https://codeaustral.com/contact and we'll help you choose the right architecture for your scale.