Streaming AI Responses in Next.js: UX, Edge, and Backpressure

Q: What is backpressure and do I need to handle it manually?

Backpressure pauses a producer when the consumer reads slowly, preventing unbounded memory growth. The AI SDK and the runtime handle it automatically in the common path. You only manage it manually in custom ReadableStream code, where you check controller.desiredSize and implement cancel() to abort the upstream model call on disconnect.

Q: How do I stop paying for tokens when a user closes the tab?

Propagate req.signal into your model call, for example abortSignal: req.signal in streamText, and implement cancel() on custom streams to call abortController.abort(). When the client disconnects, this cancels the upstream provider request immediately instead of letting it generate a full response nobody will read.

Q: Where should I persist the assistant message?

Persist in the onFinish callback using the fully assembled message, after the user has already seen the streamed output. Keep display and durability separate: a database write failure should never break or block the user-facing stream. Log persistence errors and reconcile asynchronously rather than surfacing them as user-facing failures.

Streaming is the difference between an AI feature that feels alive and one that feels broken. When a model takes eight seconds to produce a paragraph, a spinner reads as failure; the same eight seconds, streamed token by token, reads as thinking. But getting streaming right in the Next.js App Router involves more than piping a ReadableStream to the browser. You have to pick a runtime, handle errors that arrive halfway through a response, and respect backpressure so a slow client does not quietly corrupt your output. This is how we build streaming AI features at CodeAustral, and the trade-offs we have learned to weigh.

Why Streaming Is a UX Decision First

The instinct is to treat streaming as a performance optimization. It is not. Streaming barely changes total latency, and it can slightly increase it. What it changes is *perceived* latency and the user's mental model of the system.

Three things happen when you stream:

Time to first token (TTFT) replaces time to full response as the metric that matters. A 300ms TTFT with a 6-second total feels faster than a 2-second blocking response, even though it is objectively slower end to end.
The interface becomes interruptible. Users can read, judge, and stop a bad answer before it finishes. That is a feature, not a side effect, and your UI should make stopping cheap.
Failure becomes partial. A blocking request either succeeds or fails. A stream can deliver three good sentences and then die, and your code has to decide what that means.

If you internalize only one idea, make it this: streaming is a contract with the user that says *I will show you my work as I do it*. Everything technical below exists to keep that contract even when the network, the model provider, or the runtime misbehaves.

The AI SDK Pattern in the App Router

The AI SDK (Vercel's ai package, v5 as of 2026) removes most of the boilerplate around server-sent streaming. The canonical setup is a Route Handler that returns a streaming response and a client component that consumes it. Keep the model call on the server; the client never sees your provider key.

// app/api/chat/route.ts
import { streamText } from 'ai';
import { openai } from '@ai-sdk/openai';

export const runtime = 'edge';
export const maxDuration = 30;

export async function POST(req: Request) {
  const { messages } = await req.json();

  const result = streamText({
    model: openai('gpt-4o'),
    messages,
    abortSignal: req.signal, // propagate client disconnects
  });

  return result.toUIMessageStreamResponse({
    onError: (error) => {
      // Returned to the client as a stream part, not a 500
      console.error('stream error', error);
      return 'The model failed mid-response. Please retry.';
    },
  });
}

The provider in the import is interchangeable. The same streamText call works against Anthropic, Google, Mistral, or a self-hosted model behind an OpenAI-compatible gateway by swapping the model adapter. We deliberately keep that line as the only provider-specific code in the route so switching providers is a one-line change, which matters when you are negotiating rate limits or comparing cost per token across vendors.

On the client, the useChat hook consumes the stream and re-renders as parts arrive. The key detail is that you do not parse the protocol yourself; the hook reconstructs message parts, tool calls, and errors from the stream wire format.

Edge vs Node Runtime: A Decision, Not a Default

This is the choice teams get wrong most often. The Edge runtime is tempting because it advertises low cold-start latency and global distribution, which sounds ideal for streaming. But Edge is a constrained environment, and the constraints bite exactly the workloads that LLM features tend to have.

Use this as a decision list:

Choose Edge when your route only does fetch-based work (calling a model API over HTTP), you need low TTFT from many regions, and your responses can finish well within the platform's streaming duration cap. Edge starts fast and streams cleanly.
Choose Node when you use any package that needs native modules or the full Node API (most database drivers, sharp, some auth libraries, the AWS SDK), you do long tool-calling loops that may run minutes, or you need streaming durations beyond what Edge allows on your host.
Choose Node, reluctantly, when your observability stack (OpenTelemetry exporters, certain logging transports) only works under Node.

A common trap: putting your Postgres or Prisma call in an Edge route to "read the user's history before the model call." Most database clients do not run on Edge, or run only through a separate serverless driver. Either move the data fetch to a preceding server action under Node, or use a driver explicitly built for Edge (for example, an HTTP-based Postgres client). Do not fight the runtime; pick the one that matches your dependencies.

One more point on regions: Edge functions run near the *user*, but your model provider's API runs in a fixed region. If your Edge function in São Paulo calls a model API in Virginia, you have added a cross-continent hop on every token-generating request. Sometimes a Node function colocated with the model provider beats a globally distributed Edge function. Measure TTFT from real geographies before committing.

Backpressure: The Part Everyone Skips

Backpressure is what happens when the model produces tokens faster than the client can consume them. A ReadableStream has a built-in mechanism for this: when the consumer is slow, the stream's internal queue fills, and a well-behaved producer pauses until there is room. The AI SDK and the runtime handle this for you in the common path.

The problems appear when you write a custom stream and forget that the producer must respect the controller.

// A custom transform that respects backpressure
const stream = new ReadableStream({
  async start(controller) {
    try {
      for await (const chunk of modelTokens) {
        // enqueue returns nothing, but desiredSize tells you when to slow down
        controller.enqueue(encoder.encode(chunk));
        if (controller.desiredSize !== null && controller.desiredSize <= 0) {
          // Yield so the consumer can drain; avoids unbounded memory growth
          await new Promise((r) => setTimeout(r, 0));
        }
      }
      controller.close();
    } catch (err) {
      controller.error(err);
    }
  },
  cancel() {
    // Client disconnected: stop the upstream model call
    abortController.abort();
  },
});

The two lines that matter most are cancel() and the upstream abort(). Without them, a user who closes the tab leaves your server consuming (and paying for) tokens from the model provider for the full length of the response. At scale, abandoned streams are a real cost line. Wire req.signal through to the provider call so a client disconnect cancels the model request immediately.

If you are using the AI SDK's helpers, this is handled, provided you pass abortSignal: req.signal into streamText. The single most common backpressure bug we see is *not* a queue overflow; it is orphaned upstream requests because nobody propagated the abort.

Handling Errors Mid-Stream

Once you have sent a 200 OK and the first byte, you cannot change the status code. The HTTP headers are gone. This breaks the usual error-handling instinct of "return a 500." Mid-stream errors have to travel *inside* the stream.

There are three failure windows, and each needs a different response:

Before the first token. The provider rejected the request (rate limit, bad input, auth). Here you still control the status code, so return a proper 4xx/5xx with a JSON body.
During the stream. The connection to the provider dropped, the model hit a content filter, or a tool call threw. You must emit an error *part* in the stream so the client can show it inline, then close cleanly.
After the last token, during finalization. Your onFinish callback that persists the message to the database failed. The user already saw the answer; do not surface this as a user-facing error. Log it, enqueue a retry, and reconcile asynchronously.

The AI SDK exposes onError on the stream response (shown earlier) precisely so window-two errors become a typed stream part rather than a dangling, half-finished response. On the client, useChat surfaces an error object you render below the partial message. Always preserve the partial text: deleting three good sentences because the fourth failed is a worse experience than showing what arrived plus a retry affordance.

For persistence, separate *display* from *durability*. Stream to the user first, and persist in onFinish using the fully assembled message. If persistence fails, the user is unaffected; reconcile from logs.

Optimistic UI and the Stop Affordance

Optimistic UI for chat is mostly about ordering and reversibility. When a user sends a message:

Append their message immediately to local state before the request resolves. The input should clear instantly. This is the cheapest perceived-speed win available.
Render an assistant placeholder that fills as tokens arrive, rather than appearing all at once. An empty bubble with a subtle pulse beats a delayed full bubble.
Make Stop a first-class button. It should call the hook's stop() (which aborts the fetch and triggers your cancel()), and it should leave the partial response in place, clearly marked as stopped. Do not throw away partial work.
Reconcile on settle. When the stream finishes or errors, replace the optimistic placeholder with the canonical state from the server's onFinish, so what the user keeps matches what you stored.

The subtle rule: optimistic updates must be reversible. If the send fails before the first token, you need to either re-enable resending that exact message or roll it back with a visible, non-destructive error. Silently dropping a user's message is the fastest way to lose trust in an AI product.

A Pragmatic Checklist for Production

Before you ship a streaming AI route, confirm:

TTFT is measured and budgeted, not just total latency.
req.signal is propagated to the provider so disconnects cancel upstream work.
Runtime choice matches your dependencies (database drivers, native modules) rather than defaulting to Edge.
Mid-stream errors emit a stream part and preserve partial text.
Persistence happens in onFinish and never blocks or breaks the user-facing stream.
The client has a working Stop button that keeps partial output.
Streaming duration limits on your host are known and your maxDuration is set accordingly.

Frequently Asked Questions

Should I always use the Edge runtime for AI streaming?

No. Edge is excellent for fetch-only routes that need fast cold starts and global distribution, but it cannot run most database drivers, native modules, or very long tool-calling loops. Choose Node when your route has those dependencies or needs longer streaming durations. Match the runtime to your dependencies rather than defaulting to Edge for every AI endpoint.

How do I return an error after streaming has already started?

You cannot change the HTTP status once the first byte is sent. Instead, emit the error inside the stream as a dedicated error part. The AI SDK's onError callback handles this, turning mid-stream failures into a typed part the client renders inline. Always preserve any partial text the user already received and offer a retry.

What is backpressure and do I need to handle it manually?

Backpressure is the mechanism that pauses a producer when the consumer reads slowly, preventing unbounded memory growth. The AI SDK and the runtime handle it automatically in the common path. You only manage it manually in custom ReadableStream code, where you check controller.desiredSize and implement cancel() to abort the upstream model call on disconnect.

How do I stop paying for tokens when a user closes the tab?

Propagate req.signal into your model call (for example, abortSignal: req.signal in streamText) and implement cancel() on custom streams to call abortController.abort(). When the client disconnects, this cancels the upstream provider request immediately instead of letting it generate the full response that nobody will read.

Where should I persist the assistant message?

Persist in the onFinish callback using the fully assembled message, after the user has already seen the streamed output. Keep display and durability separate: a database write failure should never break or block the user-facing stream. Log persistence errors and reconcile asynchronously rather than surfacing them as user-facing failures.

Working with CodeAustral

We build AI features that hold up under real traffic, from streaming chat to tool-calling agents wired into production data. If you are shipping an LLM feature in Next.js and want it to feel fast, fail gracefully, and not quietly burn tokens, send us a brief at codeaustral.com/contact and we will tell you where the sharp edges are.