Streaming Responses

Deliver AI-generated tokens to users in real time with Server-Sent Events streaming. Implement responsive, interactive applications that feel immediate.

Real-Time Data Delivery

Streaming is not an optimization — it is a user experience requirement for any application where someone is waiting for a response. Non-streaming delivery forces users to stare at a loading indicator while the model generates the complete response server-side. Streaming changes the interaction model: the first tokens appear within a fraction of a second, and the user reads along as the model writes. This guide covers every aspect of streaming implementation on OpenRouter, from basic SSE consumption to production-grade connection management and fallback strategies.

Why Streaming Changes the User Experience Equation

Consider the difference between these two experiences. In the non-streaming case, a user submits a question and waits. A loading spinner appears. Ten seconds pass — during which the model is generating a 500-token response at 50 tokens per second. At the ten-second mark, the entire response appears at once. The user perceives a ten-second delay and, during that time, has no indication that anything useful is happening. In the streaming case, the first token appears at approximately 0.8 seconds. By the 2-second mark, the user has read the first sentence. By the 10-second mark, the entire response has arrived — but the user felt engaged and informed throughout the delivery, not waiting for an opaque process to complete.

This difference is not cosmetic. User research consistently shows that progressive content delivery substantially reduces abandonment rates in interactive AI applications. When users see content arriving token by token, they interpret the system as actively working rather than potentially stuck. The psychological difference between watching a response build in real time and staring at an indeterminate progress bar is the difference between a product that feels intelligent and responsive and one that feels slow and uncertain. The NIST guidelines on human-AI interaction emphasize that systems should provide continuous feedback about their processing state — a design principle that streaming inherently satisfies.

Streaming also enables interaction patterns that non-streaming delivery cannot support. A coding assistant can begin displaying suggested code while the model continues reasoning about implementation details. A content editor can surface the first paragraph while subsequent paragraphs are still being generated, letting the user begin reading or editing immediately. A chatbot can stream thinking steps alongside final output, making the model's reasoning visible and auditable. These patterns depend on streaming architecture; without it, applications are limited to the request-wait-display cycle that characterized first-generation AI interfaces.

Streaming Modes and Protocol Reference

OpenRouter supports multiple streaming delivery modes, each optimized for different application requirements. Server-Sent Events is the recommended default for most applications due to its broad client library support and compatibility with standard HTTP infrastructure including load balancers, proxies, and CDNs. The table below catalogs the available modes and their appropriate use cases.

Mode Protocol Use Case
Server-Sent Events (SSE) HTTP/1.1 or HTTP/2 Chat interfaces, coding assistants, real-time content generation
Chunked Transfer Encoding HTTP/1.1 Long-form text generation, document processing
WebSocket Streaming WebSocket (RFC 6455) Bidirectional interactions, multi-turn conversations
Polling with Delta HTTP request/response Environments where persistent connections are restricted
Batch Completion HTTP request/response Non-interactive workloads, scheduled processing jobs

Most production applications use SSE as the primary streaming protocol with a non-streaming fallback for environments that cannot maintain persistent connections. WebSocket streaming is available for applications requiring bidirectional communication during generation — for instance, when users can interrupt or redirect model output mid-generation. The Consumer Financial Protection Bureau's technology guidance notes that interactive digital services should provide clear indicators of system processing status, a requirement that streaming response delivery directly addresses.

Implementing SSE Streaming: Client-Side Patterns

Streaming implementation begins with a single parameter change: set stream to true in your chat completion request. The response format changes from a single JSON object containing the complete message to a stream of JSON objects, each containing a delta — the token or tokens generated since the previous event. Your client consumes this stream by parsing each JSON event and appending the delta content to the displayed message.

The fundamental client implementation pattern is consistent across languages. Open an HTTP connection with stream set to true. Read the response body as a continuous stream rather than buffering the complete response. Parse each SSE event — identified by the data: prefix — as it arrives. Extract the delta content from the choices array and append it to your display buffer. Continue processing events until you receive a done event or the stream closes.

Connection Resilience and Error Recovery

Streaming connections are long-lived compared to standard HTTP requests — a single streaming response can persist for tens of seconds or longer during extended generation. This makes connection management a first-class concern. Network interruptions, provider-side timeouts, and mobile connectivity changes can all terminate a streaming connection before the model finishes generating. Production applications must handle these interruptions gracefully.

The recommended resilience pattern includes three layers. First, client-side reconnection with exponential backoff: if the stream disconnects unexpectedly, wait an increasing interval before retrying, up to a maximum retry count. Second, token tracking: record the last successfully received token sequence so that a reconnection does not cause duplicate content or gaps in the displayed output. Third, a non-streaming fallback: if streaming reconnection fails after the maximum retry count, fall back to a standard non-streaming completion request so the user receives their response without an error message. This layered approach ensures that streaming failures degrade gracefully rather than producing hard failures visible to end users.

Latency Optimization Strategies

Streaming reduces perceived latency but does not eliminate the underlying generation latency. Time-to-first-token (TTFT) remains the critical metric — even with streaming, the user waits approximately 0.5 to 1.5 seconds before seeing any response content. Several optimization strategies can reduce TTFT. Model selection matters: lighter models like Gemini Flash and Llama 3.3 consistently deliver lower TTFT than frontier models due to smaller parameter counts and less complex inference paths. Geographic routing matters: requests to providers with inference endpoints in your region will have lower network latency than requests routed across continents. Batching matters: sending multiple prompt inputs as part of a single completion request increases the context processing time before the first token can be generated.

For applications where sub-200ms TTFT is a hard requirement, the most effective strategy is speculative display: begin rendering a placeholder response immediately upon receiving the user's prompt, then progressively replace placeholder content with model-generated tokens as they arrive. This technique, borrowed from progressive web rendering patterns, creates the perception of instant response even before the model has generated its first token.

Frequently Asked Questions About Streaming Responses

How do I enable streaming in my API requests?

Add stream: true to your chat completion request body. The response format changes from a single JSON object to a stream of SSE events. Your client consumption code must handle the stream format — most OpenAI-compatible SDKs support streaming natively when the stream parameter is set to true, requiring only the addition of an event handler or async iterator to process token deltas as they arrive.

Does streaming cost more than non-streaming requests?

No, streaming does not affect pricing. You are billed for the tokens the model generates regardless of delivery mode. Streaming requests consume the same number of output tokens as non-streaming requests for equivalent prompts. The only cost difference — negligible in practice — is the slight increase in network transfer bytes from the SSE event wrapper around each token delta.

How do I display streaming responses in a web application?

Web applications typically consume streaming responses through the Fetch API with a ReadableStream or through EventSource for SSE. The consumer loop reads each event, extracts the delta content, and appends it to the DOM element displaying the response. Progressive rendering with requestAnimationFrame or a debounced update interval prevents excessive DOM manipulation from impacting browser performance during high-throughput token delivery.

What happens if the user navigates away during a streaming response?

When a client disconnects from a streaming response, the server-side generation continues until the configured stop conditions are met or the server detects the disconnection. Token generation costs accrue for content that the disconnected client never receives. For applications where users frequently navigate away mid-response, consider setting a conservative max_tokens limit or implementing client-side abort signals that propagate through the API to terminate generation when the user leaves.

Can I stream structured JSON output in real time?

Structured JSON streaming presents a challenge because valid JSON cannot be incrementally rendered — a partial JSON object is syntactically incomplete until the closing bracket arrives. OpenRouter supports a partial JSON streaming mode that delivers structurally complete fragments as they become available, allowing clients to render portions of a JSON response before the full object is generated. This mode is designed for applications that display structured data progressively, such as dashboards or data extraction interfaces.

Start Streaming in Minutes

Set stream: true in your API request and deliver tokens to your users in real time. No infrastructure changes required.

Get API Access