StreamingThinker: Large Language Models Can Think While Reading

Tong, Junlong; Fan, Yingqi; Zhao, Anhao; Ma, Yunpu; Shen, Xiaoyu

ICLR 2026 · Project Page

StreamingThinker: Large Language Models Can Think While Reading

A streaming reasoning framework that aligns reading and thinking, enabling large language models to start reasoning before the full input has arrived.

Junlong Tong^1,2,3 Yingqi Fan² Anhao Zhao^2,4 Yunpu Ma⁵ Xiaoyu Shen^2,3*

¹Shanghai Jiao Tong University ²Eastern Institute of Technology, Ningbo ³Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin ⁴Hong Kong Polytechnic University ⁵Munich Center for Machine Learning, LMU ^*Corresponding author

Paper Code

Teaser

StreamingThinker departs from batch-style reasoning by letting the model read incremental evidence, form lightweight local reasoning, and deepen the final answer only after enough context has arrived.

Order-preserving reasoning
Separate source and reasoning states
Latency-aware reasoning depth

Batch thinking versus streaming thinking — Demo: batch thinking of original LLM waits for the full input, while streaming thinking unfolds alongside arriving context.

Abstract

Large language models have become strong chain-of-thought reasoners, but standard inference still assumes a batch setting: the model waits until the entire input is available before starting to think. StreamingThinker introduces a streaming thinking paradigm in which reasoning follows the order of incoming evidence and can deepen once the full input has been observed. The framework combines streaming CoT generation, streaming-constrained training, and streaming parallel inference so that reading and reasoning can proceed concurrently while preserving order alignment. On the Qwen3 model family, the paper reports reasoning quality comparable to batch thinking together with substantial latency gains: roughly 80% less waiting before reasoning begins and more than 60% lower time-level latency for the final answer across math, logical reasoning, and context-based QA tasks.

Highlights

Streaming thinking paradigm

Reasoning is synchronized with arriving text instead of being deferred until the full context is complete.

Three aligned mechanisms

Streaming CoT supervision, streaming-constrained training, and parallel inference work together as a single stack.

Comparable quality

The framework is designed to keep reasoning performance close to batch thinking while changing when reasoning happens.

Latency reduction

The paper reports around 80% less waiting before reasoning onset and more than 60% lower final-answer latency.

Method

Streaming CoT generation

StreamingThinker constructs supervision that follows sentence-level boundaries. The model learns to summarize key information, explain ambiguities, extend implications, and skip unnecessary local reasoning when the current stream segment is not useful.

Streaming-constrained training

Training enforces order-preserving reasoning with streaming attention masks and decoupled position encoding. This keeps each reasoning step tied to the evidence already seen and avoids positional contention between source tokens and reasoning tokens.

Streaming parallel inference

At inference time, separate KV caches are maintained for source-side reading and target-side reasoning. This decoupling enables true concurrency: the model can continue ingesting new input while generating reasoning tokens instead of alternating between the two in a fully serial loop.

Results

~80%

less token waiting before reasoning begins

>60%

lower time-level latency for the final answer

Qwen3

validated across math, logic, and context-based QA

Comparable

reasoning performance relative to batch thinking

Below are direct screenshots from the paper for the three most relevant streaming result tables: real streaming evaluation, context-first streaming evaluation, and efficiency breakdown.

Table · Real streaming setting

The main streaming comparison: batch thinking, naive interleaving, and parallel StreamingThinker.

Table 2 from the paper showing real streaming results

Table · Context-first streaming setting

The context-first order is harder because the question arrives later. The paper reports that the model still preserves strong early-response behavior in this setting.

Table 3 from the paper showing context-first streaming results

Table · Streaming efficiency breakdown

The efficiency breakdown shows where the latency reduction comes from in the parallel streaming pipeline.

Table 4 from the paper showing efficiency analysis

BibTeX

        @misc{https://doi.org/10.48550/arxiv.2510.17238,
            doi = {10.48550/ARXIV.2510.17238},
            url = {https://arxiv.org/abs/2510.17238},
            author = {Tong, Junlong and Fan, Yingqi and Zhao, Anhao and Ma, Yunpu and Shen, Xiaoyu},
            title = {StreamingThinker: Large Language Models Can Think While Reading},
            publisher = {arXiv},
            year = {2025}
            }

Contact

For questions about the project, please contact jl-tong@sjtu.edu.cn or xyshen@eitech.edu.cn.