ICLR 2026 · Project Page

StreamingThinker: Large Language Models Can Think While Reading

A streaming reasoning framework that aligns reading and thinking, enabling large language models to start reasoning before the full input has arrived.

Junlong Tong1,2,3 Yingqi Fan2 Anhao Zhao2,4 Yunpu Ma5 Xiaoyu Shen2,3*
1Shanghai Jiao Tong University 2Eastern Institute of Technology, Ningbo 3Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin 4Hong Kong Polytechnic University 5Munich Center for Machine Learning, LMU *Corresponding author

Teaser

StreamingThinker departs from batch-style reasoning by letting the model read incremental evidence, form lightweight local reasoning, and deepen the final answer only after enough context has arrived.

  • Order-preserving reasoning
  • Separate source and reasoning states
  • Latency-aware reasoning depth
Batch thinking versus streaming thinking
Demo: batch thinking of original LLM waits for the full input, while streaming thinking unfolds alongside arriving context.

Abstract

Large language models have become strong chain-of-thought reasoners, but standard inference still assumes a batch setting: the model waits until the entire input is available before starting to think. StreamingThinker introduces a streaming thinking paradigm in which reasoning follows the order of incoming evidence and can deepen once the full input has been observed. The framework combines streaming CoT generation, streaming-constrained training, and streaming parallel inference so that reading and reasoning can proceed concurrently while preserving order alignment. On the Qwen3 model family, the paper reports reasoning quality comparable to batch thinking together with substantial latency gains: roughly 80% less waiting before reasoning begins and more than 60% lower time-level latency for the final answer across math, logical reasoning, and context-based QA tasks.

Highlights

Streaming thinking paradigm

Reasoning is synchronized with arriving text instead of being deferred until the full context is complete.

Three aligned mechanisms

Streaming CoT supervision, streaming-constrained training, and parallel inference work together as a single stack.

Comparable quality

The framework is designed to keep reasoning performance close to batch thinking while changing when reasoning happens.

Latency reduction

The paper reports around 80% less waiting before reasoning onset and more than 60% lower final-answer latency.

Method

01

Streaming CoT generation

StreamingThinker constructs supervision that follows sentence-level boundaries. The model learns to summarize key information, explain ambiguities, extend implications, and skip unnecessary local reasoning when the current stream segment is not useful.

Streaming CoT generation diagram

02

Streaming-constrained training

Training enforces order-preserving reasoning with streaming attention masks and decoupled position encoding. This keeps each reasoning step tied to the evidence already seen and avoids positional contention between source tokens and reasoning tokens.

Streaming-constrained training diagram

03

Streaming parallel inference

At inference time, separate KV caches are maintained for source-side reading and target-side reasoning. This decoupling enables true concurrency: the model can continue ingesting new input while generating reasoning tokens instead of alternating between the two in a fully serial loop.

Streaming parallel inference diagram

Results

~80%

less token waiting before reasoning begins

>60%

lower time-level latency for the final answer

Qwen3

validated across math, logic, and context-based QA

Comparable

reasoning performance relative to batch thinking

Below are direct screenshots from the paper for the three most relevant streaming result tables: real streaming evaluation, context-first streaming evaluation, and efficiency breakdown.

Table · Real streaming setting

The main streaming comparison: batch thinking, naive interleaving, and parallel StreamingThinker.

Table 2 from the paper showing real streaming results

Table · Context-first streaming setting

The context-first order is harder because the question arrives later. The paper reports that the model still preserves strong early-response behavior in this setting.

Table 3 from the paper showing context-first streaming results

Table · Streaming efficiency breakdown

The efficiency breakdown shows where the latency reduction comes from in the parallel streaming pipeline.

Table 4 from the paper showing efficiency analysis

BibTeX

        @misc{https://doi.org/10.48550/arxiv.2510.17238,
            doi = {10.48550/ARXIV.2510.17238},
            url = {https://arxiv.org/abs/2510.17238},
            author = {Tong, Junlong and Fan, Yingqi and Zhao, Anhao and Ma, Yunpu and Shen, Xiaoyu},
            title = {StreamingThinker: Large Language Models Can Think While Reading},
            publisher = {arXiv},
            year = {2025}
            }
        

Contact

For questions about the project, please contact jl-tong@sjtu.edu.cn or xyshen@eitech.edu.cn.