ExplorerComputational LinguisticsNLP
Research PaperResearchia:202603.27010

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

Ligong Han

Abstract

Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or inc...

Submitted: March 27, 2026Subjects: NLP; Computational Linguistics

Description / Details

Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to 4.7×4.7\times speedup over autoregressive decoding, and up to 1.57×1.57\times over a tuned dynamic decoding baseline while improving accuracy by up to 4.54.5 points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is 4.4×4.4\times faster than the static baseline with slightly higher accuracy.


Source: arXiv:2603.25702v1 - http://arxiv.org/abs/2603.25702v1 PDF: https://arxiv.org/pdf/2603.25702v1 Original Link: http://arxiv.org/abs/2603.25702v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Mar 27, 2026
Topic:
Computational Linguistics
Area:
NLP
Comments:
0
Bookmark
S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation | Researchia