ExplorerBiomedical EngineeringEngineering
Research PaperResearchia:202604.14035

ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

Lifeng Chen

Abstract

Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision--language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence du...

Submitted: April 14, 2026Subjects: Engineering; Biomedical Engineering

Description / Details

Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision--language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence due to the mean-field bias introduced by token-factorized denoisers. To address this challenge, we propose \textbf{ECHO}, an efficient diffusion-based VLM (dVLM) for chest X-ray report generation. ECHO enables stable one-step-per-block inference via a novel Direct Conditional Distillation (DCD) framework, which mitigates the mean-field limitation by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. In addition, we introduce a Response-Asymmetric Diffusion (RAD) training strategy that further improves training efficiency while maintaining model effectiveness. Extensive experiments demonstrate that ECHO surpasses state-of-the-art autoregressive methods, improving RaTE and SemScore by \textbf{64.33%} and \textbf{60.58%} respectively, while achieving an \textbf{8×8\times} inference speedup without compromising clinical accuracy.


Source: arXiv:2604.09450v1 - http://arxiv.org/abs/2604.09450v1 PDF: https://arxiv.org/pdf/2604.09450v1 Original Link: http://arxiv.org/abs/2604.09450v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Apr 14, 2026
Topic:
Biomedical Engineering
Area:
Engineering
Comments:
0
Bookmark