SpecFed: Accelerating Federated LLM Inference with Speculative Decoding and Compressed Transmission
Abstract
Federated inference enhances LLM performance in edge computing through weighted averaging of distributed model predictions. However, autoregressive LLM inference requires frequent full-model forward passes across workers, severely limiting decoding throughput. Distributed deployment further aggravates this due to a communication bottleneck: each worker must transmit full token probability distributions per draft token, dominating end-to-end latency. To address these challenges, we introduce spec...
Description / Details
Federated inference enhances LLM performance in edge computing through weighted averaging of distributed model predictions. However, autoregressive LLM inference requires frequent full-model forward passes across workers, severely limiting decoding throughput. Distributed deployment further aggravates this due to a communication bottleneck: each worker must transmit full token probability distributions per draft token, dominating end-to-end latency. To address these challenges, we introduce speculative decoding to enable parallel LLM processing and propose a top-K compressed transmission scheme with two server-side reconstruction strategies. We theoretically analyze the robustness of our method in terms of local reconstruction error, aggregation bias, and acceptance-rate bias, and derive corresponding bounds. Experiments demonstrate that our scheme achieves high generation fidelity while significantly reducing communication overhead.
Source: arXiv:2604.25777v1 - http://arxiv.org/abs/2604.25777v1 PDF: https://arxiv.org/pdf/2604.25777v1 Original Link: http://arxiv.org/abs/2604.25777v1
Please sign in to join the discussion.
No comments yet. Be the first to share your thoughts!
Apr 29, 2026
Chemical Engineering
Engineering
0