Back to Explorer
Research PaperResearchia:202602.25040[Artificial Intelligence > AI]

StyleStream: Real-Time Zero-Shot Voice Style Conversion

Yisi Liu

Abstract

Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited, and real-time voice style conversion has not been addressed. We propose StyleStream, the first streamable zero-shot voice style conversion system that achieves state-of-the-art performance. StyleStream consists of two components: a Destylizer, which removes style attributes while preserving linguistic content, and a Stylizer, a diffusion transformer (DiT) that reintroduces target style conditioned on reference speech. Robust content-style disentanglement is enforced through text supervision and a highly constrained information bottleneck. This design enables a fully non-autoregressive architecture, achieving real-time voice style conversion with an end-to-end latency of 1 second. Samples and real-time demo: https://berkeley-speech-group.github.io/StyleStream/.


Source: arXiv:2602.20113v1 - http://arxiv.org/abs/2602.20113v1 PDF: https://arxiv.org/pdf/2602.20113v1 Original Link: http://arxiv.org/abs/2602.20113v1

Submission:2/25/2026
Comments:0 comments
Subjects:AI; Artificial Intelligence
Original Source:
View Original PDF
arXiv: This paper is hosted on arXiv, an open-access repository
Was this helpful?

Discussion (0)

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

StyleStream: Real-Time Zero-Shot Voice Style Conversion | Researchia | Researchia