ExplorerArtificial IntelligenceAI
Research PaperResearchia:202606.19061

How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech

Nityanand Mathur

Abstract

Style-captioned text-to-speech systems use natural language to control voice characteristics, but how individual words influence acoustic output remains unclear. Understanding this is critical for diagnosing failure modes and improving controllability in expressive TTS. We propose cross-attention attribution for speech diffusion models, adapting the DAAM framework to the speech domain for the first time, and apply it to CapSpeech-TTS. Our method extracts per-token heatmaps across 25 layers and 2...

Submitted: June 19, 2026Subjects: AI; Artificial Intelligence

Description / Details

Style-captioned text-to-speech systems use natural language to control voice characteristics, but how individual words influence acoustic output remains unclear. Understanding this is critical for diagnosing failure modes and improving controllability in expressive TTS. We propose cross-attention attribution for speech diffusion models, adapting the DAAM framework to the speech domain for the first time, and apply it to CapSpeech-TTS. Our method extracts per-token heatmaps across 25 layers and 24 ODE steps. We analyze 3,600 (style caption, text transcript) combinations comprising 120 style captions conditioning the generation of 30 text transcripts each, revealing how caption tokens shape waveforms. Results show: (1) style tokens have lower temporal variance than content/function tokens, confirming global conditioning; (2) style attention correlates with F0 and energy; (3) style conditioning peaks in early steps and deep layers; (4) attention entropy reaches its minimum at layer 17, co-occurring with the style importance peak, indicating maximal network selectivity at the most style-critical stage. This is the first study of how natural language influences cross-attention in speech diffusion models


Source: arXiv:2606.20532v1 - http://arxiv.org/abs/2606.20532v1 PDF: https://arxiv.org/pdf/2606.20532v1 Original Link: http://arxiv.org/abs/2606.20532v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jun 19, 2026
Topic:
Artificial Intelligence
Area:
AI
Comments:
0
Bookmark
How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech | Researchia