ExplorerComputational LinguisticsNLP
Research PaperResearchia:202602.19010

Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens

Potsawee Manakul

Abstract

Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale, jointly modeling semantic content, acoustic details, and text to support both general audio generation and cross-modal capabilities. We provide comprehensive empirical insights for b...

Submitted: February 19, 2026Subjects: NLP; Computational Linguistics

Description / Details

Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale, jointly modeling semantic content, acoustic details, and text to support both general audio generation and cross-modal capabilities. We provide comprehensive empirical insights for building such models: (1) We systematically investigate design choices -- data sources, text mixture ratios, and token composition -- establishing a validated training recipe. (2) We conduct the first scaling law study for discrete audio models via IsoFLOP analysis on 64 models spanning 3×10183{\times}10^{18} to 3×10203{\times}10^{20} FLOPs, finding that optimal data grows 1.6×\times faster than optimal model size. (3) We apply these lessons to train SODA (Scaling Open Discrete Audio), a suite of models from 135M to 4B parameters on 500B tokens, comparing against our scaling predictions and existing models. SODA serves as a flexible backbone for diverse audio/text tasks -- we demonstrate this by fine-tuning for voice-preserving speech-to-speech translation, using the same unified architecture.


Source: arXiv:2602.16687v1 - http://arxiv.org/abs/2602.16687v1 PDF: https://arxiv.org/pdf/2602.16687v1 Original Link: http://arxiv.org/abs/2602.16687v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Feb 19, 2026
Topic:
Computational Linguistics
Area:
NLP
Comments:
0
Bookmark
Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens | Researchia