ExplorerComputational LinguisticsNLP
Research PaperResearchia:202604.15007

Saar-Voice: A Multi-Speaker Saarbrücken Dialect Speech Corpus

Lena S. Oberkircher

Abstract

Natural language processing (NLP) and speech technologies have made significant progress in recent years; however, they remain largely focused on standardized language varieties. Dialects, despite their cultural significance and widespread use, are underrepresented in linguistic resources and computational models, resulting in performance disparities. To address this gap, we introduce Saar-Voice, a six-hour speech corpus for the Saarbrücken dialect of German. The dataset was created by first col...

Submitted: April 15, 2026Subjects: NLP; Computational Linguistics

Description / Details

Natural language processing (NLP) and speech technologies have made significant progress in recent years; however, they remain largely focused on standardized language varieties. Dialects, despite their cultural significance and widespread use, are underrepresented in linguistic resources and computational models, resulting in performance disparities. To address this gap, we introduce Saar-Voice, a six-hour speech corpus for the Saarbrücken dialect of German. The dataset was created by first collecting text through digitized books and locally sourced materials. A subset of this text was recorded by nine speakers, and we conducted analyses on both the textual and speech components to assess the dataset's characteristics and quality. We discuss methodological challenges related to orthographic and speaker variation, and explore grapheme-to-phoneme (G2P) conversion. The resulting corpus provides aligned textual and audio representations. This serves as a foundation for future research on dialect-aware text-to-speech (TTS), particularly in low-resource scenarios, including zero-shot and few-shot model adaptation.


Source: arXiv:2604.11803v1 - http://arxiv.org/abs/2604.11803v1 PDF: https://arxiv.org/pdf/2604.11803v1 Original Link: http://arxiv.org/abs/2604.11803v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Apr 15, 2026
Topic:
Computational Linguistics
Area:
NLP
Comments:
0
Bookmark
Saar-Voice: A Multi-Speaker Saarbrücken Dialect Speech Corpus | Researchia