Beyond task performance: Decoding bioacoustic embeddings with speech features
Abstract
Pretrained audio embeddings are standard in bioacoustics, yet little is known about which acoustic features these models encode, nor which are useful for a given task. This hinders transparency and limits extension to rare species or data-scarce domains. Here we reveal which speech-like features are encoded in bioacoustic representations. Using the 88~eGeMAPS features across six taxonomic groups, we apply linear and nonlinear regression probes to quantify which acoustic properties each model cap...
Description / Details
Pretrained audio embeddings are standard in bioacoustics, yet little is known about which acoustic features these models encode, nor which are useful for a given task. This hinders transparency and limits extension to rare species or data-scarce domains. Here we reveal which speech-like features are encoded in bioacoustic representations. Using the 88~eGeMAPS features across six taxonomic groups, we apply linear and nonlinear regression probes to quantify which acoustic properties each model captures. Results confirm a ``no free lunch'' pattern: no single model captures the full feature space. A concatenated embedding achieves the highest performance, suggesting complementary acoustic space coverage across models. Loudness features are best encoded () while F0 is hardest to recover (). By cross-referencing recoverability with per-species feature salience (NMI), we derive data-driven model selection guidance for bioacoustics.
Source: arXiv:2606.14662v1 - http://arxiv.org/abs/2606.14662v1 PDF: https://arxiv.org/pdf/2606.14662v1 Original Link: http://arxiv.org/abs/2606.14662v1
Please sign in to join the discussion.
No comments yet. Be the first to share your thoughts!
Jun 15, 2026
Data Science
Machine Learning
0