Back to Explorer
Research PaperResearchia:202601.24004[Image Processing > Engineering]

ToS: A Team of Specialists ensemble framework for Stereo Sound Event Localization and Detection with distance estimation in Video

Davide Berghi

Abstract

Sound event localization and detection with distance estimation (3D SELD) in video involves identifying active sound events at each time frame while estimating their spatial coordinates. This multimodal task requires joint reasoning across semantic, spatial, and temporal dimensions, a challenge that single models often struggle to address effectively. To tackle this, we introduce the Team of Specialists (ToS) ensemble framework, which integrates three complementary sub-networks: a spatio-linguistic model, a spatio-temporal model, and a tempo-linguistic model. Each sub-network specializes in a unique pair of dimensions, contributing distinct insights to the final prediction, akin to a collaborative team with diverse expertise. ToS has been benchmarked against state-of-the-art audio-visual models for 3D SELD on the DCASE2025 Task 3 Stereo SELD development set, consistently outperforming existing methods across key metrics. Future work will extend this proof of concept by strengthening the specialists with appropriate tasks, training, and pre-training curricula.


Source: arXiv:2601.17611v1 - http://arxiv.org/abs/2601.17611v1 PDF: https://arxiv.org/pdf/2601.17611v1 Original Link: http://arxiv.org/abs/2601.17611v1

Submission:1/24/2026
Comments:0 comments
Subjects:Engineering; Image Processing
Original Source:
View Original PDF
arXiv: This paper is hosted on arXiv, an open-access repository
Was this helpful?

Discussion (0)

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

ToS: A Team of Specialists ensemble framework for Stereo Sound Event Localization and Detection with distance estimation in Video | Researchia