ExplorerImage ProcessingEngineering
Research PaperResearchia:202601.24004

ToS: A Team of Specialists ensemble framework for Stereo Sound Event Localization and Detection with distance estimation in Video

Davide Berghi

Abstract

Sound event localization and detection with distance estimation (3D SELD) in video involves identifying active sound events at each time frame while estimating their spatial coordinates. This multimodal task requires joint reasoning across semantic, spatial, and temporal dimensions, a challenge that single models often struggle to address effectively. To tackle this, we introduce the Team of Specialists (ToS) ensemble framework, which integrates three complementary sub-networks: a spatio-linguis...

Submitted: January 24, 2026Subjects: Engineering; Image Processing

Description / Details

Sound event localization and detection with distance estimation (3D SELD) in video involves identifying active sound events at each time frame while estimating their spatial coordinates. This multimodal task requires joint reasoning across semantic, spatial, and temporal dimensions, a challenge that single models often struggle to address effectively. To tackle this, we introduce the Team of Specialists (ToS) ensemble framework, which integrates three complementary sub-networks: a spatio-linguistic model, a spatio-temporal model, and a tempo-linguistic model. Each sub-network specializes in a unique pair of dimensions, contributing distinct insights to the final prediction, akin to a collaborative team with diverse expertise. ToS has been benchmarked against state-of-the-art audio-visual models for 3D SELD on the DCASE2025 Task 3 Stereo SELD development set, consistently outperforming existing methods across key metrics. Future work will extend this proof of concept by strengthening the specialists with appropriate tasks, training, and pre-training curricula.


Source: arXiv:2601.17611v1 - http://arxiv.org/abs/2601.17611v1 PDF: https://arxiv.org/pdf/2601.17611v1 Original Link: http://arxiv.org/abs/2601.17611v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jan 24, 2026
Topic:
Image Processing
Area:
Engineering
Comments:
0
Bookmark