ExplorerRoboticsRobotics
Research PaperResearchia:202606.23086

Real-Time Multimodal Activity-Aware Error Detection in Robot-Assisted Surgery

Seyed Hamid Reza Roodabeh

Abstract

Robot-assisted minimally invasive surgery improves surgical precision but introduces complexity, making technical error detection essential for ensuring patient safety. Current executional error detection methods using video data often overlook fine-grained contextual descriptions of activities and error types within the hierarchical structure of surgical procedures. They also under-utilize complementary multimodal information. We propose a unified framework for executional error detection that ...

Submitted: June 23, 2026Subjects: Robotics; Robotics

Description / Details

Robot-assisted minimally invasive surgery improves surgical precision but introduces complexity, making technical error detection essential for ensuring patient safety. Current executional error detection methods using video data often overlook fine-grained contextual descriptions of activities and error types within the hierarchical structure of surgical procedures. They also under-utilize complementary multimodal information. We propose a unified framework for executional error detection that leverages multimodal input, including video, kinematics, and descriptive textual prompts. Through activity prompting, we integrate descriptive language in gesture-level activities, instrument-object interactions, and error definitions. We also introduce activity-aware visual embeddings derived from vision encoders pretrained on surgical activity labels to compare the effectiveness of contrastive language-image embeddings with traditional image-based embeddings for error detection. By seamlessly integrating kinematic data with video and textual modalities, our framework significantly improves error detection performance. Achieving up to 5% and 16.6% F1 score improvements over state-of-the-art baselines on the JIGSAWS and SAR-RARP50 datasets, respectively, we demonstrate the value of combining curated textual prompts with multimodal data for accurate error detection.


Source: arXiv:2606.23593v1 - http://arxiv.org/abs/2606.23593v1 PDF: https://arxiv.org/pdf/2606.23593v1 Original Link: http://arxiv.org/abs/2606.23593v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jun 23, 2026
Topic:
Robotics
Area:
Robotics
Comments:
0
Bookmark