Online Safety Monitoring for LLMs
Abstract
Despite alignment training, LLMs remain prone to generating unsafe outputs at deployment time. Monitoring outputs online and raising an alarm when safety can no longer be assumed is therefore critical. We study a simple real-time monitor that turns a verifier signal from an external model into an alarm decision by thresholding, with the threshold calibrated via risk control. In experiments on mathematical reasoning and red teaming datasets, we show that this simple design is competitive with mor...
Description / Details
Despite alignment training, LLMs remain prone to generating unsafe outputs at deployment time. Monitoring outputs online and raising an alarm when safety can no longer be assumed is therefore critical. We study a simple real-time monitor that turns a verifier signal from an external model into an alarm decision by thresholding, with the threshold calibrated via risk control. In experiments on mathematical reasoning and red teaming datasets, we show that this simple design is competitive with more advanced monitors based on sequential hypothesis testing.
Source: arXiv:2607.02510v1 - http://arxiv.org/abs/2607.02510v1 PDF: https://arxiv.org/pdf/2607.02510v1 Original Link: http://arxiv.org/abs/2607.02510v1
Please sign in to join the discussion.
No comments yet. Be the first to share your thoughts!
Jul 3, 2026
Data Science
Machine Learning
0