ExplorerArtificial IntelligenceAI
Research PaperResearchia:202605.29015

Gram: Assessing sabotage propensities via automated alignment auditing

David Lindner

Abstract

We introduce Gram, an automated alignment auditing framework to assess the propensity of AI agents to engage in sabotage. We evaluate Gemini models across 17 simulated agentic deployment scenarios that incentivize sabotage. We find Gemini models misbehave in about 2-3% of our simulated trajectories. Many of these cases are explained by "overeagerness" in Gemini models resulting in both excessive role-playing and goal-seeking behavior. In contrast to other alignment auditing approaches, Gram is d...

Submitted: May 29, 2026Subjects: AI; Artificial Intelligence

Description / Details

We introduce Gram, an automated alignment auditing framework to assess the propensity of AI agents to engage in sabotage. We evaluate Gemini models across 17 simulated agentic deployment scenarios that incentivize sabotage. We find Gemini models misbehave in about 2-3% of our simulated trajectories. Many of these cases are explained by "overeagerness" in Gemini models resulting in both excessive role-playing and goal-seeking behavior. In contrast to other alignment auditing approaches, Gram is designed to specifically evaluate misalignment and intentional sabotage in agentic coding and research agents. We additionally introduce an experimental investigator agent pipeline which enables fine-grained targeted experiments to identify the drivers of misbehavior. We find that increasing realism of environments and removing nudges to misbehave tends to reduce sabotage rates close to zero.


Source: arXiv:2605.30322v1 - http://arxiv.org/abs/2605.30322v1 PDF: https://arxiv.org/pdf/2605.30322v1 Original Link: http://arxiv.org/abs/2605.30322v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
May 29, 2026
Topic:
Artificial Intelligence
Area:
AI
Comments:
0
Bookmark