ExplorerArtificial IntelligenceAI
Research PaperResearchia:202606.19066

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

Sihui Dai

Abstract

Prior work has shown that in-context demonstrations can jailbreak language models, but it remains unclear how models interpret different types of compliance demonstrations. We study this by mixing benign compliance demonstrations (non-harmful request, helpful response) with harmful compliance demonstrations (harmful request, helpful response) and testing three hypotheses about how demonstration composition drives harmful compliance. Across four models, we find that benign and harmful demonstrati...

Submitted: June 19, 2026Subjects: AI; Artificial Intelligence

Description / Details

Prior work has shown that in-context demonstrations can jailbreak language models, but it remains unclear how models interpret different types of compliance demonstrations. We study this by mixing benign compliance demonstrations (non-harmful request, helpful response) with harmful compliance demonstrations (harmful request, helpful response) and testing three hypotheses about how demonstration composition drives harmful compliance. Across four models, we find that benign and harmful demonstrations are not interchangeable: benign demonstrations can either reduce or increase harmful compliance depending on the model. We further show that preference optimization is the critical training stage that prevents benign demonstrations from increasing harmful compliance, that demonstration ordering exhibits strong recency bias, and that models differ in how refusal interacts with in-context learning: some adopt demonstrated formatting even when refusing, while others override all in-context signals upon refusal. Taken together, this work moves beyond showing that demonstration-based jailbreaking works to characterizing how it works: what models extract from compliance demonstrations depends on demonstration content, ordering, and training methodology.


Source: arXiv:2606.20508v1 - http://arxiv.org/abs/2606.20508v1 PDF: https://arxiv.org/pdf/2606.20508v1 Original Link: http://arxiv.org/abs/2606.20508v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jun 19, 2026
Topic:
Artificial Intelligence
Area:
AI
Comments:
0
Bookmark