ExplorerData ScienceMachine Learning
Research PaperResearchia:202604.17061

$π$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

Yaocheng Zhang

Abstract

Deep search agents have emerged as a promising paradigm for addressing complex information-seeking tasks, but their training remains challenging due to sparse rewards, weak credit assignment, and limited labeled data. Self-play offers a scalable route to reduce data dependence, but conventional self-play optimizes students only through sparse outcome rewards, leading to low learning efficiency. In this work, we observe that self-play naturally produces a question construction path (QCP) during t...

Submitted: April 17, 2026Subjects: Machine Learning; Data Science

Description / Details

Deep search agents have emerged as a promising paradigm for addressing complex information-seeking tasks, but their training remains challenging due to sparse rewards, weak credit assignment, and limited labeled data. Self-play offers a scalable route to reduce data dependence, but conventional self-play optimizes students only through sparse outcome rewards, leading to low learning efficiency. In this work, we observe that self-play naturally produces a question construction path (QCP) during task generation, an intermediate artifact that captures the reverse solution process. This reveals a new source of privileged information for self-distillation: self-play can itself provide high-quality privileged context for the teacher model in a low-cost and scalable manner, without relying on human feedback or curated privileged information. Leveraging this insight, we propose Privileged Information Self-Play (ππ-Play), a multi-agent self-evolution framework. In ππ-Play, an examiner generates tasks together with their QCPs, and a teacher model leverages QCP as privileged context to densely supervise a student via self-distillation. This design transforms conventional sparse-reward self-play into a dense-feedback self-evolution loop. Extensive experiments show that data-free ππ-Play surpasses fully supervised search agents and improves evolutionary efficiency by 2-3×\times over conventional self-play.


Source: arXiv:2604.14054v1 - http://arxiv.org/abs/2604.14054v1 PDF: https://arxiv.org/pdf/2604.14054v1 Original Link: http://arxiv.org/abs/2604.14054v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Apr 17, 2026
Topic:
Data Science
Area:
Machine Learning
Comments:
0
Bookmark