ExplorerArtificial IntelligenceAI
Research PaperResearchia:202606.25073

Autodata: An agentic data scientist to create high quality synthetic data

Ilia Kulikov

Abstract

We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such a data scientist agent, so that it learns to create even stronger data. We describe the overall formulation, and a specific practical implementation, Agentic Self-Instruct. We conduct experiments on computer science research tasks, legal reasoning tasks and reasoning with mathematical objects, where we obtain impro...

Submitted: June 25, 2026Subjects: AI; Artificial Intelligence

Description / Details

We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such a data scientist agent, so that it learns to create even stronger data. We describe the overall formulation, and a specific practical implementation, Agentic Self-Instruct. We conduct experiments on computer science research tasks, legal reasoning tasks and reasoning with mathematical objects, where we obtain improved results compared to classical synthetic dataset creation methods. Further, meta-optimizing the data scientist agent itself delivers an even larger performance uplift. Agentic data creation provides a way to convert increased inference compute into higher quality model training. Overall, we believe this direction has the potential to change the way we build AI data.


Source: arXiv:2606.25996v1 - http://arxiv.org/abs/2606.25996v1 PDF: https://arxiv.org/pdf/2606.25996v1 Original Link: http://arxiv.org/abs/2606.25996v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jun 25, 2026
Topic:
Artificial Intelligence
Area:
AI
Comments:
0
Bookmark
Autodata: An agentic data scientist to create high quality synthetic data | Researchia