Back to Explorer
Research PaperResearchia:202603.04051[Artificial Intelligence > AI]

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Yiqi Lin

Abstract

Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at https://github.com/showlab/Kiwi-Edit.


Source: arXiv:2603.02175v1 - http://arxiv.org/abs/2603.02175v1 PDF: https://arxiv.org/pdf/2603.02175v1 Original Link: http://arxiv.org/abs/2603.02175v1

Submission:3/4/2026
Comments:0 comments
Subjects:AI; Artificial Intelligence
Original Source:
View Original PDF
arXiv: This paper is hosted on arXiv, an open-access repository
Was this helpful?

Discussion (0)

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance | Researchia | Researchia