ExplorerArtificial IntelligenceAI
Research PaperResearchia:202602.17002

CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

Sayan Deb Sarkar

Abstract

Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors an...

Submitted: February 17, 2026Subjects: AI; Artificial Intelligence

Description / Details

Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to 86%86\% and token usage by up to 93%93\% compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on 1414 diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.


Source: arXiv:2602.13191v1 - http://arxiv.org/abs/2602.13191v1 PDF: https://arxiv.org/pdf/2602.13191v1 Original Link: http://arxiv.org/abs/2602.13191v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Feb 17, 2026
Topic:
Artificial Intelligence
Area:
AI
Comments:
0
Bookmark
CoPE-VideoLM: Codec Primitives For Efficient Video Language Models | Researchia