AI Video Papers2026

EarlyTom: Early Token Compression Completes Fast Video Understanding

Performs training-free early-stage visual token compression inside the vision encoder to cut time-to-first-token (TTFT) and FLOPs for Video-LLMs. Introduces a decoupled spatial token selection strategy and reports up to 2.65× TTFT reduction and 61% FLOPs savings on LLaVA-OneVision-7B (NVIDIA A100) while preserving full-token accuracy — aimed at latency-sensitive video understanding.

Visit Website

Introduction

Video LLMs deliver strong multimodal understanding but are often bottlenecked by the vision encoder: processing dense visual tokens inflates time-to-first-token (TTFT) before any language decoding begins. EarlyTom's core insight is counterintuitive but practical — compress visual tokens inside the encoder early (without extra training) so the costly encoder work itself processes fewer tokens, not just the downstream layers.

Key Findings

Early-stage, training-free token compression reduces TTFT by up to 2.65× on a single NVIDIA A100 for the LLaVA-OneVision-7B setup. So what: first-token latency — critical for interactive/video-streaming use cases — is materially lowered, improving responsiveness.
FLOPs drop by up to 61% while maintaining accuracy comparable to full-token baselines. So what: substantial inference-cost reduction enables higher throughput or lower cloud/GPU spend for the same model.
Decoupled spatial token selection (separating selection mechanics from downstream compression) yields better compression effectiveness than late-stage-only approaches. So what: it preserves task-relevant spatial information while discarding redundancy earlier in the pipeline.

Who it's for and tradeoffs

Great fit if you operate latency-sensitive Video-LLM services (live captioning, quick video Q&A, multimodal agents) and need a low-effort, deployment-friendly way to cut encoder cost without retraining. Look elsewhere if your primary constraint is maximal accuracy on extremely fine-grained visual tasks (where any token pruning risks missing tiny cues), or if you require end-to-end retraining for adapted compression policies.

Method notes

The approach is training-free and integrates a token selection mechanism inside the vision encoder, combined with spatially decoupled selection to avoid over-pruning important regions. That design yields immediate deployment gains (reduced TTFT and FLOPs) on existing Video-LLM stacks without revising the model weights.

Back

Information

Websitearxiv.org
AuthorsHesong Wang, Xin Jin, Lu Lu, Chenhaowen Li, Jian Chen, Qiang Liu, Huan Wang
Published date2026/05/28

More Items

AI Video Papers2026

HOMIE: Human-object Centric Video Personalization via Multimodal Intelligent Enchancement

Yiyang Cai, Nan Chen +9

Personalizes subject-driven videos to preserve human identity and accurate human–object interactions by integrating multimodal references and MLLM-derived semantics. Introduces global multimodal guidance in self-attention and modality-reference embeddings to align MLLM features with VAE tokens, supporting both inter- and intra-subject inputs (e.g., OCR, multi-view).

video multimodal LLM ocr paper+2

AI Video Papers2026

Apple-π: Benchmarking Thinking with Video Towards Law-Grounded Physical Intelligence

Runmao Yao, Kairui Hu +12

Evaluates whether video models reason according to physical laws by treating generated videos as visible reasoning traces and using a three-stage Perception–Formulation–Deduction protocol. Includes Orchard (400 mechanics videos), chain-of-frames prompting on annotated first frames, and a hybrid MLLM-plus-objective scoring suite for stage-resolved diagnostics.

video ai-video physics benchmark evaluation+4

AI Video Papers2026

TimeLens2: Generalist Video Temporal Grounding with Multimodal LLMs

Yuhan Zhu, Changlian Ma +13

Predicts variable-cardinality sets of evidence intervals in videos to temporally ground queries using multimodal large language models. Combines caption-derived multi-span supervision, a temporal Wasserstein matching-free reward, and temporal IoU, yielding strong mIoU gains across multiple benchmarks.

video multimodal LLM qwen paper+3

EarlyTom: Early Token Compression Completes Fast Video Understanding

Introduction

Key Findings

Who it's for and tradeoffs

Method notes

Information

Categories

Tags

More Items

HOMIE: Human-object Centric Video Personalization via Multimodal Intelligent Enchancement

Apple-π: Benchmarking Thinking with Video Towards Law-Grounded Physical Intelligence

TimeLens2: Generalist Video Temporal Grounding with Multimodal LLMs