LogoAIAny
Icon for item

EarlyTom: Early Token Compression Completes Fast Video Understanding

Performs training-free early-stage visual token compression inside the vision encoder to cut time-to-first-token (TTFT) and FLOPs for Video-LLMs. Introduces a decoupled spatial token selection strategy and reports up to 2.65× TTFT reduction and 61% FLOPs savings on LLaVA-OneVision-7B (NVIDIA A100) while preserving full-token accuracy — aimed at latency-sensitive video understanding.

Introduction

Video LLMs deliver strong multimodal understanding but are often bottlenecked by the vision encoder: processing dense visual tokens inflates time-to-first-token (TTFT) before any language decoding begins. EarlyTom's core insight is counterintuitive but practical — compress visual tokens inside the encoder early (without extra training) so the costly encoder work itself processes fewer tokens, not just the downstream layers.

Key Findings
  • Early-stage, training-free token compression reduces TTFT by up to 2.65× on a single NVIDIA A100 for the LLaVA-OneVision-7B setup. So what: first-token latency — critical for interactive/video-streaming use cases — is materially lowered, improving responsiveness.
  • FLOPs drop by up to 61% while maintaining accuracy comparable to full-token baselines. So what: substantial inference-cost reduction enables higher throughput or lower cloud/GPU spend for the same model.
  • Decoupled spatial token selection (separating selection mechanics from downstream compression) yields better compression effectiveness than late-stage-only approaches. So what: it preserves task-relevant spatial information while discarding redundancy earlier in the pipeline.
Who it's for and tradeoffs

Great fit if you operate latency-sensitive Video-LLM services (live captioning, quick video Q&A, multimodal agents) and need a low-effort, deployment-friendly way to cut encoder cost without retraining. Look elsewhere if your primary constraint is maximal accuracy on extremely fine-grained visual tasks (where any token pruning risks missing tiny cues), or if you require end-to-end retraining for adapted compression policies.

Method notes

The approach is training-free and integrates a token selection mechanism inside the vision encoder, combined with spatially decoupled selection to avoid over-pruning important regions. That design yields immediate deployment gains (reduced TTFT and FLOPs) on existing Video-LLM stacks without revising the model weights.

Information

  • Websitearxiv.org
  • AuthorsHesong Wang, Xin Jin, Lu Lu, Chenhaowen Li, Jian Chen, Qiang Liu, Huan Wang
  • Published date2026/05/28

Categories