Video LLMs deliver strong multimodal understanding but are often bottlenecked by the vision encoder: processing dense visual tokens inflates time-to-first-token (TTFT) before any language decoding begins. EarlyTom's core insight is counterintuitive but practical — compress visual tokens inside the encoder early (without extra training) so the costly encoder work itself processes fewer tokens, not just the downstream layers.
Key Findings
- Early-stage, training-free token compression reduces TTFT by up to 2.65× on a single NVIDIA A100 for the LLaVA-OneVision-7B setup. So what: first-token latency — critical for interactive/video-streaming use cases — is materially lowered, improving responsiveness.
- FLOPs drop by up to 61% while maintaining accuracy comparable to full-token baselines. So what: substantial inference-cost reduction enables higher throughput or lower cloud/GPU spend for the same model.
- Decoupled spatial token selection (separating selection mechanics from downstream compression) yields better compression effectiveness than late-stage-only approaches. So what: it preserves task-relevant spatial information while discarding redundancy earlier in the pipeline.
Who it's for and tradeoffs
Great fit if you operate latency-sensitive Video-LLM services (live captioning, quick video Q&A, multimodal agents) and need a low-effort, deployment-friendly way to cut encoder cost without retraining. Look elsewhere if your primary constraint is maximal accuracy on extremely fine-grained visual tasks (where any token pruning risks missing tiny cues), or if you require end-to-end retraining for adapted compression policies.
Method notes
The approach is training-free and integrates a token selection mechanism inside the vision encoder, combined with spatially decoupled selection to avoid over-pruning important regions. That design yields immediate deployment gains (reduced TTFT and FLOPs) on existing Video-LLM stacks without revising the model weights.
