LogoAIAny
Icon for item

Keye-VL-2.0-30B-A3B

Performs hour-scale video understanding and fine-grained temporal localization while exposing agent-style multimodal tool/code/search abilities. Built on a sparse-attention long-context architecture (DSA) and a specialized inference stack—best used in GPU-backed research or production evaluation.

Introduction

Long-video understanding breaks assumptions most multimodal models are built on: relevance drifts over minutes, temporal grounding needs frame-level precision, and computation explodes with naively scaled context. Keye-VL-2.0-30B-A3B tackles this by combining a DSA-native long-context backbone with targeted feature aggregation and post-training merges that preserve reasoning quality across hour-scale inputs—so you get temporal localization and cross-modal reasoning without linear cost growth.

Key Capabilities
  • Fine-grained temporal localization: reports top-tier mIoU scores on several TimeLens-style benchmarks. In practice this means better frame- or segment-level answers for “when did X happen?” queries across long videos.
  • Scales with context instead of collapsing: its sparse-attention and aggregation mechanisms are designed so accuracy can improve as more frames are provided, which helps tasks that need cross-segment evidence (summarization, long-term QA, and temporal retrieval).
  • Agent-ready multimodal primitives: built-in Code, Tool, and Search agent workflows let the model orchestrate repository tasks, API-style tools, and web-grounded lookups—useful for automated video analysis pipelines and tool-augmented reasoning.
  • Production-oriented efficiency stack: the release emphasizes custom kernels, ExtraIO and heterogeneous ViT-LM parallelism to reduce long-sequence prefill cost—so reproducing benchmarks usually requires the recommended runtime and optimized kernels.
Who it's for and trade-offs

Great fit if you need accurate temporal grounding and long-video reasoning at model scales you can run on GPU clusters, and if you can adopt the project's runtime (SGLang/Docker/custom kernels) to reproduce performance. It’s also suitable for teams building visual agents that must blend tool use, search, and visual perception.

Look elsewhere if you need tiny/edge/CPU-first models, zero-infrastructure single-GPU drop-in usage, or minimal-dependency deployments—the 30B model and its optimized kernels assume significant GPU and infra investment. Also treat benchmark claims as starting points: reproduce them on your hardware and datasets before relying on them for product decisions.

Where it fits

Compared to generic multimodal 30B models, Keye-VL-2.0 emphasizes hour-scale temporal reasoning and engineered inference optimizations. If your primary need is short-image VQA or tiny on-device inference, smaller vision-LM families remain more practical; if you need long-video temporal grounding or agentic toolchains, Keye-VL is purpose-built for that slot.

Implementation notes

The model is published on Hugging Face with Apache-2.0 licensing and links to the project's GitHub and runtime. Expect nontrivial infra work (optimized kernels, recommended launch configs) to match the release benchmarks and to enable hour-scale inputs reliably.