VLMEvalKit — Detailed Introduction
VLMEvalKit (python package name: vlmeval) is an open-source evaluation toolkit specifically designed for large vision-language models (VLMs / LVLMs). Its main goal is to make multi-benchmark evaluation easy, reproducible, and automated for both researchers and model developers.
Core Features
- Broad model & benchmark coverage: supports 200+ LMMs and 70–80+ image/video benchmarks (continually updated by the community).
- One-command evaluation: handles data downloading, preprocessing, inference, prediction saving and metric computation so users can evaluate models with minimal setup.
- Generation-based evaluation: uses generation for all models and optionally applies LLM-based answer extraction to improve evaluation for benchmarks with free-form answers.
- Leaderboards & records: official OpenVLM leaderboard and downloadable detailed result files make cross-model comparisons straightforward; evaluation records are provided (e.g., via Hugging Face spaces and datasets).
- Robust handling of edge cases: features for models that use “thinking mode” (customizable split_thinking) and support for very long responses (saving predictions in TSV to avoid XLSX cell limits).
- Distributed & accelerated inference: supports multi-node distributed inference through integrations such as LMDeploy and VLLM to speed up large-scale evaluations.
- Compatibility notes: provides recommended transformer/torchvision/flash-attn versions per model family to increase reproducibility.
Typical Workflow
- Install
vlmevaland required dependencies following the QuickStart in the repo. - Select a supported model or implement a small adapter (
generate_inner()for custom models). - Run a single command to evaluate the model on chosen benchmarks; the toolkit will produce prediction files and evaluation metrics.
Example (from README):
from vlmeval.config import supported_VLM
model = supported_VLM['idefics_9b_instruct']()
# Forward Single Image
ret = model.generate(['assets/apple.jpg', 'What is in this image?'])
print(ret)Recent Improvements (high-level)
- Improved handling of models with thinking-mode outputs (custom
split_thinkingand env varSPLIT_THINK=True). - Support for saving long outputs in TSV (
PRED_FORMAT=tsv) to prevent truncation when responses exceed spreadsheet cell limits. - Routing and extractor refinements to better handle multiple-choice and text inference tasks.
- New benchmark and model additions continually contributed by the community (physics reasoning benchmarks, video benchmarks, many new model families).
Who should use it
- Researchers who need reproducible cross-model comparisons across many multimodal benchmarks.
- Model developers who want quick feedback on where their VLM performs well or poorly.
- Teams running large-scale evaluation pipelines that benefit from distributed inference and standardized evaluation protocols.
Citation & Community
VLMEvalKit provides a citation entry (conference/workshop paper) for academic use, and the project encourages contributions via GitHub. The repo links to leaderboards and evaluation records hosted on Hugging Face and offers community channels such as Discord for discussion and contributor coordination.
