Overview
LLaVA (Large Language and Vision Assistant) is an open-source project and research codebase that develops multimodal assistants by applying "visual instruction tuning" — a method for fine-tuning large language models to follow multimodal image+text instructions. The project demonstrates how to connect frozen pretrained vision encoders (e.g., CLIP variants) to frozen language models (e.g., Vicuna/LLaMA families) and then teach them to perform image-grounded instruction-following through curated multimodal training mixtures.
Key features and components
- Visual instruction tuning methodology: two-stage procedure (feature alignment + multimodal instruction fine-tuning) used to produce models that can answer questions about images, describe scenes, perform OCR-style reasoning, and follow complex multimodal instructions.
- Model Zoo & checkpoints: public releases include LLaVA v1.x checkpoints, LLaVA-1.5 variants, and the LLaVA-NeXT family (stronger models with support for LLama-3 and Qwen backbones, plus video-capable variants). Checkpoints and recommended tokens/tokenizers are documented in the repo.
- Training & evaluation pipelines: end-to-end scripts for pretraining (feature alignment) and finetuning (instruction tuning), as well as an evaluation suite (including GPT-assisted evaluation) used to benchmark on diverse multimodal datasets.
- Serving & demos: Gradio-based demo, CLI inference, SGLang/worker-based serving examples, and guidance for running quantized (4-bit/8-bit) inference to reduce GPU memory requirements. The project also documents how to run model workers, controllers, and web servers for comparisons across checkpoints.
- Practical support: LoRA training recipes, DeepSpeed integration, support notes for macOS/Windows/Intel dGPU, and community-maintained ports (e.g., llama.cpp, Hugging Face Spaces, Replicate).
Releases and evolution
Since its April 2023 release, LLaVA has been actively developed. Notable milestones include NeurIPS 2023 oral presentation for the original Visual Instruction Tuning paper, LLaVA-1.5 (improved baselines and efficient finetuning with LoRA/quantization), and LLaVA-NeXT (stronger LMMs, video-capable variants, and larger-scale checkpoints). The repo documents release notes, demos, blog posts, and model performance summaries.
Typical usage
Developers and researchers can use LLaVA for:
- Training and evaluating multimodal assistants with provided datasets and recipes.
- Running demos locally or hosting model workers for image-chat and multimodal QA.
- Fine-tuning LLaVA variants with LoRA or running quantized inference for limited GPU setups.
The README includes quick-start examples for Hugging Face integration, CLI inference, and launching Gradio demos.
License & usage notes
The code is Apache-2.0 licensed, but some datasets and base checkpoints used by LLaVA are under their own licenses (e.g., LLaMA / Llama community license, OpenAI terms for certain data). Users must comply with the original licences and terms for any datasets or base models used.
Who should use it
Researchers, ML engineers, and practitioners interested in building or extending vision-language models, exploring multimodal instruction tuning, or deploying multimodal chat assistants locally or in production can use LLaVA as a well-documented reference and starting point.
