Overview
llama.cpp enables CPU-only inference via quantized GGUF weights and offers OpenAI-compatible HTTP and WebSocket servers.
Key Capabilities
- int8/int4/quant-mulmat kernels in AVX2/AVX-VNNI/NEON
- GPU offload (CUDA/Metal/OpenCL)
- LoRA/QLoRA finetune utilities