KTransformers: A Flexible Framework for LLM Inference and Fine-Tuning Optimizations
Overview
KTransformers is a research-oriented project dedicated to enhancing the efficiency of large language models (LLMs) through advanced CPU-GPU heterogeneous computing techniques. Initially developed as an integrated framework, it has since evolved into two modular components: kt-kernel for optimized inference kernels and kt-sft for fine-tuning capabilities. This separation improves modularity, maintainability, and ease of integration with other tools. The project targets key challenges in deploying and training massive models, particularly Mixture-of-Experts (MoE) architectures, by leveraging heterogeneous hardware to minimize resource demands and maximize performance.
Purpose and Key Use Cases
The core purpose of KTransformers is to democratize access to cutting-edge LLM optimizations, allowing researchers and developers to experiment with high-performance inference and fine-tuning without requiring excessive computational resources. It excels in scenarios such as:
- Hybrid Inference: Placing 'hot' experts on GPUs for speed and 'cold' experts on CPUs for cost-efficiency in MoE models.
- Resource-Constrained Training: Fine-tuning ultra-large models (e.g., 671B parameters) using just 70GB GPU memory combined with 1.3TB RAM.
- Production Deployment: Integrating with serving frameworks like SGLang for scalable, multi-concurrency inference.
- Model Compatibility: Supporting a wide range of models including DeepSeek-R1/V3/V2, Kimi-K2 variants, Qwen3-Next, GLM4-MoE, Mixtral 8x7B/8x22B, and experimental LLaMA 4 support.
By addressing VRAM bottlenecks and enabling longer context lengths (up to 139K tokens in 24GB VRAM), KTransformers significantly reduces barriers to advanced AI experimentation.
Core Modules
kt-kernel: High-Performance Inference Kernels
This module provides CPU-optimized operations for heterogeneous LLM inference, emphasizing acceleration on diverse hardware.
- Acceleration Techniques: Utilizes Intel AMX and AVX512/AVX2 for INT4/INT8 quantized inference, alongside MoE-specific optimizations with NUMA-aware memory management.
- Quantization and Backend Support: Handles CPU-side INT4/INT8 weights, GPU-side GPTQ, unsloth 1.58/2.51-bit weights, IQ1_S/FP8 hybrids, and dequantization (q2k/q3k/q5k on GPU). Also supports FP8 GPU kernels and llamfile as a linear backend.
- Hardware Compatibility: Includes Intel Arc GPUs, AMD GPUs via ROCm, Ascend NPUs, multi-GPU setups, and Windows native support.
- Integration: Offers a clean Python API for seamless incorporation into frameworks like SGLang.
Performance Highlights:
| Model | Hardware | Total Throughput | Output Throughput |
|---|---|---|---|
| DeepSeek-R1-0528 (FP8) | 8×L20 GPU + Xeon Gold 6454S | 227.85 tokens/s | 87.58 tokens/s (8-way) |
Quick installation: cd kt-kernel && pip install .. For multi-GPU and injection tutorials, refer to the documentation.
kt-sft: Fine-Tuning Framework
Integrated with LLaMA-Factory, this module enables efficient fine-tuning of massive MoE models.
- Efficiency Features: Supports full LoRA fine-tuning with heterogeneous acceleration, reducing GPU memory needs dramatically.
- Workflow Support: Includes chat, batch inference, and metrics evaluation for production readiness.
- Advanced Capabilities: Handles 3-layer prefix cache reuse (GPU-CPU-Disk) and multi-concurrency via balance-serve.
Performance Highlights:
| Model | Configuration | Throughput | GPU Memory |
|---|---|---|---|
| DeepSeek-V3 (671B) | LoRA + AMX | ~40 tokens/s | 70GB (multi-GPU) |
| DeepSeek-V2-Lite (14B) | LoRA + AMX | ~530 tokens/s | 6GB |
Quick start: Follow kt-sft/README.md for environment setup, then USE_KT=1 llamafactory-cli train examples/train_lora/deepseek3_lora_sft_kt.yaml.
Development History and Updates
Launched around August 2024, KTransformers has seen rapid iterations:
- August 2024: Initial Windows support, multi-GPU, Mixtral models, VRAM reductions for DeepSeekV2 (21GB to 11GB).
- February 2025: DeepSeek-R1/V3 support with 3-28x speedups, FP8 kernels, extended contexts (4K to 139K tokens).
- March-April 2025: ROCm/AMD GPU, multi-concurrency, LLaMA 4, AMX quantizations, Intel Arc.
- May-July 2025: Prefix caching, Kimi-K2, SmallThinker/GLM4-MoE.
- September-November 2025: Qwen3-Next, SGLang integration, Ascend NPU, Kimi-K2-Thinking, LLaMA-Factory enhancements.
The project boasts over 16,000 GitHub stars and is backed by a 2025 ACM SIGOPS paper on CPU/GPU hybrid inference for MoE models. A 2025 Q4 roadmap outlines future expansions.
Community and Contributions
Developed by MADSys Lab at Tsinghua University and Approaching.AI, with 90+ contributors. Engage via GitHub issues, pull requests, or WeChat groups. The original framework is archived for legacy reference, with comprehensive docs and online books available.
