LogoAIAny
Icon for item

KTransformers

KTransformers is a flexible framework for experiencing cutting-edge optimizations in LLM inference and fine-tuning, focusing on CPU-GPU heterogeneous computing. It consists of two core modules: kt-kernel for high-performance inference kernels and kt-sft for fine-tuning. The project supports various hardware and models like DeepSeek series, Kimi-K2, achieving significant resource savings and speedups, such as reducing GPU memory for a 671B model to 70GB and up to 28x acceleration.

Introduction

KTransformers: A Flexible Framework for LLM Inference and Fine-Tuning Optimizations

Overview

KTransformers is a research-oriented project dedicated to enhancing the efficiency of large language models (LLMs) through advanced CPU-GPU heterogeneous computing techniques. Initially developed as an integrated framework, it has since evolved into two modular components: kt-kernel for optimized inference kernels and kt-sft for fine-tuning capabilities. This separation improves modularity, maintainability, and ease of integration with other tools. The project targets key challenges in deploying and training massive models, particularly Mixture-of-Experts (MoE) architectures, by leveraging heterogeneous hardware to minimize resource demands and maximize performance.

Purpose and Key Use Cases

The core purpose of KTransformers is to democratize access to cutting-edge LLM optimizations, allowing researchers and developers to experiment with high-performance inference and fine-tuning without requiring excessive computational resources. It excels in scenarios such as:

  • Hybrid Inference: Placing 'hot' experts on GPUs for speed and 'cold' experts on CPUs for cost-efficiency in MoE models.
  • Resource-Constrained Training: Fine-tuning ultra-large models (e.g., 671B parameters) using just 70GB GPU memory combined with 1.3TB RAM.
  • Production Deployment: Integrating with serving frameworks like SGLang for scalable, multi-concurrency inference.
  • Model Compatibility: Supporting a wide range of models including DeepSeek-R1/V3/V2, Kimi-K2 variants, Qwen3-Next, GLM4-MoE, Mixtral 8x7B/8x22B, and experimental LLaMA 4 support.

By addressing VRAM bottlenecks and enabling longer context lengths (up to 139K tokens in 24GB VRAM), KTransformers significantly reduces barriers to advanced AI experimentation.

Core Modules
kt-kernel: High-Performance Inference Kernels

This module provides CPU-optimized operations for heterogeneous LLM inference, emphasizing acceleration on diverse hardware.

  • Acceleration Techniques: Utilizes Intel AMX and AVX512/AVX2 for INT4/INT8 quantized inference, alongside MoE-specific optimizations with NUMA-aware memory management.
  • Quantization and Backend Support: Handles CPU-side INT4/INT8 weights, GPU-side GPTQ, unsloth 1.58/2.51-bit weights, IQ1_S/FP8 hybrids, and dequantization (q2k/q3k/q5k on GPU). Also supports FP8 GPU kernels and llamfile as a linear backend.
  • Hardware Compatibility: Includes Intel Arc GPUs, AMD GPUs via ROCm, Ascend NPUs, multi-GPU setups, and Windows native support.
  • Integration: Offers a clean Python API for seamless incorporation into frameworks like SGLang.

Performance Highlights:

ModelHardwareTotal ThroughputOutput Throughput
DeepSeek-R1-0528 (FP8)8×L20 GPU + Xeon Gold 6454S227.85 tokens/s87.58 tokens/s (8-way)

Quick installation: cd kt-kernel && pip install .. For multi-GPU and injection tutorials, refer to the documentation.

kt-sft: Fine-Tuning Framework

Integrated with LLaMA-Factory, this module enables efficient fine-tuning of massive MoE models.

  • Efficiency Features: Supports full LoRA fine-tuning with heterogeneous acceleration, reducing GPU memory needs dramatically.
  • Workflow Support: Includes chat, batch inference, and metrics evaluation for production readiness.
  • Advanced Capabilities: Handles 3-layer prefix cache reuse (GPU-CPU-Disk) and multi-concurrency via balance-serve.

Performance Highlights:

ModelConfigurationThroughputGPU Memory
DeepSeek-V3 (671B)LoRA + AMX~40 tokens/s70GB (multi-GPU)
DeepSeek-V2-Lite (14B)LoRA + AMX~530 tokens/s6GB

Quick start: Follow kt-sft/README.md for environment setup, then USE_KT=1 llamafactory-cli train examples/train_lora/deepseek3_lora_sft_kt.yaml.

Development History and Updates

Launched around August 2024, KTransformers has seen rapid iterations:

  • August 2024: Initial Windows support, multi-GPU, Mixtral models, VRAM reductions for DeepSeekV2 (21GB to 11GB).
  • February 2025: DeepSeek-R1/V3 support with 3-28x speedups, FP8 kernels, extended contexts (4K to 139K tokens).
  • March-April 2025: ROCm/AMD GPU, multi-concurrency, LLaMA 4, AMX quantizations, Intel Arc.
  • May-July 2025: Prefix caching, Kimi-K2, SmallThinker/GLM4-MoE.
  • September-November 2025: Qwen3-Next, SGLang integration, Ascend NPU, Kimi-K2-Thinking, LLaMA-Factory enhancements.

The project boasts over 16,000 GitHub stars and is backed by a 2025 ACM SIGOPS paper on CPU/GPU hybrid inference for MoE models. A 2025 Q4 roadmap outlines future expansions.

Community and Contributions

Developed by MADSys Lab at Tsinghua University and Approaching.AI, with 90+ contributors. Engage via GitHub issues, pull requests, or WeChat groups. The original framework is archived for legacy reference, with comprehensive docs and online books available.

Information

  • Websitegithub.com
  • AuthorsMADSys Lab, Tsinghua University, Approaching.AI, Hongtao Chen, Weiyu Xie, Boxin Zhang, Jingqi Tang, Jiahao Wang, Jianwei Dong, Shaoyuan Chen, Ziwei Yuan, Chen Lin, Chengyu Qiu, Yuening Zhu, Qingliang Ou, Jiaqi Liao, Xianglin Chen, Zhiyuan Ai, Yongwei Wu, Mingxing Zhang
  • Published date2024/08/09

More Items