AIAny - KTransformers

KTransformers: A Flexible Framework for LLM Inference and Fine-Tuning Optimizations

Overview

KTransformers is a research-oriented project dedicated to enhancing the efficiency of large language models (LLMs) through advanced CPU-GPU heterogeneous computing techniques. Initially developed as an integrated framework, it has since evolved into two modular components: kt-kernel for optimized inference kernels and kt-sft for fine-tuning capabilities. This separation improves modularity, maintainability, and ease of integration with other tools. The project targets key challenges in deploying and training massive models, particularly Mixture-of-Experts (MoE) architectures, by leveraging heterogeneous hardware to minimize resource demands and maximize performance.

Purpose and Key Use Cases

The core purpose of KTransformers is to democratize access to cutting-edge LLM optimizations, allowing researchers and developers to experiment with high-performance inference and fine-tuning without requiring excessive computational resources. It excels in scenarios such as:

Hybrid Inference: Placing 'hot' experts on GPUs for speed and 'cold' experts on CPUs for cost-efficiency in MoE models.
Resource-Constrained Training: Fine-tuning ultra-large models (e.g., 671B parameters) using just 70GB GPU memory combined with 1.3TB RAM.
Production Deployment: Integrating with serving frameworks like SGLang for scalable, multi-concurrency inference.
Model Compatibility: Supporting a wide range of models including DeepSeek-R1/V3/V2, Kimi-K2 variants, Qwen3-Next, GLM4-MoE, Mixtral 8x7B/8x22B, and experimental LLaMA 4 support.

By addressing VRAM bottlenecks and enabling longer context lengths (up to 139K tokens in 24GB VRAM), KTransformers significantly reduces barriers to advanced AI experimentation.

Core Modules

kt-kernel: High-Performance Inference Kernels

This module provides CPU-optimized operations for heterogeneous LLM inference, emphasizing acceleration on diverse hardware.

Acceleration Techniques: Utilizes Intel AMX and AVX512/AVX2 for INT4/INT8 quantized inference, alongside MoE-specific optimizations with NUMA-aware memory management.
Quantization and Backend Support: Handles CPU-side INT4/INT8 weights, GPU-side GPTQ, unsloth 1.58/2.51-bit weights, IQ1_S/FP8 hybrids, and dequantization (q2k/q3k/q5k on GPU). Also supports FP8 GPU kernels and llamfile as a linear backend.
Hardware Compatibility: Includes Intel Arc GPUs, AMD GPUs via ROCm, Ascend NPUs, multi-GPU setups, and Windows native support.
Integration: Offers a clean Python API for seamless incorporation into frameworks like SGLang.

Performance Highlights:

Model	Hardware	Total Throughput	Output Throughput
DeepSeek-R1-0528 (FP8)	8×L20 GPU + Xeon Gold 6454S	227.85 tokens/s	87.58 tokens/s (8-way)

Quick installation: cd kt-kernel && pip install .. For multi-GPU and injection tutorials, refer to the documentation.

kt-sft: Fine-Tuning Framework

Integrated with LLaMA-Factory, this module enables efficient fine-tuning of massive MoE models.

Efficiency Features: Supports full LoRA fine-tuning with heterogeneous acceleration, reducing GPU memory needs dramatically.
Workflow Support: Includes chat, batch inference, and metrics evaluation for production readiness.
Advanced Capabilities: Handles 3-layer prefix cache reuse (GPU-CPU-Disk) and multi-concurrency via balance-serve.

Performance Highlights:

Model	Configuration	Throughput	GPU Memory
DeepSeek-V3 (671B)	LoRA + AMX	~40 tokens/s	70GB (multi-GPU)
DeepSeek-V2-Lite (14B)	LoRA + AMX	~530 tokens/s	6GB

Quick start: Follow kt-sft/README.md for environment setup, then USE_KT=1 llamafactory-cli train examples/train_lora/deepseek3_lora_sft_kt.yaml.

Development History and Updates

Launched around August 2024, KTransformers has seen rapid iterations:

August 2024: Initial Windows support, multi-GPU, Mixtral models, VRAM reductions for DeepSeekV2 (21GB to 11GB).
February 2025: DeepSeek-R1/V3 support with 3-28x speedups, FP8 kernels, extended contexts (4K to 139K tokens).
March-April 2025: ROCm/AMD GPU, multi-concurrency, LLaMA 4, AMX quantizations, Intel Arc.
May-July 2025: Prefix caching, Kimi-K2, SmallThinker/GLM4-MoE.
September-November 2025: Qwen3-Next, SGLang integration, Ascend NPU, Kimi-K2-Thinking, LLaMA-Factory enhancements.

The project boasts over 16,000 GitHub stars and is backed by a 2025 ACM SIGOPS paper on CPU/GPU hybrid inference for MoE models. A 2025 Q4 roadmap outlines future expansions.

Community and Contributions

Developed by MADSys Lab at Tsinghua University and Approaching.AI, with 90+ contributors. Engage via GitHub issues, pull requests, or WeChat groups. The original framework is archived for legacy reference, with comprehensive docs and online books available.

KTransformers

Introduction

KTransformers: A Flexible Framework for LLM Inference and Fine-Tuning Optimizations

Overview

Purpose and Key Use Cases

Core Modules

kt-kernel: High-Performance Inference Kernels

kt-sft: Fine-Tuning Framework

Development History and Updates

Community and Contributions

Information

Categories

Tags

More Items

Grok-1

Tianshou

UltraRAG