LogoAIAny
Icon for item

DeepEP

Provides high-throughput, low-latency GPU communication kernels for Mixture-of-Experts (MoE) and expert-parallel workloads, with NVLink↔RDMA-aware forwarding, FP8/BF16 support, and low-latency RDMA hooks for inference decoding.

Introduction

Most large-scale MoE deployments are bottlenecked by all-to-all communication latency and cross-domain bandwidth (NVLink vs RDMA). DeepEP attacks this bottleneck with domain-aware kernels and receive-hooks that push RDMA traffic without occupying GPU SMs — trading careful cluster requirements for measurable latency and throughput gains in both training and inference.

What Sets It Apart
  • NVLink↔RDMA-aware dispatch/combine kernels: kernels are explicitly optimized for asymmetric-domain forwarding (intranode NVLink and internode RDMA), so cross-node MoE traffic can saturate the fastest available fabric rather than being limited by a single domain.
  • Low-precision and SM controls: FP8 dispatch and BF16 combining plus SM-number control let you reduce memory/PCIe pressure and tune SM usage for mixed workloads — this helps when overlapping GPU compute and comms.
  • Low-latency RDMA mode with hooks: a pure-RDMA low-latency path exposes a receive-hook that lets background NIC transfers complete without occupying SMs, enabling sub-200µs dispatch latencies in small-batch decoding setups (measured on H800+CX7 testbeds).
  • Practical engineering tradeoffs: the repo exposes experimental branches (zero-copy, SM-free, TMA/ROCm efforts) and documents undefined-behavior PTX usage and NVSHMEM coupling, so teams can evaluate stability vs performance.
Who It's For

Great fit if you run large MoE models across multi-node clusters and are constrained by all-to-all or NVLink/RDMA cross-domain bottlenecks — e.g., production inference engines wanting lower decode tail latency or prefill phases in MoE pretraining. Look elsewhere if you need a plug-and-play, PCIe-only solution or cannot install NVSHMEM / configure RDMA/NVLink on your cluster; DeepEP assumes specific hardware (SM80/SM90 GPUs, NVLink, RDMA) and involves low-level kernel behavior that may require cluster tuning.

Where It Fits

DeepEP sits below model frameworks (like PyTorch) as a specialized communication layer for expert-parallelism. It complements MoE routing and model libraries by reducing end-to-end communication cost — consider it when communication becomes the dominant factor in scaling MoE beyond single-node setups.

Implementation notes

The project depends on NVSHMEM and targets Ampere/Hopper architectures with recommended CUDA versions (CUDA 11+ for SM80, CUDA 12.3+ for SM90). Benchmarks in the repository report intranode throughput on H800 NVLink and CX7 InfiniBand RDMA; community PRs (e.g., perf improvements) and experimental branches provide additional optimizations but may change stability/compatibility.

Information

  • Websitegithub.com
  • AuthorsChenggang Zhao, Shangyan Zhou, Liyue Zhang, Chengqi Deng, Zhean Xu, Yuxuan Liu, Kuai Yu, Jiashi Li, Liang Zhao
  • Published date2025/02/17

Categories