DeepGEMM addresses a narrow but critical layer of LLM stacks: the low-level matrix-multiply primitives and fused expert kernels where most compute and communication costs concentrate. By providing a small set of well-engineered tensor-core kernels (FP8/FP4/BF16), fused Mega MoE, and MQA scoring kernels compiled at runtime, it lets practitioners prototype and deploy highly optimized GEMMs without a heavyweight template/algebra dependency.
What Sets It Apart
- Unified, focused kernel set: instead of a sprawling template library, DeepGEMM exposes a limited number of core kernels that cover dense GEMMs, grouped/masked GEMMs for MoE, and indexer MQA scoring—so it's easier to understand and extend for research or production tuning.
- Runtime JIT workflow with low CPU overhead: kernels are compiled at runtime via a lightweight JIT module (optionally NVRTC) to avoid heavy install-time builds while keeping fast iteration for shape-specific tuning.
- Cross-architecture optimizations: supports NVIDIA SM90 and SM100 with architecture-specific data layouts (e.g., FP32 scaling on SM90, packed UE8M0 format on SM100) and claims performance on modern hardware comparable to or exceeding expert-tuned libraries.
- Fused Mega MoE with overlapped communication: provides a mega-kernel that fuses dispatch, FP8xFP4 linear layers, SwiGLU, and combine steps while overlapping NVLink communication—reducing end-to-end MoE latency for multi-process setups.
Who it's for & tradeoffs
Great fit if you need to build or tune low-level GPU primitives for LLMs and MoE models (researchers, infrastructure engineers, and model optimization teams). It’s also useful for teams building custom indexer scoring (MQA) or experimenting with FP8/FP4 arithmetic and alignment strategies.
Look elsewhere if you need a high-level, drop-in transformer library, multi-vendor GPU support (non-NVIDIA), or a turnkey model-serving product. DeepGEMM assumes NVIDIA GPUs (SM90/SM100), PyTorch 2.1+, CUDA toolkits (12.3+/12.9+ recommendations), and some familiarity with scaling-factor layouts and TMA alignment constraints.
Where it sits compared to adjacent projects
DeepGEMM draws inspiration from CUTLASS/CuTe but intentionally avoids heavy template/algebra dependence to remain lightweight and readable. Compared with general-purpose libraries (cuBLAS, CUTLASS) it trades a broader API surface for a tighter, LLM-focused kernel set and fused MoE primitives—making it more approachable for engineers wanting to learn or customize tensor-core kernel optimizations.
Overall, DeepGEMM is a practical choice when you care about squeezing the last percent of GEMM performance for LLM workloads and need a small, maintainable CUDA codebase that supports modern low-precision formats and fused MoE execution patterns.
