LogoAIAny
Icon for item

Bonsai Image · Ternary 4B (gemlite 2-bit)

A ternary-weight (~1.58-bit) 4B text-to-image diffusion transformer optimized for NVIDIA GPUs using Gemlite INT2 and HQQ; it reduces the transformer to ~1.21 GB (4.55 GB CUDA payload) and targets 1024×1024 generation with a 4-step FlowMatch-Euler sampler.

Introduction

Ternary Bonsai Image 4B shows that careful low-bit design can push modern diffusion-transformer quality into the memory range of commodity GPUs without requiring cloud-scale hardware. By applying a ternary 1 representation with per-group FP16 scales and pairing it with low-bit GEMM kernels (Gemlite) and an HQQ-compressed text encoder, Prism ML compresses the transformer trunk to ~1.21 GB while keeping practical generation throughput on NVIDIA cards.

Key Capabilities
  • Compact ternary transformer with FP16 group-wise scaling — the transformer trunk is reduced from a ~7.75 GB FP16 baseline to ~1.21 GB, which materially lowers memory pressure for local inference and serving.
  • Practical CUDA deployment stack (Gemlite INT2 + HQQ + FP16 VAE) — the end-to-end CUDA payload is ~4.55 GB with the text encoder offloaded after prompt encoding, enabling 1024×1024 generation on many consumer and datacenter NVIDIA GPUs.
  • Low-step, quality-oriented sampler — designed around a 4-step FlowMatch-Euler sampler (guidance=1.0, shift=3.0), so throughput is high (examples: ~2.8s on A100, ~4.5s on RTX 3080 at 1024²) while preserving prompt fidelity.
  • Cross-platform deployment paths — native Linux and Windows CUDA execution with companion MLX 2-bit variants for Apple Silicon, simplifying multi-platform experimentation and private/edge inference.
Who it's for & trade-offs

Great fit if you need to run modern diffusion-transformer image generation locally or on commodity NVIDIA GPUs and want a clear quality/footprint trade-off: it significantly reduces transformer memory while retaining diffusion-quality comparable to larger FP16 models in many benchmarks. It’s also suitable for private-serving scenarios where reduced model size lowers HBM requirements and cost.

Look elsewhere if exact bit-for-bit parity with the FP16 FLUX.2 Klein 4B is required (ternary is not bit-identical), or if your workloads demand the absolute highest benchmark scores for fine-grained text, very small printed text in images, or extreme compositional fidelity. The deployment relies on non-standard low-bit kernels (Gemlite, HQQ) and current runtime packing, which may impose platform-specific operational constraints and occasional artifacts if sampler settings are changed away from the recommended 4-step configuration.

Where it fits

Compared with a full FP16 FLUX.2 Klein 4B, Bonsai’s ternary variant trades raw parameter precision for a 6.4× transformer footprint reduction and a much smaller working set for inference loops. Against footprint-focused binary variants, the ternary design reintroduces a zero state to improve visual fidelity while remaining in the low-bit efficiency regime. Practically, it’s a middle ground for teams wanting near-FP16 quality with the cost and latency characteristics of compact models.

Information

  • Websitehuggingface.co
  • AuthorsPrism ML (prism-ml)
  • Published date2026/05/21