AIAny - Bonsai Image · Ternary 4B (gemlite 2-bit)

Ternary Bonsai Image 4B shows that careful low-bit design can push modern diffusion-transformer quality into the memory range of commodity GPUs without requiring cloud-scale hardware. By applying a ternary 1 representation with per-group FP16 scales and pairing it with low-bit GEMM kernels (Gemlite) and an HQQ-compressed text encoder, Prism ML compresses the transformer trunk to ~1.21 GB while keeping practical generation throughput on NVIDIA cards.

Key Capabilities

Compact ternary transformer with FP16 group-wise scaling — the transformer trunk is reduced from a ~7.75 GB FP16 baseline to ~1.21 GB, which materially lowers memory pressure for local inference and serving.
Practical CUDA deployment stack (Gemlite INT2 + HQQ + FP16 VAE) — the end-to-end CUDA payload is ~4.55 GB with the text encoder offloaded after prompt encoding, enabling 1024×1024 generation on many consumer and datacenter NVIDIA GPUs.
Low-step, quality-oriented sampler — designed around a 4-step FlowMatch-Euler sampler (guidance=1.0, shift=3.0), so throughput is high (examples: ~2.8s on A100, ~4.5s on RTX 3080 at 1024²) while preserving prompt fidelity.
Cross-platform deployment paths — native Linux and Windows CUDA execution with companion MLX 2-bit variants for Apple Silicon, simplifying multi-platform experimentation and private/edge inference.

Who it's for & trade-offs

Great fit if you need to run modern diffusion-transformer image generation locally or on commodity NVIDIA GPUs and want a clear quality/footprint trade-off: it significantly reduces transformer memory while retaining diffusion-quality comparable to larger FP16 models in many benchmarks. It’s also suitable for private-serving scenarios where reduced model size lowers HBM requirements and cost.

Look elsewhere if exact bit-for-bit parity with the FP16 FLUX.2 Klein 4B is required (ternary is not bit-identical), or if your workloads demand the absolute highest benchmark scores for fine-grained text, very small printed text in images, or extreme compositional fidelity. The deployment relies on non-standard low-bit kernels (Gemlite, HQQ) and current runtime packing, which may impose platform-specific operational constraints and occasional artifacts if sampler settings are changed away from the recommended 4-step configuration.

Where it fits

Compared with a full FP16 FLUX.2 Klein 4B, Bonsai’s ternary variant trades raw parameter precision for a 6.4× transformer footprint reduction and a much smaller working set for inference loops. Against footprint-focused binary variants, the ternary design reintroduces a zero state to improve visual fidelity while remaining in the low-bit efficiency regime. Practically, it’s a middle ground for teams wanting near-FP16 quality with the cost and latency characteristics of compact models.

Bonsai Image · Ternary 4B (gemlite 2-bit)

Introduction

Key Capabilities

Who it's for & trade-offs

Where it fits

Information

Categories

Tags

More Items

ideogram-4-nf4

ideogram-ai/ideogram-4-fp8

unsloth/gemma-4-12b-it-GGUF