Why this matters
Generating temporally consistent, minute-scale video from a single image is still rare outside very large proprietary systems. SANA‑WM shows that a carefully designed 2.6B image-to-video diffusion transformer plus a staged refiner can reach one-minute, 720p outputs with explicit per-frame 6‑DoF camera control — making controllable long-horizon synthesis more accessible to research and open-source workflows.
Key Capabilities
- Hybrid long-context modelling: a frame-wise Gated DeltaNet combined with periodic softmax attention reduces memory while preserving long-range temporal structure — so you can model hundreds of frames without blowing up memory.
- Precise camera control: a dual-branch architecture (main + camera branches) lets the model follow per-frame trajectories (6‑DoF) provided as camera matrices or a compact WASD/IJKL action DSL — so generated motion aligns closely to intended camera paths.
- Two-stage pipeline for quality: Stage‑1 latents produce the long video; an optional LTX‑2 sink-bidirectional Euler refiner decodes and enhances fidelity and temporal consistency — trading much higher quality for a large refiner checkpoint (≈41 GB).
- Practical engineering: the repo stitches a lightweight Stage‑1 model (2.6B params) with a heavier Stage‑2 refiner and fetches the Gemma text encoder on demand — enabling offline runs when you override checkpoints, but with significant storage requirements.
Who It's For and Tradeoffs
Great fit if you want research‑grade, controllable image→video outputs at minute scale and can provision GPU memory and storage (the refiner and encoder files are large). It is useful for camera-path-driven demos, world-model research, and experiments in long-horizon temporal consistency.
Look elsewhere if you need tiny on-device models, variable output resolutions (this pipeline targets a fixed 704×1280 frame size), or zero-storage deployments — the refiner and Gemma encoder together add tens of gigabytes and the high-fidelity decode is computationally heavy.
Where It Fits
Compared with single-stage diffusion or latent-video models, SANA‑WM’s main trade is splitting work between a memory‑efficient long-context Stage‑1 and a heavyweight refiner Stage‑2: this lowers Stage‑1 resource needs while still achieving high-quality minute-scale outputs after refinement. For teams that cannot host large checkpoints, running with --no_refiner (decode with the Sana VAE) gives a lighter but lower-fidelity alternative.
Notes
- Paper: arXiv:2605.15178 (authors and implementation details in the repository).
- Released under Apache‑2.0; bundled LTX‑2 artifacts inherit their upstream license.
- Typical output resolution is fixed at 704×1280; input images are aspect-preserving resized + center-cropped.
