AIAny - SANA-WM (Bidirectional)

Why this matters

Generating temporally consistent, minute-scale video from a single image is still rare outside very large proprietary systems. SANA‑WM shows that a carefully designed 2.6B image-to-video diffusion transformer plus a staged refiner can reach one-minute, 720p outputs with explicit per-frame 6‑DoF camera control — making controllable long-horizon synthesis more accessible to research and open-source workflows.

Key Capabilities

Hybrid long-context modelling: a frame-wise Gated DeltaNet combined with periodic softmax attention reduces memory while preserving long-range temporal structure — so you can model hundreds of frames without blowing up memory.
Precise camera control: a dual-branch architecture (main + camera branches) lets the model follow per-frame trajectories (6‑DoF) provided as camera matrices or a compact WASD/IJKL action DSL — so generated motion aligns closely to intended camera paths.
Two-stage pipeline for quality: Stage‑1 latents produce the long video; an optional LTX‑2 sink-bidirectional Euler refiner decodes and enhances fidelity and temporal consistency — trading much higher quality for a large refiner checkpoint (≈41 GB).
Practical engineering: the repo stitches a lightweight Stage‑1 model (2.6B params) with a heavier Stage‑2 refiner and fetches the Gemma text encoder on demand — enabling offline runs when you override checkpoints, but with significant storage requirements.

Who It's For and Tradeoffs

Great fit if you want research‑grade, controllable image→video outputs at minute scale and can provision GPU memory and storage (the refiner and encoder files are large). It is useful for camera-path-driven demos, world-model research, and experiments in long-horizon temporal consistency.

Look elsewhere if you need tiny on-device models, variable output resolutions (this pipeline targets a fixed 704×1280 frame size), or zero-storage deployments — the refiner and Gemma encoder together add tens of gigabytes and the high-fidelity decode is computationally heavy.

Where It Fits

Compared with single-stage diffusion or latent-video models, SANA‑WM’s main trade is splitting work between a memory‑efficient long-context Stage‑1 and a heavyweight refiner Stage‑2: this lowers Stage‑1 resource needs while still achieving high-quality minute-scale outputs after refinement. For teams that cannot host large checkpoints, running with --no_refiner (decode with the Sana VAE) gives a lighter but lower-fidelity alternative.

Notes

Paper: arXiv:2605.15178 (authors and implementation details in the repository).
Released under Apache‑2.0; bundled LTX‑2 artifacts inherit their upstream license.
Typical output resolution is fixed at 704×1280; input images are aspect-preserving resized + center-cropped.

SANA-WM (Bidirectional)

Introduction

Key Capabilities

Who It's For and Tradeoffs

Where It Fits

Information

Categories

Tags

More Items

Palmier Pro

Qwen3.6-27B-Fable-Fusion-711-Uncensored-Heretic-NM-DAU-NEO-MAX-MTP-GGUF

SenseNova-U1