Most coding-focused LLMs optimize for single-token autoregression; this model explores a different trade-off by training for Multi-Token Prediction (MTP). That design nudges the network toward longer-horizon planning and enables draft heads to propose multiple candidate tokens in parallel, letting the main model verify and emit several tokens per step — which raises throughput while improving coherence on multi-step coding and reasoning tasks.
Key Capabilities
- Multi-Token Prediction (MTP): Trained to predict multiple future tokens (configured with draft=2), enabling speculative decoding and a reported ~35% throughput gain in internal benchmarks while reducing truncation/repetition in coding tasks.
- Coding & Agent Focus: Fine-tuned with trace-inversion synthetic CoT and large collections of agent trajectories to improve tool-calling stability, stepwise debugging, and repository-level code tasks.
- Local-friendly GGUF bundle: Packaged for local inference (GGUF), with explicit notes about adding mmproj.gguf to enable optional vision/tooling support. Works well on resource-constrained setups (8-bit workflows) and for offline use-cases.
- Multilingual & practical datasets: Trained/finetuned on curated agent traces and proprietary inversion datasets to boost long-form reasoning and tool interactions.
Who it's for and trade-offs
Great fit if you need a locally runnable coder/agent model that favors multi-step planning and throughput (for example, local code generation, agent tool pipelines, or running small-scale agents). Look elsewhere if you need a fully vetted general-purpose assistant or production-critical safety guarantees: this release is community/experimental, and some long, edge-case explanations still show truncation or repetition. Also expect deployment details (prompt format, system prompt, token limits) to materially affect agent behavior.
Where it fits
Best used in local inference stacks, developer workflows, and research experiments comparing speculative decoding strategies. It complements larger cloud-hosted models when you need lower latency per token on commodity hardware and when you control the full prompt/tooling environment.
How it was built (concise)
The model is a 9B dense base fine-tuned with a curriculum mixing trace-inversion synthetic chain-of-thought data and high-quality agent trajectory traces. Benchmarks reported improved accuracy on coding/math tasks and higher reasoning-efficiency indices versus the base model, though some edge tasks showed stability trade-offs when outputs became very long.
