Speculative decoding can multiply LLM throughput by having a fast draft model propose tokens that a target model verifies in parallel, but draft models trained via supervised fine-tuning (SFT) often quickly plateau because they learn from fixed, target-generated trajectories rather than the states they produce at inference. Draft-OPD reframes this as an on-policy distillation problem and closes the offline-to-inference gap by exposing the drafter to the states it actually induces while still collecting stable continuations via the target.
Key Findings
- Draft-OPD uses target-assisted rollouts and replay of verification-exposed error positions so the drafter receives supervision on the exact draft-induced states that limit acceptance. This directs training to the true failure modes rather than stable but unrepresentative target trajectories.
- Empirically, Draft-OPD yields over 5× lossless acceleration for “thinking” models across diverse tasks, and improves speculative acceptance/performance over EAGLE-3 and DFlash by ~23% and ~13% respectively — meaning faster inference at no downstream quality loss in the evaluated settings.
- The method balances stability and on-policy signal: target-assisted continuations prevent drift while replaying rejected proposals highlights actionable mistakes, so the drafter learns corrective behaviors that translate to longer accepted runs in parallel verification.
Who It's For and Trade-offs
Great fit if you: need to scale LLM inference throughput without degrading output quality, are already using or experimenting with speculative decoding, and can train a lightweight draft model alongside a heavyweight verifier. Draft-OPD is especially useful when acceptance length is the limiting factor for speed. Look elsewhere if you: cannot run parallel verification (e.g., strict single-model pipelines), lack resources to train separate draft models, or your application tolerates occasional quality loss for simpler caching or quantization approaches.
Where It Fits
Draft-OPD sits between supervised SFT-based drafting (which is simple but plateaus) and fully actor-critic-style on-policy learning (which can be unstable). It trades some training complexity for more targeted gains in speculative acceptance and practical acceleration, making it attractive for production systems that can afford a draft+verify runtime.
How It Works (brief)
The core mechanism is: 1) let the drafter propose under its own policy; 2) use the target model to produce stable continuations (target-assisted rollouts) to avoid catastrophic drift; 3) replay the drafting process focused on positions exposed by the verifier as errors so the drafter sees corrective supervision on the states it actually creates. This focused replay amplifies the learning signal for the mistakes that matter for speculative acceptance.
