LogoAIAny
Icon for item

Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

Trains a GPT-style causal Transformer on a 2-billion-frame retargeted motion corpus to enable zero-shot whole-body motion tracking and control. By scaling both data and model capacity, it tracks highly dynamic behaviors while generalizing to unseen motions; accepted to CVPR 2026.

Introduction

Why this matters

Most existing whole-body trackers trade agility for generalization: small MLP-based trackers can reproduce specific motions but fail to generalize to unseen or highly dynamic behaviors. Humanoid-GPT flips that trade-off by scaling the training corpus to ~2 billion retargeted frames and using a causal, GPT-style Transformer architecture, producing a single generative model that both tracks complex dynamics and generalizes zero-shot to new motions and control tasks.

Key Findings
  • Billion-scale pretraining (≈2B frames) unifies major mocap datasets plus large in-house recordings, so the model sees far more motion variety during pretraining — this reduces overfitting to specific datasets and improves zero-shot performance.
  • A causal Transformer tracker replaces shallow MLPs, enabling autoregressive generation of whole-body trajectories; so what: it captures longer temporal context and more complex dynamics, improving fidelity on highly dynamic behaviors.
  • Scaling data and model capacity together yields consistent gains in zero-shot generalization across unseen tasks and motion styles; so what: a single pretrained model can be applied to new tracking and control tasks without task-specific finetuning.
  • Extensive experiments and scaling analyses (paper) show new state-of-the-art performance on a range of tracking benchmarks and robustness to unseen, high-agility motions.
Who it's for & trade-offs

Great fit if you need a single generative tracker that can handle wide-ranging, high-dynamics human motion without task-specific finetuning — e.g., researchers building motion controllers, whole-body imitation systems, or downstream robotics controllers that benefit from zero-shot generalization. Look elsewhere if you require lightweight, on-device trackers for severely constrained compute or if you need interpretable, rule-based tracking: the approach relies on large-scale pretraining and transformer-level compute and may be heavy for embedded deployment.

Method & positioning

The paper's main technical move is treating motion tracking as a generative, autoregressive modeling problem and investing in scale: (1) create a unified, retargeted corpus of ~2B frames across mocap sources; (2) train a causal GPT-style Transformer for trajectory generation and tracking; (3) evaluate scaling laws and zero-shot transfer across control tasks. Positionally, this follows the recent trend of applying foundation-model scaling to embodied and motion domains, prioritizing broad generalization over tiny specialized trackers.

Information

  • Websitearxiv.org
  • AuthorsZekun Qi, Xuchuan Chen, Dairu Liu, Chenghuai Lin, Yunrui Lian, Sikai Liang, Zhikai Zhang, Yu Guan, Jilong Wang, Wenyao Zhang, Xinqiang Yu, He Wang, Li Yi
  • Published date2026/06/02