A 12B unified, encoder-free multimodal model that directly ingests text, images and audio and returns text; supports very long contexts (up to 256K tokens), native function-calling/thinking modes, and small-model deployment for local or on-device use.