Qwen3-Omni: Native Omni AI Model for Text, Image & Video

Qwen3-Omni: Multilingual Omni-Modal Foundation Models by Alibaba Cloud Overview Qwen3-Omni is a natively end-to-end, multilingual, omni-modal large language model (LLM). Capable of understanding and processing text, audio, images, and video. Supports generating real-time speech and text responses. Developed by Alibaba Cloud's Qwen team with architectural innovations improving multi-modality performance, efficiency, and flexibility. Available on GitHub: QwenLM/Qwen3-Omni. --- Key Features State-of-the-art performance across modalities: Text-first pretraining and mixed multimodal training. Reaches state-of-the-art results on 22 of 36 audio/video benchmarks and open-source SOTA on 32 of 36. Speech recognition and audio understanding performance comparable to Gemini 2.5 Pro. Multilingual capability: Supports 119 text languages. Speech input in 19 languages including English, Chinese, Korean, Japanese, German, Russian, etc. Speech output supports 10 languages such as English, Chinese, French, German, Russian, Japanese, and Korean. Novel Architecture: Mixture-of-Experts (MoE)-based Thinker-Talker design. AuT pretraining for strong generalization. Multi-codebook design for minimal latency. Real-time Audio/Video Interaction: Low-latency streaming. Natural conversation turn-taking. Immediate text or speech responses. Flexible Control: Customized system prompts for behavior adaptation. Detailed Audio Captioner: Qwen3-Omni-30B-A3B-Captioner is an open-source, detailed, low-hallucination audio captioning model. --- Model Variants and Downloads | Model Name | Description | | -------------------------------- | -------------------------------------------------------------------------------------- | | Qwen3-Omni-30B-A3B-Instruct | Full model (thinker + talker), supports audio, video, and text input/output. | | Qwen3-Omni-30B-A3B-Thinking | Thinker component only, chain-of-thought reasoning, text output. | | Qwen3-Omni-30B-A3B-Captioner | Fine-tuned audio captioner, detailed audio-to-text with low hallucinations. | Model weights are auto-downloaded via Hugging Face Transformers or vLLM. Manual download commands available via ModelScope (recommended in Mainland China) or Hugging Face CLI. --- Usage Transformers Usage Install Transformers from source (recommended with a new environment or Docker to avoid conflicts). Install qwen-omni-utils for multimodal input handling. Recommended to install FlashAttention 2 for GPU memory saving. Sample Python usage available for loading the model, preparing multimodal inputs (text/images/audio), and generating text/audio outputs. vLLM Usage Highly recommended for fast, scalable inference. Installation requires building vLLM from source with Qwen3-Omni branch. Supports batch inference of mixed modalities. Configurable for multi-GPU, max tokens, and parallel processing. Example Python code provided for setup, input processing, and generation. DashScope API Provides offline and real-time APIs for seamless usage. APIs available for Instruct, Thinking, and Captioner models. Links to detailed API documentation for Mainland China and International users provided. --- Interaction Options Online Demo: Accessible via Hugging Face Spaces and ModelScope Studios. Experience Qwen3-Omni and its real-time capabilities instantly. Real-Time Interaction: Available on official chat platform Qwen Chat. Supports voice/video calls. Local Web UI Demo: Instructions and commands to launch local demos with either Transformers or vLLM backends. Dependencies: gradio,