Qwen Omni

Alibaba's Qwen Omni family handles what most models can't: text, images, audio, and video as input — with real-time text and speech as output. One model, one inference pass, no stitching together separate ASR, LLM, and TTS pipelines. As of March 30, 2026, there are two generations to choose from, and the decision between them comes down to one critical factor: open-source access vs raw capability.

Qwen3.5-Omni is the new flagship. It processes 256K tokens of context, understands 113 languages for audio recognition, generates speech in 36 languages, and claims SOTA across 215 audio and audio-visual subtasks. It ships in three variants — Plus, Flash, and Light — and beats Gemini 3.1 Pro on audio understanding, reasoning, and translation benchmarks. The catch? It's proprietary. No weights to download, no self-hosting. API and Qwen Chat only.

Qwen3-Omni (September 2025) remains the open-source option under Apache 2.0. A 30B-total, 3B-active MoE model that you can download, quantize, and run on your own GPU. It won't match 3.5-Omni's benchmark numbers, but it gives you something 3.5 never will: full control over your inference stack.

Try Qwen3.5-Omni (Qwen Chat)

Qwen3-Omni on Hugging Face

Qwen Omni family banner showing multimodal input and output capabilities across text, image, audio and video — Qwen Omni: text, images, audio, and video in — text and natural speech out.

Navigate this guide:

Qwen3.5-Omni: The New Flagship (March 2026)
Qwen3.5-Omni Variants: Plus, Flash, and Light
Qwen3.5-Omni Specs and Capabilities
Qwen3-Omni: The Open-Source Alternative
Qwen3.5-Omni vs Qwen3-Omni: Which One Should You Use?
Thinker-Talker Architecture Explained
API Access and How to Get Started
Running Qwen3-Omni Locally
Real-World Use Cases for Qwen Omni
FAQ

Qwen3.5-Omni: The New Flagship (March 2026)

Released on March 30, 2026, Qwen3.5-Omni is Alibaba's most capable omnimodal model to date. It builds on the Thinker-Talker foundation from Qwen3-Omni but scales everything up: longer context, more languages, better audio understanding, and — most importantly — a new Hybrid-Attention MoE architecture that pushes quality across the board.

The headline result: SOTA on 215 subtasks spanning audio understanding, audio-visual reasoning, and cross-modal interaction. That's not a cherry-picked leaderboard — it covers a wide range of real-world audio and video tasks where 3.5-Omni outperforms every model tested, including Gemini 3.1 Pro.

But here's what you need to know upfront: Qwen3.5-Omni is not open-source. Unlike its predecessor, which shipped under Apache 2.0 with full weights on Hugging Face, the 3.5 generation is proprietary. You can use it through the DashScope API or Qwen Chat, but you can't download it, self-host it, or fine-tune it. Alibaba hasn't disclosed the parameter count either. This is a significant departure from the Qwen team's usual open-weight approach, and it's worth factoring into your decision.

Qwen3.5-Omni Variants: Plus, Flash, and Light

Alibaba split 3.5-Omni into three tiers. Each targets a different tradeoff between quality, speed, and deployment constraints.

Variant	Target	Best For	Availability
Plus	Maximum quality	Complex reasoning, professional audio/video tasks, highest accuracy across all 215 SOTA subtasks	API (DashScope), Qwen Chat
Flash	Low latency	Real-time voice assistants, live video analysis, conversational AI where response speed matters	API (DashScope), Qwen Chat
Light	Edge / on-device	Mobile apps, embedded systems, scenarios with bandwidth or compute constraints	Possibly Hugging Face (TBD)

Quick verdict: Pick Plus when accuracy is non-negotiable — it's the variant behind those 215 SOTA claims. Go with Flash for anything interactive where users are waiting for a response. Light is the wildcard: if Alibaba releases it on Hugging Face, it could become the most practical option for developers who want some 3.5-Omni capability without full API dependency. As of early April 2026, Light's exact availability remains unclear.

Qwen3.5-Omni: What It Can Actually Do

The spec sheet is where 3.5-Omni separates itself from everything else in the omnimodal space. Some of these numbers represent genuine leaps over the previous generation.

Capability	Qwen3.5-Omni	Notes
Context window	256K tokens	8x longer than Qwen3-Omni's 32K
Audio input capacity	10+ hours	Up from ~40 minutes in the previous gen
Audio recognition languages	113 languages/dialects	Qwen3-Omni supported 19
Speech generation languages	36	Up from 10 in Qwen3-Omni
Video input	400+ seconds at 720p (1 FPS)	Native video understanding, not frame extraction
Architecture	Thinker-Talker with Hybrid-Attention MoE	Evolution of Qwen3-Omni's MoE design
Parameters	Not disclosed	Alibaba hasn't published the count
License	Proprietary	API only — no downloadable weights

The jump from 40 minutes to 10+ hours of audio input is the most practically significant upgrade. That turns Omni from a short-conversation tool into something you can point at entire podcasts, lectures, or meetings without chunking. The 256K context window supports this — you're no longer hitting a wall at 32K tokens when working with long-form content.

Language coverage tells a similar story. Going from 19 speech input languages to 113 isn't an incremental improvement — it transforms Omni from "works in major languages" to "works in most languages on Earth." Speech output expanding from 10 to 36 languages means real-time translation scenarios that were impossible with the previous generation are now viable. For specialized speech recognition needs, Qwen ASR remains the dedicated tool, but 3.5-Omni's audio capabilities are now competitive for most use cases.

On benchmarks, Alibaba claims 3.5-Omni surpasses Gemini 3.1 Pro in audio understanding, audio-visual reasoning, and translation quality. The SOTA across 215 subtasks is an impressive breadth claim. That said, these are Alibaba's own reported numbers — independent community benchmarks haven't had time to verify them yet, given the model launched days ago. We'll update this section as third-party testing rolls in.

Qwen Omni benchmark results showing performance across audio understanding, reasoning, and translation tasks — Qwen Omni Flash benchmark performance across audio, reasoning, and multimodal tasks.

The Proprietary Elephant in the Room

We need to be direct about this: Qwen3.5-Omni being closed-source is a meaningful shift for the Qwen ecosystem. The Qwen team has built its reputation on open weights — Qwen 3, Qwen 3.5, and Qwen3-Omni all shipped under Apache 2.0. The 3.5-Omni generation breaks that pattern.

What this means in practice: you can't run it offline, you can't fine-tune it for your domain, you can't audit the weights, and you're dependent on Alibaba's API availability and pricing decisions. For some teams, especially those in regulated industries or with data sovereignty requirements, that's a dealbreaker regardless of how good the benchmarks look. For others, the performance gains justify the tradeoff.

Qwen3-Omni: The Open-Source Alternative

Released in September 2025, Qwen3-Omni is the previous generation — and it's still the only Omni model you can actually download. If local deployment, fine-tuning, or data privacy matter to your project, this is your model. It ships under Apache 2.0 with full weights on Hugging Face.

The architecture is a 30B-total, 3B-active MoE with the original Thinker-Talker design. Despite the 30B parameter count, only 3B fire per token — so inference feels more like running a 3B dense model than a 30B one. That makes it surprisingly practical on consumer hardware.

Qwen3-Omni Variants

Variant	What It Does	Output
Instruct	Full Thinker + Talker pipeline. The general-purpose variant for multimodal conversations with speech.	Text + streaming speech
Thinking	Thinker only. Chain-of-thought reasoning traces for complex math, logic, and coding tasks.	Text only (no speech)
Captioner	Specialized for audio and video captioning. Tuned for transcription accuracy over conversation.	Text captions

Qwen3-Omni Specs

Specification	Value
Total parameters	30B (Thinker) + 3B (Talker)
Active per token	3B (Thinker) + 0.3B (Talker)
Context length	32,768 tokens
Max audio input	~40 minutes
Text languages	119
Speech input languages	19
Speech output languages	10
First-packet latency	234ms (audio), 547ms (video)
License	Apache 2.0

Qwen3-Omni 30B-A3B benchmark results across reasoning, writing, audio, and multimodal tasks compared to GPT-4o and Gemini — Qwen3-Omni 30B-A3B benchmark results: strong on reasoning and audio, behind on knowledge recall.

On benchmarks, Qwen3-Omni punches well above its weight class for a 3B-active model. It hits 73.7 on AIME25 (Thinking variant), obliterating GPT-4o's 26.7 on math competitions — though Gemini 2.5 Pro still leads at 81.5. On logic puzzles (ZebraLogic), it scores 76.0, demolishing both GPT-4o (52.6) and Gemini 2.5 Pro (37.7). Audio is another strength: 1.22% word error rate on LibriSpeech clean, roughly half of GPT-4o's error rate.

Where does it fall short? Knowledge recall. On MMLU-Redux, the Instruct variant scores 80.6 — significantly behind GPT-4o (91.3) and Gemini 2.5 Pro (92.7). The Thinking variant closes the gap to 88.8, but it's still trailing. This is a model built for reasoning and multimodal understanding, not encyclopedic breadth. If your workload leans heavily on factual knowledge, a text-focused model like Qwen-Max will serve you better.

A December 2025 update to the qwen3-omni-flash API variant brought 49 voices (up from 17), system prompt customization for speech, and benchmark gains of +9.3 on LiveCodeBench and +5.6 on ZebraLogic. If you tried the API at launch and were unimpressed, the December version is a different experience.

Qwen3.5-Omni vs Qwen3-Omni: Which One Should You Use?

This is the core decision for anyone considering a Qwen omnimodal model right now. The two generations serve fundamentally different needs, and the right choice depends on your constraints — not just your quality expectations.

Feature	Qwen3.5-Omni (Mar 2026)	Qwen3-Omni (Sep 2025)
License	Proprietary (closed-source)	Apache 2.0 (open-source)
Variants	Plus / Flash / Light	Instruct / Thinking / Captioner
Context window	256K tokens	32K tokens
Audio input capacity	10+ hours	~40 minutes
Audio recognition languages	113	19
Speech output languages	36	10
Video input	400+ seconds at 720p	Supported (shorter clips)
Parameters	Not disclosed	30B total / 3B active
Local deployment	No (API only)	Yes (vLLM, Ollama, etc.)
Fine-tuning	No	Yes (full weights available)
Availability	DashScope API, Qwen Chat	Hugging Face, DashScope API

Choose Qwen3.5-Omni when: You need the best possible quality and don't mind API dependency. If your application involves long audio (podcasts, meetings, lectures), multilingual audio covering less common languages, or you need the highest accuracy on audio-visual tasks, 3.5-Omni is the clear winner. The 256K context and 10+ hour audio capacity alone make some workflows possible that simply can't work on 3-Omni's 32K/40-minute limits.

Choose Qwen3-Omni when: You need to self-host, fine-tune, or keep data on-premises. Regulated industries, air-gapped environments, teams with strict data governance — these are all valid reasons to pick the open-source model despite its lower capability ceiling. The Apache 2.0 license means zero usage restrictions, and the 30B MoE architecture runs efficiently on consumer-grade hardware.

There's also a middle-ground scenario worth considering: use 3.5-Omni's API for development and prototyping where quality matters most, then evaluate whether Qwen3-Omni can handle the workload in production if self-hosting is a requirement. For many use cases — especially those involving shorter audio clips and major languages — the open-source model is perfectly adequate.

Thinker-Talker Architecture Explained

Both Omni generations share the same foundational design: a Thinker-Talker architecture that separates "understanding" from "speaking" into two tightly coupled modules. Most omnimodal models bolt a TTS engine onto a language model and call it done. The Thinker-Talker approach is architecturally cleaner and produces more natural results.

How It Works

The Thinker is the brain — a Mixture-of-Experts decoder that ingests text tokens, vision embeddings, and audio features into a unified token stream. It reasons over all modalities simultaneously and produces hidden states. In Qwen3-Omni, this is a 30B-total MoE with 3B active per token. In Qwen3.5-Omni, Alibaba upgraded this to a Hybrid-Attention MoE design (exact parameter count undisclosed).

The Talker converts those hidden states into streaming audio tokens using multi-codebook speech synthesis. It doesn't wait for the Thinker to finish — it starts generating speech as hidden states arrive. In Qwen3-Omni, this hits 234ms first-packet latency, faster than GPT-4o's roughly 300ms.

Thinker-Talker architecture diagram showing multimodal inputs flowing through the Thinker MoE module to the Talker speech synthesis module — The Thinker-Talker pipeline: all modalities enter the Thinker, natural speech exits the Talker in real time.

The critical piece tying it all together is TMRoPE (Time-aligned Multimodal RoPE) — a positional encoding scheme that synchronizes video frames with their corresponding audio at the timestamp level. Without it, you'd get the AI equivalent of a poorly dubbed movie. It's one of those under-the-hood engineering decisions that makes multimodal conversation feel natural rather than stitched together.

Qwen3-Omni's Thinker uses a custom AuT encoder (~650M parameters, trained on 20 million hours of audio) that replaced the Whisper-based encoder from Qwen2.5-Omni. That training data volume shows in the results: 1.22% word error rate on LibriSpeech clean, roughly half of GPT-4o's error rate. For the Qwen 3.5 family context, the omnimodal branch shares the same MoE philosophy — large total parameters, small active footprint per token.

API Access and How to Get Started

Both Omni generations are available through DashScope (Alibaba's API platform, OpenAI-compatible format). Qwen3.5-Omni is API-exclusive, while Qwen3-Omni can be accessed via API or run locally.

Qwen3-Omni API Pricing

Model	Text Input	Audio Input	Text Output	Text + Audio Output
qwen3-omni-flash	$0.43/M tokens	$3.81/M tokens	$1.66/M tokens	$15.11/M tokens
qwen-omni-turbo	$0.07/M tokens	$4.44/M tokens	$0.27/M tokens	$8.89/M tokens

The pricing split by modality is worth studying. Turbo's text is dramatically cheaper ($0.07/M vs $0.43/M for Flash), but audio input actually costs more on Turbo ($4.44 vs $3.81). If your workload is mostly text with occasional speech, Turbo saves money. If audio is central to your pipeline, Flash gives you 49 voices, better benchmarks, and slightly cheaper audio input.

Qwen3.5-Omni pricing hasn't been fully disclosed at the time of writing — it launched days ago. We expect it to come at a premium over the 3-Omni API tiers. For updated pricing across all Qwen API models, check our pricing comparison page.

Running Qwen3-Omni Locally

This section applies to Qwen3-Omni only — the 3.5 generation can't be self-hosted. All three open-weight variants (Instruct, Thinking, Captioner) are available on Hugging Face under Apache 2.0.

vLLM (Recommended)

vLLM handles MoE expert routing efficiently and gives the best throughput for Qwen3-Omni:

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct

Ollama

Fastest way to get running with minimal configuration:

ollama run qwen3:30b-a3b

Hardware Requirements

The 30B MoE activates only 3B parameters per token, so inference compute feels like a 3B dense model. The catch: you still need enough VRAM to load all 30B parameters into memory. With Q4 quantization, expect 16-20GB VRAM. An RTX 4090 (24GB) handles it comfortably; an RTX 3090 works with smaller batch sizes.

Avoid HuggingFace Transformers for anything beyond quick testing — MoE routing overhead makes it significantly slower than vLLM.

Want to check your GPU before downloading? Use our Can I Run Qwen? tool for instant hardware compatibility checks, or read the full local deployment guide for quantization and optimization tips.

Known Issues (Qwen3-Omni)

Mixed image + video inputs crash: Sending both in the same request causes a shape mismatch. Process them in separate requests.
Modality switching breaks context: Starting with audio then switching to image-only degrades results. Start a fresh session when changing modalities.
HuggingFace Transformers is slow: MoE routing isn't optimized. Use vLLM for production workloads.

Real-World Use Cases for Qwen Omni

The single-model approach eliminates the glue code and accumulated latency of chaining ASR + LLM + TTS pipelines. Here's where that matters most — and which Omni generation fits each scenario.

Real-time voice assistants. Point a camera at something, ask a question by voice, get a spoken answer. The model sees, hears, reasons, and speaks — one inference pass. Qwen3.5-Omni Flash is built for this: low latency, 36 output languages. For offline or privacy-sensitive deployments, Qwen3-Omni Instruct handles the same workflow with 10 output languages.

Long-form audio analysis. This is where 3.5-Omni's 10+ hour input capacity and 256K context transform what's possible. Analyze entire podcasts, multi-hour meetings, or lecture series without chunking. Qwen3-Omni tops out at 40 minutes — still useful for shorter content, but limiting for serious audio workloads.

Multilingual spoken translation. Speak in one language, get a response in another with natural prosody. Qwen3.5-Omni recognizes 113 languages and generates speech in 36 — covering the vast majority of global language pairs. Qwen3-Omni's 19-input/10-output coverage works for major languages but falls short for less common ones.

Video understanding and summarization. Feed a video and get both written analysis and spoken narration. Qwen3.5-Omni handles 400+ seconds of 720p video natively. For dedicated text-and-image analysis without audio, Qwen 3.5 offers longer context and stronger text reasoning.

Accessibility tools. Describe visual content aloud for visually impaired users. Transcribe and explain audio for hearing-impaired users. The unified architecture means these are native capabilities, not bolted-on afterthoughts. For specialized speech recognition tasks across many languages, Qwen ASR handles 52 languages with dedicated accuracy. For high-quality voice synthesis, Qwen TTS offers more control over voice characteristics.

FAQ

Can I run Qwen Omni locally?

Qwen3-Omni: yes. It's Apache 2.0, available on Hugging Face, and runs on consumer GPUs with 16-20GB VRAM (quantized). Qwen3.5-Omni: no. It's proprietary and only accessible through the DashScope API or Qwen Chat. If local deployment is a hard requirement, Qwen3-Omni is your only option. Check your GPU compatibility here.

Which Qwen3.5-Omni variant should I pick — Plus, Flash, or Light?

Plus for maximum quality on complex tasks — it's the flagship behind the 215 SOTA results. Flash for real-time applications where latency matters more than peak accuracy. Light for edge and mobile deployment, though its exact availability is still being clarified. When in doubt, start with Flash — it offers the best balance of speed and quality for most interactive use cases.

Is Qwen3.5-Omni better than Qwen3-Omni?

On capability, yes — across the board. Longer context, more languages, better benchmarks, more audio input capacity. But 3.5-Omni is proprietary, while 3-Omni is fully open-source. If you need to self-host, fine-tune, audit model weights, or work in regulated environments with data sovereignty requirements, Qwen3-Omni is the better choice despite its lower benchmark scores. Capability isn't the only axis that matters.

How does Qwen Omni compare to a separate ASR + LLM + TTS pipeline?

Omni gives you one model that processes all modalities in a single pass — no glue code, no accumulated latency between components. That's ideal for real-time conversations and cross-modal reasoning. But specialist models can still outperform Omni on individual tasks: Qwen ASR covers 52 speech recognition languages vs Omni's 19 (or 113 for 3.5-Omni), and Qwen TTS offers finer voice control. The tradeoff is integration simplicity vs per-task optimization.

What's the difference between Qwen Omni and Qwen 3.5?

Qwen 3.5 handles text and images — no audio input, no speech output. It has longer context (262K tokens) and stronger text-only reasoning. Qwen Omni is for when you need audio, video, or real-time speech generation. If your task is purely text and images, Qwen 3.5 is more capable and more efficient. If you need to hear or speak, you need Omni.