QwQ-32B

QwQ-32B is Alibaba's dedicated reasoning model — and pound for pound, it might be the most impressive open-weight release of early 2025. With just 32.5 billion parameters, it matches DeepSeek-R1 (a 671B model) on key reasoning benchmarks. That's the same performance at 1/21st the size. Released in March 2025 under Apache 2.0, QwQ-32B fits on a single 24GB GPU with quantization — no multi-GPU rig required.

There's a catch, though. Qwen 3 arrived a month later and baked reasoning directly into every model in the family. So the question isn't whether QwQ is good — it is — but whether you still need it. This guide covers the real benchmarks, honest limitations, how QwQ compares to Qwen 3's thinking mode, and exactly how to run it locally.

Try QwQ reasoning — chat.qwen.ai

In This Guide

Key Specs Benchmarks vs R1 QwQ vs Qwen 3 API Access Run Locally Known Issues FAQ

QwQ-32B at a Glance

Before diving into benchmarks: the essentials.

Spec	QwQ-32B
Parameters	32.5B total (31.0B non-embedding)
Context Window	131,072 tokens (131K)
Architecture	Dense Transformer, based on Qwen2.5-32B
License	Apache 2.0 — full commercial use
Release Date	March 5, 2025
Reasoning Style	Chain-of-thought via <think> blocks
Training	RL-enhanced post-training on math, code, and reasoning tasks
Min VRAM (Q4)	~20GB — fits a single RTX 3090/4090

Two things stand out here. First, the 131K context window — a massive jump from the 32K that the earlier QwQ-32B-Preview offered back in November 2024. That preview model was a proof of concept. The March 2025 release is the real deal: better benchmarks, 4x the context, and full Apache 2.0 licensing from day one.

Second, the architecture. QwQ-32B isn't a new model from scratch — it's Qwen2.5-32B with targeted reinforcement learning to unlock deep chain-of-thought reasoning. The RL post-training teaches the model to generate explicit <think> blocks where it works through problems step by step before producing a final answer. This approach keeps the model size manageable while boosting reasoning performance to a level that competes with models 20x larger.

Benchmarks: How QwQ-32B Matches a 671B Model

The headline number is real. QwQ-32B goes head-to-head with DeepSeek-R1 — a 671B mixture-of-experts model — and holds its own across the board. Here's the full picture:

Benchmark	QwQ-32B	DeepSeek-R1 (671B)	o1-mini
AIME '24 (competition math)	79.5	79.8	—
LiveBench (general reasoning)	73.1	71.6	59.1
LiveCodeBench (code gen)	63.4	65.9	53.8
BFCL (tool/function calling)	66.4	60.3	—
IFEval (instruction following)	83.9	83.8	—

QwQ-32B benchmark comparison against DeepSeek-R1 showing near-identical scores on AIME, LiveBench, and IFEval — QwQ-32B vs DeepSeek-R1: near-parity on reasoning benchmarks despite a 21x size difference.

On AIME '24, the gap is 0.3 points. Essentially a tie — except QwQ gets there with 32B parameters while R1 needs 671B. On LiveBench, QwQ actually wins outright (73.1 vs 71.6). And on tool calling (BFCL), QwQ beats R1 by 6 points, which matters if you're building agentic workflows.

Where does R1 pull ahead? LiveCodeBench. The 65.9 vs 63.4 gap isn't dramatic, but it's consistent — R1's sheer scale gives it an edge on complex multi-file code generation. For pure coding tasks, you might also want to look at Qwen Coder, which is purpose-built for that workload.

The honest caveat: these are self-reported benchmarks from Alibaba. Independent community testing broadly confirms the math and reasoning strength, but as Nathan Lambert has noted about Chinese AI labs generally, benchmark scores should be read with a grain of salt. The models are legitimately strong — but "matches R1" is the headline claim, and real-world performance can vary by task.

QwQ-32B vs Qwen 3 Thinking Mode — Which One Do You Actually Need?

This is the section that matters most. When Qwen 3 launched in April 2025, it introduced a native "thinking mode" across the entire model family. Every Qwen 3 model — from the 0.6B edge model to the 235B-A22B flagship — can now generate chain-of-thought reasoning on demand. That directly overlaps with what QwQ was built to do.

So is QwQ obsolete? Not exactly. But the answer depends on what you're optimizing for.

Factor	QwQ-32B	Qwen 3 (Thinking Mode)
AIME '24 Score	79.5	85.7 (Qwen3-235B)
Model Size	32.5B dense	0.6B to 235B (MoE)
Reasoning Approach	Always-on CoT via <think> blocks	Toggle on/off per request
Flexibility	Reasoning only	General-purpose + reasoning
VRAM (Q4)	~20GB	Varies: 1GB to 140GB+
Status	Stable, no further updates expected	Active development
License	Apache 2.0	Apache 2.0

Pick QwQ-32B when you need a dedicated, lightweight reasoning model. If you're running a single 24GB GPU and want the best possible chain-of-thought performance without loading a 235B model, QwQ is hard to beat. It's also simpler to deploy — there's no mode switching, no thinking budget to configure. It reasons by default, every time.

Pick Qwen 3 when you want a general-purpose model that can also reason deeply. The Qwen3-235B-A22B in thinking mode scores 85.7 on AIME '24 versus QwQ's 79.5 — a significant gap. And you get that reasoning ability alongside everyday chat, coding, multilingual support, and everything else Qwen 3 offers. If you can afford the compute, it's the stronger choice.

There's also a middle path. Qwen3-32B with thinking mode enabled gives you a model at the same parameter count as QwQ, but with general-purpose capabilities on top. Community reports suggest Qwen3-32B in thinking mode is roughly comparable to QwQ on math tasks, while being better at everything else. For most new projects starting today, that's where we'd point you.

Bottom line: QwQ-32B is a specialist. If you already have it deployed and it's working well for your reasoning pipeline, there's no urgent reason to switch. But if you're starting fresh, Qwen 3 gives you more versatility at the same or better reasoning performance.

API Access and Pricing

You don't need to run QwQ locally to use it. Several providers host it:

Provider	Model Name	Input Price	Output Price	Notes
Alibaba DashScope	qwq-plus / qwq-plus-latest	Varies by plan	Varies by plan	Official API, most up-to-date
OpenRouter	qwen/qwq-32b	$0.15/M tokens	$0.58/M tokens	Easy integration, pay-per-use
Groq	qwq-32b	Free tier available	Free tier available	Extremely fast inference

Groq deserves a special mention. Their LPU hardware makes QwQ-32B fast — we're talking hundreds of tokens per second, which transforms the chain-of-thought experience from "waiting for the model to think" to something nearly instantaneous. If you want to test QwQ's reasoning capabilities without any setup, Groq's free tier is the quickest path.

OpenRouter at $0.15 per million input tokens is remarkably cheap for a model at this performance level. For comparison, OpenAI's o1-mini costs significantly more, and QwQ outperforms it on LiveBench and LiveCodeBench. If you're building a reasoning-heavy application on a budget, QwQ via OpenRouter is worth serious consideration.

For the full picture on Qwen's API ecosystem — including Qwen Max, Qwen Plus, and how DashScope works — see our Qwen API guide.

Running QwQ-32B on Your Own Hardware

This is where QwQ's size advantage really shines. A 671B model like DeepSeek-R1 needs enterprise-grade hardware. QwQ-32B? You can run it on a gaming GPU.

Quick Start with Ollama

The fastest way to get QwQ running locally:

ollama run qwq:32b

That's it. Ollama handles the download, quantization, and serving. You'll get a Q4_K_M quantized version by default, which needs roughly 20GB VRAM. An RTX 3090, RTX 4090, or any card with 24GB will handle it comfortably.

Other Local Options

QwQ-32B works with every major local inference tool:

LM Studio — Search for "QwQ-32B" in the model browser. One-click download and chat interface. Great if you prefer a GUI.
llama.cpp — Full GGUF support. Grab quantized files from HuggingFace for precise control over quantization level.
MLX — Native Apple Silicon support. Runs well on M1 Pro/Max and above with 32GB+ unified memory.
vLLM — For production serving. Supports continuous batching and PagedAttention for high-throughput deployments.

For step-by-step instructions on any of these tools, our local deployment guide covers the full setup. And if you're not sure whether your hardware can handle it, check our Can I Run Qwen tool — it tells you exactly which Qwen models fit your GPU.

Hardware Requirements

Quantization	VRAM Needed	Recommended GPU	Expected Speed
Q4_K_M	~20GB	RTX 3090 / 4090 (24GB)	15-30 tok/s
Q5_K_M	~24GB	RTX 4090 (24GB)	12-25 tok/s
Q8_0	~36GB	RTX A6000 / dual GPU	10-20 tok/s
FP16	~65GB	2x RTX 4090 / A100	8-15 tok/s

Can you run it on 16GB VRAM? Technically yes, with aggressive Q4 quantization and reduced context length. But expect slower speeds and some quality degradation. The sweet spot is 24GB — that's where QwQ lives most comfortably.

Known Issues and Honest Limitations

QwQ-32B is impressive for its size, but it's not without problems. These are documented by both Alibaba and the community:

Language mixing. QwQ sometimes switches languages mid-response, especially in long reasoning chains. You'll be reading English and suddenly hit a paragraph in Chinese. It doesn't happen constantly, but it's a known quirk. System prompts that explicitly request English output help, but don't eliminate it entirely.

Circular reasoning loops. This is the most frustrating issue. On certain complex problems, the model gets stuck in a loop — re-examining the same reasoning path over and over without reaching a conclusion. It burns through tokens and context window while going nowhere. If you see the <think> block growing past several thousand tokens without resolution, it's usually better to rephrase your prompt and try again.

Common-sense reasoning gaps. QwQ was trained with heavy emphasis on math, code, and logical reasoning. On everyday common-sense questions — the kind that require world knowledge rather than deduction — it can stumble. Don't expect it to replace a general-purpose model for broad Q&A tasks. That's not what it's built for.

Think token errors. Early deployments hit issues with malformed <think> tokens that broke output parsing. The Unsloth community identified and fixed this in their quantized versions. If you're grabbing GGUF files, prefer Unsloth-patched versions from HuggingFace to avoid this.

Long-context hallucinations. Despite the 131K context window, reasoning quality degrades on very long inputs. Complex multi-step problems that require tracking many variables across a long context can produce confident-sounding but wrong answers. For critical applications, keep your reasoning prompts focused and concise rather than dumping entire documents into context.

Frequently Asked Questions

Should I use QwQ-32B or Qwen 3 for reasoning tasks?
If you can run Qwen 3 (any size) with thinking mode enabled, start there. Qwen3-235B in thinking mode scores 85.7 on AIME '24 versus QwQ's 79.5. Even Qwen3-32B with thinking mode is a more versatile option. QwQ still makes sense if you want a dedicated reasoning model that's simple to deploy on 24GB VRAM with no configuration overhead.

Can I run QwQ-32B on 16GB VRAM?
Yes, with Q4 quantization and reduced context window. It'll work, but you'll sacrifice speed and some quality. 24GB is the comfortable minimum. Check your specific GPU here.

Is QwQ-32B still being updated?
No further updates are expected. Alibaba's reasoning efforts have moved to Qwen 3's native thinking mode. QwQ-32B is stable and works well — it's just feature-frozen. Think of it as a finished product rather than an active project.

What's the difference between QwQ-32B and QwQ-32B-Preview?
The Preview (November 2024) was an early proof of concept with 32K context. The final QwQ-32B (March 2025) expanded context to 131K, improved benchmark scores across the board, and released under a cleaner Apache 2.0 license. Always use the March 2025 version.

Is QwQ free for commercial use?
Yes. Apache 2.0 allows full commercial use, modification, and redistribution. No royalties, no restricted clauses. You own what you build with it.

How does QwQ compare to OpenAI's o1-mini?
QwQ beats o1-mini on both LiveBench (73.1 vs 59.1) and LiveCodeBench (63.4 vs 53.8). It's also open-weight and free to run locally, while o1-mini is proprietary and API-only. The comparison isn't close.