Qwen’s Former Lead on What Hybrid Thinking Got Wrong — and Why He Now Backs Agents
Junyang Lin was the technical lead of Alibaba’s Qwen project. He announced he was stepping down on March 3, 2026. He now lists himself as an independent researcher on his personal site.
In a talk titled ‘Qwen: Towards a Generalist Model / Agent,‘ he walks through the Qwen family. It ends on a single line: “Training models -> training agents.” He later expanded that line into an detailed post as an independent researcher. This article reads the talk and the detailed post together.
What Lin’s Talk Actually Covers
The talk is a tour of the Qwen model family, not a single release. It moves through QwQ-32B, Qwen2.5-Max, Qwen3, Qwen2.5-VL, and Qwen2.5-Omni. Each stop shows benchmark charts against contemporaries. The named baselines include DeepSeek-R1, Grok 3 Beta, Gemini 2.5 Pro, and OpenAI’s o-series.
The Qwen3 stop carries the most detail. Lin highlights hybrid thinking modes: a thinking mode for step-by-step reasoning, and a non-thinking mode for near-instant responses. He adds dynamic thinking budgets, so callers can cap how much the model reasons. Qwen3 expanded multilingual support from 29 to 119 languages and dialects.
The presentation lists many model types and sizes from 0.6B to 235B parameters. It also lists quantized formats including GGUF, GPTQ, AWQ, and MLX, all under Apache 2.0. Two demos follow: a Web Dev demo and a Deep Research demo. The closing “Future work” slide points at agents. It lists more pretraining, RL with environment feedback, longer context, and more modalities. The last key mention is the “training models -> training agents.”
Qwen3 Architecture, As Shown in the Talk
The talk includes the Qwen3 architecture tables, reproduced below.
The small dense models tie input and output embeddings and use a 32K context. The larger dense and MoE models drop tying and extend context to 128K. The two MoE models activate 8 of 128 experts per token.
Hybrid Thinking, and Why Merging is Hard
Lin presents hybrid thinking as a clean feature. The post explains why it was hard to build. Lin writes that thinking mode and instruct mode pull in opposite directions.
A strong instruct model is rewarded for directness, brevity, and low latency. A strong thinking model is rewarded for spending more tokens on hard problems. Merge the two carelessly, and both degrade. The thinking behavior gets bloated, and the instruct behavior gets less crisp.
Qwen3 tried the merge with a four-stage post-training pipeline. That pipeline included a long-CoT cold start, reasoning RL, and a “thinking mode fusion” step. Later in 2025, the 2507 line shipped separate Instruct and Thinking variants instead. Lin frames this as a data problem more than a model problem.
Anthropic took the opposite route, and Lin calls it a useful corrective. Claude 3.7 Sonnet shipped as a hybrid model with a user-set thinking budget. Claude 4 let reasoning interleave with tool use, aimed at coding and long-running tasks. His point: a longer reasoning trace does not make a model smarter. Thinking should be shaped by the target workload, not by the benchmark.
Interactive Explainer
From ‘Reasoning’ Thinking to ‘Agentic’ Thinking
Lin draws a line between two eras. The first was reasoning thinking, defined by o1 and DeepSeek-R1. It taught the field that RL needs deterministic, verifiable rewards, so math, code, and logic became central. It also turned RL into a systems problem of large-scale rollouts and verification.
The next era, in his framing, is agentic thinking: thinking in order to act. An agent formulates plans, decides when to act, uses tools, reads environment feedback, and revises. It is defined by closed-loop interaction with the world, not by a long internal monologue.
Lin lists what agentic thinking must handle that pure reasoning can avoid:
Deciding when to stop thinking and take an action
Choosing which tool to invoke, and in what order
Incorporating noisy or partial observations from the environment
Revising plans after failures
Maintaining coherence across many turns and many tool calls
The optimization target changes with the era. The table below summarizes the contrast Lin draws.
Use Cases, With Examples
The distinction changes how you build:
Coding agents: A reasoning model emits one patch from a stack trace. An agentic system runs the test harness, reads the real error, revises, and re-runs until the suite passes. Thinking here should help with codebase navigation, error recovery, and tool orchestration.
Deep research: A reasoning model writes a long answer from memory. An agentic system breaks the question into sub-queries, calls search, drops weak sources, and returns grounded citations. Qwen’s own Deep Research demo sits in this category.
Multi-agent orchestration: Lin expects ‘harness engineering’ to matter more. An orchestrator plans and routes work. Specialized sub-agents execute narrower tasks and help control context pollution.
A Concrete Hook: Qwen3 Thinking Toggle
Hybrid thinking is exposed directly in code. The enable_thinking flag switches modes in the chat template.
name = “Qwen/Qwen3-8B”
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForCausalLM.from_pretrained(
name, torch_dtype=”auto”, device_map=”auto”
)
messages = [{“role”: “user”, “content”: “Refactor this function and explain the change.”}]
# enable_thinking=True -> step-by-step thinking mode
# enable_thinking=False -> near-instant, non-thinking mode
text = tok.apply_chat_template(
messages, tokenize=False,
add_generation_prompt=True, enable_thinking=True,
)
inputs = tok(text, return_tensors=”pt”).to(model.device)
# Qwen’s recommended sampling for thinking mode
out = model.generate(
**inputs, max_new_tokens=2048,
temperature=0.6, top_p=0.95, top_k=20,
)
enable_thinking=True is the default, and the output wraps reasoning in a <think>…</think> block. Qwen3 also accepts soft switches. Appending /think or /no_think to a user turn flips the mode per message. That per-turn control is what dynamic thinking budgets build on.
Why Agentic RL Infrastructure is Harder
The presentation’s core engineering point is about infrastructure. In reasoning RL, rollouts are mostly self-contained trajectories with clean evaluators. In agentic RL, the policy lives inside a harness of tool servers, browsers, terminals, and sandboxes.
That harness forces a new requirement: training and inference must be cleanly decoupled. Without it, rollout throughput collapses. A coding agent waiting on live test execution stalls inference and starves training. GPU utilization drops well below what reasoning RL achieves.
Lin also reframes what to obsess over. In the SFT era, teams optimized data diversity. In the agent era, he argues teams should optimize environment quality: stability, realism, coverage, and exploit resistance. He names reward hacking as the hardest problem, because tool access enlarges the attack surface for spurious optimization.
Key Takeaways
Junyang Lin left Qwen on March 3, 2026, and now publishes as an independent researcher.
His talk ends on one thesis: the field is moving from training models to training agents.
Agentic thinking is judged by sustained action in an environment, not by internal deliberation.
Agentic RL needs decoupled train-serve infra and high-quality environments, not just verifiable rewards.
Reward hacking is the central risk once models gain real tool access.
Sources:
Primary source — the talk
Primary source — Junyang Lin’s Blog
“From ‘Reasoning’ Thinking to ‘Agentic’ Thinking”: https://justinlin610.github.io/blog/from-reasoning-to-agentic-thinking/
His homepage (independent-researcher status): https://justinlin610.github.io/
Qwen3 technical details (architecture, 119 languages, hybrid thinking)
Qwen3 Technical Report (arXiv:2505.09388): https://arxiv.org/abs/2505.09388 · HTML: https://arxiv.org/html/2505.09388v1
Code verification (enable_thinking, /think /no_think, sampling)
Qwen docs Quickstart: https://qwen.readthedocs.io/en/latest/getting_started/quickstart.html
Qwen3-8B model card: https://huggingface.co/Qwen/Qwen3-8B
Qwen3-32B model card: https://huggingface.co/Qwen/Qwen3-32B
Departure facts (cited in the article)
TechCrunch: https://techcrunch.com/2026/03/03/alibabas-qwen-tech-lead-steps-down-after-major-ai-push/
Bloomberg: https://www.bloomberg.com/news/articles/2026-03-04/alibaba-qwen-head-who-warned-of-openai-gap-steps-down
VentureBeat: https://venturebeat.com/technology/did-alibaba-just-kneecap-its-powerful-qwen-ai-team-key-figures-depart-in
Supporting departure/context coverage (used for cross-checking, not all cited inline)
RecodeChinaAI (LatePost translation): https://www.recodechinaai.com/p/alibabas-qwen-lead-just-stepped-down
Simon Willison: https://simonwillison.net/2026/Mar/4/qwen/
Geopolitechs: https://www.geopolitechs.org/p/inside-the-stepping-down-of-qwens
OfficeChai: https://officechai.com/ai/alibaba-qwens-tech-lead-junyang-lin-steps-down/
MLQ News: https://mlq.ai/news/key-researcher-steps-down-from-alibabas-qwen-ai-project/
GenAI Assembling (essay analysis, used to first locate the essay): https://genaiassembling.substack.com/p/what-junyang-lin-saw
Two X posts
https://x.com/h100envy/status/2068987470960623783
https://x.com/h100envy/status/2073433806254624930
Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.
