Qwen's Former Lead on What Hybrid Thinking Got Wrong — and Why He Now Backs Agents

Junyang Lin was the technical lead of Alibaba’s Qwen project. He announced he was stepping down on March 3, 2026. He now lists himself as an independent researcher on his personal site.

In a talk titled ‘Qwen: Towards a Generalist Model / Agent,‘ he walks through the Qwen family. It ends on a single line: “Training models -> training agents.” He later expanded that line into an detailed post as an independent researcher. This article reads the talk and the detailed post together.

What Lin’s Talk Actually Covers

The talk is a tour of the Qwen model family, not a single release. It moves through QwQ-32B, Qwen2.5-Max, Qwen3, Qwen2.5-VL, and Qwen2.5-Omni. Each stop shows benchmark charts against contemporaries. The named baselines include DeepSeek-R1, Grok 3 Beta, Gemini 2.5 Pro, and OpenAI’s o-series.

The Qwen3 stop carries the most detail. Lin highlights hybrid thinking modes: a thinking mode for step-by-step reasoning, and a non-thinking mode for near-instant responses. He adds dynamic thinking budgets, so callers can cap how much the model reasons. Qwen3 expanded multilingual support from 29 to 119 languages and dialects.

The presentation lists many model types and sizes from 0.6B to 235B parameters. It also lists quantized formats including GGUF, GPTQ, AWQ, and MLX, all under Apache 2.0. Two demos follow: a Web Dev demo and a Deep Research demo. The closing “Future work” slide points at agents. It lists more pretraining, RL with environment feedback, longer context, and more modalities. The last key mention is the “training models -> training agents.”

Qwen3 Architecture, As Shown in the Talk

The talk includes the Qwen3 architecture tables, reproduced below.

ModelLayersHeads (Q/KV)Tie Embedding / Experts (Total/Act.)ContextQwen3-0.6B2816 / 8Tie: Yes32KQwen3-1.7B2816 / 8Tie: Yes32KQwen3-4B3632 / 8Tie: Yes32KQwen3-8B3632 / 8Tie: No128KQwen3-14B4040 / 8Tie: No128KQwen3-32B6464 / 8Tie: No128KQwen3-30B-A3B4832 / 4Experts: 128 / 8128KQwen3-235B-A22B9464 / 4Experts: 128 / 8128K

The small dense models tie input and output embeddings and use a 32K context. The larger dense and MoE models drop tying and extend context to 128K. The two MoE models activate 8 of 128 experts per token.

Hybrid Thinking, and Why Merging is Hard

Lin presents hybrid thinking as a clean feature. The post explains why it was hard to build. Lin writes that thinking mode and instruct mode pull in opposite directions.

A strong instruct model is rewarded for directness, brevity, and low latency. A strong thinking model is rewarded for spending more tokens on hard problems. Merge the two carelessly, and both degrade. The thinking behavior gets bloated, and the instruct behavior gets less crisp.

Qwen3 tried the merge with a four-stage post-training pipeline. That pipeline included a long-CoT cold start, reasoning RL, and a “thinking mode fusion” step. Later in 2025, the 2507 line shipped separate Instruct and Thinking variants instead. Lin frames this as a data problem more than a model problem.

Anthropic took the opposite route, and Lin calls it a useful corrective. Claude 3.7 Sonnet shipped as a hybrid model with a user-set thinking budget. Claude 4 let reasoning interleave with tool use, aimed at coding and long-running tasks. His point: a longer reasoning trace does not make a model smarter. Thinking should be shaped by the target workload, not by the benchmark.

Interactive Explainer

From ‘Reasoning’ Thinking to ‘Agentic’ Thinking

Lin draws a line between two eras. The first was reasoning thinking, defined by o1 and DeepSeek-R1. It taught the field that RL needs deterministic, verifiable rewards, so math, code, and logic became central. It also turned RL into a systems problem of large-scale rollouts and verification.

The next era, in his framing, is agentic thinking: thinking in order to act. An agent formulates plans, decides when to act, uses tools, reads environment feedback, and revises. It is defined by closed-loop interaction with the world, not by a long internal monologue.

Lin lists what agentic thinking must handle that pure reasoning can avoid:

Deciding when to stop thinking and take an action

Choosing which tool to invoke, and in what order

Incorporating noisy or partial observations from the environment

Revising plans after failures

Maintaining coherence across many turns and many tool calls

The optimization target changes with the era. The table below summarizes the contrast Lin draws.

DimensionReasoning thinkingAgentic thinkingJudged byQuality of internal deliberation before an answerWhether progress is sustained while actingReward signalVerifiable answers (math, code, logic)Task success in an interactive environmentCore object of trainingThe modelThe model plus its environment (the harness)Infra bottleneckRollouts, verification, stable policy updatesTool servers, sandboxes, train-serve decouplingMain failure modeVerbose, low-value reasoning tracesReward hacking through tool access and env leaks

Use Cases, With Examples

The distinction changes how you build:

Coding agents: A reasoning model emits one patch from a stack trace. An agentic system runs the test harness, reads the real error, revises, and re-runs until the suite passes. Thinking here should help with codebase navigation, error recovery, and tool orchestration.

Deep research: A reasoning model writes a long answer from memory. An agentic system breaks the question into sub-queries, calls search, drops weak sources, and returns grounded citations. Qwen’s own Deep Research demo sits in this category.

Multi-agent orchestration: Lin expects ‘harness engineering’ to matter more. An orchestrator plans and routes work. Specialized sub-agents execute narrower tasks and help control context pollution.

A Concrete Hook: Qwen3 Thinking Toggle

Hybrid thinking is exposed directly in code. The enable_thinking flag switches modes in the chat template.

from transformers import AutoModelForCausalLM, AutoTokenizer

name = “Qwen/Qwen3-8B”
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForCausalLM.from_pretrained(
name, torch_dtype=”auto”, device_map=”auto”
)

messages = [{“role”: “user”, “content”: “Refactor this function and explain the change.”}]

# enable_thinking=True -> step-by-step thinking mode
# enable_thinking=False -> near-instant, non-thinking mode
text = tok.apply_chat_template(
messages, tokenize=False,
add_generation_prompt=True, enable_thinking=True,
)
inputs = tok(text, return_tensors=”pt”).to(model.device)

# Qwen’s recommended sampling for thinking mode
out = model.generate(
**inputs, max_new_tokens=2048,
temperature=0.6, top_p=0.95, top_k=20,
)

enable_thinking=True is the default, and the output wraps reasoning in a <think>…</think> block. Qwen3 also accepts soft switches. Appending /think or /no_think to a user turn flips the mode per message. That per-turn control is what dynamic thinking budgets build on.

Why Agentic RL Infrastructure is Harder

The presentation’s core engineering point is about infrastructure. In reasoning RL, rollouts are mostly self-contained trajectories with clean evaluators. In agentic RL, the policy lives inside a harness of tool servers, browsers, terminals, and sandboxes.

That harness forces a new requirement: training and inference must be cleanly decoupled. Without it, rollout throughput collapses. A coding agent waiting on live test execution stalls inference and starves training. GPU utilization drops well below what reasoning RL achieves.

Lin also reframes what to obsess over. In the SFT era, teams optimized data diversity. In the agent era, he argues teams should optimize environment quality: stability, realism, coverage, and exploit resistance. He names reward hacking as the hardest problem, because tool access enlarges the attack surface for spurious optimization.

Key Takeaways

Junyang Lin left Qwen on March 3, 2026, and now publishes as an independent researcher.

His talk ends on one thesis: the field is moving from training models to training agents.

Agentic thinking is judged by sustained action in an environment, not by internal deliberation.

Agentic RL needs decoupled train-serve infra and high-quality environments, not just verifiable rewards.

Reward hacking is the central risk once models gain real tool access.

Sources:

Primary source — the talk

Primary source — Junyang Lin’s Blog

“From ‘Reasoning’ Thinking to ‘Agentic’ Thinking”: https://justinlin610.github.io/blog/from-reasoning-to-agentic-thinking/

His homepage (independent-researcher status): https://justinlin610.github.io/

Qwen3 technical details (architecture, 119 languages, hybrid thinking)

Qwen3 Technical Report (arXiv:2505.09388): https://arxiv.org/abs/2505.09388 · HTML: https://arxiv.org/html/2505.09388v1

Code verification (enable_thinking, /think /no_think, sampling)

Qwen docs Quickstart: https://qwen.readthedocs.io/en/latest/getting_started/quickstart.html

Qwen3-8B model card: https://huggingface.co/Qwen/Qwen3-8B

Qwen3-32B model card: https://huggingface.co/Qwen/Qwen3-32B

Departure facts (cited in the article)

TechCrunch: https://techcrunch.com/2026/03/03/alibabas-qwen-tech-lead-steps-down-after-major-ai-push/

Bloomberg: https://www.bloomberg.com/news/articles/2026-03-04/alibaba-qwen-head-who-warned-of-openai-gap-steps-down

VentureBeat: https://venturebeat.com/technology/did-alibaba-just-kneecap-its-powerful-qwen-ai-team-key-figures-depart-in

Supporting departure/context coverage (used for cross-checking, not all cited inline)

RecodeChinaAI (LatePost translation): https://www.recodechinaai.com/p/alibabas-qwen-lead-just-stepped-down

Simon Willison: https://simonwillison.net/2026/Mar/4/qwen/

Geopolitechs: https://www.geopolitechs.org/p/inside-the-stepping-down-of-qwens

OfficeChai: https://officechai.com/ai/alibaba-qwens-tech-lead-junyang-lin-steps-down/

MLQ News: https://mlq.ai/news/key-researcher-steps-down-from-alibabas-qwen-ai-project/

GenAI Assembling (essay analysis, used to first locate the essay): https://genaiassembling.substack.com/p/what-junyang-lin-saw

Two X posts

https://x.com/h100envy/status/2068987470960623783

https://x.com/h100envy/status/2073433806254624930

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

Source link

Qwen’s Former Lead on What Hybrid Thinking Got Wrong — and Why He Now Backs Agents

What Lin’s Talk Actually Covers

Qwen3 Architecture, As Shown in the Talk

Hybrid Thinking, and Why Merging is Hard

Interactive Explainer

From ‘Reasoning’ Thinking to ‘Agentic’ Thinking

Use Cases, With Examples

A Concrete Hook: Qwen3 Thinking Toggle

Why Agentic RL Infrastructure is Harder

Key Takeaways

Sources:

LlamaIndex ‘legal-kb’: Agentic Retrieval over Index v2 with retrieve, find, read, and grep Tools

Structured PDF-to-JSON: A Guide to Open-Source Extraction Models in 2026

How America's 250th birthday became a test of AI-powered collective intelligence

NVIDIA HORIZON: A Hands-Free Agent that Evolves Git Worktrees and Hits 100% RTL Benchmark Completion

Anthropic Launches Claude Science Beta: A Multi-Agent AI Workbench for Reproducible Genomics, Proteomics, and Cheminformatics Pipelines

NVIDIA AI Introduces ASPIRE: A Self-Improving Robotics Framework Reaching 31% Zero-Shot on LIBERO-Pro Long Tasks

Leave a Reply Cancel reply

You may have missed

‘Something Is Brewing’ for Dogecoin (DOGE) as Network Activity Explodes

LlamaIndex ‘legal-kb’: Agentic Retrieval over Index v2 with retrieve, find, read, and grep Tools

DOGE Price Prediction: Smart Money Is Loading at $0.076, But July’s Trap Door Needs to Hold $0.072 First

Brazilian Federal Police Dismantle $2 Billion Crypto Money Laundering Ring Linked to the PCC Cartel

Sitemap

Legal Information

Pin It on Pinterest

What Lin’s Talk Actually Covers

Qwen3 Architecture, As Shown in the Talk

Hybrid Thinking, and Why Merging is Hard

Interactive Explainer

From ‘Reasoning’ Thinking to ‘Agentic’ Thinking

Use Cases, With Examples

A Concrete Hook: Qwen3 Thinking Toggle

Why Agentic RL Infrastructure is Harder

Key Takeaways

Sources:

More Stories

Leave a Reply Cancel reply

You may have missed

Sitemap

Legal Information

Categories

Pin It on Pinterest