AI joins the 8-hour work day as GLM ships 5.1 open source LLM, beating Opus 4.6 and GPT 5.4 on SWE-Bench Pro

AI joins the 8-hour work day as GLM ships 5.1 open source LLM, beating Opus 4.6 and GPT 5.4 on SWE-Bench Pro



Is China picking back up the open source AI baton?

Z.ai, also known as Zhupai AI, a Chinese AI startup best known for its powerful, open source GLM family of models, has unveiled GLM-5.1 today under a permissive MIT License, allowing for enterprises to download, customize and use it for commercial purposes. They can do so on Hugging Face.

This follows its release of GLM-5 Turbo, a faster version, under only proprietary license last month.

The new GLM-5.1 is designed to work autonomously for up to eight hours on a single task, marking a definitive shift from vibe coding to agentic engineering.

The release represents a pivotal moment in the evolution of artificial intelligence. While competitors have focused on increasing reasoning tokens for better logic, Z.ai is optimizing for productive horizons.

GLM-5.1 is a 754-billion parameter Mixture-of-Experts model engineered to maintain goal alignment over extended execution traces that span thousands of tool calls.

"agents could do about 20 steps by the end of last year," wrote z.ai leader Lou on X. "glm-5.1 can do 1,700 rn. autonomous work time may be the most important curve after scaling laws. glm-5.1 will be the first point on that curve that the open-source community can verify with their own hands. hope y'all like it^^"

In a market increasingly crowded with fast models, Z.ai is betting on the marathon runner. The company, which listed on the Hong Kong Stock Exchange in early 2026 with a market capitalization of $52.83 billion, is using this release to cement its position as the leading independent developer of large language models in the region.

Technology: the staircase pattern of optimization

GLM-5.1s core technological breakthrough isn't just its scale, though its 754 billion parameters and 202,752 token context window are formidable, but its ability to avoid the plateau effect seen in previous models.

In traditional agentic workflows, a model typically applies a few familiar techniques for quick initial gains and then stalls. Giving it more time or more tool calls usually results in diminishing returns or strategy drift.

Z.ai research demonstrates that GLM-5.1 operates via what they call a staircase pattern, characterized by periods of incremental tuning within a fixed strategy punctuated by structural changes that shift the performance frontier.

In Scenario 1 of their technical report, the model was tasked with optimizing a high-performance vector database, a challenge known as VectorDBBench.

The model is provided with a Rust skeleton and empty implementation stubs, then uses tool-call-based agents to edit code, compile, test, and profile. While previous state-of-the-art results from models like Claude Opus 4.6 reached a performance ceiling of 3,547 queries per second, GLM-5.1 ran through 655 iterations and over 6,000 tool calls. The optimization trajectory was not linear but punctuated by structural breakthroughs.

At iteration 90, the model shifted from full-corpus scanning to IVF cluster probing with f16 vector compression, which reduced per-vector bandwidth from 512 bytes to 256 bytes and jumped performance to 6,400 queries per second.

By iteration 240, it autonomously introduced a two-stage pipeline involving u8 prescoring and f16 reranking, reaching 13,400 queries per second. Ultimately, the model identified and cleared six structural bottlenecks, including hierarchical routing via super-clusters and quantized routing using centroid scoring via VNNI. These efforts culminated in a final result of 21,500 queries per second, roughly six times the best result achieved in a single 50-turn session.

This demonstrates a model that functions as its own research and development department, breaking complex problems down and running experiments with real precision.

The model also managed complex execution tightening, lowering scheduling overhead and improving cache locality. During the optimization of the Approximate Nearest Neighbor search, the model proactively removed nested parallelism in favor of a redesign using per-query single-threading and outer concurrency.

When the model encountered iterations where recall fell below the 95 percent threshold, it diagnosed the failure, adjusted its parameters, and implemented parameter compensation to recover the necessary accuracy. This level of autonomous correction is what separates GLM-5.1 from models that simply generate code without testing it in a live environment.

Kernelbench: pushing the machine learning frontier

The model's endurance was further tested in KernelBench Level 3, which requires end-to-end optimization of complete machine learning architectures like MobileNet, VGG, MiniGPT, and Mamba.

In this setting, the goal is to produce a faster GPU kernel than the reference PyTorch implementation while maintaining identical outputs. Each of the 50 problems runs in an isolated Docker container with one H100 GPU and is limited to 1,200 tool-use turns. Correctness and performance are evaluated against a PyTorch eager baseline in separate CUDA contexts.

The results highlight a significant performance gap between GLM-5.1 and its predecessors. While the original GLM-5 improved quickly but leveled off early at a 2.6x speedup, GLM-5.1 sustained its optimization efforts far longer. It eventually delivered a 3.6x geometric mean speedup across 50 problems, continuing to make useful progress well past 1,000 tool-use turns.

Although Claude Opus 4.6 remains the leader in this specific benchmark at 4.2x, GLM-5.1 has meaningfully extended the productive horizon for open-source models.

This capability is not simply about having a longer context window; it requires the model to maintain goal alignment over extended execution, reducing strategy drift, error accumulation, and ineffective trial and error. One of the key breakthroughs is the ability to form an autonomous experiment, analyze, and optimize loop, where the model can proactively run benchmarks, identify bottlenecks, adjust strategies, and continuously improve results through iterative refinement.

All solutions generated during this process were independently audited for benchmark exploitation, ensuring the optimizations did not rely on specific benchmark behaviors but worked with arbitrary new inputs while keeping computation on the default CUDA stream.

Product strategy: subscription and subsidies

GLM-5.1 is positioned as an engineering-grade tool rather than a consumer chatbot. To support this, Z.ai has integrated it into a comprehensive Coding Plan ecosystem designed to compete directly with high-end developer tools.

The product offering is divided into three subscription tiers, all of which include free Model Context Protocol tools for vision analysis, web search, web reader, and document reading.

The Lite tier at $27 USD per quarter is positioned for lightweight workloads and offers three times the usage of a comparable Claude Pro plan. The Pro tier at $81 per quarter is designed for complex workloads, offering five times the Lite plan usage and 40 to 60 percent faster execution.

The Max tier at $216 per quarter is aimed at advanced developers with high-volume needs, ensuring guaranteed performance during peak hours.

For those using the API directly or through platforms like OpenRouter or Requesty, Z.ai has priced GLM-5.1 at $1.40 per one million input tokens and $4.40 per million output tokens. There's also a cache discount available for $0.26 per million input tokens.

Model

Input

Output

Total Cost

Source

Grok 4.1 Fast

$0.20

$0.50

$0.70

xAI

MiniMax M2.7

$0.30

$1.20

$1.50

MiniMax

Gemini 3 Flash

$0.50

$3.00

$3.50

Google

Kimi-K2.5

$0.60

$3.00

$3.60

Moonshot

MiMo-V2-Pro (≤256K)

$1.00

$3.00

$4.00

Xiaomi MiMo

GLM-5

$1.00

$3.20

$4.20

Z.ai

GLM-5-Turbo

$1.20

$4.00

$5.20

Z.ai

GLM-5.1

$1.40

$4.40

$5.80

Z.ai

Claude Haiku 4.5

$1.00

$5.00

$6.00

Anthropic

Qwen3-Max

$1.20

$6.00

$7.20

Alibaba Cloud

Gemini 3 Pro

$2.00

$12.00

$14.00

Google

GPT-5.2

$1.75

$14.00

$15.75

OpenAI

GPT-5.4

$2.50

$15.00

$17.50

OpenAI

Claude Sonnet 4.5

$3.00

$15.00

$18.00

Anthropic

Claude Opus 4.6

$5.00

$25.00

$30.00

Anthropic

GPT-5.4 Pro

$30.00

$180.00

$210.00

OpenAI

Notably, the model consumes quota at three times the standard rate during peak hours, which are defined as 14:00 to 18:00 Beijing Time daily, though a limited-time promotion through April 2026 allows off-peak usage to be billed at a standard 1x rate. Complementing the flagship is the recently debuted GLM-5 Turbo.

While 5.1 is the marathon runner, Turbo is the sprinter, proprietary and optimized for fast inference and tasks like tool use and persistent automation.

At a cost of $1.20 per million input / $4 per million output, it is more expensive than the base GLM-5 but comes in at more affordable than the new GLM-5.1, positioning it as a commercially attractive option for high-speed, supervised agent runs.

The model is also packaged for local deployment, supporting inference frameworks including vLLM, SGLang, and xLLM. Comprehensive deployment instructions are available at the official GitHub repository, allowing developers to run the 754 billion parameter MoE model on their own infrastructure.

For enterprise teams, the model includes advanced reasoning capabilities that can be accessed via a thinking parameter in API requests, allowing the model to show its step-by-step internal reasoning process before providing a final answer.

Benchmarks: a new global standard

The performance data for GLM-5.1 suggests it has leapfrogged several established Western models in coding and engineering tasks.

On SWE-Bench Pro, which evaluates a model's ability to resolve real-world GitHub issues using an instruction prompt and a 200,000 token context window, GLM-5.1 achieved a score of 58.4. For context, this outperforms GPT-5.4 at 57.7, Claude Opus 4.6 at 57.3, and Gemini 3.1 Pro at 54.2.

Beyond standardized coding tests, the model showed significant gains in reasoning and agentic benchmarks. It scored 63.5 on Terminal-Bench 2.0 when evaluated with the Terminus-2 framework and reached 66.5 when paired with the Claude Code harness.

On CyberGym, it achieved a 68.7 score based on a single-run pass over 1,507 tasks, demonstrating a nearly 20-point lead over the previous GLM-5 model. The model also performed strongly on the MCP-Atlas public set with a score of 71.8 and achieved a 70.6 on the T3-Bench.

In the reasoning domain, it scored 31.0 on Humanitys Last Exam, which jumped to 52.3 when the model was allowed to use external tools. On the AIME 2026 math competition benchmark, it reached 95.3, while scoring 86.2 on GPQA-Diamond for expert-level science reasoning.

The most impressive anecdotal benchmark was the Scenario 3 test: building a Linux-style desktop environment from scratch in eight hours.

Unlike previous models that might produce a basic taskbar and a placeholder window before declaring the task complete, GLM-5.1 autonomously filled out a file browser, terminal, text editor, system monitor, and even functional games.

It iteratively polished the styling and interaction logic until it had delivered a visually consistent, functional web application. This serves as a concrete example of what becomes possible when a model is given the time and the capability to keep refining its own work.

Licensing and the open segue

The licensing of these two models tells a larger story about the current state of the global AI market. GLM-5.1 has been released under the MIT License, with its model weights made publicly available on Hugging Face and ModelScope.

This follows the Z.ai historical strategy of using open-source releases to build developer goodwill and ecosystem reach. However, GLM-5 Turbo remains proprietary and closed-source. This reflects a growing trend among leading AI labs toward a hybrid model: using open-source models for broad distribution while keeping execution-optimized variants behind a paywall.

Industry analysts note that this shift arrives amidst a rebalancing in the Chinese market, where heavyweights like Alibaba are also beginning to segment their proprietary work from their open releases.

Z.ai CEO Zhang Peng appears to be navigating this by ensuring that while the flagship's core intelligence is open to the community, the high-speed execution infrastructure remains a revenue-driving asset.

The company is not explicitly promising to open-source GLM-5 Turbo itself, but says the findings will be folded into future open releases. This segmented strategy helps drive adoption while allowing the company to build a sustainable business model around its most commercially relevant work.

Community and user reactions: crushing a week's work

The developer community response to the GLM-5.1 release has been overwhelmingly focused on the model's reliability in production-grade environments.

User reviews suggest a high degree of trust in the model's autonomy.

One developer noted that GLM-5.1 shocked them with how good it is, stating it seems to do what they want more reliably than other models with less reworking of prompts needed. Another developer mentioned that the model's overall workflow from planning to project execution performs excellently, allowing them to confidently entrust it with complex tasks.

Specific case studies from users highlight significant efficiency gains.

A user from Crypto Economy News reported that a task involving preprocessing code, feature selection logic, and hyperparameter tuning solutions, which originally would have taken a week, was completed in just two days. Since getting the GLM Coding plan, other developers have noted being able to operate more freely and focus on core development without worrying about resource shortages hindering progress.

On social media, the launch announcement generated over 46,000 views in its first hour, with users captivated by the eight-hour autonomous claim. The sentiment among early adopters is that Z.ai has successfully moved past the hallucination-heavy era of AI into a period where models can be trusted to optimize themselves through repeated iteration.

The ability to build four applications rapidly through correct prompting and structured planning has been cited by multiple users as a game-changing development for individual developers.

The implications of long-horizon work

The release of GLM-5.1 suggests that the next frontier of AI competition will not be measured in tokens per second, but in autonomous duration.

If a model can work for eight hours without human intervention, it fundamentally changes the software development lifecycle.

However, Z.ai acknowledges that this is only the beginning. Significant challenges remain, such as developing reliable self-evaluation for tasks where no numeric metric exists to optimize against.

Escaping local optima earlier when incremental tuning stops paying off is another major hurdle, as is maintaining coherence over execution traces that span thousands of tool calls.

For now, Z.ai has placed a marker in the sand. With GLM-5.1, they have delivered a model that doesn't just answer questions, but finishes projects. The model is already compatible with a wide range of developer tools including Claude Code, OpenCode, Kilo Code, Roo Code, Cline, and Droid.

For developers and enterprises, the question is no longer, "what can I ask this AI?" but "what can I assign to it for the next eight hours?"

The focus of the industry is clearly shifting toward systems that can reliably execute multi-step work with less supervision. This transition to agentic engineering marks a new phase in the deployment of artificial intelligence within the global economy.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *

Pin It on Pinterest