GLM-5.1: Towards Long-Horizon Tasks

[−] gertlabs 37d ago

We're still adding samples, but some early takeaways from benchmarking on https://gertlabs.com:

Contrary to the model card, its one-shot performance is more impressive than its agentic abilities. On both metrics, GLM 5.1 is competitive with frontier models.

But keeping in mind this is an open source model operating near the frontier, it's nothing short of incredible.

I suspect 2 issues with the model are keeping it from fully realizing its potential in agentic harnesses: - Context rot (already a common complaint). We are still working on a metric to robustly test and visualize this on the site. - The model was most likely overtrained on standardized toolsets and benchmarks, and isn't as adaptive in using arbitrary tooling in our custom harness simulations. We've decided to commit to measuring intelligence as the ability to use custom, changing tools, instead of being trained to use specific tools (while still always providing a way to run local bash and other common tools). There are arguments to be made for either, but the former is more indicative of general intelligence. Regardless, it's a subtle difference and GLM 5.1 still performs well with tooling in our environments.

Crazy week for open source AI. Gemma 4 has shown that large model density is nowhere near optimized. Moats are shrinking.

If there are more representations of model performance you'd like to see, I'm actively reading your feedback and ideas.

[−] DeathArrow 37d ago

It would be nice if you can test the model with different harnesses, Z.ai's own Z Code, Claude Code, Open Code, Pi, Cursor etc.

My impression is that the choice of harness matters a lot.

[−] gertlabs 37d ago

Interesting idea. The metric I'd intuitively want to see is low variance between harnesses for a smarter model. But if a large sample of models statistically outperformed with a certain harness, that's indeed a valuable signal for a developer.

[−] nareyko 37d ago

[dead]

[−] IceHegel 37d ago

[dead]

[−] simonw 38d ago

Not only did this one draw me an excellent pelican... it also animated it! https://simonwillison.net/2026/Apr/7/glm-51/

[−] ipsum2 38d ago

It made it realistic. A pelican is much more likely to be flying in the sky than riding a bicycle.

[−] stingraycharles 38d ago

Surely at this point it’s part of the training set and the benchmark has lost its value?

[−] _pdp_ 38d ago

Simon, you need to come up with improved benchmarks soon.

[−] Yukonv 38d ago

Unsloth quantizations are available on release as well. [0] The IQ4_XS is a massive 361 GB with the 754B parameters. This is definitely a model your average local LLM enthusiast is not going to be able to run even with high end hardware.

[0] https://huggingface.co/unsloth/GLM-5.1-GGUF

GLM-5.1: Towards Long-Horizon Tasks (z.ai)

263 comments