We're still adding samples, but some early takeaways from benchmarking on https://gertlabs.com:
Contrary to the model card, its one-shot performance is more impressive than its agentic abilities. On both metrics, GLM 5.1 is competitive with frontier models.
But keeping in mind this is an open source model operating near the frontier, it's nothing short of incredible.
I suspect 2 issues with the model are keeping it from fully realizing its potential in agentic harnesses:
- Context rot (already a common complaint). We are still working on a metric to robustly test and visualize this on the site.
- The model was most likely overtrained on standardized toolsets and benchmarks, and isn't as adaptive in using arbitrary tooling in our custom harness simulations. We've decided to commit to measuring intelligence as the ability to use custom, changing tools, instead of being trained to use specific tools (while still always providing a way to run local bash and other common tools). There are arguments to be made for either, but the former is more indicative of general intelligence. Regardless, it's a subtle difference and GLM 5.1 still performs well with tooling in our environments.
Crazy week for open source AI. Gemma 4 has shown that large model density is nowhere near optimized. Moats are shrinking.
If there are more representations of model performance you'd like to see, I'm actively reading your feedback and ideas.
Interesting idea. The metric I'd intuitively want to see is low variance between harnesses for a smarter model. But if a large sample of models statistically outperformed with a certain harness, that's indeed a valuable signal for a developer.
Unsloth quantizations are available on release as well. [0] The IQ4_XS is a massive 361 GB with the 754B parameters. This is definitely a model your average local LLM enthusiast is not going to be able to run even with high end hardware.
SSD offload is always a possibility with good software support. Of course you might easily object that the model would not be "running" then, more like crawling. Still you'd be able to execute it locally and get it to respond after some time.
Meanwhile we're even seeing emerging 'engram' and 'inner-layer embedding parameters' techniques where the possibility of SSD offload is planned for in advance when developing the architecture.
For conversational purposes that may be too slow, but as a coding assistant this should work, especially if many tasks are batched, so that they may progress simultaneously through a single pass over the SSD data.
Like computing used to be. When I first compiled a Linux kernel it ran overnight on a Pentium-S. I had little idea what I was doing, probably compiled all the modules by mistake.
I remember that time, where compiling Linux kernels was measured in hours. Then multi-core computing arrived, and after a few years it was down to 10 minutes.
With LLMs it feels more like the old punchcards, though.
True, but this is not only a trade-off between opex and capex.
Local inference using open weight models provides guaranteed performance which will remain stable over time, and be available at any moment.
As many current HN threads show, depending on external AI inference providers is extremely risky, as their performance can be degraded unpredictably at any time or their prices can be raised at any time, equally unpredictably.
Being dependent on a subscription for your programming workflow is a huge bet, that you will gain more from a slightly higher quality of the proprietary models than you will lose if the service will be degraded in the future.
As the recent history has shown, many have already lost this bet.
I am not a gambler, so I have made my choice, which is local AI inference, using a variety of models depending on the task, i.e. both small models completely executable on relatively cheap GPUs (like the new Intel GPUs), medium models that need e.g. 128 GB on a CPU, and huge models that must be stored on fast SSDs (e.g. interleaved on multiple PCIe 5.0 SSDs).
Such a strategy is achievable with a modest capex, in the lower half of the 4-digit range.
I agree in principle that more democratic compute = better and third parties introduce additional risk that is outside of your control. That said I just don't see it working economically - either you have an underpowered GPU (4-digit range) at which point you have weak model, or slow model, probably both weak and slow. Or you have expensive GPU cluster, but at that point you also need to consider utilization as you are probably not streaming tokens out 24/7 and at that point TCO is just drastically more expensive for self hosting.
Personally I hope we see a third way - strong open weight models hosted by variety of companies actually competing on price and 9s of availability. That way capex expensive GPUs are fully utilized and users can rent intelligence as a commodity.
There is a very apt analogy to virtual server hosting - hosting vps/shared web is a commodity, it does not make financial sense for most users to host their website on their own physical servers in their basements.
Batching many disparate tasks together is good for compute efficiency, but makes it harder to keep the full KV-cache for each in RAM. You could handle this in an emergency by dumping some of that KV-cache to storage (this is how prompt caching works too, AIUI) and offloading loads for that too, but that adds a lot more overhead compared to just offloading sparsely-used experts, since KV-cache is far more heavily accessed.
To be honest I am a bit sad as, glm5.1 is producing mich better typescript than opus or codex imo, but no matter what it does sometimes go into shizo mode at some point over longer contexts. Not always tho I have had multiple session go over 200k and be fine.
I just set the context window to 100k and manage it actively (e.g. I compact it regularly or make it write out documentation of its current state and start a new session).
For me, Opus 4.6 isn't working quite right currently, and I often use GLM 5.1 instead. I'd prefer to use peak Opus over GLM 5.1, but GLM 5.1 is an adequate fallback. It's incredible how good open-weight models have gotten.
When it works and its not slow it can impress. Like yesterday it solved something that kimi k2.5 could not. and kimi was best open source model for me. But it still slow sometimes. I have z.ai and kimi subscription when i run out of tokens for claude (max) and codex(plus).
i have a feeling its nearing opus 4.5 level if they could fix it getting crazy after like 100k tokens.
I honestly still hold onto habits from earlier days of Claude & Codex usage and tend to wipe / compact my context frequently. I don't trust the era of big giant contexts, frankly, even on the frontier models.
Every single day, three things are becoming more and more clear:
(1) OpenAI & Anthropic are absolutely cooked; it's obvious they have no moat
(2) Local/private inference is the future of AI
(3) There's *still* no killer product yet (so get to work!)
GLM-5.0 is the real deal as far as open source models go. In our internal benchmarks it consistently outperforms other open source models, and was on par with things like GPT-5.2. Note that we don't use it for coding - we use it for more fuzzy tasks.
The focus on the speed of the agent generated code as a measure of model quality is unusual and interesting. I've been focusing on intentionally benchmaxxing agentic projects (e.g. "create benchmarks, get a baseline, then make the benchmarks 1.4x faster or better without cheating the benchmarks or causing any regression in output quality") and Opus 4.6 does it very well: in Rust, it can find enough low-level optimizations to make already-fast Rust code up to 6x faster while still passing all tests.
It's a fun way to quantify the real-world performance between models that's more practical and actionable.
Comments here seem to be talking like they've used this model for longer than a few hours -- is this true, or are y'all just sharing your initial thoughts?
I'm crossing my fingers they release a flash version of this. GLM 4.7 Flash is the main model I use locally for agentic coding work, it's pretty incredible. Didn't find anything in the release about it - but hoping it's on the horizon.
I am already subscribed to their GLM Coding Pro monthly plan and working with GLM 5.1 coupled with Open Code is such a pleasure! I will cancel my Cursor subscription.
I am on their "Coding Lite" plan, which I got a lot of use out of for a few months, but it has been seriously gimped now. Obvious quantization issues, going in circles, flipping from X to !X, injecting chinese characters. It is useless now for any serious coding work.
I can’t wait to try it. I set up a new system this morning with OpenClaw and GLM-5, and I like GLM-5 as the backend for Claude Code. Excellent results.
It's an okay model. My biggest issue using GLM 5.1 in OpenCode is that it loses coherency over longer contexts. When you crest 128k tokens, there's a high chance that the model will start spouting gibberish until you compact the history.
For short-term bugfixing and tweaks though, it does about what I'd expect from Sonnet for a pretty low price.
Just saw the Claude Mythos post. Not sure when it’s going public, but this feels like a real jump, not just incremental progress. Also waiting for the next GLM release coz specs are looking kind of insane.
A bit off-topic, but for some reason, even though I don't use LLMs for my job or for my hobbies, or in daily life frequently (and when I do, it's mostly some kind of "rubber duck brainstorm"), when I see open-weight releases like this one or the recent Gemma 4 (which is very good for local models); the first time was with DeepSeek-R1 (this one, despite being blamed for "censorship", was heavily censored only via DeepSeek API, the local model - full-weight 685B, not the distilled ones - was pretty much unhinged regarding censorship on any topic)... there's always one song coming to mind and I simply can't get rid of it no matter how hard I try.
"I am the storm that is approaching, provoking..." : )
263 comments
Contrary to the model card, its one-shot performance is more impressive than its agentic abilities. On both metrics, GLM 5.1 is competitive with frontier models.
But keeping in mind this is an open source model operating near the frontier, it's nothing short of incredible.
I suspect 2 issues with the model are keeping it from fully realizing its potential in agentic harnesses: - Context rot (already a common complaint). We are still working on a metric to robustly test and visualize this on the site. - The model was most likely overtrained on standardized toolsets and benchmarks, and isn't as adaptive in using arbitrary tooling in our custom harness simulations. We've decided to commit to measuring intelligence as the ability to use custom, changing tools, instead of being trained to use specific tools (while still always providing a way to run local bash and other common tools). There are arguments to be made for either, but the former is more indicative of general intelligence. Regardless, it's a subtle difference and GLM 5.1 still performs well with tooling in our environments.
Crazy week for open source AI. Gemma 4 has shown that large model density is nowhere near optimized. Moats are shrinking.
If there are more representations of model performance you'd like to see, I'm actively reading your feedback and ideas.
My impression is that the choice of harness matters a lot.
[0] https://huggingface.co/unsloth/GLM-5.1-GGUF
Meanwhile we're even seeing emerging 'engram' and 'inner-layer embedding parameters' techniques where the possibility of SSD offload is planned for in advance when developing the architecture.
With LLMs it feels more like the old punchcards, though.
Local inference using open weight models provides guaranteed performance which will remain stable over time, and be available at any moment.
As many current HN threads show, depending on external AI inference providers is extremely risky, as their performance can be degraded unpredictably at any time or their prices can be raised at any time, equally unpredictably.
Being dependent on a subscription for your programming workflow is a huge bet, that you will gain more from a slightly higher quality of the proprietary models than you will lose if the service will be degraded in the future.
As the recent history has shown, many have already lost this bet.
I am not a gambler, so I have made my choice, which is local AI inference, using a variety of models depending on the task, i.e. both small models completely executable on relatively cheap GPUs (like the new Intel GPUs), medium models that need e.g. 128 GB on a CPU, and huge models that must be stored on fast SSDs (e.g. interleaved on multiple PCIe 5.0 SSDs).
Such a strategy is achievable with a modest capex, in the lower half of the 4-digit range.
Personally I hope we see a third way - strong open weight models hosted by variety of companies actually competing on price and 9s of availability. That way capex expensive GPUs are fully utilized and users can rent intelligence as a commodity.
There is a very apt analogy to virtual server hosting - hosting vps/shared web is a commodity, it does not make financial sense for most users to host their website on their own physical servers in their basements.
For me, Opus 4.6 isn't working quite right currently, and I often use GLM 5.1 instead. I'd prefer to use peak Opus over GLM 5.1, but GLM 5.1 is an adequate fallback. It's incredible how good open-weight models have gotten.
i have a feeling its nearing opus 4.5 level if they could fix it getting crazy after like 100k tokens.
It's a fun way to quantify the real-world performance between models that's more practical and actionable.
I think the model is now tuned more towards agentic use/coding than general intelligence.
[0]: https://aibenchy.com/compare/z-ai-glm-5-medium/z-ai-glm-5-1-...
Excited to test this.
For short-term bugfixing and tweaks though, it does about what I'd expect from Sonnet for a pretty low price.
Everyone else isn't that far behind and they aren't all gonna just wall off their new model.
A reason that Anthropic will eventually give is 'the competition can do what Glasswing can do so what's the point limiting it'.
"I am the storm that is approaching, provoking..." : )