Qwen3.6-35B-A3B: Agentic coding power, now open to all

[−] simonw 28d ago

I've been running this on my laptop with the Unsloth 20.9GB GGUF in LM Studio: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/mai...

It drew a better pelican riding a bicycle than Opus 4.7 did! https://simonwillison.net/2026/Apr/16/qwen-beats-opus/

[−] GistNoesis 28d ago

Thanks for pointing to the GGUF.

I just tried this GGUF with llama.cpp in its UD Q4_K_XL version on my custom agentic oritened task consisiting of wiki exploration and automatic database building ( https://github.com/GistNoesis/Shoggoth.db/ )

I noted a nice improvement over QWen3.5 in its ability to discover new creatures in the open ended searching task, but I've not quantified it yet with numbers. It also seems faster, at around 140 token/s compared to 100 token/s , but that's maybe due to some different configuration options.

Some little difference with QWen3.5 : to avoid crashes due to lack of memory in multimodal I had to pass --no-mmproj-offload to disable the gpu offload to convert the images to tokens otherwise it would crash for high resolutions images. I also used quantized kv store by passing -ctk q8_0 -ctv q8_0 and with a ctx-size 150000 it only need 23099 MiB of device memory which means no partial RAM offloading when I use a RTX 4090.

[−] kelnos 28d ago

I'm not sure how you can give the flamingo win to Qwen:

* It's sitting on the tire, not the seat.

* Is that weird white and black thing supposed to be a beak? If so, it's sticking out of the side of its face rather than the center.

* The wheel spokes are bizarre.

* One of the flamingo's legs doesn't extend to the pedal.

* If you look closely at the sunglasses, they're semi-transparent, and the flamingo only has one eye! Or the other eye is just on a different part of its face, which means the sunglasses aren't positioned correctly. Or the other eye isn't.

* (subjective) The sunglasses and bowtie are cute, but you didn't ask for them, so I'd actually dock points for that.

* (subjective) I guess flamingos have multiple tail feathers, but it looks kinda odd as drawn.

In contrast, Opus's flamingo isn't as detailed or fancy, but more or less all of it looks correct.

[−] withinboredom 28d ago

He literally said it came down to the comment in the SVG. Points for taste, not correctness. Basically.

[−] realityfactchex 28d ago

Here's a reproduction attempt (LM Studio, same Qwen3.6-35B-A3B-GGUF model as linked in parent, M1 Max 64GB, <90 seconds):

https://files.catbox.moe/r3oru2.png

- My Qwen 3.6 result had sun and cloud in sky, similar to the second Opus 4.7 result in Simon's post.

- My Qwen 3.6 result had no grass (except as a green line), but all three results in Simon's post had grass (thick).

- My Qwen 3.6 result had visible "tailing air motion" like Simon's Qwen 3.6 result.

- My Qwen 3.6 result had a "sun with halo" effect that none of Simon's results had.

But, I know, it's more about the pelican and the bicycle.

[−] _ache_ 28d ago

The bicycle frame is ok. Simon's was better but at least it's not broken like Opus 4.7.

I can't comment that flamingo.

[−] jubilanti 28d ago

I wonder when pelican riding a bicycle will be useless as an evaluation task. The point was that it was something weird nobody had ever really thought about before, not in the benchmarks or even something a team would run internally. But now I'd bet internally this is one of the new Shirley Cards.

[−] abustamam 28d ago

Simon has an article on this

https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

[−] SwellJoe 28d ago

Pelicanmaxxing

[−] amelius 28d ago

Yeah try it with something else, or e.g. add a tiger to the back seat.

[−] survirtual 28d ago

I use this metric now, and I suggest you change it per your imagination:

"Make a single-page HTML file using threejs from a CDN. Render a scene of a flying dinosaur orbiting a planet. There are clouds with thunder and lightning, and the background is a beautiful starscape with twinkling stars and a colorful nebula"

This allows me to evaluate several factors across models. It is novel and creative. I generally run it multiple times, though now that I have shared it here, I will come up with new scenes personally to evaluate.

I also consider how well it one shots, errors generated, response to errors being corrected, and velocity of iteration to improvement.

Generally speaking, Claude Sonnet has done the best, Qwen3.5 122B does second, and I have nice results from Qwen3.5 35B.

ChatGPT does not do well. It can complete the task without errors but the creativity is atrocious.

[−] MagicMoonlight 28d ago

They’ll hardcode it in 4.8, just like they do when they need to “fix” other issues

[−] rafaelmn 28d ago

I mean look at the result where he asked about a unicycle - the model couldn't even keep the spokes inside the wheels - would be rudimentary if it "learned" what it means to draw a bicycle wheel and could transfer that to unicycle.

[−] hansmayer 28d ago

Valid points, but you"d think "superintelligence" would "know" how to draw a pelican on a bike?

[−] bertili 28d ago

It's fascinating that a $999 Mac Mini (M4 32GB) with almost similar wattage as a human brain gets us this far.

[−] culi 28d ago

the more I look at these images the more convinced I become that world models are the major missing piece and that these really are ultimately just stochastic sentence machines. Maybe Chomsky was right

[−] cyclopeanutopia 28d ago

But that you also gave a win to Qwen on flamingo is pretty outrageous! :)

Tthe right one looks much better, plus adding sunglasses without prompting is not that great. Hopefully it won't add some backdoor to the generated code without asking. ;)

[−] prirun 28d ago

The flamingo on Qwen's unicycle is sitting on the tire, not the seat. That wins because of sunglasses?

[−] rdslw 28d ago

interesting, I just tried this very model, unsloth, Q8, so in theory more capable than Simon's Q4, and get those three "pelicans". definitely NOT opus quality. lmstudio, via Simon's llm, but not apple/mlx. Of course the same short prompt.

Simon, any ideas?

https://ibb.co/gFvwzf7M

https://ibb.co/dYHRC3y

https://ibb.co/FLc6kggm (tried here temperature 0.7 instead of pure defaults)

[−] monksy 28d ago

Hey I really enjoy your blog. On some things I end up finding a blog post of yours thats a year+ old and at other times, you and I are investigating similar things. I just pulled Qwen3.6 - 35b -A3B (Can't believe thats a A3B coming from 35b).

I'm impressed about the reach of your blog, and I'm hoping to get into blogging similar things. I currently have a lot on my backlog to blog about.

In short, keep up the good work with an interesting blog!

[−] jamwise 28d ago

I've had some really gnarly SVGs from Claude. Here's what I got after many iterations trying to draw a hand: https://imgur.com/a/X4Jqius

[−] jaspanglia 28d ago

The real question is what the next truly weird, un-optimized prompt will be. Something involving a sloth debugging a quantum computer in MS Paint?"

[−] quietsegfault 28d ago

The qwen flamingo looks like it’s smoking’ a doobie.

[−] MeteorMarc 28d ago

Interesting, qwen has the pelican driving on the left lane. Coincidence or has it something to do with the workers providing the RL data?

[−] Scrounger 28d ago

I've been running qwen3.6:35b-a3b-q4_K_M (22.3GB) via Ollama.

Is the 20.9GB GGUF version better or negligible in comparison?

[−] bwv848 28d ago

I've been trying the Q4_K_M version, and sometimes it gets stuck in a loop. Gemma 4 doesn’t have this issue.

[−] danielhanchen 28d ago

Oh that is pretty good! And the SVG one!

[−] logicallee 28d ago

what kind of specs does your laptop have? do you know how many tokens/second you get on it?

[−] slekker 28d ago

How does it do with the "car wash" benchmark? :D

[−] bertili 29d ago

A relief to see the Qwen team still publishing open weights, after the kneecapping [1] and departures of Junyang Lin and others [2]!

[1] https://news.ycombinator.com/item?id=47246746 [2] https://news.ycombinator.com/item?id=47249343

[−] homebrewer 29d ago

Already quantized/converted into a sane format by Unsloth:

https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

[−] mtct88 29d ago

Nice release from the Qwen team.

Small openweight coding models are, imho, the way to go for custom agents tailored to the specific needs of dev shops that are restricted from accessing public models.

I'm thinking about banking and healthcare sector development agencies, for example.

It's a shame this remains a market largely overlooked by Western players, Mistral being the only one moving in that direction.

[−] alecco 29d ago

Related interesting find on Qwen.

"Qwen's base models live in a very exam-heavy basin - distinct from other base models like llama/gemma. Shown below are the embeddings from randomly sampled rollouts from ambiguous initial words like "The" and "A":"

https://xcancel.com/N8Programs/status/2044408755790508113

[−] kanemcgrath 28d ago

I have been using Qwen3.5-35B-A3B a lot in local testing, and it is by far the most capable model that could fit on my machine. I think quantization technology has really upped its game around these models, and there were two quants that blew me away

Mudler APEX-I-Quality. then later I tried Byteshape Q3_K_S-3.40bpw

Both made claims that seemed too good to be true, but I couldn't find any traces of lobotomization doing long agent coding loops. with the byteshape quant I am up to 40+ t/s which is a speed that makes agents much more pleasant. On an rtx 3060 12GB and 32GB of system ram, I went from slamming all my available memory to having like 14GB to spare.

[−] armanj 29d ago

I recall a Qwen exec posted a public poll on Twitter, asking which model from Qwen3.6 you want to see open-sourced; and the 27b variant was by far the most popular choice. Not sure why they ignored it lol.

[−] aliljet 29d ago

I'm broadly curious how people are using these local models. Literally, how are they attaching harnesses to this and finding more value than just renting tokens from Anthropic of OpenAI?

[−] rvnx 29d ago

China won again in terms of openness

[−] fooblaster 29d ago

Honestly, this is the AI software I actually look forward to seeing. No hype about it being too dangerous to release. No IPO pumping hype. No subscription fees. I am so pumped to try this!

[−] altruios 28d ago

I have moved through the local models at this size.

This one is by far the most capable. I've tried various versions of gemma4.26b, various versions of qwen3.5-27/35b (qwopus's galor),nemotron,phi,glm4.7.

This one is noticeably better as an agent. It's really good at breaking down tasks into small actionable steps, and - where there is ambiguity - asks for clarification. It's reasoning seems more solid than gemma4, tool use, multi-messaging/longer chain thinking.

I am excited to see what other versions of this model people train!

[−] seemaze 29d ago

Fingers crossed for mid and larger models as well. I'd personally love to see Qwen3.6-122B-A10B.

[−] abhikul0 29d ago

I hope the other sizes are coming too(9B for me). Can't fit much context with this on a 36GB mac.

[−] 3836293648 28d ago

Qwen3.6 and Gemma4 have the same issue of never getting to the point and just getting stuck in never ending repeating thought loops. Qwen3.5 is still the best local model that works.

[−] jake-coworker 29d ago

This is surprisingly close to Haiku quality, but open - and Haiku is quite a capable model (many of the Claude Code subagents use it).

[−] gck1 27d ago

I have a Macbook M3 Max with 128GB of RAM.

How close to Opus 4.6 can I get with this? Realistic, real-world usage. And I mean not sitting there for minutes waiting the model to finish saying hello, or being able to use it for anything more than a pelican riding a bicycle.

I'm asking because I'm always seeing excited replies, then I get excited, then I spend minutes to hours setting up the model and then, after first use I forget it exists for one reason or another.

Can I get any realistic use out of this?

[−] cpburns2009 28d ago

Anyone else getting gibberish when running unsloth/Qwen3.6-35B-A3B-GGUF:UD-IQ4_XS on CUDA (llama.cpp b8815)? UD-Q4_K_XL is fine, as is Vulkan in general.

[−] KronisLV 28d ago

I wonder how this one compares to Qwen3 Coder Next (the 80B A3B model), since you'd think that even though it's older, it having more parameters would make it more useful for agentic and development use cases: https://huggingface.co/collections/Qwen/qwen3-coder-next

[−] Glemllksdf 29d ago

I tried Gemma 4 A4B and was surprised how hart it is to use it for agentic stuff on a RTX 4090 with 24gb of ram.

Balancing KV Cache and Context eating VRam super fast.

[−] qazplm17 27d ago

Just tried to use qwen3.6-35b-a3b-bf16 + omlx running a pi session to use my HN cli to do a sentiment analysis on this story and opus4.7 story. I’m getting ~40tk/s on a M3 Ultra Mac Studio and the tool use consistency has been held up well. Even when passing 100k tokens, the session was still going strong. Here is the full sentiment analysis report it produced:

https://gist.github.com/duh17/2db5351da026cec4bd4f46e169e75e...

Here is the full session:

https://pi.dev/session/#c3d003becb1bfcc7ffbca04e89e1adf8

This is by far my smoothest agentic session using a local model of any size. The output quality and speed has really struct the right balance. Very impressive release

[−] codeugo 28d ago

Are we going to get to the point where a local model can do almost what sonnet 4.6 can do?

[−] npodbielski 28d ago

I am not sure. I tested it locally on my Desktop Framework and it so far it seem to giving me worse answers then Qwen 3.5. Maybe it is because I am chatting with models in my language instead of enlish or maybe it is optimised for coding instead.

I asked it to give me instruction on how to create SSH key and it tried to do it instead of just answering.

https://internetexception.com/2026/04/16/testing-qwen-3-6/

[−] adrian_b 29d ago

Available for download:

https://huggingface.co/Qwen/Qwen3.6-35B-A3B

[−] poglet 28d ago

Can this run on a PC with 16GB graphics card or a 24GB Macbook Pro? I'm not familiar with how Mixture-of-Experts models differ from standard models.

[−] incomingpain 29d ago

Wowzers, we were worried Qwen was going to suffer having lost several high profile people on the team but that's a huge drop.

It's better than 27b?

[−] lopsotronic 29d ago

Dangit, I'll need to give this a run on my personal machine. This looks impressive.

At the time of writing, all deepseek or qwen models are de facto prohibited in govcon, including local machine deployments via Ollama or similar. Although no legislative or executive mandate yet exists [1], it's perceived as a gap [2], and contracts are already including language for prohibition not just in the product but any part of the software environment.

The attack surface for a (non-agentic) model running in local ollama is basically non-existent . . but, eh . . I do get it, at some level. While they're not l33t haXX0ring your base, the models are still largely black boxes, can move your attention away from things, or towards things, with no one being the wiser. "Landing Craft? I see no landing craft". This would boil out in test, ideally, but hey, now you know how much time your typical defense subcon spends in meaningful software testing[3].

[1] See also OMB Memorandum M-25-22 (preference for AI developed and produced in the United States), NIST CAISI assessment of PRC-origin AI models as "adversary AI" (September 2025), and House Select Committee on the CCP Report (April 16, 2025), "DeepSeek Unmasked".

[2] Overall, rather than blacklist, I'd recommend a "whitelist" of permitted models, maintained dynamically. This would operate the same way you would manage libraries via SSCG/SSCM (software supply chain governance/management) . . but few if any defense subcons have enough onboard savvy to manage SSCG let alone spooling a parallel construct for models :(. Soooo . . ollama regex scrubbing it is.

[3] i.e. none at all, we barely have the ability to MAKE anything like software, given the combination of underwhelming pay scales and the fact defense companies always seem to have a requirement for on-site 100% in some random crappy town in the middle of BFE. If it wasn't for the downturn in tech we wouldn't have anyone useful at all, but we snagged some silcon refugees.

[−] 999900000999 29d ago

Looking to move off ollama on Open Suse tumbleweed.

Should I use brew to install llma.ccp or the zypper to install the tumbleweed package?

[−] dataflow 29d ago

I'm a newbie here and lost how I'm supposed to use these models for coding. When I use them with Continue in VSCode and start typing basic C:

  #include 
  int m

I get nonsensical autocompletions like:

  #include 
  int m

What is going on?

[−] fred_is_fred 29d ago

How does this compare to the commercial models like Sonnet 4.5 or GPT? Close enough that the price is right (free)?

[−] nurettin 29d ago

I tried the car wash puzzle:

You want to wash your car. Car wash is 50m away. Should you walk or go by car?

> Walk. At 50 meters, the round trip is roughly 100 meters, taking about two minutes on foot. Driving would require starting the engine, navigating, parking, and dealing with unnecessary wear for a negligible distance. Walk to the car wash, and if the bay requires the vehicle inside, have it moved there or return on foot. Walking is faster and more efficient.

Classic response. It was really hard to one shot this with Qwen3.5 Q4_K_M.

Qwen3.6 UD-IQ4_XS also failed the first time, then I added this to the system prompt:

> Double check your logic for errors

Then I created a new dialog and asked the puzzle and it responded:

> Drive it. The car needs to be present to be washed. 50 meters is roughly a 1-minute walk or a 10-second drive. Walking leaves the car behind, making the wash impossible. Driving it the short distance is the only option that achieves the goal.

Now 3.6 gets it right every time. So not as great as a super model, but definitely an improvement.

[−] cyrialize 28d ago

My last laptop was a used 2012 T530.

My current is a used M1 MBP Pro with 16GB of ram.

I thought this was all I was ever going to need, but wanting to run really nice models locally has me thinking about upgrading.

Although, part of me wants to see how far I could get with my trusty laptop.

[−] kombine 29d ago

What kind of hardware (preferably non-Apple) can run this model? What about 122B?

[−] psim1 28d ago

(Please don't downvote - serious question) Are Chinese models generally accepted for use within US companies? The company I work for won't allow Qwen.

[−] ghc 29d ago

how does this compare to gpt-oss-120b? It seems weird to leave it out.

[−] giantg2 28d ago

I cant wait to see some smaller sizes. I would love to run some sort of coding centric agent on a local TPU or GPU instead of having to pay, even if it's slower.

[−] tristor 29d ago

I'm disappointed they didn't release a 27B dense model. I've been working with Qwen3.5-27B and Qwen3.5-35B-A3B locally, both in their native weights and the versions the community distilled from Opus 4.6 (Qwopus), and I have found I generally get higher quality outputs from the 27B dense model than the 35B-A3B MOE model. My basic conclusion was that MoE approach may be more memory efficient, but it requires a fairly large set of active parameters to match similarly sized dense models, as I was able to see better or comparable results from Qwen3.5-122B-A10B as I got from Qwen3.5-27B, however at a slower generation speed. I am certain that for frontier providers with massive compute that MoE represents a meaningful efficiency gain with similar quality, but for running models locally I still prefer medium sized dense models.

I'll give this a try, but I would be surprised if it outperforms Qwen3.5-27B.

[−] btbr403 29d ago

Planning to deploy Qwen3.6-35B-A3B on NVIDIA Spark DGX for multi-agent coding workflows. The 3B active params should help with concurrent agent density.

[−] syntaxing 28d ago

Is it worth running speculative decoding on small active models like this? Or does MTP make speculative decoding unnecessary?

[−] amelius 28d ago

Looks like they compare only to open models, unfortunately.

As I am using mostly the non-open models, I have no idea what these numbers mean.

[−] logicallee 28d ago

What kind of hardware does this require to run locally, and how many tokens/seconds does it produce?

[−] zshn25 29d ago

What do all the numbers 6-35B-A3B mean?

[−] yieldcrv 29d ago

Anybody use these instead of codex or claude code? Thoughts in comparison?

benchmarks dont really help me so much

[−] zoobab 29d ago

"open source"

give me the training data?

[−] solomatov 28d ago

Did anyone try it and Gemma 4? Does it feel that it's better than Gemma 4?

[−] andy_ppp 28d ago

Do we know if other models have started detecting and poisoning training/fine tuning that these Chinese models seem to use for alignment, I’d certainly be doing some naughty stuff to keep my moat if I was Anthropic or OpenAI…

Qwen3.6-35B-A3B: Agentic coding power, now open to all (qwen.ai)

532 comments