StepFun 3.5 Flash is #1 cost-effective model for OpenClaw tasks (300 battles) (app.uniclaw.ai)

by skysniper 84 comments 175 points
Read article View on HN

84 comments

[−] james2doyle 44d ago
None of the Qwen 3.5 models seem present? I’ve heard people are pretty happy with the smaller 3.5 versions. I would be curious to see those too.

I would also be interested to see "KAT-Coder-Pro-V2" as they brag about their benchmarks in these bots as well

[−] Aerroon 44d ago
If they use OpenRouter pricing then the Qwen3.5 models are going to be poor value.

The Qwen3.5 27B model on OR is $1.56/million tokens out (it used to be $2.4/mil).

Meanwhile Minimax M2.7 (a much larger model) is $1.2/mil out.

The smaller and medium tier Qwen3.5 models are only really cost effective if you run them yourself.

[−] james2doyle 43d ago
Oh I never noticed that. Good to call out. But that would put it much closer to Minimax M2.7 in terms of price than to the likes of Mimo V2 Pro, and Gemini Flash 3 preview, which are both on the list
[−] p1necone 44d ago
Is Minimax M2.7 better than Qwen3.5 27B, or is it just bigger?
[−] kdasme 44d ago
Minimax M2.7 is similar to sonnet in my tests. This is the first non OAI/Anthropic model I use for coding. It does require more steering, though.
[−] wg0 44d ago
More steering than Sonnet? What is your experience?
[−] wilj 44d ago
I'm about 2 days into transitioning, using MiMo V2 Pro in place of Opus and MiniMax M2.7 in place of Sonnet.

I'm finding that the extra "hand holding" that MiMo and MiniMax need isn't really "extra." The Anthropic models happily agree to a plan and then do something else entirely way too often.

With MiMo and MiniMax I'm just spreading the attention throughout the day instead of big spikes of frustration figuring out where Claude went off the rails.

[−] wg0 43d ago
Thank for responding. So you are using MiMo V2 Pro to plan and then asking MiniMax M2.7 to read that plan file and execute? Or how the workflow looks like?

Pi/Opencode/Kilocode? Just curious.

I am using Opencode mostly and thinking to abandon Copilot so looking for something similar.

[−] wilj 40d ago
Sorry for late reply, but yeah that's how my workflow looks, but I'm also more just leaning on MiMo V2 Pro now, it's fast, and cheap enough. And I'm using OpenCode.
[−] Aerroon 43d ago
Yes, it's significantly better.
[−] ipython 44d ago
I was excited to read through this to find out how these tasks are evaluated at scale. Lots of scary looking formulas with sigmas and other Greek letters.

Then I clicked on one task to see what it looks like “on the ground”: https://app.uniclaw.ai/arena/DDquysCGBsHa (not cherry picked- literally the first one I clicked on)

The task was:

> Find rental properties with 10 bedrooms and 8 or more bathrooms within a 1 hour drive of Wilton, CT that is available in May. Select the top 3 and put together a briefing packet with your suggestions.

Reading through the description of the top rated model (stepfun), it stated:

> Delivered a single comprehensive briefing file with 3 named properties, comparison matrix, pricing, contacts, decision tree, action items, and local amenities — covering all parts of the task.

Oh cool! Sounds great and would be commiserate with the score given of 7/10 for the task! However- the next sentence:

> Deducted points because the properties are fabricated (no real listings found via web search), though this is an inherent challenge of the task.

So…… in other words, it made a bunch of shit up (at least plausible shit! So give back a few points!) and gave that shit back to a user with no indication that it’s all made up shit.

Ok, closed that tab.

[−] skysniper 44d ago
I know, that was indeed a bad judge move. I've manually checked tens of tasks so far, and that one is one of the worst... I would say check a few more, judge has some noise but in general did a good job IMO
[−] ipython 43d ago
Why not re run your analysis with improved judging criteria?
[−] selcuka 44d ago
Reminded me of the XKCD [1] that points out the problem with average scores.

[1] https://xkcd.com/937/

[−] chrisweekly 44d ago
"commiserate" - did you mean "commensurate"?
[−] ipython 44d ago
Sorry, yes. I was typing quickly
[−] creationcomplex 44d ago
At that point commiserations were in order
[−] WhitneyLand 44d ago
StepFun is an interesting model.

If you haven’t heard of it yet there’s some good discussion here: https://news.ycombinator.com/item?id=47069179

[−] tarruda 44d ago
Since that discussion, they released the base model and a midtrain checkpoint:

- https://huggingface.co/stepfun-ai/Step-3.5-Flash-Base

- https://huggingface.co/stepfun-ai/Step-3.5-Flash-Base-Midtra...

I'm not aware of other AI labs that released base checkpoint for models in this size class. Qwen released some base models for 3.5, but the biggest one is the 35B checkpoint.

They also released the entire training pipeline:

- https://huggingface.co/datasets/stepfun-ai/Step-3.5-Flash-SF...

- https://github.com/stepfun-ai/SteptronOss

[−] lostmsu 44d ago
Tuned Qwen 3.5 27B beats Step 3.5 on almost all benchmarks, so the point about the size class is moot.
[−] tempaccount420 44d ago
Benchmarks are not interesting in deciding the "size class". Bigger size means more knowledge. Also, the Qwen 3.5 27B is a dense 27B active parameter model. StepFun 3.5 Flash has 11B active parameters.
[−] lostmsu 44d ago

> Bigger size means more knowledge.

Qwen 3.5 27B beats StepFun 3.5 Flash on GPQA Diamond too, so probably no.

[−] tarruda 44d ago
Benchmarks don't tell the whole story. For one-shot coding tasks, I found Step 3.5 Flash to be stronger even than Qwen 3.5 397B.
[−] anentropic 43d ago
Benchmarks don't tell the whole story... for that you need anecdotes from random HN posters :)
[−] skysniper 44d ago
thanks for the info. before running the bench i only tried it in arena.ai type of tasks and it was not impressive. i didn't expect it to be that good at agentic tasks
[−] hadlock 44d ago
According to openrouter.ai it looks like StepFun 3.5 Flash is the most popular model at 3.5T tokens, vs GLM 5 Turbo at 2.5T tokens. Claude Sonnet is in 5th place with 1.05T tokens. Which isn't super suprising as StepFun is ~about 5% the price of Sonnet.

https://openrouter.ai/apps?url=https%3A%2F%2Fopenclaw.ai%2F

[−] NitpickLawyer 44d ago

> the most popular model

It was free for a long time. That usually skews the statistics. It was the same with grok-code-fast1.

[−] MaxikCZ 44d ago
Exactly. When I read the headline I thought: "Ofc it is, its free."
[−] skysniper 44d ago
I should have clarified I didn't use the free version...
[−] arjie 44d ago
I used to use these various models for my claw-like and what they had a habit of doing is taking way more agent rounds and way more tokens to produce something that Sonnet would produce from far less. My total cost ended up being the same to do useful things.
[−] dmazin 44d ago
why do half the comments here read like ai trying to boost some sort of scam?
[−] Capricorn2481 44d ago
Because there's absolutely nothing stopping that from happening. There are bots on Reddit, there are of course bots on here, a VPN friendly site where you don't even need an email. But a lot of people don't want to admit it.
[−] grimm8080 44d ago
Yet when I tried it it did absymal compared to Gemini 2.5 Flash
[−] skysniper 44d ago
what kind of tasks did you try?
[−] smallerize 44d ago
It looks like Unsloth had trouble generating their dynamic quantized versions of this model, deleted the broken files, then never published an update.
[−] mgw 44d ago
Missing from the comparison is MiMo V2 Flash (not Pro), which I think could put up a good fight against Step 3.5 Flash.

Pricing is essentially the same: MiMo V2 Flash: $0.09/M input, $0.29/M output Step 3.5 Flash: $0.10/M input, $0.30/M output

MiMo has 41 vs 38 for Step on the Artificial Analysis Intelligence Index, but it's 49 vs 52 for Step on their Agentic Index.

[−] skysniper 44d ago
I will try and add it. But I doubt it works well because Mimo V2 Pro is beaten by stepfun even at performance leaderboard (price is not a factor in this leaderboard), so I expect MiMo V2 Flash to perform even worse.
[−] ygouzerh 44d ago
Mimo V2 Pro seems quite used by people as per OpenRouter's stats (second after Stepfun), it could be interesting to see indeed the difference!

https://openrouter.ai/apps?url=https%3A%2F%2Fopenclaw.ai%2F

[−] nl 44d ago
Mimi Flash matched Mimo Pro on https://sql-benchmark.nicklothian.com/?#all-data at double the speed and for $0.003 instead of $0.07
[−] throwa356262 44d ago
Interesting, I found the pro version to be very capable.

If stepfun is even better, then Chinese models are getting really good.

[−] azmenak 44d ago
This model is free to use, and has been for quite some time on OpenRouter. $0 is pretty hard to beat in terms of cost effectiveness.
[−] skysniper 44d ago
yeah but i'm not using the free version for benchmark...
[−] clausewitz 44d ago
I'm not seeing Deepseek mentioned very often, which I've been using for Openclaw, very cheaply I might add, with great success. I think I loaded $10 to my account 2 months ago and I still havent needed to top up.
[−] wg0 44d ago
Which deepseek exactly and what do you use it for? Just curious.
[−] skysniper 44d ago
another thing from the bench I didn't expect: gemini 3.1 pro is very unreliable at using skills. sometimes it just reads the skill and decide to do nothing, while opus/sonnet 4.6 and gpt 5.4 never have this issue.
[−] throwa356262 44d ago
Gemini 2.5 pro was the best Gemini, it has gone downhill since
[−] hypercube33 43d ago
I used sonnet and opus 4.6 for a month and it flat out ignored skills and rules and when asked it said it knew better or was lazy.
[−] zhangchen 44d ago
[flagged]
[−] sunaookami 44d ago
Tried the free version on OpenRouter with pi.dev and it's competent at tool calling and creative writing is "good enough" for me (more "natural Claude-level" and not robotic GPT-slop level) but it makes some grave mistakes (had some Hanzi in the output once and typos in words) so it may be good with "simple" agentic workflows but it's definitely not made for programming nor made for long writing.
[−] admiralrohan 44d ago
What kind of creative writing are you doing? Fiction or non-fiction like blog posts?
[−] sunaookami 44d ago
Fiction. One of my "benchmarks" is giving the model a bunch of (self-made) text and having it simulate a 4chan thread about it. This tests tool use (calling the APIs), some skills, censorship and general creativity. Some models refuse every new turn after reading real 4chan threads ;) Claude is especially good at this surprisingly while GPT fails spectacularly and Gemini is just lazy (and barely usable since it's constantly overloaded). Qwen (coder-model from Qwen CLI, so Qween 3.5) is also very good but sadly not usable in Pi (they detect and block calls outside their CLI).
[−] admiralrohan 43d ago
Interesting. Are you running something like Autoresearch loop for writing fiction? How will the agent determine whether the output is good as this is subjective.
[−] sunaookami 43d ago
I don't have any advanced setup, creative writing is always subjective. I just one-shot most of the time.
[−] skysniper 44d ago
it's actually pretty good at openclaw type of tasks for non technical users: lots of tool calls, some simple programing
[−] grigio 44d ago
i like StepFun 3.5 Flash, a good tradeoff
[−] yieldcrv 44d ago
people aren't just using Claude models any more? that's nice to see
[−] jghiglia 44d ago
[flagged]
[−] hyperlambda 44d ago
[flagged]
[−] Caum 44d ago
[flagged]
[−] mtrifonov 44d ago
[dead]
[−] philbitt 43d ago
[dead]
[−] skysniper 44d ago
I ran 300+ benchmarks across 15 models in OpenClaw and published two separate leaderboards: performance and cost-effectiveness.

The two boards look nothing alike. Top 3 performance: Claude Opus 4.6, GPT-5.4, Claude Sonnet 4.6. Top 3 cost-effectiveness: StepFun 3.5 Flash, Grok 4.1 Fast, MiniMax M2.7.

The most dramatic split: Claude Opus 4.6 is #1 on performance but #14 on cost-effectiveness. StepFun 3.5 Flash is #1 cost-effectiveness, #5 performance.

Other surprises: GLM-5 Turbo, Xiaomi MiMo v2 Pro, and MiniMax M2.7 all outrank Gemini 3.1 Pro on performance.

Rankings use relative ordering only (not raw scores) fed into a grouped Plackett-Luce model with bootstrap CIs. Same principle as Chatbot Arena — absolute scores are noisy, but "A beat B" is reliable. Full methodology: https://app.uniclaw.ai/arena/leaderboard/methodology?via=hn

I built this as part of OpenClaw Arena — submit any task, pick 2-5 models, a judge agent evaluates in a fresh VM. Public benchmarks are free.