Oh I never noticed that. Good to call out. But that would put it much closer to Minimax M2.7 in terms of price than to the likes of Mimo V2 Pro, and Gemini Flash 3 preview, which are both on the list
I'm about 2 days into transitioning, using MiMo V2 Pro in place of Opus and MiniMax M2.7 in place of Sonnet.
I'm finding that the extra "hand holding" that MiMo and MiniMax need isn't really "extra." The Anthropic models happily agree to a plan and then do something else entirely way too often.
With MiMo and MiniMax I'm just spreading the attention throughout the day instead of big spikes of frustration figuring out where Claude went off the rails.
Thank for responding. So you are using MiMo V2 Pro to plan and then asking MiniMax M2.7 to read that plan file and execute? Or how the workflow looks like?
Pi/Opencode/Kilocode?
Just curious.
I am using Opencode mostly and thinking to abandon Copilot so looking for something similar.
Sorry for late reply, but yeah that's how my workflow looks, but I'm also more just leaning on MiMo V2 Pro now, it's fast, and cheap enough. And I'm using OpenCode.
I was excited to read through this to find out how these tasks are evaluated at scale. Lots of scary looking formulas with sigmas and other Greek letters.
Then I clicked on one task to see what it looks like “on the ground”: https://app.uniclaw.ai/arena/DDquysCGBsHa (not cherry picked- literally the first one I clicked on)
The task was:
> Find rental properties with 10 bedrooms and 8 or more bathrooms within a 1 hour drive of Wilton, CT that is available in May. Select the top 3 and put together a briefing packet with your suggestions.
Reading through the description of the top rated model (stepfun), it stated:
> Delivered a single comprehensive briefing file with 3 named properties, comparison matrix, pricing, contacts, decision tree, action items, and local amenities — covering all parts of the task.
Oh cool! Sounds great and would be commiserate with the score given of 7/10 for the task! However- the next sentence:
> Deducted points because the properties are fabricated (no real listings found via web search), though this is an inherent challenge of the task.
So…… in other words, it made a bunch of shit up (at least plausible shit! So give back a few points!) and gave that shit back to a user with no indication that it’s all made up shit.
I know, that was indeed a bad judge move. I've manually checked tens of tasks so far, and that one is one of the worst... I would say check a few more, judge has some noise but in general did a good job IMO
I'm not aware of other AI labs that released base checkpoint for models in this size class. Qwen released some base models for 3.5, but the biggest one is the 35B checkpoint.
Benchmarks are not interesting in deciding the "size class". Bigger size means more knowledge. Also, the Qwen 3.5 27B is a dense 27B active parameter model. StepFun 3.5 Flash has 11B active parameters.
thanks for the info. before running the bench i only tried it in arena.ai type of tasks and it was not impressive. i didn't expect it to be that good at agentic tasks
According to openrouter.ai it looks like StepFun 3.5 Flash is the most popular model at 3.5T tokens, vs GLM 5 Turbo at 2.5T tokens. Claude Sonnet is in 5th place with 1.05T tokens. Which isn't super suprising as StepFun is ~about 5% the price of Sonnet.
I used to use these various models for my claw-like and what they had a habit of doing is taking way more agent rounds and way more tokens to produce something that Sonnet would produce from far less. My total cost ended up being the same to do useful things.
Because there's absolutely nothing stopping that from happening. There are bots on Reddit, there are of course bots on here, a VPN friendly site where you don't even need an email. But a lot of people don't want to admit it.
I will try and add it. But I doubt it works well because Mimo V2 Pro is beaten by stepfun even at performance leaderboard (price is not a factor in this leaderboard), so I expect MiMo V2 Flash to perform even worse.
I'm not seeing Deepseek mentioned very often, which I've been using for Openclaw, very cheaply I might add, with great success. I think I loaded $10 to my account 2 months ago and I still havent needed to top up.
another thing from the bench I didn't expect: gemini 3.1 pro is very unreliable at using skills. sometimes it just reads the skill and decide to do nothing, while opus/sonnet 4.6 and gpt 5.4 never have this issue.
Tried the free version on OpenRouter with pi.dev and it's competent at tool calling and creative writing is "good enough" for me (more "natural Claude-level" and not robotic GPT-slop level) but it makes some grave mistakes (had some Hanzi in the output once and typos in words) so it may be good with "simple" agentic workflows but it's definitely not made for programming nor made for long writing.
Fiction. One of my "benchmarks" is giving the model a bunch of (self-made) text and having it simulate a 4chan thread about it. This tests tool use (calling the APIs), some skills, censorship and general creativity. Some models refuse every new turn after reading real 4chan threads ;)
Claude is especially good at this surprisingly while GPT fails spectacularly and Gemini is just lazy (and barely usable since it's constantly overloaded). Qwen (coder-model from Qwen CLI, so Qween 3.5) is also very good but sadly not usable in Pi (they detect and block calls outside their CLI).
Interesting. Are you running something like Autoresearch loop for writing fiction? How will the agent determine whether the output is good as this is subjective.
I ran 300+ benchmarks across 15 models in OpenClaw and published two separate leaderboards: performance and cost-effectiveness.
The two boards look nothing alike. Top 3 performance: Claude Opus 4.6, GPT-5.4, Claude Sonnet 4.6. Top 3 cost-effectiveness: StepFun 3.5 Flash, Grok 4.1 Fast, MiniMax M2.7.
The most dramatic split: Claude Opus 4.6 is #1 on performance but #14 on cost-effectiveness. StepFun 3.5 Flash is #1 cost-effectiveness, #5 performance.
Other surprises: GLM-5 Turbo, Xiaomi MiMo v2 Pro, and MiniMax M2.7 all outrank Gemini 3.1 Pro on performance.
Rankings use relative ordering only (not raw scores) fed into a grouped Plackett-Luce model with bootstrap CIs. Same principle as Chatbot Arena — absolute scores are noisy, but "A beat B" is reliable. Full methodology: https://app.uniclaw.ai/arena/leaderboard/methodology?via=hn
I built this as part of OpenClaw Arena — submit any task, pick 2-5 models, a judge agent evaluates in a fresh VM. Public benchmarks are free.
84 comments
I would also be interested to see "KAT-Coder-Pro-V2" as they brag about their benchmarks in these bots as well
The Qwen3.5 27B model on OR is $1.56/million tokens out (it used to be $2.4/mil).
Meanwhile Minimax M2.7 (a much larger model) is $1.2/mil out.
The smaller and medium tier Qwen3.5 models are only really cost effective if you run them yourself.
I'm finding that the extra "hand holding" that MiMo and MiniMax need isn't really "extra." The Anthropic models happily agree to a plan and then do something else entirely way too often.
With MiMo and MiniMax I'm just spreading the attention throughout the day instead of big spikes of frustration figuring out where Claude went off the rails.
Pi/Opencode/Kilocode? Just curious.
I am using Opencode mostly and thinking to abandon Copilot so looking for something similar.
Then I clicked on one task to see what it looks like “on the ground”: https://app.uniclaw.ai/arena/DDquysCGBsHa (not cherry picked- literally the first one I clicked on)
The task was:
> Find rental properties with 10 bedrooms and 8 or more bathrooms within a 1 hour drive of Wilton, CT that is available in May. Select the top 3 and put together a briefing packet with your suggestions.
Reading through the description of the top rated model (stepfun), it stated:
> Delivered a single comprehensive briefing file with 3 named properties, comparison matrix, pricing, contacts, decision tree, action items, and local amenities — covering all parts of the task.
Oh cool! Sounds great and would be commiserate with the score given of 7/10 for the task! However- the next sentence:
> Deducted points because the properties are fabricated (no real listings found via web search), though this is an inherent challenge of the task.
So…… in other words, it made a bunch of shit up (at least plausible shit! So give back a few points!) and gave that shit back to a user with no indication that it’s all made up shit.
Ok, closed that tab.
[1] https://xkcd.com/937/
If you haven’t heard of it yet there’s some good discussion here: https://news.ycombinator.com/item?id=47069179
- https://huggingface.co/stepfun-ai/Step-3.5-Flash-Base
- https://huggingface.co/stepfun-ai/Step-3.5-Flash-Base-Midtra...
I'm not aware of other AI labs that released base checkpoint for models in this size class. Qwen released some base models for 3.5, but the biggest one is the 35B checkpoint.
They also released the entire training pipeline:
- https://huggingface.co/datasets/stepfun-ai/Step-3.5-Flash-SF...
- https://github.com/stepfun-ai/SteptronOss
> Bigger size means more knowledge.
Qwen 3.5 27B beats StepFun 3.5 Flash on GPQA Diamond too, so probably no.
https://openrouter.ai/apps?url=https%3A%2F%2Fopenclaw.ai%2F
> the most popular model
It was free for a long time. That usually skews the statistics. It was the same with grok-code-fast1.
Pricing is essentially the same: MiMo V2 Flash: $0.09/M input, $0.29/M output Step 3.5 Flash: $0.10/M input, $0.30/M output
MiMo has 41 vs 38 for Step on the Artificial Analysis Intelligence Index, but it's 49 vs 52 for Step on their Agentic Index.
https://openrouter.ai/apps?url=https%3A%2F%2Fopenclaw.ai%2F
If stepfun is even better, then Chinese models are getting really good.
The two boards look nothing alike. Top 3 performance: Claude Opus 4.6, GPT-5.4, Claude Sonnet 4.6. Top 3 cost-effectiveness: StepFun 3.5 Flash, Grok 4.1 Fast, MiniMax M2.7.
The most dramatic split: Claude Opus 4.6 is #1 on performance but #14 on cost-effectiveness. StepFun 3.5 Flash is #1 cost-effectiveness, #5 performance.
Other surprises: GLM-5 Turbo, Xiaomi MiMo v2 Pro, and MiniMax M2.7 all outrank Gemini 3.1 Pro on performance.
Rankings use relative ordering only (not raw scores) fed into a grouped Plackett-Luce model with bootstrap CIs. Same principle as Chatbot Arena — absolute scores are noisy, but "A beat B" is reliable. Full methodology: https://app.uniclaw.ai/arena/leaderboard/methodology?via=hn
I built this as part of OpenClaw Arena — submit any task, pick 2-5 models, a judge agent evaluates in a fresh VM. Public benchmarks are free.