The key part of the article is that token structure interpretation is a training time concern, not just an input/output processing concern (which still leads to plenty of inconsistency and fragmentation on its own!). That means both that training stakeholders at model development shops need to be pretty incorporated into the tool/syntax development process, which leads to friction and slowdowns. It also means that any current improvements/standardizations in the way we do structured LLM I/O will necessarily be adopted on the training side after a months/years lag, given the time it takes to do new-model dev and training.
That makes for a pretty thorny mess ... and that's before we get into disincentives for standardization (standardization risks big AI labs' moat/lockin).
One of the most relevant posts about AI on HN this year. It's not hype-y, but it's imperative to discuss.
I find it strange that the industry hasn't converged in at least somewhat standardized format, but I guess despite all the progress we're still in the very early days...
In our benchmarks we exclusively use a custom harness for measuring tool capability. It has common tools that any harness would have, like a thin wrapper around shell commands, basic file editors, etc. but an important part of agentic intelligence is adapting to new tools. Frontier models are already quite adaptable, especially Anthropic models, and improving with each release. I think a standardized format will become less and less important over time.
This is backwards. If you think the models are capable of adapting to any format, they will have an easier time adapting to more popular and more common formats until they will eventually become de-facto standards.
The only case where a standard wouldn't win is the case where models are only capable of supporting the baked in format but even this could be solved by adopting a standard format.
This is one of the first tech waves where I feel like I'm on the very very groundfloor for a lot of exploration and it only feels like people have been paying closer attention in the last year. I can't imagine too many 'standard' standards becoming a standard that quickly.
It's new enough that Google seems to be throwing pasta against the wall and seeing what products and protocols stick. Antigravity for example seems too early to me, I think they just came out with another type of orchestrator, but the whole field seems to be exploring at the same time.
Everyone and their uncle is making an orchestrator now! I take a very cautious approach lately where I haven't been loading up my tools like agents, ides, browsers, phones with too much extra stuff because as soon as I switch something or something new comes out that doesn't support something I built a workflow around the tool either becomes inaccessible to me, or now a bigger learning curve than I have the patience for.
I've been a big proponent of trying to get all these things working locally for myself (I need to bite the bullet on some beefy video cards finally), and even just getting tool calls to work with some qwen models to be so counterintuitive.
I guess I fail to see why this is such a problem. Yes it would be nice if the wire format were standardized or had a standard schema description, but is writing a parser that handles several formats actually a difficult problem? Modern models could probably whip up a "libToolCallParser" with bindings for all popular languages in an afternoon. Could probably also have an automated workflow for adding any new ones with minimal fuss. An annoyance, yes, but it does not seem like a really "hard" problem. It seems more of a social problem that open source hasn't coalesced around a library that handles it easily yet or am I missing something?
There already exist products like LiteLLM that adapt tool calling to different providers. FWIW, incompatibility isn't just an opensource problem - OpenAI and Anthropic also use different syntax for tool registration and invocation.
I would guess that lack of standardization of what tools are provided by different agents is as much of a problem as the differences in syntax, since the ideal case would be for a model to be trained end-to-end for use with a specific agent and set of tools, as I believe Anthropic do. Any agent interacting with a model that wasn't specifically trained to work with that agent/toolset is going to be at a disadvantage.
Presumably the hosting services are resolving all of this in their OpenAI/Anthropic compatibility layer, which is what most tools are using. So this is really just a problem for local engines that have to do the same thing but are expected to work right away for every new model drop.
Does anyone know why there hasn’t been more widespread adoption of OpenAI’s Harmony format? Or will it just take another model generation to see adoption?
It's a good question, opinionated* answer: it's the whackiest one by far. I'm not sure it's actually good in the long run. It's very much more intense than the other formats, and idk how to describe this, but I think it puts the model in a weird place where it has to think in this odd framework of channels, and the channel names also shade how it thinks about what it's doing.
It's less of a problem than I'm making it sound, obviously the GPTs are doing just fine. But the counterexample of not having such a complex and unique format and still having things like parallel tool calls has also played out just fine.
When I think on it, the incremental step that made the more classical formats work might have been them shifting towards the model having tokens like ...... helped a ton, because you could shift to json-ifying stuff inside the parameters instead of having LLM do it.
Also fwiw, the lore on harmony was Microsoft pushed it on them to avoid issues with 2023 Bing and prompt injection and such. MS VP for Bing claimed this so not sure how true it is - not that he's unreliable, he's an awesome guy, just, language is loose. Maybe he meant "concept of channels" and not Harmony in toto. Pointing it out because it may be an indicator it was rushed and over-designed, which would explain it's relative complexity compared to ~anyone else.
* I hate talking about myself, but hate it less than being verbose and free-associating without some justification of relevant knowledge: quit Google in late 2022 to build a Flutter all-platform LLM client, based on llama.cpp / any 3rd party provider you can think of. Had to write Harmony parsing twice, as well as any other important local model format you can think of.
Ironically LLMs solve the MxN problem he's complaining about. He wants to get rid of the problem entirely, but fails to see the value of pointless differences.
It's the same kind of hubris that asks why we don't all speak one language. In the future we will all speak one language and we will all speak either our own or a DSL shared by only a few others, in America we will all speak English, in Japan even the torists will all speak Japanese. Very few will know English, but some will know it better than anyone.
Useful article, I was fighting with GLM's tool calling format just last night. Stripping and sanitization to make it compatible with my UI consistently has been... fun.
I wonder if stuffing tool call formatting into an engram layer (see Deepseek's engram paper) that could be swapped at runtime would be a useful solution here.
The idea would be to encode tool calling semantics once on a single layer, and inject as-needed. Harness providers could then give users their bespoke tool calling layer that is injected at model load-time.
Dunno, seems like it might work. I think most open source models can have an engram layer injected (some testing would be required to see where the layer best fits).
The engram idea is actually technically clever but imo sees the solution from a bottom-up approach while Louf's real argument is a top-down view. His solution (declarative specs) solves that by centralizing the spec, making it versioned and composable, independent of any actual model.
Engram layers just move the coordination problem earlier and lock it in. Coordination problems between models & providers would still exist, requiring a layer injection in each open source model and another variant produced for each. Users would still need to chose between "Qwen-8b" and "Qwen-8b-engram" x model families and sizes. Is that cleaner?
The issue with a top-level spec, that I can see, is that models fall back to their training when it comes to tools. This is why I recommended the engram approach, because as far as I can tell the problem is a model problem not a systems problem.
MCP is the wire format between agent and tool, not the format the model itself uses to emit the call. That part (Harmony, JSON, XML-ish) is still model-specific. So the M×N the article describes is really two problems stacked — MCP only solves the lower half.
Also in practice Claude Code, Cursor and Codex handle the same MCP tool differently — required params, tool descriptions, response truncation. So MCP gives you the contract but the client UX still leaks.
In my experience it's actually very doable to do reliable tool calling with a generic response format across models. You just need to disable native tool calling completely and provide a clearly defined response/tool format that conforms well to pretraining across a variety of models (e.g. XML-like syntaxes).
For example:
``Let me take a look at that``
The hard part is building a streaming XML parser that handles these responses robustly, can adjust for edge cases, and normalizes predictable mishaps in history in order to ensure continued response format adherance.
The native way to skip all that is train a small thingy to map hidden state -> token/thingy you care about once per model family, or just do it once and procrustes over the state from the model you're using to whatever you made the map for.
Feedback: I don't usually comment on formatting, but that fat indent is too much. I applied "hide distracting items" to the graphic, and the indent is still there. Reader works.
50 comments
That makes for a pretty thorny mess ... and that's before we get into disincentives for standardization (standardization risks big AI labs' moat/lockin).
I find it strange that the industry hasn't converged in at least somewhat standardized format, but I guess despite all the progress we're still in the very early days...
Benchmarks at https://gertlabs.com
The only case where a standard wouldn't win is the case where models are only capable of supporting the baked in format but even this could be solved by adopting a standard format.
This is one of the first tech waves where I feel like I'm on the very very groundfloor for a lot of exploration and it only feels like people have been paying closer attention in the last year. I can't imagine too many 'standard' standards becoming a standard that quickly.
It's new enough that Google seems to be throwing pasta against the wall and seeing what products and protocols stick. Antigravity for example seems too early to me, I think they just came out with another type of orchestrator, but the whole field seems to be exploring at the same time.
Everyone and their uncle is making an orchestrator now! I take a very cautious approach lately where I haven't been loading up my tools like agents, ides, browsers, phones with too much extra stuff because as soon as I switch something or something new comes out that doesn't support something I built a workflow around the tool either becomes inaccessible to me, or now a bigger learning curve than I have the patience for.
I've been a big proponent of trying to get all these things working locally for myself (I need to bite the bullet on some beefy video cards finally), and even just getting tool calls to work with some qwen models to be so counterintuitive.
I would guess that lack of standardization of what tools are provided by different agents is as much of a problem as the differences in syntax, since the ideal case would be for a model to be trained end-to-end for use with a specific agent and set of tools, as I believe Anthropic do. Any agent interacting with a model that wasn't specifically trained to work with that agent/toolset is going to be at a disadvantage.
It's less of a problem than I'm making it sound, obviously the GPTs are doing just fine. But the counterexample of not having such a complex and unique format and still having things like parallel tool calls has also played out just fine.
When I think on it, the incremental step that made the more classical formats work might have been them shifting towards the model having tokens like... ... helped a ton, because you could shift to json-ifying stuff inside the parameters instead of having LLM do it.
Also fwiw, the lore on harmony was Microsoft pushed it on them to avoid issues with 2023 Bing and prompt injection and such. MS VP for Bing claimed this so not sure how true it is - not that he's unreliable, he's an awesome guy, just, language is loose. Maybe he meant "concept of channels" and not Harmony in toto. Pointing it out because it may be an indicator it was rushed and over-designed, which would explain it's relative complexity compared to ~anyone else.
* I hate talking about myself, but hate it less than being verbose and free-associating without some justification of relevant knowledge: quit Google in late 2022 to build a Flutter all-platform LLM client, based on llama.cpp / any 3rd party provider you can think of. Had to write Harmony parsing twice, as well as any other important local model format you can think of.
It's the same kind of hubris that asks why we don't all speak one language. In the future we will all speak one language and we will all speak either our own or a DSL shared by only a few others, in America we will all speak English, in Japan even the torists will all speak Japanese. Very few will know English, but some will know it better than anyone.
> Ironically LLMs solve the MxN problem he's complaining about
Enlighten me please
The idea would be to encode tool calling semantics once on a single layer, and inject as-needed. Harness providers could then give users their bespoke tool calling layer that is injected at model load-time.
Dunno, seems like it might work. I think most open source models can have an engram layer injected (some testing would be required to see where the layer best fits).
Engram layers just move the coordination problem earlier and lock it in. Coordination problems between models & providers would still exist, requiring a layer injection in each open source model and another variant produced for each. Users would still need to chose between "Qwen-8b" and "Qwen-8b-engram" x model families and sizes. Is that cleaner?
The issue with a top-level spec, that I can see, is that models fall back to their training when it comes to tools. This is why I recommended the engram approach, because as far as I can tell the problem is a model problem not a systems problem.
Also in practice Claude Code, Cursor and Codex handle the same MCP tool differently — required params, tool descriptions, response truncation. So MCP gives you the contract but the client UX still leaks.
For example: ``
Let me take a look at that
``The hard part is building a streaming XML parser that handles these responses robustly, can adjust for edge cases, and normalizes predictable mishaps in history in order to ensure continued response format adherance.