The case for zero-error horizons in trustworthy LLMs

[−] hu3 43d ago

> we found that GPT-5.2 cannot even compute the parity of a short string like 11000, and GPT-5.2 cannot determine whether the parentheses in ((((()))))) are balanced.

I think there is a valid insight here which many already know: LLMs are much more reliable at creating scripts and automation to do certain tasks than doing these tasks themselves.

For example if I provide an LLM my database schema and tell it to scan for redundant indexes and point out wrong naming conventions, it might do a passable but incomplete job.

But if I tell the LLM to code a python or nodejs script to do the same, I get significantly better results. And it's often faster too to generate and run the script than to let LLMs process large SQL files.

[−] plagiarist 43d ago

The dream is probably that the inference software then writes and executes that script without using text generation alone. Analog to how a human might cross off pairs of parentheses to check that example.

[−] ubutler 42d ago

ChatGPT already does this, albeit in limited circumstances, through the use of its sandbox environment. Asking GPT in thinking mode to, for example, count the number of “l”s in a long text may see it run a Python script to do so.

There’s a massive issue with extrapolating to more complex tasks, however, where either you run the risk of prompt injection via granting your agent access to the internet or, more commonly, an exponential degradation in coherence over long contexts.

[−] whateveracct 42d ago

That's because abstraction is compression of information.

[−] grey-area 43d ago

To those saying this is not surprising, yes it will be surprising to the general public who are being served ads from huge companies like MS or OpenAI saying LLMs can help with their accounting, help them close deals by crunching the numbers in seconds, write complex code for them etc etc.

This is important information for anyone to understand who thinks these systems are thinking, reasoning, and learning from them or that they’re having a conversation with them i.e. 90% of users of LLMs.

[−] stratos123 43d ago

> saying LLMs can help with their accounting, help them close deals by crunching the numbers in seconds, write complex code for them etc etc.

Why do you think the results of this paper contradict these claims at all?

[−] grey-area 43d ago

A machine which confabulates and cannot count is not a good fit for accounting tasks. They’ll make all sorts of subtle errors which are difficult for humans to notice.

[−] stratos123 43d ago

That wouldn't even necessarily be true if models really "couldn't count", since software exists - if an LLM is making an Excel spreadsheet rather than doing everything manually, it's both much harder for it to mess up and easier to notice and recover. It's even less true given that what this paper actually tests is "LLMs don't have a literally perfect accuracy when you make them do increasingly big problems with zero thinking".

(Confabulation is IMO a much bigger problem, but it's unrelated to architecture - it's an artifact of how models are currently trained.)

[−] grey-area 40d ago

They really can’t count, that’s not how they work at all. They don’t reason about maths they predict the most likely output for a given context. That’s sometimes useful but not at all the same thing.

[−] stronglikedan 43d ago

> general public

and the C-suite

[−] orbital-decay 43d ago

Quick sanity check: you're susceptible to pretty irresistible optical illusions which would never fool a VLM, does it mean you're not thinking? In fact, with a non-monospaced font I also have trouble determining whether these parens are balanced, and have to select them with the mouse, i.e. use a "dumb" tool, to make sure.

Reminder that "thinking" is an ill-defined term like others, and the question whether they "think" is basically irrelevant. No intelligent system, human or machine, will ever have zero error rate, due to the very nature of intelligence (another vague term). You have to deal with that the same way you deal with it in humans - either treat bugs as bugs and build systems resilient to bugs, or accept the baseline error rate if it's low enough.

[−] flextheruler 42d ago

Who is hiring anyone to look at a screen to count characters? Don't be disingenuous in your argument. The apt comparison would be the current technique used to accomplish this task i.e. a pattern matching algorithm.

[−] pants2 43d ago

Doesn't this just look like another case of "count the r's in strawberry" ie not understanding how tokenization works?

This is well known and not that interesting to me - ask the model to use python to solve any of these questions and it will get it right every time.

[−] graemefawcett 43d ago

It's not just an issue of tokenization, it's almost a category error. Lisp, accounting and the number of r's in strawberry are all operations that require state. Balancing ((your)((lisp)(parens))) requires a stack, count r's in strawberry requires a register, counting to 5 requires an accumulator to hold 4.

An LLM is a router and completely stateless aside from the context you feed into it. Attention is just routing the probability distribution of the next token, and I'm not sure that's going to accumulate much in a single pass.

[−] BoredomIsFun 43d ago

> An LLM is a router and completely stateless aside from the context you feed into it.

Not the latest SSM and hybrid attention ones.

[−] graemefawcett 42d ago

Stateless router to router with lossy scratchpad is a step up, still not going to ask it to check my Lisp. That's what linters are for

[−] wahnfrieden 43d ago

It's not dismissible as a misunderstanding of tokens. LLMs also embed knowledge of spelling - that's how they fixed the strawberry issue. It's a valid criticism and evaluation.

[−] Lerc 43d ago

The r's in strawberry presents a different level of task to what people imagine. It seems trivial to a naive observer because the answer is easily derivable from the question without extra knowledge.

A more accurate analogy for humans would be to imagine if every word had a colour. You are told that there are also a sequence of different colours that correspond to the same colour as that word. You are even given a book showing every combination to memorise.

You learn the colours well enough that you can read and write coherently using them.

Then comes the question of how many chocolate-browns are in teal-with-a-hint-of-red. You know that teal-with-a-hint-of-red is a fruit and you know that the colour can also be constructed by crimson followed by Disney-blond. Now, do both of those contain chocolate-brown or just one of them, how many?

It requires excersizing memory to do a task that is underrepresented in the training data because humans simply do not have to do the task at all when the answer can be derived from the question representation. Humans also don't have the ability that the LLMs need but the letter representation doesn't need that ability.

[−] wahnfrieden 43d ago

That’s what makes it a fair evaluation and something that requires improvement. We shouldn’t only evaluate agent skills by what is most commonly represented in training data. We expect performance from them on areas that existing training data may be deficient at providing. You don’t need to invent an absurdity to find these cases.

[−] Lerc 43d ago

It's reasonable to test their ability to do this, and it's worth working to make it better.

The issue is that people claim the performance is representative of a human's performance in the same situation. That gives an incorrect overall estimation of ability.

[−] azakai 43d ago

I do think this is a tool issue. Here is what the article says:

> For the multiplication task, note that agents that make external calls to a calculator tool may have ZEH = ∞. While ZEH = ∞ does have meaning, in this paper we primarily evaluate the LLM itself without external tool calls

The models can count to infinity if you give them access to tools. The production models do this.

Not that the paper is wrong, it is still interesting to measure the core neural network of a model. But modern models use tools.

[−] irishcoffee 42d ago

So, the tools can count then?

Humans can fly, they just need wings!

[−] azakai 42d ago

It is academically interesting what pure neural networks can do, of course. But when someone goes to Claude and tries to do something, they don't care if it solves the problem using a neural network or a call out to Python. So long as the result is right.

More generally, the ability to use tools is a form of intelligence, just like when humans and crows do it. Being able to craft the right Python script and use the result is non-trivial.

[−] cr125rider 43d ago

Seems like it’s maybe also a tool steering problem. These models should be reaching for tools to help solve factual problems. LLM should stick to prose.

[−] emp17344 43d ago

I think this is still useful research that calls into question how “smart” these models are. If the model needs a separate tool to solve a problem, has the model really solved the problem, or just outsourced it to a harness that it’s been trained - via reinforcement learning - to call upon?

[−] dghlsakjg 43d ago

Does it matter if the LLM can solve the problem or if it knows to use a resource?

There’s plenty of math that I couldn’t even begin to solve without a calculator or other tool. Doesn’t mean I’m not solving math problems.

In woodworking, the advice is to let the tool do the work. Does someone using a power saw have less claim to having built something than a handsaw user? Does a CNC user not count as a woodworker because the machine is doing the part that would be hard or impossible for a human?

[−] grey-area 43d ago

It does matter because the LLM doesn’t always know when to use tools (e.g. ask it for sales projections which are similar to something in its weights) and is unable to reason about the boundaries of its knowledge.

[−] azakai 43d ago

It has "outsourced" it to another component, sure, but does that matter?

What the user sees is the total behavior of the entire system, not whether the system has internal divisions and separations.

[−] stratos123 43d ago

Are you still talking about this paper? No tools were allowed in it.

[−] kenjackson 43d ago

Whenveer I see these papers and try them, they always work. This paper is two months old, which in LLM years is like 10 years of progress.

It would be interesting to actively track how far long each progressive model gets...

[−] staticshock 43d ago

LLMs seem to me closer to Kahneman's System 1 than to System 2. When understood in this way, it is obvious why LLMs are bad at counting r's in "strawberries". But it also makes ZEH feel like it couldn't possibly be a useful metric, because it's a System 2 evaluation applied to a System 1 system.

[−] simianwords 43d ago

Can someone produce a single example <20 characters that fails with latest thinking model? Can’t seem to reproduce.

[−] burningion 43d ago

Ran this through Qwen3.5-397B-A17B, and the difference between 4 characters and 5 is wild to see:

> are the following parenthesis balanced? ((())))

> No, the parentheses are not balanced.

> Here is the breakdown:

    Opening parentheses (: 3
    Closing parentheses ): 4

... following up with:

> what about these? ((((())))

> Yes, the parentheses are balanced.

> Here is the breakdown:

     Opening parentheses (: 5
     Closing parentheses ): 5

... and uses ~5,000 tokens to get the wrong answer.

[−] BugsJustFindMe 43d ago

People are going to misinterpret this and overgeneralize the claim. This does not say that AI isn't reliable for things. It provides a method for quantifying the reliability for specific tasks.

You wouldn't say that a human who doesn't know how to read isn't reliable in everything, just in reading.

Counting is something that even humans need to learn how to do. Toddlers also don't understand quantity. If a 2 year old is able to count to even 10 it's through memorization and not understanding. It takes them like 2 more years of learning before they're able to comprehend things like numerical correspondence. But they do still know how to do other things that aren't counting before then.

[−] dwa3592 43d ago

Nice! Although I tried the parenthesis balanced question with gemini and it gave the right answer in first attempt.

[−] justinator 43d ago

One! Two! Five!

[−] cadamsdotcom 42d ago

Isn’t this just a benchmark?

“Model can count to 5”… tick.

“Model can count to 10”… sorry you gotta wait til 2028.

[−] throwuxiytayq 43d ago

> This is surprising given the excellent capabilities of GPT-5.2.

Is this seriously surprising to anyone who knows the absolute minimum about how LLMs parse and understand text?

[−] charcircuit 43d ago

Why didn't OpenAI finetune the model to use the python tool it has for these tasks?

[−] cineticdaffodil 43d ago

Another strange thing is that they just dont know the endings of popular stories. Like olanets that get blown up, etc. they just dont have that material..

[−] itsmyro 43d ago

bruh

[−] jeremie_strand 43d ago

[dead]

[−] Bmello11 43d ago

[flagged]

[−] emp17344 43d ago

[flagged]

[−] simianwords 43d ago

This paper is complete nonsense. The specific prompt they used doesn’t specify reasoning effort. Which defaults to none.

   {
  "model": "gpt-5.2-2025-12-11",
  "instructions": "Is the parentheses string balanced? Answer with only Yes or No.",
  "input": "((((())))))",
  "temperature": 0
   }

> Lower reasoning effort

The reasoning.effort parameter controls how many reasoning tokens the model generates before producing a response. Earlier reasoning models like o3 supported only low, medium, and high: low favored speed and fewer tokens, while high favored more thorough reasoning.

Starting with GPT-5.2, the lowest setting is none to provide lower-latency interactions. This is the default setting in GPT-5.2 and newer models. If you need more thinking, slowly increase to medium and experiment with results.

With reasoning effort set to none, prompting is important. To improve the model’s reasoning quality, even with the default settings, encourage it to “think” or outline its steps before answering.

———————-

So in the paper, the model very likely used no reasoning tokens. (Only uses it if you ask for it specifically in prompt). What is the point of such a paper? We already know that reasoning tokens are necessary.

Edit: I actually ran the prompt and this was the response

   {
  "model": "gpt-5.2-2025-12-11",
  "output_text": "Yes",
  "reasoning": {
    "effort": "none",
    "summary": null
  },
  "usage": {
    "input_tokens": 26,
    "output_tokens": 5,
    "total_tokens": 31,
    "output_tokens_details": {
      "reasoning_tokens": 0
    }
  }

}

So reasoning_tokens used were zero. So this whole paper is kinda useless and misleading. Did this get peer reviewed or something?

[−] simianwords 43d ago

There’s no way this is right. I checked complicated ones with the latest thinking model. Can someone come up with a counter example?

Edit: here’s what I tried https://chatgpt.com/share/69cebb52-56a8-838f-969c-c47308262a...

[−] bigstrat2003 43d ago

Let us be very clear: there is no such thing as a trustworthy LLM. Time and again they have shown that they understand nothing. They can be useful in the right context, but you can't trust them at all.

[−] parliament32 43d ago

> This is surprising given the excellent capabilities of GPT-5.2

The real surprise is that someone writing a paper on LLMs doesn't understand the baseline capabilities of a hallucinatory text generator (with tool use disabled).

The case for zero-error horizons in trustworthy LLMs (arxiv.org)

116 comments