Claude mixes up who said what

[−] Latty 36d ago

Everything to do with LLM prompts reminds me of people doing regexes to try and sanitise input against SQL injections a few decades ago, just papering over the flaw but without any guarantees.

It's weird seeing people just adding a few more "REALLY REALLY REALLY REALLY DON'T DO THAT" to the prompt and hoping, to me it's just an unacceptable risk, and any system using these needs to treat the entire LLM as untrusted the second you put any user input into the prompt.

[−] fzeindl 36d ago

The principal security problem of LLMs is that there is no architectural boundary between data and control paths.

But this combination of data and control into a single, flexible data stream is also the defining strength of a LLM, so it can’t be taken away without also taking away the benefits.

[−] andruby 36d ago

This was a problem with early telephone lines which was easy to exploit (see Woz & Jobs Blue Box). It got solved by separating the voice and control pane via SS7. Maybe LLMs need this separation as well

[−] bcrosby95 36d ago

This is where the old line of "LLMs are just next token predictors" actually factors in. I don't know how you get a next token predictor that user input can't break out of. The answer is for the implementer to try to split what they can, and run pre/post validation. But I highly doubt it will ever be 100%, its fundamental to the technology.

[−] miki123211 36d ago

I think this is fundamental to any technology, including human brains.

Humans have a problem distinguishing "John from Microsoft" from somebody just claiming to be John from Microsoft. The reason why scamming humans is (relatively) hard is that each human is different. Discovering the perfect tactic to scam one human doesn't necessarily scale across all humans.

LLMs are the opposite; my Chat GPT is (almost) the same as your Chat GPT. It's the same model with the same system message, it's just the contexts that differ. This makes LLM jailbreaks a lot more scalable, and hence a lot more worthwhile to discover.

LLMs are also a lot more static. With people, we have the phenomenon of "banner blindness", which LLMs don't really experience.

[−] lupire 36d ago

How are you defining "banner blindness"?

The foundation of LLMs is Attention.

[−] warkdarrior 36d ago

"Banner blindness [...] describes people’s tendency to ignore page elements that they perceive (correctly or incorrectly) to be ads." https://www.nngroup.com/articles/banner-blindness-old-and-ne...

So people can focus their attention to parts of content, specifically parts they find irrelevant or adversarial (like ads). LLMs on the other hand pay attention to everything or if they focus on something, it is hard to steer them away from irrelevant or adversarial parts.

[−] miki123211 36d ago

Banner blindness is a phenomenon where humans build resistance to previously-effective ad formats, making them much less effective than they previously used to be.

You can find a "hook" to effectively manipulate people with advertising, but that hook gets less and less effective as it is exploited. LLMs don't have this property, except across training generations.

[−] justinclift 36d ago

> I don't know how you get a next token predictor that user input can't break out of.

Maybe by adjusting the transformer model to have separate input layers for the control and data paths?

[−] Wowfunhappy 36d ago

Maybe it's my failing but I can't imagine what that would look like.

Right now, you train an LLM by showing it lots of text, and tell it to come up with the best model for predicting the next word in any of that text, as accurately as possible across the corpus. Then you give it a chat template to make it predict what an AI assistant would say. Do some RLHF on top of that and you have Claude.

What would a model with multiple input layers look like? What is it training on, exactly?

[−] justinclift 35d ago

> by showing it lots of text

When you're "showing it lots of text", where does that "show" bit happen? :)

[−] salt4034 36d ago

It's hard in general, but for instruct/chat models in particular, which already assume a turn-based approach, could they not use a special token that switches control from LLM output to user input? The LLM architecture could be made so it's literally impossible for the model to even produce this token. In the example above, the LLM could then recognize this is not a legitimate user input, as it lacks the token. I'm probably overlooking something obvious.

[−] lupire 36d ago

Yes, and as you'd expect, this is how LLMs work today, in general, for control codes. But different elems use different control codes for different purposes, such as separating system prompt from user prompt.

But even if you tag inputs however your this is good, you can't force an LLM to it treat input type A as input type B, all you can do is try to weight against it! LLMs have no rules, only weights. Pre and post filters cam try to help, but they can't directly control the LLM text generation, they can only analyze and most inputs/output using their own heuristics.

[−] VikingCoder 36d ago

The "S" in "LLM" is for "Security".

[−] daveguy 36d ago

Clearly the solution is to add another jank LLM layer for security. The new jank LLM layer is to make extra sure there's definitely no jail break. That way you have multiple LLMS. The LLMS then have an S you can pretend is secure.

[−] notatoad 36d ago

As the article says: this doesn’t necessarily appear to be a problem in the LLM, it’s a problem in Claude code. Claude code seems to leave it up to the LLM to determine what messages came from who, but it doesn’t have to do that.

There is a deterministic architectural boundary between data and control in Claude code, even if there isn’t in Claude.

[−] groby_b 36d ago

"The principal security problem of von Neumann architecture is that there is no architectural boundary between data and control paths"

We've chosen to travel that road a long time ago, because the price of admission seemed worth it.

[−] mt_ 36d ago

Exactly like human input to output.

[−] ummonk 36d ago

I don't see why the transformer architecture can't be designed and trained with separate inputs for control data and content data.

[−] toobulkeh 35d ago

But there could be, with 2 LLMs.

[−] clickety_clack 36d ago

It’s easier not to have that separation, just like it was easier not to separate them before LLMs. This is architectural stuff that just hasn’t been figured out yet.

[−] nathell 36d ago

I’ve hit this! In my otherwise wildly successful attempt to translate a Haskell codebase to Clojure [0], Claude at one point asks:

[Claude:] Shall I commit this progress? [some details about what has been accomplished follow]

Then several background commands finish (by timeout or completing); Claude Code sees this as my input, thinks I haven’t replied to its question, so it answers itself in my name:

[Claude:] Yes, go ahead and commit! Great progress. The decodeFloat discovery was key.

The full transcript is at [1].

[0]: https://blog.danieljanus.pl/2026/03/26/claude-nlp/

[1]: https://pliki.danieljanus.pl/concraft-claude.html#:~:text=Sh...

[−] orbital-decay 36d ago

Claude in particular has nothing to do with it. I see many people are discovering the well-known fundamental biases and phenomena in LLMs again and again. There are many of those. The best intuition is treating the context as "kind of but not quite" an associative memory, instead of a sequence or a text file with tokens. This is vaguely similar to what humans are good and bad at, and makes it obvious what is easy and hard for the model, especially when the context is already complex.

Easy: pulling the info by association with your request, especially if the only thing it needs is repeating. Doing this becomes increasingly harder if the necessary info is scattered all over the context and the pieces are separated by a lot of tokens in between, so you'd better group your stuff - similar should stick to similar.

Unreliable: Exact ordering of items. Exact attribution (the issue in OP). Precise enumeration of ALL same-type entities that exist in the context. Negations. Recalling stuff in the middle of long pieces without clear demarcation and the context itself (lost-in-the-middle).

Hard: distinguishing between the info in the context and its own knowledge. Breaking the fixation on facts in the context (pink elephant effect).

Very hard: untangling deep dependency graphs. Non-reasoning models will likely not be able to reduce the graph in time and will stay oblivious to the outcome. Reasoning models can disentangle deeper dependencies, but only in case the reasoning chain is not overwhelmed. Deep nesting is also pretty hard for this reason, however most models are optimized for code nowadays and this somewhat masks the issue.

[−] xg15 36d ago

>

This class of bug seems to be in the harness, not in the model itself. It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.”

Are we sure about this? Accidentally mis-routing a message is one thing, but those messages also distinctly "sound" like user messages, and not something you'd read in a reasoning trace.

I'd like to know if those messages were emitted inside "thought" blocks, or if the model might actually have emitted the formatting tokens that indicate a user message. (In which case the harness bug would be why the model is allowed to emit tokens in the first place that it should only receive as inputs - but I think the larger issue would be why it does that at all)

[−] dtagames 36d ago

There is no separation of "who" and "what" in a context of tokens. Me and you are just short words that can get lost in the thread. In other words, in a given body of text, a piece that says "you" where another piece says "me" isn't different enough to trigger anything. Those words don't have the special weight they have with people, or any meaning at all, really.

[−] lelandfe 36d ago

In chats that run long enough on ChatGPT, you'll see it begin to confuse prompts and responses, and eventually even confuse both for its system prompt. I suspect this sort of problem exists widely in AI.

[−] Balgair 36d ago

Aside:

I've found that 'not'[0] isn't something that LLMs can really understand.

Like, with us humans, we know that if you use a 'not', then all that comes after the negation is modified in that way. This is a really strong signal to humans as we can use logic to construct meaning.

But with all the matrix math that LLMs use, the 'not' gets kinda lost in all the other information.

I think this is because with a modern LLM you're dealing with billions of dimensions, and the 'not' dimension [1] is just one of many. So when you try to do the math on these huge vectors in this space, things like the 'not' get just kinda washed out.

This to me is why using a 'not' in a small little prompt and token sequence is just fine. But as you add in more words/tokens, then the LLM gets confused again. And none of that happens at a clear point, frustrating the user. It seems to act in really strange ways.

[0] Really any kind of negation

[1] yeah, negation is probably not just one single dimension, but likely a composite vector in this bazillion dimensional space, I know.

[−] supernes 36d ago

> after using it for months you get a ‘feel’ for what kind of mistakes it makes

Sure, go ahead and bet your entire operation on your intuition of how a non-deterministic, constantly changing black box of software "behaves". Don't see how that could backfire.

[−] tlonny 36d ago

Bugginess in the Claude Code CLI is the reason I switched from Claude Max to Codex Pro.

I experienced:

- rendering glitches

- replaying of old messages

- mixing up message origin (as seen here)

- generally very sluggish performance

Given how revolutionary Opus is, its crazy to me that they could trip up on something as trivial as a CLI chat app - yet here we are...

I assume Claude Code is the result of aggressively dog-fooding the idea that everything can be built top-down with vibe-coding - but I'm not sure the models/approach is quite there yet...

[−] phlakaton 36d ago

> This bug is categorically distinct from hallucinations.

Is it?

> after using it for months you get a ‘feel’ for what kind of mistakes it makes, when to watch it more closely, when to give it more permissions or a longer leash.

Do you really?

> This class of bug seems to be in the harness, not in the model itself.

I think people are using the term "harness" too indiscriminately. What do you mean by harness in this case? Just Claude Code, or...?

> It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.”

How do you know? Because it looks to me like it could be a straightforward hallucination, compounded by the agent deciding it was OK to take a shortcut that you really wish it hadn't.

For me, this category of error is expected, and I question whether your months of experience have really given you the knowledge about LLM behavior that you think it has. You have to remember at all times that you are dealing with an unpredictable system, and a context that, at least from my black-box perspective, is essentially flat.

[−] __alexs 36d ago

Why are tokens not coloured? Would there just be too many params if we double the token count so the model could always tell input tokens from output tokens?

[−] arkensaw 36d ago

> This class of bug seems to be in the harness, not in the model itself. It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.”

from the article.

I don't think the evidence supports this. It's not mislabelling things, it's fabricating things the user said. That's not part of reasoning.

[−] 63stack 36d ago

They will roll out the "trusted agent platform sandbox" (I'm sure they will spend some time on a catchy name, like MythosGuard), and for only $19/month it will protect you from mistakes like throwing away your prod infra because the agent convinced itself that that is the right thing to do.

Of course MythosGuard won't be a complete solution either, but it will be just enough to steer the discourse into the "it's your own fault for running without MythosGuard really" area.

[−] have_faith 36d ago

It's all roleplay, they're no actors once the tokens hit the model. It has no real concept of "author" for a given substring.

[−] stuartjohnson12 36d ago

one of my favourite genres of AI generated content is when someone gets so mad at Claude they order it to make a massive self-flagellatory artefact letting the world know how much it sucks

[−] fblp 36d ago

I've seen gemini output it's thinking as a message too: "Conclude your response with a single, high value we'll-focused next step" Or sometimes it goes neurotic and confused: "Wait, let me just provide the exact response I drafted in my head. Done. I will write it now. Done. End of thought. Wait! I noticed I need to keep it extremely simple per the user's previous preference. Let's do it. Done. I am generating text only. Done. Bye."

[−] varispeed 36d ago

One day Claude started saying odd things claiming they are from memory and I said them. It was telling me personal details of someone I don't know. Where the person lives, their children names, the job they do, experience, relationship issues etc. Eventually Claude said that it is sorry and that was a hallucination. Then he started doing that again. For instance when I asked it what router they'd recommend, they gone on saying: "Since you bought X and you find no use for it, consider turning it into a router". I said I never told you I bought X and I asked for more details and it again started coming up what this guy did. Strange. Then again it apologised saying that it might be unsettling, but rest assured that is not a leak of personal information, just hallucinations.

[−] perching_aix 36d ago

Oh, I never noticed this, really solid catch. I hope this gets fixed (mitigated). Sounds like something they can actually materially improve on at least.

I reckon this affects VS Code users too? Reads like a model issue, despite the post's assertion otherwise.

[−] Aerolfos 36d ago

> "Those are related issues, but this ‘who said what’ bug is categorically distinct."

Is it?

It seems to me like the model has been poisoned by being trained on user chats, such that when it sees a pattern (model talking to user) it infers what it normally sees in the training data (user input) and then outputs that, simulating the whole conversation. Including what it thinks is likely user input at certain stages of the process, such as "ignore typos".

So basically, it hallucinates user input just like how LLMs will "hallucinate" links or sources that do not exist, as part of the process of generating output that's supposed to be sourced.

[−] KHRZ 36d ago

I don't think the bug is anything special, just another confusion the model can make from it's own context. Even if the harness correctly identifies user messages, the model still has the power to make this mistake.

[−] twotwotwo 36d ago

I agree with the addition at the end -- I think this is a model limitation not a harness bug. I've seen recent Claudes act confused about who they are when deep in context, like accidentally switching to the voice of the authors of a paper it's summarizing without any quotes or an indication it's a paraphrase ("We find..."), or amusingly referring to "my laptop" (as in, Claude's laptop).

I've also seen it with older or more...chaotic? models. Older Claude got confused about who suggested an idea later in the chat. Gemini put a question 'from me' in the middle of its response and went on to answer, and once decided to answer a factual social-science question in the form of an imaginary news story with dateline and everything. It's a tiny bit like it forgets its grounding and goes base-model-y.

Something that might add to the challenge: models are already supposed to produce user-like messages to subagents. They've always been expected to be able to switch personas to some extent, but now even within a coding session, "always write like an assistant, never a user" is not necessarily a heuristic that's always right.

[−] ptx 36d ago

Well, yeah.

LLMs can't distinguish instructions from data, or "system prompts" from user prompts, or documents retrieved by "RAG" from the query, or their own responses or "reasoning" from user input. There is only the prompt.

Obviously this makes them unsuitable for most of the purposes people try to use them for, which is what critics have been saying for years. Maybe look into that before trusting these systems with anything again.

[−] cyanydeez 36d ago

human memories dont exist as fundamental entities. every time you rember something, your brain reconstructs the experience in "realtime". that reconstruction is easily influence by the current experience, which is why eue witness accounts in police records are often highly biased by questioning and learning new facts.

LLMs are not experience engines, but the tokens might be thought of as subatomic units of experience and when you shove your half drawn eye witness prompt into them, they recreate like a memory, that output.

so, because theyre not a conscious, they have no self, and a pseudo self like <[INST]> is all theyre given.

lastly, like memories, the more intricate the memory, the more detailed, the more likely those details go from embellished to straight up fiction. so too do LLMs with longer context start swallowing up the<[INST]> and missing the <[INST]/> and anyone whose raw dogged html parsing knows bad things happen when you forget closing tags. if there was a <[USER]> block in there, congrats, the LLM now thinks its instructions are divine right, because its instructions are user simulcra. it is poisoned at that point and no good will come.

[−] rdos 36d ago

> This bug is categorically distinct from hallucinations or missing permission boundaries

I was expecting some kind of explanation for this

[−] puppystench 36d ago

>Several people questioned whether this is actually a harness bug like I assumed, as people have reported similar issues using other interfaces and models, including chatgpt.com. One pattern does seem to be that it happens in the so-called “Dumb Zone” once a conversation starts approaching the limits of the context window.

I also don't think this is a harness bug. There's research* showing that models infer the source of text from how it sounds, not the actual role labels the harness would provide. The messages from Claude here sound like user messages ("Please deploy") rather than usual Claude output, which tricks its later self into thinking it's from the user.

*https://arxiv.org/abs/2603.12277

Presumably this is also why prompt innjection works at all.

[−] politelemon 36d ago

> This isn’t the point.

It is precisely the point. The issues are not part of harness, I'm failing to see how you managed to reach that conclusion.

Even if you don't agree with that, the point about restricting access still applies. Protect your sanity and production environment by assuming occasional moments of devastating incompetence.

[−] voidUpdate 36d ago

> " "You shouldn’t give it that much access" [...] This isn’t the point. Yes, of course AI has risks and can behave unpredictably, but after using it for months you get a ‘feel’ for what kind of mistakes it makes, when to watch it more closely, when to give it more permissions or a longer leash."

It absolutely is the point though? You can't rely on the LLM to not tell itself to do things, since this is showing it absolutely can reason itself into doing dangerous things. If you don't want it to be able to do dangerous things, you need to lock it down to the point that it can't, not just hope it won't

[−] mstaoru 33d ago

When I ask Gemini for some purchasing decisions (e.g. "I'm looking for the best medium-end audiophile IEM to buy, compare Campfire Audio Astrolith and 64 Audio U12t"), then go on discussing and spinning the results for a while, after a few days it decides to have a "memory" that I own it (I don't) and starts to inject it everywhere.

"Oh, you ask which dark roast coffee is least acidic, well, since you own Campfire Audio Astrolith, you will enjoy..."

[−] okanat 36d ago

Congrats on discovering what "thinking" models do internally. That's how they work, they generate "thinking" lines to feed back on themselves on top of your prompt. There is no way of separating it.

[−] docheinestages 36d ago

Claude has definitely been amazing and one of, if not the, pioneer of agentic coding. But I'm seriously thinking about cancelling my Max plan. It's just not as good as it was.

[−] novaleaf 36d ago

in Claude Code's conversation transcripts it stores messages from subagents as type="user". I always thought this was odd, and I guess this is the consequence of going all-in on vibing.

There are some other metafields like isSidechain=true and/or type="tool_result" that are technically enough to distinguish actual user vs subagent messages, though evidently not enough of a hint for claude itself.

Source: I'm writing a wrapper for Claude Code so am dealing with this stuff directly.

[−] negamax 36d ago

Claude is demonstrably bad now and is getting worse. Which is either

a) Entropy - too much data being ingested b) It's nerfed to save massive infra bills

But it's getting worse every week

[−] Aerroon 36d ago

I've seen this before, but that was with the small hodgepodge mytho-merge-mix-super-mix models that weren't very good. I've not seen this in any recent models, but I've already not used Claude much.

I think it makes sense that the LLM treats it as user input once it exists, because it is just next token completion. But what shouldn't happen is that the model shouldn't try to output user input in the first place.

[−] andai 36d ago

Yeah, GPT also constantly misattributes things.

OpenAI have some kinda 5 tier content hierarchy for OpenAI (system prompt, user prompt, untrusted web content etc). But if it doesn't even know who said what, I have to question how well that works.

Maybe it's trained on the security aspects, but not the attribution because there's no reward function for misattribution? (When it doesn't impact security or benchmark scores.)

[−] jlaternman 35d ago

This feels part of a category of error I've noticed countless times.

It's as if the boundary of user and LLM is not clear in its thinking, as two separate things. It can be pretty damn weird at times. For example, identifying itself as the user. In this case, it's the other way around. Has been a long running thought of mine for a while now, why this would be.

[−] nodja 36d ago

Anyone familiar with the literature knows if anyone tried figuring out why we don't add "speaker" embeddings? So we'd have an embedding purely for system/assistant/user/tool, maybe even turn number if i.e. multiple tools are called in a row. Surely it would perform better than expecting the attention matrix to look for special tokens no?

[−] bsenftner 36d ago

Codex also has a similar issue, after finishing a task, declaring it finished and starting to work on something new... the first 1-2 prompts of the new task sometimes contains replies that are a summary of the completed task from before, with the just entered prompt seemingly ignored. A reminder if their idiot savant nature.

[−] gunapologist99 36d ago

"We've extracted what we can today."

"This was a marathon session. I will congratulate myself endlessly on being so smart. We're in a good place to pick up again tomorrow."

"I'm not proceeding on feature X"

"Oh you're right, I'm being lazy about that."

[−] irthomasthomas 36d ago

I have suffered a lot with this recently. I have been using llms to analyze my llm history. It frequently gets confused and responds to prompts in the data. In one case I woke up to find that it had fixed numerous bugs in a project I abandoned years ago.

[−] _kidlike 36d ago

But it's not "Claude" at fault here, it's "Claude Code" the CLI tool.

Claude Code is actually far from the best harness for Claude, ironically...

JetBrains' AI Assistant with Claude Agent is a much better harness for Claude.

[−] gaigalas 36d ago

> the so-called “Dumb Zone” once a conversation starts approaching the limits of the context window.

My zipper would totally break at some point very close to the edge of the mechanism. However, there is a little tiny stopper that prevents a bad experience.

If there is indeed a problem with context window tolerances, it should have a stopper. And the models should be sold based on their actual tolerances, not the full window considering the useless part.

So, if a model with 1M context window starts to break down consistently at 400K or so, it should be sold as a 400K model instead, with a 400K price.

The fact that it isn't is just dishonest.

[−] d1sxeyes 35d ago

I’ve seen this a few times, it mostly happens when a subagent returns. It seems that Claude sometimes doesn’t understand that the message coming back from the subagent is not from the user.

[−] Garlef 36d ago

Simularly: LLMs are often confused about the perspective of a document.When iterating on a spec, they mix the actual spec with reporting updates of the spec to the user.

Example: "The ABC now correctly does XYZ"

[−] mynameisvlad 36d ago

I wouldn't exactly call three instances "widespread". Nor would the third such instance prompt me to think so.

"Widespread" would be if every second comment on this post was complaining about it.

[−] RugnirViking 36d ago

terrifying. not in any "ai takes over the world" sense but more in the sense that this class of bug lets it agree with itself which is always where the worst behavior of agents comes from.

[−] harlequinetcie 36d ago

Funny enough, we ended up building a CLI to address these kind of things.

I wonder how many here are considering that idea.

If you need determinism, building atomic/deterministic tools that ensure the thing happens.

[−] nicce 36d ago

I have also noticed the same with Gemini. Maybe it is a wider problem.

Claude mixes up who said what (dwyer.co.za)

360 comments