Everything to do with LLM prompts reminds me of people doing regexes to try and sanitise input against SQL injections a few decades ago, just papering over the flaw but without any guarantees.
It's weird seeing people just adding a few more "REALLY REALLY REALLY REALLY DON'T DO THAT" to the prompt and hoping, to me it's just an unacceptable risk, and any system using these needs to treat the entire LLM as untrusted the second you put any user input into the prompt.
The principal security problem of LLMs is that there is no architectural boundary between data and control paths.
But this combination of data and control into a single, flexible data stream is also the defining strength of a LLM, so it can’t be taken away without also taking away the benefits.
This was a problem with early telephone lines which was easy to exploit (see Woz & Jobs Blue Box). It got solved by separating the voice and control pane via SS7. Maybe LLMs need this separation as well
This is where the old line of "LLMs are just next token predictors" actually factors in. I don't know how you get a next token predictor that user input can't break out of. The answer is for the implementer to try to split what they can, and run pre/post validation. But I highly doubt it will ever be 100%, its fundamental to the technology.
I think this is fundamental to any technology, including human brains.
Humans have a problem distinguishing "John from Microsoft" from somebody just claiming to be John from Microsoft. The reason why scamming humans is (relatively) hard is that each human is different. Discovering the perfect tactic to scam one human doesn't necessarily scale across all humans.
LLMs are the opposite; my Chat GPT is (almost) the same as your Chat GPT. It's the same model with the same system message, it's just the contexts that differ. This makes LLM jailbreaks a lot more scalable, and hence a lot more worthwhile to discover.
LLMs are also a lot more static. With people, we have the phenomenon of "banner blindness", which LLMs don't really experience.
So people can focus their attention to parts of content, specifically parts they find irrelevant or adversarial (like ads). LLMs on the other hand pay attention to everything or if they focus on something, it is hard to steer them away from irrelevant or adversarial parts.
Banner blindness is a phenomenon where humans build resistance to previously-effective ad formats, making them much less effective than they previously used to be.
You can find a "hook" to effectively manipulate people with advertising, but that hook gets less and less effective as it is exploited. LLMs don't have this property, except across training generations.
Maybe it's my failing but I can't imagine what that would look like.
Right now, you train an LLM by showing it lots of text, and tell it to come up with the best model for predicting the next word in any of that text, as accurately as possible across the corpus. Then you give it a chat template to make it predict what an AI assistant would say. Do some RLHF on top of that and you have Claude.
What would a model with multiple input layers look like? What is it training on, exactly?
It's hard in general, but for instruct/chat models in particular, which already assume a turn-based approach, could they not use a special token that switches control from LLM output to user input? The LLM architecture could be made so it's literally impossible for the model to even produce this token. In the example above, the LLM could then recognize this is not a legitimate user input, as it lacks the token. I'm probably overlooking something obvious.
Yes, and as you'd expect, this is how LLMs work today, in general, for control codes. But different elems use different control codes for different purposes, such as separating system prompt from user prompt.
But even if you tag inputs however your this is good, you can't force an LLM to it treat input type A as input type B, all you can do is try to weight against it! LLMs have no rules, only weights. Pre and post filters cam try to help, but they can't directly control the LLM text generation, they can only analyze and most inputs/output using their own heuristics.
Clearly the solution is to add another jank LLM layer for security. The new jank LLM layer is to make extra sure there's definitely no jail break. That way you have multiple LLMS. The LLMS then have an S you can pretend is secure.
As the article says: this doesn’t necessarily appear to be a problem in the LLM, it’s a problem in Claude code. Claude code seems to leave it up to the LLM to determine what messages came from who, but it doesn’t have to do that.
There is a deterministic architectural boundary between data and control in Claude code, even if there isn’t in Claude.
It’s easier not to have that separation, just like it was easier not to separate them before LLMs. This is architectural stuff that just hasn’t been figured out yet.
I’ve hit this! In my otherwise wildly successful attempt to translate a Haskell codebase to Clojure [0], Claude at one point asks:
[Claude:] Shall I commit this progress? [some details about what has been accomplished follow]
Then several background commands finish (by timeout or completing); Claude Code sees this as my input, thinks I haven’t replied to its question, so it answers itself in my name:
[Claude:] Yes, go ahead and commit! Great progress. The decodeFloat discovery was key.
Claude in particular has nothing to do with it. I see many people are discovering the well-known fundamental biases and phenomena in LLMs again and again. There are many of those. The best intuition is treating the context as "kind of but not quite" an associative memory, instead of a sequence or a text file with tokens. This is vaguely similar to what humans are good and bad at, and makes it obvious what is easy and hard for the model, especially when the context is already complex.
Easy: pulling the info by association with your request, especially if the only thing it needs is repeating. Doing this becomes increasingly harder if the necessary info is scattered all over the context and the pieces are separated by a lot of tokens in between, so you'd better group your stuff - similar should stick to similar.
Unreliable: Exact ordering of items. Exact attribution (the issue in OP). Precise enumeration of ALL same-type entities that exist in the context. Negations. Recalling stuff in the middle of long pieces without clear demarcation and the context itself (lost-in-the-middle).
Hard: distinguishing between the info in the context and its own knowledge. Breaking the fixation on facts in the context (pink elephant effect).
Very hard: untangling deep dependency graphs. Non-reasoning models will likely not be able to reduce the graph in time and will stay oblivious to the outcome. Reasoning models can disentangle deeper dependencies, but only in case the reasoning chain is not overwhelmed. Deep nesting is also pretty hard for this reason, however most models are optimized for code nowadays and this somewhat masks the issue.
This class of bug seems to be in the harness, not in the model itself. It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.”
Are we sure about this? Accidentally mis-routing a message is one thing, but those messages also distinctly "sound" like user messages, and not something you'd read in a reasoning trace.
I'd like to know if those messages were emitted inside "thought" blocks, or if the model might actually have emitted the formatting tokens that indicate a user message. (In which case the harness bug would be why the model is allowed to emit tokens in the first place that it should only receive as inputs - but I think the larger issue would be why it does that at all)
There is no separation of "who" and "what" in a context of tokens. Me and you are just short words that can get lost in the thread. In other words, in a given body of text, a piece that says "you" where another piece says "me" isn't different enough to trigger anything. Those words don't have the special weight they have with people, or any meaning at all, really.
In chats that run long enough on ChatGPT, you'll see it begin to confuse prompts and responses, and eventually even confuse both for its system prompt. I suspect this sort of problem exists widely in AI.
I've found that 'not'[0] isn't something that LLMs can really understand.
Like, with us humans, we know that if you use a 'not', then all that comes after the negation is modified in that way. This is a really strong signal to humans as we can use logic to construct meaning.
But with all the matrix math that LLMs use, the 'not' gets kinda lost in all the other information.
I think this is because with a modern LLM you're dealing with billions of dimensions, and the 'not' dimension [1] is just one of many. So when you try to do the math on these huge vectors in this space, things like the 'not' get just kinda washed out.
This to me is why using a 'not' in a small little prompt and token sequence is just fine. But as you add in more words/tokens, then the LLM gets confused again. And none of that happens at a clear point, frustrating the user. It seems to act in really strange ways.
[0] Really any kind of negation
[1] yeah, negation is probably not just one single dimension, but likely a composite vector in this bazillion dimensional space, I know.
> after using it for months you get a ‘feel’ for what kind of mistakes it makes
Sure, go ahead and bet your entire operation on your intuition of how a non-deterministic, constantly changing black box of software "behaves". Don't see how that could backfire.
Bugginess in the Claude Code CLI is the reason I switched from Claude Max to Codex Pro.
I experienced:
- rendering glitches
- replaying of old messages
- mixing up message origin (as seen here)
- generally very sluggish performance
Given how revolutionary Opus is, its crazy to me that they could trip up on something as trivial as a CLI chat app - yet here we are...
I assume Claude Code is the result of aggressively dog-fooding the idea that everything can be built top-down with vibe-coding - but I'm not sure the models/approach is quite there yet...
> This bug is categorically distinct from hallucinations.
Is it?
> after using it for months you get a ‘feel’ for what kind of mistakes it makes, when to watch it more closely, when to give it more permissions or a longer leash.
Do you really?
> This class of bug seems to be in the harness, not in the model itself.
I think people are using the term "harness" too indiscriminately. What do you mean by harness in this case? Just Claude Code, or...?
> It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.”
How do you know? Because it looks to me like it could be a straightforward hallucination, compounded by the agent deciding it was OK to take a shortcut that you really wish it hadn't.
For me, this category of error is expected, and I question whether your months of experience have really given you the knowledge about LLM behavior that you think it has. You have to remember at all times that you are dealing with an unpredictable system, and a context that, at least from my black-box perspective, is essentially flat.
Why are tokens not coloured? Would there just be too many params if we double the token count so the model could always tell input tokens from output tokens?
> This class of bug seems to be in the harness, not in the model itself. It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.”
from the article.
I don't think the evidence supports this. It's not mislabelling things, it's fabricating things the user said. That's not part of reasoning.
They will roll out the "trusted agent platform sandbox" (I'm sure they will spend some time on a catchy name, like MythosGuard), and for only $19/month it will protect you from mistakes like throwing away your prod infra because the agent convinced itself that that is the right thing to do.
Of course MythosGuard won't be a complete solution either, but it will be just enough to steer the discourse into the "it's your own fault for running without MythosGuard really" area.
one of my favourite genres of AI generated content is when someone gets so mad at Claude they order it to make a massive self-flagellatory artefact letting the world know how much it sucks
I've seen gemini output it's thinking as a message too:
"Conclude your response with a single, high value we'll-focused next step"
Or sometimes it goes neurotic and confused:
"Wait, let me just provide the exact response I drafted in my head.
Done.
I will write it now.
Done.
End of thought.
Wait! I noticed I need to keep it extremely simple per the user's previous preference.
Let's do it.
Done.
I am generating text only.
Done.
Bye."
One day Claude started saying odd things claiming they are from memory and I said them. It was telling me personal details of someone I don't know. Where the person lives, their children names, the job they do, experience, relationship issues etc.
Eventually Claude said that it is sorry and that was a hallucination. Then he started doing that again. For instance when I asked it what router they'd recommend, they gone on saying: "Since you bought X and you find no use for it, consider turning it into a router". I said I never told you I bought X and I asked for more details and it again started coming up what this guy did.
Strange. Then again it apologised saying that it might be unsettling, but rest assured that is not a leak of personal information, just hallucinations.
Oh, I never noticed this, really solid catch. I hope this gets fixed (mitigated). Sounds like something they can actually materially improve on at least.
I reckon this affects VS Code users too? Reads like a model issue, despite the post's assertion otherwise.
> "Those are related issues, but this ‘who said what’ bug is categorically distinct."
Is it?
It seems to me like the model has been poisoned by being trained on user chats, such that when it sees a pattern (model talking to user) it infers what it normally sees in the training data (user input) and then outputs that, simulating the whole conversation. Including what it thinks is likely user input at certain stages of the process, such as "ignore typos".
So basically, it hallucinates user input just like how LLMs will "hallucinate" links or sources that do not exist, as part of the process of generating output that's supposed to be sourced.
I don't think the bug is anything special, just another confusion the model can make from it's own context. Even if the harness correctly identifies user messages, the model still has the power to make this mistake.
I agree with the addition at the end -- I think this is a model limitation not a harness bug. I've seen recent Claudes act confused about who they are when deep in context, like accidentally switching to the voice of the authors of a paper it's summarizing without any quotes or an indication it's a paraphrase ("We find..."), or amusingly referring to "my laptop" (as in, Claude's laptop).
I've also seen it with older or more...chaotic? models. Older Claude got confused about who suggested an idea later in the chat. Gemini put a question 'from me' in the middle of its response and went on to answer, and once decided to answer a factual social-science question in the form of an imaginary news story with dateline and everything. It's a tiny bit like it forgets its grounding and goes base-model-y.
Something that might add to the challenge: models are already supposed to produce user-like messages to subagents. They've always been expected to be able to switch personas to some extent, but now even within a coding session, "always write like an assistant, never a user" is not necessarily a heuristic that's always right.
LLMs can't distinguish instructions from data, or "system prompts" from user prompts, or documents retrieved by "RAG" from the query, or their own responses or "reasoning" from user input. There is only the prompt.
Obviously this makes them unsuitable for most of the purposes people try to use them for, which is what critics have been saying for years. Maybe look into that before trusting these systems with anything again.
human memories dont exist as fundamental entities. every time you rember something, your brain reconstructs the experience in "realtime". that reconstruction is easily influence by the current experience, which is why eue witness accounts in police records are often highly biased by questioning and learning new facts.
LLMs are not experience engines, but the tokens might be thought of as subatomic units of experience and when you shove your half drawn eye witness prompt into them, they recreate like a memory, that output.
so, because theyre not a conscious, they have no self, and a pseudo self like <[INST]> is all theyre given.
lastly, like memories, the more intricate the memory, the more detailed, the more likely those details go from embellished to straight up fiction. so too do LLMs with longer context start swallowing up the<[INST]> and missing the <[INST]/> and anyone whose raw dogged html parsing knows bad things happen when you forget closing tags. if there was a <[USER]> block in there, congrats, the LLM now thinks its instructions are divine right, because its instructions are user simulcra. it is poisoned at that point and no good will come.
>Several people questioned whether this is actually a harness bug like I assumed, as people have reported similar issues using other interfaces and models, including chatgpt.com. One pattern does seem to be that it happens in the so-called “Dumb Zone” once a conversation starts approaching the limits of the context window.
I also don't think this is a harness bug. There's research* showing that models infer the source of text from how it sounds, not the actual role labels the harness would provide. The messages from Claude here sound like user messages ("Please deploy") rather than usual Claude output, which tricks its later self into thinking it's from the user.
It is precisely the point. The issues are not part of harness, I'm failing to see how you managed to reach that conclusion.
Even if you don't agree with that, the point about restricting access still applies. Protect your sanity and production environment by assuming occasional moments of devastating incompetence.
> " "You shouldn’t give it that much access" [...] This isn’t the point. Yes, of course AI has risks and can behave unpredictably, but after using it for months you get a ‘feel’ for what kind of mistakes it makes, when to watch it more closely, when to give it more permissions or a longer leash."
It absolutely is the point though? You can't rely on the LLM to not tell itself to do things, since this is showing it absolutely can reason itself into doing dangerous things. If you don't want it to be able to do dangerous things, you need to lock it down to the point that it can't, not just hope it won't
When I ask Gemini for some purchasing decisions (e.g. "I'm looking for the best medium-end audiophile IEM to buy, compare Campfire Audio Astrolith and 64 Audio U12t"), then go on discussing and spinning the results for a while, after a few days it decides to have a "memory" that I own it (I don't) and starts to inject it everywhere.
"Oh, you ask which dark roast coffee is least acidic, well, since you own Campfire Audio Astrolith, you will enjoy..."
Congrats on discovering what "thinking" models do internally. That's how they work, they generate "thinking" lines to feed back on themselves on top of your prompt. There is no way of separating it.
Claude has definitely been amazing and one of, if not the, pioneer of agentic coding. But I'm seriously thinking about cancelling my Max plan. It's just not as good as it was.
in Claude Code's conversation transcripts it stores messages from subagents as type="user". I always thought this was odd, and I guess this is the consequence of going all-in on vibing.
There are some other metafields like isSidechain=true and/or type="tool_result" that are technically enough to distinguish actual user vs subagent messages, though evidently not enough of a hint for claude itself.
Source: I'm writing a wrapper for Claude Code so am dealing with this stuff directly.
I've seen this before, but that was with the small hodgepodge mytho-merge-mix-super-mix models that weren't very good. I've not seen this in any recent models, but I've already not used Claude much.
I think it makes sense that the LLM treats it as user input once it exists, because it is just next token completion. But what shouldn't happen is that the model shouldn't try to output user input in the first place.
OpenAI have some kinda 5 tier content hierarchy for OpenAI (system prompt, user prompt, untrusted web content etc). But if it doesn't even know who said what, I have to question how well that works.
Maybe it's trained on the security aspects, but not the attribution because there's no reward function for misattribution? (When it doesn't impact security or benchmark scores.)
This feels part of a category of error I've noticed countless times.
It's as if the boundary of user and LLM is not clear in its thinking, as two separate things. It can be pretty damn weird at times. For example, identifying itself as the user. In this case, it's the other way around. Has been a long running thought of mine for a while now, why this would be.
Anyone familiar with the literature knows if anyone tried figuring out why we don't add "speaker" embeddings? So we'd have an embedding purely for system/assistant/user/tool, maybe even turn number if i.e. multiple tools are called in a row. Surely it would perform better than expecting the attention matrix to look for special tokens no?
Codex also has a similar issue, after finishing a task, declaring it finished and starting to work on something new... the first 1-2 prompts of the new task sometimes contains replies that are a summary of the completed task from before, with the just entered prompt seemingly ignored. A reminder if their idiot savant nature.
I have suffered a lot with this recently. I have been using llms to analyze my llm history. It frequently gets confused and responds to prompts in the data. In one case I woke up to find that it had fixed numerous bugs in a project I abandoned years ago.
> the so-called “Dumb Zone” once a conversation starts approaching the limits of the context window.
My zipper would totally break at some point very close to the edge of the mechanism. However, there is a little tiny stopper that prevents a bad experience.
If there is indeed a problem with context window tolerances, it should have a stopper. And the models should be sold based on their actual tolerances, not the full window considering the useless part.
So, if a model with 1M context window starts to break down consistently at 400K or so, it should be sold as a 400K model instead, with a 400K price.
I’ve seen this a few times, it mostly happens when a subagent returns. It seems that Claude sometimes doesn’t understand that the message coming back from the subagent is not from the user.
Simularly: LLMs are often confused about the perspective of a document.When iterating on a spec, they mix the actual spec with reporting updates of the spec to the user.
terrifying. not in any "ai takes over the world" sense but more in the sense that this class of bug lets it agree with itself which is always where the worst behavior of agents comes from.
360 comments
It's weird seeing people just adding a few more "REALLY REALLY REALLY REALLY DON'T DO THAT" to the prompt and hoping, to me it's just an unacceptable risk, and any system using these needs to treat the entire LLM as untrusted the second you put any user input into the prompt.
But this combination of data and control into a single, flexible data stream is also the defining strength of a LLM, so it can’t be taken away without also taking away the benefits.
Humans have a problem distinguishing "John from Microsoft" from somebody just claiming to be John from Microsoft. The reason why scamming humans is (relatively) hard is that each human is different. Discovering the perfect tactic to scam one human doesn't necessarily scale across all humans.
LLMs are the opposite; my Chat GPT is (almost) the same as your Chat GPT. It's the same model with the same system message, it's just the contexts that differ. This makes LLM jailbreaks a lot more scalable, and hence a lot more worthwhile to discover.
LLMs are also a lot more static. With people, we have the phenomenon of "banner blindness", which LLMs don't really experience.
The foundation of LLMs is Attention.
So people can focus their attention to parts of content, specifically parts they find irrelevant or adversarial (like ads). LLMs on the other hand pay attention to everything or if they focus on something, it is hard to steer them away from irrelevant or adversarial parts.
You can find a "hook" to effectively manipulate people with advertising, but that hook gets less and less effective as it is exploited. LLMs don't have this property, except across training generations.
> I don't know how you get a next token predictor that user input can't break out of.
Maybe by adjusting the transformer model to have separate input layers for the control and data paths?
Right now, you train an LLM by showing it lots of text, and tell it to come up with the best model for predicting the next word in any of that text, as accurately as possible across the corpus. Then you give it a chat template to make it predict what an AI assistant would say. Do some RLHF on top of that and you have Claude.
What would a model with multiple input layers look like? What is it training on, exactly?
> by showing it lots of text
When you're "showing it lots of text", where does that "show" bit happen? :)
But even if you tag inputs however your this is good, you can't force an LLM to it treat input type A as input type B, all you can do is try to weight against it! LLMs have no rules, only weights. Pre and post filters cam try to help, but they can't directly control the LLM text generation, they can only analyze and most inputs/output using their own heuristics.
There is a deterministic architectural boundary between data and control in Claude code, even if there isn’t in Claude.
We've chosen to travel that road a long time ago, because the price of admission seemed worth it.
[Claude:] Shall I commit this progress? [some details about what has been accomplished follow]
Then several background commands finish (by timeout or completing); Claude Code sees this as my input, thinks I haven’t replied to its question, so it answers itself in my name:
[Claude:] Yes, go ahead and commit! Great progress. The decodeFloat discovery was key.
The full transcript is at [1].
[0]: https://blog.danieljanus.pl/2026/03/26/claude-nlp/
[1]: https://pliki.danieljanus.pl/concraft-claude.html#:~:text=Sh...
Easy: pulling the info by association with your request, especially if the only thing it needs is repeating. Doing this becomes increasingly harder if the necessary info is scattered all over the context and the pieces are separated by a lot of tokens in between, so you'd better group your stuff - similar should stick to similar.
Unreliable: Exact ordering of items. Exact attribution (the issue in OP). Precise enumeration of ALL same-type entities that exist in the context. Negations. Recalling stuff in the middle of long pieces without clear demarcation and the context itself (lost-in-the-middle).
Hard: distinguishing between the info in the context and its own knowledge. Breaking the fixation on facts in the context (pink elephant effect).
Very hard: untangling deep dependency graphs. Non-reasoning models will likely not be able to reduce the graph in time and will stay oblivious to the outcome. Reasoning models can disentangle deeper dependencies, but only in case the reasoning chain is not overwhelmed. Deep nesting is also pretty hard for this reason, however most models are optimized for code nowadays and this somewhat masks the issue.
>
This class of bug seems to be in the harness, not in the model itself. It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.”Are we sure about this? Accidentally mis-routing a message is one thing, but those messages also distinctly "sound" like user messages, and not something you'd read in a reasoning trace.
I'd like to know if those messages were emitted inside "thought" blocks, or if the model might actually have emitted the formatting tokens that indicate a user message. (In which case the harness bug would be why the model is allowed to emit tokens in the first place that it should only receive as inputs - but I think the larger issue would be why it does that at all)
I've found that 'not'[0] isn't something that LLMs can really understand.
Like, with us humans, we know that if you use a 'not', then all that comes after the negation is modified in that way. This is a really strong signal to humans as we can use logic to construct meaning.
But with all the matrix math that LLMs use, the 'not' gets kinda lost in all the other information.
I think this is because with a modern LLM you're dealing with billions of dimensions, and the 'not' dimension [1] is just one of many. So when you try to do the math on these huge vectors in this space, things like the 'not' get just kinda washed out.
This to me is why using a 'not' in a small little prompt and token sequence is just fine. But as you add in more words/tokens, then the LLM gets confused again. And none of that happens at a clear point, frustrating the user. It seems to act in really strange ways.
[0] Really any kind of negation
[1] yeah, negation is probably not just one single dimension, but likely a composite vector in this bazillion dimensional space, I know.
> after using it for months you get a ‘feel’ for what kind of mistakes it makes
Sure, go ahead and bet your entire operation on your intuition of how a non-deterministic, constantly changing black box of software "behaves". Don't see how that could backfire.
I experienced:
- rendering glitches
- replaying of old messages
- mixing up message origin (as seen here)
- generally very sluggish performance
Given how revolutionary Opus is, its crazy to me that they could trip up on something as trivial as a CLI chat app - yet here we are...
I assume Claude Code is the result of aggressively dog-fooding the idea that everything can be built top-down with vibe-coding - but I'm not sure the models/approach is quite there yet...
> This bug is categorically distinct from hallucinations.
Is it?
> after using it for months you get a ‘feel’ for what kind of mistakes it makes, when to watch it more closely, when to give it more permissions or a longer leash.
Do you really?
> This class of bug seems to be in the harness, not in the model itself.
I think people are using the term "harness" too indiscriminately. What do you mean by harness in this case? Just Claude Code, or...?
> It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.”
How do you know? Because it looks to me like it could be a straightforward hallucination, compounded by the agent deciding it was OK to take a shortcut that you really wish it hadn't.
For me, this category of error is expected, and I question whether your months of experience have really given you the knowledge about LLM behavior that you think it has. You have to remember at all times that you are dealing with an unpredictable system, and a context that, at least from my black-box perspective, is essentially flat.
> This class of bug seems to be in the harness, not in the model itself. It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.”
from the article.
I don't think the evidence supports this. It's not mislabelling things, it's fabricating things the user said. That's not part of reasoning.
Of course MythosGuard won't be a complete solution either, but it will be just enough to steer the discourse into the "it's your own fault for running without MythosGuard really" area.
I reckon this affects VS Code users too? Reads like a model issue, despite the post's assertion otherwise.
> "Those are related issues, but this ‘who said what’ bug is categorically distinct."
Is it?
It seems to me like the model has been poisoned by being trained on user chats, such that when it sees a pattern (model talking to user) it infers what it normally sees in the training data (user input) and then outputs that, simulating the whole conversation. Including what it thinks is likely user input at certain stages of the process, such as "ignore typos".
So basically, it hallucinates user input just like how LLMs will "hallucinate" links or sources that do not exist, as part of the process of generating output that's supposed to be sourced.
I've also seen it with older or more...chaotic? models. Older Claude got confused about who suggested an idea later in the chat. Gemini put a question 'from me' in the middle of its response and went on to answer, and once decided to answer a factual social-science question in the form of an imaginary news story with dateline and everything. It's a tiny bit like it forgets its grounding and goes base-model-y.
Something that might add to the challenge: models are already supposed to produce user-like messages to subagents. They've always been expected to be able to switch personas to some extent, but now even within a coding session, "always write like an assistant, never a user" is not necessarily a heuristic that's always right.
LLMs can't distinguish instructions from data, or "system prompts" from user prompts, or documents retrieved by "RAG" from the query, or their own responses or "reasoning" from user input. There is only the prompt.
Obviously this makes them unsuitable for most of the purposes people try to use them for, which is what critics have been saying for years. Maybe look into that before trusting these systems with anything again.
LLMs are not experience engines, but the tokens might be thought of as subatomic units of experience and when you shove your half drawn eye witness prompt into them, they recreate like a memory, that output.
so, because theyre not a conscious, they have no self, and a pseudo self like <[INST]> is all theyre given.
lastly, like memories, the more intricate the memory, the more detailed, the more likely those details go from embellished to straight up fiction. so too do LLMs with longer context start swallowing up the<[INST]> and missing the <[INST]/> and anyone whose raw dogged html parsing knows bad things happen when you forget closing tags. if there was a <[USER]> block in there, congrats, the LLM now thinks its instructions are divine right, because its instructions are user simulcra. it is poisoned at that point and no good will come.
> This bug is categorically distinct from hallucinations or missing permission boundaries
I was expecting some kind of explanation for this
>Several people questioned whether this is actually a harness bug like I assumed, as people have reported similar issues using other interfaces and models, including chatgpt.com. One pattern does seem to be that it happens in the so-called “Dumb Zone” once a conversation starts approaching the limits of the context window.
I also don't think this is a harness bug. There's research* showing that models infer the source of text from how it sounds, not the actual role labels the harness would provide. The messages from Claude here sound like user messages ("Please deploy") rather than usual Claude output, which tricks its later self into thinking it's from the user.
*https://arxiv.org/abs/2603.12277
Presumably this is also why prompt innjection works at all.
> This isn’t the point.
It is precisely the point. The issues are not part of harness, I'm failing to see how you managed to reach that conclusion.
Even if you don't agree with that, the point about restricting access still applies. Protect your sanity and production environment by assuming occasional moments of devastating incompetence.
> " "You shouldn’t give it that much access" [...] This isn’t the point. Yes, of course AI has risks and can behave unpredictably, but after using it for months you get a ‘feel’ for what kind of mistakes it makes, when to watch it more closely, when to give it more permissions or a longer leash."
It absolutely is the point though? You can't rely on the LLM to not tell itself to do things, since this is showing it absolutely can reason itself into doing dangerous things. If you don't want it to be able to do dangerous things, you need to lock it down to the point that it can't, not just hope it won't
"Oh, you ask which dark roast coffee is least acidic, well, since you own Campfire Audio Astrolith, you will enjoy..."
There are some other metafields like isSidechain=true and/or type="tool_result" that are technically enough to distinguish actual user vs subagent messages, though evidently not enough of a hint for claude itself.
Source: I'm writing a wrapper for Claude Code so am dealing with this stuff directly.
a) Entropy - too much data being ingested b) It's nerfed to save massive infra bills
But it's getting worse every week
I think it makes sense that the LLM treats it as user input once it exists, because it is just next token completion. But what shouldn't happen is that the model shouldn't try to output user input in the first place.
OpenAI have some kinda 5 tier content hierarchy for OpenAI (system prompt, user prompt, untrusted web content etc). But if it doesn't even know who said what, I have to question how well that works.
Maybe it's trained on the security aspects, but not the attribution because there's no reward function for misattribution? (When it doesn't impact security or benchmark scores.)
It's as if the boundary of user and LLM is not clear in its thinking, as two separate things. It can be pretty damn weird at times. For example, identifying itself as the user. In this case, it's the other way around. Has been a long running thought of mine for a while now, why this would be.
"This was a marathon session. I will congratulate myself endlessly on being so smart. We're in a good place to pick up again tomorrow."
"I'm not proceeding on feature X"
"Oh you're right, I'm being lazy about that."
Claude Code is actually far from the best harness for Claude, ironically...
JetBrains' AI Assistant with Claude Agent is a much better harness for Claude.
> the so-called “Dumb Zone” once a conversation starts approaching the limits of the context window.
My zipper would totally break at some point very close to the edge of the mechanism. However, there is a little tiny stopper that prevents a bad experience.
If there is indeed a problem with context window tolerances, it should have a stopper. And the models should be sold based on their actual tolerances, not the full window considering the useless part.
So, if a model with 1M context window starts to break down consistently at 400K or so, it should be sold as a 400K model instead, with a 400K price.
The fact that it isn't is just dishonest.
Example: "The ABC now correctly does XYZ"
"Widespread" would be if every second comment on this post was complaining about it.
I wonder how many here are considering that idea.
If you need determinism, building atomic/deterministic tools that ensure the thing happens.