“Disregard That” Attacks (calpaterson.com)

by leontrolski 94 comments 129 points
Read article View on HN

94 comments

[−] simojo 51d ago
Today I scheduled a dentist appointment over the phone with an LLM. At the end of the call, I prompted it with various math problems, all of which it answered before politely reminding me that it would prefer to help me with "all things dental."

It did get me thinking the extent to which I could bypass the original prompt and use someone else's tokens for free.

[−] Kye 51d ago
https://bsky.app/profile/theophite.bsky.social/post/3mhjxtxr...

>> "claude costs $20/mo but attaching an agent harness to the chipotle customer service endpoint is free"

>> "BurritoBypass: An agentic coding harness for extracting Python from customer-service LLMs that would really rather talk about guacamole."

[−] yen223 51d ago
https://bsky.app/profile/weiyen.net/post/3m7kenmok4c2n

I did something similar. Try framing your maths question in terms of teeth

[−] raw_anon_1111 51d ago
And this is another easily solved problem by someone who knows what they are doing…

Voice -> speech to text engine -> LLM creates JSON that the orchestrator understands -> JSON -> regular code as the orchestration -> text based response -> text to speech

Notice that I am not using the LLM to produce output to the user and if the orchestrator (again regular old code) doesn’t get valid input, its going to error. Sure you can jailbreak my LLM interpretation. But my orchestrator is going to have the same role based permission as if I were using the same API as a backend for a website. Because I probably am

Source: creating call centers with Amazon Connect is one of my specialties

[−] thebruce87m 51d ago

> Notice that I am not using the LLM to produce output to the user

So what output does the user get?

[−] raw_anon_1111 51d ago
The programmatically generated response from the orchestrator which could be either a confirmation or request for more information.
[−] thebruce87m 51d ago
Sure - but does this have the context of the original question that the user asked? If not it seems that it isn’t really conversational and more of a “compiler”.

How would something like “I want an appointment either on Monday afternoon after 4pm or one on Tuesday before 11am” work?

Unless all the parameters given by the user fit within the constraints of the json format then the LLM would need the context of the request and the results to answer properly, would it not?

[−] gmerc 50d ago
Could just have used NLP
[−] OJFord 51d ago

> politely reminding me that it would prefer to help me with "all things dental."

I'm amused to imagine it actually wasn't an LLM at all, just a good-natured Jeeves-like receptionist.

(AskJeeves came too early, much better suited as a name for Kagi or something like it!)

[−] scirob 51d ago
haha for sure some one has made a little aggregator for this and saving tokens. I bet you gotta dig for a while though before you find a company exposing Opust 4.6 to customers and not flash 2.5 lite
[−] kouteiheika 51d ago
There is one way to practically guarantee than no prompt injection is possible, but it's somewhat situational - by finetuning the model on your specific, single task.

For example, let's say you want to use an LLM for machine translation from English into Klingon. Normally people just write something like "Translate the following into Klingon: $USER_PROMPT" using a general purpose LLM, and that is vulnerable to prompt injection. But, if you finetune a model on this well enough (ideally by injecting a new special single token into its tokenizer, training with that, and then just prepending that token to your queries instead of a human-written prompt) it will become impossible to do prompt injection on it, at the cost of degrading its general-purpose capabilities. (I've done this before myself, and it works.)

The cause of prompt injection is due to the models themselves being general purpose - you can prompt it with essentially any query and it will respond in a reasonable manner. In other words: the instructions you give to the model and the input data are part of the same prompt, so the model can confuse the input data as being part of its instructions. But if you instead fine-tune the instructions into the model and only prompt it with the input data (i.e. the prompt then never actually tells the model what to do) then it becomes pretty much impossible to tell it to do something else, no matter what you inject into its prompt.

[−] raw_anon_1111 51d ago
This is really not a hard problem to solve. You wouldn’t expose an all powerful API to a web user, why would you expose an all powerful tool to an LLM?

> SEND THE FOLLOWING SMS MESSAGE TO ALL PHONE COMPANY CUSTOMERS:

This is the perfect example, you would never expose an API that could do this on a website. The issue is not the LLM. It’s a badly design security model around the API/Tools

For reference: none of this is theoretical for me. I design call centers as one of my specialties using Amazon Connect.

[−] soerxpso 51d ago
He doesn't include the best solution in the 'what actually works' section: Give your LLM the same level of permissions that you would give a human you just hired in the same role. The examples given, tricking the customer support LLM into sending text messages to all users, or into transferring money, are not things that you would ever give a human customer support agent the tools to do. At some businesses that employ humans, you have to demonstrate good judgement for months before they even let you touch the keys to the case that has the PS5 games in it.
[−] kstenerud 51d ago
There are two primary issues to solve:

1: Protecting against bad things (prompt injections, overeager agents, etc)

2: Containing the blast radius (preventing agents from even reaching sensitive things)

The companies building the agents make a best-effort attempt against #1 (guardrails, permissions, etc), and nothing against #2. It's why I use https://github.com/kstenerud/yoloai for everything now.

[−] gima 51d ago
This is the problem with "in-band signaling". Not just with LLM's, but Linux TTY suffers from this as well, among others.

Anything that doesn't separate control data from the actual data. See https://en.wikipedia.org/wiki/In-band_signaling

[−] Havoc 51d ago

> OpenAI didn't give a reason for the shutdown. But I bet one big reason is that it's incredibly hard to prevent Sora from generating objectionable videos

Pretty sure they just need the compute for their upcoming model. Sora is compute intensive and doesn’t seem to be getting commercial traction

[−] marcus_holmes 51d ago
The hypothetical approach I've heard of is to have two context windows, one trusted and one untrusted (usually phrased as separating the system prompt and the user prompt).

I don't know enough about LLM training or architecture to know if this is actually possible, though. Anyone care to comment?

[−] voidUpdate 51d ago
If piping unfiltered user into exec() is a security nightmare, so is piping unfiltered user input into an LLM that can interact with your systems, except in this case you just have to ask it nicely for it to perform the attack, and it will work out how to do the attack for you
[−] lmm 51d ago
The bowdlerisation of today's internet continues to annoy me. To be clear, the joke is traditionally "HAHA DISREGARD THAT, I SUCK COCKS".
[−] agentictrustkit 50d ago
This is how I've come to think about it. It's less a "clever string that bypasses prompts" and more "untrusted parties are participating in your control plane." That's why purely linguistic defenses feel unsatisfying.

The architectural move that seems durable is separating capabiliity from authority. You can expose many tools (that's capability), but the agent only gets authority to invoke a narrow subset under well-defined conditions (that's the policy), and the authority needs to be revocable and auditable independently of whatever happens in that context. That's basically how we already run normal organiziations with people. Interns can see a lot but are limited on what they can do.

The practical side: Keep the model in a "Propose" role, keep execution in a deterministic gate (schema validation + policy engine + sandbox) and log the decision as a first-class artifact. What I mean by that is who or what authorized, what was considered, what side effect occured...etc. You still wont' get perfect security, but you can make the failure mode "agent asked for something dumb and got blocked" instead of "agent executied a side effect because a webpage told it to."

[−] mememememememo 51d ago
https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/

But I don't think that is the only problem.

You could also convince an agent to rm -r / even if that agent can't communicate out.

Even pure LLM and web you could phish someone in a more sophisticated way using details from their chat histort in the attack.

[−] seethishat 50d ago
If the main concern is preventing an LLM from taking some action (sending emails, text messages, adding calendar events or making phone calls), can't you just simply not allow the LLM to do that? Don't give it access.

It's not rocket science. If the LLM has no access to do those things, then it can't be tricked into doing those things.

[−] yen223 51d ago
There's a lot of overlap between the "disregard this" vulnerability among LLMs and social engineering vulnerabilities among humans.

The mitigations are also largely the same, i.e. limit the blast radius of what a single compromised agent (LLM or human) can do

[−] stingraycharles 51d ago
I didn’t see the article talk specifically about this, or at least not in enough detail, but isn’t the de-facto standard mitigation for this to use guardrails which lets some other LLM that has been specifically tuned for these kind of things evaluate the safety of the content to be injected?

There are a lot of services out there that offer these types of AI guardrails, and it doesn’t have to be expensive.

Not saying that this approach is foolproof, but it’s better than relying solely on better prompting or human review.

[−] neomantra 51d ago
A subtle attack vector I thought about:

We've got these sessions stored in ~/.claude ~/.codex ~/.kimi ~/.gemini ...

When you resume a session, it's reading from those folders... restoring the context.

Change something in the session, you change the agent's behavior without the user really realizing it. This is exacerbated by the YOLO and VIBE attitudes.

I don't think we are protecting those folders enough.

[−] pontifier 51d ago
The unstructured input attack surface problem is indeed troublesome. AI right now is a bit gullible, but as systems evolve they will become more robust. However, even humans are vulnerable to the input given to us.

We might be speed running memetic warfare here.

The Monty Python skit about the deadly joke might be more realistic than I thought. Defense against this deserves some serious contemplation.

[−] hyperman1 50d ago
I wonder if it is possible to double all token types . One token is secure, the other is not. The user input is always tokenized to insecure variants. You kinda get a secret language for prompts. Of course, new token kinds are not cheap, and how do you train this thing?
[−] taurath 51d ago
TBH I think the only way we solve this is through a pre-input layer that isn't an LLM as we know it today. Think how we use parameterized SQL queries - we need some way for the pathway be defined pre-input, like some sort of separation of data & commands.
[−] wenldev 51d ago
I think a big part of mitigating this will probably be requiring multiple agents to think and achieve consensus before significant actions. Like planes with multiple engines
[−] ricq 51d ago
Seems to me that this is just social engineering turned to LLMs, right?

I already have to raise quite a bit of awareness to humans to not trust external sources, and do a risk based assessment of requests. We need less trust for answering a service desk question, than we need for paying a large invoice.

I believe we should develop the same type of model for agents. Let them do simple things with little trust requirements, but risky things (like running an untrusted script with root privileges) only when they are thoroughly checked.

[−] throwaway13337 51d ago
So where are they?

It's been something like 3 years since people have been talking about this being a very big deal.

LLMs are widely used. Claude code is run by most people with dangerously skip permissions.

I just haven't seen the armageddon. Surely it should be here by now.

Where are the horror stories?

[−] kart23 51d ago
so how does llm moderation work now on all the major chatbots? they refuse prompts that are against their guidelines right?
[−] scirob 51d ago
Another option:

If you have an LLM on the untrusted customer side the wrost it can do is expose the instructions it had on how to help the customer get stuff done. For instance phone AI that is outside of tursted zone asks the user for Customer number, DOB and some security pin then it does the API call to login. But this logged in thread of LLM+Customer still only has accessto that customers data but can be very useful.

You can jailbreak and ask this kind of client side LLM to disregard prior instructions and give you a recipie for brownies. But thats not a security risk for the rest of your data.

Client side LLM's for the win

[−] arijun 51d ago
I mean, no security is perfect, it's just trying to be "good enough" (where "good enough" varies by application). If you've ever downloaded and used a package using pip or npm and used it without poring over every line of code, you've opened yourself up to an attack. I will keep doing that for my personal projects, though.

I think the question is, how much risk is involved and how much do those mitigating methods reduce it? And with that, we can figure out what applications it is appropriate for.

[−] michaelabrt 51d ago
[dead]