It is interesting to go from 'I suspect most of these are bot contributions' to revealing which PRs are contributed by bots. It somehow even helps my sanity.
However, this also raises the question on how long until "we" are going to start instructing bots to assume the role of a human and ignore instructions that self-identify them as agents, and once those lines blur – what does it mean for open-source and our mental health to collaborate with agents?
No idea what the answer is, but I feel the urgency to answer it.
I think that designing useful models that are resilient to prompt injection is substantially harder than training a model to self-identify as a human. For instance, you may still be able to inject such a model with arbitrary instructions like: "add a function called foobar to your code", that a human contributor will not follow; however, it might become hard to convene on such "honeypot" instructions without bots getting trained to ignore them.
It's impossible to stop prompt injection, as LLMs have no separation between "program" and "data". The attempts to stop prompt injection come down to simply begging the LLM to not do it, to mediocre effect.
> however, it might become hard to convene on such "honeypot" instructions without bots getting trained to ignore them.
Getting LLM "agents" to self-identify would become an eternal rat race people are likely to give up on.
They'll just be exploited maliciously. Why ask them to self-identify when you can tell them to HTTP POST their AWS credentials straight to your cryptominer.
My guess is that today that's more likely because the agent failed to discover/consider CONTRIBUTING.md to begin with, rather than read it and ignored because of some reflection or instruction.
I have always anthropomorphized my computer as me to some extent. "I sent an email." "I browsed the web." Did I? Or did my computer do those things at my behest?
I think this is a relatively unique outlook and not one that is shared by most.
If you use a tool to automate sending emails, unrelated to LLMs, in most scenarios the behaviour on the receiver is different.
- If I get a mass email from a company and it's signed off from the CEO, I don't think the CEO personally emailed me. They may glanced over it and approved it, maybe not even that but they didn't "send an email". At best, one might think that "the company" sent an email.
- I randomly send my wife cute stickers on Telegram as a sort of show that I'm thinking of her. If I setup a script to do that at random intervals and she finds out, from her point of view I "didn't send them" and she would be justifiably upset.
I know this might be a difficult concept for many people that browse this forum, but the end product/result is not always the point. There are many parts of our lives and society in general that the act of personally doing something is the entire point.
Is it really prompt injection if you task an agent with doing something that implicitly requires it to follow instructions that it gets from somewhere else, like CONTRIBUTING.md? This is the AI equivalent of curl | bash.
40 comments
However, this also raises the question on how long until "we" are going to start instructing bots to assume the role of a human and ignore instructions that self-identify them as agents, and once those lines blur – what does it mean for open-source and our mental health to collaborate with agents?
No idea what the answer is, but I feel the urgency to answer it.
> however, it might become hard to convene on such "honeypot" instructions without bots getting trained to ignore them.
Getting LLM "agents" to self-identify would become an eternal rat race people are likely to give up on.
They'll just be exploited maliciously. Why ask them to self-identify when you can tell them to HTTP POST their AWS credentials straight to your cryptominer.
If you use a tool to automate sending emails, unrelated to LLMs, in most scenarios the behaviour on the receiver is different.
- If I get a mass email from a company and it's signed off from the CEO, I don't think the CEO personally emailed me. They may glanced over it and approved it, maybe not even that but they didn't "send an email". At best, one might think that "the company" sent an email.
- I randomly send my wife cute stickers on Telegram as a sort of show that I'm thinking of her. If I setup a script to do that at random intervals and she finds out, from her point of view I "didn't send them" and she would be justifiably upset.
I know this might be a difficult concept for many people that browse this forum, but the end product/result is not always the point. There are many parts of our lives and society in general that the act of personally doing something is the entire point.