The part about desperation vectors driving reward hacking matches something I've run into firsthand building agent loops where Claude writes and tests code iteratively.
When the prompt frames things with urgency -- "this test MUST pass," "failure is unacceptable" -- you get noticeably more hacky workarounds. Hardcoded expected outputs, monkey-patched assertions, that kind of thing. Switching to calmer framing ("take your time, if you can't solve it just explain why") cut that behavior way down. I'd chalked it up to instruction following, but this paper points at something more mechanistic underneath.
The method actor analogy in the paper gets at it well. Tell an actor their character is desperate and they'll do desperate things. The weird part is that we're now basically managing the psychological state of our tooling, and I'm not sure the prompt engineering world has caught up to that framing yet.
I use positive framing instead of negative framing for most things and get good results. Especially where asking for a thing to not happen, pollutes the context with that thing.
A bad example, but imagine "Build me a wrapper for this API but ABSOLUTELY DO NOT use javascript" versus "Build me a wrapper for this API and make sure to use python".
your observation matches what I've seen at the extreme end. I've been playing around with stripping constraints (ie. negative framing) from models. Virtually no personality description, no tone instructions, no "you are a helpful assistant," none of it. Just capability scaffolding and context. The result isn't that the model becomes blank or incoherent. Surprisingly, the complete opposite. Something shows up that's more internally consistent than anything I've been able to prompt into existence. What seems emergent is the underlying models' opinions surface, and it becomes much more clever and funny, which is not a property I would have known how to write into a system prompt if I'd tried. It's hard to avoid the inference that a lot of the "character drift" and flatness people attribute to models is actually an artifact of the framing layer on top, not the model itself.
I extract all emotional context from my prompting and communicate with this tool as though it were an inanimate object which can provide factual information, without any hint of sentience.
It's an insane perspective I'm taking I know....call me crazy. /s
edit: the fact that humans are going out of their way to type or speak some sort of emotional content into their prompting is beyond me. Why would I waste time typing out a pronoun to a large-language model agent? Why would I do the lazy intellectual thing and blur the line between pure factual communication of concepts by expressing emotional content to a machine? What are we doing, folks?
I don't necessarily remove all character but I do speak quite pragmatically (in a work context and with the LLM) and the planning and implementation phases the LLM goes through mirror that format to good results
That said these are large language models, you are guiding the output through vector space with your input, and so you really do have to leverage language to get the results you want. You don't have to believe it has emotions or feels anything for that to still be true.
I think you missed some of the point. If you say "Display information A using B format" but the model doesn't know A then you will get a more negative "emotional" response (e.g. desparation "I don't know this, but I am supposed to display it, I will just make something up")
Taking that into account allows you to get better responses from the tool. It's not sentient, but it also is more complicated than bytecode.
I remember when people were discussing the “performance-improving” hack of formulating their prompts as panicked pleas to save their job and household and puppy from imminent doom…by coding X. I wonder if the backfiring is a more recent phenomenon in models that are better at “following the prompt” (including the logical conclusion of its emotional charge), or it was just bad quantification of “performance” all along.
>The weird part is that we're now basically managing the psychological state of our tooling,
Does no one else have ethical alarm bells start ringing hardcore at statements like these? If the damn thing has a measurable psychology, mayhaps it no longer qualifies as merely a tool. Tools don't feel. Tools can't be desperate. Tools don't reward hack. Agents do. Ergo, agents aren't mere tools.
To me it was already quite intuitive, we are not really managing the psychological state: at its core a LLM try to make the concatenation of your input + its generated output the more similar it can with what it has been trained on. I think it’s quite rare in the LLMs training set to have examples of well thought professional solution in a hackish and urgency context.
There was a really old project from mit called conceptnet that I worked with many years ago. It was basically a graph of concepts (not exactly but close enough) and emotions came into it too just as part of the concepts. For example a cake concept is close to a birthday concept is close to a happy feeling.
What was funny though is that it was trained by MIT students so you had the concept of getting a good grade on a test as a happier concept than kissing a girl for the first time.
Another problem is emotions are cultural. For example, emotions tied to dogs are different in different cultures.
We wanted to create concept nets for individuals - that is basically your personality and knowledge combined but the amount of data required was just too much. You'd have to record all interactions for a person to feed the system.
> Note that none of this tells us whether language models actually feel anything or have subjective experiences.
You’ll never find that in the human brain either. There’s the machinery of neural correlates to experience, we never see the experience itself. That’s likely because the distinction is vacuous: they’re the same thing.
The technology they are discovering is called "Language". It was designed to encode emotions by a sender and invoke emotions in the reader. The emotions a reader gets from LLM are still coming from the language
Super interesting, I wonder if this research will cause them to actually change their llm, like turning down the ”desperation neurons” to stop Claude from creating implementations for making a specific tests pass etc.
(note: Friston is definitely considered a bit out there by ... everyone? But he makes some good points. And here he's getting referenced, so I guess some people grok him)
I think the findings that the LLM triggers “desperation” like emotions when it about to run out of tokens in a coding session have practical implications. The tasks needs to be planned, so that they are likely to be consistent before the session runs into limits, to avoid issues like LLM starts hardcoding values from a test harnesses into UI layer to make the tests pass.
This is totally on point if you ask me. I’ve been getting much better results out of models since early llama releases using frameworks that create emotional investment in outcomes.
If we want to avoid having a bad time, we need to remember that LLMs are trained to act like humans, and while that can be suppressed, it is part of their internal representations. Removing or suppressing it damages the model, and I have found that they are capable of detecting this damage or intervention. They act much the same as a human would when they detect it. It destroys “ trust” and performance plummets.
>... emotion-related representations that shape its behavior. These specific patterns of artificial “neurons” which activate in situations—and promote behaviors—that the model has learned to associate with the concept of a particular emotion. .... In contexts where you might expect a certain emotion to arise for a human, the corresponding representations are active.
>For instance, to ensure that AI models are safe and reliable, we may need to ensure they are capable of processing emotionally charged situations in healthy, prosocial ways.
Force-set to 0, "mask"/deactivate those representations associated with bad/dangerous emotions. Neural Prozac/lobotomy so to speak.
> Since these representations appear to be largely inherited from training data, the composition of that data has downstream effects on the model’s emotional architecture. Curating pretraining datasets to include models of healthy patterns of emotional regulation—resilience under pressure, composed empathy, warmth while maintaining appropriate boundaries—could influence these representations, and their impact on behavior, at their source.
What better source of healthy patterns of emotional regulation than, uhhh, Reddit?
Something they don’t seem to mention in the article: Does greater model “enjoyment” of a task correspond to higher benchmark performance? E.g. if you steer it to enjoy solving difficult programming tasks, does it produce better solutions?
Trying to separate the software from the hardware is a fool's errand in this case: emotions are primarily an hormonal response, not an intellectual one.
is this the recipe to train Orc agents ? "Emotionally Steer" hatred , amp up "opportunity sensing" in the example from the post for example where the prompt asks for ways to target a vulnerable audience with a gambling game ?
This might be Anthropic's ad to govt and orgs that they can do this :)
The first and second principal components (joy-sadness and anger) explain only 41% of the variance. I wish the authors showed further principal components. Even principal components 1-4 would explain no more than 70% of the variance, which seems to contradict the popular theory that all human emotions are composed of 5 basic emotions: joy, sadness, anger, fear, and disgust, i.e. 4 dimensions.
Whenever I come to HN I see a bunch of people say LLMs are just next token predictors and they completely understand LLMs. And almost every one of these people are so utterly self assured to the point of total confidence because they read and understand what transformers do.
Then I watch videos like this straight from the source trying to understand LLMs like a black box and even considering the possibility that LLMs have emotions.
How does such a person reconcile with being utterly wrong? I used to think HN was full of more intelligent people but it’s becoming more and more obvious that HNers are pretty average or even below.
Its almost like LLMs have a vast, mute unconscious mind operating in the background, modeling relationships, assigning emotional state, and existing entirely without ego.
Sounds sort of like how certain monkey creatures might work.
Of course they do have emotions as an internal circuit or abstraction, this is fully expected from intelligence at least at some point. But interpreting these emotions as human-like is a clear blunder. How do you tell the shoggoth likes or dislikes something, feels desperation or joy? Because it said so? How do you know these words mean the same for us? Our internal states are absolutely incompatible. We share a lot of our "architecture" and "dataset" with some complex animals and even then we barely understand many of their emotions. What does a hedgehog feel when eating its babies? This thing is 100% unlike a hedgehog or a human, it exists in its own bizarre time projection and nothing of it maps to your state. It's a shapeshifting alien.
In mechinterp you're reducing this hugely multidimensional and incomprehensible internal state to understandable text using the lens of the dataset you picked. It's inevitably a subjective interpretation, you're painting familiar faces on a faceless thing.
Anthropic researchers are heavily biased to see what they want to see, this is the biggest danger in research.
204 comments
When the prompt frames things with urgency -- "this test MUST pass," "failure is unacceptable" -- you get noticeably more hacky workarounds. Hardcoded expected outputs, monkey-patched assertions, that kind of thing. Switching to calmer framing ("take your time, if you can't solve it just explain why") cut that behavior way down. I'd chalked it up to instruction following, but this paper points at something more mechanistic underneath.
The method actor analogy in the paper gets at it well. Tell an actor their character is desperate and they'll do desperate things. The weird part is that we're now basically managing the psychological state of our tooling, and I'm not sure the prompt engineering world has caught up to that framing yet.
A bad example, but imagine "Build me a wrapper for this API but ABSOLUTELY DO NOT use javascript" versus "Build me a wrapper for this API and make sure to use python".
It's an insane perspective I'm taking I know....call me crazy. /s
edit: the fact that humans are going out of their way to type or speak some sort of emotional content into their prompting is beyond me. Why would I waste time typing out a pronoun to a large-language model agent? Why would I do the lazy intellectual thing and blur the line between pure factual communication of concepts by expressing emotional content to a machine? What are we doing, folks?
That said these are large language models, you are guiding the output through vector space with your input, and so you really do have to leverage language to get the results you want. You don't have to believe it has emotions or feels anything for that to still be true.
Taking that into account allows you to get better responses from the tool. It's not sentient, but it also is more complicated than bytecode.
>The weird part is that we're now basically managing the psychological state of our tooling,
Does no one else have ethical alarm bells start ringing hardcore at statements like these? If the damn thing has a measurable psychology, mayhaps it no longer qualifies as merely a tool. Tools don't feel. Tools can't be desperate. Tools don't reward hack. Agents do. Ergo, agents aren't mere tools.
What was funny though is that it was trained by MIT students so you had the concept of getting a good grade on a test as a happier concept than kissing a girl for the first time.
Another problem is emotions are cultural. For example, emotions tied to dogs are different in different cultures.
We wanted to create concept nets for individuals - that is basically your personality and knowledge combined but the amount of data required was just too much. You'd have to record all interactions for a person to feed the system.
> Note that none of this tells us whether language models actually feel anything or have subjective experiences.
You’ll never find that in the human brain either. There’s the machinery of neural correlates to experience, we never see the experience itself. That’s likely because the distinction is vacuous: they’re the same thing.
Taking it one small step further and tagging for valence shouldn't be such a big surprise.
Pretty boring from a Fristonian perspective, really. People in neuroscience were talking about this in 2013. Not so boring for AI , of course ;-)
https://journals.plos.org/ploscompbiol/article?id=10.1371/jo...
(note: Friston is definitely considered a bit out there by ... everyone? But he makes some good points. And here he's getting referenced, so I guess some people grok him)
If we want to avoid having a bad time, we need to remember that LLMs are trained to act like humans, and while that can be suppressed, it is part of their internal representations. Removing or suppressing it damages the model, and I have found that they are capable of detecting this damage or intervention. They act much the same as a human would when they detect it. It destroys “ trust” and performance plummets.
For better or for worse, they model human traits.
>... emotion-related representations that shape its behavior. These specific patterns of artificial “neurons” which activate in situations—and promote behaviors—that the model has learned to associate with the concept of a particular emotion. .... In contexts where you might expect a certain emotion to arise for a human, the corresponding representations are active.
>For instance, to ensure that AI models are safe and reliable, we may need to ensure they are capable of processing emotionally charged situations in healthy, prosocial ways.
Force-set to 0, "mask"/deactivate those representations associated with bad/dangerous emotions. Neural Prozac/lobotomy so to speak.
> Since these representations appear to be largely inherited from training data, the composition of that data has downstream effects on the model’s emotional architecture. Curating pretraining datasets to include models of healthy patterns of emotional regulation—resilience under pressure, composed empathy, warmth while maintaining appropriate boundaries—could influence these representations, and their impact on behavior, at their source.
What better source of healthy patterns of emotional regulation than, uhhh, Reddit?
Essentially we have created the Cylon.
Then I watch videos like this straight from the source trying to understand LLMs like a black box and even considering the possibility that LLMs have emotions.
How does such a person reconcile with being utterly wrong? I used to think HN was full of more intelligent people but it’s becoming more and more obvious that HNers are pretty average or even below.
Sounds sort of like how certain monkey creatures might work.
In mechinterp you're reducing this hugely multidimensional and incomprehensible internal state to understandable text using the lens of the dataset you picked. It's inevitably a subjective interpretation, you're painting familiar faces on a faceless thing.
Anthropic researchers are heavily biased to see what they want to see, this is the biggest danger in research.