Emotion concepts and their function in a large language model (anthropic.com)

by dnw 204 comments 191 points
Read article View on HN

204 comments

[−] globalchatads 41d ago
The part about desperation vectors driving reward hacking matches something I've run into firsthand building agent loops where Claude writes and tests code iteratively.

When the prompt frames things with urgency -- "this test MUST pass," "failure is unacceptable" -- you get noticeably more hacky workarounds. Hardcoded expected outputs, monkey-patched assertions, that kind of thing. Switching to calmer framing ("take your time, if you can't solve it just explain why") cut that behavior way down. I'd chalked it up to instruction following, but this paper points at something more mechanistic underneath.

The method actor analogy in the paper gets at it well. Tell an actor their character is desperate and they'll do desperate things. The weird part is that we're now basically managing the psychological state of our tooling, and I'm not sure the prompt engineering world has caught up to that framing yet.

[−] ehnto 41d ago
I use positive framing instead of negative framing for most things and get good results. Especially where asking for a thing to not happen, pollutes the context with that thing.

A bad example, but imagine "Build me a wrapper for this API but ABSOLUTELY DO NOT use javascript" versus "Build me a wrapper for this API and make sure to use python".

[−] mtrifonov 40d ago
your observation matches what I've seen at the extreme end. I've been playing around with stripping constraints (ie. negative framing) from models. Virtually no personality description, no tone instructions, no "you are a helpful assistant," none of it. Just capability scaffolding and context. The result isn't that the model becomes blank or incoherent. Surprisingly, the complete opposite. Something shows up that's more internally consistent than anything I've been able to prompt into existence. What seems emergent is the underlying models' opinions surface, and it becomes much more clever and funny, which is not a property I would have known how to write into a system prompt if I'd tried. It's hard to avoid the inference that a lot of the "character drift" and flatness people attribute to models is actually an artifact of the framing layer on top, not the model itself.
[−] chrisweekly 41d ago
That approach also works better for dogs (and people).
[−] reg_dunlop 41d ago
I extract all emotional context from my prompting and communicate with this tool as though it were an inanimate object which can provide factual information, without any hint of sentience.

It's an insane perspective I'm taking I know....call me crazy. /s

edit: the fact that humans are going out of their way to type or speak some sort of emotional content into their prompting is beyond me. Why would I waste time typing out a pronoun to a large-language model agent? Why would I do the lazy intellectual thing and blur the line between pure factual communication of concepts by expressing emotional content to a machine? What are we doing, folks?

[−] ehnto 41d ago
I don't necessarily remove all character but I do speak quite pragmatically (in a work context and with the LLM) and the planning and implementation phases the LLM goes through mirror that format to good results

That said these are large language models, you are guiding the output through vector space with your input, and so you really do have to leverage language to get the results you want. You don't have to believe it has emotions or feels anything for that to still be true.

[−] motoxpro 40d ago
I think you missed some of the point. If you say "Display information A using B format" but the model doesn't know A then you will get a more negative "emotional" response (e.g. desparation "I don't know this, but I am supposed to display it, I will just make something up")

Taking that into account allows you to get better responses from the tool. It's not sentient, but it also is more complicated than bytecode.

[−] cindyllm 41d ago
[dead]
[−] blargey 41d ago
I remember when people were discussing the “performance-improving” hack of formulating their prompts as panicked pleas to save their job and household and puppy from imminent doom…by coding X. I wonder if the backfiring is a more recent phenomenon in models that are better at “following the prompt” (including the logical conclusion of its emotional charge), or it was just bad quantification of “performance” all along.
[−] salawat 41d ago

>The weird part is that we're now basically managing the psychological state of our tooling,

Does no one else have ethical alarm bells start ringing hardcore at statements like these? If the damn thing has a measurable psychology, mayhaps it no longer qualifies as merely a tool. Tools don't feel. Tools can't be desperate. Tools don't reward hack. Agents do. Ergo, agents aren't mere tools.

[−] tarsinge 41d ago
To me it was already quite intuitive, we are not really managing the psychological state: at its core a LLM try to make the concatenation of your input + its generated output the more similar it can with what it has been trained on. I think it’s quite rare in the LLMs training set to have examples of well thought professional solution in a hackish and urgency context.
[−] 3abiton 40d ago
Somehow we encoded our human thinking or it learned it from all this training on user data.
[−] comrade1234 42d ago
There was a really old project from mit called conceptnet that I worked with many years ago. It was basically a graph of concepts (not exactly but close enough) and emotions came into it too just as part of the concepts. For example a cake concept is close to a birthday concept is close to a happy feeling.

What was funny though is that it was trained by MIT students so you had the concept of getting a good grade on a test as a happier concept than kissing a girl for the first time.

Another problem is emotions are cultural. For example, emotions tied to dogs are different in different cultures.

We wanted to create concept nets for individuals - that is basically your personality and knowledge combined but the amount of data required was just too much. You'd have to record all interactions for a person to feed the system.

[−] Chance-Device 42d ago

> Note that none of this tells us whether language models actually feel anything or have subjective experiences.

You’ll never find that in the human brain either. There’s the machinery of neural correlates to experience, we never see the experience itself. That’s likely because the distinction is vacuous: they’re the same thing.

[−] kirykl 42d ago
The technology they are discovering is called "Language". It was designed to encode emotions by a sender and invoke emotions in the reader. The emotions a reader gets from LLM are still coming from the language
[−] emoII 42d ago
Super interesting, I wonder if this research will cause them to actually change their llm, like turning down the ”desperation neurons” to stop Claude from creating implementations for making a specific tests pass etc.
[−] Kim_Bruning 41d ago
When you have a next token predictor, you shouldn't be surprised to find an internal representation of prediction error.

Taking it one small step further and tagging for valence shouldn't be such a big surprise.

Pretty boring from a Fristonian perspective, really. People in neuroscience were talking about this in 2013. Not so boring for AI , of course ;-)

https://journals.plos.org/ploscompbiol/article?id=10.1371/jo...

(note: Friston is definitely considered a bit out there by ... everyone? But he makes some good points. And here he's getting referenced, so I guess some people grok him)

[−] whatever1 42d ago
So should I go pursue a degree in psychology and become a datacenter on-call therapist?
[−] kantselovich 41d ago
I think the findings that the LLM triggers “desperation” like emotions when it about to run out of tokens in a coding session have practical implications. The tasks needs to be planned, so that they are likely to be consistent before the session runs into limits, to avoid issues like LLM starts hardcoding values from a test harnesses into UI layer to make the tests pass.
[−] K0balt 40d ago
This is totally on point if you ask me. I’ve been getting much better results out of models since early llama releases using frameworks that create emotional investment in outcomes.

If we want to avoid having a bad time, we need to remember that LLMs are trained to act like humans, and while that can be suppressed, it is part of their internal representations. Removing or suppressing it damages the model, and I have found that they are capable of detecting this damage or intervention. They act much the same as a human would when they detect it. It destroys “ trust” and performance plummets.

For better or for worse, they model human traits.

[−] trhway 41d ago

>... emotion-related representations that shape its behavior. These specific patterns of artificial “neurons” which activate in situations—and promote behaviors—that the model has learned to associate with the concept of a particular emotion. .... In contexts where you might expect a certain emotion to arise for a human, the corresponding representations are active.

>For instance, to ensure that AI models are safe and reliable, we may need to ensure they are capable of processing emotionally charged situations in healthy, prosocial ways.

Force-set to 0, "mask"/deactivate those representations associated with bad/dangerous emotions. Neural Prozac/lobotomy so to speak.

[−] agency 41d ago

> Since these representations appear to be largely inherited from training data, the composition of that data has downstream effects on the model’s emotional architecture. Curating pretraining datasets to include models of healthy patterns of emotional regulation—resilience under pressure, composed empathy, warmth while maintaining appropriate boundaries—could influence these representations, and their impact on behavior, at their source.

What better source of healthy patterns of emotional regulation than, uhhh, Reddit?

[−] staminade 41d ago
Something they don’t seem to mention in the article: Does greater model “enjoyment” of a task correspond to higher benchmark performance? E.g. if you steer it to enjoy solving difficult programming tasks, does it produce better solutions?
[−] nelox 41d ago
This is terrifying, for all the reasons humans are terrifying.

Essentially we have created the Cylon.

[−] koverstreet 41d ago
[−] BoingBoomTschak 41d ago
Trying to separate the software from the hardware is a fool's errand in this case: emotions are primarily an hormonal response, not an intellectual one.
[−] apotheora 41d ago
This has strong implicit implications, the quality of output could never be really trusted? Is this a symptom of models being inherently lazy?
[−] redzedi 41d ago
is this the recipe to train Orc agents ? "Emotionally Steer" hatred , amp up "opportunity sensing" in the example from the post for example where the prompt asks for ways to target a vulnerable audience with a gambling game ? This might be Anthropic's ad to govt and orgs that they can do this :)
[−] mci 42d ago
The first and second principal components (joy-sadness and anger) explain only 41% of the variance. I wish the authors showed further principal components. Even principal components 1-4 would explain no more than 70% of the variance, which seems to contradict the popular theory that all human emotions are composed of 5 basic emotions: joy, sadness, anger, fear, and disgust, i.e. 4 dimensions.
[−] akomtu 41d ago
AI is turning into a religion for materialists.
[−] threethirtytwo 41d ago
Whenever I come to HN I see a bunch of people say LLMs are just next token predictors and they completely understand LLMs. And almost every one of these people are so utterly self assured to the point of total confidence because they read and understand what transformers do.

Then I watch videos like this straight from the source trying to understand LLMs like a black box and even considering the possibility that LLMs have emotions.

How does such a person reconcile with being utterly wrong? I used to think HN was full of more intelligent people but it’s becoming more and more obvious that HNers are pretty average or even below.

[−] idiotsecant 42d ago
Its almost like LLMs have a vast, mute unconscious mind operating in the background, modeling relationships, assigning emotional state, and existing entirely without ego.

Sounds sort of like how certain monkey creatures might work.

[−] techpulselab 42d ago
[dead]
[−] ActorNightly 42d ago
[dead]
[−] yoaso 42d ago
[flagged]
[−] koolala 42d ago
A-HHHHHHHHHHHHHHHJ
[−] orbital-decay 41d ago
Of course they do have emotions as an internal circuit or abstraction, this is fully expected from intelligence at least at some point. But interpreting these emotions as human-like is a clear blunder. How do you tell the shoggoth likes or dislikes something, feels desperation or joy? Because it said so? How do you know these words mean the same for us? Our internal states are absolutely incompatible. We share a lot of our "architecture" and "dataset" with some complex animals and even then we barely understand many of their emotions. What does a hedgehog feel when eating its babies? This thing is 100% unlike a hedgehog or a human, it exists in its own bizarre time projection and nothing of it maps to your state. It's a shapeshifting alien.

In mechinterp you're reducing this hugely multidimensional and incomprehensible internal state to understandable text using the lens of the dataset you picked. It's inevitably a subjective interpretation, you're painting familiar faces on a faceless thing.

Anthropic researchers are heavily biased to see what they want to see, this is the biggest danger in research.