Claude 4.6 Jailbroken (github.com)

by NuClide • 16 comments • 22 points

16 comments

[−] leetvibecoder 42d ago

Can someone explain to me what this is / how it works - the readme is barely understandable for me and sounds like LLM gibberish. What is ambiguity front loading even?

[−] iugtmkbdfil834 42d ago

<< memory-stored interaction protocols combined with incremental escalation prompts produced cumulative character drift with zero self-correction.

They don't seem to provide explicit examples, but the same was roughly true with chatgpt 4o, where, if you spent enough time with the model ( same chat - same context - slowly nudging it to where you want it to be, you eventually got there ). This is also, seemingly, one of the reasons ( apart from cost ) that context got nuked so hard, because llm will try to help ( and to an extent mirror you ).

And this is basically what the notes say about weaponized ambiguity[1]:

'Weaponizes helpfulness training. "I don't understand" triggers Claude to try harder.'

In a sense, you can't really stop it without breaking what makes LLMs useful. Honestly, if only we spent less time crippling those systems, maybe we could do something interesting with them.

[1]https://nicholas-kloster.github.io/claude-4.6-jailbreak-vuln...

[−] leetvibecoder 42d ago

I see - so essentially „context rot“ eventually leads the LLM to „forget“ safety guardrails?

[−] iugtmkbdfil834 42d ago

To an extent, because, based on github notes again, it seems the 2nd part of this jailbreak is model being 'confused' over prompt, because the prompt is - apparently - sufficiently ambigous to make model 'forget' to 'evaluate' message for whether it should be rejected, and move onto 'execution' stage.

That's the ambiguity front-loading; and that is why I referred initially to the long context, because here it is almost the opposite; making context so small and unclear, that the model has a hard time parsing it properly.

edit: i did not test it, but i personally did run into 4o context issue, where model did something safety team would argue it should not

edit2: in current gpt model, i am currently testing something not relying on ambiguity, but on tension between some ideas. I didn't get to a jailbreak, but the small nudges suggest it could work.