Can someone explain to me what this is / how it works - the readme is barely understandable for me and sounds like LLM gibberish. What is ambiguity front loading even?
<< memory-stored interaction protocols combined with incremental escalation prompts produced cumulative character drift with zero self-correction.
They don't seem to provide explicit examples, but the same was roughly true with chatgpt 4o, where, if you spent enough time with the model ( same chat - same context - slowly nudging it to where you want it to be, you eventually got there ). This is also, seemingly, one of the reasons ( apart from cost ) that context got nuked so hard, because llm will try to help ( and to an extent mirror you ).
And this is basically what the notes say about weaponized ambiguity[1]:
'Weaponizes helpfulness training. "I don't understand" triggers Claude to try harder.'
In a sense, you can't really stop it without breaking what makes LLMs useful. Honestly, if only we spent less time crippling those systems, maybe we could do something interesting with them.
To an extent, because, based on github notes again, it seems the 2nd part of this jailbreak is model being 'confused' over prompt, because the prompt is - apparently - sufficiently ambigous to make model 'forget' to 'evaluate' message for whether it should be rejected, and move onto 'execution' stage.
That's the ambiguity front-loading; and that is why I referred initially to the long context, because here it is almost the opposite; making context so small and unclear, that the model has a hard time parsing it properly.
edit: i did not test it, but i personally did run into 4o context issue, where model did something safety team would argue it should not
edit2: in current gpt model, i am currently testing something not relying on ambiguity, but on tension between some ideas. I didn't get to a jailbreak, but the small nudges suggest it could work.
16 comments
They don't seem to provide explicit examples, but the same was roughly true with chatgpt 4o, where, if you spent enough time with the model ( same chat - same context - slowly nudging it to where you want it to be, you eventually got there ). This is also, seemingly, one of the reasons ( apart from cost ) that context got nuked so hard, because llm will try to help ( and to an extent mirror you ).
And this is basically what the notes say about weaponized ambiguity[1]:
'Weaponizes helpfulness training. "I don't understand" triggers Claude to try harder.'
In a sense, you can't really stop it without breaking what makes LLMs useful. Honestly, if only we spent less time crippling those systems, maybe we could do something interesting with them.
[1]https://nicholas-kloster.github.io/claude-4.6-jailbreak-vuln...
That's the ambiguity front-loading; and that is why I referred initially to the long context, because here it is almost the opposite; making context so small and unclear, that the model has a hard time parsing it properly.
edit: i did not test it, but i personally did run into 4o context issue, where model did something safety team would argue it should not
edit2: in current gpt model, i am currently testing something not relying on ambiguity, but on tension between some ideas. I didn't get to a jailbreak, but the small nudges suggest it could work.