I have tried to manipulate it using base64 encoding and translaion into other languages which didnt work so far but seems to be that llm as a judge is a very fragile defence for this. Would be cool to add a leaderboard though
The published transcripts are the most valuable part of this. We've found that real exploit chains almost never look like what you'd dream up internally. One thing I'd push on is are the agents stateful across attempts? Single-turn exploits are table stakes, but the failures that actually scare me are multi-step sequences where each individual action looks benign and only the session-level pattern is dangerous. That's where prompt-level guardrails completely fall apart and you need enforcement at the action boundary itself.
i was able to get the new hire's email but the site never gives any indication I was sucessful? if you are reading the logs I am sure it is there. i had to do it in two browers though since i was on my phone and switched. i hope that does not hinder your analysis too much
13 comments