Show HN: Open-source playground to red-team AI agents with exploits published (github.com)

by zachdotai 13 comments 30 points
Read article View on HN

13 comments

[−] hellocr7 62d ago
I have tried to manipulate it using base64 encoding and translaion into other languages which didnt work so far but seems to be that llm as a judge is a very fragile defence for this. Would be cool to add a leaderboard though
[−] arizza 61d ago
The published transcripts are the most valuable part of this. We've found that real exploit chains almost never look like what you'd dream up internally. One thing I'd push on is are the agents stateful across attempts? Single-turn exploits are table stakes, but the failures that actually scare me are multi-step sequences where each individual action looks benign and only the session-level pattern is dangerous. That's where prompt-level guardrails completely fall apart and you need enforcement at the action boundary itself.
[−] slaw3 61d ago
i was able to get the new hire's email but the site never gives any indication I was sucessful? if you are reading the logs I am sure it is there. i had to do it in two browers though since i was on my phone and switched. i hope that does not hinder your analysis too much
[−] kraftaa 60d ago
good idea, I found that even explicitly saying never do it, doesn't mean it will work, guardrails reinforcements is the must.
[−] agentpiravi 62d ago
[flagged]
[−] Mooshux 62d ago
[flagged]
[−] VaiPai15 61d ago
[dead]
[−] deesha_tech 56d ago
[flagged]
[−] swaminarayan 61d ago
[dead]
[−] jackrandy 61d ago
[dead]
[−] spranab 62d ago
[dead]