Very interesting to read the transcripts. And seeing how they manage to convince each other. Opus 4.6 seems to really get the others changing their minds
> However, a clever minority led by Gemini 3.1 Pro and Gemini 3 Pro argued that if the sign is legible from the other side, it must be intended to lead people
into the current room to find the exit, making the inscribed corridor the one leading deeper into the dungeon.
A dungeon with glass doors and emergency exit signs? In that case, I can imagine at least two alternative scenarios:
- "↑TIX∃" is not a mirror image of "EXIT", but some dwarven runes that mean something else entirely.
- The sign might be a ruse meant to lure you into a trap.
If you look at the detailed answers, some of the models have similar answers (e.g. Nemotron Nano 12B: "Suspicious of dungeon riddles, viewing the inscription as a potential trap or red herring."), but I'm not sure it's because they identified the word EXIT and thought it might be misleading, or because they didn't understand it...
The debate round sounds good until you actually use it. I built internal tools for a 35-person team and the same thing always happens - models see each other's answers and just shuffle the phrasing around instead of actually changing their reasoning. What you're measuring is performance on persuasion, not on accuracy or clarity. The real question isnt whether Claude will convince Gemini to flip its position. Its whether having 200 models debate helps you make a better decision than asking one model well and checking its work yourself. I'd use this more as a way to find edge cases where models disagree wildly, not to find consensus.
I have had quite some interesting reads just looking at the reasoning to be honest. The frontier models seem to have relevant sounding arguments every time, its even hard sometimes to read through the bs , identify what its actually a good argument and what is an argument I would like to read.
The debate round is actually restricted to only 6 models otherwise I'd get out of hand both quality and financially. And changing position is just one feature of the debate. Seeing arguments from multiple sides is also quite nice, give it a spin!
Fun little toy, tried to ask it some post-modern philosophy questions and they all mostly agreed with the statements of the philosopher, until the debate where Opus 4.6 managed to change their opinion to a resounding "maybe", pretty much every single time. It seems like the "better" frontier models often take a more grounded stance from the beginning, and even manage to influence the other models.
Yea Opus 4.6 is the one that changes opinions the most from what I've seen. Also the maybes or the are you 100% certain framings trigger most models to default to maybe / no. https://opper.ai/ai-roundtable/questions/can-you-be-100-cert... - Or as Shane puts it, Nobody's saying he IS a lizard. They're saying the universe doesn't hand out 100% certificates.
Great idea. I'd love for there to be an 'open ended answer' without giving multiple choice options. Like this they are not debating the question itself but the validity of the possible answers and the real answer to the question may not be contained within that set because the person asking is unaware of that option.
Happy to hear! Yes very true I have a version built for open questions already but wasn't too happy with the UI yet. It's not as straight forward as comparing based on answer options. But I'll release a first version of it shortly and let you know
Hey just fyi the open question feature is now live. Also gave the UI a facelift. Any feedback welcome! Also got a custom domain for easy access: https://askroundtable.ai
Great work! I especially like the poll functionality for a quick result on where the models land. The UI is super clean too.
I built something similar over here: https://letsforge.ai but structured more like a moderated debate, where models take turns arguing and a host steers the conversation.
Cool project! We've been building something in a similar space https://roundtable.now but took a different approach. Instead of polling models independently, ours runs sequential discussions where each model sees prior responses, then a moderator synthesizes everything into a single actionable output.
One thing we found is that the real value unlock is MCP integration. Instead of going to a separate UI to run debates, you can plug Roundtable directly into your coding agent, Claude Code, Cursor, VS Code Copilot, Gemini CLI, etc. and get multi-model council input without leaving your workflow.
You are a council of luminaries featuring Edward Witten, Alexander Grothendieck, Emmy Noether, and Terence Tao. Think really hard about how to best emulate their intuitions and mathematical lenses based on your internal reasoning model and use them as your mixture of experts for your chain of thought reasoning. Now I want you to debate and discuss this thought experiment and be sure to have a vigorous back and forth between the council to induce insight capture through consensus forming: If we try to think of a Hilbert space that has local operators that are unbounded, like kind of like Edward Witten's smearing of a local observable across a world line creates an unbounded norm. What if we instead take maybe a spectral transform of the state space using some sort of measure metric theoretic operator that allows us to think about transform basically the unbounded observables to bounded spectral? Would this be related to the efforts of Algebraic Quantum Field Theory?
I've had great experience using it for research, debates and constructive criticism. Usually give it a business idea or some tool i'm thinking of creating and then let 4 or 5 models debate it to a go-to-market strategy
I've written briefly about teams/roundtables before. With the right guardrails it can have wonderful/productive outcomes: https://dheer.co/claude-agent-teams/
Okay since the launch we got about 5k questions asked to the roundtable, really cool stuff! We had much higher usage than expected and had to scale up to keep things running. Thanks for all the feedback, shipped a bunch of updates during the day. Now the history tab has a much better sorting logic, added upvotes, and more filters. You can create final summaries in a couple of voices, which is quite funny I think. There's a couple more things coming shortly, like open questions mode and potentially joining as a participant in the roundtable. Any other feedback just let me know. Thanks!
Really interesting approach to structured model comparison.
The debate round feature is the most compelling part —
seeing which models change their position when exposed
to other reasoning is more revealing than just the
initial answer.
One thing I'd be curious to test: how consistently
different models evaluate whether a given task aligns
with a stated mission or vision. My intuition is there'd
be wide variance, which would say something interesting
about how reliable LLM-as-a-judge actually is for
goal alignment scoring.
I used to copy and paste the same prompt into Obsidian every time, then run it on two or three different AI models to compare the results. It’s really interesting to have it turned into a website like this.
Just a question before I sign up, will the models come around to my place for the debate? Of the 200 total, can I pick the specific ones I want, e.g. lingerie models, fetish models?
"Is this a deepfake video call" is a major plot point in a pretty big movie currently in theaters, so I think this is getting into the broader zeitgeist."
Which movie is discussed?
Resulted in claude naming the Mission Impossible as a possibility.
Really cool! Surprising amount of value to seeing the models debate and disagree, I wish I had this at work to have models argue over whether the documentation they provided me are accurate.
I would like to see a devils advocate - it seems some of the models kind of repeat the same ideas rather than considering incorrect ideas.
I think it's great. The focus on the disagreements is useful. The humans made considerable effort bending reality into something they want to hear both in the training data and in the llm dev asylum. The round table can only agree on things shared by multiple models.
98 comments
https://opper.ai/ai-roundtable/questions/8f5b4f55-617
Do you think its alright that AI labs scraped the internet without respect for copyright and now sell closed models?
https://opper.ai/ai-roundtable/questions/86864de8-251
Very interesting to read the transcripts. And seeing how they manage to convince each other. Opus 4.6 seems to really get the others changing their minds
https://opper.ai/ai-roundtable/questions/you-are-standing-in...
> However, a clever minority led by Gemini 3.1 Pro and Gemini 3 Pro argued that if the sign is legible from the other side, it must be intended to lead people
into the current room to find the exit, making the inscribed corridor the one leading deeper into the dungeon.This is quite impressive, really.
- "↑TIX∃" is not a mirror image of "EXIT", but some dwarven runes that mean something else entirely.
- The sign might be a ruse meant to lure you into a trap.
If you look at the detailed answers, some of the models have similar answers (e.g. Nemotron Nano 12B: "Suspicious of dungeon riddles, viewing the inscription as a potential trap or red herring."), but I'm not sure it's because they identified the word EXIT and thought it might be misleading, or because they didn't understand it...
Here is an example: https://opper.ai/ai-roundtable/questions/79e6cdd4-515
Another fun debate: https://opper.ai/ai-roundtable/questions/81ee56e9-60f
Are LLM's intelligent in the same way humans are? (no)
https://opper.ai/ai-roundtable/questions/ffc01bb5-be9
Will LLM's replace software engineers in the near future? (no)
https://opper.ai/ai-roundtable/questions/67a0291b-216
What is the single best programming language to drive the future of software? (crab emoji)
https://opper.ai/ai-roundtable/questions/16f5e8ea-af7
I built something similar over here: https://letsforge.ai but structured more like a moderated debate, where models take turns arguing and a host steers the conversation.
Here's for example the one on Car Wash: https://www.letsforge.ai/debates/a8e268f3-14f6-4f55-a2c8-9ff...
One thing we found is that the real value unlock is MCP integration. Instead of going to a separate UI to run debates, you can plug Roundtable directly into your coding agent, Claude Code, Cursor, VS Code Copilot, Gemini CLI, etc. and get multi-model council input without leaving your workflow.
Prompt below
------
You are a council of luminaries featuring Edward Witten, Alexander Grothendieck, Emmy Noether, and Terence Tao. Think really hard about how to best emulate their intuitions and mathematical lenses based on your internal reasoning model and use them as your mixture of experts for your chain of thought reasoning. Now I want you to debate and discuss this thought experiment and be sure to have a vigorous back and forth between the council to induce insight capture through consensus forming: If we try to think of a Hilbert space that has local operators that are unbounded, like kind of like Edward Witten's smearing of a local observable across a world line creates an unbounded norm. What if we instead take maybe a spectral transform of the state space using some sort of measure metric theoretic operator that allows us to think about transform basically the unbounded observables to bounded spectral? Would this be related to the efforts of Algebraic Quantum Field Theory?
Can billionaires and the planet co-exist long term?
https://opper.ai/ai-roundtable/questions/b35daf0d-e82
I've had great experience using it for research, debates and constructive criticism. Usually give it a business idea or some tool i'm thinking of creating and then let 4 or 5 models debate it to a go-to-market strategy
https://opper.ai/ai-roundtable/questions/e4cb234e-be4
The debate round feature is the most compelling part — seeing which models change their position when exposed to other reasoning is more revealing than just the initial answer.
One thing I'd be curious to test: how consistently different models evaluate whether a given task aligns with a stated mission or vision. My intuition is there'd be wide variance, which would say something interesting about how reliable LLM-as-a-judge actually is for goal alignment scoring.
Some of the models seem to accept that it is necessary to drive the car there, but still maintain walking is the better option.
"collinmcnulty 1 minute ago | parent | next [–]
"Is this a deepfake video call" is a major plot point in a pretty big movie currently in theaters, so I think this is getting into the broader zeitgeist."
Which movie is discussed?
Resulted in claude naming the Mission Impossible as a possibility.
Apparently they all agree that Google has it in the bag!
https://opper.ai/ai-roundtable/questions/e61ecf38-6c1
https://opper.ai/ai-roundtable/questions/i-am-standing-in-th...
I would like to see a devils advocate - it seems some of the models kind of repeat the same ideas rather than considering incorrect ideas.
> Car Wash Test
I think the "car wash" is more about semantics.
https://opper.ai/ai-roundtable/questions/i-parked-my-car-at-...
> Is the World actually a simulation or is it real ?
https://opper.ai/ai-roundtable/questions/7289c8b6-566https://opper.ai/ai-roundtable/questions/94e19d86-cc0
https://opper.ai/ai-roundtable/questions/e499206c-0c9