> Question: A fair die rolling a 6 twice in a row is more likely than rolling 1-2-3-4-5-6 in sequence
Two 6s in a row is 1/36 chance (1/6)^2
1-2-3-4-5-6 is a 1/46656 chance (1/6)^6
Website is claiming they are the same probability:
> Same probability: 1/46,656 — Both outcomes have exactly the same probability: (1/6)^6 = 1/46,656. This illustrates the representativeness heuristic — random-looking sequences feel more probable than ordered ones.
Website's "answer" is wrong: was the question supposed to be rolling a 6 six times in a row?
Yeah, most likely it was try to identify a bias of human perception, that 1,2,3,4,5,6 would be more probably than 6x6.
A better way to illustrate this bias is with coin flips. People will tell you that odds of 6 heads is more rare than the odds 3 tails then 3 heads. The difficulty is understanding whether they mean "in order" or "as a group".
If it's in order, the odds are the same. Every order of H/T has the same probability, but humans will see "all heads" and think that's more rare. But the important bit is whether there's a clear understanding ordering.
You're right, that's a mistake in how I phrased the question. It should say "six times in a row" not "twice in a row". Fixing it now! Thanks for pointing that out!
Maybe I don't know enough about "calibration" in a technical sense, but it seems like this quiz cant really distinguish between factual knowledge and calibration skill?
Is this type of quiz reproducible for individuals and across various cross-sections of the population?
Are there studies on this? Is the quiz based on these studies?
There’s a bias, I think. When I saw the title that is about how bad I’m at estimating, I’ve leaned towards counterintuitive answers. This got me quite a high score. I think test set should also include intuitive facts (or maybe I was just lucky).
I think this might be conflating confidence with accuracy. I tried leaving the slider the the middle (nominally the least confident position) and it gave a score of 0.25 and diagnosed it as 'overconfident'.
Apologies if this is off-topic, but having spent more time than I'd like to admit having to create and edit webapps that emerged entirely out of Claude Code, Cursor, Codex, etc. with minimal to no direct code-writing by their human subscribers, this website has strong AI smells:
- Inter font
- all caps section headers
- Lucide icons
- em dashes, of course the em dashes
- bubble status badges (of course with all-caps "IN PROGRESS" and "COMING SOON" that mean the same thing)
- Uncited claims like "Most founders are overconfident in the 70-90% range" and "Most people score between 0.20 and 0.30"
- No less than FOUR blog articles all published April 4
None of these points is by any means a dealbreaker. And after all, I suppose a product should be judged on its merits and the value it delivers to its users, not on the tools used to create it. But together, the frontend bears the unmistakeable generative AI "smell" that telegraphs that the human(s) directing the tools building this app might be optimizing for speed over rigor and quality (further supported by the volunteer QA/QC happening in the comments), and may only be as good and reliable as the uncritically accepted outputs of a $20/month coding assistant.
Wait, so roughly is it rewarding being confident when correct, and penalizing being confident when wrong? Meaning that the highest score is only achievable if you answer fully confident true or false, and get all 10 correct?
If so, isn't that conflating knowledge with over/under confidence?
Brier scoring works on questions with cheap, fast resolution; the strategic decisions you mention (hiring, equipment, big purchases) resolve over months or years, often ambiguously, and the counterfactual never resolves at all. Curious whether the calibration gains from the rapid-feedback quiz actually transfer to the slow-feedback domains the tool is designed to help with, or whether it ends up training a slightly different skill. A second thing: most of my strategic decisions weren't solo, and once one calibrated person sits in a room with two louder uncalibrated ones, the calibration math stops being load-bearing. Have you thought about a team variant?
I’ve taken the quiz but not been compelled to sign up. The site feels manipulative, e.g., the “show me all the questions” link is tiny and hidden between two larger boxes, and even then it only shows 2 questions with a signup CTA. Maybe that’s best practice growth hacking these days, but to me it’s a manipulative turnoff. If you’d given me all the questions and answers simply then I would signed up for more, especially with the discount code. Otherwise, how am I supposed to even know what I’m signup up for? Every interaction I’ve had with the site so far is a sales attempt, so mostly I expect more of those.
Having previously spent a reasonable amount of time on Metaculus I’m familiar with Brier scores and rating my confidence. I assume that’s how I was able to get better than average results. It’s an interesting app.
It’s something I’m interested improving on as well as predictions in general. I saw you suggested Thinking Fast and Slow and I’ve skimmed through some of Superforecasters. Metaculus have a bunch of resources too.
Quick update: 1,934 quiz completions, 44.5% scored overconfident. Most interesting finding was that the quiz itself got more engagement than the product behind it. Added educational tooltips, a public roadmap with voting, and UTM tracking based on feedback here and from users that reached out directly!
I didn't find the questions very representative about estimation. that is maybe if happen to know many of random root facts about the world under which they were based, then their application might be a revenant question about ability to estimate. I really felt more like I was making uneducated guesses (0.155). I suppose I was expecting more ping pong balls in airplanes
Made a few changes based on feedback from this thread: full results now shown immediately with no email gate, changed the UX to include true/false/uncertain buttons + a confidence slider, I cleaned up the quiz result page, and fixed the die probability question. Thanks for all the honest feedback!
Update at 2 hours: 1350+ quiz takers! 50% overconfident, 40% well-calibrated, and 10% underconfident. The average score is around 0.228, with the best score still at 0.007 (nearly perfect). The pattern so far is people are most overconfident in the 70-90% range, but are right closer to ~55% of the time.
Update: 400+ quiz takers now... insane. Best Brier score so far is 0.007 (nearly perfect calibration). The worst came in at 0.600. Average is 0.230, still just better than a coin flip. Where did you land?
Interesting data from the quiz so far: 160+ quiz takers! The average is 0.239 (barely better than a coin flip at 0.25), but almost everyone indicates they are confident in their answers.
The slider disappearing when sliding between extremes is very confusing. I think the silver should be the only thing displayed and remove the buttons entirely.
It is very disappointing that you can't see what you got right or wrong without giving out your email. I'm not even sure if one would learn from the email or whatever the calibration result is.
I'm happy for you if it works but I sure feel cheated. I hope others also feel it's against the spirit of a Show HN. But maybe it's just me.
66 comments
> Question: A fair die rolling a 6 twice in a row is more likely than rolling 1-2-3-4-5-6 in sequence
Two 6s in a row is 1/36 chance (1/6)^2
1-2-3-4-5-6 is a 1/46656 chance (1/6)^6
Website is claiming they are the same probability:
> Same probability: 1/46,656 — Both outcomes have exactly the same probability: (1/6)^6 = 1/46,656. This illustrates the representativeness heuristic — random-looking sequences feel more probable than ordered ones.
Website's "answer" is wrong: was the question supposed to be rolling a 6 six times in a row?
A better way to illustrate this bias is with coin flips. People will tell you that odds of 6 heads is more rare than the odds 3 tails then 3 heads. The difficulty is understanding whether they mean "in order" or "as a group".
If it's in order, the odds are the same. Every order of H/T has the same probability, but humans will see "all heads" and think that's more rare. But the important bit is whether there's a clear understanding ordering.
Is this type of quiz reproducible for individuals and across various cross-sections of the population?
Are there studies on this? Is the quiz based on these studies?
- Inter font
- all caps section headers
- Lucide icons
- em dashes, of course the em dashes
- bubble status badges (of course with all-caps "IN PROGRESS" and "COMING SOON" that mean the same thing)
- Uncited claims like "Most founders are overconfident in the 70-90% range" and "Most people score between 0.20 and 0.30"
- No less than FOUR blog articles all published April 4
None of these points is by any means a dealbreaker. And after all, I suppose a product should be judged on its merits and the value it delivers to its users, not on the tools used to create it. But together, the frontend bears the unmistakeable generative AI "smell" that telegraphs that the human(s) directing the tools building this app might be optimizing for speed over rigor and quality (further supported by the volunteer QA/QC happening in the comments), and may only be as good and reliable as the uncritically accepted outputs of a $20/month coding assistant.
I unsubscribe from mails that aren't useful to me day-to-day because they're distracting.
Other than that it seems like a cool idea. I'd recommend slightly bigger fonts. I often have this issue with Gemini.
If so, isn't that conflating knowledge with over/under confidence?
It’s something I’m interested improving on as well as predictions in general. I saw you suggested Thinking Fast and Slow and I’ve skimmed through some of Superforecasters. Metaculus have a bunch of resources too.
https://www.metaculus.com/help/prediction-resources/
https://taketest.xyz/confidence-calibration
The same site also has something with a fixed confidence level: https://taketest.xyz/ci-calibration
Manifest fetch from https://www.convexly.app/manifest.json failed, code 403
>0.188
Slightly above avg - yay
I'm happy for you if it works but I sure feel cheated. I hope others also feel it's against the spirit of a Show HN. But maybe it's just me.
Heck yeah.
Your Calibration Results 9/10 correct direction
Brier Score
0.131
Lower is better (0 = perfect)
Diagnosis
Well Calibrated Strong score. You were right more often than your confidence suggested. Trust your gut more.