Show HN: Sup AI, a confidence-weighted ensemble (52.15% on Humanity's Last Exam)

[−] scottmu 50d ago

I want to clarify what Ken meant by "entropy in the output token probability distributions." Whenever an LLM outputs a token, it's choosing that token out of all possible tokens. Every possible output token has a probability assigned by the model (typically a logarithm of the probability). This is a probability distribution (the output token probabilities sum to 1). Entropy is a measure of uncertainty and can quantify if a token probability distribution is certain (1 token has a 99.9% probability, and the rest share the leftover 0.1% probability) or uncertain (every token has roughly the same probability, so it's pretty much random which token is selected). Low entropy is the former case, and high entropy is the latter.

There is interesting research in the correlation of entropy with accuracy and hallucinations:

- https://www.nature.com/articles/s41586-024-07421-0

- https://arxiv.org/abs/2405.19648

- https://arxiv.org/abs/2509.04492 (when only a small number of probabilities are available, which is something we frequently deal with)

- https://arxiv.org/abs/2603.18940

- tons more, happy to chat about if interested

[−] mememememememo 50d ago

Wow if it is that easy to detect hallucinations, are the big models or rigs (agentic scaffolds) building any self-correcting behaviour. Or possibly switching it to an I don't know mode so it can ask the human for help understanding.

Maybe this insight is why I feel hallucinations are much rarer in the last 12 months on top models. Are they being detected before they get sent out.

[−] philipodonnell 49d ago

Is the difficulty that in high entropy situations, you can’t really tell whether it’s because the model is uncertain, or because of the options are so semantically similar that it doesn’t matter which one you choose? Like pure synonyms.

[−] stephantul 49d ago

Buddy… your son gets a top post on HN in which he clearly mentions you, yet you feel the need to make an account just to correct him in the first comment? Can’t you send him a message and let him correct it?

[−] Tomjosetj31 49d ago

Impressive result on HLE if the methodology holds up. One thing I'd want to understand better: how much of the gain comes from the entropy weighting specifically vs. simply having more compute via parallel inference? Would be curious to see an ablation — same models, same budget, but with naive majority voting instead. That would isolate the actual contribution of your confidence-weighting approach.

[−] hello12343214 50d ago

I use gemini and cursor for enterprise software implementation, but they often suggest incorrect solutions to edge cases and unique config requirements. An AI that has a higher likelihood of being accurate is very appealing. I'll give Sup AI at try over the next few days at work.

Also, discovering HLE was great... scrolling through some of the questions brings back memories of college organic chem.

[−] siliconc0w 49d ago

Do you have data for other benchmarks? +7% for HLE isn't nothing but it'd be more compelling if you could show you're consistently doing better with your method across more domains (especially coding, which seems like the primary use-case these days).

[−] wavemode 49d ago

Is 7 extra percent on HLE benchmark really worth the cost of running an entire ensemble of models?

[−] adshotco 49d ago

[flagged]

[−] algolint 51d ago

[flagged]

Show HN: Sup AI, a confidence-weighted ensemble (52.15% on Humanity's Last Exam) (sup.ai)

24 comments