Is it me or they very carefully do not report performance on GPT-5.4 Pro, only the default GPT-5.4? They also very carefully left Anthropic models out of their comparison.
I went back to the BixBench benchmark which they mentioned. I couldn't find official results for Anthropic models, but I found a project taking Opus 4.6 from 65.3% to 92.0% (which would be above GPT-Rosalind) with nearly 200 carefully crafted skills [1]. There also appears to be competitive competitor models with scores on par with this tuned GPT.
Bix Bench seems like a really interesting/useful idea but most of the value for a layperson (like me) is comparing the results of different models on the benchmark. From what I can find there is no centralised & updated model results set. Shame.
> GPT‑Rosalind is now available … for qualified customers …
It’s kind of gross to make money off her name (if that’s what’s happening) posthumously. It’s a complicated story anyway. IIRC her sister referred to it as “the Cult of Rosalind” when people were cashing in on books about her.
I'd rather the AI companies make up names, or name their products things like "Clod" than use my name (if they were to ask) - as no matter how good it looks today eventually it'll be some form of laughingstock.
At least GPT is pretty "unique" and they've not polluted search (except for those looking for the GUID Partition Table, RIP).
Any name you pick will immediately override anything that comes before - naming a model Socrates would confuse searches, for example (and it's why I hate the rename of iTunes to "Music" which is a generic term!).
For me too, it was around that time last year, with GPT-5, Claude Sonnet 4.5 and then Gemini 3 that I started feeling that these models are clearly becoming great at reasoning. I'm not at all opposed to saying that they are around PhD-level on at least some domains.
I work for a life sciences company. It will be a long time before anyone trusts a generative model to do the actual science when mathematically provable models are as good as they are today. There is room for AI in the field, but it's not in the science directly.
The voiceover in the promo video on this page seems to be AI generated, with some weird artifacts. Right at the beginning it sounds like it says "cormbiying structure daya retrieval and lirrachure search".
While this model set (GPT-Rosalind) is limited to certain organizations, the announcement also included the release of a Life Sciences Plugin, which is more broadly available on Codex [1].
If you have something like this, how about you demonstrate a way to really help, and demonstrate (as opposed to claim) what this can do? Make a cheap vaccine against the new resistant forms of TBC, or if you truly want to impress, against HIV. DON'T get it approved at all, just publish how it would work, maybe with a simulation (so it can't be patented). This shouldn't even be so hard, it's just really hard to make money on either of those vaccines, as right 1st world countries have little need for them (HIV, perhaps, but vaccines don't make much money. A TBC vaccine, definitely doesn't make money), so you're not "getting in the way of business" doing that.
Why? AI's reputation would be greatly improved by saving a few 10s of millions of lives (per year, I might add). And either of those advances would do just that.
Oh, and another reason. Do either of these things and you'll have very rich businesses screaming to become your customer coming out of every hole. Guaranteed.
30 comments
I went back to the BixBench benchmark which they mentioned. I couldn't find official results for Anthropic models, but I found a project taking Opus 4.6 from 65.3% to 92.0% (which would be above GPT-Rosalind) with nearly 200 carefully crafted skills [1]. There also appears to be competitive competitor models with scores on par with this tuned GPT.
[1] https://github.com/jaechang-hits/SciAgent-Skills
> GPT‑Rosalind is now available … for qualified customers …
It’s kind of gross to make money off her name (if that’s what’s happening) posthumously. It’s a complicated story anyway. IIRC her sister referred to it as “the Cult of Rosalind” when people were cashing in on books about her.
Any name you pick will immediately override anything that comes before - naming a model Socrates would confuse searches, for example (and it's why I hate the rename of iTunes to "Music" which is a generic term!).
Sam Altman, August 2025
https://www.bbc.com/news/articles/cy5prvgw0r1o
For me too, it was around that time last year, with GPT-5, Claude Sonnet 4.5 and then Gemini 3 that I started feeling that these models are clearly becoming great at reasoning. I'm not at all opposed to saying that they are around PhD-level on at least some domains.
[1] https://github.com/openai/plugins/tree/main/plugins/life-sci...
Why? AI's reputation would be greatly improved by saving a few 10s of millions of lives (per year, I might add). And either of those advances would do just that.
Oh, and another reason. Do either of these things and you'll have very rich businesses screaming to become your customer coming out of every hole. Guaranteed.