For example, SWE-bench Pro improved ~11% compared with Opus 4.6. Should one interpret it as 4.7 is able to solve more difficult problems? or 11% less hallucinations?
Not related to this release, but is anyone aware of what's happening with Deepseek? The usual cascade of synced releases has been lacking this frontier lab whale for a while now.
12 comments
For example, SWE-bench Pro improved ~11% compared with Opus 4.6. Should one interpret it as 4.7 is able to solve more difficult problems? or 11% less hallucinations?