Day 1 of ARC-AGI-3 (symbolica.ai)

by lairv 76 comments 90 points
Read article View on HN

76 comments

[−] lairv 50d ago
Note that this uses a harness so it doesn't qualify for the official ARC-AGI-3 leaderboard

According to the authors the harness isn't ARC-AGI specific though https://x.com/agenticasdk/status/2037335806264971461

[−] fchollet 50d ago
It is 100% ARC-AGI-3 specific though, just read through the prompts https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...
[−] boxed 50d ago
What a dick move. Making that prompt open source will probably mean that every other model that doesn't want to cheat will scrape that and accidentally cheat in the next models.
[−] cxdorn 50d ago
(disclaimer: i worked on early versions of agentica_sdk; but wasn't involved in recent developments and the ARC solver)

As other comments point out this is about harness development and harness efficiency. Agentica SDK is a sort of meta harness, that makes things easy: plug any "internal API" (as defined natively in your codebase) directly into your agent. Agentica SDK itself is not application specifc; but the APIs of your application are... application specific.

Re: the linked prompt. A harness is a set of tools and descriptions how to best use those tools, and sometimes some external control flow based on the outcome of using those tools. How to "best use the tools" should always be part of the prompt (like in this case).

So this work tries to answer: "short of telling the agent any solutions, make a simple but efficient API to play the games, hand it to the agent, and see how it does". In the world of harness development I think that's an interesting question to answer!

[−] DetroitThrow 49d ago

>In the world of harness development I think that's an interesting question to answer!

The challenge isn't about harness development though, and a sufficiently complex harness can solve these tasks rather easily.

And presenting it as if you've made a novel development for solving ARC-AGI-3 leads me to believe you're willing to waste all of our time for your benefit at every step in the future.

[−] diwank 50d ago
this is so disingenuous on symbolica's part. these insincere announcements just make it harder for genuine attempts and novel ideas
[−] DetroitThrow 50d ago
Um, yes this is a extremely specific as a benchmark harness. It has a ton of knowledge encoded about the tasks at hand. The tweet is dishonest even in the best light.

The hard part of these tests isn't purely reasoning ability ffs.

[−] krackers 50d ago

> this uses a harness

This seems like an arbitrary restriction. Tool-use requires a harness, and their whitepaper never defines exactly what counts as valid.

[−] osti 50d ago
Doesn't the chat version of chatgpt or gemini also have interleaved tool calls, so do those also count as with harnesses?
[−] mmaunder 50d ago
We're calling agents harnesses now?
[−] stephantul 50d ago
The fact that this was on the set of training problems with a custom harness basically makes the headline a lie.

What if you give opus the same harness? Do people even care about meaningful comparisons any more or is it all just “numbers go up”

[−] gslin 50d ago
https://en.wikipedia.org/wiki/Goodhart's_law

> Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.

[−] modeless 50d ago
On the public set of 25 problems. These are intended for development and testing, not evaluation. There are 110 private problems for actual evaluation purposes, and the ARC-AGI-3 paper says "the public set is materially easier than the private set".
[−] mohsen1 50d ago
Uses public dataset to evaluate which is not meant for evaluation. Writes super specific prompt[1] and claims eye catching results.

This is the state of "AI" these days I guess...

[1] https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...

[−] padolsey 50d ago
Knowing the nature of a test ahead of time, building out your capabilities and tooling before entering the exam hall when your peers don't have that advantage, makes you a cheater.
[−] andy12_ 50d ago
Apparently the score would be a little higher if it weren't for the fact that scores are penalized for being worse than the human baseline, but aren't rewarded for being better than the human baseline (which seems like an arbitrary decision. The human baseline is not optimal).
[−] esafak 50d ago
Anybody used this Agentica of theirs?
[−] bytesandbits 50d ago
we constantly underestimate the power of inference scaffolding. I have seen it in all domains: coding, ASR, ARC-AGI benchmarks you name it. Scaffolding can do a lot! And post-training too. I am confident our currently pre-trained models can beat this benchmark over 80% with the right post-training and scaffolding. That being said I don't think ARC-AGI proves much. It is not a useful task at all in the wild. it is just a game; a strange and confusing one. For me this is just a pointless pseudo-academic exercise. Good to have, but by no means measures intelligence and even less utility of a model.
[−] dsfadfasdf 47d ago
Can someone clarify if image inputs are allowed, so VLMs can be used? I have not been able to get information anywhere.
[−] AbanoubRodolf 50d ago
[flagged]