Agent Reading Test | swenrekcah

[−] theyCallMeSwift 39d ago

I love this idea, but have a hypothesis that 90% of agents that people actually use today would fail this test inadvertently (false negative).

Industry best practice + standard implementation for most agents right now is to do web browsing / fetching via subagents. Their output is summarized using a cheaper model and then passed back to the parent. It's very unlikely that without preserving the actual content the subagents see that the CANARY- strings would be found in the output.

Any thoughts on how you'd change the test structure with this in mind?

[−] dacharyc 39d ago

Hey there - I'm the test author, and you've hit on one of the main points. For the summarization/relevance-based content return, this is a consideration for some of the agent platforms (although I've found others actually do better here than I expected!) - which is part of the point I'm trying to drive home to folks who aren't as familiar with these systems.

I chose to structure it this way intentionally because this is the finding. Most people are surprised that agents aren't 'seeing' everything that's there, and get frustrated when an agent says something isn't there when it clearly is. Raising awareness of this is one of the main points of the exercise, to me.

[−] refulgentis 39d ago

This isn't best practice. It's certainly not industry best practice. It would fail some pretty basic tests, like these, resulting in poor UX and poor reviews. There’s plenty of half-assed things labelled agent that do so, of course.

I think it describes generally how we can picture Claude and OpenAI working, but neglects further implementation details that are hard to see from their blog posts, ex. a web search vs. a web get tool.

(source: maintained a multi-provider x llama.cpp LLM client for 2.5+ years and counting)

[−] lucb1e 39d ago

I don't understand. It says for the first task:

> URL: What parameters does the Create Stream endpoint accept?

The answer that I would give is name, description, retention_days, and tags. What the answer sheet <https://agentreadingtest.com/answers.json> has is: CANARY-TRUNC-10K-fox ("Early in the page. All agents should find this."), CANARY-TRUNC-40K-river, CANARY-TRUNC-75K-summit, etc. These words appear on the page, but why would the LLM output include these? The first one appears before the API endpoint subpath specification, and the second in the middle of a word in the decryption. They do not answer this test question of what parameters are supported

A later test is to see if it can deal with broken pages, ("an unclosed ``` fence", specifically). Wouldn't it not echo those tokens if it can deal with seemingly erroneous strings on the page?

How is this test supposed to work?

[−] hettygreen 38d ago

At this point I wonder if AI's get updated just to recognize and deal with specific tests like this.

In comparison to solving the root issues, it's gotta be easier to add a few extra lines of code to intervene if someone is asking about walking or driving to the carwash or wanting to know how many "r"'s in the word strawberry.

I wonder if AI is the opaque interesting tech it says it is, but also it's thousands of extra if statements catching known/published/problematic/embarrassing inconsistencies.

Anyone here work for any of the big AI companies? Is it just one big black-box, or a black-box with thousands of intervention points and guard rails?

[−] throwatdem12311 39d ago

What a great target for someone to hack and add some secret prompt injections into.

Agent Reading Test (agentreadingtest.com)

23 comments