1. playwright-cli for exploration and ad-hoc scraping, in order to determine what works.
2. playwright code generation based on 1, which captures a repeatable workflow
3. agent skills - these can be playwright based, but in some cases if I can just rely on built-in tools like Web Search and Web Fetch, I will.
playwright is one of the unsung heroes of agentic workflows. I heavily rely on it. In addition to the obvious DOM inspection capabilities, the fact that the console and network can be inspected is a game changer for debugging. watching an agent get rapid feedback or do live TDD is one of the most satisfying things ever.
Browser automation and being able to record the graphics buffer as video, during a run, open up many possibilities.
"Claude, reverse engineer the APIs of this website and build a client. Use Dev Tools."
I have succeed 8/8 websites with this.
Sites like Booking.com, Hotels.com, try to identify real humans with their AWS solution and Cloudflare, but you can just solve the captcha yourself, login and the session is in disguishable from a human. Playwright is detected and often blocked.
Agreed! One thing that we felt was missing from the existing MCP tools was user recording. For old and shitty healthcare websites it's easier to just show the workflow than explain it
The playwright codegen tool exists, but the script it generates is super simple and it can't handle loops or data extraction.
So for libretto we often use a mix of instructions + recording my actions for the agent. Makes the process faster than just relying on a description and waiting for the agent to figure out the whole flow
Same playwright is phenomenal. You can also have the agent browse with MCP to figure out the workflow, then bang out a repeatable playwright script for it. It's a great combo
I literally _just_ put up an announcement on our internal Slack of a tool I had spent a few weeks trying to get right. Strange to post the announcement and, literally the same day, see a better, publicly available toolkit to do enable that very workflow!
I'm also using Playwright, to automate a platform that has a maze of iframes, referer links, etc. Hopefully I can replace the internals with a script I get from this project.
Did you consider MCP sampling to avoid requiring your own LLM access? (for the clients that support it of course, but I think it's important and will become standard anyway)
Not totally sure I understand, but if you're talking about the snapshot command which requires an API key we initially had it spinning up a tmux session to analyze the snapshot instead of using the API. But we switched it to use the API for 2 reasons:
1. Noticed that the API was a couple seconds faster than spinning up the coding agent
2. Spinning up a separate agent you can't guarantee its behavior, and we wanted to enforce that only a single LLM call was run to read the snapshot and analyze the selector. You can guarantee this with an API call but not with a local coding agent
Sorry yeah it was a big vague, I was thinking about creating a Libretto MCP since it's a/the standard way to share AI tooling nowadays and that would make it usable in more contexts.
In that case, the protocol has a feature called "sampling" that allow the MCP server (Libretto) to send completion requests to the MCP client (the main agent/harness the user interacts with), that means that Libretto would not need its own LLM API keys to work, it would piggyback on the LLMs configured in the main harness (sampling support "picking" the style of model you prefer too - smart vs fast etc).
Hey Muchael, we had similar thoughts at Retriever AI of moving from runtime agentic inference to writing scripts combining webpage interactions and reverse engineered site APIs.
Compared to your our approach, we are doing this entirely within a browser extension so meeting users where they already doing their existing work.
Within the extension just record doing a task, we reverse engineer the APIs and write a script. Then execute the script from within the webpage so that auth/headers/tokens get automatically added.
You can just prompt to supply parameters and reuse the script at zero token cost.
Use cases we were targetting is like Instagram DMs or LinkedIn connection requests but it should also work for your healthcare use case!
It's a good callout. We have a BAA + ZDR with Anthropic and OpenAI, and if you want to use libretto for healthcare use cases having a BAA is essential. Was using Codex in the demo, and we've seen that both Claude and Codex work pretty well
The 'deterministic' framing is the part I'd want to understand better. When a model generates a Playwright script, selector choice is often the fragile element: LLMs frequently generate CSS class selectors or XPath rather than Playwright's recommended getByRole/getByLabel/getByText approach, even when accessible-name selectors would work. The generated code can 'work' on first run but break on the first layout tweak.
@muchael: does Libretto constrain the model to prefer accessible-name-based selectors during generation, or does the determinism come primarily from the execution-verification loop (run → fail → self-correct)? The two approaches have meaningfully different failure modes—the first makes the initial code robust, the second only catches brittleness at runtime.
This is a great flag and something we want to spend more time experimenting with as we continue to build out the repo.
Right now we kind of have a mixture of the 2 approaches, but there's a large room for improvement.
- When libretto performs the code generation it initially inspects the page and sends the network calls/playwright actions using snapshot and exec tools to test them individually. After it's tested all of individual selectors and thinks it's finished, it creates a script and then runs the script from scratch. Oftentimes the generated script will fail, and that will trigger libretto to identify the failure and update the code and repeat this process until the script works. That iteration process helps make the scripts much more reliable.
- The way our snapshot command works is that we send a screenshot + DOM (depending on size may be condensed) to a separate LLM and ask it to figure out the relevant selectors. We do this to not pollute context of main agent with the DOM + lots of screenshots. As a part of that analyzers prompt we tell it to prefer selectors using: data-testid, data-test, aria-label, name, id, role. This just lives in the analyzer prompt and is not deterministic though. It'd be interesting to see if we can improve script quality if we add a hard constraint on the selectors or with different prompting.
I'm also curious if you have any guidance for prompt improvements we can give the snapshot analyzer LLM to help it pick more robust selectors right off the bat.
Looks awesome, but I wonder if its functionality could be exposed to existing CLIs such as Claude Code instead of having to run it through its own CLI, mainly because I don't want to spend on credits when I've already got a CC subscription.
EDIT: To clarify, I realize there are skill files that can be used with Claude directly, but the snapshot analysis model seems to require a key. Any way to route that effort through Claude Code itself, such as for example exporting the raw snapshot to a file and instructing Claude Code to use a built-in subagent instead?
I built something very similar for my company internally. The idea was that that the maintenance of the code is on the agent and the code is purely an optimization. If it breaks the agent runs it iteratively, fixes the code for next time. Happy to replace my tool with this and see how it does!
This is what I found doing playwright based extraction against anti-bot defenses. Runtime agents were brittle. It felt like trying to debug/audit a black box.
We used to deal with RPA stuff at work. Always fragile. Good to see evolution in the space.
Very interesting idea. Old school solutions but with new methods.
But maybe we can't make everything deterministic for complex cases, the scenarios that opened after LLM arrived into scene. Maybe we need a mix of both.
Cool. Thank you for sharing. While AI tools are extremely powerful, packages like this help create some good standards and stepping stones for connectivity that the models haven’t gotten around to yet. Thanks again.
Thanks for this! We have clear answers for things that are 100% and 0% automated, but it’s always that 80%-99% automated slice where the frontier is, great idea.
56 comments
2. playwright code generation based on 1, which captures a repeatable workflow
3. agent skills - these can be playwright based, but in some cases if I can just rely on built-in tools like Web Search and Web Fetch, I will.
playwright is one of the unsung heroes of agentic workflows. I heavily rely on it. In addition to the obvious DOM inspection capabilities, the fact that the console and network can be inspected is a game changer for debugging. watching an agent get rapid feedback or do live TDD is one of the most satisfying things ever.
Browser automation and being able to record the graphics buffer as video, during a run, open up many possibilities.
"Claude, reverse engineer the APIs of this website and build a client. Use Dev Tools."
I have succeed 8/8 websites with this.
Sites like Booking.com, Hotels.com, try to identify real humans with their AWS solution and Cloudflare, but you can just solve the captcha yourself, login and the session is in disguishable from a human. Playwright is detected and often blocked.
The playwright codegen tool exists, but the script it generates is super simple and it can't handle loops or data extraction.
So for libretto we often use a mix of instructions + recording my actions for the agent. Makes the process faster than just relying on a description and waiting for the agent to figure out the whole flow
I'm also using Playwright, to automate a platform that has a maze of iframes, referer links, etc. Hopefully I can replace the internals with a script I get from this project.
1. Noticed that the API was a couple seconds faster than spinning up the coding agent
2. Spinning up a separate agent you can't guarantee its behavior, and we wanted to enforce that only a single LLM call was run to read the snapshot and analyze the selector. You can guarantee this with an API call but not with a local coding agent
In that case, the protocol has a feature called "sampling" that allow the MCP server (Libretto) to send completion requests to the MCP client (the main agent/harness the user interacts with), that means that Libretto would not need its own LLM API keys to work, it would piggyback on the LLMs configured in the main harness (sampling support "picking" the style of model you prefer too - smart vs fast etc).
Compared to your our approach, we are doing this entirely within a browser extension so meeting users where they already doing their existing work.
Within the extension just record doing a task, we reverse engineer the APIs and write a script. Then execute the script from within the webpage so that auth/headers/tokens get automatically added.
You can just prompt to supply parameters and reuse the script at zero token cost.
Use cases we were targetting is like Instagram DMs or LinkedIn connection requests but it should also work for your healthcare use case!
Deeper dive: https://www.rtrvr.ai/blog/ai-subroutines-zero-token-determin...
@muchael: does Libretto constrain the model to prefer accessible-name-based selectors during generation, or does the determinism come primarily from the execution-verification loop (run → fail → self-correct)? The two approaches have meaningfully different failure modes—the first makes the initial code robust, the second only catches brittleness at runtime.
Right now we kind of have a mixture of the 2 approaches, but there's a large room for improvement.
- When libretto performs the code generation it initially inspects the page and sends the network calls/playwright actions using
snapshotandexectools to test them individually. After it's tested all of individual selectors and thinks it's finished, it creates a script and then runs the script from scratch. Oftentimes the generated script will fail, and that will trigger libretto to identify the failure and update the code and repeat this process until the script works. That iteration process helps make the scripts much more reliable.- The way our
snapshotcommand works is that we send a screenshot + DOM (depending on size may be condensed) to a separate LLM and ask it to figure out the relevant selectors. We do this to not pollute context of main agent with the DOM + lots of screenshots. As a part of that analyzers prompt we tell it to prefer selectors using: data-testid, data-test, aria-label, name, id, role. This just lives in the analyzer prompt and is not deterministic though. It'd be interesting to see if we can improve script quality if we add a hard constraint on the selectors or with different prompting.I'm also curious if you have any guidance for prompt improvements we can give the snapshot analyzer LLM to help it pick more robust selectors right off the bat.
EDIT: To clarify, I realize there are skill files that can be used with Claude directly, but the snapshot analysis model seems to require a key. Any way to route that effort through Claude Code itself, such as for example exporting the raw snapshot to a file and instructing Claude Code to use a built-in subagent instead?
We used to deal with RPA stuff at work. Always fragile. Good to see evolution in the space.
Edit: nevermind. I see from the website it is MIT. Probably should add a COPYING.md or LICENSE.md to the repository itself.