Show HN: Libretto – Making AI browser automations deterministic

[−] skapadia 29d ago

1. playwright-cli for exploration and ad-hoc scraping, in order to determine what works.

2. playwright code generation based on 1, which captures a repeatable workflow

3. agent skills - these can be playwright based, but in some cases if I can just rely on built-in tools like Web Search and Web Fetch, I will.

playwright is one of the unsung heroes of agentic workflows. I heavily rely on it. In addition to the obvious DOM inspection capabilities, the fact that the console and network can be inspected is a game changer for debugging. watching an agent get rapid feedback or do live TDD is one of the most satisfying things ever.

Browser automation and being able to record the graphics buffer as video, during a run, open up many possibilities.

[−] miohtama 29d ago

You can also do Chrome MCP.

"Claude, reverse engineer the APIs of this website and build a client. Use Dev Tools."

I have succeed 8/8 websites with this.

Sites like Booking.com, Hotels.com, try to identify real humans with their AWS solution and Cloudflare, but you can just solve the captcha yourself, login and the session is in disguishable from a human. Playwright is detected and often blocked.

[−] muchael 28d ago

Agreed! One thing that we felt was missing from the existing MCP tools was user recording. For old and shitty healthcare websites it's easier to just show the workflow than explain it

The playwright codegen tool exists, but the script it generates is super simple and it can't handle loops or data extraction.

So for libretto we often use a mix of instructions + recording my actions for the agent. Makes the process faster than just relying on a description and waiting for the agent to figure out the whole flow

[−] freedomben 29d ago

Same playwright is phenomenal. You can also have the agent browse with MCP to figure out the workflow, then bang out a repeatable playwright script for it. It's a great combo

[−] anthuswilliams 29d ago

I literally _just_ put up an announcement on our internal Slack of a tool I had spent a few weeks trying to get right. Strange to post the announcement and, literally the same day, see a better, publicly available toolkit to do enable that very workflow!

I'm also using Playwright, to automate a platform that has a maze of iframes, referer links, etc. Hopefully I can replace the internals with a script I get from this project.

[−] muchael 29d ago

Haha that's wild, let me know if you run into any issues with it!

[−] Guillaume86 29d ago

Did you consider MCP sampling to avoid requiring your own LLM access? (for the clients that support it of course, but I think it's important and will become standard anyway)

[−] muchael 28d ago

Not totally sure I understand, but if you're talking about the snapshot command which requires an API key we initially had it spinning up a tmux session to analyze the snapshot instead of using the API. But we switched it to use the API for 2 reasons:

1. Noticed that the API was a couple seconds faster than spinning up the coding agent

2. Spinning up a separate agent you can't guarantee its behavior, and we wanted to enforce that only a single LLM call was run to read the snapshot and analyze the selector. You can guarantee this with an API call but not with a local coding agent

[−] Guillaume86 28d ago

Sorry yeah it was a big vague, I was thinking about creating a Libretto MCP since it's a/the standard way to share AI tooling nowadays and that would make it usable in more contexts.

In that case, the protocol has a feature called "sampling" that allow the MCP server (Libretto) to send completion requests to the MCP client (the main agent/harness the user interacts with), that means that Libretto would not need its own LLM API keys to work, it would piggyback on the LLMs configured in the main harness (sampling support "picking" the style of model you prefer too - smart vs fast etc).

[−] arjunchint 27d ago

Hey Muchael, we had similar thoughts at Retriever AI of moving from runtime agentic inference to writing scripts combining webpage interactions and reverse engineered site APIs.

Compared to your our approach, we are doing this entirely within a browser extension so meeting users where they already doing their existing work.

Within the extension just record doing a task, we reverse engineer the APIs and write a script. Then execute the script from within the webpage so that auth/headers/tokens get automatically added.

You can just prompt to supply parameters and reuse the script at zero token cost.

Use cases we were targetting is like Instagram DMs or LinkedIn connection requests but it should also work for your healthcare use case!

Deeper dive: https://www.rtrvr.ai/blog/ai-subroutines-zero-token-determin...

[−] z3ugma 29d ago

Love it! Do you have a BAA with Claude though? Otherwise, your demo is likely exposing PHI to 3rd parties and exposing you to risk related to HIPAA

[−] muchael 29d ago

It's a good callout. We have a BAA + ZDR with Anthropic and OpenAI, and if you want to use libretto for healthcare use cases having a BAA is essential. Was using Codex in the demo, and we've seen that both Claude and Codex work pretty well

[−] tanishqkanc 29d ago

just adding to michael's reply - we took care to make sure no PHI was exposed in our demo video as well.

[−] jimmypk 29d ago

The 'deterministic' framing is the part I'd want to understand better. When a model generates a Playwright script, selector choice is often the fragile element: LLMs frequently generate CSS class selectors or XPath rather than Playwright's recommended getByRole/getByLabel/getByText approach, even when accessible-name selectors would work. The generated code can 'work' on first run but break on the first layout tweak.

@muchael: does Libretto constrain the model to prefer accessible-name-based selectors during generation, or does the determinism come primarily from the execution-verification loop (run → fail → self-correct)? The two approaches have meaningfully different failure modes—the first makes the initial code robust, the second only catches brittleness at runtime.

[−] muchael 28d ago

This is a great flag and something we want to spend more time experimenting with as we continue to build out the repo.

Right now we kind of have a mixture of the 2 approaches, but there's a large room for improvement.

- When libretto performs the code generation it initially inspects the page and sends the network calls/playwright actions using snapshot and exec tools to test them individually. After it's tested all of individual selectors and thinks it's finished, it creates a script and then runs the script from scratch. Oftentimes the generated script will fail, and that will trigger libretto to identify the failure and update the code and repeat this process until the script works. That iteration process helps make the scripts much more reliable.

- The way our snapshot command works is that we send a screenshot + DOM (depending on size may be condensed) to a separate LLM and ask it to figure out the relevant selectors. We do this to not pollute context of main agent with the DOM + lots of screenshots. As a part of that analyzers prompt we tell it to prefer selectors using: data-testid, data-test, aria-label, name, id, role. This just lives in the analyzer prompt and is not deterministic though. It'd be interesting to see if we can improve script quality if we add a hard constraint on the selectors or with different prompting.

I'm also curious if you have any guidance for prompt improvements we can give the snapshot analyzer LLM to help it pick more robust selectors right off the bat.

[−] terabytest 29d ago

Looks awesome, but I wonder if its functionality could be exposed to existing CLIs such as Claude Code instead of having to run it through its own CLI, mainly because I don't want to spend on credits when I've already got a CC subscription.

EDIT: To clarify, I realize there are skill files that can be used with Claude directly, but the snapshot analysis model seems to require a key. Any way to route that effort through Claude Code itself, such as for example exporting the raw snapshot to a file and instructing Claude Code to use a built-in subagent instead?

[−] coderw 29d ago

Curious how you handle target site changes - does the agent get triggered to regenerate, or do you just wait for the script to fail in prod first?

[−] boriskurikhin 29d ago

I like the pre-gen approach! Curious how it responds to JS that changes how components are rendered at run-time.

[−] heyitsaamir 29d ago

I built something very similar for my company internally. The idea was that that the maintenance of the code is on the agent and the code is purely an optimization. If it breaks the agent runs it iteratively, fixes the code for next time. Happy to replace my tool with this and see how it does!

[−] cowartc 29d ago

This is what I found doing playwright based extraction against anti-bot defenses. Runtime agents were brittle. It felt like trying to debug/audit a black box.

We used to deal with RPA stuff at work. Always fragile. Good to see evolution in the space.

[−] admiralrohan 29d ago

Very interesting idea. Old school solutions but with new methods. But maybe we can't make everything deterministic for complex cases, the scenarios that opened after LLM arrived into scene. Maybe we need a mix of both.

[−] gbibas 29d ago

Cool. Thank you for sharing. While AI tools are extremely powerful, packages like this help create some good standards and stepping stones for connectivity that the models haven’t gotten around to yet. Thanks again.

[−] etwigg 29d ago

Thanks for this! We have clear answers for things that are 100% and 0% automated, but it’s always that 80%-99% automated slice where the frontier is, great idea.

[−] daveguy 29d ago

What is the license?

Edit: nevermind. I see from the website it is MIT. Probably should add a COPYING.md or LICENSE.md to the repository itself.

[−] MouadNos 26d ago

how do you decide when a broken flow should escalate to a human instead of self-heal ?

[−] messh 29d ago

how does it differ from playwright-cli?

[−] seagull 29d ago

I've wanted something like this for ages, excited to try this out!

[−] yehia2amer 29d ago

how does it differ from https://github.com/browserbase/stagehand ?

[−] arizen 29d ago

Curious how it compares to Browser Use

[−] voidUpdate 29d ago

How does it have deterministic output when using LLMs that are non-deterministic by nature?

[−] devstatic 29d ago

this is interesting

[−] arpadav 29d ago

this looks awesome

[−] binaryroot 26d ago

nice!

[−] potter098 29d ago

[flagged]

[−] KaiShips 29d ago

[flagged]

[−] kantaro 29d ago

[flagged]

[−] cafecito_dev 28d ago

[flagged]

[−] WhoffAgents 29d ago

[flagged]

[−] danelliot 29d ago

[dead]

[−] Unsponsoredio 29d ago

[dead]

[−] raffaeleg 29d ago

[flagged]

[−] maxbeech 29d ago

[dead]

[−] zhangchen 26d ago

[dead]

[−] huflungdung 29d ago

[dead]

[−] secureotter 29d ago

[dead]

[−] alexbike 29d ago

[flagged]

[−] surgical_fire 29d ago

[flagged]

Show HN: Libretto – Making AI browser automations deterministic (github.com)

56 comments