Launch HN: Canary (YC W26) – AI QA that understands your code

by Visweshyc 26 comments 58 points
Read article View on HN

26 comments

[−] pastescreenshot 57d ago
The interesting question to me is not whether the system can generate a plausible PR-time test, but whether the useful ones survive after the PR is gone. If Canary catches a real regression, how often can that check be promoted into a stable long-lived regression test without turning into a flaky, environment-coupled browser script? That conversion rate feels closer to the real moat than the generation demo.
[−] Visweshyc 57d ago
Good point. To keep the regression tests reliable as the app evolves, we run a reliability cascade. First, we generate and execute deterministic Playwright from the codebase. If execution fails then we fall back to DOM and aria tree. If that still fails, we fall back to vision agents that verify what the user actually sees before flagging a drift in the application behavior
[−] warmcat 58d ago
Good work. But what makes this different than just another feature in Gemini Code assist or Github copilot?
[−] Visweshyc 58d ago
Thanks! To execute these tests reliably you would need custom browser fleets, ephemeral environments, data seeding and device farms
[−] mikestorrent 57d ago
If that's what you guys are bringing, you should put that more up front; focus on making it clear you're providing ingredients that Claude et al will not be providing on their own without Real Actual Software to do it.
[−] Visweshyc 57d ago
Fair feedback. Will make that clearer. Appreciate it
[−] wenldev 53d ago
Automated QA is for sure a huge problem. I'm curious about the type of tests though. Are we talking integration or end-to-end?
[−] solfox 58d ago
Not a direct competitor but another YC company I use and enjoy for PR reviews is cubic.dev. I like your focus on automated tests.
[−] Bnjoroge 58d ago
what kinds of tests does it generate and how's this different from the tens of code review startups out there?
[−] Visweshyc 58d ago
The system focuses on going beyond the happy path and generating edge case tests that try to break the application. For example, a Grafana PR added visual drag feedback to query cards. The system came up with an edge case like - does drag feedback still work when there's only one card in the list, with nothing to reorder against?
[−] opensre 58d ago
[flagged]
[−] bmd1905 56d ago
[dead]
[−] tgtracing 57d ago
[dead]
[−] vivzkestrel 57d ago
- there are atleast 10 dozen code review startups at this point and i see a new one on YC every week

- what is your differentiator?

[−] Visweshyc 57d ago
We see this as different from review. The system generates tests to catch second-order effects and executes them against the live application to expose bugs