What CI looks like at a 100-person team (PostHog) (mendral.com)

by shad42 30 comments 56 points
Read article View on HN

30 comments

[−] sd9 60d ago
It just seems weird to me to throw all these stats together. Putting 75GB of logs in the same category as managing the compute for this many parallel workflows and so on seem like problems on totally different scales.

Unfortunately I didn’t really get the point of the article after being bombarded with stats, expect that the authors have an AI tool to sell.

[−] joncrane 60d ago
We get it! They have 22,477 tests with a 99.98% pass rate, ship 65 commits to main daily, and keep 98 engineers productive on a single monorepo.

I thought the repetition of these statitics was a little tired, but overall that's an impressive solution. Also totally get that the hardest part is log ingestion and indexing.

[−] Havoc 60d ago
To me that reads more like monorepo is a central point of failure and they’re scrambling to bandaid the consequence of that decision. And the bandaids aren’t gonna scale to 1000 people

I guess they’re missing whatever Google has to make their monorepo scale

[−] dpark 60d ago
Problems don’t go away with fractured repos. They just change shape. Many repos maybe get you more reliable CI, but you pay for it with increased cost of integrating dependencies and increased complexity with debugging breaks in production (assuming many repos mean many services).

In my experience, multiple small repos don’t even have better CI reliability than a mono repo as less is invested because it affects fewer people. 10 person repos regularly have flaky tests that never get addressed because “we’ll deal with it later”. The tolerance for flakiness goes up when you can attribute it to a close teammate you know is heads down on something critical instead of it feeling like a random test you don’t even care about.

[−] shad42 60d ago
Mendral co-founder here. What happens at PostHog is not uncommon. While building Mendral, we talked to hundreds of team and they all have a similar situation. Initially they come to us to make their CI pipelines faster. But as the agent dives in, the urgency becomes keeping all pipelines reliable. It comes from growing a code base with a test suite. Of course it has to change eventually: splitting the test suite, running specific part of the CI depending on the code, etc... But the situation described in the article is widespread with a product that grows quickly.
[−] simianwords 60d ago
interesting that they have an agent that is triggered on flaky CI failures. but it seems far too specific -- you can have pull request on many other triggers.

there doesn't seem to be any upside on having it only for flaky tests because the workflow is really agnostic to the context.

[−] SirensOfTitan 60d ago
I don't really think this is at all at the quality bar for posts here. This is obviously AI-slop -- why should I invest more time reading your slop than you took to write it?

Even so, at what point do we consider the LLM-ification of all of tech a hazard? I've seen Claude go and lazily fix a test by loosening invariants. AI writes your code, AI writes your tests. Where is your human judgment?

Someone is going to lose money or get hurt by this level of automation. If the humans on your team cannot keep track of the code being committed, then I would prefer not to use your product.

[−] jofzar 60d ago

> These are not the numbers of a team with a CI problem. These are the numbers of a team that moves extremely fast and takes testing seriously.

Please no AI slop, write your own bloody blog posts.

[−] IshKebab 60d ago

> Every commit to main triggers an average of 221 parallel jobs

Jesus, this is why Bazel was invented.

[−] elteto 60d ago
I think this is the first article that truly gave me “slop nausea”. So many “It’s not X. It’s Y.” Do people not realize how awful this reads? It’s not a novel either, just a few thousand words, just fucking write it and edit it yourself.
[−] Heer_J 59d ago
[dead]
[−] zX41ZdbW 60d ago
Two problematic statements in this article:

1. Test pass rate is 99.98% is not good - the only acceptable rate is 100%.

2. Tests should not be quarantined or disabled. Every flaky test deserves attention.