Show HN: Git bayesect – Bayesian Git bisection for non-deterministic bugs (github.com)

by hauntsaninja 43 comments 337 points
Read article View on HN

43 comments

[−] rs545837 44d ago
Really fun work, and the writeup on the math is great. The Beta-Bernoulli conjugacy trick making the marginal likelihood closed-form is elegant.

We ran benchmarks comparing bisect vs bayesect across flakiness levels. At 90/10, bisect drops to ~44% accuracy while bayesect holds at ~96%. At 70/30 it's 9% vs 67%. The entropy-minimization selection is key here since naive median splitting converges much slower.

One thing we found, you can squeeze out another 10-15% accuracy by weighting the prior with code structure. Commits that change highly-connected functions (many transitive dependents in the call graph) are more likely culprits than commits touching isolated code. That prior is free, zero test runs needed.

Information-theoretically, the structural prior gives you I_prior bits before running any test, reducing the total tests needed from log2(n)/D_KL to (log2(n) - I_prior)/D_KL. On 1024-commit repos with 80/20 flakiness: 92% accuracy with graph priors vs 85% pure bayesect vs 10% git bisect.

We're building this into sem (https://github.com/ataraxy-labs/sem), which has an entity dependency graph that provides the structural signal.

[−] sfink 44d ago

> We ran benchmarks comparing bisect vs bayesect across flakiness levels. At 90/10, bisect drops to ~44% accuracy while bayesect holds at ~96%. At 70/30 it's 9% vs 67%.

I don't understand what you're comparing. Can't you increase bayesect accuracy arbitrarily by running it longer? When are you choosing to terminate? Perhaps I don't understand this after all.

[−] rs545837 44d ago
Yes, bayesect accuracy increases with more iterations. The comparison was at a fixed budget(300 test runs) when I was running. Sorry should have clarified more on that.
[−] hauntsaninja 44d ago
Yep, you can run bayesect to an arbitrary confidence level.

This script in the repo https://github.com/hauntsaninja/git_bayesect/blob/main/scrip... will show you that a) the confidence level is calibrated, b) how quickly you get to that confidence level (on average, p50 and p95)

For the failure rates you describe, calibration.py shows that you should see much higher accuracy at 300 tests

[−] rs545837 44d ago
You're right, at 300 tests bayesect converges to ~97-100% across the board. I reran with calibration.py and confirmed.

Went a step further and tested graph-weighted priors (per-commit weight proportional to transitive dependents, Pareto-distributed). The prior helps in the budget-constrained regime:

128 commits, 500 trials:

Budget=50, 70/30: uniform 22% → graph 33% Budget=50, 80/20: uniform 71% → graph 77% Budget=100, 70/30: uniform 56% → graph 65% At 300 tests the gap disappears since there's enough data to converge anyway. The prior is worth a few bits, which matters when bits are scarce.

Script: https://gist.github.com/rs545837/b3266ecf22e12726f0d55c56466...

[−] hauntsaninja 48d ago
git bisect works great for tracking down regressions, but relies on the bug presenting deterministically. But what if the bug is non-deterministic? Or worse, your behaviour was always non-deterministic, but something has changed, e.g. your tests went from somewhat flaky to very flaky.

In addition to the repo linked in the title, I also wrote up a little bit of the math behind it here: https://hauntsaninja.github.io/git_bayesect.html

[−] ajb 44d ago
Nice! I implemented a similar thing a while back: https://github.com/Ealdwulf/BBChop

I'm going to have to check out how you got linear time with Shannon entropy, because I used Renyi entropy to do that, to make the algebra easier.

It's also possible to do it over the DAG, rather than a linear history - although that makes the code a lot more complicated. Unfortunately there doesn't seem to be a linear time cumulative sum algorithm over dags, so it's super linear in some cases.

[−] Myrmornis 44d ago
This is really cool! Is there an alternative way of thinking about it involving a hidden markov model, looking for a change in value of an unknown latent P(fail)? Or does your approach end up being similar to whatever the appropriate Bayesian approach to the HMM would be?
[−] belden 44d ago
My team doesn’t always have cleanly bisectable branches being merged to main —- it’s not uncommon to see “fix syntax error” types of commits.

But, to merge we need to have all tests pass. (If tests flakily pass then we get new flakey tests, yay!)

I know git-bisect doesn’t support this: but could git-bayesect have an option to only consider merge commits? Being able to track a flake change back to an individual PR would be really useful.

[−] llasram 43d ago
Neat! I work on a system with some very similar math, but a slightly different model. I really like how in bayesect making the error rates asymmetric via independent Beta priors on bidirectional errors allows the computations to be nice and symmetric.

I haven't worked these all the way through, but I'm slightly skeptical or at least confused by a few details:

1. Another way to frame P(D|B=b) would be to have the old vs new side draws be beta-binomial distributed, in which case we should then have binomial coefficients for each of the draw side probabilities for the number of possible orderings of the observations. Do they end up cancelling out somewhere? [ed: Oh yes, of course -- D includes that in each case we observe exactly one of the C(n,k) orderings.]

2. I think your expected conditional entropy code is treating the imputed new observations as independent from the past observations, though even if that's the case it may not affect it much in this model. If it does though, it might be worth explicitly unit-testing the naive vs efficient calculations to ensure they match.

Anyway, thanks for sharing!

[−] Retr0id 44d ago
Super cool!

A related situation I was in recently was where I was trying to bisect a perf regression, but the benchmarks themselves were quite noisy, making it hard to tell whether I was looking at a "good" vs "bad" commit without repeated trials (in practice I just did repeats).

I could pick a threshold and use bayesect as described, but that involves throwing away information. How hard would it be to generalize this to let me plug in a raw benchmark score at each step?

[−] supermdguy 48d ago
Okay this is really fun and mathematically satisfying. Could even be useful for tough bugs that are technically deterministic, but you might not have precise reproduction steps.

Does it support running a test multiple times to get a probability for a single commit instead of just pass/fail? I guess you’d also need to take into account the number of trials to update the Beta properly.

[−] SugarReflex 44d ago
I hope this comment is not out of place, but I am wondering what the application for all this is? How can this help us or what does it teach us or help us prove? I am asking out of genuine curiosity as I barely understand it but I believe it has something to do with probability.

edit: thanks for the responses! I was not even familiar with git bisect before this, so I've got some new things to learn.

[−] IshKebab 44d ago
Does this tool assume it takes the same amount of time to test two commits once as it does to test one commit twice? Maybe true for interpreted languages, but if you're waiting 15 minutes to compile LLVM you're probably going to want to run your 1 second flaky test more than once. Probably pretty trivial to fix this though?

Great idea anyway!

[−] shiandow 44d ago

> our entropy calculation will now have to use the posterior means

Now hang on for a bit, you can't just plug in averages.

At least that's what I initially thought, but in this particular instance it works out correctly because you're calculating an expected value of the entropy from the two possible outcomes and there the posterior mean is indeed the correct probability to use.

You do have to take the prior into account when calculating the posterior distributions for B, but that formula is in the article.

[−] davidkunz 44d ago
Useful for tests with LLM interactions.
[−] OfirMarom 44d ago
What I love about this is you applied Bayesian methods to a problem I never would have thought to use it on!
[−] convexly 44d ago
Do you expose the posterior probabilities anywhere so you can see how confident it is in the result?
[−] MarcelinoGMX3C 44d ago
[dead]
[−] jiusanzhou 44d ago
[dead]
[−] aplomb1026 44d ago
[dead]
[−] ju571nk3n 44d ago
[flagged]
[−] DevCrate 43d ago
[dead]
[−] alcor-z 44d ago
[dead]
[−] skrun_dev 44d ago
[dead]
[−] volume_tech 44d ago
[flagged]
[−] Caum 44d ago
[flagged]