Really fun work, and the writeup on the math is great. The Beta-Bernoulli conjugacy trick making the marginal likelihood closed-form is elegant.
We ran benchmarks comparing bisect vs bayesect across flakiness levels. At 90/10, bisect drops to ~44% accuracy while bayesect holds at ~96%. At 70/30 it's 9% vs 67%. The entropy-minimization selection is key here since naive median splitting converges much slower.
One thing we found, you can squeeze out another 10-15% accuracy by weighting the prior with code structure. Commits that change highly-connected functions (many transitive dependents in the call graph) are more likely culprits than commits touching isolated code. That prior is free, zero test runs needed.
Information-theoretically, the structural prior gives you I_prior bits before running any test, reducing the total tests needed from log2(n)/D_KL to (log2(n) - I_prior)/D_KL. On 1024-commit repos with 80/20 flakiness: 92% accuracy with graph priors vs 85% pure bayesect vs 10% git bisect.
We're building this into sem (https://github.com/ataraxy-labs/sem), which has an entity dependency graph that provides the structural signal.
> We ran benchmarks comparing bisect vs bayesect across flakiness levels. At 90/10, bisect drops to ~44% accuracy while bayesect holds at ~96%. At 70/30 it's 9% vs 67%.
I don't understand what you're comparing. Can't you increase bayesect accuracy arbitrarily by running it longer? When are you choosing to terminate? Perhaps I don't understand this after all.
Yes, bayesect accuracy increases with more iterations. The comparison was at a fixed budget(300 test runs) when I was running. Sorry should have clarified more on that.
You're right, at 300 tests bayesect converges to ~97-100% across the board. I reran with calibration.py and confirmed.
Went a step further and tested graph-weighted priors (per-commit weight proportional to transitive dependents, Pareto-distributed). The prior helps in the budget-constrained regime:
128 commits, 500 trials:
Budget=50, 70/30: uniform 22% → graph 33%
Budget=50, 80/20: uniform 71% → graph 77%
Budget=100, 70/30: uniform 56% → graph 65%
At 300 tests the gap disappears since there's enough data to converge anyway. The prior is worth a few bits, which matters when bits are scarce.
git bisect works great for tracking down regressions, but relies on the bug presenting deterministically. But what if the bug is non-deterministic? Or worse, your behaviour was always non-deterministic, but something has changed, e.g. your tests went from somewhat flaky to very flaky.
I'm going to have to check out how you got linear time with Shannon entropy, because I used Renyi entropy to do that, to make the algebra easier.
It's also possible to do it over the DAG, rather than a linear history - although that makes the code a lot more complicated. Unfortunately there doesn't seem to be a linear time cumulative sum algorithm over dags, so it's super linear in some cases.
This is really cool! Is there an alternative way of thinking about it involving a hidden markov model, looking for a change in value of an unknown latent P(fail)? Or does your approach end up being similar to whatever the appropriate Bayesian approach to the HMM would be?
My team doesn’t always have cleanly bisectable branches being merged to main —- it’s not uncommon to see “fix syntax error” types of commits.
But, to merge we need to have all tests pass. (If tests flakily pass then we get new flakey tests, yay!)
I know git-bisect doesn’t support this: but could git-bayesect have an option to only consider merge commits? Being able to track a flake change back to an individual PR would be really useful.
Neat! I work on a system with some very similar math, but a slightly different model. I really like how in bayesect making the error rates asymmetric via independent Beta priors on bidirectional errors allows the computations to be nice and symmetric.
I haven't worked these all the way through, but I'm slightly skeptical or at least confused by a few details:
1. Another way to frame P(D|B=b) would be to have the old vs new side draws be beta-binomial distributed, in which case we should then have binomial coefficients for each of the draw side probabilities for the number of possible orderings of the observations. Do they end up cancelling out somewhere? [ed: Oh yes, of course -- D includes that in each case we observe exactly one of the C(n,k) orderings.]
2. I think your expected conditional entropy code is treating the imputed new observations as independent from the past observations, though even if that's the case it may not affect it much in this model. If it does though, it might be worth explicitly unit-testing the naive vs efficient calculations to ensure they match.
A related situation I was in recently was where I was trying to bisect a perf regression, but the benchmarks themselves were quite noisy, making it hard to tell whether I was looking at a "good" vs "bad" commit without repeated trials (in practice I just did repeats).
I could pick a threshold and use bayesect as described, but that involves throwing away information. How hard would it be to generalize this to let me plug in a raw benchmark score at each step?
Okay this is really fun and mathematically satisfying. Could even be useful for tough bugs that are technically deterministic, but you might not have precise reproduction steps.
Does it support running a test multiple times to get a probability for a single commit instead of just pass/fail? I guess you’d also need to take into account the number of trials to update the Beta properly.
I hope this comment is not out of place, but I am wondering what the application for all this is? How can this help us or what does it teach us or help us prove? I am asking out of genuine curiosity as I barely understand it but I believe it has something to do with probability.
edit: thanks for the responses! I was not even familiar with git bisect before this, so I've got some new things to learn.
Does this tool assume it takes the same amount of time to test two commits once as it does to test one commit twice? Maybe true for interpreted languages, but if you're waiting 15 minutes to compile LLVM you're probably going to want to run your 1 second flaky test more than once. Probably pretty trivial to fix this though?
> our entropy calculation will now have to use the posterior means
Now hang on for a bit, you can't just plug in averages.
At least that's what I initially thought, but in this particular instance it works out correctly because you're calculating an expected value of the entropy from the two possible outcomes and there the posterior mean is indeed the correct probability to use.
You do have to take the prior into account when calculating the posterior distributions for B, but that formula is in the article.
43 comments
We ran benchmarks comparing bisect vs bayesect across flakiness levels. At 90/10, bisect drops to ~44% accuracy while bayesect holds at ~96%. At 70/30 it's 9% vs 67%. The entropy-minimization selection is key here since naive median splitting converges much slower.
One thing we found, you can squeeze out another 10-15% accuracy by weighting the prior with code structure. Commits that change highly-connected functions (many transitive dependents in the call graph) are more likely culprits than commits touching isolated code. That prior is free, zero test runs needed.
Information-theoretically, the structural prior gives you I_prior bits before running any test, reducing the total tests needed from log2(n)/D_KL to (log2(n) - I_prior)/D_KL. On 1024-commit repos with 80/20 flakiness: 92% accuracy with graph priors vs 85% pure bayesect vs 10% git bisect.
We're building this into sem (https://github.com/ataraxy-labs/sem), which has an entity dependency graph that provides the structural signal.
> We ran benchmarks comparing bisect vs bayesect across flakiness levels. At 90/10, bisect drops to ~44% accuracy while bayesect holds at ~96%. At 70/30 it's 9% vs 67%.
I don't understand what you're comparing. Can't you increase bayesect accuracy arbitrarily by running it longer? When are you choosing to terminate? Perhaps I don't understand this after all.
This script in the repo https://github.com/hauntsaninja/git_bayesect/blob/main/scrip... will show you that a) the confidence level is calibrated, b) how quickly you get to that confidence level (on average, p50 and p95)
For the failure rates you describe, calibration.py shows that you should see much higher accuracy at 300 tests
Went a step further and tested graph-weighted priors (per-commit weight proportional to transitive dependents, Pareto-distributed). The prior helps in the budget-constrained regime:
128 commits, 500 trials:
Budget=50, 70/30: uniform 22% → graph 33% Budget=50, 80/20: uniform 71% → graph 77% Budget=100, 70/30: uniform 56% → graph 65% At 300 tests the gap disappears since there's enough data to converge anyway. The prior is worth a few bits, which matters when bits are scarce.
Script: https://gist.github.com/rs545837/b3266ecf22e12726f0d55c56466...
In addition to the repo linked in the title, I also wrote up a little bit of the math behind it here: https://hauntsaninja.github.io/git_bayesect.html
I'm going to have to check out how you got linear time with Shannon entropy, because I used Renyi entropy to do that, to make the algebra easier.
It's also possible to do it over the DAG, rather than a linear history - although that makes the code a lot more complicated. Unfortunately there doesn't seem to be a linear time cumulative sum algorithm over dags, so it's super linear in some cases.
But, to merge we need to have all tests pass. (If tests flakily pass then we get new flakey tests, yay!)
I know git-bisect doesn’t support this: but could git-bayesect have an option to only consider merge commits? Being able to track a flake change back to an individual PR would be really useful.
I haven't worked these all the way through, but I'm slightly skeptical or at least confused by a few details:
1. Another way to frame P(D|B=b) would be to have the old vs new side draws be beta-binomial distributed, in which case we should then have binomial coefficients for each of the draw side probabilities for the number of possible orderings of the observations. Do they end up cancelling out somewhere? [ed: Oh yes, of course -- D includes that in each case we observe exactly one of the C(n,k) orderings.]
2. I think your expected conditional entropy code is treating the imputed new observations as independent from the past observations, though even if that's the case it may not affect it much in this model. If it does though, it might be worth explicitly unit-testing the naive vs efficient calculations to ensure they match.
Anyway, thanks for sharing!
A related situation I was in recently was where I was trying to bisect a perf regression, but the benchmarks themselves were quite noisy, making it hard to tell whether I was looking at a "good" vs "bad" commit without repeated trials (in practice I just did repeats).
I could pick a threshold and use bayesect as described, but that involves throwing away information. How hard would it be to generalize this to let me plug in a raw benchmark score at each step?
Does it support running a test multiple times to get a probability for a single commit instead of just pass/fail? I guess you’d also need to take into account the number of trials to update the Beta properly.
edit: thanks for the responses! I was not even familiar with
git bisectbefore this, so I've got some new things to learn.Great idea anyway!
> our entropy calculation will now have to use the posterior means
Now hang on for a bit, you can't just plug in averages.
At least that's what I initially thought, but in this particular instance it works out correctly because you're calculating an expected value of the entropy from the two possible outcomes and there the posterior mean is indeed the correct probability to use.
You do have to take the prior into account when calculating the posterior distributions for B, but that formula is in the article.