Flaky Test Bisect

What this skill does

Pins down which commit introduced a flaky test by combining git bisect run with a flake-aware test runner that re-runs each candidate N times. Output is the offending commit hash plus a minimal local repro.

Inputs

repo_dir: clean checkout of the repo.
test_id: the flaky test in tool-specific syntax (e.g., tests/foo.test.ts -t 'connects' for Vitest).
good_ref and bad_ref: commits where the test passed and started failing.
iterations: how many times to run the test per commit (default 25). A failure rate above 5% counts as flaky-bad.

Steps

git -C <repo_dir> checkout <bad_ref> and run the test once to confirm it actually fails or flakes.
Write a runner script bisect-run.sh that does: npm install --silent && for i in $(seq 1 <iterations>); do <test cmd> || exit 1; done. Make it executable.
Start bisect: git bisect start <bad_ref> <good_ref>.
Run git bisect run ./bisect-run.sh. Bisect will narrow to the first bad commit.
Capture the resulting git bisect log and the printed first-bad-commit hash.
Run git bisect reset to return to the working tree.
Inspect the first-bad-commit with git show --stat <hash> and isolate the file(s) most likely related (test under question, its imports, shared fixtures).
Write a minimal repro script repro.sh that: clones the repo at <hash>^ (good), applies just the relevant hunk(s) from <hash>, runs the test, observes flake.
If runner-side state matters (DB, network), document the env: dump env | sort and pin Node/Python version with .nvmrc/.python-version.

Output

A markdown file bisect-report.md containing the bad commit hash, the diffstat, the bisect log, and a fenced repro.sh. Exit 0 when bisect completes successfully, 1 if the test passed at bad_ref (no flake reproducible), 2 if good_ref already flakes.

Verification

Re-run bisect-run.sh at <hash>^ and confirm zero failures across iterations runs; run it at <hash> and confirm at least one failure. If both pass cleanly, the bisect resolved noise rather than a real regression; bump iterations and rerun. Run the produced repro.sh from a fresh clone to confirm it reproduces the flake without depending on the local working tree.

Edge cases

Test depends on wall-clock time: pin TZ=UTC and use faketime if available; flakes due to time should be tagged but not bisected.
Test depends on a stateful service (DB, Redis): run the bisect inside a Docker Compose harness so each iteration starts clean.
Bisect lands on a merge commit: rerun on the linearized history with --first-parent if the merge cannot be reverted in isolation.
All commits in range pass: extend good_ref further back; the regression may predate the assumed range.

amitte/flaky-test-bisect