Wednesday, December 21, 2016

Life of Kernel Bisecter

Bisecting is extremely useful to fix a regression in big projects like upstream kernel. The goal here is to get the regression fixed instead of just reporting it and forget about it. Usually upstream regression reports have easily been ignored due to the bandwidth of the kernel developers, complex of the code analysis involved to find out the root cause, developers limited access to the hardware etc. However, since it is a regression, it is usually possible to track down which exact commit introduced it. Hence, make it is way easier for developers to figure out the root cause and come up with a fix. Also, the original authors who introduced the regression usually response quickly (within one working day) because they want to maintain good reputations within the community. By introducing regression with their patches without fixing them quickly makes lives harder for them to get their future patches accepted by Linus and sub-system maintainers. Linus and friends are usually not afraid of and good at making them feel public peer pressure once happened. In the worst case, the solution is to send a revert patch to fix the regression. Usually, it will be accepted as Linus and friends because they absolutely hate regressions even the trivial ones.

However, git bisect kernel regression is not usually an easy task. Especially, for big project like upstream kernel, there could be a lot of going on between a commit introduced the regression and when the people actually encountered the regression, so lots of things could go wrong during the "git bisect". Therefore, it is important to test upstream kernel as often as possible to make bisecter's life easier. Below are some hard lessons learned from my years' experiences of kernel git bisecting.

Always test on tagged commit first if possible. Tagged commits like v4.7 and v4.7-rc2 usually are more stable and have less compilation errors or boot issues that you will either need to deal with it before further bisecting. For example, if v4.7 is bad and v4.2 is good. Don't just start "git bisect" yet, as the next commit to test will be some random commit in between. Instead, test the middle tagged commits like v4.5 instead. If v4.7 is bad and v4.6 is good, bisect manually further for those tagged commits v4.7-rc* until you know the exact v4.7-rc release that introduced the regression before start the "git bisect".

If there is an unexpected happened during the testing of one of the commit like compilation errors, boot issues or early events mask regression reproduction, you usually need to deal with them first by figuring out the patch (may also means to start another reversed mini-bisect to find out the patches) to fix them and manually carry them in the kernel during the bisecting. If you are lucky, you might be able to fix them by using git skip, tweaking the kernel config, searching the git log, mailing lists and bugzilla or find out other workarounds.

If you are chasing a lockdep regression, make sure you applied patches for other lockdep issues that will happen before you running the reproducer. Otherwise, you may get a false positive due to the design of lockdep that only the first one will show up. Always check the dmesg before running the reproducer to see if the lock debugging is disabled. For example, enabled KASAN config will likely disable the lock debugging due to kernel is tainted.

You will need lots of CPUs to be efficient to compile lots of kernels using "make -j". Otherwise, it could take you weeks to find the culprit. Also, you will better have a big partitions for /boot/ and /lib/ to store many kernels, so you save time without needing to delete them to make rooms for new kernels. Usually, you can skip "make clean" to accelerate the kernel compilation. However, if you suspect some old data structures was used by the new kernel, recompiled the kernel after "make clean". Also, use things like ccache or distcc if possible.

The more you know about the kernel internal, the more efficient of a bisecting process might be. Bisecting is essentially a block-box testing process, but you can use your kernel internal knowledge to reduce the amount of commits need to test. Once the commits left to be tested reduced to a number that you can manage to read the commit logs for those, you can read all of them while waiting for kernel to compile. Then, you can test some suspected commits. If your guesses are wrong, make sure marking those commits as usually as either good or bad, so you can resume the black-box bisecting from there.

Usually rc1 is most likely to introduce regressions since most of big merges happened at that time. Hence, put rc1 on a slightly higher priority to test during the process if necessary. LWN have good summaries for what included for each RC release.

If the bisect points to a merge commit like,
commit 711bef65e91d2a06730bf8c64bb00ecab48815a1
Merge: acdfffb 0f5aa88
you will then run "git bisect good acdfffb" and "git bisect bad 0f5aa88" in order to find out the exact non-merge commit introduced the regression.

Once you found the culprit commit, tested it again by reverting the commit against the latest git head to confirm the finding. The older the commit, the less likely it will be a straightforward revert. If there are some revert conflicts, try to resolve them as many as possible. If it is too difficult for you, try to revert the commit while it is set as the head to confirm the finding to avoid some side-effects like kernel config differences etc. Include those information in the final email (below) to send to the community.

Beware of kernel config differences that could cause a regression that you try to bisect. So, if you suspect the git bisect process leads to some commits that do not make sense, double-check the kernel config differences between the closest good and bad commits to see if they may cause the problem.

Once you are ready to send your bisecting result by email, always add the original author and the people who provided the "Signed-off-by" tag to the "To" list, and put a single relevant mailing-list to the "CC" list, e.g., linux-xfs@ for xfs issues, so it will be achieved somewhere for future reference. Optionally, you can include the people who provided the "Reported-by" tag to the CC list as well, so they may be interested to test the patches once available. Append something like "[Bisected]" at the beginning of the email subject to draw more attention from your hard working. If there is no response from anyone after one working day, submit to the upstream bugzilla with the regression flag, so it will likely be included by automated regression reports in the future and not to get lost.