Fixes what's left of https://bugs.llvm.org/show_bug.cgi?id=51781.
- Applied the same fix to the custom state machine, as suggested by @jdoerfert privately, and extended the new test to cover it. For that test on the NVIDIA Pascals I tried, fixing the custom state machine didn't appear to be needed. Perhaps in that version, the master thread manages to be selected for execution before other threads in its warp. However, fixing the custom state machine did prove important for that test on an AMD GPU I tried. Maybe another test would prove it's important for Pascals too, but I haven't looked for one.
- Moved fix to callers of generic state machine functions, as suggested by @tianshilei1992.
- Pointed out this fix is also relevant to AMD GPUs, as suggested by @JonChesterfield.
This doesn't quite work because now every thread in the last warp will execute the user code.
if (InitCB <u BlockSize) return;
and then whatever we had before.