Simplifies control flow to allow store/load forwarding
This change folds two basic blocks into one, leaving a single store to parallelLevel.
This is a step towards spmd kernels with sufficiently aggressive inlining folding
the loads from parallelLevel and thus discarding the nested parallel handling
when it is unused.
Transform:
int threadId = GetThreadIdInBlock(); if (threadId == 0) { parallelLevel[0] = expr; } else if (GetLaneId() == 0) { parallelLevel[GetWarpId()] = expr; } // => if (GetLaneId() == 0) { parallelLevel[GetWarpId()] = expr; } // because unsigned GetLaneId() { return GetThreadIdInBlock() & (WARPSIZE - 1);} // so whenever threadId == 0, GetLaneId() is also 0.
That replaces a store in two distinct basic blocks with as single store.
A more aggressive follow up is possible if the threads in the warp/wave
race to write the same value to the same address. This is not done as
part of this change.
if (GetLaneId() == 0) { parallelLevel[GetWarpId()] = expr; } // => parallelLevel[GetWarpId()] = expr; // because unsigned GetWarpId() { return GetThreadIdInBlock() / WARPSIZE; } // so GetWarpId will index the same element for every thread in the warp // and, because expr is lane-invariant in this case, every lane stores the // same value to this unique address