Currently loop unroll is conservative about loops containing convergent instructions.
It does not allow remainder for such loops, which essentially disables unroll count
requested by pragma and results in fully unrolled loop in many cases.
As such a user may specify pragma unroll 32 but instead gets the loop unrolled 512
and results in extremely long compilation time.
For some target, e.g. AMDGPU, the remainder does not cause extra divergence and
should be allowed.
This patch introduces AllowRemainderForConvergentLoop in
TargetTransformInfo::UnrollingPreferences and allows each target to specify
whether unrolling convergent loop with remainder is allowed. By default it is
false therefore no functional change for other targets.
This patch fixes shmembench-ocl compilation time issue on amdpu.
I don't like sticking this here.
From your description, it sounds like it's a *correctness* property of the target, whether or not certain transforms which duplicate convergent operations are allowed. In that case, it's not really about unrolling at all; it could apply to other transforms which clone code. So at the very least, this should be a separate hook, with a clear explanation of exactly which transforms this allows.