This allows hoisting of a common code, for instance if denominator
is loop invariant. Current change is expansion only, adding licm to
the target pass list going to be a separate patch. Given this patch
changes to codegen are minor as the expansion is similar to that on
DAG. DAG expansion still must remain for R600.
Details
Diff Detail
Event Timeline
Should we enable BypassSlowDivision or possibly merge this expansion with it?
The DAG expansion also probably needs to remain for all targets. DAGCombiner could still potentially introduce new div nodes that would need to be handled
lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp | ||
---|---|---|
573–574 | CreateIntrinsic should work for all of these |
Bypass is a separate question as it does runtime resolution. In fact it is questionable optimization for a SIMT, that is enough to have just one thread doing slow division to get the overhead penalty. In anyway this is really a separate optimization.
Probably for scalar cases only. Vectors with scheduler are too unstable. Even scalar will fluctuate with scheduler and RA. Do you think it makes sense to have a separate set of scalar cases?
I believe it was specifically introduced for PTX, so apparently it is common enough in real workloads
I still can see this is a trade-off depending on the test. And still a separate change as a result.
Won't doing this break the case where both the div and rem are used, so the full expansion will be used twice?
opt combines its. For instance:
a[0] = x % y; a[1] = x / y;
%4 = udiv i32 %1, %2 %5 = mul i32 %4, %2 %6 = sub i32 %1, %5
lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp | ||
---|---|---|
492–494 | While this is what we do in the DAG, this isn't really the canonical IR way to do this. Truncate, and shift + truncate should probably be used instead here |
While this is what we do in the DAG, this isn't really the canonical IR way to do this. Truncate, and shift + truncate should probably be used instead here