This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Early expansion of 32 bit udiv/urem
ClosedPublic

Authored by rampitec on Jun 25 2018, 11:38 PM.

Details

Summary

This allows hoisting of a common code, for instance if denominator
is loop invariant. Current change is expansion only, adding licm to
the target pass list going to be a separate patch. Given this patch
changes to codegen are minor as the expansion is similar to that on
DAG. DAG expansion still must remain for R600.

Diff Detail

Repository
rL LLVM

Event Timeline

rampitec created this revision.Jun 25 2018, 11:38 PM

Should we enable BypassSlowDivision or possibly merge this expansion with it?

The DAG expansion also probably needs to remain for all targets. DAGCombiner could still potentially introduce new div nodes that would need to be handled

lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
573–574 ↗(On Diff #152843)

CreateIntrinsic should work for all of these

Probably should add an update_test_checks test for the full expnasions

Should we enable BypassSlowDivision or possibly merge this expansion with it?

Bypass is a separate question as it does runtime resolution. In fact it is questionable optimization for a SIMT, that is enough to have just one thread doing slow division to get the overhead penalty. In anyway this is really a separate optimization.

Probably should add an update_test_checks test for the full expnasions

Probably for scalar cases only. Vectors with scheduler are too unstable. Even scalar will fluctuate with scheduler and RA. Do you think it makes sense to have a separate set of scalar cases?

Probably should add an update_test_checks test for the full expnasions

Probably for scalar cases only. Vectors with scheduler are too unstable. Even scalar will fluctuate with scheduler and RA. Do you think it makes sense to have a separate set of scalar cases?

It's a single IR pass expansion, the scheduler isn't involved

Should we enable BypassSlowDivision or possibly merge this expansion with it?

Bypass is a separate question as it does runtime resolution. In fact it is questionable optimization for a SIMT, that is enough to have just one thread doing slow division to get the overhead penalty. In anyway this is really a separate optimization.

I believe it was specifically introduced for PTX, so apparently it is common enough in real workloads

Probably should add an update_test_checks test for the full expnasions

Probably for scalar cases only. Vectors with scheduler are too unstable. Even scalar will fluctuate with scheduler and RA. Do you think it makes sense to have a separate set of scalar cases?

It's a single IR pass expansion, the scheduler isn't involved

So you mean to do this for the IR, not the final ISA? Ok.

Should we enable BypassSlowDivision or possibly merge this expansion with it?

Bypass is a separate question as it does runtime resolution. In fact it is questionable optimization for a SIMT, that is enough to have just one thread doing slow division to get the overhead penalty. In anyway this is really a separate optimization.

I believe it was specifically introduced for PTX, so apparently it is common enough in real workloads

I still can see this is a trade-off depending on the test. And still a separate change as a result.

rampitec updated this revision to Diff 152906.Jun 26 2018, 9:50 AM
rampitec marked an inline comment as done.
  • Added IR test
  • Switched to Builder.CreateIntrinsic()

Won't doing this break the case where both the div and rem are used, so the full expansion will be used twice?

Won't doing this break the case where both the div and rem are used, so the full expansion will be used twice?

opt combines its. For instance:

a[0] = x % y;
a[1] = x / y;
%4 = udiv i32 %1, %2
%5 = mul i32 %4, %2
%6 = sub i32 %1, %5
arsenm added inline comments.Jun 26 2018, 1:51 PM
lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
492–494 ↗(On Diff #152906)

While this is what we do in the DAG, this isn't really the canonical IR way to do this. Truncate, and shift + truncate should probably be used instead here

rampitec updated this revision to Diff 152965.Jun 26 2018, 2:42 PM
rampitec marked an inline comment as done.

Changed hi/lo split method as suggested.

arsenm accepted this revision.Jun 28 2018, 2:12 AM

LGTM

This revision is now accepted and ready to land.Jun 28 2018, 2:12 AM
This revision was automatically updated to reflect the committed changes.