This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Unroll preferences improvements
ClosedPublic

Authored by rampitec on Feb 2 2017, 2:41 PM.

Details

Summary

Exit loop analysis early if suitable private access found.
Do not account for GEPs which are invariant to loop induction variable.
Do not account for Allocas which are too big to fit into register file anyway.
Add option for tuning: -amdgpu-unroll-threshold-private.

Diff Detail

Repository
rL LLVM

Event Timeline

rampitec created this revision.Feb 2 2017, 2:41 PM
arsenm edited edge metadata.Feb 2 2017, 2:50 PM

Needs tests in test/Transforms/LoopUnroll/AMDGPU

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
32–35 ↗(On Diff #86883)

I don't think we need this. The regular option I think overrides the target preference if set

64–65 ↗(On Diff #86883)

Won't LICM move this out of the loop before unrolling?

93–96 ↗(On Diff #86883)

I don't think we should change this away from the hardware sizes. I think this hook is only used by the vectorizers we don't use. We should define a different constant for use for the alloca heuristic

rampitec marked 2 inline comments as done.Feb 2 2017, 4:14 PM
rampitec added inline comments.
lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
32–35 ↗(On Diff #86883)

Right, it does.

64–65 ↗(On Diff #86883)

In most cases it will, but consider this:

for (int i=0; i < 16; i++)
  for (int j=0; j < 100; j++)
    arr[(j + gid) % 64] = a[i];

Here arr[j] depends on the induction variable of inner loop, so it is not hoisted. We do not want to unroll outer loop because of that though.
Actually this check is not sufficient, so I have rewritten this place. I will update review shortly.

93–96 ↗(On Diff #86883)

The numbers here were just incorrect. SI to CI have 128 registers. Then it makes sense to take into consideration real register file size, which is target dependent. As a todo we need to limit it further if we have occupancy attributes.

arsenm added inline comments.Feb 2 2017, 4:31 PM
lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
93–96 ↗(On Diff #86883)

No, there have always been 256 VGPRs.

rampitec marked 5 inline comments as done.Feb 2 2017, 4:34 PM
rampitec added inline comments.
lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
93–96 ↗(On Diff #86883)

Ouch. My memory is wrong. Will fix.

rampitec updated this revision to Diff 86928.Feb 2 2017, 5:25 PM
rampitec marked an inline comment as done.
rampitec edited the summary of this revision. (Show Details)

Added test.
Reverted portion about number of registers.
Removed general unroll threshold option.
Added logic to go deep into loop nest to see if that is actually contained loop needs to be unrolled, but not outer.

arsenm accepted this revision.Feb 2 2017, 5:54 PM

LGTM except for the isSized question

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
59 ↗(On Diff #86928)

Can you add a test where isSized is necessary? I thought it was illegal to alloca such a type

67 ↗(On Diff #86928)

!inst

This revision is now accepted and ready to land.Feb 2 2017, 5:54 PM
rampitec marked an inline comment as done.Feb 2 2017, 6:01 PM
rampitec added inline comments.
lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
59 ↗(On Diff #86928)

It happens when you have opaque type, like image. I doubt I can easily write a test like this, but when it happens getTypeAllocSize() asserts, so it is better to keep it.

This revision was automatically updated to reflect the committed changes.
vpykhtin added inline comments.
lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
64–65 ↗(On Diff #86883)

Please add this motivating example as a comment

vpykhtin added inline comments.Feb 3 2017, 5:16 AM
llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
69

Why is this check nessesary? Is this an early exit when GEP is dependent on more than 1 induction variable?

tstellarAMD added inline comments.Feb 3 2017, 6:39 AM
llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
88

Do you also want to set PartialThreshold here?

vpykhtin added inline comments.Feb 3 2017, 7:08 AM
llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
88

I thought partialy unrolled loops won't make it possible to SROA private arrays. What are the benefits of partial unrolling on AMDGPU btw? What comes in mind: mem ops clustering/widening, less branches? What else?

tstellarAMD added inline comments.Feb 3 2017, 8:33 AM
llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
88

I had a test case where bumping the PartialThreshold helped more non-partial loops be unrolled, but I looked at the case again and increasing the normal Threshold has the same affect, so I don't think this is needed.

rampitec marked 4 inline comments as done.Feb 3 2017, 10:17 AM
rampitec added inline comments.
llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
69

This is just a check that real dependency is in the inner loop, which really needs to be unrolled. When you iterate through blocks a a loop you will get those belonging to inner loops as well. I'm just checking we are about to unroll a right one.

88

Actually I agree, partial unroll does not help to SROA an array. There can be other motivation, but not this.