Exit loop analysis early if suitable private access found.
Do not account for GEPs which are invariant to loop induction variable.
Do not account for Allocas which are too big to fit into register file anyway.
Add option for tuning: -amdgpu-unroll-threshold-private.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
Needs tests in test/Transforms/LoopUnroll/AMDGPU
lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
32–35 ↗ | (On Diff #86883) | I don't think we need this. The regular option I think overrides the target preference if set |
64–65 ↗ | (On Diff #86883) | Won't LICM move this out of the loop before unrolling? |
93–96 ↗ | (On Diff #86883) | I don't think we should change this away from the hardware sizes. I think this hook is only used by the vectorizers we don't use. We should define a different constant for use for the alloca heuristic |
lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
32–35 ↗ | (On Diff #86883) | Right, it does. |
64–65 ↗ | (On Diff #86883) | In most cases it will, but consider this: for (int i=0; i < 16; i++) for (int j=0; j < 100; j++) arr[(j + gid) % 64] = a[i]; Here arr[j] depends on the induction variable of inner loop, so it is not hoisted. We do not want to unroll outer loop because of that though. |
93–96 ↗ | (On Diff #86883) | The numbers here were just incorrect. SI to CI have 128 registers. Then it makes sense to take into consideration real register file size, which is target dependent. As a todo we need to limit it further if we have occupancy attributes. |
lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
93–96 ↗ | (On Diff #86883) | No, there have always been 256 VGPRs. |
lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
93–96 ↗ | (On Diff #86883) | Ouch. My memory is wrong. Will fix. |
Added test.
Reverted portion about number of registers.
Removed general unroll threshold option.
Added logic to go deep into loop nest to see if that is actually contained loop needs to be unrolled, but not outer.
lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
59 ↗ | (On Diff #86928) | It happens when you have opaque type, like image. I doubt I can easily write a test like this, but when it happens getTypeAllocSize() asserts, so it is better to keep it. |
lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
64–65 ↗ | (On Diff #86883) | Please add this motivating example as a comment |
llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
69 | Why is this check nessesary? Is this an early exit when GEP is dependent on more than 1 induction variable? |
llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
88 | Do you also want to set PartialThreshold here? |
llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
88 | I thought partialy unrolled loops won't make it possible to SROA private arrays. What are the benefits of partial unrolling on AMDGPU btw? What comes in mind: mem ops clustering/widening, less branches? What else? |
llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
88 | I had a test case where bumping the PartialThreshold helped more non-partial loops be unrolled, but I looked at the case again and increasing the normal Threshold has the same affect, so I don't think this is needed. |
llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
69 | This is just a check that real dependency is in the inner loop, which really needs to be unrolled. When you iterate through blocks a a loop you will get those belonging to inner loops as well. I'm just checking we are about to unroll a right one. | |
88 | Actually I agree, partial unroll does not help to SROA an array. There can be other motivation, but not this. |
Why is this check nessesary? Is this an early exit when GEP is dependent on more than 1 induction variable?