[AMDGPU] Fix for issue in alloca to vector promotion pass
ClosedPublic

Authored by dstuttard on Apr 5 2017, 7:54 AM.

Details

Summary

Alloca promotion pass not coping with mode where array is loaded or saved into value as an array
aggregate and accessed using ExtractValue and InsertValue

This fixes the problem for the cases submitted with the bug, and also makes the routine more
graceful in situations it can't handle - in which case it will back-off the vectorization.

The function to attempt vectorization has been re-factored and split into several functions to make
it clearer what is happening.

Also added some test cases for the new modes.

Diff Detail

Repository
rL LLVM
dstuttard created this revision.Apr 5 2017, 7:54 AM
dstuttard updated this revision to Diff 94233.Apr 5 2017, 8:13 AM

Removed whitespace difference

arsenm added a comment.Apr 5 2017, 9:33 AM

Fixing crashes on it is good, but why are you spending so much effort on optimizing non-canonical IR? InstCombine decomposes aggregate loads and stores to loads and stores of the individual components. Similarly we give up on array allocas in the form where they are allocating N items of the element type.

lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
483–486 ↗(On Diff #94233)

Commented out code

test/CodeGen/AMDGPU/promote-alloca-array-aggregate.ll
1 ↗(On Diff #94233)

Can you change the check prefix to IR or OPT or something in case we want to add a codegen run line as well

5 ↗(On Diff #94233)

These should have ( to end the name to avoid potentially matching cases with the same prefix

10–11 ↗(On Diff #94233)

addrspace 6/7?

11 ↗(On Diff #94233)

Anonymous global

13 ↗(On Diff #94233)

Remove these comments

17 ↗(On Diff #94233)

You should run opt -instnamer on these tests to avoid anonymous values

103 ↗(On Diff #94233)

Can this case be reduced?

As you say, the issue actually is that non-canonical IR was being presented to the pass - the front-end has since been updated to fix this (although possibly not in the way it should).

The question is, what should this pass do when valid, but non-canonical IR is passed in. My inclination is to strip out the code to transform to vector in this case and instead just more gracefully handle instead (the tryPromoteAlloca function will just return false - as with other cases it can't handle).

Not sure what you mean with your last comment

Similarly we give up on array allocas in the form where they are allocating N items of the element type.

If you mean that the tryPromoteAlloca gives up for N<2 or N>4 as an example of what to do - then yes, that's what I'm proposing for this case as well.

arsenm added a comment.Apr 6 2017, 9:56 AM

As you say, the issue actually is that non-canonical IR was being presented to the pass - the front-end has since been updated to fix this (although possibly not in the way it should).

The frontend emitting it seems fine if convenient, but instcombine will decompose it for you. If you pass pipeline somehow isn't running instcombine, you have a problem.

The question is, what should this pass do when valid, but non-canonical IR is passed in. My inclination is to strip out the code to transform to vector in this case and instead just more gracefully handle instead (the tryPromoteAlloca function will just return false - as with other cases it can't handle).

Not sure what you mean with your last comment

Similarly we give up on array allocas in the form where they are allocating N items of the element type.

If you mean that the tryPromoteAlloca gives up for N<2 or N>4 as an example of what to do - then yes, that's what I'm proposing for this case as well.

No, I mean the I.isArrayAllocation() early exit. An alloca with a constant number of elements is turned into an alloca of 1 element of array type by instcombine.

FYI Changpeng has also been looking at this pass to handle more cases.

As you say, the issue actually is that non-canonical IR was being presented to the pass - the front-end has since been updated to fix this (although possibly not in the way it should).

The question is, what should this pass do when valid, but non-canonical IR is passed in. My inclination is to strip out the code to transform to vector in this case and instead just more gracefully handle instead (the tryPromoteAlloca function will just return false - as with other cases it can't handle).

Not sure what you mean with your last comment

Non-canonical IR are constructs that the basic canonicalization passes like instcombine produce. This is supposed to make later optimizations lives easier since they don't have to worry about every possible way of representing something. It is sufficient to just not crash or miscompile on the non-canonical inputs.

dstuttard updated this revision to Diff 94518.Apr 7 2017, 7:43 AM

Backing out of the main change - adding some more checks to the original code so the pass handles
non-canonical form more gracefully.

dstuttard updated this revision to Diff 94519.Apr 7 2017, 7:48 AM

Test case file had gained executable status - changed back again

dstuttard updated this revision to Diff 94520.Apr 7 2017, 7:51 AM

Change anonymous globals

As you say, the issue actually is that non-canonical IR was being presented to the pass - the front-end has since been updated to fix this (although possibly not in the way it should).

The frontend emitting it seems fine if convenient, but instcombine will decompose it for you. If you pass pipeline somehow isn't running instcombine, you have a problem.

Yes. This is now fixed.

The question is, what should this pass do when valid, but non-canonical IR is passed in. My inclination is to strip out the code to transform to vector in this case and instead just more gracefully handle instead (the tryPromoteAlloca function will just return false - as with other cases it can't handle).

Not sure what you mean with your last comment

Similarly we give up on array allocas in the form where they are allocating N items of the element type.

If you mean that the tryPromoteAlloca gives up for N<2 or N>4 as an example of what to do - then yes, that's what I'm proposing for this case as well.

No, I mean the I.isArrayAllocation() early exit. An alloca with a constant number of elements is turned into an alloca of 1 element of array type by instcombine.

FYI Changpeng has also been looking at this pass to handle more cases.

Ok - I see what you mean. Hopefully this minor fix will not be an issue for Changpeng.

As you say, the issue actually is that non-canonical IR was being presented to the pass - the front-end has since been updated to fix this (although possibly not in the way it should).

The question is, what should this pass do when valid, but non-canonical IR is passed in. My inclination is to strip out the code to transform to vector in this case and instead just more gracefully handle instead (the tryPromoteAlloca function will just return false - as with other cases it can't handle).

Not sure what you mean with your last comment

Non-canonical IR are constructs that the basic canonicalization passes like instcombine produce. This is supposed to make later optimizations lives easier since they don't have to worry about every possible way of representing something. It is sufficient to just not crash or miscompile on the non-canonical inputs.

Agreed. That's what it now does.

test/CodeGen/AMDGPU/promote-alloca-array-aggregate.ll
103 ↗(On Diff #94233)

I removed this one as I don't think it really added much. Particularly with the subsequent changes.

arsenm added inline comments.Apr 11 2017, 2:05 PM
lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
405–407 ↗(On Diff #94520)

This isn't necessarily true. You could have a direct store to an alloca pointer for example

432 ↗(On Diff #94520)

I can see how this would be a problem and not handled today, but I don't think anything flatten array types. You could still see something like [4 x [4 x i32]], though the elements will still be individually addressed, not as aggregate loads and stores

dstuttard added inline comments.Apr 12 2017, 8:42 AM
lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
405–407 ↗(On Diff #94520)

Ok - but since the current implementation expects a gep that's what is being verified here.
Would a comment change be sufficient?

432 ↗(On Diff #94520)

Yes you're correct. It should be possible to extend this routine to handle (for instance) [2 x [2 x i32]] and still fit in the current size constraints.
The current implementation won't handle this though.
Perhaps a FIXME or TODO comment to that effect in here?

arsenm added inline comments.Apr 19 2017, 10:10 AM
lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
405–407 ↗(On Diff #94520)

Yes

432 ↗(On Diff #94520)

Yes

dstuttard updated this revision to Diff 96917.Apr 27 2017, 7:05 AM

Amending comments as per review

Any chance of a final review on this change?

arsenm accepted this revision.Jun 5 2017, 7:54 AM
arsenm added inline comments.
lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
401 ↗(On Diff #96917)

Incomplete comment

409–410 ↗(On Diff #96917)

Same comment leftover

test/CodeGen/AMDGPU/promote-alloca-array-aggregate.ll
40 ↗(On Diff #96917)

These tests are all missing check lines

47 ↗(On Diff #96917)

What are address spaces 6 and 7? We can't codegen these, and it is somewhat preferable to be able to codegen any IR tests (although there are some exceptions)

This revision is now accepted and ready to land.Jun 5 2017, 7:54 AM
dstuttard updated this revision to Diff 101896.Jun 8 2017, 5:42 AM

Made changes to comments
Changed address spaces in tests and added some checks

dstuttard marked 4 inline comments as done.Jun 8 2017, 5:44 AM

Made suggested changes prior to submission

This revision was automatically updated to reflect the committed changes.