This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Split entry basic block after alloca instructions.
ClosedPublic

Authored by hsmhsm on Aug 31 2021, 12:27 AM.

Details

Summary

While initializing the LDS pointers within entry basic block of kernel(s), make
sure that the entry basic block is split after alloca instructions.

Diff Detail

Event Timeline

hsmhsm created this revision.Aug 31 2021, 12:27 AM
hsmhsm requested review of this revision.Aug 31 2021, 12:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 31 2021, 12:27 AM
foad added inline comments.Aug 31 2021, 1:12 AM
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
196

Wouldn't this break if the entry block contains something that needs the lds pointer initialized, followed by an alloca?

hsmhsm added inline comments.Aug 31 2021, 2:05 AM
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
196

Only such possibility for breaking is - there is a call to non-kernel function foo() before alloca, and foo() uses LDS. But, the assumption here is that usually won't happen, because allocas are usually put at the beggining of the entry block.

This revision is now accepted and ready to land.Aug 31 2021, 10:09 AM
This revision was automatically updated to reflect the committed changes.
hsmhsm marked an inline comment as done.

Only such possibility for breaking is - there is a call to non-kernel function foo() before alloca, and foo() uses LDS. But, the assumption here is that usually won't happen, because allocas are usually put at the beggining of the entry block.

I've seen alloca in blocks other than the entry block after inlining. I think I've seen a function call followed by an alloca as well. I can't think of a reason why that would be invalid IR, and I think it would be possible to set up a series of passes that create it. Could you add a (handwritten) test case with the pattern that is miscompiled?

In such a case we could move the alloca to the start of the basic block. We might actually want to move alloca to the (start of the) entry block in general for amdgpu as (I think, it's been a few months) we can only lower them in the entry block.

Subscribing @ronlieb in case this fix turns out to be insufficient for D109062

Only such possibility for breaking is - there is a call to non-kernel function foo() before alloca, and foo() uses LDS. But, the assumption here is that usually won't happen, because allocas are usually put at the beggining of the entry block.

I've seen alloca in blocks other than the entry block after inlining. I think I've seen a function call followed by an alloca as well. I can't think of a reason why that would be invalid IR, and I think it would be possible to set up a series of passes that create it. Could you add a (handwritten) test case with the pattern that is miscompiled?

In such a case we could move the alloca to the start of the basic block. We might actually want to move alloca to the (start of the) entry block in general for amdgpu as (I think, it's been a few months) we can only lower them in the entry block.

This needs to work correctly. Alloca can legally be placed anywhere

hsmhsm added a comment.EditedSep 1 2021, 8:39 PM

Only such possibility for breaking is - there is a call to non-kernel function foo() before alloca, and foo() uses LDS. But, the assumption here is that usually won't happen, because allocas are usually put at the beggining of the entry block.

I've seen alloca in blocks other than the entry block after inlining. I think I've seen a function call followed by an alloca as well. I can't think of a reason why that would be invalid IR, and I think it would be possible to set up a series of passes that create it. Could you add a (handwritten) test case with the pattern that is miscompiled?

In such a case we could move the alloca to the start of the basic block. We might actually want to move alloca to the (start of the) entry block in general for amdgpu as (I think, it's been a few months) we can only lower them in the entry block.

This needs to work correctly. Alloca can legally be placed anywhere

I think we discussed these items in our last Monday's internal weekly meeting. I am recapping it here again:

  • First, the entry block splitting here should happen after all the allocas which are inserted at the beginning of the block before any other non-alloca instructions which actually make sense. Note down the word beginning of the block here. But, this will not guarentee that split is always happened after alloca. As Jon mentioned, there could be a call to function foo(), and foo() has allocas, and which is inlined so that these allocas do appear after split and which is perfectly legal. But, AMDGPU back-end is not handling it correctly at the moment which is a bug in AMDGPU back-end.
  • Second, root-cause why AMDGPU back-end is not able to handle allocas which are not inserted at the beginning of the block and fix it.
arsenm added a comment.Sep 2 2021, 6:43 PM
  • First, the entry block splitting here should happen after all the allocas which are inserted at the beginning of the block before any other non-alloca instructions which actually make sense.

Allocas can appear anywhere in the block. They do not have to be clustered at the beginning. As long as it's in the entry block, it doesn't look like a dynamic alloca

Only such possibility for breaking is - there is a call to non-kernel function foo() before alloca, and foo() uses LDS. But, the assumption here is that usually won't happen, because allocas are usually put at the beggining of the entry block.

I've seen alloca in blocks other than the entry block after inlining. I think I've seen a function call followed by an alloca as well. I can't think of a reason why that would be invalid IR, and I think it would be possible to set up a series of passes that create it. Could you add a (handwritten) test case with the pattern that is miscompiled?

In such a case we could move the alloca to the start of the basic block. We might actually want to move alloca to the (start of the) entry block in general for amdgpu as (I think, it's been a few months) we can only lower them in the entry block.

It's legal for an alloca to appear anywhere but this patch doesn't need to worry about it. If you want optimal code, you're expected to put them at the start of the entry block. Since this is just an optimization, this only needs to concern itself with these allocas. Any problems with allocas in other positions is an unrelated problem