This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Do not only rely on BB number when finding bottom loop
ClosedPublic

Authored by msearles on Feb 27 2018, 12:20 PM.

Details

Summary

We should also check that the "bottom" basic block of a loop
is a successor of the "header" basic block, otherwise we don't
propagate the information correctly when the CFG is complex.

This fixes an important rendering problem with Wolfsentein 2,
because of one vector-memory wait was missing.

Please review,
Thanks!

Diff Detail

Event Timeline

hakzsam created this revision.Feb 27 2018, 12:20 PM

is it possible to include the test as well?

hakzsam updated this revision to Diff 136696.EditedMar 2 2018, 1:24 AM

This new revision attaches a LLVM IR testcase that reproduces the issue, not sure if the test can still be reduced compared to the original monster shader.

I have noticed one weird thing after running shader-db, it looks like there is a side effect that produces some unnecessary expcnt(0) instructions in some compute shaders. Maybe this patch triggers a new issue? Otherwise, I don't see any regressions with RADV.

I will investigate about the extra expcnt(0).

Please review, thanks!

Here's a test case where a expcnt(0) is generated https://hastebin.com/uyuhabeqov.pl

If you look at the CFG, there is two loops (BB1 and BB5). The previous algorithm returns BB12 for both loop headers which looks quite wrong. The real bottom basic blocks are BB2 for BB1 and BB6 for BB5 (which is what this patch does return).

I think something isn't correctly re-initialized. Anyway, generating expcnt(0) instructions in compute shaders is just useless.

Oh, I thought expcnt() was for parameters export only, but it can be used for GDS instructions as well.

So, the expcnt(0) is generated because there is a buffer_atomic_add in that test case, which actually makes sense to me. Looks like this patch also fixes that case.

Could you please reduce the testcase?

Could you please reduce the testcase?

Okay, I will try to.

hakzsam updated this revision to Diff 137003.Mar 5 2018, 7:31 AM

v3: reduced testcase

For the testcase, would you also run -instnamer to rename instances of %<number>? You may want to combine it with other clean-up options, something like this: opt -S -deadarghaX0r -strip -strip-debug -strip-dead-prototypes -instnamer

For the testcase, would you also run -instnamer to rename instances of %<number>? You may want to combine it with other clean-up options, something like this: opt -S -deadarghaX0r -strip -strip-debug -strip-dead-prototypes -instnamer

Ah yeah, thanks for the hints. I will try to clean up the testcase a bit more.

hakzsam updated this revision to Diff 137043.Mar 5 2018, 10:52 AM
hakzsam retitled this revision from [RFC] [AMDGPU] Do not only rely on BB number when finding bottom loop to [AMDGPU] Do not only rely on BB number when finding bottom loop.

v4: - run 'opt -S -deadarghaX0r -strip -strip-debug -strip-dead-prototypes -instnamer'

  • remove RFC on the subject

I haven't yet had the time to fully grok what's going on here, but I suspect this code is fundamentally broken because the whole assumption of having a single loopBottom is wrong. It happens to be true for structured loops with non-uniform control flow, but when there's uniform control flow, there may well be multiple back-edges. What happens in those cases?

test/CodeGen/AMDGPU/waitcnt-looptest.ll
145–146 ↗(On Diff #137043)

These attributes clash with the attributes defined later on in the added part of the tests. I'm surprised there is no warning about that.

Please put all attributes at the end of the file and make their numbers unique.

nhaehnle added inline comments.Mar 5 2018, 11:42 PM
test/CodeGen/AMDGPU/waitcnt-looptest.ll
170 ↗(On Diff #137043)

This and possibly others are redundant. Have you tried running -dce?

I haven't yet had the time to fully grok what's going on here, but I suspect this code is fundamentally broken because the whole assumption of having a single loopBottom is wrong. It happens to be true for structured loops with non-uniform control flow, but when there's uniform control flow, there may well be multiple back-edges. What happens in those cases?

I'm not sure either if the current code correctly supports multiple back-edges. I think if there is more than one, loopBottom() will select the basic block with the highest number if it's a successor of the header loop.

I haven't yet had the time to fully grok what's going on here, but I suspect this code is fundamentally broken because the whole assumption of having a single loopBottom is wrong. It happens to be true for structured loops with non-uniform control flow, but when there's uniform control flow, there may well be multiple back-edges. What happens in those cases?

I'm not sure either if the current code correctly supports multiple back-edges. I think if there is more than one, loopBottom() will select the basic block with the highest number if it's a successor of the header loop.

The other way around: it will select the basic block with the highest number which has the loop header as its successor.

The question is whether that ends up doing the right thing in all cases.

I haven't yet had the time to fully grok what's going on here, but I suspect this code is fundamentally broken because the whole assumption of having a single loopBottom is wrong. It happens to be true for structured loops with non-uniform control flow, but when there's uniform control flow, there may well be multiple back-edges. What happens in those cases?

I'm not sure either if the current code correctly supports multiple back-edges. I think if there is more than one, loopBottom() will select the basic block with the highest number if it's a successor of the header loop.

The other way around: it will select the basic block with the highest number which has the loop header as its successor.

The question is whether that ends up doing the right thing in all cases.

Exactly, I can try to write a test with multi back-edges and we will see what happens in this situation.

test/CodeGen/AMDGPU/waitcnt-looptest.ll
170 ↗(On Diff #137043)

No, I didn't, I will try and update the patch again if needed, Thanks!

hakzsam updated this revision to Diff 137139.Mar 6 2018, 1:56 AM

v5: run -dce to remove redundant phis

I haven't yet had the time to fully grok what's going on here, but I suspect this code is fundamentally broken because the whole assumption of having a single loopBottom is wrong. It happens to be true for structured loops with non-uniform control flow, but when there's uniform control flow, there may well be multiple back-edges. What happens in those cases?

I'm not sure either if the current code correctly supports multiple back-edges. I think if there is more than one, loopBottom() will select the basic block with the highest number if it's a successor of the header loop.

The other way around: it will select the basic block with the highest number which has the loop header as its successor.

The question is whether that ends up doing the right thing in all cases.

https://hastebin.com/joqihayohe One case that doesn't work (ie. a vmcnt(0) is missing).

I haven't yet had the time to fully grok what's going on here, but I suspect this code is fundamentally broken because the whole assumption of having a single loopBottom is wrong. It happens to be true for structured loops with non-uniform control flow, but when there's uniform control flow, there may well be multiple back-edges. What happens in those cases?

I'm not sure either if the current code correctly supports multiple back-edges. I think if there is more than one, loopBottom() will select the basic block with the highest number if it's a successor of the header loop.

The other way around: it will select the basic block with the highest number which has the loop header as its successor.

The question is whether that ends up doing the right thing in all cases.

https://hastebin.com/joqihayohe One case that doesn't work (ie. a vmcnt(0) is missing).

I should have a fix for the multiple back-edges case, I'm going to run CTS to be sure it doesn't break anything before updating.

hakzsam updated this revision to Diff 137189.Mar 6 2018, 7:01 AM

v6: fix the multiple back-edges case and update the testcase

arsenm added a comment.Mar 6 2018, 8:16 AM

I think this really calls for a MIR testcase

hakzsam updated this revision to Diff 137384.Mar 7 2018, 7:07 AM

v7:

  • replace the LLVM IR testcase with a MIR one
  • remove one unused variable in isLoopBottom()
  • do not rely on block numbfer in isLoopBottom()
arsenm added inline comments.Mar 7 2018, 7:21 AM
lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1556

lowercase true

1562

Braces

1563

Swap order of checks

1572

Braces

hakzsam updated this revision to Diff 137399.Mar 7 2018, 8:25 AM

v8:

  • add a sanity check when a block is looping over itself
  • cosmetic changes
msearles added inline comments.Mar 8 2018, 6:40 PM
lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1581

Can you explain why you avoid re-visiting a loop when a block loops over itself? I suppose this makes sense considering the current waitcnt pass sources. I.e., it admits that it does not handle single block loops correctly. See an "IMPORTANT NOTE" to this effect in SIInsertWaitcnts::mergeInputScoreBrackets() at SIInsertWaitcnts.cpp:1315-1317 . However, in order for it to correctly handle single block loops, it will need to revisit them and then the sanitify check won't make sense. I have a patch that fixes SIInsertWaitcnts::mergeInputScoreBrackets() in the case of a single block loop, but it's only useful if the block is revisited.

See the reduced testcase. Note:

  • the first GLOBAL_STORE_DWORD has its write-data in $vgpr11
  • the last GLOBAL_LOAD_DWORD loads into $vgpr11
  • so, a waitcnt is needed before the store to wait for the load
  • if the pass does not revisit bb.0, it has no way of knowing this


$ llc -mcpu=gfx900 -mtriple=amdgcn -verify-machineinstrs -run-pass si-insert-waitcnts testcase.mir
$ cat testcase.mir
--- |
define amdgpu_kernel void @benchmark_func() {
ret void
}
...
---
name: benchmark_func
body: |
bb.0:
GLOBAL_STORE_DWORD $vgpr7_vgpr8, $vgpr11, 0, 0, 0, implicit $exec
$vgpr21 = GLOBAL_LOAD_DWORD $vgpr4_vgpr5, 0, 0, 0, implicit $exec
$vgpr10 = GLOBAL_LOAD_DWORD $vgpr10_vgpr11, 0, 0, 0, implicit $exec
GLOBAL_STORE_DWORD $vgpr14_vgpr15, $vgpr21, 0, 0, 0, implicit $exec
$vgpr11 = GLOBAL_LOAD_DWORD $vgpr11_vgpr12, 0, 0, 0, implicit $exec
S_CBRANCH_SCC1 %bb.0, implicit $scc

bb.1:
S_ENDPGM
...

msearles added inline comments.Mar 14 2018, 9:35 AM
lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1581

Hi @hakzsam, I'd like to see this patch move forward, so any comment on my comment? Thanks.

hakzsam added inline comments.Mar 15 2018, 1:17 AM
lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1581

Hi,

Sorry, I was working on something more important.

Yes, I would like this patch to move forward too, I have to handle your testcase. I will update in the next few days.

Thanks for your feedbacks!

msearles added inline comments.Mar 15 2018, 8:50 AM
lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1581

Sounds good; note that the testcase is llvm/trunk/test/CodeGen/AMDGPU/waitcnt-loop-single-basic-block.mir , which was pushed as part of https://reviews.llvm.org/rL327583 . I think if you remove that sanitify check, then the testcase will continue to pass; however, it will fail with the santify check as the single bb is not revisited.

Mark

msearles added inline comments.Apr 7 2018, 7:11 AM
lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1581

Hi @hakzsam - ping; let me know if you don't have time to work on this; I can pick-up where you left off.

Mark

Hi Mark,

Yes, feel free to update this patch if you have time because I'm not sure when I will be able to.

Thanks.

msearles commandeered this revision.Apr 10 2018, 9:12 AM
msearles edited reviewers, added: hakzsam; removed: msearles.

Picking up where @hakzsam left off

msearles updated this revision to Diff 141867.Apr 10 2018, 9:15 AM
  • Change isLoopBottom() to return true in the case of a single basic block loop
  • Clear LoopWaitcntDataMap in between functions
  • Change isLoopBottom() to return true in the case of a single basic block loop
  • Clear LoopWaitcntDataMap in between functions

Thanks for taking care of this!

The patch seems good to me, but as my LLVM skills are not that great, I'm not able to say "LGTM". :)

I will let AMDGPU LLVM people reply.

Ping; the patch looks good to the original author and to me, the follow-on author. However, it would be nice if someone other than one of us accepts the revision.

This revision is now accepted and ready to land.Apr 17 2018, 12:51 PM
This revision was automatically updated to reflect the committed changes.