Page MenuHomePhabricator

[AMDGPU] Uniform values being used outside loop marked non-divergent
Needs ReviewPublic

Authored by rtaylor on Apr 17 2019, 12:53 PM.

Details

Summary

This fixes an issue where values are uniform inside of a loop but
the uses of that value outside the loop can be divergent.

Change-Id: I94d3d2e30cc2a6ae8d59e92cadf6f1b6cb7e708b

Diff Detail

Event Timeline

rtaylor created this revision.Apr 17 2019, 12:53 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 17 2019, 12:53 PM
arsenm requested changes to this revision.Apr 17 2019, 12:58 PM

This is a workaround. The structurizer / annotator must be correct without relying on another pass to hide situations they don't handle correctly

This revision now requires changes to proceed.Apr 17 2019, 12:58 PM

We have a test case such that a value that is uniform in the loop is used outside the loop where threads might have diverged.

For example:

define amdgpu_ps void @_amdgpu_ps_main(<4 x i32> inreg %desc, float %divergent, <2 x i32> %ptrish) {
bb59:

br label %.preheader

.preheader:

%tmp62 = phi i32 [ %tmp105, %bb104 ], [ 0, %bb59 ]
cmp and branch here

bb104:

%tmp105 = add nuw nsw i32 %tmp62, 1
cmp and branch here

.loopexit:

%load2 = tail call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 %tmp62, i32 0)

}

This calls lcssa after StructurizeCFG which inserts PHI nodes into the exit block for this type of value, allowing proper DA of the value after the loop.

This is a workaround. The structurizer / annotator must be correct without relying on another pass to hide situations they don't handle correctly

We discussed going to the new DA which may track these properly. I haven't gotten to finding out if that is the case yet or not. Some of the group thought it was worthwhile to upstream this for now, I've added you to an email chain regarding this issue.

It sort of intuitively makes sense to me that the control flow lowering would like LCSSA. However, this should not be handled by adding it directly to the pass pipeline. You can add this as a dependency, e.g. AU.addRequiredID(LCSSAID);

I would also like to see the an IR->IR testcase showing LCSSA was implicitly run

It sort of intuitively makes sense to me that the control flow lowering would like LCSSA. However, this should not be handled by adding it directly to the pass pipeline. You can add this as a dependency, e.g. AU.addRequiredID(LCSSAID);

I would also like to see the an IR->IR testcase showing LCSSA was implicitly run

Actually, what really requires LCSSA? Is it DivergenceAnalysis or StructurizeCFG directly?

rtaylor added a comment.EditedApr 17 2019, 3:33 PM

It sort of intuitively makes sense to me that the control flow lowering would like LCSSA. However, this should not be handled by adding it directly to the pass pipeline. You can add this as a dependency, e.g. AU.addRequiredID(LCSSAID);

I would also like to see the an IR->IR testcase showing LCSSA was implicitly run

Actually, what really requires LCSSA? Is it DivergenceAnalysis or StructurizeCFG directly?

How I understand the problem is that DA is not looking across blocks and therefore won't see that tmp62 is actually divergent in loop exit (though it is uniform in the loop). LCSSA provides a phi node for the loop exit block (where it is divergent) and allows DA to mark it divergent so that the s_buffer_load can be lowered to a buffer_load.

It sort of intuitively makes sense to me that the control flow lowering would like LCSSA. However, this should not be handled by adding it directly to the pass pipeline. You can add this as a dependency, e.g. AU.addRequiredID(LCSSAID);

I would also like to see the an IR->IR testcase showing LCSSA was implicitly run

Actually, what really requires LCSSA? Is it DivergenceAnalysis or StructurizeCFG directly?

How I understand the problem is that DA is not looking across blocks and therefore won't see that tmp62 is actually divergent in loop exit (though it is uniform in the loop). LCSSA provides a phi node for the loop exit block (where it is divergent) and allows DA to mark it divergent so that the s_buffer_load can be lowered to a buffer_load.

OK, so it seems to me like the dependency should be on DivergenceAnalysis. if you just patch it here, the same problem will occur for the handful of other passes that optionally use it

I think there are some misunderstandings here. None of the IR passes require LCSSA. The problem is in getting the divergence data into the SelectionDAG.

Specifically, you can have, in a weird mixture of IR and SelectionDAG:

loop:
  ...
  %uni = ...                             ; Value is uniform here
  ...
  br i1 %div, label %loop, label %next   ; Divergent loop exit

next:
  %0 = CopyFromReg N(corresponding to %uni)
  use %0

In this case, %0 must be labeled divergent. However, %0 does not exist at an IR level, and so the code in isSDNodeSourceOfDivergence can only query for the divergence of %uni. However, %uni itself is uniform.

One way to look at the problem is that DivergenceAnalysis::isDivergent is really "isDivergentAtDefinition", and what we need is a query "isDivergentAtUse". Implementing that query isn't entirely trivial, and LCSSA is effectively an alternative way of making the right query.

I think there are some misunderstandings here. None of the IR passes require LCSSA. The problem is in getting the divergence data into the SelectionDAG.

Specifically, you can have, in a weird mixture of IR and SelectionDAG:

loop:
  ...
  %uni = ...                             ; Value is uniform here
  ...
  br i1 %div, label %loop, label %next   ; Divergent loop exit

next:
  %0 = CopyFromReg N(corresponding to %uni)
  use %0

In this case, %0 must be labeled divergent. However, %0 does not exist at an IR level, and so the code in isSDNodeSourceOfDivergence can only query for the divergence of %uni. However, %uni itself is uniform.

One way to look at the problem is that DivergenceAnalysis::isDivergent is really "isDivergentAtDefinition", and what we need is a query "isDivergentAtUse". Implementing that query isn't entirely trivial, and LCSSA is effectively an alternative way of making the right query.

So really amdgpu-isel is dependent on LCSSA

I am hitting this assert in LuxMark with this patch:
Assertion failed: (IncomingDef->isPHI()), function lowerPhis, file ../lib/Target/AMDGPU/SILowerI1Copies.cpp, line 534.
Stack dump:
0. Program arguments: /Users/matt/src/llvm/build_debug/bin/clang -cc1 -triple amdgcn-amd-amdhsa -emit-obj -disable-free -main-file-name t_9348_21.bc -mrelocation-model pic -pic-level 1 -mthread-model posix -mdisable-fp-elim -fmath-errno -masm-verbose -mconstructor-aliases -fvisibility hidden -fapply-global-visibility-to-externs -target-cpu gfx900 -target-feature -wavefrontsize16 -target-feature -wavefrontsize32 -target-feature +wavefrontsize64 -target-feature -sram-ecc -target-feature -code-object-v3 -target-feature +cumode -dwarf-column-info -debugger-tuning=gdb -resource-dir /home/marsenau/builds/opencl_amdgpu_scratch/bin/lib/clang/8.0 -O3 -fdebug-compilation-dir /home/marsenau/src/LuxMark-3.1 -ferror-limit 19 -fmessage-length 201 -cl-kernel-arg-info -fobjc-runtime=gcc -fdiagnostics-show-option -vectorize-loops -vectorize-slp -mllvm -amdgpu-internalize-symbols -mllvm -amdgpu-early-inline-all -o /tmp/t_9348_21-9f1436.o -x ir AMD_9348_7/t_9348_21.bc -faddrsig

  1. Code generation
  2. Running pass 'CallGraph Pass Manager' on module 'AMD_9348_7/t_9348_21.bc'.
  3. Running pass 'SI Lower i1 Copies' on function '@scheduler'

0 clang 0x0000000109a1f8bc llvm::sys::PrintStackTrace(llvm::raw_ostream&) + 60
1 clang 0x0000000109a1fe79 PrintStackTraceSignalHandler(void*) + 25
2 clang 0x0000000109a1db36 llvm::sys::RunSignalHandlers() + 118
3 clang 0x0000000109a23a42 SignalHandler(int) + 210
4 libsystem_platform.dylib 0x00007fff7ddd1b5d _sigtramp + 29
5 libsystem_platform.dylib 0x000000012a983938 _sigtramp + 2897944056
6 libsystem_c.dylib 0x00007fff7dc916a6 abort + 127
7 libsystem_c.dylib 0x00007fff7dc5a20d basename_r + 0
8 clang 0x0000000106af4af5 (anonymous namespace)::SILowerI1Copies::lowerPhis() + 1141
9 clang 0x0000000106af40ba (anonymous namespace)::SILowerI1Copies::runOnMachineFunction(llvm::MachineFunction&) + 186
10 clang 0x000000010860f8de llvm::MachineFunctionPass::runOnFunction(llvm::Function&) + 542
11 clang 0x0000000108bfab35 llvm::FPPassManager::runOnFunction(llvm::Function&) + 613
12 clang 0x0000000107f5c8ad (anonymous namespace)::CGPassManager::RunPassOnSCC(llvm::Pass*, llvm::CallGraphSCC&, llvm::CallGraph&, bool&, bool&) + 925
13 clang 0x0000000107f596ed (anonymous namespace)::CGPassManager::RunAllPassesOnSCC(llvm::CallGraphSCC&, llvm::CallGraph&, bool&) + 541
14 clang 0x0000000107f58ec1 (anonymous namespace)::CGPassManager::runOnModule(llvm::Module&) + 433
15 clang 0x0000000108bfb8d5 (anonymous namespace)::MPPassManager::runOnModule(llvm::Module&) + 789

I'm working on reducing it

rtaylor updated this revision to Diff 196238.Apr 23 2019, 6:42 AM

Moved LCSSA call to after the sinking pass.

arsenm requested changes to this revision.Apr 23 2019, 7:03 AM

This should really be expressed as a pass dependency, not explicitly adding the pass to the pipeline

This revision now requires changes to proceed.Apr 23 2019, 7:03 AM

Also needs a comment explaining why LCSSA is needed

Also needs a comment explaining why LCSSA is needed

So are you looking for a dependency and a comment or just a comment?

rtaylor updated this revision to Diff 197792.May 2 2019, 8:16 AM

I was misunderstanding the dependency issues, this should fix it.

Scratch that last commit, this is still broken.

arsenm added inline comments.May 2 2019, 8:18 AM
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
815 ↗(On Diff #197792)

You should remove the explicit pass since it's a dependency now

rtaylor updated this revision to Diff 197814.May 2 2019, 9:56 AM

Preserving StackProtector in LCSSA to avoid pass scheduling conflict
Removed explicit call of LCSSA in AMDGPUTargetMachine

arsenm added inline comments.May 2 2019, 10:38 AM
lib/Transforms/Utils/LCSSA.cpp
450

If you use the ID form you should be able to avoid the include

rtaylor marked an inline comment as done.May 2 2019, 11:24 AM
rtaylor added inline comments.
lib/Transforms/Utils/LCSSA.cpp
450

I don't think there is one for StackProtector.

arsenm added inline comments.May 2 2019, 12:34 PM
lib/Transforms/Utils/LCSSA.cpp
450

You can add that

rtaylor marked an inline comment as done.May 2 2019, 3:36 PM
rtaylor added inline comments.
lib/Transforms/Utils/LCSSA.cpp
450

I could but depending on where it goes it might still need a header, if it's not already included.

rtaylor updated this revision to Diff 198293.May 6 2019, 9:10 AM

Added StackProtectorID and changed LCSSA to use it instead of StackProtector directly.

arsenm added inline comments.May 6 2019, 1:42 PM
lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
92

A little more about why it doesn’t get this information would be good

rtaylor updated this revision to Diff 198471.May 7 2019, 7:55 AM

More detailed explanation of why LCSSA is needed

nhaehnle accepted this revision.Jun 3 2019, 12:36 AM
nhaehnle added a subscriber: alex-t.

Assuming any linker errors with this are fixed, I think this is ready to go in.

Okay, so the linker errors haven't been resolved. If issues related to this problem are burning people urgently, I think a reasonable quick fix would be to add LCSSA not as a dependency but as a codegen pass. It'd be great if we could avoid that, but as hacks go, that still seems cleaner to me than some of the alternatives.