Page MenuHomePhabricator

[AMDGPU] Move WQM Pass after MI Scheduler
ClosedPublic

Authored by critson on Sep 22 2020, 3:04 AM.

Details

Summary

Exec mask manipulation inserted by SIWholeQuadMode barriers to
instruction scheduling. Move the entire pass after the machine
instruction scheduler and make changes so pass is correct for
non-SSA operation. These changes should leave the pass still
usable pre-scheduler, although tests have be updated to reflect
post-scheduler results.

Diff Detail

Event Timeline

critson created this revision.Sep 22 2020, 3:04 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 22 2020, 3:04 AM
critson requested review of this revision.Sep 22 2020, 3:04 AM

This passes VulkanCTS as much as stock LLVM does for graphics.
I still need to do some porting work so I can test performance impact.

foad added inline comments.Sep 22 2020, 3:41 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
993

Why set VerifyAfter = false?

Also, a nit, I think insertPass(&MachineSchedulerID, &SIPreAllocateWWMRegsID, false) would be slightly easier to understand, once you know that insertPass(A,B) just appends B to the list of passes to be inserted after A.

critson updated this revision to Diff 296422.Oct 6 2020, 5:31 AM
  • Address comments about pass insertion.
  • Fix bug in removal of trivial SGPR copies from WWM.
critson marked an inline comment as done.Oct 6 2020, 5:35 AM
arsenm added inline comments.Oct 6 2020, 7:31 AM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
362–371

Isn't this pass required to be post-SSA if it's after the scheduler?

critson updated this revision to Diff 296867.Oct 7 2020, 10:30 PM
  • Fix assumptions about SCC live intervals which are not valid late in compilation.
critson added inline comments.Oct 8 2020, 7:12 PM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
362–371

I am currently retaining the ability to run in both SSA and non-SSA modes.

nhaehnle added inline comments.Oct 12 2020, 11:19 AM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
315–332

I don't understand the logic here. Why the special treatment of operand 0?

349–363

I don't understand this logic. A use is a use -- why should implicitness or tiedness make a difference? This seems pretty wrong.

835–851

This looks suspicious. Can you please explain what is happening here?

979

This is incorrect, there could be other users of the value. Just keep the simpler case below.

critson added inline comments.Oct 12 2020, 9:34 PM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
315–332

Are you saying I should iterate MI->defs() instead?
The code here is intended to mark all instructions defining parts of the specified register.
If this is a partial register write then we need to follow the input values to mark the other instructions.

349–363

Agreed, this should go away.
It was an early hack before I wrote a working markDefs.

835–851

This is analogous to the code above that modifies the SI_ELSE to make it respect the EXEC mask -- it is a very special case match and fix up based on current code generation so I do not like it either.
If we make SI_ELSE always respect modifications to the EXEC mask then this can go away. Then perhaps we add a late peephole to clear up some of the unnecessary instructions when they are not required.
Do you have an opinion on this?

979

There should not be other users of the value, it is a kill?

I am not going to fight to keep this, but we would benefit from more late clean up of unnecessary copies.
I guess this ties into some of the things I am touching on in D89187, so the follow up to that might solve this.

nhaehnle added inline comments.Oct 14 2020, 4:04 PM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
315–332

Are you saying I should iterate MI->defs() instead?

Well, there's the question of whether you need to follow implicit defs as well. I just don't see why you're treating operand 0 differently from others.

835–851

We really shouldn't rely on such details of current code generation here.

I think you're on to something. Digging into this more... SI_ELSE is currently lowered as:

s_or_saveexec_bNN dst, src
...
s_xor_bNN_term exec, exec, dst

What if we instead lowered it as:

s_or_saveexec_bNN tmp, src
...
s_and_bNN_term dst, exec, tmp
s_xor_bNN_term exec, exec, dst

One of the OptimizeExecMasking passes can then just remove the s_and_bNN_term if there is no modification of exec in the middle.

I think I'd be happy about that solution. In practice, the scan backwards from s_and_bNN isn't that expensive and I believe it's required anyway. Thoughts?

979

I appreciate the desire to remove some unnecessary copies, but let's first figure out whether this one is correct.

Specifically, I thought LRQ.isKill() only means that the use of SrcReg in MI is the _last_ use. There could be other uses of the same definition of SrcReg that come earlier, right?

So maybe you could still eliminate the copy here if you updated those other uses as well. I would still ask you to keep things simpler here for this change and see if you can find a good place to eliminate this kind of copy separately in a dedicated pass. This code is quite difficult to follow even without this.

critson marked 5 inline comments as done.Oct 16 2020, 5:05 AM
critson added inline comments.
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
315–332

I don't think I am intentionally treating operand 0 different, this was just written to inspect the definition of the instruction and should be based on defs().

The point of this loop is to follow the chain of all definitions of a register (or parts of it), and mark each instruction involved. The idea is it stops when the entire register has been defined. (Or rather it needs to keep going if the definitions are only partial.) For that reason we should also look at implicit defs, as the first whole register implicit def should be a valid stopping point.

835–851

Yes, your solution is what I was trying to suggest.
Your instead case is what happens when we do MI.getOperand(3).setImm(1).
I will put it in as a separate Phabricator review shortly to simplify SI_ELSE lowering and optimise out the unnecessary s_and in OptimizeExecMasking.

979

OK, yep I see that there /could/ be other users, although in practice I had not encountered them.
I will work on cleaning this up in a later pass.

nhaehnle added inline comments.Oct 19 2020, 9:57 AM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
315–332

Well the code treats operand 0 specially, because there's literally a getOperand(0) in there, with a hard-coded 0. If the intention is to treat all defs in the same way (which I think it should be), then why not have a single homogenous loop over operands?

I do understand the point about partial defs, that makes sense and it's not what I'm worried about.

835–851

Sounds good!

979

What do you mean by cleaning this up in a later pass? The goal should be to keep the MachineIR an accurate representation of the program at all times.

critson updated this revision to Diff 299556.Oct 20 2020, 11:08 PM
critson marked 2 inline comments as done.
  • Rebase
  • Fix markDefs to iterate all operands of MI
  • Remove fix up for SI_ELSE as this is no longer required
  • Remove elimination of trivial SGPR to SGPR WWM copies (this adds cruft in atomic optimizer tests)
critson marked 5 inline comments as done.Oct 20 2020, 11:15 PM
critson added inline comments.
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
315–332

I have fixed this to iterate all operands looking for appropriate defs to follow.

979

MachineIR is accurate.
My point is because the pass now runs later there is nothing to optimise away trivial copies it introduces when lowering WWM operations.
See cruft this adds in atomic tests, e.g. atomic_optimizations_buffer.ll

nhaehnle accepted this revision.Oct 22 2020, 8:41 AM

LGTM

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
315–332

Thanks!

979

Ah, I see. Maybe that cleanup could be done as a follow-up change.

This revision is now accepted and ready to land.Oct 22 2020, 8:41 AM
This revision was landed with ongoing or failed builds.Oct 26 2020, 6:26 PM
This revision was automatically updated to reflect the committed changes.
critson marked an inline comment as done.
mceier added a subscriber: mceier.Oct 30 2020, 1:18 PM
mceier added inline comments.
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
653–657

Is this really correct ? CS:GO crashes with git version of llvm at line 657 (on Radeon 5700XT):

#0  0x00007fad33ab57b0 in llvm::IndexListEntry::getIndex (this=0x0) at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/include/llvm/CodeGen/SlotIndexes.h:58
#1  0x00007fad33ab581d in llvm::SlotIndex::getIndex (this=0x7fad2ffa2960) at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/include/llvm/CodeGen/SlotIndexes.h:125
#2  0x00007fad33ab591d in llvm::SlotIndex::operator> (this=0x7fad2ffa2960, other=...) at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/include/llvm/CodeGen/SlotIndexes.h:187
#3  0x00007fad362e842c in (anonymous namespace)::SIWholeQuadMode::prepareInsertion (this=0x1460000, MBB=..., First=..., Last=..., PreferLast=false, SaveSCC=true)
    at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp:657
#4  0x00007fad362e923c in (anonymous namespace)::SIWholeQuadMode::processBlock (this=0x1460000, MBB=..., LiveMaskReg=2147484277, isEntry=true)
    at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp:859
#5  0x00007fad362ea140 in (anonymous namespace)::SIWholeQuadMode::runOnMachineFunction (this=0x1460000, MF=...)
    at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp:1055
#6  0x00007fad33d15048 in llvm::MachineFunctionPass::runOnFunction (this=0x1460000, F=...) at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/lib/CodeGen/MachineFunctionPass.cpp:73
#7  0x00007fad3391406a in llvm::FPPassManager::runOnFunction (this=0x148ba80, F=...) at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/lib/IR/LegacyPassManager.cpp:1519
#8  0x00007fad352ab96b in (anonymous namespace)::CGPassManager::RunPassOnSCC (this=0x148bc40, P=0x148ba80, CurSCC=..., CG=..., CallGraphUpToDate=@0x7fad2ffa2d8d: true, DevirtualizedCall=@0x7fad2ffa2e30: false)
    at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/lib/Analysis/CallGraphSCCPass.cpp:178
#9  0x00007fad352ac4ed in (anonymous namespace)::CGPassManager::RunAllPassesOnSCC (this=0x148bc40, CurSCC=..., CG=..., DevirtualizedCall=@0x7fad2ffa2e30: false)
    at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/lib/Analysis/CallGraphSCCPass.cpp:476
#10 0x00007fad352ac7e4 in (anonymous namespace)::CGPassManager::runOnModule (this=0x148bc40, M=...) at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/lib/Analysis/CallGraphSCCPass.cpp:541
#11 0x00007fad339147f6 in (anonymous namespace)::MPPassManager::runOnModule (this=0x1412800, M=...) at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/lib/IR/LegacyPassManager.cpp:1634
#12 0x00007fad3390f6ce in llvm::legacy::PassManagerImpl::run (this=0x1430200, M=...) at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/lib/IR/LegacyPassManager.cpp:615
#13 0x00007fad33915087 in llvm::legacy::PassManager::run (this=0x1427bb8, M=...) at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/lib/IR/LegacyPassManager.cpp:1761
#14 0x00007fad3abc7145 in ac_compile_module_to_elf (p=0x1427b60, module=0x35af9300, pelf_buffer=0x111a5ad0, pelf_size=0x111a5ad8) at ../mesa-9999/src/amd/llvm/ac_llvm_helper.cpp:259
#15 0x00007fad3aad8501 in si_compile_llvm (sscreen=0xc23400, binary=0x111a5ad0, conf=0x111a5ae8, compiler=0xc23cb0, ac=0x7fad2ffa3490, debug=0x2577a030, stage=MESA_SHADER_FRAGMENT,
    name=0x7fad3ad7582a "Pixel Shader", less_optimized=false) at ../mesa-9999/src/gallium/drivers/radeonsi/si_shader_llvm.c:104
#16 0x00007fad3aad61d2 in si_llvm_compile_shader (sscreen=0xc23400, compiler=0xc23cb0, shader=0x111a5a00, debug=0x2577a030, nir=0xc5a6450, free_nir=false)
    at ../mesa-9999/src/gallium/drivers/radeonsi/si_shader.c:1891
#17 0x00007fad3aad634d in si_compile_shader (sscreen=0xc23400, compiler=0xc23cb0, shader=0x111a5a00, debug=0x2577a030) at ../mesa-9999/src/gallium/drivers/radeonsi/si_shader.c:1927
#18 0x00007fad3ab109c3 in si_init_shader_selector_async (job=0x2577a000, thread_index=0) at ../mesa-9999/src/gallium/drivers/radeonsi/si_state_shaders.c:2492
#19 0x00007fad3a73196b in util_queue_thread_func (input=0xc34ed0) at ../mesa-9999/src/util/u_queue.c:304
#20 0x00007fad3a730acb in impl_thrd_routine (p=0xc34ec0) at ../mesa-9999/include/c11/threads_posix.h:87
#21 0x00007fad3fbcbf9e in start_thread (arg=0x7fad2ffa5640) at pthread_create.c:463
#22 0x00007fad3ff4865f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
critson added inline comments.Nov 5 2020, 4:10 AM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
653–657

Thanks for bringing this to my attention. Sorry for the slow response. I am currently investigating. Do you happen to have any more details for reproduction?

mceier added inline comments.Nov 5 2020, 6:02 AM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
653–657

The crash happens at CS:GO startup and unfortunately besides this and the stacktrace I don't have more details about it; I don't know how to extract the shader that triggers this crash (not even sure if it will help with anything). So far only CS:GO triggers it.

I wonder if the move of "insertPass" from line 1000 to 993 might be the culprit ? If so I could rebuild the llvm (I didn't do it yet because it takes a lot of time) and verify this.

mceier added inline comments.Nov 5 2020, 11:49 AM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
653–657

I tried changing the order of passes and CS:GO still crashed. Also I was able to dump LLVM IR of the shader (by setting AMD_DEBUG=ps,vs):

It's always shader 387 that crashes. Hopefully it will help you debug this.

mceier added inline comments.Nov 5 2020, 11:53 AM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
653–657

After commenting out top 2 lines, llc produces the same stacktrace as CS:GO.

critson added inline comments.Nov 6 2020, 5:14 AM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
653–657

Thanks for this shader.
Can you share you llc command line options as I was unable to get a crash feeding the shader to: llc -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs

mceier added inline comments.Nov 6 2020, 5:25 AM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
653–657

Well, I just renamed shader4.txt to shader4.ll, commented out 2 top lines and ran llc shader4.ll and llc crashed. I didn't use any llc options.

"llc -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs shader4.ll" also doesn't crash here.

LLVM commit is f738aee0bbf39d11b9f0104e094c7893ffca040c

llc --version shows:

LLVM (http://llvm.org/):

LLVM version 12.0.0git
Optimized build.
Default target: x86_64-pc-linux-gnu
Host CPU: skylake

Registered Targets:
  aarch64    - AArch64 (little endian)
  aarch64_32 - AArch64 (little endian ILP32)
  aarch64_be - AArch64 (big endian)
  amdgcn     - AMD GCN GPUs
  arm        - ARM
  arm64      - ARM64 (little endian)
  arm64_32   - ARM64 (little endian ILP32)
  armeb      - ARM (big endian)
  bpf        - BPF (host endian)
  bpfeb      - BPF (big endian)
  bpfel      - BPF (little endian)
  nvptx      - NVIDIA PTX 32-bit
  nvptx64    - NVIDIA PTX 64-bit
  r600       - AMD GPUs HD2XXX-HD6XXX
  riscv32    - 32-bit RISC-V
  riscv64    - 64-bit RISC-V
  thumb      - Thumb
  thumbeb    - Thumb (big endian)
  wasm32     - WebAssembly 32-bit
  wasm64     - WebAssembly 64-bit
  x86        - 32-bit X86: Pentium-Pro and above
  x86-64     - 64-bit X86: EM64T and AMD64
mceier added inline comments.Nov 6 2020, 6:07 AM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
653–657

llc crashes only for -mcpu=generic and -mcpu=generic-hsa

btw. shader4.txt.gz on phabricator seems to be gzipped twice (I uploaded gzipped file and I didn't know phabricator will gzip it again) and has to be ungzipped twice.

critson added inline comments.Nov 6 2020, 11:19 PM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
653–657

Thanks.
I have created D90997 which hopefully should address the issue.