Exec mask manipulation inserted by SIWholeQuadMode barriers to
instruction scheduling. Move the entire pass after the machine
instruction scheduler and make changes so pass is correct for
non-SSA operation. These changes should leave the pass still
usable pre-scheduler, although tests have be updated to reflect
post-scheduler results.
Details
- Reviewers
nhaehnle foad arsenm - Commits
- rG7a880ab38892: [AMDGPU] Move WQM Pass after MI Scheduler
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
This passes VulkanCTS as much as stock LLVM does for graphics.
I still need to do some porting work so I can test performance impact.
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp | ||
---|---|---|
1001 | Why set VerifyAfter = false? Also, a nit, I think insertPass(&MachineSchedulerID, &SIPreAllocateWWMRegsID, false) would be slightly easier to understand, once you know that insertPass(A,B) just appends B to the list of passes to be inserted after A. |
- Address comments about pass insertion.
- Fix bug in removal of trivial SGPR copies from WWM.
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
387 | Isn't this pass required to be post-SSA if it's after the scheduler? |
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
387 | I am currently retaining the ability to run in both SSA and non-SSA modes. |
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
315–332 | I don't understand the logic here. Why the special treatment of operand 0? | |
358–372 | I don't understand this logic. A use is a use -- why should implicitness or tiedness make a difference? This seems pretty wrong. | |
862–878 | This looks suspicious. Can you please explain what is happening here? | |
1012 | This is incorrect, there could be other users of the value. Just keep the simpler case below. |
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
315–332 | Are you saying I should iterate MI->defs() instead? | |
358–372 | Agreed, this should go away. | |
862–878 | This is analogous to the code above that modifies the SI_ELSE to make it respect the EXEC mask -- it is a very special case match and fix up based on current code generation so I do not like it either. | |
1012 | There should not be other users of the value, it is a kill? I am not going to fight to keep this, but we would benefit from more late clean up of unnecessary copies. |
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
315–332 |
Well, there's the question of whether you need to follow implicit defs as well. I just don't see why you're treating operand 0 differently from others. | |
862–878 | We really shouldn't rely on such details of current code generation here. I think you're on to something. Digging into this more... SI_ELSE is currently lowered as: s_or_saveexec_bNN dst, src ... s_xor_bNN_term exec, exec, dst What if we instead lowered it as: s_or_saveexec_bNN tmp, src ... s_and_bNN_term dst, exec, tmp s_xor_bNN_term exec, exec, dst One of the OptimizeExecMasking passes can then just remove the s_and_bNN_term if there is no modification of exec in the middle. I think I'd be happy about that solution. In practice, the scan backwards from s_and_bNN isn't that expensive and I believe it's required anyway. Thoughts? | |
1012 | I appreciate the desire to remove some unnecessary copies, but let's first figure out whether this one is correct. Specifically, I thought LRQ.isKill() only means that the use of SrcReg in MI is the _last_ use. There could be other uses of the same definition of SrcReg that come earlier, right? So maybe you could still eliminate the copy here if you updated those other uses as well. I would still ask you to keep things simpler here for this change and see if you can find a good place to eliminate this kind of copy separately in a dedicated pass. This code is quite difficult to follow even without this. |
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
315–332 | I don't think I am intentionally treating operand 0 different, this was just written to inspect the definition of the instruction and should be based on defs(). The point of this loop is to follow the chain of all definitions of a register (or parts of it), and mark each instruction involved. The idea is it stops when the entire register has been defined. (Or rather it needs to keep going if the definitions are only partial.) For that reason we should also look at implicit defs, as the first whole register implicit def should be a valid stopping point. | |
862–878 | Yes, your solution is what I was trying to suggest. | |
1012 | OK, yep I see that there /could/ be other users, although in practice I had not encountered them. |
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
315–332 | Well the code treats operand 0 specially, because there's literally a getOperand(0) in there, with a hard-coded 0. If the intention is to treat all defs in the same way (which I think it should be), then why not have a single homogenous loop over operands? I do understand the point about partial defs, that makes sense and it's not what I'm worried about. | |
862–878 | Sounds good! | |
1012 | What do you mean by cleaning this up in a later pass? The goal should be to keep the MachineIR an accurate representation of the program at all times. |
- Rebase
- Fix markDefs to iterate all operands of MI
- Remove fix up for SI_ELSE as this is no longer required
- Remove elimination of trivial SGPR to SGPR WWM copies (this adds cruft in atomic optimizer tests)
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
315–332 | I have fixed this to iterate all operands looking for appropriate defs to follow. | |
1012 | MachineIR is accurate. |
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
678–679 | Is this really correct ? CS:GO crashes with git version of llvm at line 657 (on Radeon 5700XT): #0 0x00007fad33ab57b0 in llvm::IndexListEntry::getIndex (this=0x0) at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/include/llvm/CodeGen/SlotIndexes.h:58 #1 0x00007fad33ab581d in llvm::SlotIndex::getIndex (this=0x7fad2ffa2960) at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/include/llvm/CodeGen/SlotIndexes.h:125 #2 0x00007fad33ab591d in llvm::SlotIndex::operator> (this=0x7fad2ffa2960, other=...) at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/include/llvm/CodeGen/SlotIndexes.h:187 #3 0x00007fad362e842c in (anonymous namespace)::SIWholeQuadMode::prepareInsertion (this=0x1460000, MBB=..., First=..., Last=..., PreferLast=false, SaveSCC=true) at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp:657 #4 0x00007fad362e923c in (anonymous namespace)::SIWholeQuadMode::processBlock (this=0x1460000, MBB=..., LiveMaskReg=2147484277, isEntry=true) at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp:859 #5 0x00007fad362ea140 in (anonymous namespace)::SIWholeQuadMode::runOnMachineFunction (this=0x1460000, MF=...) at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp:1055 #6 0x00007fad33d15048 in llvm::MachineFunctionPass::runOnFunction (this=0x1460000, F=...) at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/lib/CodeGen/MachineFunctionPass.cpp:73 #7 0x00007fad3391406a in llvm::FPPassManager::runOnFunction (this=0x148ba80, F=...) at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/lib/IR/LegacyPassManager.cpp:1519 #8 0x00007fad352ab96b in (anonymous namespace)::CGPassManager::RunPassOnSCC (this=0x148bc40, P=0x148ba80, CurSCC=..., CG=..., CallGraphUpToDate=@0x7fad2ffa2d8d: true, DevirtualizedCall=@0x7fad2ffa2e30: false) at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/lib/Analysis/CallGraphSCCPass.cpp:178 #9 0x00007fad352ac4ed in (anonymous namespace)::CGPassManager::RunAllPassesOnSCC (this=0x148bc40, CurSCC=..., CG=..., DevirtualizedCall=@0x7fad2ffa2e30: false) at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/lib/Analysis/CallGraphSCCPass.cpp:476 #10 0x00007fad352ac7e4 in (anonymous namespace)::CGPassManager::runOnModule (this=0x148bc40, M=...) at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/lib/Analysis/CallGraphSCCPass.cpp:541 #11 0x00007fad339147f6 in (anonymous namespace)::MPPassManager::runOnModule (this=0x1412800, M=...) at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/lib/IR/LegacyPassManager.cpp:1634 #12 0x00007fad3390f6ce in llvm::legacy::PassManagerImpl::run (this=0x1430200, M=...) at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/lib/IR/LegacyPassManager.cpp:615 #13 0x00007fad33915087 in llvm::legacy::PassManager::run (this=0x1427bb8, M=...) at /var/tmp/portage/sys-devel/llvm-12.0.0.9999-r1/work/llvm/lib/IR/LegacyPassManager.cpp:1761 #14 0x00007fad3abc7145 in ac_compile_module_to_elf (p=0x1427b60, module=0x35af9300, pelf_buffer=0x111a5ad0, pelf_size=0x111a5ad8) at ../mesa-9999/src/amd/llvm/ac_llvm_helper.cpp:259 #15 0x00007fad3aad8501 in si_compile_llvm (sscreen=0xc23400, binary=0x111a5ad0, conf=0x111a5ae8, compiler=0xc23cb0, ac=0x7fad2ffa3490, debug=0x2577a030, stage=MESA_SHADER_FRAGMENT, name=0x7fad3ad7582a "Pixel Shader", less_optimized=false) at ../mesa-9999/src/gallium/drivers/radeonsi/si_shader_llvm.c:104 #16 0x00007fad3aad61d2 in si_llvm_compile_shader (sscreen=0xc23400, compiler=0xc23cb0, shader=0x111a5a00, debug=0x2577a030, nir=0xc5a6450, free_nir=false) at ../mesa-9999/src/gallium/drivers/radeonsi/si_shader.c:1891 #17 0x00007fad3aad634d in si_compile_shader (sscreen=0xc23400, compiler=0xc23cb0, shader=0x111a5a00, debug=0x2577a030) at ../mesa-9999/src/gallium/drivers/radeonsi/si_shader.c:1927 #18 0x00007fad3ab109c3 in si_init_shader_selector_async (job=0x2577a000, thread_index=0) at ../mesa-9999/src/gallium/drivers/radeonsi/si_state_shaders.c:2492 #19 0x00007fad3a73196b in util_queue_thread_func (input=0xc34ed0) at ../mesa-9999/src/util/u_queue.c:304 #20 0x00007fad3a730acb in impl_thrd_routine (p=0xc34ec0) at ../mesa-9999/include/c11/threads_posix.h:87 #21 0x00007fad3fbcbf9e in start_thread (arg=0x7fad2ffa5640) at pthread_create.c:463 #22 0x00007fad3ff4865f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 |
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
678–679 | Thanks for bringing this to my attention. Sorry for the slow response. I am currently investigating. Do you happen to have any more details for reproduction? |
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
678–679 | The crash happens at CS:GO startup and unfortunately besides this and the stacktrace I don't have more details about it; I don't know how to extract the shader that triggers this crash (not even sure if it will help with anything). So far only CS:GO triggers it. I wonder if the move of "insertPass" from line 1000 to 993 might be the culprit ? If so I could rebuild the llvm (I didn't do it yet because it takes a lot of time) and verify this. |
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
678–679 | I tried changing the order of passes and CS:GO still crashed. Also I was able to dump LLVM IR of the shader (by setting AMD_DEBUG=ps,vs): It's always shader 387 that crashes. Hopefully it will help you debug this. |
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
678–679 | After commenting out top 2 lines, llc produces the same stacktrace as CS:GO. |
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
678–679 | Thanks for this shader. |
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
678–679 | Well, I just renamed shader4.txt to shader4.ll, commented out 2 top lines and ran llc shader4.ll and llc crashed. I didn't use any llc options. "llc -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs shader4.ll" also doesn't crash here. LLVM commit is f738aee0bbf39d11b9f0104e094c7893ffca040c llc --version shows: LLVM (http://llvm.org/): LLVM version 12.0.0git Optimized build. Default target: x86_64-pc-linux-gnu Host CPU: skylake Registered Targets: aarch64 - AArch64 (little endian) aarch64_32 - AArch64 (little endian ILP32) aarch64_be - AArch64 (big endian) amdgcn - AMD GCN GPUs arm - ARM arm64 - ARM64 (little endian) arm64_32 - ARM64 (little endian ILP32) armeb - ARM (big endian) bpf - BPF (host endian) bpfeb - BPF (big endian) bpfel - BPF (little endian) nvptx - NVIDIA PTX 32-bit nvptx64 - NVIDIA PTX 64-bit r600 - AMD GPUs HD2XXX-HD6XXX riscv32 - 32-bit RISC-V riscv64 - 64-bit RISC-V thumb - Thumb thumbeb - Thumb (big endian) wasm32 - WebAssembly 32-bit wasm64 - WebAssembly 64-bit x86 - 32-bit X86: Pentium-Pro and above x86-64 - 64-bit X86: EM64T and AMD64 |
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
678–679 | llc crashes only for -mcpu=generic and -mcpu=generic-hsa btw. shader4.txt.gz on phabricator seems to be gzipped twice (I uploaded gzipped file and I didn't know phabricator will gzip it again) and has to be ungzipped twice. |
Why set VerifyAfter = false?
Also, a nit, I think insertPass(&MachineSchedulerID, &SIPreAllocateWWMRegsID, false) would be slightly easier to understand, once you know that insertPass(A,B) just appends B to the list of passes to be inserted after A.