Page MenuHomePhabricator

[AMDGPU] Move WQM Pass after MI Scheduler
AcceptedPublic

Authored by critson on Sep 22 2020, 3:04 AM.

Details

Summary

Exec mask manipulation inserted by SIWholeQuadMode barriers to
instruction scheduling. Move the entire pass after the machine
instruction scheduler and make changes so pass is correct for
non-SSA operation. These changes should leave the pass still
usable pre-scheduler, although tests have be updated to reflect
post-scheduler results.

Diff Detail

Unit TestsFailed

TimeTest
380 mslinux > HWAddressSanitizer-x86_64.TestCases::sizes.cpp
Script: -- : 'RUN: at line 3'; /mnt/disks/ssd0/agent/llvm-project/build/./bin/clang --driver-mode=g++ -m64 -gline-tables-only -fsanitize=hwaddress -fuse-ld=lld -mcmodel=large -mllvm -hwasan-globals -mllvm -hwasan-use-short-granules -mllvm -hwasan-instrument-landing-pads=0 -mllvm -hwasan-instrument-personality-functions /mnt/disks/ssd0/agent/llvm-project/compiler-rt/test/hwasan/TestCases/sizes.cpp -nostdlib++ -lstdc++ -o /mnt/disks/ssd0/agent/llvm-project/build/projects/compiler-rt/test/hwasan/X86_64/TestCases/Output/sizes.cpp.tmp
120 mslinux > Polly.ScopInfo/NonAffine::non-affine-loop-condition-dependent-access_3.ll
Script: -- : 'RUN: at line 1'; opt -polly-process-unprofitable -polly-remarks-minimal -polly-use-llvm-names -polly-import-jscop-dir=/mnt/disks/ssd0/agent/llvm-project/polly/test/ScopInfo/NonAffine -polly-codegen-verify -basic-aa -polly-scops -polly-allow-nonaffine-branches -polly-allow-nonaffine-loops=false -analyze < /mnt/disks/ssd0/agent/llvm-project/polly/test/ScopInfo/NonAffine/non-affine-loop-condition-dependent-access_3.ll | FileCheck /mnt/disks/ssd0/agent/llvm-project/polly/test/ScopInfo/NonAffine/non-affine-loop-condition-dependent-access_3.ll --check-prefix=INNERMOST

Event Timeline

critson created this revision.Sep 22 2020, 3:04 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 22 2020, 3:04 AM
critson requested review of this revision.Sep 22 2020, 3:04 AM

This passes VulkanCTS as much as stock LLVM does for graphics.
I still need to do some porting work so I can test performance impact.

foad added inline comments.Sep 22 2020, 3:41 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
993

Why set VerifyAfter = false?

Also, a nit, I think insertPass(&MachineSchedulerID, &SIPreAllocateWWMRegsID, false) would be slightly easier to understand, once you know that insertPass(A,B) just appends B to the list of passes to be inserted after A.

critson updated this revision to Diff 296422.Tue, Oct 6, 5:31 AM
  • Address comments about pass insertion.
  • Fix bug in removal of trivial SGPR copies from WWM.
critson marked an inline comment as done.Tue, Oct 6, 5:35 AM
arsenm added inline comments.Tue, Oct 6, 7:31 AM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
362–371

Isn't this pass required to be post-SSA if it's after the scheduler?

critson updated this revision to Diff 296867.Wed, Oct 7, 10:30 PM
  • Fix assumptions about SCC live intervals which are not valid late in compilation.
critson added inline comments.Thu, Oct 8, 7:12 PM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
362–371

I am currently retaining the ability to run in both SSA and non-SSA modes.

nhaehnle added inline comments.Mon, Oct 12, 11:19 AM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
315–332

I don't understand the logic here. Why the special treatment of operand 0?

349–363

I don't understand this logic. A use is a use -- why should implicitness or tiedness make a difference? This seems pretty wrong.

835–851

This looks suspicious. Can you please explain what is happening here?

979

This is incorrect, there could be other users of the value. Just keep the simpler case below.

critson added inline comments.Mon, Oct 12, 9:34 PM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
315–332

Are you saying I should iterate MI->defs() instead?
The code here is intended to mark all instructions defining parts of the specified register.
If this is a partial register write then we need to follow the input values to mark the other instructions.

349–363

Agreed, this should go away.
It was an early hack before I wrote a working markDefs.

835–851

This is analogous to the code above that modifies the SI_ELSE to make it respect the EXEC mask -- it is a very special case match and fix up based on current code generation so I do not like it either.
If we make SI_ELSE always respect modifications to the EXEC mask then this can go away. Then perhaps we add a late peephole to clear up some of the unnecessary instructions when they are not required.
Do you have an opinion on this?

979

There should not be other users of the value, it is a kill?

I am not going to fight to keep this, but we would benefit from more late clean up of unnecessary copies.
I guess this ties into some of the things I am touching on in D89187, so the follow up to that might solve this.

nhaehnle added inline comments.Wed, Oct 14, 4:04 PM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
315–332

Are you saying I should iterate MI->defs() instead?

Well, there's the question of whether you need to follow implicit defs as well. I just don't see why you're treating operand 0 differently from others.

835–851

We really shouldn't rely on such details of current code generation here.

I think you're on to something. Digging into this more... SI_ELSE is currently lowered as:

s_or_saveexec_bNN dst, src
...
s_xor_bNN_term exec, exec, dst

What if we instead lowered it as:

s_or_saveexec_bNN tmp, src
...
s_and_bNN_term dst, exec, tmp
s_xor_bNN_term exec, exec, dst

One of the OptimizeExecMasking passes can then just remove the s_and_bNN_term if there is no modification of exec in the middle.

I think I'd be happy about that solution. In practice, the scan backwards from s_and_bNN isn't that expensive and I believe it's required anyway. Thoughts?

979

I appreciate the desire to remove some unnecessary copies, but let's first figure out whether this one is correct.

Specifically, I thought LRQ.isKill() only means that the use of SrcReg in MI is the _last_ use. There could be other uses of the same definition of SrcReg that come earlier, right?

So maybe you could still eliminate the copy here if you updated those other uses as well. I would still ask you to keep things simpler here for this change and see if you can find a good place to eliminate this kind of copy separately in a dedicated pass. This code is quite difficult to follow even without this.

critson marked 5 inline comments as done.Fri, Oct 16, 5:05 AM
critson added inline comments.
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
315–332

I don't think I am intentionally treating operand 0 different, this was just written to inspect the definition of the instruction and should be based on defs().

The point of this loop is to follow the chain of all definitions of a register (or parts of it), and mark each instruction involved. The idea is it stops when the entire register has been defined. (Or rather it needs to keep going if the definitions are only partial.) For that reason we should also look at implicit defs, as the first whole register implicit def should be a valid stopping point.

835–851

Yes, your solution is what I was trying to suggest.
Your instead case is what happens when we do MI.getOperand(3).setImm(1).
I will put it in as a separate Phabricator review shortly to simplify SI_ELSE lowering and optimise out the unnecessary s_and in OptimizeExecMasking.

979

OK, yep I see that there /could/ be other users, although in practice I had not encountered them.
I will work on cleaning this up in a later pass.

nhaehnle added inline comments.Mon, Oct 19, 9:57 AM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
315–332

Well the code treats operand 0 specially, because there's literally a getOperand(0) in there, with a hard-coded 0. If the intention is to treat all defs in the same way (which I think it should be), then why not have a single homogenous loop over operands?

I do understand the point about partial defs, that makes sense and it's not what I'm worried about.

835–851

Sounds good!

979

What do you mean by cleaning this up in a later pass? The goal should be to keep the MachineIR an accurate representation of the program at all times.

critson updated this revision to Diff 299556.Tue, Oct 20, 11:08 PM
critson marked 2 inline comments as done.
  • Rebase
  • Fix markDefs to iterate all operands of MI
  • Remove fix up for SI_ELSE as this is no longer required
  • Remove elimination of trivial SGPR to SGPR WWM copies (this adds cruft in atomic optimizer tests)
critson marked 5 inline comments as done.Tue, Oct 20, 11:15 PM
critson added inline comments.
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
315–332

I have fixed this to iterate all operands looking for appropriate defs to follow.

979

MachineIR is accurate.
My point is because the pass now runs later there is nothing to optimise away trivial copies it introduces when lowering WWM operations.
See cruft this adds in atomic tests, e.g. atomic_optimizations_buffer.ll

nhaehnle accepted this revision.Thu, Oct 22, 8:41 AM

LGTM

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
315–332

Thanks!

979

Ah, I see. Maybe that cleanup could be done as a follow-up change.

This revision is now accepted and ready to land.Thu, Oct 22, 8:41 AM