This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Combine s_or_saveexec, s_xor instructions.
ClosedPublic

Authored by tsymalla on Jul 4 2022, 4:16 AM.

Details

Summary

This patch merges a consecutive sequence of

s_or_saveexec s_o, s_i
s_xor exec, exec, s_o

into a single

s_andn2_saveexec s_o, s_i instruction.
This patch also cleans up the SIOptimizeExecMasking pass a bit.

Diff Detail

Event Timeline

tsymalla created this revision.Jul 4 2022, 4:16 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 4 2022, 4:16 AM
tsymalla requested review of this revision.Jul 4 2022, 4:16 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 4 2022, 4:16 AM
foad added a comment.Jul 4 2022, 4:34 AM

This patch also cleans up the SIOptimizeExecMasking pass a bit.

Please put that in a separate NFC patch. It really does make life easier for reviewers.

This patch also cleans up the SIOptimizeExecMasking pass a bit.

Please put that in a separate NFC patch. It really does make life easier for reviewers.

Will do that.

Cleanup submitted as separate NFC change in https://reviews.llvm.org/D129086, will change this patch after the NFC has landed

tsymalla updated this revision to Diff 442493.Jul 6 2022, 3:15 AM

Rebased on top of the latest refactoring in SIOptimizeExecMasking

Could you please have a look at this one again?

This is a good start. However, I have some high-level questions:

  1. Scanning over entire basic blocks is bad for compile times, and this pass is already doing some scans. For example, optimizeExecSequence already scans for copies to exec. Can this be improved? Notice how the scan in optimizeExecSequence goes backwards in the basic block and limits itself to only a small number of instruction. I could imagine a restructuring of the pass so that every basic block is scanned backwards for an EXEC-writing instruction. Depending on what the instruction is (copy-to-exec, s_and_saveexec, s_xor) one of the optimizations can be applied.
  2. Why is this change done in SIOptimizeExecMasking instead of SIOptimizeExecMaskingPreRA? Actually, I don't remember why we have the two passes in the first place. Perhaps @rampitec remembers?
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
750

MI == MBB.end() is always false here.

tsymalla edited reviewers, added: sebastian-ne; removed: critson.Jul 7 2022, 7:17 AM
arsenm added a comment.Jul 7 2022, 7:21 AM
  1. Why is this change done in SIOptimizeExecMasking instead of SIOptimizeExecMaskingPreRA? Actually, I don't remember why we have the two passes in the first place. Perhaps @rampitec remembers?

The primary reason is we can't produce the terminators with output registers before register allocation in case we need to insert spills live out of the block for the save exec. RA needs to insert spills before terminators

sebastian-ne added inline comments.Jul 7 2022, 7:22 AM
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
733–734

I guess this also works if the input register is not the same as the output register?

s_or_saveexec s_o, s_i
s_xor exec, exec, s_o
llvm/test/CodeGen/AMDGPU/s_or_saveexec_xor_combine.mir
99

Can you add a test where S_OR_SAVEEXEC is the last instruction?

This is a good start. However, I have some high-level questions:

  1. Scanning over entire basic blocks is bad for compile times, and this pass is already doing some scans. For example, optimizeExecSequence already scans for copies to exec. Can this be improved? Notice how the scan in optimizeExecSequence goes backwards in the basic block and limits itself to only a small number of instruction. I could imagine a restructuring of the pass so that every basic block is scanned backwards for an EXEC-writing instruction. Depending on what the instruction is (copy-to-exec, s_and_saveexec, s_xor) one of the optimizations can be applied.

Good idea. I will try to reuse the existing scan to inject the optimizations, if possible.

  1. Why is this change done in SIOptimizeExecMasking instead of SIOptimizeExecMaskingPreRA? Actually, I don't remember why we have the two passes in the first place. Perhaps @rampitec remembers?

I don't know what exactly the reasons are to have two passes. However, as in SIOptimizeExecMasking S_*BINOP*_{B32, B64} instructions are swapped with their SAVEEXEC counterpart, it made sense to me to introduce the change here. This is the last time such instructions are inserted by a pass, so it's likely the pattern I am trying to match will only appear after SIOptimizeExecMasking has run. But please correct me if I'm wrong.

This is a good start. However, I have some high-level questions:

  1. Scanning over entire basic blocks is bad for compile times, and this pass is already doing some scans. For example, optimizeExecSequence already scans for copies to exec. Can this be improved? Notice how the scan in optimizeExecSequence goes backwards in the basic block and limits itself to only a small number of instruction. I could imagine a restructuring of the pass so that every basic block is scanned backwards for an EXEC-writing instruction. Depending on what the instruction is (copy-to-exec, s_and_saveexec, s_xor) one of the optimizations can be applied.
  2. Why is this change done in SIOptimizeExecMasking instead of SIOptimizeExecMaskingPreRA? Actually, I don't remember why we have the two passes in the first place. Perhaps @rampitec remembers?

We only had post-RA pass initially. AFAIR I have added pre-RA to actually save registers (D35967).

  1. Scanning over entire basic blocks is bad for compile times, and this pass is already doing some scans. For example, optimizeExecSequence already scans for copies to exec. Can this be improved? Notice how the scan in optimizeExecSequence goes backwards in the basic block and limits itself to only a small number of instruction. I could imagine a restructuring of the pass so that every basic block is scanned backwards for an EXEC-writing instruction. Depending on what the instruction is (copy-to-exec, s_and_saveexec, s_xor) one of the optimizations can be applied.

It can probably start from the first terminator, scan backwards, and bail with the first instruction not modifying the exec.

  1. Why is this change done in SIOptimizeExecMasking instead of SIOptimizeExecMaskingPreRA? Actually, I don't remember why we have the two passes in the first place. Perhaps @rampitec remembers?

The primary reason is we can't produce the terminators with output registers before register allocation in case we need to insert spills live out of the block for the save exec. RA needs to insert spills before terminators

That makes sense, thanks.

tsymalla marked 2 inline comments as done.Jul 8 2022, 2:28 AM
tsymalla added inline comments.
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
733–734

No, I don't think so.
If EXEC = 0b111000, s0 = 0b001001, s1 = 0b010010, then after
s_or_saveexec s0, s1
s_xor exec, exec, s0

EXEC = 0b010010

while for s_andn2_saveexec s1, s0 EXEC = s_andn2_saveexec s1, s1 = 0b100100, s_andn2_saveexec s0, s1 EXEC = s_andn2_saveexec s0, s0 = 0b000001.
Same goes if you change the order of operands, so DST of s_or_saveexec must be equal to SRC0 of s_or_saveexec and thus needs to be DST and SRC0 of s_andn2_saveexec, please correct me if I'm wrong

sebastian-ne added inline comments.Jul 8 2022, 2:40 AM
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
733–734

Not quite sure if I’m following, but isn’t it

=> EXEC = 0b111000, s0 = 0b001001, s1 = 0b010010
s_or_saveexec s0, s1
=> EXEC = 0b111010, s0 = 0b111000, s1 = 0b010010
s_xor exec, exec, s0
=> EXEC = 0b000010, s0 = 0b111000, s1 = 0b010010

and

=> EXEC = 0b111000, s0 = 0b001001, s1 = 0b010010
s_andn2_saveexec s0, s1
=> EXEC = 0b000010, s0 = 0b111000, s1 = 0b010010
(EXEC = ~EXEC & s1 = 0b000111 & 0b010010 = 0b000010)

?

tsymalla added inline comments.Jul 8 2022, 2:49 AM
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
733–734

Yes, you are correct. I probably messed up somewhere.

tsymalla updated this revision to Diff 443929.Jul 12 2022, 5:53 AM

Merge two scans into one to reduce compile time.
Combine even if src and dst of s_or_saveexec are not equal.

Seems fine to me if you update the commit message and the comment.
I’ll leave approval to someone else.

llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
733–734

Can you update the comment please?

tsymalla edited the summary of this revision. (Show Details)Jul 12 2022, 6:35 AM
tsymalla marked an inline comment as done.
tsymalla edited the summary of this revision. (Show Details)Jul 12 2022, 6:43 AM
tsymalla updated this revision to Diff 443935.Jul 12 2022, 7:05 AM

Update inline comment.

arsenm added inline comments.Jul 12 2022, 10:48 AM
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
830–831

Why do you need to collect every instance in the function before processing them? Each of these can be handled standalone?

tsymalla added inline comments.Jul 13 2022, 12:44 AM
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
830–831

I find that more convenient and easier to follow, especially if the optimization handles a bunch of cases. Instead of mixing the find pattern-do combine step, I prefer to first exclude all irrelevant matchings and then just transform the findings one by another.

Could anyone have a look at this again?

This looks pretty good already, but I do have a bunch of code quality comments inline.

llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
466

Why the empty line?

706–710

Confusing comment. Just remove it.

711–713

Merge this function and the callee.

715

Inline this function into its caller.

722–723

Move those into optmizeVCmpSaveExecSequence (also maybe rename the method while you're at it? The singular "sequence" already implies that it's a single sequence...)

751–754

This is sort of backwards. When matching code for this kind of combine, the general pattern is to start from the definition of the final value you want to rewrite. In this case, that would be the S_XOR. You can see the approach taken in findPossibleVCMPVCMPXOptimization, using findInstrBackwards.

765

Remove OrXorPair and use emplace_back here

781–782

Move into the inner scope.

818–820

This isn't true. The v_cmp may potentially be far away, but the s_and_saveexec must be near the end of the block, so the entire loop should exit after the SearchWindow.

Also, and I know this isn't due to your change, but debug instructions should not be counted towards the search window. The goal is to avoid codegen changes caused by debug.

Also also, shouldn't we be able to exit this loop once we found the first write to EXEC?

tsymalla marked 6 inline comments as done.Jul 20 2022, 1:46 AM
tsymalla added inline comments.
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
751–754

But wouldn't this cause more work to do? I think it's more likely to first find a S_XOR instruction which the compiler could use as start point for looking for the s_or_saveexec. The issue is, that we basically would require to stop at a lot of s_xor and see if the instruction before is a eligible s_or_saveexec instruction. The amount of checks would be reduced if we could first stop at the s_or_saveexec instruction and _then_ look at the s_xor instruction.

See following (artificial) example:

BB0:
s_or_saveexec_b32 s0, s1
s_xor_b32 exec_lo, exec_lo, s0
s_xor_b32 exec_lo, exec_lo, s1
s_xor_b32 exec_lo, exec_lo, s2

Three checks for three s_xor instructions when checking s_xor first and two of them will fail. In comparison, with the current approach, we'd only check once at the cost of incrementing the iterator.

818–820

I don't get the first part of your comment. If we exit after SearchWindow, we're likely to ignore some v_cmp instructions.

For the debug instructions, you're correct. I missed that.

For the EXEC part, I'll double-check.

nhaehnle added inline comments.Jul 20 2022, 6:22 AM
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
751–754

Can such sequences actually happen, though? Before register allocation, we generally follow the rule that EXEC can only be written by special terminator instructions, and there is really no reason to have more than one of those. So I don't see where this sequence would come from.

See also the note about being able to stop the search after seeing the first instruction (from the end) that writes EXEC.

818–820

My point in the first part was that for v_cmp + s_and_saveexec -> s_mov + v_cmpx, we scan the search window for the s_and_saveexec. And then *from there* we look backwards for the v_cmp. This scan (via findInstrBackwards) isn't limited by SearchWindow.

tsymalla added inline comments.Jul 20 2022, 6:33 AM
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
751–754

Not exactly. My point is, we will see single occurrences of s_xor_b32 exec_lo, exec_lo, s* as terminator instructions, and will stop at any of them even if in a lot of cases the combine won't be applied. But I can change the order of checks anyhow.

tsymalla updated this revision to Diff 446409.Jul 21 2022, 3:05 AM

Addressed @nhaehnle's comments.

This revision is now accepted and ready to land.Jul 21 2022, 3:32 AM
This revision was landed with ongoing or failed builds.Jul 21 2022, 5:16 AM
This revision was automatically updated to reflect the committed changes.