Page MenuHomePhabricator

[ADMGPU] SDWA peephole optimization pass.

Authored by SamWot on Feb 16 2017, 4:10 AM.



First iteration of SDWA peephole.

This pass tries to combine several instruction into one SDWA instruction. E.g. it converts:

V_LSHRREV_B32_e32 %vreg0, 16, %vreg1
V_ADD_I32_e32 %vreg2, %vreg0, %vreg3
V_LSHLREV_B32_e32 %vreg4, 16, %vreg2


V_ADD_I32_sdwa %vreg4, %vreg1, %vreg3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD

Added --amdgpu-sdwa-peephole command-line option that enables this pass. by default pass is disabled.

Pass structure:

  1. Iterate over machine instruction in basic block and try to apply "SDWA patterns" to each of them. SDWA patterns match machine instruction into either source or destination SDWA operand. E.g. V_LSHRREV_B32_e32 %vreg0, 16, %vreg1 is matched to source SDWA operand %vreg1 src_sel:WORD_1.
  2. Iterate over found SDWA operands and find instruction that could be potentially coverted into SDWA. E.g. for source SDWA operand potential instruction are all instruction in this basic block that uses %vreg0
  3. Iterate over all potential instructions and check if they can be converted into SDWA.
  4. Convert instructions to SDWA.

This review contains basic implementation of SDWA peephole pass. This pass requires additional testing fot both correctness and performance (no performance testing done).
There are several ways this pass can be improved:

  1. Make this pass work on whole function not only basic block. As I can see this can be done right now without changes to pass.
  2. Introduce more SDWA patterns
  3. Introduce mnemonics to limit when SDWA patterns should apply

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
SamWot edited the summary of this revision. (Show Details)Feb 16 2017, 4:13 AM
SamWot edited the summary of this revision. (Show Details)
SamWot added a subscriber: Restricted Project.
SamWot edited the summary of this revision. (Show Details)Feb 16 2017, 4:15 AM
arsenm added inline comments.Feb 16 2017, 10:26 AM

This seems late to me, and also introduces a 3rd run of DeadMachineInstructionElim. I would expect to run this right after SIFoldOperands. You have code dealing with VOPC/vcc required uses, but that is of limited help right after the pre-RA run. I consider it a bug that we see any of those at all at this point.


Should be included last.


Should not need shared_ptr, unique_ptr is enough. The struct is also small enough to store values of


usual style is static functions without anon namespace


Single quotes




This function shouldn't be necessary. I think all you want to do is check the blocks have the same parent? You shouldn't be seeing situations where you're seeing instructions from multiple functions


I think this should be removed. There should be no cases where MRI is unavailable, so the null checks here and on its callers are unnecessary. It would be better to just pass around the one MRI reference


I think this is too aggressive. I would expect to restrict the number of uses to fold. In a typical case if the shift has more than one use, you are increasing code size. I would also expect a legality check here, because if there is an unfoldable user, there's no point in folding any of them.


I think you're going through too much trouble to loop over explicit_uses and then figuring out the operand index. In similar places we just directly query src0/src1


You could create a new virtual register and replace all uses with. If you are going through the trouble of deleting the leftover instructions yourself there's no need to run dead mi elimination again


This reasoning seems backwards to me. If you run before SIShrinkInstructions you shouldn't need to worry about the e32 case (which is part of why we try to keep things in VOP3 form until necessary)


You don't need to check this. The instruction is broken and will fail the verifier


For a next round patch I would expect and x, bitmask to select the low half




I would expect to consolidate this with convertToSDWA and return false if it can't be converted


I think these asserts are overly aggressive


This should be a const reference. shared_ptr is pretty expensive


Single dbg statement, and also should be wrapped in DEBUG()


This check is unnecessary?


This doesn't seem like a problem to me. I would expect combinable patterns to be sunk down earlier


A better name for this might be collectSDWABitExtracts? I also think you should try folding the sources into the uses in the same loop. Once you see one of these defs, it's use is going to be later, so you don't need to collect every possible use in the block ahead of time.

You could also look at the sources for SDWA convertible instructions instead of looking at this from the other direction.


You can replace func with GCN. This will never be shared with r600


The naming convention in other tests is to include the operation in the name, in this case mul, and then suffix with the llvm type, _v2i16


This needs a lot more tests. I would like to see a range of tests using half as well. These tests are not also stressing some of the illegal cases you check for, like SGPR operands. There should also be multiple uses tests, with some foldable and unfoldable uses.

There's also a lot of code for checking parent blocks, but these tests all only have one block.

rampitec added inline comments.Feb 16 2017, 10:29 AM

Are there plans to enable it?


It should come after llvm's includes.


Looks like you only need FirstBB == SecondBB.


Isn't MRI just always available? It is just getParentInst()->getParent()->getRegInfo(). No checks really needed. Most likely this function is not needed too.


Check is not needed.

rampitec added inline comments.Feb 16 2017, 1:02 PM

I'm not sure it is always a vreg. It is better got giveup on physreg. The same for multiply defined vregs, you are running the pass late and can have multiple defs. Just bail out.


Should be wrapped in DEBUG().


Not needed.


Space before colon. Here and in other places.

SamWot added inline comments.Feb 17 2017, 3:17 AM

SDWA peephole extensively use results of SIShrinkInstructions pass. I considered placing peephole after SiFoldOperands but at that point we still have all instructions in 64-bit encoding - but they can't be encoded as SDWA. So SDWA pass would in many ways repeat work done by SiShrinkInstruction. Possibly we can move both shrink instructions and SDWA peephole earlier.


I can't store by value because it would lead to type slicing.


Yes, this is overkill but if I check toher thing here then why not to check that functions are also same, just for confidence?


This should work but I don't want to campare pointers. If internal structure of MachineFunction changes later then this would not work. Also basic block number is exactly what we need to check if this is same basic block - then why not to use it?
What I should really fix is that FirstMI->getParent()->getNumber(); should be FirstBB->getNumber().


I agree that we should limit number of uses. This is what I wanted to say with "Introduce mnemonics to limit when SDWA patterns should apply".
One thing that I don't know is if one instruction uses register twice would it appear in use_instructions() twice ore single time.


We don't yet support SDWA on gfx9.


This method collects not only BitExtracts but also BitInserts.

Splitting processes of finding SDWA patterns and folding sources allows to generalize finding patterns not only for source operands but also for destination operands. With that we can fold whole bunch of operands into instruction instead of folding instructions. Also I think this split makes pass much more clear and maintainable.

rampitec added inline comments.Feb 17 2017, 10:54 AM

You do not check the module is the same? This is function pass, you do not get blocks from different functions.


All llvm works this way. BTW, numbers can change as well. You are taking parent basic block pointers from both instructions right here. Do you expect blocks to change in between of these two getParent() calls?


It then needs TODO comment here.

arsenm added inline comments.Feb 17 2017, 12:07 PM

If you iterate uses you will see the same instruction for repeated uses. use_instructions skips to unique instructions

SamWot updated this revision to Diff 90672.Mar 6 2017, 3:50 AM
SamWot marked 29 inline comments as done.

Added support for subregs.
Fixed most of previous comments.

rampitec added inline comments.Mar 6 2017, 11:18 AM

You still do not need logic more than FirstMI->getParent() == SecondMI->getParent()


TRI->getMatchingSuperReg() or TRI->getMatchingSuperRegClass() maybe?


No need to convert lanemask to integer and back.

SamWot updated this revision to Diff 91150.Mar 9 2017, 3:00 AM

Fixed isSameBB

SamWot added inline comments.Mar 9 2017, 3:03 AM

This functions work with physical registers if I understand correctly, and I need to work with virtual regs.


I only convert from LaneBitmask to integer. This is done because getSubRegIndexLaneMask takes integer as input.

SamWot updated this revision to Diff 91533.Mar 13 2017, 3:48 AM

Introduced new sdwa patterns

rampitec edited edge metadata.Mar 13 2017, 10:40 AM

Introduced new sdwa patterns

I would like to see this patch fixed, submitted, and optimization enabled first, before new patterns are introduced.

I would like to see this patch fixed, submitted, and optimization enabled first, before new patterns are introduced.

I alredy either fixed or answered on comments to previous version. Currently I'm waiting for new comments or acceptance thats why I started working on new functionality.

SamWot updated this revision to Diff 91991.Mar 16 2017, 6:25 AM
SamWot marked 18 inline comments as done.

Cleanup for remaining comments.

SamWot added inline comments.Mar 16 2017, 6:27 AM

I want first to submit this patch so that I would be able to fully test it and ensure that it doesn't break anything.


I have to delete instructions in that case to ensure validity of machine code. Simply intoducing new virtual registers doesn't solve this problem.
Dead code elimination is important for removing src patterns. I don't want to remove by hand because later I want to introduce more complex patterns consisting of several instructions which removal by hand might become to hard.


Same as for enabling pass by default: I want to submit this patch first, then extend it.

rampitec added inline comments.Mar 16 2017, 10:38 AM

I still do not see a check for single use. Am I missing it somewhere?

SamWot updated this revision to Diff 92032.Mar 16 2017, 11:29 AM

Add check for single use of src sdwa operand.

rampitec added inline comments.Mar 16 2017, 4:24 PM

You only need to do:

if (PotentialMI != PotentialMO.getParent()) {
  return nullptr;

If != nullptr and continue are not needed.


Looks like parent will be erased anyway. I assume convertToSDWA should return status.


Space after comma.


You are working in pre-regalloc... I think you need to update LIS when you create or delete instructions.


There can be only one SDWA operand, right? It seems this loop may try to create more than one...

rampitec added inline comments.Mar 16 2017, 4:40 PM

What will happen if instruction already comes to the pass with either SDWA or DPP in it? There can be only one such operand per instruction.

SamWot updated this revision to Diff 92311.Mar 20 2017, 3:54 AM
SamWot marked 5 inline comments as done.

Changes for latest comments

SamWot marked an inline comment as done.Mar 20 2017, 3:56 AM
SamWot added inline comments.

Rewrote potentialToConvert function so that they should be more clear.


Currently instruction with SDWA or DPP would not be converted. They would not pass "isConvertibleToSDWA" check.
Later it would be nice to extend this pass so that it would accept SDWA instructions.


First LiveIntervalAnalysis is runnig after my pass. Liveness informatioin is inavailable for my pass.


No, there might be several SDWA operands for one instructions. One for each instruction operands.

This revision is now accepted and ready to land.Mar 20 2017, 9:54 AM
This revision was automatically updated to reflect the committed changes.