Page MenuHomePhabricator

[ADMGPU] SDWA peephole optimization pass.
ClosedPublic

Authored by SamWot on Feb 16 2017, 4:10 AM.

Details

Summary

First iteration of SDWA peephole.

This pass tries to combine several instruction into one SDWA instruction. E.g. it converts:

V_LSHRREV_B32_e32 %vreg0, 16, %vreg1
V_ADD_I32_e32 %vreg2, %vreg0, %vreg3
V_LSHLREV_B32_e32 %vreg4, 16, %vreg2

Into:

V_ADD_I32_sdwa %vreg4, %vreg1, %vreg3 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD

Added --amdgpu-sdwa-peephole command-line option that enables this pass. by default pass is disabled.

Pass structure:

  1. Iterate over machine instruction in basic block and try to apply "SDWA patterns" to each of them. SDWA patterns match machine instruction into either source or destination SDWA operand. E.g. V_LSHRREV_B32_e32 %vreg0, 16, %vreg1 is matched to source SDWA operand %vreg1 src_sel:WORD_1.
  2. Iterate over found SDWA operands and find instruction that could be potentially coverted into SDWA. E.g. for source SDWA operand potential instruction are all instruction in this basic block that uses %vreg0
  3. Iterate over all potential instructions and check if they can be converted into SDWA.
  4. Convert instructions to SDWA.

This review contains basic implementation of SDWA peephole pass. This pass requires additional testing fot both correctness and performance (no performance testing done).
There are several ways this pass can be improved:

  1. Make this pass work on whole function not only basic block. As I can see this can be done right now without changes to pass.
  2. Introduce more SDWA patterns
  3. Introduce mnemonics to limit when SDWA patterns should apply

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
SamWot edited the summary of this revision. (Show Details)Feb 16 2017, 4:13 AM
SamWot edited the summary of this revision. (Show Details)
SamWot added a subscriber: Restricted Project.
SamWot edited the summary of this revision. (Show Details)Feb 16 2017, 4:15 AM
arsenm added inline comments.Feb 16 2017, 10:26 AM
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
673–676

This seems late to me, and also introduces a 3rd run of DeadMachineInstructionElim. I would expect to run this right after SIFoldOperands. You have code dealing with VOPC/vcc required uses, but that is of limited help right after the pre-RA run. I consider it a bug that we see any of those at all at this point.

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
24

Should be included last.

60

Should not need shared_ptr, unique_ptr is enough. The struct is also small enough to store values of

138

usual style is static functions without anon namespace

164

Single quotes

171

ditto

175

This function shouldn't be necessary. I think all you want to do is check the blocks have the same parent? You shouldn't be seeing situations where you're seeing instructions from multiple functions

206

I think this should be removed. There should be no cases where MRI is unavailable, so the null checks here and on its callers are unnecessary. It would be better to just pass around the one MRI reference

222

I think this is too aggressive. I would expect to restrict the number of uses to fold. In a typical case if the shift has more than one use, you are increasing code size. I would also expect a legality check here, because if there is an unfoldable user, there's no point in folding any of them.

236

I think you're going through too much trouble to loop over explicit_uses and then figuring out the operand index. In similar places we just directly query src0/src1

306–308

You could create a new virtual register and replace all uses with. If you are going through the trouble of deleting the leftover instructions yourself there's no need to run dead mi elimination again

313–315

This reasoning seems backwards to me. If you run before SIShrinkInstructions you shouldn't need to worry about the e32 case (which is part of why we try to keep things in VOP3 form until necessary)

331–334

You don't need to check this. The instruction is broken and will fail the verifier

343

For a next round patch I would expect and x, bitmask to select the low half

346–348

Ditto

370

I would expect to consolidate this with convertToSDWA and return false if it can't be converted

408–411

I think these asserts are overly aggressive

452

This should be a const reference. shared_ptr is pretty expensive

456–457

Single dbg statement, and also should be wrapped in DEBUG()

467

This check is unnecessary?

477

This doesn't seem like a problem to me. I would expect combinable patterns to be sunk down earlier

480

A better name for this might be collectSDWABitExtracts? I also think you should try folding the sources into the uses in the same loop. Once you see one of these defs, it's use is going to be later, so you don't need to collect every possible use in the block ahead of time.

You could also look at the sources for SDWA convertible instructions instead of looking at this from the other direction.

test/CodeGen/AMDGPU/sdwa-peephole.ll
2

You can replace func with GCN. This will never be shared with r600

60

The naming convention in other tests is to include the operation in the name, in this case mul, and then suffix with the llvm type, _v2i16

67

This needs a lot more tests. I would like to see a range of tests using half as well. These tests are not also stressing some of the illegal cases you check for, like SGPR operands. There should also be multiple uses tests, with some foldable and unfoldable uses.

There's also a lot of code for checking parent blocks, but these tests all only have one block.

rampitec added inline comments.Feb 16 2017, 10:29 AM
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
99

Are there plans to enable it?

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
24

It should come after llvm's includes.

180

Looks like you only need FirstBB == SecondBB.

206

Isn't MRI just always available? It is just getParentInst()->getParent()->getRegInfo(). No checks really needed. Most likely this function is not needed too.

220

Check is not needed.

rampitec added inline comments.Feb 16 2017, 1:02 PM
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
277

I'm not sure it is always a vreg. It is better got giveup on physreg. The same for multiply defined vregs, you are running the pass late and can have multiple defs. Just bail out.

336

Should be wrapped in DEBUG().

365

Not needed.

467
>= VOLCANIC_ISLANDS?
478

Space before colon. Here and in other places.

SamWot added inline comments.Feb 17 2017, 3:17 AM
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
673–676

SDWA peephole extensively use results of SIShrinkInstructions pass. I considered placing peephole after SiFoldOperands but at that point we still have all instructions in 64-bit encoding - but they can't be encoded as SDWA. So SDWA pass would in many ways repeat work done by SiShrinkInstruction. Possibly we can move both shrink instructions and SDWA peephole earlier.

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
60

I can't store by value because it would lead to type slicing.

175

Yes, this is overkill but if I check toher thing here then why not to check that functions are also same, just for confidence?

180

This should work but I don't want to campare pointers. If internal structure of MachineFunction changes later then this would not work. Also basic block number is exactly what we need to check if this is same basic block - then why not to use it?
What I should really fix is that FirstMI->getParent()->getNumber(); should be FirstBB->getNumber().

222

I agree that we should limit number of uses. This is what I wanted to say with "Introduce mnemonics to limit when SDWA patterns should apply".
One thing that I don't know is if one instruction uses register twice would it appear in use_instructions() twice ore single time.

467

We don't yet support SDWA on gfx9.

480

This method collects not only BitExtracts but also BitInserts.

Splitting processes of finding SDWA patterns and folding sources allows to generalize finding patterns not only for source operands but also for destination operands. With that we can fold whole bunch of operands into instruction instead of folding instructions. Also I think this split makes pass much more clear and maintainable.

rampitec added inline comments.Feb 17 2017, 10:54 AM
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
175

You do not check the module is the same? This is function pass, you do not get blocks from different functions.

180

All llvm works this way. BTW, numbers can change as well. You are taking parent basic block pointers from both instructions right here. Do you expect blocks to change in between of these two getParent() calls?

467

It then needs TODO comment here.

arsenm added inline comments.Feb 17 2017, 12:07 PM
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
222

If you iterate uses you will see the same instruction for repeated uses. use_instructions skips to unique instructions

SamWot updated this revision to Diff 90672.Mar 6 2017, 3:50 AM
SamWot marked 29 inline comments as done.

Added support for subregs.
Fixed most of previous comments.

rampitec added inline comments.Mar 6 2017, 11:18 AM
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
197

You still do not need logic more than FirstMI->getParent() == SecondMI->getParent()

219

TRI->getMatchingSuperReg() or TRI->getMatchingSuperRegClass() maybe?

232

No need to convert lanemask to integer and back.

SamWot updated this revision to Diff 91150.Mar 9 2017, 3:00 AM

Fixed isSameBB

SamWot added inline comments.Mar 9 2017, 3:03 AM
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
219

This functions work with physical registers if I understand correctly, and I need to work with virtual regs.

232

I only convert from LaneBitmask to integer. This is done because getSubRegIndexLaneMask takes integer as input.

SamWot updated this revision to Diff 91533.Mar 13 2017, 3:48 AM

Introduced new sdwa patterns

rampitec edited edge metadata.Mar 13 2017, 10:40 AM

Introduced new sdwa patterns

I would like to see this patch fixed, submitted, and optimization enabled first, before new patterns are introduced.

I would like to see this patch fixed, submitted, and optimization enabled first, before new patterns are introduced.

I alredy either fixed or answered on comments to previous version. Currently I'm waiting for new comments or acceptance thats why I started working on new functionality.

SamWot updated this revision to Diff 91991.Mar 16 2017, 6:25 AM
SamWot marked 18 inline comments as done.

Cleanup for remaining comments.

SamWot added inline comments.Mar 16 2017, 6:27 AM
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
99

I want first to submit this patch so that I would be able to fully test it and ensure that it doesn't break anything.

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
306–308

I have to delete instructions in that case to ensure validity of machine code. Simply intoducing new virtual registers doesn't solve this problem.
Dead code elimination is important for removing src patterns. I don't want to remove by hand because later I want to introduce more complex patterns consisting of several instructions which removal by hand might become to hard.

477

Same as for enabling pass by default: I want to submit this patch first, then extend it.

rampitec added inline comments.Mar 16 2017, 10:38 AM
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
222

I still do not see a check for single use. Am I missing it somewhere?

SamWot updated this revision to Diff 92032.Mar 16 2017, 11:29 AM

Add check for single use of src sdwa operand.

rampitec added inline comments.Mar 16 2017, 4:24 PM
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
277

You only need to do:

if (PotentialMI != PotentialMO.getParent()) {
  return nullptr;

If != nullptr and continue are not needed.

312

Looks like parent will be erased anyway. I assume convertToSDWA should return status.

580

Space after comma.

580

You are working in pre-regalloc... I think you need to update LIS when you create or delete instructions.

647

There can be only one SDWA operand, right? It seems this loop may try to create more than one...

rampitec added inline comments.Mar 16 2017, 4:40 PM
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
385

What will happen if instruction already comes to the pass with either SDWA or DPP in it? There can be only one such operand per instruction.

SamWot updated this revision to Diff 92311.Mar 20 2017, 3:54 AM
SamWot marked 5 inline comments as done.

Changes for latest comments

SamWot marked an inline comment as done.Mar 20 2017, 3:56 AM
SamWot added inline comments.
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
277

Rewrote potentialToConvert function so that they should be more clear.

385

Currently instruction with SDWA or DPP would not be converted. They would not pass "isConvertibleToSDWA" check.
Later it would be nice to extend this pass so that it would accept SDWA instructions.

580

First LiveIntervalAnalysis is runnig after my pass. Liveness informatioin is inavailable for my pass.

647

No, there might be several SDWA operands for one instructions. One for each instruction operands.

This revision is now accepted and ready to land.Mar 20 2017, 9:54 AM
This revision was automatically updated to reflect the committed changes.