This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Rearrange mul(dup(sext/zext)) to mul(sext/zext(dup))
ClosedPublic

Authored by NickGuy on Nov 11 2020, 6:22 AM.

Details

Summary

Performing this rearrangement allows for existing patterns
to match against cases where operands are hoisted up to a loop
header (by LICM), allowing for optimisations such as umull/smull
codegen, while still allowing loop-invariant values to be hoisted
out of the loop afer other optimisations.

Diff Detail

Event Timeline

NickGuy created this revision.Nov 11 2020, 6:22 AM
NickGuy requested review of this revision.Nov 11 2020, 6:22 AM
NickGuy added inline comments.Nov 11 2020, 6:27 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
10900–10902

In a number of tests, this line is hit more than a few times. Only 1 of which it fails on. Using .dump() to try and identify why didn't help, as it appears to be the same style as the others that pass (e.g. %tmp3 = sext <8 x i8> %arg to <8 x i16>)
If anyone can provide insight to this, I would greatly appreciate it :)

15491–15500

I'm unsure as to when PerformDAGCombine is invoked. If this function generates a new DUP node, would this function then be invoked with that node? Or does this function need a bit more scaffolding to support this case?

SjoerdMeijer added inline comments.Nov 11 2020, 6:59 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
10900–10902

what do you mean by fail? I guess that's a segfault?
I guess you need to make sure it is an instruction first with dyn_cast, then you check its operands and uses.

NickGuy added inline comments.Nov 11 2020, 7:05 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
10900–10902

Sorry, I should've been a bit clearer.
It fails with an assertion error that stems from an invalid cast in CodeGenPrepare::tryToSinkFreeOperands.

Shuffle->getOperandUse(0).dump() outputs something that looks like an instruction, but reportedly isn't, and only in a single case. It's simple to work around, but I was curious as to if anyone had an idea as to what was happening.

SjoerdMeijer added inline comments.Nov 11 2020, 7:28 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
10893

Nit: do you need the enumerate? Can you do just:

for (auto *OpIdx : I->operands())
10900–10902

Still a bit unclear to me.
But are you actually checking Shuffle is an instruction? I also don't see what the workaround is.

15457

Nit: A comment on what we are exactly combining here.

15458

Nit: performDUPSextCombine -> performDUPSExtCombine

15459

Nit: think the coding style is to use:

auto *Operand = ...

when it is a pointer.

Same for more declarations below.

15465

Can you run clang-format on your patch? I've tried reading this function, but all these lint message makes it pretty unreadable to me.

Hello

This looks like two separate patches to me. One that folds dup(ext(..)) into ext(dup(..)), and another that tries to sink operands into a loop.

They should be separate patches with their own set of tests. The first, done universally, will need some thought (and maybe some benchmarks) to make sure it's always the correct thing to do, it's not just something that happens to work here. The fact that no other tests are failing is a good sign at least.

NickGuy updated this revision to Diff 304529.Nov 11 2020, 8:48 AM
NickGuy marked 2 inline comments as done.

This looks like two separate patches to me.

I've pulled the operand sinking out to it's own patch now.

The first, done universally, will need some thought (and maybe some benchmarks) to make sure it's always the correct thing to do, it's not just something that happens to work here

Agreed, this review request was simply to gather some more opinions on the matter. There was no intention of merging this as-is.

The fact that no other tests are failing is a good sign at least.

Now that you mention it, it's a bit concerning. It's possible that there are no existing tests covering this case.

NickGuy added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
10900–10902

Shuffle is always an instruction at this point, the part that's causing the assertion failure is pushing the operand use into Ops. The result of Shuffle->getOperandUse(0) is what is sometimes not an instruction.
The workaround is the isa<Instruction>(&Shuffle->getOperandUse(0) check.

15458

I opted for performDUPExtCombine, as it performs both signed and zero extends

15459

None of these are pointer types, though that's good to know for future :)

15465

Done, sorry about that

Thanks

So what this seems to be proposing is that ever instance of dup(sext(..)) is transformed into sext(dup(..)). That's a pretty general transform. Off the top of my head... the extend could be free in places (zext i32->i64) but it may allow more folding into other instructions like the smull/saddl/ssubl it can allow here. If the operand is a load, that should at least have been folded into a single instruction already prior to the dup being made. But I would probably expect a sxth+dup to be quicker than a dup+sshll in general.

I would suggest trying this on a number of small tests and seeing how it does in terms of instruction count. Alternatively try compiling the llvm testsuite and seeing how many things change, and if they look better or worse in the process. But like I said this may need to be a bit more targeted.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15462

The llvm style guide suggests not to over-use auto like this. It just makes it more difficult telling what things are.

15468

These is no such thing as a "non-vector dup", as far as I understand.

15487

Create a SDLoc DL(N) and use it in both getNodes.
This doesn't need getVTList for single types.
Or {} for the operands I don't think.

15501

Detecting the extend is probably best done inside the function. It's common to do:

if (SDValue Ext = performDUPExtCombine(..))
  return Ext;

I would personally leave out at least AssertSext and AssertZext until you at least have a test that shows them being needed. If this does SIGN_EXTEND_INREG it should probably handle the equivalent AND as well.

llvm/test/CodeGen/AArch64/aarch64-matrix-smull.ll
4 ↗(On Diff #304529)

I would expect tests that looked something like (but I got and edited this one from mve):

define <4 x i32> @vdup_i16(i16 %src) {
; CHECK-LABEL: vdup_i16:
; CHECK:       @ %bb.0: @ %entry
; CHECK-NEXT:    vdup.16 q0, r0
; CHECK-NEXT:    bx lr
entry:
  %0 = insertelement <4 x i16> undef, i16 %src, i32 0
  %x = shufflevector <4 x i16> %0, <4 x i16> undef, <4 x i32> zeroinitializer
  %out = sext <4 x i16> %0 to <4 x i32>
  ret <4 x i32> %out
}

But for all type and sizes that this transform supports. Which seems to be a lot at the moment. Don't forget scalable types too.

Having tests that show that mul(sext(...), dup(sext(...))) are also folded sounds useful too, but they can hopefully be equally small.

dmgreen added inline comments.Nov 11 2020, 2:02 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15483

I'm not sure what this is doing. It should be sign extending from the smaller type with the correct number of vector lanes (either ExtOperand.getValueType() for a sext/zext or Operand.getOperand(1) for a SIGN_EXTEND_INREG, and something based on the mask for an AND). It them probably has to make sure that the new extend is a legal operation.

NickGuy updated this revision to Diff 308966.Dec 2 2020, 7:48 AM
NickGuy marked 6 inline comments as done.
NickGuy retitled this revision from [AArch64] Rearrange (dup(sext/zext)) to (sext/zext(dup)) to [AArch64] Rearrange mul(dup(sext/zext)) to mul(sext/zext(dup)).

Addressed comments, and changed to be more targeted. The rearrangement will only be performed if the pattern feeds into a mul instruction, and only if the type combination is valid for smull/umull folding.

llvm/test/CodeGen/AArch64/aarch64-matrix-smull.ll
4 ↗(On Diff #304529)

I've added some tests testing the behaviour for different types and sizes (all generated, so the IR is identical apart from the types). I've omitted support for scalable types, as I encountered some issues when testing them. I plan to make progress with the fixed types first, then revisit scalable types later (Unless it turns out that to support them, I'm missing 1 line somewhere).

NickGuy updated this revision to Diff 308975.Dec 2 2020, 8:15 AM

Removed redundant/useless debug prints that were erroneously included in the patch.

Sorry for the delay. There is a lot of code here all of a sudden. More than I expected!

Can you run clang-format on the patch to make it more readable?

NickGuy updated this revision to Diff 309844.Dec 7 2020, 2:25 AM

Can you run clang-format on the patch to make it more readable?

Done, sorry about that (Guess I should have another look at my pre-commit hooks)

NickGuy updated this revision to Diff 309909.Dec 7 2020, 7:17 AM

Fixed more formatting issues (that were apparently missed by clang-format before?)

Thanks, that makes it clearer at least.

There's quite a bit of code here. I was hoping this would be a lot simpler, and it would be good if some of this could be simplified (but a lot of this might well be needed for all the cases that are being supported?). It makes it more difficult to check all the things it's doing.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11740

There is a convenience function, DAG.getUNDEF(VT);

Probably doesn't need a lambda as a result.

11772

Debug messages are fairly uncommon in ISel Lowering. It tends to trigger a lot of times where they are not important, and the debug messages it prints already are usually enough to give a rough indication of what is going on.

This whole function probably simplifies a lot I think? Is it something like TargetType.getHalfNumVectorElementsVT == PreExtendVT and the TargetType is one of {...}? I'm not sure how the MVT::v16i8 extended to MVT::v8i16 works out, from ISel. I think because ISel is type checked, those cases would not come up even if the underlying instructions supported it? We are not optimizing smull2 as far as I understand.

11803

Do we need to check both shuffles and DUP's? It that for the i64 types that are otherwise illegal somehow?

11828

A mul should always be an integer I think?

11853–11854

This comment is very short (as in - the length of the line). It probably doesn't need the newline before it either.

11861

Multiplies (and other BinOps) only have 2 operands. It would probably be simpler to just check both the operands as opposed to needing the loop.

11870

This also seems to make changes without necessarily detecting that they are useful. That is probably fine here considering what is being transformed, but it can be better to represent it as "test if it will help, if so then make the change".

NickGuy updated this revision to Diff 311506.Dec 14 2020, 12:26 AM
NickGuy marked 5 inline comments as done.
NickGuy added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11740

Ah, thanks for pointing that out.
Lambda is removed, too.

11772

Debug messages are fairly uncommon in ISel Lowering

Removed the debug messages.

11803

No, we don't. Looks like checking for shuffles catches the same DUP cases before they become DUPs. I've removed the DUP-specific check, and have unified it down the VectorShuffle check

11828

That was a remnant of before we were restricting this to muls only. Regardless, this has now been removed.

11861

It would probably be simpler to just check both the operands as opposed to needing the loop.

I disagree. Having the loop here makes it clearer that each operand is being handled in the same way, while having them separated needlessly duplicates the code.

NickGuy updated this revision to Diff 311840.Dec 15 2020, 1:37 AM

Fixed broken commit

NickGuy updated this revision to Diff 311853.Dec 15 2020, 3:08 AM

Fixed formatting, and removed "Unsupported combines" tests as they were failing due to a separate issue (and contribute very little value to this patch)

Fixed formatting, and removed "Unsupported combines" tests as they were failing due to a separate issue (and contribute very little value to this patch)

It might be worth keeping one or two to show it doesn't do anything wrong/crash on incorrect types.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11660

Add a message explaining what this function does. It looks generally useful.

It might also be worth having this function return the PreExtendType instead, and letting the parent function do the extend to vector type. Either way is fine but it might allow the function to be used for other things, if they come up.

11684

Can remove debug here and below.

11706

Nit: Doesn't need a break like this after a return.

11709

I think this can be PostExtendType.changeVectorElementType(PreExtendType);

11715

SDNode -> SDValue

11724

I may be wrong, but is the PreExtendVT != MVT::v16i8 half of this ever used? TargetType and PreExtendVT should have the same number of vector elements I believe. And then TargetType.getScalarSizeInBits() == 2 * PreExtendVT.getScalarSizeInBits() maybe?

That might help this lambda simplify further, possibly to the point that it is easier to inline it, now that it is only used in one place.

11757

Nit: DebugLoc is usually called just DL (I think DebugLoc is already the name of the type used in Machine Instructions).

11761

Nit: It doesn't need the {} around the operands, there are overloaded methods to handle that automatically already.

11768

It may be simpler to generate the AArch64ISD::DUP directly? I'm not sure either way which is better, but it's less nodes and we know it's going to get there eventually. Be careful about illegal types though.

11861

It would seem simpler (and smaller) to just call performCommonVectorExtendCombine for each operand. It may need to use something like Op0 ? Op0 : Mul->getOperand(0), but it would remove the need to track Changed and most of the rest of this loop.

NickGuy updated this revision to Diff 312207.Dec 16 2020, 7:14 AM
NickGuy marked 12 inline comments as done.

Addressed comments, and re-added a couple of "Unsupported combines" tests.

Thanks for the updates, this is looking good.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11672

SDNode* -> SDValue

11778–11782

I think this will always produce a new value, as opposed to returning SDValue when nothing is done. It's probably best to make it something like this instead:

SDValue Op0 = performCommonVectorExtendCombine(Mul->getOperand(0), DAG);
SDValue Op1 = performCommonVectorExtendCombine(Mul->getOperand(1), DAG);
if (!Op0 && !Op1)
  return SDValue();

SDLoc DL(Mul);
return DAG.getNode(Mul->getOpcode(), DL, Mul->getValueType(0),
                   Op0 ? Op0 : Mul->getOperand(0),
                   Op1 ? Op1 : Mul->getOperand(1));
NickGuy updated this revision to Diff 312736.Dec 18 2020, 2:56 AM
NickGuy marked 2 inline comments as done.

Addressed Comments

NickGuy added inline comments.Dec 18 2020, 3:02 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11724

Not entirely sure it is, I added that check to explicitly allow the types accepted by u/smull. I've changed this now.

11768

Generating the DUP directly, while simpler at this point, failed due to this combine being performed fairly early, and the DUP not being handled as part of lowering. Given that, keeping it as shuffle_vector at this point seems simpler, as we can also benefit from the existing VECTOR_SHUFFLE->DUP lowering checks.

NickGuy updated this revision to Diff 312773.Dec 18 2020, 5:49 AM

Fixed incorrect error-handling

dmgreen added inline comments.Dec 21 2020, 12:32 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11673

const SDValue &ExtOperand -> SDValue ExtOperand

Its also only used in one place and needn't get a variable.

11682

const SDValue &TypeOperand -> SDValue TypeOperand

11720

This doesn't seem to be checking that the shuffle/insert are actually a splat.

There is an isSplatValue method that could help, depending on if it's really checking what this wants to check.

NickGuy updated this revision to Diff 314185.Dec 31 2020, 5:01 AM
NickGuy marked 3 inline comments as done.
NickGuy added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11720

Done, though I wasn't able to produce IR that represented a shuffle/insert that wasn't a splat, so I haven't got any tests for this case.

NickGuy updated this revision to Diff 314550.Jan 5 2021, 2:29 AM

Added test case for non-splat shuffles

dmgreen added inline comments.Jan 5 2021, 8:34 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11717

Although (I know..) I suggested this, it does more than what we might expect here. And there's the chance it could change in any number of weird and wonderful ways in the future. As we are relying on this being a vector shuffle with an insert element and a zero mask, I think we should check for that. I think it should be enough to test that:

  • The shuffle vector has a zero mask (ShuffleVectorSDNode(VectorShuffle)->isSplat(), getSplatIndex()==0)
  • The first first operand is an insert
  • That is inserting into lane 0.
llvm/test/CodeGen/AArch64/aarch64-dup-ext.ll
169

I think this could still count as a splat, as the elements are undef (can happily take any value, including 0).
It's probably fine to use a extra parameter and something like an ext shuffle:
%broadcast.splat = shufflevector <8 x i16> %c, <8 x i16> undef, <8 x i32> <i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1>
That should hopefully test things like it not being a splat and the insert not existing.

NickGuy updated this revision to Diff 314861.Jan 6 2021, 4:59 AM
NickGuy marked 2 inline comments as done.

Addressed comments, updating the checks and tests

dmgreen accepted this revision.Jan 6 2021, 5:44 AM

Thanks. LGTM with a couple of suggestions.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11714

You can probably remove this now and just rely on the dyn_cast<ShuffleVectorSDNode> check below.

11724

Maybe move this comment up above the if, to avoid the misleading indentation.

11733–11739

Maybe something like this?

// Ensures the insert is inserting into lane 0
auto *Constant = dyn_cast<ConstantSDNode>(InsertLane.getNode());
if (!Constant || Constant->getZExtValue() != 0)
  return SDValue();
This revision is now accepted and ready to land.Jan 6 2021, 5:44 AM
This revision was automatically updated to reflect the committed changes.
NickGuy marked 3 inline comments as done.
dmajor added a subscriber: dmajor.Jan 7 2021, 7:20 AM

This commit caused a crash on our bots with a null LLVMTy at changeExtendedVectorElementType. Reduced case: https://godbolt.org/z/7ehffx

This commit caused a crash on our bots with a null LLVMTy at changeExtendedVectorElementType. Reduced case: https://godbolt.org/z/7ehffx

Thanks for bringing this to my attention, I've got a fix in review at D94234.