Performing this rearrangement allows for existing patterns
to match against cases where operands are hoisted up to a loop
header (by LICM), allowing for optimisations such as umull/smull
codegen, while still allowing loop-invariant values to be hoisted
out of the loop afer other optimisations.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
10900–10902 | In a number of tests, this line is hit more than a few times. Only 1 of which it fails on. Using .dump() to try and identify why didn't help, as it appears to be the same style as the others that pass (e.g. %tmp3 = sext <8 x i8> %arg to <8 x i16>) | |
15491–15500 | I'm unsure as to when PerformDAGCombine is invoked. If this function generates a new DUP node, would this function then be invoked with that node? Or does this function need a bit more scaffolding to support this case? |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
10900–10902 | what do you mean by fail? I guess that's a segfault? |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
10900–10902 | Sorry, I should've been a bit clearer. Shuffle->getOperandUse(0).dump() outputs something that looks like an instruction, but reportedly isn't, and only in a single case. It's simple to work around, but I was curious as to if anyone had an idea as to what was happening. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
10893 | Nit: do you need the enumerate? Can you do just: for (auto *OpIdx : I->operands()) | |
10900–10902 | Still a bit unclear to me. | |
15457 | Nit: A comment on what we are exactly combining here. | |
15458 | Nit: performDUPSextCombine -> performDUPSExtCombine | |
15459 | Nit: think the coding style is to use: auto *Operand = ... when it is a pointer. Same for more declarations below. | |
15465 | Can you run clang-format on your patch? I've tried reading this function, but all these lint message makes it pretty unreadable to me. |
Hello
This looks like two separate patches to me. One that folds dup(ext(..)) into ext(dup(..)), and another that tries to sink operands into a loop.
They should be separate patches with their own set of tests. The first, done universally, will need some thought (and maybe some benchmarks) to make sure it's always the correct thing to do, it's not just something that happens to work here. The fact that no other tests are failing is a good sign at least.
This looks like two separate patches to me.
I've pulled the operand sinking out to it's own patch now.
The first, done universally, will need some thought (and maybe some benchmarks) to make sure it's always the correct thing to do, it's not just something that happens to work here
Agreed, this review request was simply to gather some more opinions on the matter. There was no intention of merging this as-is.
The fact that no other tests are failing is a good sign at least.
Now that you mention it, it's a bit concerning. It's possible that there are no existing tests covering this case.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
10900–10902 | Shuffle is always an instruction at this point, the part that's causing the assertion failure is pushing the operand use into Ops. The result of Shuffle->getOperandUse(0) is what is sometimes not an instruction. | |
15458 | I opted for performDUPExtCombine, as it performs both signed and zero extends | |
15459 | None of these are pointer types, though that's good to know for future :) | |
15465 | Done, sorry about that |
Thanks
So what this seems to be proposing is that ever instance of dup(sext(..)) is transformed into sext(dup(..)). That's a pretty general transform. Off the top of my head... the extend could be free in places (zext i32->i64) but it may allow more folding into other instructions like the smull/saddl/ssubl it can allow here. If the operand is a load, that should at least have been folded into a single instruction already prior to the dup being made. But I would probably expect a sxth+dup to be quicker than a dup+sshll in general.
I would suggest trying this on a number of small tests and seeing how it does in terms of instruction count. Alternatively try compiling the llvm testsuite and seeing how many things change, and if they look better or worse in the process. But like I said this may need to be a bit more targeted.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
15462 | The llvm style guide suggests not to over-use auto like this. It just makes it more difficult telling what things are. | |
15468 | These is no such thing as a "non-vector dup", as far as I understand. | |
15487 | Create a SDLoc DL(N) and use it in both getNodes. | |
15501 | Detecting the extend is probably best done inside the function. It's common to do: if (SDValue Ext = performDUPExtCombine(..)) return Ext; I would personally leave out at least AssertSext and AssertZext until you at least have a test that shows them being needed. If this does SIGN_EXTEND_INREG it should probably handle the equivalent AND as well. | |
llvm/test/CodeGen/AArch64/aarch64-matrix-smull.ll | ||
4 ↗ | (On Diff #304529) | I would expect tests that looked something like (but I got and edited this one from mve): define <4 x i32> @vdup_i16(i16 %src) { ; CHECK-LABEL: vdup_i16: ; CHECK: @ %bb.0: @ %entry ; CHECK-NEXT: vdup.16 q0, r0 ; CHECK-NEXT: bx lr entry: %0 = insertelement <4 x i16> undef, i16 %src, i32 0 %x = shufflevector <4 x i16> %0, <4 x i16> undef, <4 x i32> zeroinitializer %out = sext <4 x i16> %0 to <4 x i32> ret <4 x i32> %out } But for all type and sizes that this transform supports. Which seems to be a lot at the moment. Don't forget scalable types too. Having tests that show that mul(sext(...), dup(sext(...))) are also folded sounds useful too, but they can hopefully be equally small. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
15483 | I'm not sure what this is doing. It should be sign extending from the smaller type with the correct number of vector lanes (either ExtOperand.getValueType() for a sext/zext or Operand.getOperand(1) for a SIGN_EXTEND_INREG, and something based on the mask for an AND). It them probably has to make sure that the new extend is a legal operation. |
Addressed comments, and changed to be more targeted. The rearrangement will only be performed if the pattern feeds into a mul instruction, and only if the type combination is valid for smull/umull folding.
llvm/test/CodeGen/AArch64/aarch64-matrix-smull.ll | ||
---|---|---|
4 ↗ | (On Diff #304529) | I've added some tests testing the behaviour for different types and sizes (all generated, so the IR is identical apart from the types). I've omitted support for scalable types, as I encountered some issues when testing them. I plan to make progress with the fixed types first, then revisit scalable types later (Unless it turns out that to support them, I'm missing 1 line somewhere). |
Sorry for the delay. There is a lot of code here all of a sudden. More than I expected!
Can you run clang-format on the patch to make it more readable?
Can you run clang-format on the patch to make it more readable?
Done, sorry about that (Guess I should have another look at my pre-commit hooks)
Thanks, that makes it clearer at least.
There's quite a bit of code here. I was hoping this would be a lot simpler, and it would be good if some of this could be simplified (but a lot of this might well be needed for all the cases that are being supported?). It makes it more difficult to check all the things it's doing.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
11740 | There is a convenience function, DAG.getUNDEF(VT); Probably doesn't need a lambda as a result. | |
11772 | Debug messages are fairly uncommon in ISel Lowering. It tends to trigger a lot of times where they are not important, and the debug messages it prints already are usually enough to give a rough indication of what is going on. This whole function probably simplifies a lot I think? Is it something like TargetType.getHalfNumVectorElementsVT == PreExtendVT and the TargetType is one of {...}? I'm not sure how the MVT::v16i8 extended to MVT::v8i16 works out, from ISel. I think because ISel is type checked, those cases would not come up even if the underlying instructions supported it? We are not optimizing smull2 as far as I understand. | |
11803 | Do we need to check both shuffles and DUP's? It that for the i64 types that are otherwise illegal somehow? | |
11828 | A mul should always be an integer I think? | |
11853–11854 | This comment is very short (as in - the length of the line). It probably doesn't need the newline before it either. | |
11861 | Multiplies (and other BinOps) only have 2 operands. It would probably be simpler to just check both the operands as opposed to needing the loop. | |
11870 | This also seems to make changes without necessarily detecting that they are useful. That is probably fine here considering what is being transformed, but it can be better to represent it as "test if it will help, if so then make the change". |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
11740 | Ah, thanks for pointing that out. | |
11772 |
Removed the debug messages. | |
11803 | No, we don't. Looks like checking for shuffles catches the same DUP cases before they become DUPs. I've removed the DUP-specific check, and have unified it down the VectorShuffle check | |
11828 | That was a remnant of before we were restricting this to muls only. Regardless, this has now been removed. | |
11861 |
I disagree. Having the loop here makes it clearer that each operand is being handled in the same way, while having them separated needlessly duplicates the code. |
Fixed formatting, and removed "Unsupported combines" tests as they were failing due to a separate issue (and contribute very little value to this patch)
It might be worth keeping one or two to show it doesn't do anything wrong/crash on incorrect types.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
11660 | Add a message explaining what this function does. It looks generally useful. It might also be worth having this function return the PreExtendType instead, and letting the parent function do the extend to vector type. Either way is fine but it might allow the function to be used for other things, if they come up. | |
11684 | Can remove debug here and below. | |
11706 | Nit: Doesn't need a break like this after a return. | |
11709 | I think this can be PostExtendType.changeVectorElementType(PreExtendType); | |
11715 | SDNode -> SDValue | |
11724 | I may be wrong, but is the PreExtendVT != MVT::v16i8 half of this ever used? TargetType and PreExtendVT should have the same number of vector elements I believe. And then TargetType.getScalarSizeInBits() == 2 * PreExtendVT.getScalarSizeInBits() maybe? That might help this lambda simplify further, possibly to the point that it is easier to inline it, now that it is only used in one place. | |
11757 | Nit: DebugLoc is usually called just DL (I think DebugLoc is already the name of the type used in Machine Instructions). | |
11761 | Nit: It doesn't need the {} around the operands, there are overloaded methods to handle that automatically already. | |
11768 | It may be simpler to generate the AArch64ISD::DUP directly? I'm not sure either way which is better, but it's less nodes and we know it's going to get there eventually. Be careful about illegal types though. | |
11861 | It would seem simpler (and smaller) to just call performCommonVectorExtendCombine for each operand. It may need to use something like Op0 ? Op0 : Mul->getOperand(0), but it would remove the need to track Changed and most of the rest of this loop. |
Thanks for the updates, this is looking good.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
11672 | SDNode* -> SDValue | |
11778–11782 | I think this will always produce a new value, as opposed to returning SDValue when nothing is done. It's probably best to make it something like this instead: SDValue Op0 = performCommonVectorExtendCombine(Mul->getOperand(0), DAG); SDValue Op1 = performCommonVectorExtendCombine(Mul->getOperand(1), DAG); if (!Op0 && !Op1) return SDValue(); SDLoc DL(Mul); return DAG.getNode(Mul->getOpcode(), DL, Mul->getValueType(0), Op0 ? Op0 : Mul->getOperand(0), Op1 ? Op1 : Mul->getOperand(1)); |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
11724 | Not entirely sure it is, I added that check to explicitly allow the types accepted by u/smull. I've changed this now. | |
11768 | Generating the DUP directly, while simpler at this point, failed due to this combine being performed fairly early, and the DUP not being handled as part of lowering. Given that, keeping it as shuffle_vector at this point seems simpler, as we can also benefit from the existing VECTOR_SHUFFLE->DUP lowering checks. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
11673 | const SDValue &ExtOperand -> SDValue ExtOperand Its also only used in one place and needn't get a variable. | |
11682 | const SDValue &TypeOperand -> SDValue TypeOperand | |
11720 | This doesn't seem to be checking that the shuffle/insert are actually a splat. There is an isSplatValue method that could help, depending on if it's really checking what this wants to check. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
11720 | Done, though I wasn't able to produce IR that represented a shuffle/insert that wasn't a splat, so I haven't got any tests for this case. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
11717 | Although (I know..) I suggested this, it does more than what we might expect here. And there's the chance it could change in any number of weird and wonderful ways in the future. As we are relying on this being a vector shuffle with an insert element and a zero mask, I think we should check for that. I think it should be enough to test that:
| |
llvm/test/CodeGen/AArch64/aarch64-dup-ext.ll | ||
169 | I think this could still count as a splat, as the elements are undef (can happily take any value, including 0). |
Thanks. LGTM with a couple of suggestions.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
11714 | You can probably remove this now and just rely on the dyn_cast<ShuffleVectorSDNode> check below. | |
11724 | Maybe move this comment up above the if, to avoid the misleading indentation. | |
11733–11739 | Maybe something like this? // Ensures the insert is inserting into lane 0 auto *Constant = dyn_cast<ConstantSDNode>(InsertLane.getNode()); if (!Constant || Constant->getZExtValue() != 0) return SDValue(); |
This commit caused a crash on our bots with a null LLVMTy at changeExtendedVectorElementType. Reduced case: https://godbolt.org/z/7ehffx
Nit: do you need the enumerate? Can you do just: