This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SVE]: custom lower AVGFloor/AVGCeil.

Authored by hassnaa-arm on Feb 3 2023, 9:57 AM.



-Lower AVGFloor(A, B) to:

SRL(A) + SRL(B) + (A&B)&1.

-Lower AVGCeil(A, B) to:

SRL(A) + SRL(B) + (A|B)&1.

Diff Detail

Event Timeline

hassnaa-arm created this revision.Feb 3 2023, 9:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 3 2023, 9:57 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
hassnaa-arm requested review of this revision.Feb 3 2023, 9:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 3 2023, 9:57 AM
sdesmalen requested changes to this revision.Feb 6 2023, 12:58 AM

This patch doesn't seem like it is in a state to review yet (e.g. it doesn't have any tests), perhaps you can add 'WIP' to the title to make it clear that this is a Work In Progress?

1420–1434 ↗(On Diff #494676)

Just as a suggestion for future uses of these pattern matches, is that you could write this in a more compact way like this:

if (match(Op1, m_SpecificInt(1)) &&
    match(Op2, m_Add(m_Add(m_OneUse(m_Value(A)), m_OneUse(m_Value(B))),
                     m_ConstantInt(C))) &&
    (C->getZExtValue() == 0 || C->getZExtValue() == 1)) {
1456 ↗(On Diff #494676)

This transform is only profitable when the target has no explicit instructions for this pattern (e.g. like Arm's urhadd/srhadd) and the extends are expensive. The InstCombine pass transforms the IR to a canonical form and doesn't use the CostModel to decide whether a certain form is profitable or not. A better place to do this would be in an DAGCombine (to start it's good enough for this to be target-specific, e.g. a combine in AArch64ISelLowering.cpp) where we can simply check if the target being compiled for has the urhadd/shradd instructions, and if not, do the transform.

This revision now requires changes to proceed.Feb 6 2023, 12:58 AM
hassnaa-arm retitled this revision from [Transform][InstCombine]: transform lshr pattern. to [Transform][InstCombine]: transform lshr pattern. [WIP].Feb 6 2023, 2:01 AM
Matt added a subscriber: Matt.Feb 6 2023, 12:16 PM

Move combining trunc shift and extend shift to AArch64

hassnaa-arm retitled this revision from [Transform][InstCombine]: transform lshr pattern. [WIP] to [AArch64][combine]: transform lshr pattern. [WIP].Feb 7 2023, 10:28 AM
hassnaa-arm edited the summary of this revision. (Show Details)
hassnaa-arm retitled this revision from [AArch64][combine]: transform lshr pattern. [WIP] to [AArch64][combine]: combine lshr pattern. [WIP].Feb 7 2023, 10:28 AM
hassnaa-arm added inline comments.Feb 7 2023, 10:36 AM
22 ↗(On Diff #495593)

I investigated the generated DAG and the effect of my changes,
I found that my changes affected the DAG.
Here is the DAG just before the step of instruction selection:

SelectionDAG has 19 nodes:
  t0: ch,glue = EntryToken
  t2: i32,ch = CopyFromReg t0, Register:i32 %0
  t4: i32,ch = CopyFromReg t0, Register:i32 %1
          t50: i32 = and t2, Constant:i32<254>
        t37: i32 = srl t50, Constant:i64<1>
          t52: i32 = and t4, Constant:i32<254>
        t42: i32 = srl t52, Constant:i64<1>
      t43: i32 = add t37, t42
        t39: i32 = or t2, t4
      t40: i32 = and t39, Constant:i32<1>
    t44: i32 = add t43, t40
  t17: ch,glue = CopyToReg t0, Register:i32 $w0, t44
  t18: ch = AArch64ISD::RET_FLAG t17, Register:i32 $w0, t17:1
sdesmalen added inline comments.Feb 8 2023, 3:00 AM

As I understand it, the point to simplify the following:

   trunc(lshr(add(add(ext(a), ext(b)), 1), 1))
-> lshr(a, 1) + lshr(b, 1) + (a | b) & 1

iff the type of ext(a) is not a legal type

I see that in llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp it tries to combine this pattern into a simpler AVGFLOORS/AVGFLOORU/AVGCEILS/AVGCEILU node in the function combineShiftToAVG. Is there a way to re-use this existing mechanism and then Custom lower this node to the desired set of instructions? I even wonder if we could generalise this code (i.e. not specific to AArch64), when the nodes are set to Expand.


You're testing for SVE2 (vector extension), but the test is using scalar types, not vector types. That doesn't seem entirely right?


I don't believe this is a transform we want to make.

Add AArch64 implementation for custom-lowering AVGFloor/AVGCeil

hassnaa-arm retitled this revision from [AArch64][combine]: combine lshr pattern. [WIP] to [AArch64][SVE]: custom lower AVGFloor/AVGCeil. [WIP].Feb 10 2023, 4:02 AM
hassnaa-arm edited the summary of this revision. (Show Details)

Remove neon-lshr.ll

Remove old code that is not used now.

Please pre-commit the new testcases, so the changes are more visible.

The <vscale x 2 x i128> variations currently crash the compiler; I'd recommend fixing that before trying to optimize them.

The "expected" code for haddu_v2i32 is worse than what we currently generate.

Add precursory patch.

Optimize the generated code by checking if the extended node was previously truncated.

Optimize the generated code by checking if the extended nodes were previously truncated.

hassnaa-arm retitled this revision from [AArch64][SVE]: custom lower AVGFloor/AVGCeil. [WIP] to [AArch64][SVE]: custom lower AVGFloor/AVGCeil..Feb 20 2023, 2:04 AM
sdesmalen added inline comments.Feb 20 2023, 3:41 AM
1034 ↗(On Diff #498046)

This is probably better added as a separate DAGCombine that folds away the sign-extend-in-regs explicitly, because that is already performed by the avgfloor operation.

avgfloors(sextinreg(x), sextinreg(y)) -> avgfloors(x, y)

You seem to be trying to optimise the case where one of the operands is a vector of ones, but I don't see any motivating tests for this?
I would also expect these cases to be folded away already by existing DAGCombines.


This doesn't look like an improvement, we don't really want to do this transform if it makes the resulting code worse. Do you know why this results in worse code?

hassnaa-arm added inline comments.Feb 20 2023, 4:37 AM
1034 ↗(On Diff #498046)

Sorry, I don't understand this comment, may you clarify if more ?
Why did you mention sign-extend-in-regs ? why is it related to this code ?

sdesmalen added inline comments.Feb 20 2023, 4:52 AM
1034 ↗(On Diff #498046)

The truncate + sign-extend will be combined into a sign-extend-in-reg. You can see that if you disable the code you added below on lines 1046-1055, and run the test @hadd8_sext_asr

hassnaa-arm added inline comments.Feb 20 2023, 5:11 AM

In the new changes, the generated instructions are exactly the equivalent for AVGFloor, no additional instructions.
I think nothing can be done for this case.

sdesmalen added inline comments.Feb 20 2023, 8:45 AM
1034 ↗(On Diff #498046)

Given my other suggestion, please ignore my suggestion above to add a DAGCombine (which I'm not even sure is legal; the semantics of the AVGFLOOR operation aren't entirely clear to me still)


Should this be using an arithmetic shift when the operation is signed?


For these cases (where the input is an unpacked vector that is promoted using an and (for the unsigned case) or sign_extend_inreg (for the signed case)) we can emit the original code when Custom lowering these nodes. That way we can mitigate the regression.


Should this use ashr (arithmetic shift, since the values s0s/s1s are signed)?

Same question for other tests in this file.

hassnaa-arm marked 4 inline comments as done.

Check if it's better to emit the original code or custom lower AVGFloor/Ceil

Change lshr to ashr for signed cases in the precursory patch.

In case of CEIL, Put ADD operation for constant 1.

sdesmalen added inline comments.Feb 23 2023, 1:57 AM

This is missing a check to ensure that the and value is a mask, and that the masked 'type' is smaller than the size of the current type.
Perhaps you can get some inspiration from isExtendOrShiftOperand which checks whether the operation is zero or sign-extended.

In either case, it would be nice to have this logic in separate isSignedExtended() and isZeroExtended() functions.


There are several uses of Node->getOpcode(), you can move that to a separate variable.

It would also be nice to have some variables:

  • IsSigned (for AVGFLOORS and AVGCEILS)
  • IsFloor (for AVGFLOORS and AVGFLOORU)
  • ShiftOpc = IsSigned ? ISD::SRA : ISD::SRL;

That way you can simplify the code a bit, without having to check the opcodes everywhere, and you can combine some of the if/else branches.


The filename has both a dash (-) and underscores (_). Can you rename it to sve-avg-floor-ceil.ll ?


Can you add two RUN lines, one for -mattr=+sve and one for -mattr=+sve2 ?


Can you add these tests to the precursory patch, so that this patch only shows the diff?

hassnaa-arm marked 4 inline comments as done.Feb 23 2023, 8:41 AM
hassnaa-arm added inline comments.

Adding them cause a crash.
Crash message is: "LLVM ERROR: Scalarization of scalable vectors is not supported."
It crashes while legalising the result type of nxv2i128

Enhance code readability.

Hello. Sorry for the delay in looking at this but I wasn't sure exactly what you were trying to do, and I've never been a huge fan of DAG combines that create the wrong node just to expand it later. It looks like for legal types this can lead to a nice decrease in instruction count though.

For smaller types I'm not sure that checking for individual opcodes for extension will work well. They could already be extending loads for example. I've not thought about it too much yet, but As far as I understand from the original hadd (/avg) work it would probably be better to be checking that the top bits are known 0 for unsigned, and that there are >1 sign bits for the signed cases.

In any case, it would be good to see alive proofs for the transforms you are making.


This can be moved into the start of LowerAVG to make this function a little more regular. (I know it's not super regular already, but other parts already do the same thing).


Can it use countTrailingOnes as opposed to countPopulation? Although I think these functions might be better removed and use computeKnownBits checks instead.


This can just be called LowerAVG


check should be Check


Why is it OK to only check one of the operands?


Doesn't need the else after the return, which might help simplify if this is checking the sign bits.

3 ↗(On Diff #499878)

There is already a test for hadds in sve in sve2-hadd. This test looks like a copy of it. It would probably be better to just use the existing test with an extra run line for SVE1 targets.


Like Eli said it would be good to fix vscale x 1 vectorization so this wasn't a problem any more.

hassnaa-arm marked 6 inline comments as done.

Remove sve-avgfloor testing file.
Add RUN line for sve to sve2-hadd
rename sve2-hadd to sve-hadd

hassnaa-arm added inline comments.Feb 28 2023, 4:53 AM

I think using countTrailingOnes is much simpler than computeKnownBits because the operand and its value are already known.

sdesmalen added inline comments.Feb 28 2023, 5:46 AM

I presume that @dmgreen meant using computeKnownBits on Node as a whole, not specifically on the splatted value so that you don't need to check for the opcode explicitly.

See for example checkZExtBool, where it does:

APInt RequredZero(SizeInBits, 0xFE);
KnownBits Bits = DAG.computeKnownBits(Arg, 4);
bool ZExtBool = (Bits.Zero & RequredZero) == RequredZero;

You can probably do a similar thing for IsSignExtended, but then looking at the Bits.One instead of Bits.Zero.

Use computeKnownBits for checking zeroExtedn/signExtend.

Use ComputeNumSignBits instead of ComputeKnownBits for SIGN_EXTEND_INREG ops.

hassnaa-arm added inline comments.Mar 1 2023, 7:42 AM

I used DAG.ComputeNumSignBits instead of computeKnownBits, because computeKnownBits doesn't get any known bits for SIGN_EXTEND_INREG

104 ↗(On Diff #501511)

@dmgreen I tried to get alive proofs for this transform,
but it seems there is something wrong.
I'm not sure if the problem related to my equivalent IR to the generated code or the problem is because the transform is not correct.

Hi - I was just looking at the patch whilst you updated it! Please ignore any comments that don't apply any more.


It might be better to make these lambdas inside LowerAVG. The names sound fairly generic but they are in practice tied to hadd nodes.




I believe this only needs 1 bit to be zero, so it could probably use Known.Zero.isSignBitSet()?
There is a proof in for rhadd.


I think this one would be better to use computeNumSignBits. It's less important whether they are 0 or 1, just that they are the same as the top bit. Again I think it needs 1 bit to be valid, so computeNumSignBits(..) > 1 should do it.

It would be good to see proofs for the other bit here too where the hadd is expanded to SRL(A) + SRL(B) + (A&B)&1


I would probably do:

unsigned Opc = Op->getOpcode();
bool IsCeil = Opc == ISD::AVGCEILS || Opc == ISD::AVGCEILU;
bool IsSigned = Opc == ISD::AVGFLOORS || Op->getOpcode() == ISD::AVGCEILS;

&&, not ||, I would expect.

A HADD is equivalent to trunc(shr(add(ext(x),ext(y)), 1)), it is not directly equivalent to shr(add(x,y), 1). So we need to prove that turning it into the simpler version add+shift is better. It's not really "emit the original code".

75 ↗(On Diff #501511)

This ideally shouldn't change. Because the top bit isn't demanded instcombine will transform ashr into lshr, and we should be testing what the backend will see: I guess the lshr version isn't transformed after it gets promoted? It might be OK in this case.

104 ↗(On Diff #501511)

Can you update the link? There are some other alive links that could help.

hassnaa-arm marked 5 inline comments as done.Mar 1 2023, 12:13 PM
hassnaa-arm added inline comments.

Is that because the sign bit will be known to be zero ONLY for zero-extend case ?

75 ↗(On Diff #501511)

But for some cases, using lshr in the IR cause generating extra AND instruction.

Check if both operands of AVG are extended, not just single one.

dmgreen added inline comments.Mar 2 2023, 12:21 AM

Oh right. I was assuming that it would use if (!IsSigned && IsZeroExtended(OpA) && IsZeroExtended(OpB)) and if (IsSigned && IsSignExtended(OpA) && IsSignExtended(OpB))

An hadd / rhadd is defined as converting into a arbitrary wide integer before doing the add/shift and then converting back. It just happens that it only needs 1 bit extra for that to be equivalent to any other type sizes. So we only need to check the top bit is known to be 0. (isSignBitSet doesnt really have anything to do with signedness here, its just a way of checking the top bit).

hassnaa-arm marked an inline comment as done.Mar 2 2023, 6:59 AM

While checking isZeroExtending, only checking the signbit of known Zeros is enough.

sdesmalen added inline comments.Mar 2 2023, 7:34 AM

Move this closer to use.


nit: unnecessary newline.

Also, these lines exceed the 80char limit, so please run clang-format on this code.


You can remove this condition, it's covered by DAG.ComputeNumSignBits.


You can combine these cases doing something like this:

if ((IsSigned && IsSignExtended(OpA) && IsSignExtended(OpB)) ||
    (!IsSigned && IsZeroExtended(OpA) && IsZeroExtended(OpB))) {
  unsigned ShiftOpc = IsSigned ? ISD::SRA : ISD::SRL;
  return DAG.getNode(ShiftOpc, dl, VT, Add, ConstantOne);

nit: SDValue Tmp = DAG.getNode(IsCeil ? ISD::OR : ISD::AND, dl, VT, OpA, OpB);

hassnaa-arm marked 5 inline comments as done.

Enhance code readability.

Thanks. here are some alive proofs for the transform in and

Can you extend the testing to include both ashr and lshr versions? They should both be useful if we are custom legalizing the nodes. Otherwise I think this looks good.

25 ↗(On Diff #502090)

Can you copy these tests so there are versions with both lshr and ashr.

Add test cases for logical shr.

hassnaa-arm added inline comments.Mar 7 2023, 2:17 AM
904 ↗(On Diff #502962)

here is a transformation proof for this case:

dmgreen accepted this revision.Mar 7 2023, 2:59 AM

I think it's worth adding test for both the ashr and lshr versions, but otherwise I think this LGTM. Thanks

166 ↗(On Diff #502962)

Is there a lshr version of this one? It would be good to have some that are "full width" and use lshr.
As instcombine will convert all the ashr to lshr it might be best to make sure there are tests for all the functions that were changed.

904 ↗(On Diff #502962)

I think that's another case. This one would be From what I can tell they all look OK, according to alive.

Thanks for all the changes @hassnaa-arm, I've just left some final minor comments.


Can you add a comment explaining what this does, e.g. something like

When x and y are extended, lower:
  avgfloor(x, y) -> (x + y) >> 1
  avgceil(x, y)  -> (x + y + 1) >> 1

Otherwise, lower to:
  avgfloor(x, y) -> (x >> 1) + (y >> 1) + (x && y && 1)
  avgceil(x, y)  -> (x >> 1) + (y >> 1) + ((x || y) && 1)

nit: this comment would be redundant if you address my other comment (to add a comment for the function itself)



SDValue ShiftOpA = DAG.getNode(ShiftOpc, dl, VT, OpA, ConstantOne);
SDValue ShiftOpB = DAG.getNode(ShiftOpc, dl, VT, OpB, ConstantOne);
hassnaa-arm marked 3 inline comments as done.Mar 7 2023, 4:42 AM
hassnaa-arm added inline comments.
166 ↗(On Diff #502962)

Sorry, I don't understand what you mean by "full width" ?

Add comments explaining what LowerAvg() does.

dmgreen added inline comments.Mar 7 2023, 5:10 AM
166 ↗(On Diff #502962)

I just meant a multiple of 128 - a <vscale x 4 x i32>. There appear to be <vscale x 2 x ..> tests for lshr, but we should have all the others sizes too.

hassnaa-arm marked an inline comment as done.Mar 8 2023, 1:38 PM

Add test cases that use lshr.

@dmgreen Thanks for reviewing the patch. Do you have any further comments ?

sdesmalen accepted this revision.Mar 9 2023, 6:14 AM

Thanks for the changes @hassnaa-arm, I'm satisfied with the patch now so removing my 'requesting changes'.
Unless @dmgreen has more comments on the tests, I'm happy for this patch to land.

This revision is now accepted and ready to land.Mar 9 2023, 6:14 AM
dmgreen accepted this revision.Mar 10 2023, 3:14 AM

Yeah nothing else from me. LGTM, thanks for the changes.

This revision was landed with ongoing or failed builds.Mar 13 2023, 12:01 PM
This revision was automatically updated to reflect the committed changes.