Page MenuHomePhabricator

[AArch64][SVE]: custom lower AVGFloor/AVGCeil.

Authored by hassnaa-arm on Feb 3 2023, 9:57 AM.



-Lower AVGFloor(A, B) to:

SRL(A) + SRL(B) + (A&B)&1.

-Lower AVGCeil(A, B) to:

SRL(A) + SRL(B) + (A|B)&1.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In case of CEIL, Put ADD operation for constant 1.

sdesmalen added inline comments.Feb 23 2023, 1:57 AM

This is missing a check to ensure that the and value is a mask, and that the masked 'type' is smaller than the size of the current type.
Perhaps you can get some inspiration from isExtendOrShiftOperand which checks whether the operation is zero or sign-extended.

In either case, it would be nice to have this logic in separate isSignedExtended() and isZeroExtended() functions.


There are several uses of Node->getOpcode(), you can move that to a separate variable.

It would also be nice to have some variables:

  • IsSigned (for AVGFLOORS and AVGCEILS)
  • IsFloor (for AVGFLOORS and AVGFLOORU)
  • ShiftOpc = IsSigned ? ISD::SRA : ISD::SRL;

That way you can simplify the code a bit, without having to check the opcodes everywhere, and you can combine some of the if/else branches.

4 ↗(On Diff #499568)

Can you add these tests to the precursory patch, so that this patch only shows the diff?

1 ↗(On Diff #499126)

The filename has both a dash (-) and underscores (_). Can you rename it to sve-avg-floor-ceil.ll ?

2 ↗(On Diff #499126)

Can you add two RUN lines, one for -mattr=+sve and one for -mattr=+sve2 ?

hassnaa-arm marked 4 inline comments as done.Feb 23 2023, 8:41 AM
hassnaa-arm added inline comments.
4 ↗(On Diff #499568)

Adding them cause a crash.
Crash message is: "LLVM ERROR: Scalarization of scalable vectors is not supported."
It crashes while legalising the result type of nxv2i128

Enhance code readability.

Hello. Sorry for the delay in looking at this but I wasn't sure exactly what you were trying to do, and I've never been a huge fan of DAG combines that create the wrong node just to expand it later. It looks like for legal types this can lead to a nice decrease in instruction count though.

For smaller types I'm not sure that checking for individual opcodes for extension will work well. They could already be extending loads for example. I've not thought about it too much yet, but As far as I understand from the original hadd (/avg) work it would probably be better to be checking that the top bits are known 0 for unsigned, and that there are >1 sign bits for the signed cases.

In any case, it would be good to see alive proofs for the transforms you are making.


This can be moved into the start of LowerAVG to make this function a little more regular. (I know it's not super regular already, but other parts already do the same thing).


Can it use countTrailingOnes as opposed to countPopulation? Although I think these functions might be better removed and use computeKnownBits checks instead.


This can just be called LowerAVG


check should be Check


Why is it OK to only check one of the operands?


Doesn't need the else after the return, which might help simplify if this is checking the sign bits.


There is already a test for hadds in sve in sve2-hadd. This test looks like a copy of it. It would probably be better to just use the existing test with an extra run line for SVE1 targets.

4 ↗(On Diff #499568)

Like Eli said it would be good to fix vscale x 1 vectorization so this wasn't a problem any more.

hassnaa-arm marked 6 inline comments as done.

Remove sve-avgfloor testing file.
Add RUN line for sve to sve2-hadd
rename sve2-hadd to sve-hadd

hassnaa-arm added inline comments.Feb 28 2023, 4:53 AM

I think using countTrailingOnes is much simpler than computeKnownBits because the operand and its value are already known.

sdesmalen added inline comments.Feb 28 2023, 5:46 AM

I presume that @dmgreen meant using computeKnownBits on Node as a whole, not specifically on the splatted value so that you don't need to check for the opcode explicitly.

See for example checkZExtBool, where it does:

APInt RequredZero(SizeInBits, 0xFE);
KnownBits Bits = DAG.computeKnownBits(Arg, 4);
bool ZExtBool = (Bits.Zero & RequredZero) == RequredZero;

You can probably do a similar thing for IsSignExtended, but then looking at the Bits.One instead of Bits.Zero.

Use computeKnownBits for checking zeroExtedn/signExtend.

Use ComputeNumSignBits instead of ComputeKnownBits for SIGN_EXTEND_INREG ops.

hassnaa-arm added inline comments.Mar 1 2023, 7:42 AM

I used DAG.ComputeNumSignBits instead of computeKnownBits, because computeKnownBits doesn't get any known bits for SIGN_EXTEND_INREG

104 ↗(On Diff #501511)

@dmgreen I tried to get alive proofs for this transform,
but it seems there is something wrong.
I'm not sure if the problem related to my equivalent IR to the generated code or the problem is because the transform is not correct.

Hi - I was just looking at the patch whilst you updated it! Please ignore any comments that don't apply any more.


It might be better to make these lambdas inside LowerAVG. The names sound fairly generic but they are in practice tied to hadd nodes.




I believe this only needs 1 bit to be zero, so it could probably use Known.Zero.isSignBitSet()?
There is a proof in for rhadd.


I think this one would be better to use computeNumSignBits. It's less important whether they are 0 or 1, just that they are the same as the top bit. Again I think it needs 1 bit to be valid, so computeNumSignBits(..) > 1 should do it.

It would be good to see proofs for the other bit here too where the hadd is expanded to SRL(A) + SRL(B) + (A&B)&1


I would probably do:

unsigned Opc = Op->getOpcode();
bool IsCeil = Opc == ISD::AVGCEILS || Opc == ISD::AVGCEILU;
bool IsSigned = Opc == ISD::AVGFLOORS || Op->getOpcode() == ISD::AVGCEILS;

&&, not ||, I would expect.

A HADD is equivalent to trunc(shr(add(ext(x),ext(y)), 1)), it is not directly equivalent to shr(add(x,y), 1). So we need to prove that turning it into the simpler version add+shift is better. It's not really "emit the original code".

75 ↗(On Diff #501511)

This ideally shouldn't change. Because the top bit isn't demanded instcombine will transform ashr into lshr, and we should be testing what the backend will see: I guess the lshr version isn't transformed after it gets promoted? It might be OK in this case.

104 ↗(On Diff #501511)

Can you update the link? There are some other alive links that could help.

hassnaa-arm marked 5 inline comments as done.Mar 1 2023, 12:13 PM
hassnaa-arm added inline comments.

Is that because the sign bit will be known to be zero ONLY for zero-extend case ?

75 ↗(On Diff #501511)

But for some cases, using lshr in the IR cause generating extra AND instruction.

Check if both operands of AVG are extended, not just single one.

dmgreen added inline comments.Mar 2 2023, 12:21 AM

Oh right. I was assuming that it would use if (!IsSigned && IsZeroExtended(OpA) && IsZeroExtended(OpB)) and if (IsSigned && IsSignExtended(OpA) && IsSignExtended(OpB))

An hadd / rhadd is defined as converting into a arbitrary wide integer before doing the add/shift and then converting back. It just happens that it only needs 1 bit extra for that to be equivalent to any other type sizes. So we only need to check the top bit is known to be 0. (isSignBitSet doesnt really have anything to do with signedness here, its just a way of checking the top bit).

hassnaa-arm marked an inline comment as done.Mar 2 2023, 6:59 AM

While checking isZeroExtending, only checking the signbit of known Zeros is enough.

sdesmalen added inline comments.Mar 2 2023, 7:34 AM

Move this closer to use.


nit: unnecessary newline.

Also, these lines exceed the 80char limit, so please run clang-format on this code.


You can remove this condition, it's covered by DAG.ComputeNumSignBits.


You can combine these cases doing something like this:

if ((IsSigned && IsSignExtended(OpA) && IsSignExtended(OpB)) ||
    (!IsSigned && IsZeroExtended(OpA) && IsZeroExtended(OpB))) {
  unsigned ShiftOpc = IsSigned ? ISD::SRA : ISD::SRL;
  return DAG.getNode(ShiftOpc, dl, VT, Add, ConstantOne);

nit: SDValue Tmp = DAG.getNode(IsCeil ? ISD::OR : ISD::AND, dl, VT, OpA, OpB);

hassnaa-arm marked 5 inline comments as done.

Enhance code readability.

Thanks. here are some alive proofs for the transform in and

Can you extend the testing to include both ashr and lshr versions? They should both be useful if we are custom legalizing the nodes. Otherwise I think this looks good.

25 ↗(On Diff #502090)

Can you copy these tests so there are versions with both lshr and ashr.

Add test cases for logical shr.

hassnaa-arm added inline comments.Mar 7 2023, 2:17 AM
904 ↗(On Diff #502962)

here is a transformation proof for this case:

dmgreen accepted this revision.Mar 7 2023, 2:59 AM

I think it's worth adding test for both the ashr and lshr versions, but otherwise I think this LGTM. Thanks

166 ↗(On Diff #502962)

Is there a lshr version of this one? It would be good to have some that are "full width" and use lshr.
As instcombine will convert all the ashr to lshr it might be best to make sure there are tests for all the functions that were changed.

904 ↗(On Diff #502962)

I think that's another case. This one would be From what I can tell they all look OK, according to alive.

Thanks for all the changes @hassnaa-arm, I've just left some final minor comments.


Can you add a comment explaining what this does, e.g. something like

When x and y are extended, lower:
  avgfloor(x, y) -> (x + y) >> 1
  avgceil(x, y)  -> (x + y + 1) >> 1

Otherwise, lower to:
  avgfloor(x, y) -> (x >> 1) + (y >> 1) + (x && y && 1)
  avgceil(x, y)  -> (x >> 1) + (y >> 1) + ((x || y) && 1)

nit: this comment would be redundant if you address my other comment (to add a comment for the function itself)



SDValue ShiftOpA = DAG.getNode(ShiftOpc, dl, VT, OpA, ConstantOne);
SDValue ShiftOpB = DAG.getNode(ShiftOpc, dl, VT, OpB, ConstantOne);
hassnaa-arm marked 3 inline comments as done.Mar 7 2023, 4:42 AM
hassnaa-arm added inline comments.
166 ↗(On Diff #502962)

Sorry, I don't understand what you mean by "full width" ?

Add comments explaining what LowerAvg() does.

dmgreen added inline comments.Mar 7 2023, 5:10 AM
166 ↗(On Diff #502962)

I just meant a multiple of 128 - a <vscale x 4 x i32>. There appear to be <vscale x 2 x ..> tests for lshr, but we should have all the others sizes too.

hassnaa-arm marked an inline comment as done.Mar 8 2023, 1:38 PM

Add test cases that use lshr.

@dmgreen Thanks for reviewing the patch. Do you have any further comments ?

sdesmalen accepted this revision.Mar 9 2023, 6:14 AM

Thanks for the changes @hassnaa-arm, I'm satisfied with the patch now so removing my 'requesting changes'.
Unless @dmgreen has more comments on the tests, I'm happy for this patch to land.

This revision is now accepted and ready to land.Mar 9 2023, 6:14 AM
dmgreen accepted this revision.Mar 10 2023, 3:14 AM

Yeah nothing else from me. LGTM, thanks for the changes.

This revision was landed with ongoing or failed builds.Mar 13 2023, 12:01 PM
This revision was automatically updated to reflect the committed changes.