This is an archive of the discontinued LLVM Phabricator instance.

[ISel] Port AArch64 HADD and RHADD to ISel
ClosedPublic

Authored by dmgreen on Jul 18 2021, 7:12 AM.

Details

Summary

This ports the aarch64 combines for HADD and RHADD over to DAG combine, so that they can be used in more architectures (notably MVE in a followup patch). They are renamed to AVGFLOOR and AVGCEIL in the process, to avoid confusion with instructions such as X86 hadd. The code was also rewritten slightly to remove the AArch64 idiosyncrasies.

The general pattern for a AVGFLOORS is

%xe = sext i8 %x to i32
%ye = sext i8 %y to i32
%a = add i32 %xe, %ye
%r = lshr i32 %a, 1
%t = trunc i32 %r to i8

An AVGFLOORU is equivalent with zext. Because of the truncate lshr==ashr, as the top bits are not demanded. An AVGCEIL also includes an extra rounding, so includes an extra add of 1.

Diff Detail

Event Timeline

dmgreen created this revision.Jul 18 2021, 7:12 AM
dmgreen requested review of this revision.Jul 18 2021, 7:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 18 2021, 7:12 AM

X86 might be able to reuse this for X86ISD::AVG

llvm/include/llvm/CodeGen/ISDOpcodes.h
620

having add?
Maybe add a code snippet?

625

Maybe add a code snippet?

efriedma added inline comments.Jul 18 2021, 11:44 AM
llvm/include/llvm/CodeGen/ISDOpcodes.h
620

There's really very little incentive to keep ISD opcode names short, particularly rarely used ones like this. I'd prefer to spell it out (particularly the "h"; for example, x86 has "haddps", where the "h" stands for "horizontal").

Maybe ADD_HALVE_UNSIGNED_ROUND_DOWN and ADD_HALVE_SIGNED_ROUND_UP?

RKSimon requested changes to this revision.Nov 3 2021, 8:27 AM

@dmgreen - if you're still looking at this, the patch needs a rename of the node enum types and preferably we replace X86ISD::AVG to use this as well.

This revision now requires changes to proceed.Nov 3 2021, 8:27 AM
craig.topper added inline comments.Nov 5 2021, 12:21 PM
llvm/include/llvm/CodeGen/ISDOpcodes.h
620

Halide calls these halving_add and rounding_halving_add.

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
935 ↗(On Diff #359623)

Is zext supposed to be sext?

945 ↗(On Diff #359623)

Can this be DemandedBits.isPositive() or DemandedBits.isSignBitClear()? Counting bits seems like overkill.

Instead of making this part of SimplifyDemandedBits, could you emit (and (sext (hadds X, Y)), 0x7fffffff) for the (lshr (add (sext(X), sext(Y)), 1) case and let the AND be optimized by itself? Or would the transform not be profitable if it doesn't get removed?

I haven't had a change to look at it recently - I've been busy with other things. I hope to get back to this at some point - it would be good to share it between backends.

llvm/include/llvm/CodeGen/ISDOpcodes.h
620

I'll be honest - I didn't much like the sound of ADD_HALVE_UNSIGNED_ROUND_DOWN. I'm not a fan of how long that name is. Arm uses the name HADD, with S/U or V before it. X86 uses AVG and RISCV uses AAVG I believe, if I was looking up the right instruction. I'm not sure if they have the same concepts of rounding RHADD though, which rounds up. I think gcc calls them AVG_FLOOR and AVG_CEIL.

Any of those names or HALVING_ADD would be find by me.

Instead of making this part of SimplifyDemandedBits, could you emit (and (sext (hadds X, Y)), 0x7fffffff) for the (lshr (add (sext(X), sext(Y)), 1) case and let the AND be optimized by itself? Or would the transform not be profitable if it doesn't get removed?

Yeah that might work. Like https://alive2.llvm.org/ce/z/9pwKEi. I'll have to check if it looks worse anywhere - I think it should be fine in general, possibly minus the cost of materializing the constant.

arcbbb added a subscriber: arcbbb.Nov 7 2021, 9:03 AM
dmgreen updated this revision to Diff 395353.Dec 19 2021, 1:20 PM
dmgreen edited the summary of this revision. (Show Details)

This now:

  • Calls the nodes AVGFLOOR and AVGCEIL.
  • Ports X86ISD::AVG to use AVGCEILU.
  • Doesn't remove detectAVGPattern, as it can detect more pattern, such as illegal types. It adds some basic widening support to handle them.
lebedev.ri added a subscriber: lebedev.ri.
lebedev.ri added inline comments.
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
908 ↗(On Diff #395353)

There's also the case where one of the operands is a constant, so you don't have an explicit +1

947–950 ↗(On Diff #395353)

Since you have already avoided the other misdesign (by starting from the shift and not trunc),
can we also avoid one here, by looking at the known sign/leading zero bits instead of an explicit extension?

960–961 ↗(On Diff #395353)

Don't you need to check that the addition is no-wrap,
i.e. don't you need to check how many bits were added during extension?

Right now this seems like a miscompile, one that will be avoided
if you use known bits instead of looking for extensions.

dmgreen updated this revision to Diff 396444.Dec 28 2021, 4:48 PM
dmgreen edited the summary of this revision. (Show Details)

Now using some known bits.

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
908 ↗(On Diff #395353)

For ARM/Arch64 that would be an hadd (avgfloor), IIUC. For X86 where avgceil is legal but avgfloor is not, we can probably do something about that if needed, or make it target specific. It's probably best not to try and address ever issue in this single patch though.

947–950 ↗(On Diff #395353)

Sure, sounds good. I've given it a go, but it might be a bit messy. We have to get a type to use for the legalization, which was previously just using the extend type. Let me know if you have any suggestions.

960–961 ↗(On Diff #395353)

MVE already has some nowrap tablegen patterns, as they don't require any extends and can be recognised from a add+shift.

I've not added anything with nowrap here (maybe we can extend this in a future patch, so long as this is correct so far). The other patterns mostly verify with a single bit of extension, so long as I have that right, and I've added checks where they seem to need 2.

lebedev.ri added inline comments.Dec 29 2021, 1:54 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
987–988 ↗(On Diff #396444)

Can we get to here where the operations aren't legal yet?

908 ↗(On Diff #395353)

Sure, this can be done in a follow-up.

What i'm saying is that if you want to catch more cases,
you might want to try to deal with add(ext, C) as with add(ext, add(C-1, 1)) (or vice versa).
Probably this might be best done by adding a wrapper function,
and adding a parameter to this function, bool LookingForCeil.

dmgreen added inline comments.Jan 3 2022, 7:10 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
987–988 ↗(On Diff #396444)

Yes I think so (IIUYC). It'll be run as part of any of the DAG combines. There is very little legalization for AVG at the moment, so I'm not sure we can create a lot of illegal nodes.

(But I'm not an expert of when it's best to create illegal target-independent nodes and when not)

Matt added a subscriber: Matt.Jan 7 2022, 7:33 AM
craig.topper added inline comments.Jan 7 2022, 1:15 PM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
18643

This APInt goes with the code after this new code. Move it down?

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
946 ↗(On Diff #396444)

This needs to forward the Depth from the caller.

957 ↗(On Diff #396444)

ComputeNumSignBits can call computeKnownBits internally and will count leading 0s as sign bits. Does the zero check need to have priority here?

Instead of making this part of SimplifyDemandedBits, could you emit (and (sext (hadds X, Y)), 0x7fffffff) for the (lshr (add (sext(X), sext(Y)), 1) case and let the AND be optimized by itself? Or would the transform not be profitable if it doesn't get removed?

Yeah that might work. Like https://alive2.llvm.org/ce/z/9pwKEi. I'll have to check if it looks worse anywhere - I think it should be fine in general, possibly minus the cost of materializing the constant.

Did you check on this?

RKSimon added inline comments.Jan 10 2022, 10:21 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
1351 ↗(On Diff #396444)

These can be custom lowered on AVX1 targets by splitting, but I'm OK with just a TODO for now.

1654 ↗(On Diff #396444)

These can be custom lowered on non-AVX512BW targets by splitting, but I'm OK with just a TODO for now.

dmgreen updated this revision to Diff 402013.Jan 21 2022, 9:25 AM
dmgreen marked 14 inline comments as done.
dmgreen edited the summary of this revision. (Show Details)

Instead of making this part of SimplifyDemandedBits, could you emit (and (sext (hadds X, Y)), 0x7fffffff) for the (lshr (add (sext(X), sext(Y)), 1) case and let the AND be optimized by itself? Or would the transform not be profitable if it doesn't get removed?

Yeah that might work. Like https://alive2.llvm.org/ce/z/9pwKEi. I'll have to check if it looks worse anywhere - I think it should be fine in general, possibly minus the cost of materializing the constant.

Did you check on this?

Yeah - AArch64 as a saddl instruction that can perform 'add(sext, sext)' as a single instruction. That with a shift will be better than a hadd;extend;bic, and it will look worse if the And is less efficient (possibly having to dup a constant).

With the trunc it will perform OK, I'm not sure what the likelyhood of sext with lshr will be without it. With the trunc it will be one of the most common cases, given how the ashr is converted to a lshr.

Providing that the depth is limited, is there an issue with doing it as a demand bits combine?

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
957 ↗(On Diff #396444)

I was trying to keep SRA producing HADDS, and SRL preferring HADDU. I think it should be using the known bits to preferably make the lowest bitwidth hadd.

llvm/lib/Target/X86/X86ISelLowering.cpp
1351 ↗(On Diff #396444)

Thanks. I thought that the natural splitting of vectors would capture that already.

Providing that the depth is limited, is there an issue with doing it as a demand bits combine?

My main reason for questioning was because as far as I know this is a larger transform than anything else we do in SimplifiedDemandedBits. So it seemed a bit of a new direction for SimplifiedDemandedBits, but if we're not concerned about that then I don't object to this patch.

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
953 ↗(On Diff #402013)

I'm not sure, but would it be better to use countMaxActiveBits and ComputeMaxSignificantBits. And do all the comparisons against the scalar bit width instead of against 1 and 2.

1000 ↗(On Diff #402013)

changeVectorElementType is a little broken. If the VT is an MVT, but the VT with the element type replaced isn't an MVT, the function will assert because it needs an LLVMContext to create an EVT which it can't get from the MVT.

Maybe it's ok in this case as long as our MVTs always have all power of 2 element types smaller for each MVT. Like MVT::vXi32 would need MVT::vXi16 and MVT::vXi8 to exist.

dmgreen updated this revision to Diff 402489.Jan 24 2022, 5:42 AM

Now uses EVT::getVectorVT

Providing that the depth is limited, is there an issue with doing it as a demand bits combine?

My main reason for questioning was because as far as I know this is a larger transform than anything else we do in SimplifiedDemandedBits. So it seemed a bit of a new direction for SimplifiedDemandedBits, but if we're not concerned about that then I don't object to this patch.

Yeah that's fair enough. I had considered it, and was hoping that the initial condition (ConstantSDNode *N1C = isConstOrConstSplat(Op.getOperand(1)); if(N1C && N1C->isOne())..) would rule most stuff out, limiting how expensive this is in practice. In my mind recursing into a SimplifiedDemandedBits call for "N" steps is going to be more expensive than any one transform inside it, due to the number of nodes potentially visited. But they will be related, and I have no hard evidence about which part is potentially slower.

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
953 ↗(On Diff #402013)

In my mind, saying an add requires a single extra bit is simpler than dealing with MaxActiveBits and BitWidths. But if you think it's worth changing I'm happy to change it.

I've lost track of this patch a little - would it make sense to split it into two parts - the first just replaces the target opcodes with the generic ISD variants, and the second then begins the process of moving the matching into generic DAG?

dmgreen planned changes to this revision.Jan 30 2022, 11:54 PM

Yeah sounds good. This patch was certainly feeling too big

dmgreen updated this revision to Diff 406179.Feb 6 2022, 1:33 AM
dmgreen edited the summary of this revision. (Show Details)

This is now a more straight-forward porting of the code from AArch64 to DAGCombine. The X86 parts have been moved into a different patch, and this now starts at a truncate in the same way that the AArch64 code did.

Please tidyup the clang-format warnings.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
14357

update the urhadd references

dmgreen updated this revision to Diff 406287.Feb 6 2022, 2:32 PM

Attempt to clean up formatting, spelling and some node names.

RKSimon accepted this revision.Feb 7 2022, 12:11 AM

LGTM with a few minors

llvm/include/llvm/CodeGen/ISDOpcodes.h
622

Really pedantic but none of these description refer to these being averaging adds, yet a lot of the comments in other places just refers to them as averages.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12699

assert(N->getOpcode() == ISD::TRUNCATE && "TRUNCATE node expected");

12759

Is this assert necessary?

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
14349

update these as well

This revision is now accepted and ready to land.Feb 7 2022, 12:11 AM
This revision was landed with ongoing or failed builds.Feb 11 2022, 10:29 AM
This revision was automatically updated to reflect the committed changes.