This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Attempt to match OP(SHUFFLE(X,Y),SHUFFLE(X,Y)) -> SHUFFLE(HOP(X,Y))
ClosedPublic

Authored by RKSimon on Jul 14 2020, 9:59 AM.

Details

Summary

An initial backend patch towards fixing the various poor HADD combines (PR34724, PR41813, PR45747 etc.).

This extends isHorizontalBinOp to check if we have per-element horizontal ops (odd+even element pairs), but not in the expected serial order - in which case we build a "post shuffle mask" that we can apply to the HOP result, assuming we have fast-hops/optsize etc.

The next step will be to extend the SHUFFLE(HOP(X,Y)) combines as suggested on PR41813 - accepting more post-shuffle masks even on slow-hop targets if we can fold it into another shuffle.

Diff Detail

Event Timeline

RKSimon created this revision.Jul 14 2020, 9:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 14 2020, 9:59 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
RKSimon updated this revision to Diff 277905.Jul 14 2020, 11:11 AM

clang-format

Need to add more tests to make sure integer and 256-bit work as expected.

What do you think about consolidating the h-op creation in 1 helper as a preliminary to avoid the caller code duplication? We can also more easily assert properties of the shuffle mask (eg, only chooses from operand 0?).

I had this:

diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 450927aaf5c..3ed354d483f 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -44507,21 +44507,66 @@ static bool isHorizontalBinOp(SDValue &LHS, SDValue &RHS, SelectionDAG &DAG,
   return true;
 }
 
+/// Try to synthesize horizontal add/sub from adds/subs of shuffles.
+static SDValue getHorizontalBinop(SDNode *N, SelectionDAG &DAG,
+                                  const X86Subtarget &Subtarget) {
+  unsigned Opcode = N->getOpcode();
+  unsigned HorizOpcode;
+  switch (Opcode) {
+  case ISD::ADD: HorizOpcode = X86ISD::HADD; break;
+  case ISD::SUB: HorizOpcode = X86ISD::HSUB; break;
+  case ISD::FADD: HorizOpcode = X86ISD::FHADD; break;
+  case ISD::FSUB: HorizOpcode = X86ISD::FHSUB; break;
+  default:
+    llvm_unreachable("Unexpected opcode for horizontal op");
+  }
+
+  EVT VT = N->getValueType(0);
+  if (!VT.isSimple())
+    return SDValue();
+  switch (VT.getSimpleVT().SimpleTy) {
+  case MVT::v8i16:
+  case MVT::v4i32:
+  case MVT::v16i16:
+  case MVT::v8i32:
+    // 256-bit vectors without AVX2 are handled by splitting below.
+    if (!Subtarget.hasSSSE3())
+      return SDValue();
+    break;
+  case MVT::v4f32:
+  case MVT::v2f64:
+    if (!Subtarget.hasSSE3())
+      return SDValue();
+    break;
+  case MVT::v8f32:
+  case MVT::v4f64:
+    if (!Subtarget.hasAVX())
+      return SDValue();
+    break;
+  default:
+    return SDValue();
+  }
+
+  SDValue Op0 = N->getOperand(0), Op1 = N->getOperand(1);
+  bool IsCommutable = (Opcode == ISD::ADD || Opcode == ISD::FADD);
+  if (!isHorizontalBinOp(Op0, Op1, DAG, Subtarget, IsCommutable))
+    return SDValue();
+
+  if (VT.getScalarType().isFloatingPoint())
+    return DAG.getNode(HorizOpcode, SDLoc(N), VT, Op0, Op1);
+
+  auto HopBuilder = [&](SelectionDAG &DAG, const SDLoc &DL,
+                        ArrayRef<SDValue> Ops) {
+    return DAG.getNode(HorizOpcode, DL, Ops[0].getValueType(), Ops);
+  };
+  return SplitOpsAndApply(DAG, Subtarget, SDLoc(N), VT, {Op0, Op1}, HopBuilder);
+}
+
 /// Do target-specific dag combines on floating-point adds/subs.
 static SDValue combineFaddFsub(SDNode *N, SelectionDAG &DAG,
                                const X86Subtarget &Subtarget) {
RKSimon updated this revision to Diff 280198.Jul 23 2020, 10:35 AM

rebase with new tests

RKSimon updated this revision to Diff 280679.Jul 25 2020, 7:35 AM

Relax lane crossing limitations for AVX2+ targets

RKSimon updated this revision to Diff 280687.Jul 25 2020, 11:48 AM

Always permit lane crossing on AVX2 targets, and on non-AVX2 targets, integer hops will always be split to 128-bits so permit lane crossing there as well.

spatel added inline comments.Jul 27 2020, 11:47 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
44457

Reduce negative logic?
// Avoid 128-bit lane crossing if this is pre-AVX2 and FP (integer will be split).
if (!Subtarget.hasAVX2 && VT.isFloatingPoint() && ...)

RKSimon updated this revision to Diff 281024.Jul 27 2020, 12:41 PM

simplify lane-crossing logic

spatel accepted this revision.Jul 27 2020, 1:01 PM

LGTM

This revision is now accepted and ready to land.Jul 27 2020, 1:01 PM
This revision was landed with ongoing or failed builds.Jul 28 2020, 2:04 AM
This revision was automatically updated to reflect the committed changes.