This is an archive of the discontinued LLVM Phabricator instance.

[x86] favor vector constant load to avoid GPR to XMM transfer, part 2
ClosedPublic

Authored by spatel on May 18 2020, 8:49 AM.

Details

Summary

This replaces the build_vector lowering code that was just added in D80013 and matches the pattern later from the x86-specific "vzext_movl". That seems to result in the same or better improvements and gets rid of the 'TODO' items from that patch.

AFAICT, we always shrink wider constant vectors to 128-bit on these patterns, so we still get the implicit zero-extension to ymm/zmm without wasting space on larger vector constants. There's a trade-off there because that means we miss potential load-folding.

Similarly, we could load scalar constants here with implicit zero-extension even to 128-bit. That saves constant space, but it means we forego load-folding, and so it increases register pressure. This seems like a good middle-ground between those 2 options?

Diff Detail

Event Timeline

spatel created this revision.May 18 2020, 8:49 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 18 2020, 8:49 AM
spatel retitled this revision from [x86] favor vector constant load to avoid GPR to XMM transert, part 2 to [x86] favor vector constant load to avoid GPR to XMM transfer, part 2.May 18 2020, 8:49 AM
RKSimon added inline comments.May 18 2020, 11:18 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
35802

Would creating X86ISD::VZEXT_LOAD be better? We'd need to zext i8/i16 cases to i32 constant pool entries but we'd avoid the upper zero constants.

spatel marked an inline comment as done.May 18 2020, 1:20 PM
spatel added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
35802

My 1st draft of this patch used VZEXT_LOAD, and I think it worked without having to explicitly zext here. Does this match what you're thinking of? If so, this leads to the trade-off mentioned in the description - we save some constant space, but lose some load folds.

diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 655147076a4..dc61867b418 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -35788,6 +35788,26 @@ static SDValue combineTargetShuffle(SDValue N, SelectionDAG &DAG,
       }
     }

+    // Load a scalar integer constant directly to XMM instead of transferring an
+    // immediate value from GPR.
+    // TODO: Would it be better to load a 128-bit vector constant instead?
+    // vzext_movl (scalar_to_vector C) --> vzext_load &C
+    if (N0.getOpcode() == ISD::SCALAR_TO_VECTOR) {
+      if (auto *C = dyn_cast<ConstantSDNode>(N0.getOperand(0))) {
+        MVT PVT = DAG.getTargetLoweringInfo().getPointerTy(DAG.getDataLayout());
+        SDValue CP = DAG.getConstantPool(C->getConstantIntValue(), PVT);
+        Align Alignment = cast<ConstantPoolSDNode>(CP)->getAlign();
+        EVT ScalarVT = N0.getOperand(0).getValueType();
+        SDVTList MemTys = DAG.getVTList(VT, MVT::Other);
+        SDValue MemOps[] = {DAG.getEntryNode(), CP};
+        MachinePointerInfo MPI =
+            MachinePointerInfo::getConstantPool(DAG.getMachineFunction());
+        return DAG.getMemIntrinsicNode(X86ISD::VZEXT_LOAD, DL, MemTys, MemOps,
+                                       ScalarVT, MPI, Alignment,
+                                       MachineMemOperand::MOLoad);
+      }
+    }
+
     return SDValue();
   }
   case X86ISD::BLENDI: {
spatel updated this revision to Diff 264959.May 19 2020, 10:38 AM

Patch updated:
This is rebased on top of D80223, so we can see incremental diffs like enabling more load folding.

Ideally we'd have a way to fold vzext_load ops inside X86InstrInfo::foldMemoryOperandCustom (by zero padding the constant pool entry where necessary) but I'm not certain how easy that is.

So we probably want to go with this variant (sorry for the wild goose chase).

@craig.topper any thoughts?

Ideally we'd have a way to fold vzext_load ops inside X86InstrInfo::foldMemoryOperandCustom (by zero padding the constant pool entry where necessary) but I'm not certain how easy that is.

So we probably want to go with this variant (sorry for the wild goose chase).

@craig.topper any thoughts?

I think the caller of foldMemoryOperandImpl is responsible for copying the memoperand over to the new instruction. So changing the memory reference out from under it will break that at the very least. We'd also be deferring our usual load folding to the peephole pass which isn't as quite strong as SelectionDAG I think.

If the load is in a loop we potentialy unfold it in MachineLICM and hoist it out of the loop. So maybe what we really want is a later constant pool shrinking pass that runs after Machine LICM. We have a similar issue with broadcasts from constant pool don't we? Lowing of build_vector favors forming broadcasts of constants without knowing if we can fold.

RKSimon accepted this revision.May 23 2020, 7:53 AM

LGTM - let's go with this approach. I've raised PR46048 to hopefully help us with reducing the size of constant pool loads when useful.

This revision is now accepted and ready to land.May 23 2020, 7:53 AM
This revision was automatically updated to reflect the committed changes.