This is an archive of the discontinued LLVM Phabricator instance.

lkail retitled this revision from Merge consecutive stores of vector elements before types are legalized to [PowerPC] Merge consecutive stores of vector elements before types are legalized.Jun 4 2019, 10:49 PM

lkail added reviewers: bogner, niravd, MaskRay.Jun 5 2019, 11:44 PM

This seems like it should be folded into the already existing checks. Do you know why NumStoresToMerge was not being before. I expect it's the requirement of a legal types in pre-legal merges or PPC's check for allowed misaligned accesses missing some cases. If it's the former I suspect we can disable the legality requirement prelegaltypes for non-truncated stores (replace TLI.isTypeLegal(ty) with isTypeLegal(ty)).

niravd added a subscriber: fhahn.Jun 7 2019, 11:34 AM

In D62890#1532667, @niravd wrote:

This seems like it should be folded into the already existing checks. Do you know why NumStoresToMerge was not being before. I expect it's the requirement of a legal types in pre-legal merges or PPC's check for allowed misaligned accesses missing some cases. If it's the former I suspect we can disable the legality requirement prelegaltypes for non-truncated stores (replace TLI.isTypeLegal(ty) with isTypeLegal(ty)).

Thanks for responding, @niravd. I currently have no idea of why NumStoresToMerge was not being before. I might have a look at patches related to this portion of code. I notice that consecutive stores of vector elements was first introduced by https://reviews.llvm.org/rL224611 in which a legal type was a requirement already. I'll have a try of what you said.

In D62890#1532667, @niravd wrote:

This seems like it should be folded into the already existing checks. Do you know why NumStoresToMerge was not being before. I expect it's the requirement of a legal types in pre-legal merges or PPC's check for allowed misaligned accesses missing some cases. If it's the former I suspect we can disable the legality requirement prelegaltypes for non-truncated stores (replace TLI.isTypeLegal(ty) with isTypeLegal(ty)).

Hi, @niravd, after investigate code carefully, I think this check might not be redundant. Considering the case, we have 3 i32 values extracted from vectors, both isTypeLegal and TIL.isTypeLegal see a v3i32 illegal. However, MergeStoresOfConstantsOrVecElts which is called later by MergeConsecutiveStores doesn't require type legality and builds a BUILD_VECTOR node whose elements are 3 EXTRACT_VECTOR_ELT values. PowerPC's vector type legalizer can handle such cases, so it can benefit from getNumStoresOfVectorElementsToMergePreLegalize. I know it's quite weird such check added within a context where type legality check is around. I once wanna implement it in PPCTargetLowering::PerformDAGCombine, however it might make code duplicated.

Hi, @niravd, after investigate code carefully, I think this check might not be redundant. Considering the case, we have 3 i32 values extracted from vectors, both isTypeLegal and TIL.isTypeLegal see a v3i32 illegal.

Are you certain isTypeLegal was returning false in prelegalization? isTypeLegal(x) should be "!LegalTypes || TLI.isTypeLegal(x)". I was expecting that if you swapped out TLI.isTypeLegal for isTypeLegal where it was failing, we could generate an invalid node, but it sounds like that's not the case. If so, I think we should double check.

However, MergeStoresOfConstantsOrVecElts which is called later by MergeConsecutiveStores doesn't require type legality and builds a BUILD_VECTOR node whose elements are 3 EXTRACT_VECTOR_ELT values. PowerPC's vector type legalizer can handle such cases, so it can benefit from getNumStoresOfVectorElementsToMergePreLegalize. I know it's quite weird such check added within a context where type legality check is around. I once wanna implement it in PPCTargetLowering::PerformDAGCombine, however it might make code duplicated.

@niravd Thanks for pointing out my mistake. I tried swapping out TIL.isTypeLegal with isTypeLegal, code generated for PowerPC is not what I expect, cuz TIL.allowsMemoryAccess will return false if Ty is something like v3i32. Also this change will break some regression tests of SystemZ and X86. And I don't quite understand the meaning of 'double check' here, could you please explain more?

By double check, I just meant to look at the results again with isTypeLegal checked, which is where we are?

FWIW, It's probably fine to do something like isTypeLegal with allowsMemoryAccess, though there will likely be more changes in other backends,. I expect most to be mundane. It may be worth it to update and see if others are motivated to look into real regressions.

In D62890#1549589, @lkail wrote:

@niravd Thanks for pointing out my mistake. I tried swapping out TIL.isTypeLegal with isTypeLegal, code generated for PowerPC is not what I expect, cuz TIL.allowsMemoryAccess will return false if Ty is something like v3i32. Also this change will break some regression tests of SystemZ and X86. And I don't quite understand the meaning of 'double check' here, could you please explain more?

lkail updated this revision to Diff 205740.Jun 19 2019, 11:26 PM

lkail retitled this revision from [PowerPC] Merge consecutive stores of vector elements before types are legalized to [DAGCombiner] Merge consecutive stores of vector elements before types are legalized.

lkail edited the summary of this revision. (Show Details)

Updated the patch. @niravd any further suggestions?

lkail edited reviewers, added: craig.topper, RKSimon, uweigand; removed: MaskRay.Jun 19 2019, 11:34 PM

Looks like it's mostly an improvement though there are some potential regressions around vector shuffles. I'll leave it to the others if it's acceptable to land.

Also, what happened to the PPC test changes?

The X86 changes LGTM

niravd added a reviewer: jonpa.Jun 20 2019, 1:22 PM

Also, what happened to the PPC test changes?

Cuz PPC's allowsMemoryAccess lacks information about which CombineLevel is at, it fails the check(Only a few vector types are allowed to have misaligned access). As a result, currently no changes happen in PPC's code. I might try to solve this problem with another patch.

Hi @jonpa, could you have a look if it is a real reg in SystemZ's change?

Cuz PPC's allowsMemoryAccess lacks information about which CombineLevel is at, it fails the check(Only a few vector types are allowed to have misaligned access). As a result, currently no changes happen in PPC's code. I might try to solve this problem with another patch.

Ah. Can you solve this by doing the analog to isTypeLegal (vs. isTypeLegal) and assume it's true prelegalization? If it's that small, you should (but don't feel like you must) fold it into this patch.

In D62890#1554934, @lkail wrote:

Hi @jonpa, could you have a look if it is a real reg in SystemZ's change?

At a first glance this seems to be generating worse code now since we need to do all those permute-type instructions in order to get the bytes into the correct order in a single register to store ... I'll clarify with the hardware folks which of the sequences would actually be preferable.

If it turns out that there are instances where *not* merging stores (even when it would be *possible*) is not preferred from a performance perspective, should common code use some cost function here?

Considering suggestions of @niravd and @uweigand , is it proper to have an implementation like

bool DAGCombiner::isAbleToMergeConsecutiveStoresPreLegalize(ArrayRef<SNode*> elements, .../* Params related to align, addrspace and etc.*/) {
  if (LegalTypes)
    return false;
  // Let target decides cost considering elements to be stored.
  return TLI.canMergeConsecutiveStoresOfVectorElements(elements, ...);
}

And new check is

if (isTypeLegal(Ty) &&
    TLI.canMergeStoresTo(FirstStoreAS, Ty, DAG) &&
    ((TLI.allowsMemoryAccess(Context, DL, Ty,
                           *FirstInChain->getMemOperand(), &IsFast) &&
      IsFast) || isAbleToMergeConsecutiveStoresPreLegalize(elements, ...))

It's reasonable assuming we want to prohibit the SystemZ cases. Note that TLI.canMergeConsecutiveStoresOfVectorElements and TLI.canMergeStoresTo are both only called here and should be merged into a single method.

That said, the permute expression seems like a sign that there are some permutation peepholes may be worthwhile. We could simplify the permute into a BUILD_VECTOR of a vector stores which should be close enough to the original element-wise stores here.

In D62890#1557119, @lkail wrote:

Considering suggestions of @niravd and @uweigand , is it proper to have an implementation like

bool DAGCombiner::isAbleToMergeConsecutiveStoresPreLegalize(ArrayRef<SNode*> elements, .../* Params related to align, addrspace and etc.*/) {
  if (LegalTypes)
    return false;
  // Let target decides cost considering elements to be stored.
  return TLI.canMergeConsecutiveStoresOfVectorElements(elements, ...);
}

And new check is

if (isTypeLegal(Ty) &&
    TLI.canMergeStoresTo(FirstStoreAS, Ty, DAG) &&
    ((TLI.allowsMemoryAccess(Context, DL, Ty,
                           *FirstInChain->getMemOperand(), &IsFast) &&
      IsFast) || isAbleToMergeConsecutiveStoresPreLegalize(elements, ...))

@lkail Are you still looking at this please?

Herald added a subscriber: • wuzish. · View Herald TranscriptAug 19 2019, 8:28 AM

Hi @RKSimon , currently I'm not working on it. I should have abandoned this patch. I might have another patch which also works for PowerPC.

Since this patch has long been not updated, I'll close it and plan another patch that also works for PowerPC.

tingwang commandeered this revision.Jul 31 2022, 11:24 PM

tingwang added a reviewer: lkail.

Herald added a project: Restricted Project. · View Herald TranscriptJul 31 2022, 11:24 PM

Herald added subscribers: StephenFan, ecnelises, pengfei. · View Herald Transcript

With this change, the SystemZ case is not touched. Two x86 cases still need to confirm.

According to previous comments, it may be better to have some cost function to decide if merge is preferred or not. I'm not sure how to implement the cost function, one reason is that the final code sequence generated may depends on target specific flag etc., and backend target at this stage may not have enough information to decide. For example results in extract-and-store.ll show dependence on -ppc-disable-perfect-shuffle=false.

Herald added a project: Restricted Project. · View Herald TranscriptJul 31 2022, 11:45 PM

Harbormaster completed remote builds in B178505: Diff 448921.Jul 31 2022, 11:46 PM

Instead of a cost function, could we use (possibly tweaked) isMultiStoresCheaperThanBitsMerge?

llvm/test/CodeGen/PowerPC/extract-and-store.ll
533	is this really an issue with the store merging or the ppc shuffle combines have gotten messed up?

In D62890#3690606, @RKSimon wrote:

Instead of a cost function, could we use (possibly tweaked) isMultiStoresCheaperThanBitsMerge?

Thank you for pointing out. I will try to come up with something to guard against cases that got degenerated.

llvm/test/CodeGen/PowerPC/extract-and-store.ll
533	Had a quick check, this is the case PPC::LowerVECTOR_SHUFFLE does not have efficient solution, so it turned into VPERM as last resort. Probably the cost function should avoid this kind of situation.

RKSimon added inline comments.Aug 1 2022, 5:12 AM

llvm/test/CodeGen/PowerPC/extract-and-store.ll
533	I'd be nervous about a cost function as that is likely to be very difficult to keep balanced. I'd probably recommend just overriding canMergeStoresTo or isMultiStoresCheaperThanBitsMerge for PPC

Added cost function for target as suggested. The function on PPC tries to avoid some patterns that lead to TOC accesses.

Harbormaster completed remote builds in B178715: Diff 449219.Aug 2 2022, 1:38 AM

tingwang added inline comments.Aug 2 2022, 1:42 AM

llvm/test/CodeGen/PowerPC/extract-and-store.ll
533	Thank you for the advice. I created a new function, hope that is fine.

This code seems quite unnecessarily complex. I can achieve essentially the same results with something like this:

diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 36973f5bddb0..984e84ba6fdc 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -939,6 +939,9 @@ public:
            (unsigned)VT.getSimpleVT().SimpleTy < array_lengthof(RegClassForVT));
     return VT.isSimple() && RegClassForVT[VT.getSimpleVT().SimpleTy] != nullptr;
   }
+  virtual bool isTypeLegalForMemAccess(EVT VT) const {
+    return isTypeLegal(VT);
+  }
 
   class ValueTypeActionImpl {
     /// ValueTypeActions - For each value type, keep a LegalizeTypeAction enum
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index 5e77317572af..6acde2a5ae91 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -18486,7 +18486,7 @@ bool DAGCombiner::tryStoreMergeOfExtracts(
       if (Ty.getSizeInBits() > MaximumLegalStoreInBits)
         break;
 
-      if (TLI.isTypeLegal(Ty) &&
+      if (TLI.isTypeLegalForMemAccess(Ty) &&
           TLI.canMergeStoresTo(FirstStoreAS, Ty, DAG.getMachineFunction()) &&
           TLI.allowsMemoryAccess(Context, DL, Ty,
                                  *FirstInChain->getMemOperand(), &IsFast) &&
diff --git a/llvm/lib/Target/PowerPC/PPCISelLowering.cpp b/llvm/lib/Target/PowerPC/PPCISelLowering.cpp
index 862d2ebc75a6..ceed4c1ffc91 100644
--- a/llvm/lib/Target/PowerPC/PPCISelLowering.cpp
+++ b/llvm/lib/Target/PowerPC/PPCISelLowering.cpp
@@ -17569,7 +17569,8 @@ bool PPCTargetLowering::allowsMisalignedMemoryAccesses(EVT VT,
       return true;
     if (Subtarget.hasVSX()) {
       if (VT != MVT::v2f64 && VT != MVT::v2i64 &&
-          VT != MVT::v4f32 && VT != MVT::v4i32)
+          VT != MVT::v4f32 && VT != MVT::v4i32 &&
+          VT != MVT::v2f32 && VT != MVT::v2i32)
         return false;
     } else {
       return false;
diff --git a/llvm/lib/Target/PowerPC/PPCISelLowering.h b/llvm/lib/Target/PowerPC/PPCISelLowering.h
index 2fa6d45bfe1a..1f0051f8d273 100644
--- a/llvm/lib/Target/PowerPC/PPCISelLowering.h
+++ b/llvm/lib/Target/PowerPC/PPCISelLowering.h
@@ -1101,6 +1101,11 @@ namespace llvm {
     EVT getOptimalMemOpType(const MemOp &Op,
                             const AttributeList &FuncAttributes) const override;
 
+    bool isTypeLegalForMemAccess(EVT VT) const override {
+      bool Ret = TargetLoweringBase::isTypeLegalForMemAccess(VT) || VT == MVT::v2i32 || VT == MVT::v2f32;
+      return Ret;
+    }
+
     /// Is unaligned memory access allowed for the given type, and is it fast
     /// relative to software emulation.
     bool allowsMisalignedMemoryAccesses(

Sure, it produces some vperm's with this test case, but I don't see an issue with that - in most cases that matter, the constant pool loads aren't likely to lead to a lot of cache misses.

Thank you. A good lesson for me to learn how to simplify logic!

The original approach is too complex, and the same effect can be achieved more simply as Nemanja pointed out.

I'm adopting the whole approach, and added a guard to make sure this is applied only before type is legalized.

Harbormaster completed remote builds in B180097: Diff 451048.Aug 9 2022, 1:52 AM

RKSimon added inline comments.Aug 9 2022, 2:27 AM

llvm/include/llvm/CodeGen/TargetLowering.h
945	please add doxygen descriptions of the params (B in particular....)

Update parameter naming and comments.

Harbormaster completed remote builds in B180132: Diff 451097.Aug 9 2022, 5:29 AM

Gentle ping.

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

TargetLowering.h

8 lines

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

2 lines

Target/

PowerPC/

PPCISelLowering.h

9 lines

PPCISelLowering.cpp

4 lines

test/

CodeGen/

PowerPC/

extract-and-store.ll

188 lines

Diff 451097

llvm/include/llvm/CodeGen/TargetLowering.h

Show First 20 Lines • Show All 934 Lines • ▼ Show 20 Lines	public:
/// This means that it has a register that directly holds it without		/// This means that it has a register that directly holds it without
/// promotions or expansions.		/// promotions or expansions.
bool isTypeLegal(EVT VT) const {		bool isTypeLegal(EVT VT) const {
assert(!VT.isSimple() \|\|		assert(!VT.isSimple() \|\|
(unsigned)VT.getSimpleVT().SimpleTy < array_lengthof(RegClassForVT));		(unsigned)VT.getSimpleVT().SimpleTy < array_lengthof(RegClassForVT));
return VT.isSimple() && RegClassForVT[VT.getSimpleVT().SimpleTy] != nullptr;		return VT.isSimple() && RegClassForVT[VT.getSimpleVT().SimpleTy] != nullptr;
}		}

		/// Return true if the target has native support for the specified value type.
		/// Provide opportunity for target to decide before type is legalized. On PPC
		/// for example, there is efficient pattern to do two vector extracts and
		RKSimonUnsubmitted Not Done Reply Inline Actions please add doxygen descriptions of the params (B in particular....) RKSimon: please add doxygen descriptions of the params (B in particular....)
		/// store into consecutive memory locations \p BeforeTypeLegalized.
		virtual bool isTypeLegalForMemAccess(EVT VT, bool BeforeTypeLegalized) const {
		return isTypeLegal(VT);
		}

class ValueTypeActionImpl {		class ValueTypeActionImpl {
/// ValueTypeActions - For each value type, keep a LegalizeTypeAction enum		/// ValueTypeActions - For each value type, keep a LegalizeTypeAction enum
/// that indicates how instruction selection should deal with the type.		/// that indicates how instruction selection should deal with the type.
LegalizeTypeAction ValueTypeActions[MVT::VALUETYPE_SIZE];		LegalizeTypeAction ValueTypeActions[MVT::VALUETYPE_SIZE];

public:		public:
ValueTypeActionImpl() {		ValueTypeActionImpl() {
std::fill(std::begin(ValueTypeActions), std::end(ValueTypeActions),		std::fill(std::begin(ValueTypeActions), std::end(ValueTypeActions),
▲ Show 20 Lines • Show All 4,070 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 18,467 Lines • ▼ Show 20 Lines	for (unsigned i = 0; i < NumConsecutiveStores; ++i) {
unsigned Elts = (i + 1) * NumMemElts;		unsigned Elts = (i + 1) * NumMemElts;
EVT Ty = EVT::getVectorVT(*DAG.getContext(), MemVT.getScalarType(), Elts);		EVT Ty = EVT::getVectorVT(*DAG.getContext(), MemVT.getScalarType(), Elts);
bool IsFast = false;		bool IsFast = false;

// Break early when size is too large to be legal.		// Break early when size is too large to be legal.
if (Ty.getSizeInBits() > MaximumLegalStoreInBits)		if (Ty.getSizeInBits() > MaximumLegalStoreInBits)
break;		break;

if (TLI.isTypeLegal(Ty) &&		if (TLI.isTypeLegalForMemAccess(Ty, Level == BeforeLegalizeTypes) &&
TLI.canMergeStoresTo(FirstStoreAS, Ty, DAG.getMachineFunction()) &&		TLI.canMergeStoresTo(FirstStoreAS, Ty, DAG.getMachineFunction()) &&
TLI.allowsMemoryAccess(Context, DL, Ty,		TLI.allowsMemoryAccess(Context, DL, Ty,
*FirstInChain->getMemOperand(), &IsFast) &&		*FirstInChain->getMemOperand(), &IsFast) &&
IsFast)		IsFast)
NumStoresToMerge = i + 1;		NumStoresToMerge = i + 1;
}		}

// Check if we found a legal integer type creating a meaningful		// Check if we found a legal integer type creating a meaningful
▲ Show 20 Lines • Show All 6,557 Lines • Show Last 20 Lines

llvm/lib/Target/PowerPC/PPCISelLowering.h

Show First 20 Lines • Show All 1,056 Lines • ▼ Show 20 Lines	bool getTgtMemIntrinsic(IntrinsicInfo &Info,
MachineFunction &MF,		MachineFunction &MF,
unsigned Intrinsic) const override;		unsigned Intrinsic) const override;

/// It returns EVT::Other if the type should be determined using generic		/// It returns EVT::Other if the type should be determined using generic
/// target-independent logic.		/// target-independent logic.
EVT getOptimalMemOpType(const MemOp &Op,		EVT getOptimalMemOpType(const MemOp &Op,
const AttributeList &FuncAttributes) const override;		const AttributeList &FuncAttributes) const override;

		/// Two vector extracts and store into consecutive memory locations is
		/// allowed \p BeforeTypeLegalized.
		bool isTypeLegalForMemAccess(EVT VT,
		bool BeforeTypeLegalized) const override {
		return TargetLoweringBase::isTypeLegalForMemAccess(VT,
		BeforeTypeLegalized) \|\|
		(BeforeTypeLegalized && (VT == MVT::v2i32 \|\| VT == MVT::v2f32));
		}

/// Is unaligned memory access allowed for the given type, and is it fast		/// Is unaligned memory access allowed for the given type, and is it fast
/// relative to software emulation.		/// relative to software emulation.
bool allowsMisalignedMemoryAccesses(		bool allowsMisalignedMemoryAccesses(
EVT VT, unsigned AddrSpace, Align Alignment = Align(1),		EVT VT, unsigned AddrSpace, Align Alignment = Align(1),
MachineMemOperand::Flags Flags = MachineMemOperand::MONone,		MachineMemOperand::Flags Flags = MachineMemOperand::MONone,
bool *Fast = nullptr) const override;		bool *Fast = nullptr) const override;

/// isFMAFasterThanFMulAndFAdd - Return true if an FMA operation is faster		/// isFMAFasterThanFMulAndFAdd - Return true if an FMA operation is faster
▲ Show 20 Lines • Show All 419 Lines • Show Last 20 Lines

llvm/lib/Target/PowerPC/PPCISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 16,668 Lines • ▼ Show 20 Lines	if (!VT.isSimple())
return false;		return false;

if (VT.isFloatingPoint() && !VT.isVector() &&		if (VT.isFloatingPoint() && !VT.isVector() &&
!Subtarget.allowsUnalignedFPAccess())		!Subtarget.allowsUnalignedFPAccess())
return false;		return false;

if (VT.getSimpleVT().isVector()) {		if (VT.getSimpleVT().isVector()) {
if (Subtarget.hasVSX()) {		if (Subtarget.hasVSX()) {
if (VT != MVT::v2f64 && VT != MVT::v2i64 &&		if (VT != MVT::v2f64 && VT != MVT::v2i64 && VT != MVT::v4f32 &&
VT != MVT::v4f32 && VT != MVT::v4i32)		VT != MVT::v4i32 && VT != MVT::v2f32 && VT != MVT::v2i32)
return false;		return false;
} else {		} else {
return false;		return false;
}		}
}		}

if (VT == MVT::ppcf128)		if (VT == MVT::ppcf128)
return false;		return false;
▲ Show 20 Lines • Show All 1,511 Lines • Show Last 20 Lines

llvm/test/CodeGen/PowerPC/extract-and-store.ll

Show First 20 Lines • Show All 476 Lines • ▼ Show 20 Lines	entry:
%arrayidx = getelementptr inbounds i32, i32* %ap, i64 3		%arrayidx = getelementptr inbounds i32, i32* %ap, i64 3
store i32 %vecext, i32* %arrayidx, align 4		store i32 %vecext, i32* %arrayidx, align 4
ret <4 x i32> %a		ret <4 x i32> %a
}		}

define dso_local void @test_consecutive_i32(<4 x i32> %a, i32* nocapture %b) local_unnamed_addr #0 {		define dso_local void @test_consecutive_i32(<4 x i32> %a, i32* nocapture %b) local_unnamed_addr #0 {
; CHECK-LABEL: test_consecutive_i32:		; CHECK-LABEL: test_consecutive_i32:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: xxsldwi vs0, vs34, vs34, 2		; CHECK-NEXT: vpkudum v2, v2, v2
; CHECK-NEXT: li r3, 4		; CHECK-NEXT: xxswapd vs0, vs34
; CHECK-NEXT: stxsiwx vs34, r5, r3		; CHECK-NEXT: stfdx f0, 0, r5
; CHECK-NEXT: stfiwx f0, 0, r5
; CHECK-NEXT: blr		; CHECK-NEXT: blr
;		;
; CHECK-BE-LABEL: test_consecutive_i32:		; CHECK-BE-LABEL: test_consecutive_i32:
; CHECK-BE: # %bb.0: # %entry		; CHECK-BE: # %bb.0: # %entry
; CHECK-BE-NEXT: xxsldwi vs0, vs34, vs34, 3		; CHECK-BE-NEXT: addis r3, r2, .LCPI14_0@toc@ha
; CHECK-BE-NEXT: xxsldwi vs1, vs34, vs34, 1		; CHECK-BE-NEXT: addi r3, r3, .LCPI14_0@toc@l
; CHECK-BE-NEXT: li r3, 4		; CHECK-BE-NEXT: lxvw4x vs35, 0, r3
; CHECK-BE-NEXT: stfiwx f0, 0, r5		; CHECK-BE-NEXT: vperm v2, v2, v2, v3
; CHECK-BE-NEXT: stfiwx f1, r5, r3		; CHECK-BE-NEXT: stxsdx vs34, 0, r5
; CHECK-BE-NEXT: blr		; CHECK-BE-NEXT: blr
;		;
; CHECK-P9-LABEL: test_consecutive_i32:		; CHECK-P9-LABEL: test_consecutive_i32:
; CHECK-P9: # %bb.0: # %entry		; CHECK-P9: # %bb.0: # %entry
; CHECK-P9-NEXT: xxsldwi vs0, vs34, vs34, 2		; CHECK-P9-NEXT: vpkudum v2, v2, v2
; CHECK-P9-NEXT: li r3, 4		; CHECK-P9-NEXT: xxswapd vs0, vs34
; CHECK-P9-NEXT: stxsiwx vs34, r5, r3		; CHECK-P9-NEXT: stfd f0, 0(r5)
; CHECK-P9-NEXT: stfiwx f0, 0, r5
; CHECK-P9-NEXT: blr		; CHECK-P9-NEXT: blr
;		;
; CHECK-P9-BE-LABEL: test_consecutive_i32:		; CHECK-P9-BE-LABEL: test_consecutive_i32:
; CHECK-P9-BE: # %bb.0: # %entry		; CHECK-P9-BE: # %bb.0: # %entry
; CHECK-P9-BE-NEXT: xxsldwi vs0, vs34, vs34, 3		; CHECK-P9-BE-NEXT: addis r3, r2, .LCPI14_0@toc@ha
; CHECK-P9-BE-NEXT: li r3, 4		; CHECK-P9-BE-NEXT: addi r3, r3, .LCPI14_0@toc@l
; CHECK-P9-BE-NEXT: stfiwx f0, 0, r5		; CHECK-P9-BE-NEXT: lxv vs35, 0(r3)
; CHECK-P9-BE-NEXT: xxsldwi vs0, vs34, vs34, 1		; CHECK-P9-BE-NEXT: vperm v2, v2, v2, v3
; CHECK-P9-BE-NEXT: stfiwx f0, r5, r3		; CHECK-P9-BE-NEXT: stxsd v2, 0(r5)
; CHECK-P9-BE-NEXT: blr		; CHECK-P9-BE-NEXT: blr
entry:		entry:

%vecext = extractelement <4 x i32> %a, i32 0		%vecext = extractelement <4 x i32> %a, i32 0
store i32 %vecext, i32* %b, align 4		store i32 %vecext, i32* %b, align 4
%vecext1 = extractelement <4 x i32> %a, i32 2		%vecext1 = extractelement <4 x i32> %a, i32 2
%arrayidx2 = getelementptr inbounds i32, i32* %b, i64 1		%arrayidx2 = getelementptr inbounds i32, i32* %b, i64 1
store i32 %vecext1, i32* %arrayidx2, align 4		store i32 %vecext1, i32* %arrayidx2, align 4
ret void		ret void
}		}

define dso_local void @test_consecutive_float(<4 x float> %a, float* nocapture %b) local_unnamed_addr #0 {		define dso_local void @test_consecutive_float(<4 x float> %a, float* nocapture %b) local_unnamed_addr #0 {
; CHECK-LABEL: test_consecutive_float:		; CHECK-LABEL: test_consecutive_float:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: xxsldwi vs0, vs34, vs34, 1		; CHECK-NEXT: addis r3, r2, .LCPI15_0@toc@ha
; CHECK-NEXT: xxsldwi vs1, vs34, vs34, 3		; CHECK-NEXT: addi r3, r3, .LCPI15_0@toc@l
; CHECK-NEXT: li r3, 4		; CHECK-NEXT: lxvd2x vs0, 0, r3
; CHECK-NEXT: stfiwx f0, 0, r5		; CHECK-NEXT: xxswapd vs35, vs0
; CHECK-NEXT: stfiwx f1, r5, r3		; CHECK-NEXT: vperm v2, v2, v2, v3
		; CHECK-NEXT: xxswapd vs0, vs34
		; CHECK-NEXT: stfdx f0, 0, r5
		RKSimonUnsubmitted Not Done Reply Inline Actions is this really an issue with the store merging or the ppc shuffle combines have gotten messed up? RKSimon: is this really an issue with the store merging or the ppc shuffle combines have gotten messed…
		tingwangAuthorUnsubmitted Done Reply Inline Actions Had a quick check, this is the case PPC::LowerVECTOR_SHUFFLE does not have efficient solution, so it turned into VPERM as last resort. Probably the cost function should avoid this kind of situation. tingwang: Had a quick check, this is the case PPC::LowerVECTOR_SHUFFLE does not have efficient solution…
		RKSimonUnsubmitted Not Done Reply Inline Actions I'd be nervous about a cost function as that is likely to be very difficult to keep balanced. I'd probably recommend just overriding canMergeStoresTo or isMultiStoresCheaperThanBitsMerge for PPC RKSimon: I'd be nervous about a cost function as that is likely to be very difficult to keep balanced.
		tingwangAuthorUnsubmitted Done Reply Inline Actions Thank you for the advice. I created a new function, hope that is fine. tingwang: Thank you for the advice. I created a new function, hope that is fine.
; CHECK-NEXT: blr		; CHECK-NEXT: blr
;		;
; CHECK-BE-LABEL: test_consecutive_float:		; CHECK-BE-LABEL: test_consecutive_float:
; CHECK-BE: # %bb.0: # %entry		; CHECK-BE: # %bb.0: # %entry
; CHECK-BE-NEXT: xxsldwi vs0, vs34, vs34, 2		; CHECK-BE-NEXT: vpkudum v2, v2, v2
; CHECK-BE-NEXT: li r3, 4		; CHECK-BE-NEXT: stxsdx vs34, 0, r5
; CHECK-BE-NEXT: stxsiwx vs34, 0, r5
; CHECK-BE-NEXT: stfiwx f0, r5, r3
; CHECK-BE-NEXT: blr		; CHECK-BE-NEXT: blr
;		;
; CHECK-P9-LABEL: test_consecutive_float:		; CHECK-P9-LABEL: test_consecutive_float:
; CHECK-P9: # %bb.0: # %entry		; CHECK-P9: # %bb.0: # %entry
; CHECK-P9-NEXT: xxsldwi vs0, vs34, vs34, 1		; CHECK-P9-NEXT: addis r3, r2, .LCPI15_0@toc@ha
; CHECK-P9-NEXT: li r3, 4		; CHECK-P9-NEXT: addi r3, r3, .LCPI15_0@toc@l
; CHECK-P9-NEXT: stfiwx f0, 0, r5		; CHECK-P9-NEXT: lxv vs35, 0(r3)
; CHECK-P9-NEXT: xxsldwi vs0, vs34, vs34, 3		; CHECK-P9-NEXT: vperm v2, v2, v2, v3
; CHECK-P9-NEXT: stfiwx f0, r5, r3		; CHECK-P9-NEXT: xxswapd vs0, vs34
		; CHECK-P9-NEXT: stfd f0, 0(r5)
; CHECK-P9-NEXT: blr		; CHECK-P9-NEXT: blr
;		;
; CHECK-P9-BE-LABEL: test_consecutive_float:		; CHECK-P9-BE-LABEL: test_consecutive_float:
; CHECK-P9-BE: # %bb.0: # %entry		; CHECK-P9-BE: # %bb.0: # %entry
; CHECK-P9-BE-NEXT: xxsldwi vs0, vs34, vs34, 2		; CHECK-P9-BE-NEXT: vpkudum v2, v2, v2
; CHECK-P9-BE-NEXT: li r3, 4		; CHECK-P9-BE-NEXT: stxsd v2, 0(r5)
; CHECK-P9-BE-NEXT: stxsiwx vs34, 0, r5
; CHECK-P9-BE-NEXT: stfiwx f0, r5, r3
; CHECK-P9-BE-NEXT: blr		; CHECK-P9-BE-NEXT: blr
entry:		entry:
%vecext = extractelement <4 x float> %a, i32 1		%vecext = extractelement <4 x float> %a, i32 1
store float %vecext, float* %b, align 4		store float %vecext, float* %b, align 4
%vecext1 = extractelement <4 x float> %a, i32 3		%vecext1 = extractelement <4 x float> %a, i32 3
%arrayidx2 = getelementptr inbounds float, float* %b, i64 1		%arrayidx2 = getelementptr inbounds float, float* %b, i64 1
store float %vecext1, float* %arrayidx2, align 4		store float %vecext1, float* %arrayidx2, align 4
ret void		ret void
}		}

define dso_local void @test_stores_exceed_vec_size(<4 x i32> %a, i32* nocapture %b) local_unnamed_addr #0 {		define dso_local void @test_stores_exceed_vec_size(<4 x i32> %a, i32* nocapture %b) local_unnamed_addr #0 {
; CHECK-LABEL: test_stores_exceed_vec_size:		; CHECK-LABEL: test_stores_exceed_vec_size:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: addis r3, r2, .LCPI16_0@toc@ha		; CHECK-NEXT: addis r3, r2, .LCPI16_0@toc@ha
; CHECK-NEXT: xxsldwi vs1, vs34, vs34, 1
; CHECK-NEXT: li r4, 20
; CHECK-NEXT: addi r3, r3, .LCPI16_0@toc@l		; CHECK-NEXT: addi r3, r3, .LCPI16_0@toc@l
; CHECK-NEXT: lxvd2x vs0, 0, r3		; CHECK-NEXT: lxvd2x vs0, 0, r3
; CHECK-NEXT: li r3, 16
; CHECK-NEXT: xxswapd vs35, vs0		; CHECK-NEXT: xxswapd vs35, vs0
; CHECK-NEXT: vperm v3, v2, v2, v3		; CHECK-NEXT: vperm v3, v2, v2, v3
		; CHECK-NEXT: vsldoi v2, v2, v2, 12
; CHECK-NEXT: xxswapd vs0, vs35		; CHECK-NEXT: xxswapd vs0, vs35
		; CHECK-NEXT: xxswapd vs1, vs34
; CHECK-NEXT: stxvd2x vs0, 0, r5		; CHECK-NEXT: stxvd2x vs0, 0, r5
; CHECK-NEXT: stfiwx f1, r5, r3		; CHECK-NEXT: stfd f1, 16(r5)
; CHECK-NEXT: stxsiwx vs34, r5, r4
; CHECK-NEXT: blr		; CHECK-NEXT: blr
;		;
; CHECK-BE-LABEL: test_stores_exceed_vec_size:		; CHECK-BE-LABEL: test_stores_exceed_vec_size:
; CHECK-BE: # %bb.0: # %entry		; CHECK-BE: # %bb.0: # %entry
; CHECK-BE-NEXT: addis r3, r2, .LCPI16_0@toc@ha		; CHECK-BE-NEXT: addis r3, r2, .LCPI16_0@toc@ha
; CHECK-BE-NEXT: xxsldwi vs0, vs34, vs34, 1
; CHECK-BE-NEXT: li r4, 20
; CHECK-BE-NEXT: addi r3, r3, .LCPI16_0@toc@l		; CHECK-BE-NEXT: addi r3, r3, .LCPI16_0@toc@l
; CHECK-BE-NEXT: lxvw4x vs35, 0, r3		; CHECK-BE-NEXT: lxvw4x vs35, 0, r3
; CHECK-BE-NEXT: li r3, 16		; CHECK-BE-NEXT: li r3, 16
; CHECK-BE-NEXT: stxsiwx vs34, r5, r3
; CHECK-BE-NEXT: stfiwx f0, r5, r4
; CHECK-BE-NEXT: vperm v3, v2, v2, v3		; CHECK-BE-NEXT: vperm v3, v2, v2, v3
		; CHECK-BE-NEXT: vsldoi v2, v2, v2, 4
; CHECK-BE-NEXT: stxvw4x vs35, 0, r5		; CHECK-BE-NEXT: stxvw4x vs35, 0, r5
		; CHECK-BE-NEXT: stxsdx vs34, r5, r3
; CHECK-BE-NEXT: blr		; CHECK-BE-NEXT: blr
;		;
; CHECK-P9-LABEL: test_stores_exceed_vec_size:		; CHECK-P9-LABEL: test_stores_exceed_vec_size:
; CHECK-P9: # %bb.0: # %entry		; CHECK-P9: # %bb.0: # %entry
; CHECK-P9-NEXT: addis r3, r2, .LCPI16_0@toc@ha		; CHECK-P9-NEXT: addis r3, r2, .LCPI16_0@toc@ha
; CHECK-P9-NEXT: xxsldwi vs0, vs34, vs34, 1
; CHECK-P9-NEXT: addi r3, r3, .LCPI16_0@toc@l		; CHECK-P9-NEXT: addi r3, r3, .LCPI16_0@toc@l
; CHECK-P9-NEXT: lxv vs35, 0(r3)		; CHECK-P9-NEXT: lxv vs35, 0(r3)
; CHECK-P9-NEXT: li r3, 16
; CHECK-P9-NEXT: stfiwx f0, r5, r3
; CHECK-P9-NEXT: li r3, 20
; CHECK-P9-NEXT: stxsiwx vs34, r5, r3
; CHECK-P9-NEXT: vperm v3, v2, v2, v3		; CHECK-P9-NEXT: vperm v3, v2, v2, v3
		; CHECK-P9-NEXT: vsldoi v2, v2, v2, 12
		; CHECK-P9-NEXT: xxswapd vs0, vs34
; CHECK-P9-NEXT: stxv vs35, 0(r5)		; CHECK-P9-NEXT: stxv vs35, 0(r5)
		; CHECK-P9-NEXT: stfd f0, 16(r5)
; CHECK-P9-NEXT: blr		; CHECK-P9-NEXT: blr
;		;
; CHECK-P9-BE-LABEL: test_stores_exceed_vec_size:		; CHECK-P9-BE-LABEL: test_stores_exceed_vec_size:
; CHECK-P9-BE: # %bb.0: # %entry		; CHECK-P9-BE: # %bb.0: # %entry
; CHECK-P9-BE-NEXT: addis r3, r2, .LCPI16_0@toc@ha		; CHECK-P9-BE-NEXT: addis r3, r2, .LCPI16_0@toc@ha
; CHECK-P9-BE-NEXT: xxsldwi vs0, vs34, vs34, 1
; CHECK-P9-BE-NEXT: addi r3, r3, .LCPI16_0@toc@l		; CHECK-P9-BE-NEXT: addi r3, r3, .LCPI16_0@toc@l
; CHECK-P9-BE-NEXT: lxv vs35, 0(r3)		; CHECK-P9-BE-NEXT: lxv vs35, 0(r3)
; CHECK-P9-BE-NEXT: li r3, 16
; CHECK-P9-BE-NEXT: stxsiwx vs34, r5, r3
; CHECK-P9-BE-NEXT: li r3, 20
; CHECK-P9-BE-NEXT: stfiwx f0, r5, r3
; CHECK-P9-BE-NEXT: vperm v3, v2, v2, v3		; CHECK-P9-BE-NEXT: vperm v3, v2, v2, v3
		; CHECK-P9-BE-NEXT: vsldoi v2, v2, v2, 4
; CHECK-P9-BE-NEXT: stxv vs35, 0(r5)		; CHECK-P9-BE-NEXT: stxv vs35, 0(r5)
		; CHECK-P9-BE-NEXT: stxsd v2, 16(r5)
; CHECK-P9-BE-NEXT: blr		; CHECK-P9-BE-NEXT: blr
entry:		entry:
%vecext = extractelement <4 x i32> %a, i32 2		%vecext = extractelement <4 x i32> %a, i32 2
store i32 %vecext, i32* %b, align 4		store i32 %vecext, i32* %b, align 4
%vecext1 = extractelement <4 x i32> %a, i32 3		%vecext1 = extractelement <4 x i32> %a, i32 3
%arrayidx2 = getelementptr inbounds i32, i32* %b, i64 1		%arrayidx2 = getelementptr inbounds i32, i32* %b, i64 1
store i32 %vecext1, i32* %arrayidx2, align 4		store i32 %vecext1, i32* %arrayidx2, align 4
%vecext3 = extractelement <4 x i32> %a, i32 0		%vecext3 = extractelement <4 x i32> %a, i32 0
▲ Show 20 Lines • Show All 285 Lines • ▼ Show 20 Lines	entry:
%arrayidx24 = getelementptr inbounds i8, i8* %b, i64 12		%arrayidx24 = getelementptr inbounds i8, i8* %b, i64 12
store i8 %vecext23, i8* %arrayidx24, align 1		store i8 %vecext23, i8* %arrayidx24, align 1
ret void		ret void
}		}

define void @test_elements_from_two_vec(<4 x i32> %a, <4 x i32> %b, i32* nocapture %c) local_unnamed_addr #0 {		define void @test_elements_from_two_vec(<4 x i32> %a, <4 x i32> %b, i32* nocapture %c) local_unnamed_addr #0 {
; CHECK-LABEL: test_elements_from_two_vec:		; CHECK-LABEL: test_elements_from_two_vec:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: xxsldwi vs0, vs34, vs34, 2		; CHECK-NEXT: addis r3, r2, .LCPI19_0@toc@ha
; CHECK-NEXT: xxsldwi vs1, vs35, vs35, 1		; CHECK-NEXT: addi r3, r3, .LCPI19_0@toc@l
; CHECK-NEXT: li r3, 4		; CHECK-NEXT: lxvd2x vs0, 0, r3
; CHECK-NEXT: stfiwx f0, r7, r3		; CHECK-NEXT: xxswapd vs36, vs0
; CHECK-NEXT: stfiwx f1, 0, r7		; CHECK-NEXT: vperm v2, v2, v3, v4
		; CHECK-NEXT: xxswapd vs0, vs34
		; CHECK-NEXT: stfdx f0, 0, r7
; CHECK-NEXT: blr		; CHECK-NEXT: blr
;		;
; CHECK-BE-LABEL: test_elements_from_two_vec:		; CHECK-BE-LABEL: test_elements_from_two_vec:
; CHECK-BE: # %bb.0: # %entry		; CHECK-BE: # %bb.0: # %entry
; CHECK-BE-NEXT: xxsldwi vs0, vs34, vs34, 3		; CHECK-BE-NEXT: addis r3, r2, .LCPI19_0@toc@ha
; CHECK-BE-NEXT: li r3, 4		; CHECK-BE-NEXT: addi r3, r3, .LCPI19_0@toc@l
; CHECK-BE-NEXT: stxsiwx vs35, 0, r7		; CHECK-BE-NEXT: lxvw4x vs36, 0, r3
; CHECK-BE-NEXT: stfiwx f0, r7, r3		; CHECK-BE-NEXT: vperm v2, v3, v2, v4
		; CHECK-BE-NEXT: stxsdx vs34, 0, r7
; CHECK-BE-NEXT: blr		; CHECK-BE-NEXT: blr
;		;
; CHECK-P9-LABEL: test_elements_from_two_vec:		; CHECK-P9-LABEL: test_elements_from_two_vec:
; CHECK-P9: # %bb.0: # %entry		; CHECK-P9: # %bb.0: # %entry
; CHECK-P9-NEXT: xxsldwi vs0, vs34, vs34, 2		; CHECK-P9-NEXT: addis r3, r2, .LCPI19_0@toc@ha
; CHECK-P9-NEXT: li r3, 4		; CHECK-P9-NEXT: addi r3, r3, .LCPI19_0@toc@l
; CHECK-P9-NEXT: stfiwx f0, r7, r3		; CHECK-P9-NEXT: lxv vs36, 0(r3)
; CHECK-P9-NEXT: xxsldwi vs0, vs35, vs35, 1		; CHECK-P9-NEXT: vperm v2, v2, v3, v4
; CHECK-P9-NEXT: stfiwx f0, 0, r7		; CHECK-P9-NEXT: xxswapd vs0, vs34
		; CHECK-P9-NEXT: stfd f0, 0(r7)
; CHECK-P9-NEXT: blr		; CHECK-P9-NEXT: blr
;		;
; CHECK-P9-BE-LABEL: test_elements_from_two_vec:		; CHECK-P9-BE-LABEL: test_elements_from_two_vec:
; CHECK-P9-BE: # %bb.0: # %entry		; CHECK-P9-BE: # %bb.0: # %entry
; CHECK-P9-BE-NEXT: xxsldwi vs0, vs34, vs34, 3		; CHECK-P9-BE-NEXT: addis r3, r2, .LCPI19_0@toc@ha
; CHECK-P9-BE-NEXT: li r3, 4		; CHECK-P9-BE-NEXT: addi r3, r3, .LCPI19_0@toc@l
; CHECK-P9-BE-NEXT: stxsiwx vs35, 0, r7		; CHECK-P9-BE-NEXT: lxv vs36, 0(r3)
; CHECK-P9-BE-NEXT: stfiwx f0, r7, r3		; CHECK-P9-BE-NEXT: vperm v2, v3, v2, v4
		; CHECK-P9-BE-NEXT: stxsd v2, 0(r7)
; CHECK-P9-BE-NEXT: blr		; CHECK-P9-BE-NEXT: blr
entry:		entry:
%vecext = extractelement <4 x i32> %a, i32 0		%vecext = extractelement <4 x i32> %a, i32 0
%arrayidx = getelementptr inbounds i32, i32* %c, i64 1		%arrayidx = getelementptr inbounds i32, i32* %c, i64 1
store i32 %vecext, i32* %arrayidx, align 4		store i32 %vecext, i32* %arrayidx, align 4
%vecext1 = extractelement <4 x i32> %b, i32 1		%vecext1 = extractelement <4 x i32> %b, i32 1
store i32 %vecext1, i32* %c, align 4		store i32 %vecext1, i32* %c, align 4
ret void		ret void
}		}

define dso_local void @test_elements_from_three_vec(<4 x float> %a, <4 x float> %b, <4 x float> %c, float* nocapture %d) local_unnamed_addr #0 {		define dso_local void @test_elements_from_three_vec(<4 x float> %a, <4 x float> %b, <4 x float> %c, float* nocapture %d) local_unnamed_addr #0 {
; CHECK-LABEL: test_elements_from_three_vec:		; CHECK-LABEL: test_elements_from_three_vec:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: xxsldwi vs0, vs34, vs34, 3		; CHECK-NEXT: addis r3, r2, .LCPI20_0@toc@ha
; CHECK-NEXT: xxsldwi vs1, vs36, vs36, 1		; CHECK-NEXT: xxsldwi vs1, vs36, vs36, 1
; CHECK-NEXT: li r3, 4		; CHECK-NEXT: addi r3, r3, .LCPI20_0@toc@l
; CHECK-NEXT: li r4, 8		; CHECK-NEXT: lxvd2x vs0, 0, r3
; CHECK-NEXT: stxsiwx vs35, r9, r3		; CHECK-NEXT: li r3, 8
; CHECK-NEXT: stfiwx f0, 0, r9		; CHECK-NEXT: stfiwx f1, r9, r3
; CHECK-NEXT: stfiwx f1, r9, r4		; CHECK-NEXT: xxswapd vs37, vs0
		; CHECK-NEXT: vperm v2, v3, v2, v5
		; CHECK-NEXT: xxswapd vs0, vs34
		; CHECK-NEXT: stfdx f0, 0, r9
; CHECK-NEXT: blr		; CHECK-NEXT: blr
;		;
; CHECK-BE-LABEL: test_elements_from_three_vec:		; CHECK-BE-LABEL: test_elements_from_three_vec:
; CHECK-BE: # %bb.0: # %entry		; CHECK-BE: # %bb.0: # %entry
; CHECK-BE-NEXT: xxsldwi vs0, vs34, vs34, 2		; CHECK-BE-NEXT: addis r3, r2, .LCPI20_0@toc@ha
; CHECK-BE-NEXT: xxsldwi vs1, vs35, vs35, 1		; CHECK-BE-NEXT: addi r3, r3, .LCPI20_0@toc@l
; CHECK-BE-NEXT: li r3, 4		; CHECK-BE-NEXT: lxvw4x vs37, 0, r3
; CHECK-BE-NEXT: li r4, 8		; CHECK-BE-NEXT: li r3, 8
; CHECK-BE-NEXT: stxsiwx vs36, r9, r4		; CHECK-BE-NEXT: stxsiwx vs36, r9, r3
; CHECK-BE-NEXT: stfiwx f1, r9, r3		; CHECK-BE-NEXT: vperm v2, v2, v3, v5
; CHECK-BE-NEXT: stfiwx f0, 0, r9		; CHECK-BE-NEXT: stxsdx vs34, 0, r9
; CHECK-BE-NEXT: blr		; CHECK-BE-NEXT: blr
;		;
; CHECK-P9-LABEL: test_elements_from_three_vec:		; CHECK-P9-LABEL: test_elements_from_three_vec:
; CHECK-P9: # %bb.0: # %entry		; CHECK-P9: # %bb.0: # %entry
; CHECK-P9-NEXT: xxsldwi vs0, vs34, vs34, 3		; CHECK-P9-NEXT: addis r3, r2, .LCPI20_0@toc@ha
; CHECK-P9-NEXT: li r3, 4		; CHECK-P9-NEXT: addi r3, r3, .LCPI20_0@toc@l
; CHECK-P9-NEXT: stxsiwx vs35, r9, r3		; CHECK-P9-NEXT: lxv vs37, 0(r3)
; CHECK-P9-NEXT: li r3, 8		; CHECK-P9-NEXT: li r3, 8
; CHECK-P9-NEXT: stfiwx f0, 0, r9		; CHECK-P9-NEXT: vperm v2, v3, v2, v5
		; CHECK-P9-NEXT: xxswapd vs0, vs34
		; CHECK-P9-NEXT: stfd f0, 0(r9)
; CHECK-P9-NEXT: xxsldwi vs0, vs36, vs36, 1		; CHECK-P9-NEXT: xxsldwi vs0, vs36, vs36, 1
; CHECK-P9-NEXT: stfiwx f0, r9, r3		; CHECK-P9-NEXT: stfiwx f0, r9, r3
; CHECK-P9-NEXT: blr		; CHECK-P9-NEXT: blr
;		;
; CHECK-P9-BE-LABEL: test_elements_from_three_vec:		; CHECK-P9-BE-LABEL: test_elements_from_three_vec:
; CHECK-P9-BE: # %bb.0: # %entry		; CHECK-P9-BE: # %bb.0: # %entry
; CHECK-P9-BE-NEXT: xxsldwi vs0, vs34, vs34, 2		; CHECK-P9-BE-NEXT: addis r3, r2, .LCPI20_0@toc@ha
; CHECK-P9-BE-NEXT: li r3, 4		; CHECK-P9-BE-NEXT: addi r3, r3, .LCPI20_0@toc@l
; CHECK-P9-BE-NEXT: stfiwx f0, 0, r9		; CHECK-P9-BE-NEXT: lxv vs37, 0(r3)
; CHECK-P9-BE-NEXT: xxsldwi vs0, vs35, vs35, 1
; CHECK-P9-BE-NEXT: stfiwx f0, r9, r3
; CHECK-P9-BE-NEXT: li r3, 8		; CHECK-P9-BE-NEXT: li r3, 8
; CHECK-P9-BE-NEXT: stxsiwx vs36, r9, r3		; CHECK-P9-BE-NEXT: stxsiwx vs36, r9, r3
		; CHECK-P9-BE-NEXT: vperm v2, v2, v3, v5
		; CHECK-P9-BE-NEXT: stxsd v2, 0(r9)
; CHECK-P9-BE-NEXT: blr		; CHECK-P9-BE-NEXT: blr
entry:		entry:
%vecext = extractelement <4 x float> %a, i32 3		%vecext = extractelement <4 x float> %a, i32 3
store float %vecext, float* %d, align 4		store float %vecext, float* %d, align 4
%vecext1 = extractelement <4 x float> %b, i32 2		%vecext1 = extractelement <4 x float> %b, i32 2
%arrayidx2 = getelementptr inbounds float, float* %d, i64 1		%arrayidx2 = getelementptr inbounds float, float* %d, i64 1
store float %vecext1, float* %arrayidx2, align 4		store float %vecext1, float* %arrayidx2, align 4
%vecext3 = extractelement <4 x float> %c, i32 1		%vecext3 = extractelement <4 x float> %c, i32 1
%arrayidx4 = getelementptr inbounds float, float* %d, i64 2		%arrayidx4 = getelementptr inbounds float, float* %d, i64 2
store float %vecext3, float* %arrayidx4, align 4		store float %vecext3, float* %arrayidx4, align 4
ret void		ret void
}		}

This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] Improve tryStoreMergeOfExtracts to merge stores before type is legalizedNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 451097

llvm/include/llvm/CodeGen/TargetLowering.h

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

llvm/lib/Target/PowerPC/PPCISelLowering.h

llvm/lib/Target/PowerPC/PPCISelLowering.cpp

llvm/test/CodeGen/PowerPC/extract-and-store.ll

[DAGCombiner] Improve tryStoreMergeOfExtracts to merge stores before type is legalized
Needs ReviewPublic