This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/
-
CodeGen/SelectionDAG/
-
SelectionDAG/
1/2
SelectionDAG.cpp
-
Target/PowerPC/
-
PowerPC/
-
PPCISelLowering.h
1/7
PPCISelLowering.cpp
-
test/CodeGen/
-
CodeGen/
-
ARM/
1/3
memset-align.ll
-
PowerPC/
9/12
memset-tail.ll
1/1
p10-fi-elim.ll

Differential D138883

[SelectionDAG][PowerPC] Memset reuse vector element for tail store
ClosedPublic

Authored by tingwang on Nov 28 2022, 5:29 PM.

Download Raw Diff

Details

Reviewers

shchenz
nemanjai
rzurob
RKSimon
dmgreen
asavonic
lkail
ecnelises

Group Reviewers

Restricted Project

Commits

rG71be020dda2c: [SelectionDAG][PowerPC] Memset reuse vector element for tail store

Summary

On PPC there are instructions to store element from vector(e.g. stxsdx/stxsiwx), and these instructions can be leveraged to avoid tail constant in memset and constant splat array initialization.

This patch tries to explore these opportunities.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

tingwang created this revision.Nov 28 2022, 5:29 PM

Herald added a project: Restricted Project. · View Herald TranscriptNov 28 2022, 5:29 PM

Herald added subscribers: kbarton, hiraditya. · View Herald Transcript

tingwang requested review of this revision.Nov 28 2022, 5:29 PM

Herald added a subscriber: llvm-commits. · View Herald TranscriptNov 28 2022, 5:29 PM

tingwang added a child revision: D138881: [PowerPC][NFC] Add test case for memset tail store.Nov 28 2022, 5:29 PM

Harbormaster completed remote builds in B199907: Diff 478424.Nov 28 2022, 5:29 PM

lkail added a reviewer: RKSimon.Nov 28 2022, 5:44 PM

I think this is useful, but we should ensure we can get rid of the swap that this introduces (in a separate patch).

llvm/include/llvm/CodeGen/TargetLowering.h
672 ↗	(On Diff #478424)	Do we need this? Can `canCombineStoreAndExtract()` suffice for this purpose?
llvm/test/CodeGen/PowerPC/memset-tail.ll
195	Why do we now get the redundant swap for the vector store that we didn't get before? Was it eliminated by the swap elimination before and now it is not because we have a use of the partial vector?

nemanjai added inline comments.Nov 28 2022, 8:00 PM

llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
7310	Can we reduce the nesting here by converting this to `else if`?

Update according to comments:
(1) Use existing canCombineStoreAndExtract() instead of creating new.
(2) Nest the else statement properly.
(3) Saw two cases changed due to (1).

Harbormaster completed remote builds in B199988: Diff 478535.Nov 29 2022, 5:15 AM

tingwang marked an inline comment as done.Nov 29 2022, 5:20 AM

tingwang added inline comments.

llvm/include/llvm/CodeGen/TargetLowering.h
672 ↗	(On Diff #478424)	Thank you! I did a try and updated patch. Saw two cases changed. I will look into the detail tomorrow.
llvm/test/CodeGen/PowerPC/memset-tail.ll
195	Debug-only `ppc-vsx-swaps` shows "Web 0 rejected for physreg, partial reg, or not swap[pable]". I will look into it and probably post another patch to fix the issue. Thank you!

Realized I need to use PPCTTIImpl::getVectorInstrCost() API to determine the cost of instructions. I'm working on it now.

Changes in this update:
(1) I was trying to use TTI.getVectorInstrCost() to query instruction cost in PPCTargetLowering::canCombineStoreAndExtract(). However not able to reach TTI, and didn't find any reference to do that in SDAG. Given that the original implementation of canCombineStoreAndExtract() on ARM implemented its own logic to calculate Cost, followed the approach and implemented logic by referring to PPCTTIImpl::getVectorInstrCost().

(2) Refactored logic inside getMemsetStores(). For PPC CombineStoreAndExtract is beneficial on some specific element index according to endianness and instruction (StoreAndExtract elements on other indexes requires vector permutation, which makes the whole idea less attractive). Since this is platform independent logic, I'm querying the cost for indexes, and pick the least cost to do the combine.

tingwang added inline comments.Nov 29 2022, 11:57 PM

llvm/test/CodeGen/PowerPC/p10-fi-elim.ll
31–36	Instruction sequence change in `PowerPC/p10-fi-elim.ll` is result of `CodeGenPrepare::optimizeExtractElementInst()` now can generate combined pattern since we enabled `canCombineStoreAndExtract()`. Seems we can avoid two mfvsrd instructions.

tingwang added inline comments.Nov 29 2022, 11:59 PM

llvm/test/CodeGen/ARM/memset-align.ll
21	Hello @dmgreen @asavonic. This patch tries to reuse vector element for the tail store in memset by implementing `canCombineStoreAndExtract()` on PPC. This changed introduced test case change on ARM in llvm/test/CodeGen/ARM/memset-align.ll. Could you please help me check if the change looks good or not? Thank you! Looked into the scenario on ARM, if the i8 fill value of memset is zero, it creates vector for the initial 16B, and constant tail for the remaining bytes, which exactly hit this patch's scenario. For other values, it creates i32 for memset and will not be impacted by this patch.

Add memset-tail.ll changes.

Harbormaster completed remote builds in B200192: Diff 478827.Nov 30 2022, 12:10 AM

I would expect not only memset, some consecutive stores could also reuse the result of vector split, see https://godbolt.org/z/77aMvncb4.
For

void foo(long a[3]) {
    a[0] = 12;
    a[1] = 12;
    a[2] = 12;
}

foo(long*):                               # @foo(long*)
        .quad   .Lfunc_begin0
        .quad   .TOC.@tocbase
        .quad   0
.Lfunc_begin0:
        xxlxor 0, 0, 0
        li 4, 12
        xxsplti32dx 0, 1, 12
        std 4, 16(3)
        stxv 0, 0(3)
        blr
        .long   0
        .quad   0

We don't reuse the result of xxsplti32dx.

In D138883#3962255, @lkail wrote:
I would expect not only memset, some consecutive stores could also reuse the result of vector split, see https://godbolt.org/z/77aMvncb4.
For
void foo(long a[3]) {
    a[0] = 12;
    a[1] = 12;
    a[2] = 12;
}

foo(long*):                               # @foo(long*)
        .quad   .Lfunc_begin0
        .quad   .TOC.@tocbase
        .quad   0
.Lfunc_begin0:
        xxlxor 0, 0, 0
        li 4, 12
        xxsplti32dx 0, 1, 12
        std 4, 16(3)
        stxv 0, 0(3)
        blr
        .long   0
        .quad   0
We don't reuse the result of xxsplti32dx.

Sure. The posted IR could be handled by DAGCombiner::mergeConsecutiveStores(), and I agree similar combine can be applied there. But for this case, memset stores are volatile, and DAGCombiner::getStoreMergeCandidates() does not accept volatile store currently.

asavonic added inline comments.Dec 1 2022, 12:22 PM

llvm/test/CodeGen/ARM/memset-align.ll
21	This looks fine to me. VST1 and scalar STR seem equivalent in this case, if I'm reading the docs right.

tingwang added inline comments.Dec 1 2022, 3:45 PM

llvm/test/CodeGen/ARM/memset-align.ll
21	This looks fine to me. VST1 and scalar STR seem equivalent in this case, if I'm reading the docs right. Thank you for the confirm!

Test case update.

Harbormaster completed remote builds in B200714: Diff 479543.Dec 2 2022, 12:50 AM

tingwang mentioned this in D139193: [PowerPC] remove XXSWAPD after vector splat immediate.Dec 2 2022, 4:42 AM

tingwang added a parent revision: D139193: [PowerPC] remove XXSWAPD after vector splat immediate.

tingwang added a child revision: D139491: [PowerPC] remove XXSWAPD after load from CP which is a splat value.Dec 6 2022, 5:07 PM

In D138883#3955904, @nemanjai wrote:

I think this is useful, but we should ensure we can get rid of the swap that this introduces (in a separate patch).

According to my test case, there are two kinds of swap in LE: (1) swap after vector splat immediate; (2) swap after load from constant-pool. Submitted two patches D139193 and D139491 to address them separately.

shchenz mentioned this in D138881: [PowerPC][NFC] Add test case for memset tail store.Dec 7 2022, 12:37 AM

Gentle ping.

Update patch as following test case pattern changed:
llvm/test/CodeGen/ARM/memset-align.ll

Harbormaster completed remote builds in B205563: Diff 486119.Jan 3 2023, 5:15 PM

Gentle ping.

Rebase && Gentle ping.

Harbormaster completed remote builds in B210890: Diff 493461.Jan 30 2023, 5:39 PM

tingwang mentioned this in D139691: [PowerPC] add a peephole to remove redundant swap instructions created by expandVSXStoreForLE.Feb 5 2023, 5:46 PM

While I am not principally against this approach, it doesn't really give me a great feeling going in this direction. The issue is more widespread than just memset/memcpy/memmove as Kai's example illustrates.
I wonder if it would be a more complete solution to add a DAG combine that looks up the chain from the store to see if a store of a splat of the same value exists. That should certainly cover both examples.

I will try to get similar results by DAG combine. Thanks to Nemanja and Kai's insight!

tingwang added a parent revision: D144235: [PowerPC][NFC] add const-splat-array-init.ll.Feb 16 2023, 10:52 PM

tingwang retitled this revision from [SelectionDAG][PowerPC] Memset reuse vector element for tail store to [PowerPC] find and reuse ConstantSplatVector to combine constant store into extract and store.Feb 16 2023, 11:06 PM

tingwang edited the summary of this revision. (Show Details)

tingwang added reviewers: lkail, ecnelises.

tingwang removed a project: Restricted Project.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 16 2023, 11:06 PM

Redo the implementation, and now both memset and constant splat array initialization get changed.

Herald added a subscriber: qcolombet. · View Herald TranscriptFeb 16 2023, 11:12 PM

Harbormaster completed remote builds in B214332: Diff 498255.Feb 16 2023, 11:13 PM

Plan to continue improve the patch...

(1) Format code to follow coding style guidance.
(2) Fix SplatValue check.
(3) Remaining redundant instructions like mtfprd will be fixed in separate patches.

Harbormaster completed remote builds in B216705: Diff 501485.Mar 1 2023, 6:11 AM

(1) Update element type ElemTy which now matches the type expected by both STFIWX and STXSIX PPCISD nodes.
(2) Add missing match pattern for PPCstxsix.

Harbormaster completed remote builds in B216925: Diff 501775.Mar 2 2023, 12:40 AM

Update StoreSizeInBits check to skip on PowerOf2 bit size less than 8.

Harbormaster completed remote builds in B217094: Diff 502028.Mar 2 2023, 5:15 PM

Attempt to push the logic into DAGCombiner::mergeConsecutiveStores()...

Sorry, I had unsubmitted comments. Not sure if they still apply.

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
14854	Nit: you don't need the name of the function here.
14868	What if `dyn_cast` returns `null` (i.e. if operand 1 is not a constant)?
14877–14879	We don't need to construct an `APInt` just to check whether it is a power of 2. You can just use `isPowerOf2_64()` from `MathExtras.h`.

In D138883#4181453, @nemanjai wrote:

Sorry, I had unsubmitted comments. Not sure if they still apply.

Hi Nemanja,

Appreciate your help! I planned to change due to the reason that DAGCombiner::getStoreMergeCandidates() already walks through chain of stores, and I realized that it could be a better place to find candidate for this opportunity. By the way, maybe the criteria of splat of constant could be relieved to just match the subsection that is extracted by the target store, and I would like to have a try.

I hope next version will be final for review. Thank you again for taking time looking into this!

Since canCombineStoreAndExtract() target hook looks not good on PPC, reconsider use this approach to do the combine.

Change in this version:
(1) Address comments from previous review.
(2) Add check to make sure do not combine on truncated stores or those stores that return value.

Harbormaster completed remote builds in B223906: Diff 511247.Apr 5 2023, 6:32 PM

Gentle ping. Since the alternative path (D146602/D146610) looks not good, shall we take this approach forward? Any comments are welcome. Thank you!

Minor update:
(1) Add bitwidth is multiple of check for isSplat() call.
(2) Reduce MaxSearchNodes from 4 to 3, this is the minimum setting to allow target patterns in test cases.

And ping...

Harbormaster completed remote builds in B237197: Diff 529204.Jun 7 2023, 1:01 AM

Rebase and added some comments.

Hi @nemanjai, I accepted and addressed your previous comments. Do you have any more concerns on the approach that is implemented here? Thank you!

Harbormaster completed remote builds in B243642: Diff 537959.Jul 6 2023, 8:21 PM

In D138883#4479380, @tingwang wrote:

Rebase and added some comments.

Hi @nemanjai, I accepted and addressed your previous comments. Do you have any more concerns on the approach that is implemented here? Thank you!

I am really sorry about the delay...
While I am not completely opposed to this, it seems like a fair bit of machinery to add for something that we could solve more simply with unaligned stores (i.e. the same way we would codegen a memset where the tail is *not* a power of 2).

I don't think that

xxspltib 0, 165
li 4, 16
stxsibx 0, 3, 4
stxv 0, 0(3)

is any better than

xxspltib 0, 165
li 4, 1
stxvx 0, 3, 4
stxv 0, 0(3)

Is it possible to do something like that without walking the chain and with existing capabilities?

Hi @nemanjai, appreciate your time looking into this patch. Thank you!

I agree with you, and I think walking the chain is burning CPU cycles without achieving anything. I realized it is difficult for me to take both targets (the original memset case, and the one @lkail mentioned) in this patch, so I would like to drop the second target, in order to focus on the first one.

I like the idea to use unaligned store, and quickly tested that to see if any potential issue. Created memset.c with multiple memset(p, 0xXY, 24); lines to stress the performance. According to my numbers from Power10, use extract-and-store (https://reviews.llvm.org/D138883?id=493461) got 17% faster than baseline, whereas unaligned store got about 30% slower than baseline.

From performance perspective, I think I should pursuit the original approach. However since canCombineStoreAndExtract target hook has been proved not beneficial (https://reviews.llvm.org/D146602) on PPC, I probably need to create one for PPC only at this moment.

Let me know if any comments. I will post patch shortly.

memset.c92 KBDownload

Return to the initial proposal after exploring different approaches. Since canCombineStoreAndExtract() is not beneficial to PPC, created another filter for PPC.

Harbormaster completed remote builds in B244374: Diff 538961.Jul 11 2023, 1:45 AM

tingwang added inline comments.Jul 11 2023, 1:50 AM

llvm/test/CodeGen/PowerPC/memset-tail.ll
195	Will be eliminated by https://reviews.llvm.org/D139193.
237	Plan to address this pattern in separate patch.
246	Will be eliminated by https://reviews.llvm.org/D139193.
261	Plan to address this pattern in separate patch.
299	Will be eliminated by https://reviews.llvm.org/D139193.
353	Will be eliminated by https://reviews.llvm.org/D139193.
474	Plan to address this pattern in separate patch.
474	Plan to address this pattern in separate patch.

tingwang mentioned this in rG0bcef1d93de8: [PowerPC] remove XXSWAPD after vector splat immediate.Jul 11 2023, 9:59 PM

(1) Rebase after commit D139193.
(2) Add two P10 patterns to match extract-and-store.

Now test case is clean. Gentle ping.

Harbormaster completed remote builds in B244667: Diff 539375.Jul 12 2023, 1:54 AM

Rebase && Ping.

Harbormaster completed remote builds in B249970: Diff 546725.Aug 3 2023, 1:57 AM

Gentle ping...

Herald added a subscriber: sunshaoce. · View Herald TranscriptAug 20 2023, 5:18 PM

shchenz added inline comments.Sep 4 2023, 7:45 PM

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
16828	If using `stfd` is allowed for tail size 5/6/7, then can we use `stfd` for tail size 3/4 too? (I assume the change here impacts cases `memsetTailV1B3` and `memsetTailV1B4`?)
llvm/test/CodeGen/PowerPC/memset-tail.ll
245	This seems a legacy issue because I also found same issue in case `memsetTailV1B12` and also from the left side of this case. Is it safe to extend the store length from 23 bytes to 32(or 24) bytes here? There is no clue saying that memory after `(char *)p + 7` is writable by the user? The related logic is in `allowsMisalignedMemoryAccesses()`. But is it correct that we can safely assume this memset can write more memory even this memset handles aligned memory? What do you think? @nemanjai

shchenz added inline comments.Sep 4 2023, 8:12 PM

llvm/test/CodeGen/PowerPC/memset-tail.ll
245	Sorry, please ignore this comment. I didn't realize that the two stores `stxsdx` and `stxvd2x` have overlaps. So the real write size is not extended.

tingwang added inline comments.Sep 5 2023, 12:30 AM

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
16828	It seems `TargetLowering::findOptimalMemOpLowering()` decides the type of each store. I guess if we change the type for the size 3/4 case from i32 to i64, then it will result in stfd.

This LGTM with some nits.

Let's first target for the memset cases as this is the common case where splat values happens.

llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
7271	Nit: this comment needs update?
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
1617	nit: We may need comments here why we don't try to extract constant for `ElemSizeInBits` 8/16. (I guess the reason is we don't have benefit as we need `li` to load the index and this `li` can also be used to load the 8/16 bit imm?
16828	Thanks. Better to add some comment here why we need to set the type to `MVT::v8i16`

This revision is now accepted and ready to land.Sep 5 2023, 7:29 PM

In D138883#4639161, @shchenz wrote:

This LGTM with some nits.

Let's first target for the memset cases as this is the common case where splat values happens.

Thank you! I will address the remaining comments in the commit.

Closed by commit rG71be020dda2c: [SelectionDAG][PowerPC] Memset reuse vector element for tail store (authored by tingwang). · Explain WhySep 5 2023, 10:55 PM

This revision was automatically updated to reflect the committed changes.

tingwang added a commit: rG71be020dda2c: [SelectionDAG][PowerPC] Memset reuse vector element for tail store.

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

SelectionDAG/

SelectionDAG.cpp

34 lines

Target/

PowerPC/

PPCISelLowering.h

3 lines

PPCISelLowering.cpp

48 lines

test/

CodeGen/

ARM/

memset-align.ll

4 lines

PowerPC/

memset-tail.ll

134 lines

p10-fi-elim.ll

40 lines

Diff 478827

llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,262 Lines • ▼ Show 20 Lines	for (unsigned i = 0; i < NumMemOps; i++) {
if (VTSize > Size) {		if (VTSize > Size) {
// Issuing an unaligned load / store pair that overlaps with the previous		// Issuing an unaligned load / store pair that overlaps with the previous
// pair. Adjust the offset accordingly.		// pair. Adjust the offset accordingly.
assert(i == NumMemOps-1 && i != 0);		assert(i == NumMemOps-1 && i != 0);
DstOff -= VTSize - Size;		DstOff -= VTSize - Size;
}		}

// If this store is smaller than the largest store see whether we can get		// If this store is smaller than the largest store see whether we can get
// the smaller value for free with a truncate.		// the smaller value for free with a truncate.
		shchenzUnsubmitted Not Done Reply Inline Actions Nit: this comment needs update? shchenz: Nit: this comment needs update?
SDValue Value = MemSetValue;		SDValue Value = MemSetValue;
if (VT.bitsLT(LargestVT)) {		if (VT.bitsLT(LargestVT)) {
		// Helper function to query least cost index for CombineStoreAndExtract.
		const auto QueryIndex = [](const TargetLowering &TLI, SelectionDAG &DAG,
		EVT Type, unsigned NumElts, unsigned &Index) {
		Index = -1U;
		unsigned Cost = std::numeric_limits<unsigned>::max();
		bool Ret = false;
		for (unsigned i = 0; i < NumElts; ++i) {
		unsigned TmpC;
		if (TLI.canCombineStoreAndExtract(
		Type.getTypeForEVT(*DAG.getContext()),
		ConstantInt::get(*DAG.getContext(), APInt(8, i)), TmpC) &&
		TmpC < Cost) {
		Cost = TmpC;
		Index = i;
		Ret = true;
		}
		}
		return Ret;
		};

		unsigned Index;
		unsigned NElts = LargestVT.getSizeInBits() / VT.getSizeInBits();
		EVT SVT = EVT::getVectorVT(*DAG.getContext(), VT.getScalarType(), NElts);
if (!LargestVT.isVector() && !VT.isVector() &&		if (!LargestVT.isVector() && !VT.isVector() &&
TLI.isTruncateFree(LargestVT, VT))		TLI.isTruncateFree(LargestVT, VT))
Value = DAG.getNode(ISD::TRUNCATE, dl, VT, MemSetValue);		Value = DAG.getNode(ISD::TRUNCATE, dl, VT, MemSetValue);
else		else if (LargestVT.isVector() && !VT.isVector() &&
		QueryIndex(TLI, DAG, SVT, NElts, Index) &&
		TLI.isTypeLegal(SVT) &&
		LargestVT.getSizeInBits() == SVT.getSizeInBits()) {
		// Target which can combine store(extractelement VectorTy, Idx) can get
		// the smaller value for free.
		SDValue TailValue = DAG.getNode(ISD::BITCAST, dl, SVT, MemSetValue);
		Value = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, VT, TailValue,
		DAG.getVectorIdxConstant(Index, dl));
		} else
Value = getMemsetValue(Src, VT, DAG, dl);		Value = getMemsetValue(Src, VT, DAG, dl);
		nemanjaiUnsubmitted Done Reply Inline Actions Can we reduce the nesting here by converting this to `else if`? nemanjai: Can we reduce the nesting here by converting this to `else if`?
}		}
assert(Value.getValueType() == VT && "Value with wrong type.");		assert(Value.getValueType() == VT && "Value with wrong type.");
SDValue Store = DAG.getStore(		SDValue Store = DAG.getStore(
Chain, dl, Value,		Chain, dl, Value,
DAG.getMemBasePlusOffset(Dst, TypeSize::Fixed(DstOff), dl),		DAG.getMemBasePlusOffset(Dst, TypeSize::Fixed(DstOff), dl),
DstPtrInfo.getWithOffset(DstOff), Alignment,		DstPtrInfo.getWithOffset(DstOff), Alignment,
isVol ? MachineMemOperand::MOVolatile : MachineMemOperand::MONone,		isVol ? MachineMemOperand::MOVolatile : MachineMemOperand::MONone,
NewAAInfo);		NewAAInfo);
▲ Show 20 Lines • Show All 4,830 Lines • Show Last 20 Lines

llvm/lib/Target/PowerPC/PPCISelLowering.h

Show First 20 Lines • Show All 798 Lines • ▼ Show 20 Lines	public:
bool isCheapToSpeculateCttz(Type *Ty) const override {		bool isCheapToSpeculateCttz(Type *Ty) const override {
return true;		return true;
}		}

bool isCheapToSpeculateCtlz(Type *Ty) const override {		bool isCheapToSpeculateCtlz(Type *Ty) const override {
return true;		return true;
}		}

		bool canCombineStoreAndExtract(Type VectorTy, Value Idx,
		unsigned &Cost) const override;

bool isCtlzFast() const override {		bool isCtlzFast() const override {
return true;		return true;
}		}

bool isEqualityCmpFoldedWithSignedCmp() const override {		bool isEqualityCmpFoldedWithSignedCmp() const override {
return false;		return false;
}		}

▲ Show 20 Lines • Show All 689 Lines • Show Last 20 Lines

llvm/lib/Target/PowerPC/PPCISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,602 Lines • ▼ Show 20 Lines
bool PPCTargetLowering::hasSPE() const {		bool PPCTargetLowering::hasSPE() const {
return Subtarget.hasSPE();		return Subtarget.hasSPE();
}		}

bool PPCTargetLowering::preferIncOfAddToSubOfNot(EVT VT) const {		bool PPCTargetLowering::preferIncOfAddToSubOfNot(EVT VT) const {
return VT.isScalarInteger();		return VT.isScalarInteger();
}		}

		bool PPCTargetLowering::canCombineStoreAndExtract(Type VectorTy, Value Idx,
		unsigned &Cost) const {
		if (!Subtarget.isPPC64() \|\| !Subtarget.hasVSX())
		return false;

		if (!isa<ConstantInt>(Idx))
		return false;
		shchenzUnsubmitted Not Done Reply Inline Actions nit: We may need comments here why we don't try to extract constant for `ElemSizeInBits` 8/16. (I guess the reason is we don't have benefit as we need `li` to load the index and this `li` can also be used to load the 8/16 bit imm? shchenz: nit: We may need comments here why we don't try to extract constant for `ElemSizeInBits` 8/16.

		if (auto *VTy = dyn_cast<VectorType>(VectorTy)) {
		if (VTy->getScalarType()->isIntegerTy()) {
		unsigned BitWidth = VTy->getScalarSizeInBits();
		unsigned ElemIdx;
		// Accept the combine only if the element index matches the one that can
		// be directly move-from VSR.
		if (BitWidth == 32) {
		ElemIdx = Subtarget.isLittleEndian() ? 2 : 1;
		} else if (BitWidth == 64) {
		ElemIdx = Subtarget.isLittleEndian() ? 1 : 0;
		} else {
		return false;
		}
		if (cast<ConstantInt>(Idx)->getZExtValue() == ElemIdx) {
		Cost = 1;
		return true;
		}
		}
		}
		return false;
		}

const char *PPCTargetLowering::getTargetNodeName(unsigned Opcode) const {		const char *PPCTargetLowering::getTargetNodeName(unsigned Opcode) const {
switch ((PPCISD::NodeType)Opcode) {		switch ((PPCISD::NodeType)Opcode) {
case PPCISD::FIRST_NUMBER: break;		case PPCISD::FIRST_NUMBER: break;
case PPCISD::FSEL: return "PPCISD::FSEL";		case PPCISD::FSEL: return "PPCISD::FSEL";
case PPCISD::XSMAXC: return "PPCISD::XSMAXC";		case PPCISD::XSMAXC: return "PPCISD::XSMAXC";
case PPCISD::XSMINC: return "PPCISD::XSMINC";		case PPCISD::XSMINC: return "PPCISD::XSMINC";
case PPCISD::FCFID: return "PPCISD::FCFID";		case PPCISD::FCFID: return "PPCISD::FCFID";
case PPCISD::FCFIDU: return "PPCISD::FCFIDU";		case PPCISD::FCFIDU: return "PPCISD::FCFIDU";
▲ Show 20 Lines • Show All 13,197 Lines • ▼ Show 20 Lines	SDValue PPCTargetLowering::expandVSXStoreForLE(SDNode *N,
SDValue StoreOps[] = { Chain, Swap, Base };		SDValue StoreOps[] = { Chain, Swap, Base };
SDValue Store = DAG.getMemIntrinsicNode(PPCISD::STXVD2X, dl,		SDValue Store = DAG.getMemIntrinsicNode(PPCISD::STXVD2X, dl,
DAG.getVTList(MVT::Other),		DAG.getVTList(MVT::Other),
StoreOps, VecTy, MMO);		StoreOps, VecTy, MMO);
DCI.AddToWorklist(Store.getNode());		DCI.AddToWorklist(Store.getNode());
return Store;		return Store;
}		}

// Handle DAG combine for STORE (FP_TO_INT F).		// Handle DAG combine for STORE (FP_TO_INT F).
		nemanjaiUnsubmitted Not Done Reply Inline Actions Nit: you don't need the name of the function here. nemanjai: Nit: you don't need the name of the function here.
SDValue PPCTargetLowering::combineStoreFPToInt(SDNode *N,		SDValue PPCTargetLowering::combineStoreFPToInt(SDNode *N,
DAGCombinerInfo &DCI) const {		DAGCombinerInfo &DCI) const {

SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;
SDLoc dl(N);		SDLoc dl(N);
unsigned Opcode = N->getOperand(1).getOpcode();		unsigned Opcode = N->getOperand(1).getOpcode();

assert((Opcode == ISD::FP_TO_SINT \|\| Opcode == ISD::FP_TO_UINT)		assert((Opcode == ISD::FP_TO_SINT \|\| Opcode == ISD::FP_TO_UINT)
&& "Not a FP_TO_INT Instruction!");		&& "Not a FP_TO_INT Instruction!");

SDValue Val = N->getOperand(1).getOperand(0);		SDValue Val = N->getOperand(1).getOperand(0);
EVT Op1VT = N->getOperand(1).getValueType();		EVT Op1VT = N->getOperand(1).getValueType();
EVT ResVT = Val.getValueType();		EVT ResVT = Val.getValueType();

		nemanjaiUnsubmitted Not Done Reply Inline Actions What if `dyn_cast` returns `null` (i.e. if operand 1 is not a constant)? nemanjai: What if `dyn_cast` returns `null` (i.e. if operand 1 is not a constant)?
if (!isTypeLegal(ResVT))		if (!isTypeLegal(ResVT))
return SDValue();		return SDValue();

// Only perform combine for conversion to i64/i32 or power9 i16/i8.		// Only perform combine for conversion to i64/i32 or power9 i16/i8.
bool ValidTypeForStoreFltAsInt =		bool ValidTypeForStoreFltAsInt =
(Op1VT == MVT::i32 \|\| Op1VT == MVT::i64 \|\|		(Op1VT == MVT::i32 \|\| Op1VT == MVT::i64 \|\|
(Subtarget.hasP9Vector() && (Op1VT == MVT::i16 \|\| Op1VT == MVT::i8)));		(Subtarget.hasP9Vector() && (Op1VT == MVT::i16 \|\| Op1VT == MVT::i8)));

if (ResVT == MVT::f128 && !Subtarget.hasP9Vector())		if (ResVT == MVT::f128 && !Subtarget.hasP9Vector())
return SDValue();		return SDValue();

		nemanjaiUnsubmitted Not Done Reply Inline Actions We don't need to construct an `APInt` just to check whether it is a power of 2. You can just use `isPowerOf2_64()` from `MathExtras.h`. nemanjai: We don't need to construct an `APInt` just to check whether it is a power of 2. You can just…
if (ResVT == MVT::ppcf128 \|\| !Subtarget.hasP8Vector() \|\|		if (ResVT == MVT::ppcf128 \|\| !Subtarget.hasP8Vector() \|\|
cast<StoreSDNode>(N)->isTruncatingStore() \|\| !ValidTypeForStoreFltAsInt)		cast<StoreSDNode>(N)->isTruncatingStore() \|\| !ValidTypeForStoreFltAsInt)
return SDValue();		return SDValue();

// Extend f32 values to f64		// Extend f32 values to f64
if (ResVT.getScalarSizeInBits() == 32) {		if (ResVT.getScalarSizeInBits() == 32) {
Val = DAG.getNode(ISD::FP_EXTEND, dl, MVT::f64, Val);		Val = DAG.getNode(ISD::FP_EXTEND, dl, MVT::f64, Val);
DCI.AddToWorklist(Val.getNode());		DCI.AddToWorklist(Val.getNode());
▲ Show 20 Lines • Show All 1,926 Lines • ▼ Show 20 Lines

/// It returns EVT::Other if the type should be determined using generic		/// It returns EVT::Other if the type should be determined using generic
/// target-independent logic.		/// target-independent logic.
EVT PPCTargetLowering::getOptimalMemOpType(		EVT PPCTargetLowering::getOptimalMemOpType(
const MemOp &Op, const AttributeList &FuncAttributes) const {		const MemOp &Op, const AttributeList &FuncAttributes) const {
if (getTargetMachine().getOptLevel() != CodeGenOpt::None) {		if (getTargetMachine().getOptLevel() != CodeGenOpt::None) {
// We should use Altivec/VSX loads and stores when available. For unaligned		// We should use Altivec/VSX loads and stores when available. For unaligned
// addresses, unaligned VSX loads are only fast starting with the P8.		// addresses, unaligned VSX loads are only fast starting with the P8.
if (Subtarget.hasAltivec() && Op.size() >= 16 &&		if (Subtarget.hasAltivec() && Op.size() >= 16) {
(Op.isAligned(Align(16)) \|\|		if (Op.isMemset() && Subtarget.hasVSX()) {
((Op.isMemset() && Subtarget.hasVSX()) \|\| Subtarget.hasP8Vector())))		uint64_t TailSize = Op.size() % 16;
		// For memset lowering, tail size need be different from vector element
		// size to allow borrow tail from vector, otherwise constant tail will
		// be generated.
		if (TailSize > 2 && TailSize <= 4) {
		shchenzUnsubmitted Not Done Reply Inline Actions If using `stfd` is allowed for tail size 5/6/7, then can we use `stfd` for tail size 3/4 too? (I assume the change here impacts cases `memsetTailV1B3` and `memsetTailV1B4`?) shchenz: If using `stfd` is allowed for tail size 5/6/7, then can we use `stfd` for tail size 3/4 too?
		tingwangAuthorUnsubmitted Done Reply Inline Actions It seems `TargetLowering::findOptimalMemOpLowering()` decides the type of each store. I guess if we change the type for the size 3/4 case from i32 to i64, then it will result in stfd. tingwang: It seems `TargetLowering::findOptimalMemOpLowering()` decides the type of each store. I guess…
		shchenzUnsubmitted Not Done Reply Inline Actions Thanks. Better to add some comment here why we need to set the type to `MVT::v8i16` shchenz: Thanks. Better to add some comment here why we need to set the type to `MVT::v8i16`
		return MVT::v8i16;
		}
return MVT::v4i32;		return MVT::v4i32;
}		}
		if (Op.isAligned(Align(16)) \|\| Subtarget.hasP8Vector())
		return MVT::v4i32;
		}
		}

if (Subtarget.isPPC64()) {		if (Subtarget.isPPC64()) {
return MVT::i64;		return MVT::i64;
}		}

return MVT::i32;		return MVT::i32;
}		}

▲ Show 20 Lines • Show All 1,607 Lines • Show Last 20 Lines

llvm/test/CodeGen/ARM/memset-align.ll

	Show All 12 Lines
	; CHECK-NEXT: vmov.i32 q8, #0x0			; CHECK-NEXT: vmov.i32 q8, #0x0
	; CHECK-NEXT: mov r0, sp			; CHECK-NEXT: mov r0, sp
	; CHECK-NEXT: mov.w r1, #-1			; CHECK-NEXT: mov.w r1, #-1
	; CHECK-NEXT: mov r2, r0			; CHECK-NEXT: mov r2, r0
	; CHECK-NEXT: strd r1, r1, [sp, #8]			; CHECK-NEXT: strd r1, r1, [sp, #8]
	; CHECK-NEXT: strd r1, r1, [sp]			; CHECK-NEXT: strd r1, r1, [sp]
	; CHECK-NEXT: vst1.64 {d16, d17}, [r2]!			; CHECK-NEXT: vst1.64 {d16, d17}, [r2]!
	; CHECK-NEXT: str r1, [r2]			; CHECK-NEXT: str r1, [r2]
				; CHECK-NEXT: add.w r2, r0, #15
				tingwangAuthorUnsubmitted Not Done Reply Inline Actions Hello @dmgreen @asavonic. This patch tries to reuse vector element for the tail store in memset by implementing `canCombineStoreAndExtract()` on PPC. This changed introduced test case change on ARM in llvm/test/CodeGen/ARM/memset-align.ll. Could you please help me check if the change looks good or not? Thank you! Looked into the scenario on ARM, if the i8 fill value of memset is zero, it creates vector for the initial 16B, and constant tail for the remaining bytes, which exactly hit this patch's scenario. For other values, it creates i32 for memset and will not be impacted by this patch. tingwang: Hello @dmgreen @asavonic. This patch tries to reuse vector element for the tail store in…
				asavonicUnsubmitted Not Done Reply Inline Actions This looks fine to me. VST1 and scalar STR seem equivalent in this case, if I'm reading the docs right. asavonic: This looks fine to me. VST1 and scalar STR seem equivalent in this case, if I'm reading the…
				tingwangAuthorUnsubmitted Done Reply Inline Actions This looks fine to me. VST1 and scalar STR seem equivalent in this case, if I'm reading the docs right. Thank you for the confirm! tingwang: > This looks fine to me. VST1 and scalar STR seem equivalent in this case, if I'm reading the…
				; CHECK-NEXT: vst1.32 {d16[0]}, [r2]
	; CHECK-NEXT: str r1, [sp, #20]			; CHECK-NEXT: str r1, [sp, #20]
	; CHECK-NEXT: movs r1, #0
	; CHECK-NEXT: str.w r1, [sp, #15]
	; CHECK-NEXT: bl callee			; CHECK-NEXT: bl callee
	; CHECK-NEXT: add sp, #24			; CHECK-NEXT: add sp, #24
	; CHECK-NEXT: pop {r7, pc}			; CHECK-NEXT: pop {r7, pc}
	entry:			entry:
	%a = alloca %struct.af, align 8			%a = alloca %struct.af, align 8
	%0 = bitcast %struct.af* %a to i8*			%0 = bitcast %struct.af* %a to i8*
	%1 = bitcast %struct.af* %a to i8*			%1 = bitcast %struct.af* %a to i8*
	call void @llvm.memset.p0i8.i64(i8* align 8 %1, i8 -1, i64 24, i1 false)			call void @llvm.memset.p0i8.i64(i8* align 8 %1, i8 -1, i64 24, i1 false)
	call void @llvm.memset.p0i8.i64(i8* align 8 %0, i8 0, i64 19, i1 false)			call void @llvm.memset.p0i8.i64(i8* align 8 %0, i8 0, i64 19, i1 false)
	call void @callee(%struct.af* %a)			call void @callee(%struct.af* %a)
	ret void			ret void
	}			}

	declare void @llvm.memset.p0i8.i64(i8* nocapture writeonly, i8, i64, i1 immarg)			declare void @llvm.memset.p0i8.i64(i8* nocapture writeonly, i8, i64, i1 immarg)
	declare void @callee(%struct.af*) local_unnamed_addr #1			declare void @callee(%struct.af*) local_unnamed_addr #1

llvm/test/CodeGen/PowerPC/memset-tail.ll

Show First 20 Lines • Show All 163 Lines • ▼ Show 20 Lines	entry:
tail call void @llvm.memset.p0.i64(ptr %p, i8 15, i64 25, i1 false)		tail call void @llvm.memset.p0.i64(ptr %p, i8 15, i64 25, i1 false)
ret void		ret void
}		}

define dso_local void @memsetTailV1B8(ptr nocapture noundef writeonly %p) local_unnamed_addr {		define dso_local void @memsetTailV1B8(ptr nocapture noundef writeonly %p) local_unnamed_addr {
; P8-BE-LABEL: memsetTailV1B8:		; P8-BE-LABEL: memsetTailV1B8:
; P8-BE: # %bb.0: # %entry		; P8-BE: # %bb.0: # %entry
; P8-BE-NEXT: vspltisb 2, 15		; P8-BE-NEXT: vspltisb 2, 15
; P8-BE-NEXT: lis 4, 3855		; P8-BE-NEXT: li 4, 16
; P8-BE-NEXT: ori 4, 4, 3855		; P8-BE-NEXT: stxsdx 34, 3, 4
; P8-BE-NEXT: rldimi 4, 4, 32, 0
; P8-BE-NEXT: stxvw4x 34, 0, 3		; P8-BE-NEXT: stxvw4x 34, 0, 3
; P8-BE-NEXT: std 4, 16(3)
; P8-BE-NEXT: blr		; P8-BE-NEXT: blr
;		;
; P9-BE-LABEL: memsetTailV1B8:		; P9-BE-LABEL: memsetTailV1B8:
; P9-BE: # %bb.0: # %entry		; P9-BE: # %bb.0: # %entry
; P9-BE-NEXT: lis 4, 3855
; P9-BE-NEXT: xxspltib 0, 15		; P9-BE-NEXT: xxspltib 0, 15
; P9-BE-NEXT: ori 4, 4, 3855
; P9-BE-NEXT: stxv 0, 0(3)		; P9-BE-NEXT: stxv 0, 0(3)
; P9-BE-NEXT: rldimi 4, 4, 32, 0		; P9-BE-NEXT: stfd 0, 16(3)
; P9-BE-NEXT: std 4, 16(3)
; P9-BE-NEXT: blr		; P9-BE-NEXT: blr
;		;
; P10-BE-LABEL: memsetTailV1B8:		; P10-BE-LABEL: memsetTailV1B8:
; P10-BE: # %bb.0: # %entry		; P10-BE: # %bb.0: # %entry
; P10-BE-NEXT: pli 4, 252645135
; P10-BE-NEXT: rldimi 4, 4, 32, 0
; P10-BE-NEXT: std 4, 16(3)
; P10-BE-NEXT: xxspltib 0, 15		; P10-BE-NEXT: xxspltib 0, 15
; P10-BE-NEXT: stxv 0, 0(3)		; P10-BE-NEXT: stxv 0, 0(3)
		; P10-BE-NEXT: stfd 0, 16(3)
; P10-BE-NEXT: blr		; P10-BE-NEXT: blr
;		;
; P8-LE-LABEL: memsetTailV1B8:		; P8-LE-LABEL: memsetTailV1B8:
; P8-LE: # %bb.0: # %entry		; P8-LE: # %bb.0: # %entry
; P8-LE-NEXT: lis 4, 3855
; P8-LE-NEXT: vspltisb 2, 15		; P8-LE-NEXT: vspltisb 2, 15
; P8-LE-NEXT: ori 4, 4, 3855		; P8-LE-NEXT: li 4, 16
; P8-LE-NEXT: rldimi 4, 4, 32, 0		; P8-LE-NEXT: xxswapd 0, 34
		nemanjaiUnsubmitted Not Done Reply Inline Actions Why do we now get the redundant swap for the vector store that we didn't get before? Was it eliminated by the swap elimination before and now it is not because we have a use of the partial vector? nemanjai: Why do we now get the redundant swap for the vector store that we didn't get before? Was it…
		tingwangAuthorUnsubmitted Done Reply Inline Actions Debug-only `ppc-vsx-swaps` shows "Web 0 rejected for physreg, partial reg, or not swap[pable]". I will look into it and probably post another patch to fix the issue. Thank you! tingwang: Debug-only `ppc-vsx-swaps` shows "Web 0 rejected for physreg, partial reg, or not swap[pable]".
		tingwangAuthorUnsubmitted Done Reply Inline Actions Will be eliminated by https://reviews.llvm.org/D139193. tingwang: Will be eliminated by https://reviews.llvm.org/D139193.
; P8-LE-NEXT: std 4, 16(3)		; P8-LE-NEXT: stxsdx 34, 3, 4
; P8-LE-NEXT: stxvd2x 34, 0, 3		; P8-LE-NEXT: stxvd2x 0, 0, 3
; P8-LE-NEXT: blr		; P8-LE-NEXT: blr
;		;
; P9-LE-LABEL: memsetTailV1B8:		; P9-LE-LABEL: memsetTailV1B8:
; P9-LE: # %bb.0: # %entry		; P9-LE: # %bb.0: # %entry
; P9-LE-NEXT: lis 4, 3855
; P9-LE-NEXT: xxspltib 0, 15		; P9-LE-NEXT: xxspltib 0, 15
; P9-LE-NEXT: ori 4, 4, 3855
; P9-LE-NEXT: stxv 0, 0(3)		; P9-LE-NEXT: stxv 0, 0(3)
; P9-LE-NEXT: rldimi 4, 4, 32, 0		; P9-LE-NEXT: stfd 0, 16(3)
; P9-LE-NEXT: std 4, 16(3)
; P9-LE-NEXT: blr		; P9-LE-NEXT: blr
;		;
; P10-LE-LABEL: memsetTailV1B8:		; P10-LE-LABEL: memsetTailV1B8:
; P10-LE: # %bb.0: # %entry		; P10-LE: # %bb.0: # %entry
; P10-LE-NEXT: pli 4, 252645135
; P10-LE-NEXT: rldimi 4, 4, 32, 0
; P10-LE-NEXT: std 4, 16(3)
; P10-LE-NEXT: xxspltib 0, 15		; P10-LE-NEXT: xxspltib 0, 15
; P10-LE-NEXT: stxv 0, 0(3)		; P10-LE-NEXT: stxv 0, 0(3)
		; P10-LE-NEXT: stfd 0, 16(3)
; P10-LE-NEXT: blr		; P10-LE-NEXT: blr
entry:		entry:
tail call void @llvm.memset.p0.i64(ptr %p, i8 15, i64 24, i1 false)		tail call void @llvm.memset.p0.i64(ptr %p, i8 15, i64 24, i1 false)
ret void		ret void
}		}

define dso_local void @memsetTailV1B7(ptr nocapture noundef writeonly %p) local_unnamed_addr {		define dso_local void @memsetTailV1B7(ptr nocapture noundef writeonly %p) local_unnamed_addr {
; P8-BE-LABEL: memsetTailV1B7:		; P8-BE-LABEL: memsetTailV1B7:
; P8-BE: # %bb.0: # %entry		; P8-BE: # %bb.0: # %entry
; P8-BE-NEXT: lis 4, 3855
; P8-BE-NEXT: vspltisb 2, 15		; P8-BE-NEXT: vspltisb 2, 15
; P8-BE-NEXT: li 5, 15		; P8-BE-NEXT: li 4, 15
; P8-BE-NEXT: ori 4, 4, 3855		; P8-BE-NEXT: stxsdx 34, 3, 4
; P8-BE-NEXT: rldimi 4, 4, 32, 0
; P8-BE-NEXT: stdx 4, 3, 5
; P8-BE-NEXT: stxvw4x 34, 0, 3		; P8-BE-NEXT: stxvw4x 34, 0, 3
; P8-BE-NEXT: blr		; P8-BE-NEXT: blr
;		;
; P9-BE-LABEL: memsetTailV1B7:		; P9-BE-LABEL: memsetTailV1B7:
; P9-BE: # %bb.0: # %entry		; P9-BE: # %bb.0: # %entry
; P9-BE-NEXT: lis 4, 3855
; P9-BE-NEXT: li 5, 15
; P9-BE-NEXT: ori 4, 4, 3855
; P9-BE-NEXT: rldimi 4, 4, 32, 0
; P9-BE-NEXT: stdx 4, 3, 5
; P9-BE-NEXT: xxspltib 0, 15		; P9-BE-NEXT: xxspltib 0, 15
		; P9-BE-NEXT: stfd 0, 15(3)
; P9-BE-NEXT: stxv 0, 0(3)		; P9-BE-NEXT: stxv 0, 0(3)
; P9-BE-NEXT: blr		; P9-BE-NEXT: blr
;		;
; P10-BE-LABEL: memsetTailV1B7:		; P10-BE-LABEL: memsetTailV1B7:
; P10-BE: # %bb.0: # %entry		; P10-BE: # %bb.0: # %entry
; P10-BE-NEXT: pli 4, 252645135
; P10-BE-NEXT: rldimi 4, 4, 32, 0
; P10-BE-NEXT: pstd 4, 15(3), 0
; P10-BE-NEXT: xxspltib 0, 15		; P10-BE-NEXT: xxspltib 0, 15
		; P10-BE-NEXT: mffprd 4, 0
		tingwangAuthorUnsubmitted Done Reply Inline Actions Plan to address this pattern in separate patch. tingwang: Plan to address this pattern in separate patch.
; P10-BE-NEXT: stxv 0, 0(3)		; P10-BE-NEXT: stxv 0, 0(3)
		; P10-BE-NEXT: pstd 4, 15(3), 0
; P10-BE-NEXT: blr		; P10-BE-NEXT: blr
;		;
; P8-LE-LABEL: memsetTailV1B7:		; P8-LE-LABEL: memsetTailV1B7:
; P8-LE: # %bb.0: # %entry		; P8-LE: # %bb.0: # %entry
; P8-LE-NEXT: lis 4, 3855
; P8-LE-NEXT: vspltisb 2, 15		; P8-LE-NEXT: vspltisb 2, 15
; P8-LE-NEXT: li 5, 15		; P8-LE-NEXT: li 4, 15
		shchenzUnsubmitted Not Done Reply Inline Actions This seems a legacy issue because I also found same issue in case `memsetTailV1B12` and also from the left side of this case. Is it safe to extend the store length from 23 bytes to 32(or 24) bytes here? There is no clue saying that memory after `(char )p + 7` is writable by the user? The related logic is in `allowsMisalignedMemoryAccesses()`. But is it correct that we can safely assume this memset can write more memory even this memset handles aligned memory? What do you think? @nemanjai shchenz:* This seems a legacy issue because I also found same issue in case `memsetTailV1B12` and also…
		shchenzUnsubmitted Not Done Reply Inline Actions Sorry, please ignore this comment. I didn't realize that the two stores `stxsdx` and `stxvd2x` have overlaps. So the real write size is not extended. shchenz: Sorry, please ignore this comment. I didn't realize that the two stores `stxsdx` and `stxvd2x`…
; P8-LE-NEXT: ori 4, 4, 3855		; P8-LE-NEXT: xxswapd 0, 34
		tingwangAuthorUnsubmitted Done Reply Inline Actions Will be eliminated by https://reviews.llvm.org/D139193. tingwang: Will be eliminated by https://reviews.llvm.org/D139193.
; P8-LE-NEXT: rldimi 4, 4, 32, 0		; P8-LE-NEXT: stxsdx 34, 3, 4
; P8-LE-NEXT: stdx 4, 3, 5		; P8-LE-NEXT: stxvd2x 0, 0, 3
; P8-LE-NEXT: stxvd2x 34, 0, 3
; P8-LE-NEXT: blr		; P8-LE-NEXT: blr
;		;
; P9-LE-LABEL: memsetTailV1B7:		; P9-LE-LABEL: memsetTailV1B7:
; P9-LE: # %bb.0: # %entry		; P9-LE: # %bb.0: # %entry
; P9-LE-NEXT: lis 4, 3855
; P9-LE-NEXT: li 5, 15
; P9-LE-NEXT: ori 4, 4, 3855
; P9-LE-NEXT: rldimi 4, 4, 32, 0
; P9-LE-NEXT: stdx 4, 3, 5
; P9-LE-NEXT: xxspltib 0, 15		; P9-LE-NEXT: xxspltib 0, 15
		; P9-LE-NEXT: stfd 0, 15(3)
; P9-LE-NEXT: stxv 0, 0(3)		; P9-LE-NEXT: stxv 0, 0(3)
; P9-LE-NEXT: blr		; P9-LE-NEXT: blr
;		;
; P10-LE-LABEL: memsetTailV1B7:		; P10-LE-LABEL: memsetTailV1B7:
; P10-LE: # %bb.0: # %entry		; P10-LE: # %bb.0: # %entry
; P10-LE-NEXT: pli 4, 252645135
; P10-LE-NEXT: rldimi 4, 4, 32, 0
; P10-LE-NEXT: pstd 4, 15(3), 0
; P10-LE-NEXT: xxspltib 0, 15		; P10-LE-NEXT: xxspltib 0, 15
		; P10-LE-NEXT: mffprd 4, 0
		tingwangAuthorUnsubmitted Done Reply Inline Actions Plan to address this pattern in separate patch. tingwang: Plan to address this pattern in separate patch.
; P10-LE-NEXT: stxv 0, 0(3)		; P10-LE-NEXT: stxv 0, 0(3)
		; P10-LE-NEXT: pstd 4, 15(3), 0
; P10-LE-NEXT: blr		; P10-LE-NEXT: blr
entry:		entry:
tail call void @llvm.memset.p0.i64(ptr %p, i8 15, i64 23, i1 false)		tail call void @llvm.memset.p0.i64(ptr %p, i8 15, i64 23, i1 false)
ret void		ret void
}		}

define dso_local void @memsetTailV1B4(ptr nocapture noundef writeonly %p) local_unnamed_addr {		define dso_local void @memsetTailV1B4(ptr nocapture noundef writeonly %p) local_unnamed_addr {
; P8-BE-LABEL: memsetTailV1B4:		; P8-BE-LABEL: memsetTailV1B4:
; P8-BE: # %bb.0: # %entry		; P8-BE: # %bb.0: # %entry
; P8-BE-NEXT: vspltisb 2, 15		; P8-BE-NEXT: vspltisb 2, 15
; P8-BE-NEXT: lis 4, 3855		; P8-BE-NEXT: li 4, 16
; P8-BE-NEXT: ori 4, 4, 3855		; P8-BE-NEXT: stxsiwx 34, 3, 4
; P8-BE-NEXT: stw 4, 16(3)
; P8-BE-NEXT: stxvw4x 34, 0, 3		; P8-BE-NEXT: stxvw4x 34, 0, 3
; P8-BE-NEXT: blr		; P8-BE-NEXT: blr
;		;
; P9-BE-LABEL: memsetTailV1B4:		; P9-BE-LABEL: memsetTailV1B4:
; P9-BE: # %bb.0: # %entry		; P9-BE: # %bb.0: # %entry
; P9-BE-NEXT: lis 4, 3855
; P9-BE-NEXT: ori 4, 4, 3855
; P9-BE-NEXT: stw 4, 16(3)
; P9-BE-NEXT: xxspltib 0, 15		; P9-BE-NEXT: xxspltib 0, 15
		; P9-BE-NEXT: li 4, 16
		; P9-BE-NEXT: stfiwx 0, 3, 4
; P9-BE-NEXT: stxv 0, 0(3)		; P9-BE-NEXT: stxv 0, 0(3)
; P9-BE-NEXT: blr		; P9-BE-NEXT: blr
;		;
; P10-BE-LABEL: memsetTailV1B4:		; P10-BE-LABEL: memsetTailV1B4:
; P10-BE: # %bb.0: # %entry		; P10-BE: # %bb.0: # %entry
; P10-BE-NEXT: pli 4, 252645135
; P10-BE-NEXT: stw 4, 16(3)
; P10-BE-NEXT: xxspltib 0, 15		; P10-BE-NEXT: xxspltib 0, 15
		; P10-BE-NEXT: li 4, 16
		; P10-BE-NEXT: stfiwx 0, 3, 4
; P10-BE-NEXT: stxv 0, 0(3)		; P10-BE-NEXT: stxv 0, 0(3)
; P10-BE-NEXT: blr		; P10-BE-NEXT: blr
;		;
; P8-LE-LABEL: memsetTailV1B4:		; P8-LE-LABEL: memsetTailV1B4:
; P8-LE: # %bb.0: # %entry		; P8-LE: # %bb.0: # %entry
; P8-LE-NEXT: vspltisb 2, 15		; P8-LE-NEXT: vspltisb 2, 15
; P8-LE-NEXT: lis 4, 3855		; P8-LE-NEXT: li 4, 16
; P8-LE-NEXT: ori 4, 4, 3855		; P8-LE-NEXT: xxswapd 0, 34
		tingwangAuthorUnsubmitted Done Reply Inline Actions Will be eliminated by https://reviews.llvm.org/D139193. tingwang: Will be eliminated by https://reviews.llvm.org/D139193.
; P8-LE-NEXT: stw 4, 16(3)		; P8-LE-NEXT: stxsiwx 34, 3, 4
; P8-LE-NEXT: stxvd2x 34, 0, 3		; P8-LE-NEXT: stxvd2x 0, 0, 3
; P8-LE-NEXT: blr		; P8-LE-NEXT: blr
;		;
; P9-LE-LABEL: memsetTailV1B4:		; P9-LE-LABEL: memsetTailV1B4:
; P9-LE: # %bb.0: # %entry		; P9-LE: # %bb.0: # %entry
; P9-LE-NEXT: lis 4, 3855
; P9-LE-NEXT: ori 4, 4, 3855
; P9-LE-NEXT: stw 4, 16(3)
; P9-LE-NEXT: xxspltib 0, 15		; P9-LE-NEXT: xxspltib 0, 15
		; P9-LE-NEXT: li 4, 16
		; P9-LE-NEXT: stfiwx 0, 3, 4
; P9-LE-NEXT: stxv 0, 0(3)		; P9-LE-NEXT: stxv 0, 0(3)
; P9-LE-NEXT: blr		; P9-LE-NEXT: blr
;		;
; P10-LE-LABEL: memsetTailV1B4:		; P10-LE-LABEL: memsetTailV1B4:
; P10-LE: # %bb.0: # %entry		; P10-LE: # %bb.0: # %entry
; P10-LE-NEXT: pli 4, 252645135
; P10-LE-NEXT: stw 4, 16(3)
; P10-LE-NEXT: xxspltib 0, 15		; P10-LE-NEXT: xxspltib 0, 15
		; P10-LE-NEXT: li 4, 16
		; P10-LE-NEXT: stfiwx 0, 3, 4
; P10-LE-NEXT: stxv 0, 0(3)		; P10-LE-NEXT: stxv 0, 0(3)
; P10-LE-NEXT: blr		; P10-LE-NEXT: blr
entry:		entry:
tail call void @llvm.memset.p0.i32(ptr %p, i8 15, i32 20, i1 false)		tail call void @llvm.memset.p0.i32(ptr %p, i8 15, i32 20, i1 false)
ret void		ret void
}		}

define dso_local void @memsetTailV1B3(ptr nocapture noundef writeonly %p) local_unnamed_addr {		define dso_local void @memsetTailV1B3(ptr nocapture noundef writeonly %p) local_unnamed_addr {
; P8-BE-LABEL: memsetTailV1B3:		; P8-BE-LABEL: memsetTailV1B3:
; P8-BE: # %bb.0: # %entry		; P8-BE: # %bb.0: # %entry
; P8-BE-NEXT: vspltisb 2, 15		; P8-BE-NEXT: vspltisb 2, 15
; P8-BE-NEXT: lis 4, 3855		; P8-BE-NEXT: li 4, 15
; P8-BE-NEXT: ori 4, 4, 3855		; P8-BE-NEXT: stxsiwx 34, 3, 4
; P8-BE-NEXT: stxvw4x 34, 0, 3		; P8-BE-NEXT: stxvw4x 34, 0, 3
; P8-BE-NEXT: stw 4, 15(3)
; P8-BE-NEXT: blr		; P8-BE-NEXT: blr
;		;
; P9-BE-LABEL: memsetTailV1B3:		; P9-BE-LABEL: memsetTailV1B3:
; P9-BE: # %bb.0: # %entry		; P9-BE: # %bb.0: # %entry
; P9-BE-NEXT: lis 4, 3855
; P9-BE-NEXT: ori 4, 4, 3855
; P9-BE-NEXT: stw 4, 15(3)
; P9-BE-NEXT: xxspltib 0, 15		; P9-BE-NEXT: xxspltib 0, 15
		; P9-BE-NEXT: li 4, 15
		; P9-BE-NEXT: stfiwx 0, 3, 4
; P9-BE-NEXT: stxv 0, 0(3)		; P9-BE-NEXT: stxv 0, 0(3)
; P9-BE-NEXT: blr		; P9-BE-NEXT: blr
;		;
; P10-BE-LABEL: memsetTailV1B3:		; P10-BE-LABEL: memsetTailV1B3:
; P10-BE: # %bb.0: # %entry		; P10-BE: # %bb.0: # %entry
; P10-BE-NEXT: pli 4, 252645135
; P10-BE-NEXT: stw 4, 15(3)
; P10-BE-NEXT: xxspltib 0, 15		; P10-BE-NEXT: xxspltib 0, 15
		; P10-BE-NEXT: li 4, 15
		; P10-BE-NEXT: stfiwx 0, 3, 4
; P10-BE-NEXT: stxv 0, 0(3)		; P10-BE-NEXT: stxv 0, 0(3)
; P10-BE-NEXT: blr		; P10-BE-NEXT: blr
;		;
; P8-LE-LABEL: memsetTailV1B3:		; P8-LE-LABEL: memsetTailV1B3:
; P8-LE: # %bb.0: # %entry		; P8-LE: # %bb.0: # %entry
; P8-LE-NEXT: vspltisb 2, 15		; P8-LE-NEXT: vspltisb 2, 15
; P8-LE-NEXT: lis 4, 3855		; P8-LE-NEXT: li 4, 15
; P8-LE-NEXT: ori 4, 4, 3855		; P8-LE-NEXT: xxswapd 0, 34
		tingwangAuthorUnsubmitted Done Reply Inline Actions Will be eliminated by https://reviews.llvm.org/D139193. tingwang: Will be eliminated by https://reviews.llvm.org/D139193.
; P8-LE-NEXT: stw 4, 15(3)		; P8-LE-NEXT: stxsiwx 34, 3, 4
; P8-LE-NEXT: stxvd2x 34, 0, 3		; P8-LE-NEXT: stxvd2x 0, 0, 3
; P8-LE-NEXT: blr		; P8-LE-NEXT: blr
;		;
; P9-LE-LABEL: memsetTailV1B3:		; P9-LE-LABEL: memsetTailV1B3:
; P9-LE: # %bb.0: # %entry		; P9-LE: # %bb.0: # %entry
; P9-LE-NEXT: lis 4, 3855
; P9-LE-NEXT: ori 4, 4, 3855
; P9-LE-NEXT: stw 4, 15(3)
; P9-LE-NEXT: xxspltib 0, 15		; P9-LE-NEXT: xxspltib 0, 15
		; P9-LE-NEXT: li 4, 15
		; P9-LE-NEXT: stfiwx 0, 3, 4
; P9-LE-NEXT: stxv 0, 0(3)		; P9-LE-NEXT: stxv 0, 0(3)
; P9-LE-NEXT: blr		; P9-LE-NEXT: blr
;		;
; P10-LE-LABEL: memsetTailV1B3:		; P10-LE-LABEL: memsetTailV1B3:
; P10-LE: # %bb.0: # %entry		; P10-LE: # %bb.0: # %entry
; P10-LE-NEXT: pli 4, 252645135
; P10-LE-NEXT: stw 4, 15(3)
; P10-LE-NEXT: xxspltib 0, 15		; P10-LE-NEXT: xxspltib 0, 15
		; P10-LE-NEXT: li 4, 15
		; P10-LE-NEXT: stfiwx 0, 3, 4
; P10-LE-NEXT: stxv 0, 0(3)		; P10-LE-NEXT: stxv 0, 0(3)
; P10-LE-NEXT: blr		; P10-LE-NEXT: blr
entry:		entry:
tail call void @llvm.memset.p0.i64(ptr %p, i8 15, i64 19, i1 false)		tail call void @llvm.memset.p0.i64(ptr %p, i8 15, i64 19, i1 false)
ret void		ret void
}		}

define dso_local void @memsetTailV1B2(ptr nocapture noundef writeonly %p) local_unnamed_addr {		define dso_local void @memsetTailV1B2(ptr nocapture noundef writeonly %p) local_unnamed_addr {
▲ Show 20 Lines • Show All 87 Lines • ▼ Show 20 Lines
; P9-LE-NEXT: li 4, -1		; P9-LE-NEXT: li 4, -1
; P9-LE-NEXT: xxleqv 0, 0, 0		; P9-LE-NEXT: xxleqv 0, 0, 0
; P9-LE-NEXT: stb 4, 16(3)		; P9-LE-NEXT: stb 4, 16(3)
; P9-LE-NEXT: stxv 0, 0(3)		; P9-LE-NEXT: stxv 0, 0(3)
; P9-LE-NEXT: blr		; P9-LE-NEXT: blr
;		;
; P10-LE-LABEL: memsetTailV1B1:		; P10-LE-LABEL: memsetTailV1B1:
; P10-LE: # %bb.0: # %entry		; P10-LE: # %bb.0: # %entry
; P10-LE-NEXT: li 4, -1		; P10-LE-NEXT: li 4, -1
		tingwangAuthorUnsubmitted Done Reply Inline Actions Plan to address this pattern in separate patch. tingwang: Plan to address this pattern in separate patch.
		tingwangAuthorUnsubmitted Done Reply Inline Actions Plan to address this pattern in separate patch. tingwang: Plan to address this pattern in separate patch.
; P10-LE-NEXT: xxleqv 0, 0, 0		; P10-LE-NEXT: xxleqv 0, 0, 0
; P10-LE-NEXT: stb 4, 16(3)		; P10-LE-NEXT: stb 4, 16(3)
; P10-LE-NEXT: stxv 0, 0(3)		; P10-LE-NEXT: stxv 0, 0(3)
; P10-LE-NEXT: blr		; P10-LE-NEXT: blr
entry:		entry:
tail call void @llvm.memset.p0.i64(ptr %p, i8 -1, i64 17, i1 false)		tail call void @llvm.memset.p0.i64(ptr %p, i8 -1, i64 17, i1 false)
ret void		ret void
}		}
▲ Show 20 Lines • Show All 429 Lines • Show Last 20 Lines

llvm/test/CodeGen/PowerPC/p10-fi-elim.ll

	Show All 20 Lines
	define dso_local signext i32 @test_FI_elim([40 x i8]* noalias nocapture dereferenceable(40) %arg, [0 x %96]* noalias nocapture nonnull readonly %arg2, [0 x %97]* noalias nocapture nonnull readonly %arg3, [0 x %98]* noalias nocapture nonnull readonly %arg4, %100* noalias nocapture dereferenceable(48) %arg6, %101* noalias nocapture dereferenceable(72) %arg7) local_unnamed_addr #2 {			define dso_local signext i32 @test_FI_elim([40 x i8]* noalias nocapture dereferenceable(40) %arg, [0 x %96]* noalias nocapture nonnull readonly %arg2, [0 x %97]* noalias nocapture nonnull readonly %arg3, [0 x %98]* noalias nocapture nonnull readonly %arg4, %100* noalias nocapture dereferenceable(48) %arg6, %101* noalias nocapture dereferenceable(72) %arg7) local_unnamed_addr #2 {
	; CHECK-LABEL: test_FI_elim:			; CHECK-LABEL: test_FI_elim:
	; CHECK: # %bb.0: # %bb			; CHECK: # %bb.0: # %bb
	; CHECK-NEXT: mflr r0			; CHECK-NEXT: mflr r0
	; CHECK-NEXT: std r0, 16(r1)			; CHECK-NEXT: std r0, 16(r1)
	; CHECK-NEXT: stdu r1, -80(r1)			; CHECK-NEXT: stdu r1, -80(r1)
	; CHECK-NEXT: .cfi_def_cfa_offset 80			; CHECK-NEXT: .cfi_def_cfa_offset 80
	; CHECK-NEXT: .cfi_offset lr, 16			; CHECK-NEXT: .cfi_offset lr, 16
	; CHECK-NEXT: lxv v2, 0(r3)
	; CHECK-NEXT: mr r9, r6			; CHECK-NEXT: mr r9, r6
	; CHECK-NEXT: mr r6, r5			; CHECK-NEXT: mr r6, r5
	; CHECK-NEXT: li r0, 4			; CHECK-NEXT: li r5, 3
	; CHECK-NEXT: li r11, 3			; CHECK-NEXT: li r10, -127
	; CHECK-NEXT: std r0, 0(r3)			; CHECK-NEXT: lxv v2, 0(r3)
	; CHECK-NEXT: stb r11, 0(0)			; CHECK-NEXT: stb r5, 0(0)
	; CHECK-NEXT: li r12, -127			; CHECK-NEXT: stb r10, 0(r3)
	; CHECK-NEXT: stb r12, 0(r3)			; CHECK-NEXT: stb r5, 0(r3)
				tingwangAuthorUnsubmitted Done Reply Inline Actions Instruction sequence change in `PowerPC/p10-fi-elim.ll` is result of `CodeGenPrepare::optimizeExtractElementInst()` now can generate combined pattern since we enabled `canCombineStoreAndExtract()`. Seems we can avoid two mfvsrd instructions. tingwang: Instruction sequence change in `PowerPC/p10-fi-elim.ll` is result of `CodeGenPrepare…
	; CHECK-NEXT: li r2, 1
	; CHECK-NEXT: stb r11, 0(r3)
	; CHECK-NEXT: stb r12, 0(r3)
	; CHECK-NEXT: stw r2, 0(r3)
	; CHECK-NEXT: mfvsrd r5, v2
	; CHECK-NEXT: vaddudm v3, v2, v2
	; CHECK-NEXT: pstxv v2, 64(r1), 0
	; CHECK-NEXT: neg r5, r5
	; CHECK-NEXT: mfvsrd r10, v3
	; CHECK-NEXT: std r5, 0(r3)
	; CHECK-NEXT: lbz r5, 2(r7)			; CHECK-NEXT: lbz r5, 2(r7)
	; CHECK-NEXT: mr r7, r9			; CHECK-NEXT: mr r7, r9
	; CHECK-NEXT: neg r10, r10			; CHECK-NEXT: li r12, 1
	; CHECK-NEXT: std r2, 0(r3)			; CHECK-NEXT: stb r10, 0(r3)
	; CHECK-NEXT: std r0, 0(r3)			; CHECK-NEXT: stw r12, 0(r3)
	; CHECK-NEXT: std r10, 0(r3)			; CHECK-NEXT: li r11, 4
				; CHECK-NEXT: std r11, 0(r3)
				; CHECK-NEXT: vaddudm v4, v2, v2
				; CHECK-NEXT: vnegd v3, v2
				; CHECK-NEXT: pstxv v2, 64(r1), 0
	; CHECK-NEXT: rlwinm r5, r5, 0, 27, 27			; CHECK-NEXT: rlwinm r5, r5, 0, 27, 27
				; CHECK-NEXT: vnegd v2, v4
				; CHECK-NEXT: stxsd v3, 0(r3)
				; CHECK-NEXT: std r12, 0(r3)
				; CHECK-NEXT: std r11, 0(r3)
	; CHECK-NEXT: stb r5, 0(0)			; CHECK-NEXT: stb r5, 0(0)
	; CHECK-NEXT: lbz r5, 2(r8)			; CHECK-NEXT: lbz r5, 2(r8)
				; CHECK-NEXT: stxsd v2, 0(r3)
	; CHECK-NEXT: rlwinm r5, r5, 0, 27, 27			; CHECK-NEXT: rlwinm r5, r5, 0, 27, 27
	; CHECK-NEXT: stb r5, 0(r3)			; CHECK-NEXT: stb r5, 0(r3)
	; CHECK-NEXT: li r5, 2			; CHECK-NEXT: li r5, 2
	; CHECK-NEXT: stw r5, 0(r3)			; CHECK-NEXT: stw r5, 0(r3)
	; CHECK-NEXT: mr r5, r4			; CHECK-NEXT: mr r5, r4
	; CHECK-NEXT: bl foo@notoc			; CHECK-NEXT: bl foo@notoc
	; CHECK-NEXT: extsw r3, r3			; CHECK-NEXT: extsw r3, r3
	; CHECK-NEXT: addi r1, r1, 80			; CHECK-NEXT: addi r1, r1, 80
	▲ Show 20 Lines • Show All 95 Lines • Show Last 20 Lines