This is an archive of the discontinued LLVM Phabricator instance.

[SelectionDAG][PowerPC] Memset reuse vector element for tail store
ClosedPublic

Authored by tingwang on Nov 28 2022, 5:29 PM.

Details

Summary

On PPC there are instructions to store element from vector(e.g. stxsdx/stxsiwx), and these instructions can be leveraged to avoid tail constant in memset and constant splat array initialization.

This patch tries to explore these opportunities.

Diff Detail

Event Timeline

tingwang created this revision.Nov 28 2022, 5:29 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 28 2022, 5:29 PM
tingwang requested review of this revision.Nov 28 2022, 5:29 PM

I think this is useful, but we should ensure we can get rid of the swap that this introduces (in a separate patch).

llvm/include/llvm/CodeGen/TargetLowering.h
677

Do we need this? Can canCombineStoreAndExtract() suffice for this purpose?

llvm/test/CodeGen/PowerPC/memset-tail.ll
197

Why do we now get the redundant swap for the vector store that we didn't get before? Was it eliminated by the swap elimination before and now it is not because we have a use of the partial vector?

nemanjai added inline comments.Nov 28 2022, 8:00 PM
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
7782

Can we reduce the nesting here by converting this to else if?

tingwang updated this revision to Diff 478535.Nov 29 2022, 5:14 AM

Update according to comments:
(1) Use existing canCombineStoreAndExtract() instead of creating new.
(2) Nest the else statement properly.
(3) Saw two cases changed due to (1).

tingwang marked an inline comment as done.Nov 29 2022, 5:20 AM
tingwang added inline comments.
llvm/include/llvm/CodeGen/TargetLowering.h
677

Thank you! I did a try and updated patch. Saw two cases changed. I will look into the detail tomorrow.

llvm/test/CodeGen/PowerPC/memset-tail.ll
197

Debug-only ppc-vsx-swaps shows "Web 0 rejected for physreg, partial reg, or not swap[pable]". I will look into it and probably post another patch to fix the issue. Thank you!

tingwang planned changes to this revision.Nov 29 2022, 5:09 PM

Realized I need to use PPCTTIImpl::getVectorInstrCost() API to determine the cost of instructions. I'm working on it now.

tingwang updated this revision to Diff 478808.Nov 29 2022, 11:56 PM
tingwang added reviewers: dmgreen, asavonic.

Changes in this update:
(1) I was trying to use TTI.getVectorInstrCost() to query instruction cost in PPCTargetLowering::canCombineStoreAndExtract(). However not able to reach TTI, and didn't find any reference to do that in SDAG. Given that the original implementation of canCombineStoreAndExtract() on ARM implemented its own logic to calculate Cost, followed the approach and implemented logic by referring to PPCTTIImpl::getVectorInstrCost().

(2) Refactored logic inside getMemsetStores(). For PPC CombineStoreAndExtract is beneficial on some specific element index according to endianness and instruction (StoreAndExtract elements on other indexes requires vector permutation, which makes the whole idea less attractive). Since this is platform independent logic, I'm querying the cost for indexes, and pick the least cost to do the combine.

tingwang added inline comments.Nov 29 2022, 11:57 PM
llvm/test/CodeGen/PowerPC/p10-fi-elim.ll
31 ↗(On Diff #478808)

Instruction sequence change in PowerPC/p10-fi-elim.ll is result of CodeGenPrepare::optimizeExtractElementInst() now can generate combined pattern since we enabled canCombineStoreAndExtract(). Seems we can avoid two mfvsrd instructions.

tingwang added inline comments.Nov 29 2022, 11:59 PM
llvm/test/CodeGen/ARM/memset-align.ll
21 ↗(On Diff #478808)

Hello @dmgreen @asavonic. This patch tries to reuse vector element for the tail store in memset by implementing canCombineStoreAndExtract() on PPC. This changed introduced test case change on ARM in llvm/test/CodeGen/ARM/memset-align.ll. Could you please help me check if the change looks good or not? Thank you!

Looked into the scenario on ARM, if the i8 fill value of memset is zero, it creates vector for the initial 16B, and constant tail for the remaining bytes, which exactly hit this patch's scenario. For other values, it creates i32 for memset and will not be impacted by this patch.

tingwang updated this revision to Diff 478827.Nov 30 2022, 12:10 AM

Add memset-tail.ll changes.

lkail added a subscriber: lkail.Nov 30 2022, 6:18 PM

I would expect not only memset, some consecutive stores could also reuse the result of vector split, see https://godbolt.org/z/77aMvncb4.
For

void foo(long a[3]) {
    a[0] = 12;
    a[1] = 12;
    a[2] = 12;
}

foo(long*):                               # @foo(long*)
        .quad   .Lfunc_begin0
        .quad   .TOC.@tocbase
        .quad   0
.Lfunc_begin0:
        xxlxor 0, 0, 0
        li 4, 12
        xxsplti32dx 0, 1, 12
        std 4, 16(3)
        stxv 0, 0(3)
        blr
        .long   0
        .quad   0

We don't reuse the result of xxsplti32dx.

tingwang added a comment.EditedNov 30 2022, 8:17 PM

I would expect not only memset, some consecutive stores could also reuse the result of vector split, see https://godbolt.org/z/77aMvncb4.
For

void foo(long a[3]) {
    a[0] = 12;
    a[1] = 12;
    a[2] = 12;
}

foo(long*):                               # @foo(long*)
        .quad   .Lfunc_begin0
        .quad   .TOC.@tocbase
        .quad   0
.Lfunc_begin0:
        xxlxor 0, 0, 0
        li 4, 12
        xxsplti32dx 0, 1, 12
        std 4, 16(3)
        stxv 0, 0(3)
        blr
        .long   0
        .quad   0

We don't reuse the result of xxsplti32dx.

Sure. The posted IR could be handled by DAGCombiner::mergeConsecutiveStores(), and I agree similar combine can be applied there. But for this case, memset stores are volatile, and DAGCombiner::getStoreMergeCandidates() does not accept volatile store currently.

asavonic added inline comments.Dec 1 2022, 12:22 PM
llvm/test/CodeGen/ARM/memset-align.ll
21 ↗(On Diff #478808)

This looks fine to me. VST1 and scalar STR seem equivalent in this case, if I'm reading the docs right.

tingwang added inline comments.Dec 1 2022, 3:45 PM
llvm/test/CodeGen/ARM/memset-align.ll
21 ↗(On Diff #478808)

This looks fine to me. VST1 and scalar STR seem equivalent in this case, if I'm reading the docs right.

Thank you for the confirm!

tingwang updated this revision to Diff 479543.Dec 2 2022, 12:49 AM

Test case update.

I think this is useful, but we should ensure we can get rid of the swap that this introduces (in a separate patch).

According to my test case, there are two kinds of swap in LE: (1) swap after vector splat immediate; (2) swap after load from constant-pool. Submitted two patches D139193 and D139491 to address them separately.

Gentle ping.

tingwang updated this revision to Diff 486119.Jan 3 2023, 5:14 PM

Update patch as following test case pattern changed:
llvm/test/CodeGen/ARM/memset-align.ll

Gentle ping.

Gentle ping.

tingwang updated this revision to Diff 493461.Jan 30 2023, 5:38 PM

Rebase && Gentle ping.

While I am not principally against this approach, it doesn't really give me a great feeling going in this direction. The issue is more widespread than just memset/memcpy/memmove as Kai's example illustrates.
I wonder if it would be a more complete solution to add a DAG combine that looks up the chain from the store to see if a store of a splat of the same value exists. That should certainly cover both examples.

tingwang planned changes to this revision.EditedFeb 7 2023, 10:25 PM

I will try to get similar results by DAG combine. Thanks to Nemanja and Kai's insight!

tingwang retitled this revision from [SelectionDAG][PowerPC] Memset reuse vector element for tail store to [PowerPC] find and reuse ConstantSplatVector to combine constant store into extract and store.Feb 16 2023, 11:06 PM
tingwang edited the summary of this revision. (Show Details)
tingwang added reviewers: lkail, ecnelises.
tingwang removed a project: Restricted Project.
Herald added a project: Restricted Project. · View Herald TranscriptFeb 16 2023, 11:06 PM
tingwang updated this revision to Diff 498255.Feb 16 2023, 11:12 PM

Redo the implementation, and now both memset and constant splat array initialization get changed.

tingwang planned changes to this revision.Feb 28 2023, 4:08 AM

Plan to continue improve the patch...

tingwang updated this revision to Diff 501485.EditedMar 1 2023, 6:10 AM

(1) Format code to follow coding style guidance.
(2) Fix SplatValue check.
(3) Remaining redundant instructions like mtfprd will be fixed in separate patches.

tingwang updated this revision to Diff 501775.Mar 2 2023, 12:39 AM

(1) Update element type ElemTy which now matches the type expected by both STFIWX and STXSIX PPCISD nodes.
(2) Add missing match pattern for PPCstxsix.

tingwang updated this revision to Diff 502028.Mar 2 2023, 5:14 PM

Update StoreSizeInBits check to skip on PowerOf2 bit size less than 8.

tingwang planned changes to this revision.Mar 8 2023, 9:05 PM

Attempt to push the logic into DAGCombiner::mergeConsecutiveStores()...

Sorry, I had unsubmitted comments. Not sure if they still apply.

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
15111

Nit: you don't need the name of the function here.

15125

What if dyn_cast returns null (i.e. if operand 1 is not a constant)?

15134–15136

We don't need to construct an APInt just to check whether it is a power of 2. You can just use isPowerOf2_64() from MathExtras.h.

Sorry, I had unsubmitted comments. Not sure if they still apply.

Hi Nemanja,

Appreciate your help! I planned to change due to the reason that DAGCombiner::getStoreMergeCandidates() already walks through chain of stores, and I realized that it could be a better place to find candidate for this opportunity. By the way, maybe the criteria of splat of constant could be relieved to just match the subsection that is extracted by the target store, and I would like to have a try.

I hope next version will be final for review. Thank you again for taking time looking into this!

tingwang updated this revision to Diff 511247.Apr 5 2023, 6:31 PM

Since canCombineStoreAndExtract() target hook looks not good on PPC, reconsider use this approach to do the combine.

Change in this version:
(1) Address comments from previous review.
(2) Add check to make sure do not combine on truncated stores or those stores that return value.

Gentle ping. Since the alternative path (D146602/D146610) looks not good, shall we take this approach forward? Any comments are welcome. Thank you!

tingwang updated this revision to Diff 529204.Jun 7 2023, 1:00 AM

Minor update:
(1) Add bitwidth is multiple of check for isSplat() call.
(2) Reduce MaxSearchNodes from 4 to 3, this is the minimum setting to allow target patterns in test cases.

And ping...

tingwang updated this revision to Diff 537959.Jul 6 2023, 8:20 PM

Rebase and added some comments.

Hi @nemanjai, I accepted and addressed your previous comments. Do you have any more concerns on the approach that is implemented here? Thank you!

Rebase and added some comments.

Hi @nemanjai, I accepted and addressed your previous comments. Do you have any more concerns on the approach that is implemented here? Thank you!

I am really sorry about the delay...
While I am not completely opposed to this, it seems like a fair bit of machinery to add for something that we could solve more simply with unaligned stores (i.e. the same way we would codegen a memset where the tail is *not* a power of 2).

I don't think that

xxspltib 0, 165
li 4, 16
stxsibx 0, 3, 4
stxv 0, 0(3)

is any better than

xxspltib 0, 165
li 4, 1
stxvx 0, 3, 4
stxv 0, 0(3)

Is it possible to do something like that without walking the chain and with existing capabilities?

Hi @nemanjai, appreciate your time looking into this patch. Thank you!

I agree with you, and I think walking the chain is burning CPU cycles without achieving anything. I realized it is difficult for me to take both targets (the original memset case, and the one @lkail mentioned) in this patch, so I would like to drop the second target, in order to focus on the first one.

I like the idea to use unaligned store, and quickly tested that to see if any potential issue. Created memset.c with multiple memset(p, 0xXY, 24); lines to stress the performance. According to my numbers from Power10, use extract-and-store (https://reviews.llvm.org/D138883?id=493461) got 17% faster than baseline, whereas unaligned store got about 30% slower than baseline.

From performance perspective, I think I should pursuit the original approach. However since canCombineStoreAndExtract target hook has been proved not beneficial (https://reviews.llvm.org/D146602) on PPC, I probably need to create one for PPC only at this moment.

Let me know if any comments. I will post patch shortly.

tingwang updated this revision to Diff 538961.Jul 11 2023, 1:44 AM
tingwang retitled this revision from [PowerPC] find and reuse ConstantSplatVector to combine constant store into extract and store to [SelectionDAG][PowerPC] Memset reuse vector element for tail store.

Return to the initial proposal after exploring different approaches. Since canCombineStoreAndExtract() is not beneficial to PPC, created another filter for PPC.

tingwang added inline comments.Jul 11 2023, 1:50 AM
llvm/test/CodeGen/PowerPC/memset-tail.ll
197

Will be eliminated by https://reviews.llvm.org/D139193.

236

Plan to address this pattern in separate patch.

246

Will be eliminated by https://reviews.llvm.org/D139193.

258

Plan to address this pattern in separate patch.

297

Will be eliminated by https://reviews.llvm.org/D139193.

350

Will be eliminated by https://reviews.llvm.org/D139193.

709

Plan to address this pattern in separate patch.

732

Plan to address this pattern in separate patch.

tingwang updated this revision to Diff 539375.Jul 11 2023, 10:04 PM

(1) Rebase after commit D139193.
(2) Add two P10 patterns to match extract-and-store.

Now test case is clean. Gentle ping.

tingwang updated this revision to Diff 546725.Aug 3 2023, 12:17 AM

Rebase && Ping.

Gentle ping...

Gentle ping...

shchenz added inline comments.Sep 4 2023, 7:45 PM
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
17116

If using stfd is allowed for tail size 5/6/7, then can we use stfd for tail size 3/4 too? (I assume the change here impacts cases memsetTailV1B3 and memsetTailV1B4?)

llvm/test/CodeGen/PowerPC/memset-tail.ll
244

This seems a legacy issue because I also found same issue in case memsetTailV1B12 and also from the left side of this case.

Is it safe to extend the store length from 23 bytes to 32(or 24) bytes here? There is no clue saying that memory after (char *)p + 7 is writable by the user?

The related logic is in allowsMisalignedMemoryAccesses(). But is it correct that we can safely assume this memset can write more memory even this memset handles aligned memory?

What do you think? @nemanjai

shchenz added inline comments.Sep 4 2023, 8:12 PM
llvm/test/CodeGen/PowerPC/memset-tail.ll
244

Sorry, please ignore this comment. I didn't realize that the two stores stxsdx and stxvd2x have overlaps. So the real write size is not extended.

tingwang added inline comments.Sep 5 2023, 12:30 AM
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
17116

It seems TargetLowering::findOptimalMemOpLowering() decides the type of each store. I guess if we change the type for the size 3/4 case from i32 to i64, then it will result in stfd.

shchenz accepted this revision as: shchenz.Sep 5 2023, 7:29 PM

This LGTM with some nits.

Let's first target for the memset cases as this is the common case where splat values happens.

llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
7760–7761

Nit: this comment needs update?

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
1644

nit: We may need comments here why we don't try to extract constant for ElemSizeInBits 8/16. (I guess the reason is we don't have benefit as we need li to load the index and this li can also be used to load the 8/16 bit imm?

17116

Thanks. Better to add some comment here why we need to set the type to MVT::v8i16

This revision is now accepted and ready to land.Sep 5 2023, 7:29 PM

This LGTM with some nits.

Let's first target for the memset cases as this is the common case where splat values happens.

Thank you! I will address the remaining comments in the commit.