Page MenuHomePhabricator

[SelectionDAG][PowerPC] Memset reuse vector element for tail store
Changes PlannedPublic

Authored by tingwang on Nov 28 2022, 5:29 PM.

Details

Reviewers
shchenz
nemanjai
rzurob
RKSimon
dmgreen
asavonic
Group Reviewers
Restricted Project
Summary

On PPC there are instructions to store element from vector(e.g. stxsdx/stxsiwx), and these instructions can be leveraged to avoid tail constant in memset.

This patch tries to add callback in SDAG to allow the reuse.

Diff Detail

Event Timeline

tingwang created this revision.Nov 28 2022, 5:29 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 28 2022, 5:29 PM
tingwang requested review of this revision.Nov 28 2022, 5:29 PM

I think this is useful, but we should ensure we can get rid of the swap that this introduces (in a separate patch).

llvm/include/llvm/CodeGen/TargetLowering.h
672 ↗(On Diff #478424)

Do we need this? Can canCombineStoreAndExtract() suffice for this purpose?

llvm/test/CodeGen/PowerPC/memset-tail.ll
195

Why do we now get the redundant swap for the vector store that we didn't get before? Was it eliminated by the swap elimination before and now it is not because we have a use of the partial vector?

nemanjai added inline comments.Nov 28 2022, 8:00 PM
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
7433

Can we reduce the nesting here by converting this to else if?

tingwang updated this revision to Diff 478535.Nov 29 2022, 5:14 AM

Update according to comments:
(1) Use existing canCombineStoreAndExtract() instead of creating new.
(2) Nest the else statement properly.
(3) Saw two cases changed due to (1).

tingwang marked an inline comment as done.Nov 29 2022, 5:20 AM
tingwang added inline comments.
llvm/include/llvm/CodeGen/TargetLowering.h
672 ↗(On Diff #478424)

Thank you! I did a try and updated patch. Saw two cases changed. I will look into the detail tomorrow.

llvm/test/CodeGen/PowerPC/memset-tail.ll
195

Debug-only ppc-vsx-swaps shows "Web 0 rejected for physreg, partial reg, or not swap[pable]". I will look into it and probably post another patch to fix the issue. Thank you!

tingwang planned changes to this revision.Nov 29 2022, 5:09 PM

Realized I need to use PPCTTIImpl::getVectorInstrCost() API to determine the cost of instructions. I'm working on it now.

tingwang updated this revision to Diff 478808.Nov 29 2022, 11:56 PM
tingwang added reviewers: dmgreen, asavonic.

Changes in this update:
(1) I was trying to use TTI.getVectorInstrCost() to query instruction cost in PPCTargetLowering::canCombineStoreAndExtract(). However not able to reach TTI, and didn't find any reference to do that in SDAG. Given that the original implementation of canCombineStoreAndExtract() on ARM implemented its own logic to calculate Cost, followed the approach and implemented logic by referring to PPCTTIImpl::getVectorInstrCost().

(2) Refactored logic inside getMemsetStores(). For PPC CombineStoreAndExtract is beneficial on some specific element index according to endianness and instruction (StoreAndExtract elements on other indexes requires vector permutation, which makes the whole idea less attractive). Since this is platform independent logic, I'm querying the cost for indexes, and pick the least cost to do the combine.

tingwang added inline comments.Nov 29 2022, 11:57 PM
llvm/test/CodeGen/PowerPC/p10-fi-elim.ll
43–44

Instruction sequence change in PowerPC/p10-fi-elim.ll is result of CodeGenPrepare::optimizeExtractElementInst() now can generate combined pattern since we enabled canCombineStoreAndExtract(). Seems we can avoid two mfvsrd instructions.

tingwang added inline comments.Nov 29 2022, 11:59 PM
llvm/test/CodeGen/ARM/memset-align.ll
21

Hello @dmgreen @asavonic. This patch tries to reuse vector element for the tail store in memset by implementing canCombineStoreAndExtract() on PPC. This changed introduced test case change on ARM in llvm/test/CodeGen/ARM/memset-align.ll. Could you please help me check if the change looks good or not? Thank you!

Looked into the scenario on ARM, if the i8 fill value of memset is zero, it creates vector for the initial 16B, and constant tail for the remaining bytes, which exactly hit this patch's scenario. For other values, it creates i32 for memset and will not be impacted by this patch.

tingwang updated this revision to Diff 478827.Nov 30 2022, 12:10 AM

Add memset-tail.ll changes.

lkail added a subscriber: lkail.Nov 30 2022, 6:18 PM

I would expect not only memset, some consecutive stores could also reuse the result of vector split, see https://godbolt.org/z/77aMvncb4.
For

void foo(long a[3]) {
    a[0] = 12;
    a[1] = 12;
    a[2] = 12;
}

foo(long*):                               # @foo(long*)
        .quad   .Lfunc_begin0
        .quad   .TOC.@tocbase
        .quad   0
.Lfunc_begin0:
        xxlxor 0, 0, 0
        li 4, 12
        xxsplti32dx 0, 1, 12
        std 4, 16(3)
        stxv 0, 0(3)
        blr
        .long   0
        .quad   0

We don't reuse the result of xxsplti32dx.

tingwang added a comment.EditedNov 30 2022, 8:17 PM

I would expect not only memset, some consecutive stores could also reuse the result of vector split, see https://godbolt.org/z/77aMvncb4.
For

void foo(long a[3]) {
    a[0] = 12;
    a[1] = 12;
    a[2] = 12;
}

foo(long*):                               # @foo(long*)
        .quad   .Lfunc_begin0
        .quad   .TOC.@tocbase
        .quad   0
.Lfunc_begin0:
        xxlxor 0, 0, 0
        li 4, 12
        xxsplti32dx 0, 1, 12
        std 4, 16(3)
        stxv 0, 0(3)
        blr
        .long   0
        .quad   0

We don't reuse the result of xxsplti32dx.

Sure. The posted IR could be handled by DAGCombiner::mergeConsecutiveStores(), and I agree similar combine can be applied there. But for this case, memset stores are volatile, and DAGCombiner::getStoreMergeCandidates() does not accept volatile store currently.

asavonic added inline comments.Dec 1 2022, 12:22 PM
llvm/test/CodeGen/ARM/memset-align.ll
21

This looks fine to me. VST1 and scalar STR seem equivalent in this case, if I'm reading the docs right.

tingwang added inline comments.Dec 1 2022, 3:45 PM
llvm/test/CodeGen/ARM/memset-align.ll
21

This looks fine to me. VST1 and scalar STR seem equivalent in this case, if I'm reading the docs right.

Thank you for the confirm!

tingwang updated this revision to Diff 479543.Dec 2 2022, 12:49 AM

Test case update.

I think this is useful, but we should ensure we can get rid of the swap that this introduces (in a separate patch).

According to my test case, there are two kinds of swap in LE: (1) swap after vector splat immediate; (2) swap after load from constant-pool. Submitted two patches D139193 and D139491 to address them separately.

Gentle ping.

tingwang updated this revision to Diff 486119.Jan 3 2023, 5:14 PM

Update patch as following test case pattern changed:
llvm/test/CodeGen/ARM/memset-align.ll

Gentle ping.

Gentle ping.

tingwang updated this revision to Diff 493461.Mon, Jan 30, 5:38 PM

Rebase && Gentle ping.

While I am not principally against this approach, it doesn't really give me a great feeling going in this direction. The issue is more widespread than just memset/memcpy/memmove as Kai's example illustrates.
I wonder if it would be a more complete solution to add a DAG combine that looks up the chain from the store to see if a store of a splat of the same value exists. That should certainly cover both examples.

tingwang planned changes to this revision.EditedTue, Feb 7, 10:25 PM

I will try to get similar results by DAG combine. Thanks to Nemanja and Kai's insight!