On PPC there are instructions to store element from vector(e.g. stxsdx/stxsiwx), and these instructions can be leveraged to avoid tail constant in memset and constant splat array initialization.
This patch tries to explore these opportunities.
Paths
| Differential D138883
[SelectionDAG][PowerPC] Memset reuse vector element for tail store ClosedPublic Authored by tingwang on Nov 28 2022, 5:29 PM.
Details Summary On PPC there are instructions to store element from vector(e.g. stxsdx/stxsiwx), and these instructions can be leveraged to avoid tail constant in memset and constant splat array initialization. This patch tries to explore these opportunities.
Diff Detail
Event Timelinetingwang added a child revision: D138881: [PowerPC][NFC] Add test case for memset tail store.Nov 28 2022, 5:29 PM Comment Actions I think this is useful, but we should ensure we can get rid of the swap that this introduces (in a separate patch).
Comment Actions Update according to comments: tingwang added inline comments.
Comment Actions Realized I need to use PPCTTIImpl::getVectorInstrCost() API to determine the cost of instructions. I'm working on it now. Comment Actions Changes in this update: (2) Refactored logic inside getMemsetStores(). For PPC CombineStoreAndExtract is beneficial on some specific element index according to endianness and instruction (StoreAndExtract elements on other indexes requires vector permutation, which makes the whole idea less attractive). Since this is platform independent logic, I'm querying the cost for indexes, and pick the least cost to do the combine.
Comment Actions I would expect not only memset, some consecutive stores could also reuse the result of vector split, see https://godbolt.org/z/77aMvncb4. void foo(long a[3]) { a[0] = 12; a[1] = 12; a[2] = 12; } foo(long*): # @foo(long*) .quad .Lfunc_begin0 .quad .TOC.@tocbase .quad 0 .Lfunc_begin0: xxlxor 0, 0, 0 li 4, 12 xxsplti32dx 0, 1, 12 std 4, 16(3) stxv 0, 0(3) blr .long 0 .quad 0 We don't reuse the result of xxsplti32dx. Comment Actions
Sure. The posted IR could be handled by DAGCombiner::mergeConsecutiveStores(), and I agree similar combine can be applied there. But for this case, memset stores are volatile, and DAGCombiner::getStoreMergeCandidates() does not accept volatile store currently.
tingwang added a child revision: D139491: [PowerPC] remove XXSWAPD after load from CP which is a splat value.Dec 6 2022, 5:07 PM Comment Actions
According to my test case, there are two kinds of swap in LE: (1) swap after vector splat immediate; (2) swap after load from constant-pool. Submitted two patches D139193 and D139491 to address them separately. Comment Actions Update patch as following test case pattern changed: Comment Actions While I am not principally against this approach, it doesn't really give me a great feeling going in this direction. The issue is more widespread than just memset/memcpy/memmove as Kai's example illustrates. Comment Actions I will try to get similar results by DAG combine. Thanks to Nemanja and Kai's insight! tingwang added a parent revision: D144235: [PowerPC][NFC] add const-splat-array-init.ll.Feb 16 2023, 10:52 PM tingwang retitled this revision from [SelectionDAG][PowerPC] Memset reuse vector element for tail store to [PowerPC] find and reuse ConstantSplatVector to combine constant store into extract and store.Feb 16 2023, 11:06 PM Comment Actions Redo the implementation, and now both memset and constant splat array initialization get changed. Comment Actions (1) Format code to follow coding style guidance. Comment Actions (1) Update element type ElemTy which now matches the type expected by both STFIWX and STXSIX PPCISD nodes. Comment Actions Sorry, I had unsubmitted comments. Not sure if they still apply.
Comment Actions
Hi Nemanja, Appreciate your help! I planned to change due to the reason that DAGCombiner::getStoreMergeCandidates() already walks through chain of stores, and I realized that it could be a better place to find candidate for this opportunity. By the way, maybe the criteria of splat of constant could be relieved to just match the subsection that is extracted by the target store, and I would like to have a try. I hope next version will be final for review. Thank you again for taking time looking into this! Comment Actions Since canCombineStoreAndExtract() target hook looks not good on PPC, reconsider use this approach to do the combine. Change in this version: Comment Actions Minor update: And ping... Comment Actions Rebase and added some comments. Hi @nemanjai, I accepted and addressed your previous comments. Do you have any more concerns on the approach that is implemented here? Thank you! Comment Actions
I am really sorry about the delay... I don't think that xxspltib 0, 165 li 4, 16 stxsibx 0, 3, 4 stxv 0, 0(3) is any better than xxspltib 0, 165 li 4, 1 stxvx 0, 3, 4 stxv 0, 0(3) Is it possible to do something like that without walking the chain and with existing capabilities? Comment Actions Hi @nemanjai, appreciate your time looking into this patch. Thank you! I agree with you, and I think walking the chain is burning CPU cycles without achieving anything. I realized it is difficult for me to take both targets (the original memset case, and the one @lkail mentioned) in this patch, so I would like to drop the second target, in order to focus on the first one. I like the idea to use unaligned store, and quickly tested that to see if any potential issue. Created memset.c with multiple memset(p, 0xXY, 24); lines to stress the performance. According to my numbers from Power10, use extract-and-store (https://reviews.llvm.org/D138883?id=493461) got 17% faster than baseline, whereas unaligned store got about 30% slower than baseline. From performance perspective, I think I should pursuit the original approach. However since canCombineStoreAndExtract target hook has been proved not beneficial (https://reviews.llvm.org/D146602) on PPC, I probably need to create one for PPC only at this moment. Let me know if any comments. I will post patch shortly. memset.c92 KBDownload tingwang retitled this revision from [PowerPC] find and reuse ConstantSplatVector to combine constant store into extract and store to [SelectionDAG][PowerPC] Memset reuse vector element for tail store. Comment ActionsReturn to the initial proposal after exploring different approaches. Since canCombineStoreAndExtract() is not beneficial to PPC, created another filter for PPC.
Comment Actions This LGTM with some nits. Let's first target for the memset cases as this is the common case where splat values happens.
This revision is now accepted and ready to land.Sep 5 2023, 7:29 PM Comment Actions
Thank you! I will address the remaining comments in the commit. Closed by commit rG71be020dda2c: [SelectionDAG][PowerPC] Memset reuse vector element for tail store (authored by tingwang). · Explain WhySep 5 2023, 10:55 PM This revision was automatically updated to reflect the committed changes.
Revision Contents
Diff 511247 llvm/lib/Target/PowerPC/PPCISelLowering.h
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
llvm/lib/Target/PowerPC/PPCInstrVSX.td
llvm/test/CodeGen/PowerPC/const-splat-array-init.ll
llvm/test/CodeGen/PowerPC/memset-tail.ll
|
nit: We may need comments here why we don't try to extract constant for ElemSizeInBits 8/16. (I guess the reason is we don't have benefit as we need li to load the index and this li can also be used to load the 8/16 bit imm?