On PPC there are instructions to store element from vector(e.g. stxsdx/stxsiwx), and these instructions can be leveraged to avoid tail constant in memset and constant splat array initialization.
This patch tries to explore these opportunities.
Differential D138883
[SelectionDAG][PowerPC] Memset reuse vector element for tail store tingwang on Nov 28 2022, 5:29 PM. Authored by
Details On PPC there are instructions to store element from vector(e.g. stxsdx/stxsiwx), and these instructions can be leveraged to avoid tail constant in memset and constant splat array initialization. This patch tries to explore these opportunities.
Diff Detail
Unit Tests Event TimelineComment Actions I think this is useful, but we should ensure we can get rid of the swap that this introduces (in a separate patch).
Comment Actions Update according to comments:
Comment Actions Realized I need to use PPCTTIImpl::getVectorInstrCost() API to determine the cost of instructions. I'm working on it now. Comment Actions Changes in this update: (2) Refactored logic inside getMemsetStores(). For PPC CombineStoreAndExtract is beneficial on some specific element index according to endianness and instruction (StoreAndExtract elements on other indexes requires vector permutation, which makes the whole idea less attractive). Since this is platform independent logic, I'm querying the cost for indexes, and pick the least cost to do the combine.
Comment Actions I would expect not only memset, some consecutive stores could also reuse the result of vector split, see https://godbolt.org/z/77aMvncb4. void foo(long a[3]) { a[0] = 12; a[1] = 12; a[2] = 12; } foo(long*): # @foo(long*) .quad .Lfunc_begin0 .quad .TOC.@tocbase .quad 0 .Lfunc_begin0: xxlxor 0, 0, 0 li 4, 12 xxsplti32dx 0, 1, 12 std 4, 16(3) stxv 0, 0(3) blr .long 0 .quad 0 We don't reuse the result of xxsplti32dx. Comment Actions Sure. The posted IR could be handled by DAGCombiner::mergeConsecutiveStores(), and I agree similar combine can be applied there. But for this case, memset stores are volatile, and DAGCombiner::getStoreMergeCandidates() does not accept volatile store currently.
Comment Actions Update patch as following test case pattern changed: Comment Actions While I am not principally against this approach, it doesn't really give me a great feeling going in this direction. The issue is more widespread than just memset/memcpy/memmove as Kai's example illustrates. Comment Actions I will try to get similar results by DAG combine. Thanks to Nemanja and Kai's insight! Comment Actions Redo the implementation, and now both memset and constant splat array initialization get changed. Comment Actions (1) Format code to follow coding style guidance. Comment Actions (1) Update element type ElemTy which now matches the type expected by both STFIWX and STXSIX PPCISD nodes. Comment Actions Sorry, I had unsubmitted comments. Not sure if they still apply.
Comment Actions Hi Nemanja, Appreciate your help! I planned to change due to the reason that DAGCombiner::getStoreMergeCandidates() already walks through chain of stores, and I realized that it could be a better place to find candidate for this opportunity. By the way, maybe the criteria of splat of constant could be relieved to just match the subsection that is extracted by the target store, and I would like to have a try. I hope next version will be final for review. Thank you again for taking time looking into this! Comment Actions Since canCombineStoreAndExtract() target hook looks not good on PPC, reconsider use this approach to do the combine. Change in this version: Comment Actions Minor update: And ping... Comment Actions Rebase and added some comments. Hi @nemanjai, I accepted and addressed your previous comments. Do you have any more concerns on the approach that is implemented here? Thank you! Comment Actions I am really sorry about the delay... I don't think that xxspltib 0, 165 li 4, 16 stxsibx 0, 3, 4 stxv 0, 0(3) is any better than xxspltib 0, 165 li 4, 1 stxvx 0, 3, 4 stxv 0, 0(3) Is it possible to do something like that without walking the chain and with existing capabilities? Comment Actions Hi @nemanjai, appreciate your time looking into this patch. Thank you! I agree with you, and I think walking the chain is burning CPU cycles without achieving anything. I realized it is difficult for me to take both targets (the original memset case, and the one @lkail mentioned) in this patch, so I would like to drop the second target, in order to focus on the first one. I like the idea to use unaligned store, and quickly tested that to see if any potential issue. Created memset.c with multiple memset(p, 0xXY, 24); lines to stress the performance. According to my numbers from Power10, use extract-and-store (https://reviews.llvm.org/D138883?id=493461) got 17% faster than baseline, whereas unaligned store got about 30% slower than baseline. From performance perspective, I think I should pursuit the original approach. However since canCombineStoreAndExtract target hook has been proved not beneficial (https://reviews.llvm.org/D146602) on PPC, I probably need to create one for PPC only at this moment. Let me know if any comments. I will post patch shortly. Comment Actions Return to the initial proposal after exploring different approaches. Since canCombineStoreAndExtract() is not beneficial to PPC, created another filter for PPC.
Comment Actions This LGTM with some nits. Let's first target for the memset cases as this is the common case where splat values happens.
|
Do we need this? Can canCombineStoreAndExtract() suffice for this purpose?