Page MenuHomePhabricator

[DAGCombine] Split vector load-update-store into single element stores
Changes PlannedPublic

Authored by qiucf on Nov 13 2019, 11:16 PM.

Details

Reviewers
spatel
andrewrk
fhahn
efriedma
bogner
nemanjai
Group Reviewers
Restricted Project
Summary

Currently, Clang does not generate individual stores for update to its elements. For code below:

typedef float v4sf __attribute__ ((vector_size(16)));

void foo(v4sf *a) {
  (*a)[0] = 1;
  (*a)[3] = 2;
}

LLVM generates a shuffle instr for it, even if there's only one element updated. But GCC will generate individual stores (at least on PowerPC).

Also, if we have a chain of shufflevector/insertelement instrs, we can go through it, track status of each element and find which updated, finally replace original vector store into multiple element stores. This patch will do it.

This optimization happens at DAGCombiner, since each platform can easily set rules about turning it own in own version of hook method. Steps of the optimization are:

  1. Start at a vector store, go up through its value operand, until we find a load.
  2. In path from store to the load, we only accept insert/shuffle as operands.
  3. Track value modification from the load the store. Quit if we need to extract from other vectors.
  4. Generate store of elements changed in the path, to replace original vector store.

A target-related method isCheapToSplitStore is created. So only PowerPC platform turns the optimization on now.

Discussion: http://lists.llvm.org/pipermail/llvm-dev/2019-September/135432.html http://lists.llvm.org/pipermail/llvm-dev/2019-October/135638.html

Diff Detail

Event Timeline

qiucf created this revision.Nov 13 2019, 11:16 PM

This is missing test coverage.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
6799–6801

It would be best to do as much of this checking as early as possible, before calling getVectorUpdates()

steven.zhang added inline comments.Thu, Nov 14, 3:05 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
6800

And it is also benefit to do this folding before the legalization, so that, the illegal store could be combined to legal store later.

qiucf updated this revision to Diff 229465.Fri, Nov 15, 1:36 AM
  • Add regression test.
  • Check legality before doing costy operations.

I plan to move the two changed swap-le tests into another NFC patch, since they're not related to the logic. Also, currently i8 and i16 are illegal on PowerPC. So this won't affect vectors like <8 x i16> or <16 x i8>.

qiucf marked 2 inline comments as done.Fri, Nov 15, 1:36 AM
qiucf updated this revision to Diff 229750.Sun, Nov 17, 11:04 PM

Remove test case change to swaps-le-5 and swaps-le-6 since they're moved to a single differential D70373.

lkail added a subscriber: lkail.Tue, Nov 19, 1:28 AM

I plan to move the two changed swap-le tests into another NFC patch, since they're not related to the logic.

I think all test cases change should be reflected so that reviewers can have a look if they are regressions.

This is a very good idea. Love it!.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
6640

Alignment seems strange. Please use clang-format

6653

What about one operand is undef? putting an undef into Path has no meaning?

6668

If Path is empty here, there is an infinite loop since Current is not changed.

6740

If changed is already set, don't need to set it again.

6754

I see for Buf, the second value is only given 0/1? Can we use a bool instead?

6758

Move this comments down to line 6761. And also add other execluded cases in the comment

6762

Can we handle indexed store too? I guess you also need to exclude that kind of store.

6784

I am thinking UpdatedElementsIdx and all the codes to update UpdatedElementsIdx is redundant. we could collect updated elements idx in getVectorUpdates and check whether States[i] need to update by its States[i].second?

6800

Better to add an assert here that UpdatedElementsIdx.size() will never be larger than VecLen

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
1606

Do we hava a perf testing shows that for <4 x i32> store, 3 scalar stores for i32 has a good perf?

llvm/test/CodeGen/PowerPC/vector-store-split.ll
4

I think the test coverage is not enough. Especially the negative testing. For example // We don't support shuffle which changes vector length., BUILD_VECTOR related testing, undef operand testing and so on.

Added Eli and Justin in case they are interested in chiming in on at least the target-independent parts of this.

It looks like this is missing some checks on the load. The code needs to check that the load and store target the same address, and that there aren't any operations between the load and the store that could modify the memory.

The profitability check probably needs to weigh the cost of the memory operations a little more carefully in cases where the total number of memory operations increases.

I'm a little worried there could be a performance penalty on certain CPUs if the vector value is loaded soon afterwards, due to the partial overlap. Depends on details of the specific CPU, though, and maybe it's rare enough that it doesn't matter.

nemanjai requested changes to this revision.Fri, Nov 22, 6:06 AM

This changeset is a rather complicated and hard-to-follow piece of code that appears to solve a problem in the motivating test case, but makes it rather unclear whether it is a good idea or a bad idea overall.
There are a number of things that need to be clarified for this patch to proceed:

  1. Needs empirical performance data. This is straightforward - see if it affects the performance of any benchmarks.
  2. Needs more thorough testing. We need to cover more types, more ways we may end up with these "merge-and-store" idioms, different numbers of elements changed, etc.
  3. An overview of the algorithm should be provided to aid readability. The code as written does not exactly aid readability so it would be good to provide an outline.
  4. I still think we should do this in InstCombine rather than on the SDAG. It seems like that would be a more natural place to do this.
  5. If we are doing it in InstCombine, why not simply produce a masked store? If we converted the IR for the test case in the description to this:
define dso_local void @foo(<4 x float>* nocapture %a, float %b) local_unnamed_addr #0 {
entry:
  call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> <float 1.000000e+00, float undef, float undef, float 2.000000e+00>, <4 x float>* %a, i32 1, <4 x i1> <i1 true, i1 false, i1 false, i1 true>)
  ret void
}
declare void @llvm.masked.store.v4f32.p0v4f32(<4 x float>, <4 x float>*, i32, <4 x i1>)

We will get the desired codegen:

lis 4, 16256
lis 5, 16384
stw 4, 0(3)
stw 5, 12(3)

And there is target-independent handling for masked stores, so it is not at all clear to me why we'd go through the trouble of implementing this complex handling in the SDAG.

@spatel I know you were initially against doing this in InstCombine, but I still believe that is a better place for this and a simpler way to implement it. If we narrow the scope of this to only handle insertions at constant indices, lib/CodeGen/ScalarizeMaskedMemIntrin.cpp should handle this quite well. And on the subject of cost model, I don't really think we need a target-specific cost model for this - simply the count of load/store/insertelement operations we are saving with the masked intrinsic weighed against the likely number of stores if we expand the masked intrinsic.
For the attached example, we are getting rid of a load and two insertelement instructions and introducing a mask that will expand to a maximum of 2 stores, so it probably makes sense to do it. On the other hand, if we get rid of a load and three insertelement instructions and introduce a mask that may expand to 3 stores, it is probably not worth it.

llvm/include/llvm/CodeGen/TargetLowering.h
3413

This should be more descriptive. Perhaps:

/// Determine whether it is profitable to split a single vector store
/// into \p NumSplit scalar stores.

Furthermore, I don't think this query is useful as implemented. For most targets, it is almost guaranteed that this should return false when NumSplit > 2 and quite likely even with NumSplit == 2 is not cheaper than a single vector store.

The problem is that there is not context to determine what we would be saving if we were to split this up. If we have some sequence of operations on a vector and then we need to store that vector either with a single vector store or split into NumSplit pieces, the answer is clearly - don't split it (one store is better than many).

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
6616

Nit: name here does not match the function name.

6756

For readability, move these declarations past the early exits.

6762

+1

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
1609

Really? So it is cheaper to split a v16i8 vector into 15 pieces than it is to store it with a single store? I really doubt that.

This revision now requires changes to proceed.Fri, Nov 22, 6:06 AM

@spatel I know you were initially against doing this in InstCombine, but I still believe that is a better place for this and a simpler way to implement it. If we narrow the scope of this to only handle insertions at constant indices, lib/CodeGen/ScalarizeMaskedMemIntrin.cpp should handle this quite well. And on the subject of cost model, I don't really think we need a target-specific cost model for this - simply the count of load/store/insertelement operations we are saving with the masked intrinsic weighed against the likely number of stores if we expand the masked intrinsic.
For the attached example, we are getting rid of a load and two insertelement instructions and introducing a mask that will expand to a maximum of 2 stores, so it probably makes sense to do it. On the other hand, if we get rid of a load and three insertelement instructions and introduce a mask that may expand to 3 stores, it is probably not worth it.

I'm still skeptical that IR canonicalization is better/easier, but I could be convinced. You're still going to have to do the memory analysis that @efriedma mentioned to make sure this is even a valid transform. Is that easier in IR?
To proceed on the IR path (and again, I'm skeptical that intcombine vs. some other IR pass is the right option), we would need to create codegen tests for multiple targets with the alternative IR sequences. Then, we would need to potentially improve that output for multiple targets. After that is done, we could then transform to the masked store intrinsic in IR.

Here's an example of a codegen test of IR alternatives - as I think was originally shown in the llvm-dev thread, we want to replace 2 out of 4 elements of a vector, but this is with values that are in scalar params/registers rather than constants:

define void @insert_store(<4 x i32>* %q, i32 %s0, i32 %s3) {
  %t0 = load <4 x i32>, <4 x i32>* %q, align 16
  %vecins0 = insertelement <4 x i32> %t0, i32 %s0, i32 0
  %vecins3 = insertelement <4 x i32> %vecins0, i32 %s3, i32 3
  store <4 x i32> %vecins3, <4 x i32>* %q, align 16
  ret void
}

declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32, <4 x i1>)

define void @masked_store(<4 x i32>* %q, i32 %s0, i32 %s3) {
  %vecins0 = insertelement <4 x i32> undef, i32 %s0, i32 0
  %vecins3 = insertelement <4 x i32> %vecins0, i32 %s3, i32 3
  call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %vecins3, <4 x i32>* %q, i32 16, <4 x i1> <i1 1, i1 0, i1 0, i1 1>)
  ret void
}

The 2nd sequence looks way better for PPC, so great:

addis 6, 2, .LCPI0_0@toc@ha
mtvsrd 0, 4
lvx 4, 0, 3
addi 4, 6, .LCPI0_0@toc@l
xxswapd	34, 0
lvx 3, 0, 4
mtvsrd 0, 5
addis 4, 2, .LCPI0_1@toc@ha
addi 4, 4, .LCPI0_1@toc@l
vperm 2, 2, 4, 3
xxswapd	35, 0
lvx 4, 0, 4
vperm 2, 3, 2, 4
stvx 2, 0, 3

vs.

stw 4, 0(3)
stw 5, 12(3)

But here's what happens on x86 with AVX2 (this target has custom/legal vector inserts and vector masked store lowering):

vmovdqa	(%rdi), %xmm0
vpinsrd	$0, %esi, %xmm0, %xmm0
vpinsrd	$3, %edx, %xmm0, %xmm0
vmovdqa	%xmm0, (%rdi)

vs.

vmovd	%esi, %xmm0
vpinsrd	$3, %edx, %xmm0, %xmm0
vmovdqa	LCPI1_0(%rip), %xmm1    ## xmm1 = [4294967295,0,0,4294967295]
vpmaskmovd	%xmm0, %xmm1, (%rdi)

I'm not actually sure which of those we consider better. But neither is ideal. We'd be better off pretending there was no masked move instruction and get the expanded:

movl	%esi, (%rdi)
movl	%edx, 12(%rdi)

It looks like this is missing some checks on the load. The code needs to check that the load and store target the same address, and that there aren't any operations between the load and the store that could modify the memory.

The profitability check probably needs to weigh the cost of the memory operations a little more carefully in cases where the total number of memory operations increases.

I'm a little worried there could be a performance penalty on certain CPUs if the vector value is loaded soon afterwards, due to the partial overlap. Depends on details of the specific CPU, though, and maybe it's rare enough that it doesn't matter.

This patch might be split into two: (1) Merge shuffle-insert and shuffle-shuffle if they have no other uses and RHS in shuffle is constant; (2) Make MatchVectorStoreSplit only consider a simple load-shuffle/insert-store chain.

That may exclude cases that both LHS and RHS are results of shuffle but from the same root. However the case should be rare and complex.

My question is: why we didn't implement merging shuffles in instcombine? I saw comments saying that's unsafe or making things worse:

we are absolutely afraid of producing a shuffle mask not in the input program, because the code gen may not be smart enough to turn a merged shuffle into two specific shuffles: it may produce worse code. As such, we only merge two shuffles if the result is either a splat or one of the input shuffle masks. In this case, merging the shuffles just removes one instruction, which we know is safe.

Would it help if we check their uses before fold them?

My question is: why we didn't implement merging shuffles in instcombine? I saw comments saying that's unsafe or making things worse:

we are absolutely afraid of producing a shuffle mask not in the input program, because the code gen may not be smart enough to turn a merged shuffle into two specific shuffles: it may produce worse code. As such, we only merge two shuffles if the result is either a splat or one of the input shuffle masks. In this case, merging the shuffles just removes one instruction, which we know is safe.

Would it help if we check their uses before fold them?

I don't understand how checking uses would change that. Do you have an example?

The code comment is still accurate in general because instcombine must be good for all targets and not all targets have a generic shuffle instruction - Altivec vperm is a real luxury. :)
So it's very difficult to reverse shuffle transforms later. As an example, see how much code x86 needs to map shuffles to a series of incomplete shuffle ISAs under combineShuffle():
https://github.com/llvm/llvm-project/blob/master/llvm/lib/Target/X86/X86ISelLowering.cpp

qiucf updated this revision to Diff 231380.Thu, Nov 28, 2:13 AM
qiucf marked 12 inline comments as done.

Address some comments from the community:

  • Add the swap-le test back to this revision for better review.
  • Add check for indexed store/loads.
  • Add more tests for change-length shuffle, undef, etc.
  • Subtle some comments inconsistent with code.
  • Eliminate a possible infinite loop case.
efriedma requested changes to this revision.Tue, Dec 3, 4:41 PM
This revision now requires changes to proceed.Tue, Dec 3, 4:41 PM
qiucf planned changes to this revision.Wed, Dec 4, 8:18 AM

Thanks for comments and explanation from everyone. I think there're two key issues to clarify and solve about this revision:

  1. Implementation code is too complicated but focused on a specialized case. It tries to search in tree but what we actually can do is not so much. So I'm going to simplify the logic, and cut some extremely rare cases if necessary.
  2. How many elements are suitable for optimization indeed? I didn't get obvious better results in benchmarks. Although we can use target information on DAGCombiner, it's more suitable placed at InstCombine, in concept, I think. Since TargetTransformInfo may not always suitable here, should we start from only do this for 1-element case?