This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
-
TargetLowering.h
-
lib/
-
CodeGen/SelectionDAG/
-
SelectionDAG/
13/14
DAGCombiner.cpp
-
TargetLowering.cpp
-
Target/PowerPC/
-
PowerPC/
-
PPCISelLowering.h
1
PPCISelLowering.cpp
-
test/CodeGen/PowerPC/
-
CodeGen/
-
PowerPC/
-
swaps-le-5.ll
-
swaps-le-6.ll
1
vector-store-split.ll

Differential D70223

[DAGCombine] Split vector load-update-store into single element stores
AbandonedPublic

Authored by qiucf on Nov 13 2019, 11:16 PM.

Download Raw Diff

Details

Reviewers

spatel
andrewrk
fhahn
efriedma
bogner
nemanjai

Group Reviewers

Restricted Project

Summary

Currently, Clang does not generate individual stores for update to its elements. For code below:

typedef float v4sf __attribute__ ((vector_size(16)));

void foo(v4sf *a) {
  (*a)[0] = 1;
  (*a)[3] = 2;
}

LLVM generates a shuffle instr for it, even if there's only one element updated. But GCC will generate individual stores (at least on PowerPC).

Also, if we have a chain of shufflevector/insertelement instrs, we can go through it, track status of each element and find which updated, finally replace original vector store into multiple element stores. This patch will do it.

This optimization happens at DAGCombiner, since each platform can easily set rules about turning it own in own version of hook method. Steps of the optimization are:

Start at a vector store, go up through its value operand, until we find a load.
In path from store to the load, we only accept insert/shuffle as operands.
Track value modification from the load the store. Quit if we need to extract from other vectors.
Generate store of elements changed in the path, to replace original vector store.

A target-related method isCheapToSplitStore is created. So only PowerPC platform turns the optimization on now.

Discussion: http://lists.llvm.org/pipermail/llvm-dev/2019-September/135432.html http://lists.llvm.org/pipermail/llvm-dev/2019-October/135638.html

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

qiucf created this revision.Nov 13 2019, 11:16 PM

Herald added subscribers: llvm-commits, steven.zhang, jsji and 3 others. · View Herald TranscriptNov 13 2019, 11:16 PM

qiucf added a reviewer: fhahn.Nov 14 2019, 12:01 AM

Herald added a subscriber: • wuzish. · View Herald TranscriptNov 14 2019, 12:01 AM

This is missing test coverage.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
6799–6801	It would be best to do as much of this checking as early as possible, before calling `getVectorUpdates()`

steven.zhang added inline comments.Nov 14 2019, 3:05 AM

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
6800	And it is also benefit to do this folding before the legalization, so that, the illegal store could be combined to legal store later.

Add regression test.
Check legality before doing costy operations.

I plan to move the two changed swap-le tests into another NFC patch, since they're not related to the logic. Also, currently i8 and i16 are illegal on PowerPC. So this won't affect vectors like <8 x i16> or <16 x i8>.

qiucf marked 2 inline comments as done.Nov 15 2019, 1:36 AM

qiucf added a parent revision: D70373: [NFC] [PowerPC] Add volatile flag to a swap optimization test.Nov 17 2019, 10:43 PM

Remove test case change to swaps-le-5 and swaps-le-6 since they're moved to a single differential D70373.

I plan to move the two changed swap-le tests into another NFC patch, since they're not related to the logic.

I think all test cases change should be reflected so that reviewers can have a look if they are regressions.

This is a very good idea. Love it!.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
6640	Alignment seems strange. Please use clang-format
6653	What about one operand is undef? putting an undef into Path has no meaning?
6668	If Path is empty here, there is an infinite loop since Current is not changed.
6740	If changed is already set, don't need to set it again.
6754	I see for Buf, the second value is only given 0/1? Can we use a bool instead?
6758	Move this comments down to line 6761. And also add other execluded cases in the comment
6762	Can we handle indexed store too? I guess you also need to exclude that kind of store.
6784	I am thinking `UpdatedElementsIdx` and all the codes to update `UpdatedElementsIdx` is redundant. we could collect updated elements idx in `getVectorUpdates` and check whether States[i] need to update by its States[i].second?
6800	Better to add an assert here that `UpdatedElementsIdx.size()` will never be larger than `VecLen`
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
1606	Do we hava a perf testing shows that for <4 x i32> store, 3 scalar stores for i32 has a good perf?
llvm/test/CodeGen/PowerPC/vector-store-split.ll
4	I think the test coverage is not enough. Especially the negative testing. For example `// We don't support shuffle which changes vector length.`, BUILD_VECTOR related testing, undef operand testing and so on.

Added Eli and Justin in case they are interested in chiming in on at least the target-independent parts of this.

It looks like this is missing some checks on the load. The code needs to check that the load and store target the same address, and that there aren't any operations between the load and the store that could modify the memory.

The profitability check probably needs to weigh the cost of the memory operations a little more carefully in cases where the total number of memory operations increases.

I'm a little worried there could be a performance penalty on certain CPUs if the vector value is loaded soon afterwards, due to the partial overlap. Depends on details of the specific CPU, though, and maybe it's rare enough that it doesn't matter.

This changeset is a rather complicated and hard-to-follow piece of code that appears to solve a problem in the motivating test case, but makes it rather unclear whether it is a good idea or a bad idea overall.
There are a number of things that need to be clarified for this patch to proceed:

Needs empirical performance data. This is straightforward - see if it affects the performance of any benchmarks.
Needs more thorough testing. We need to cover more types, more ways we may end up with these "merge-and-store" idioms, different numbers of elements changed, etc.
An overview of the algorithm should be provided to aid readability. The code as written does not exactly aid readability so it would be good to provide an outline.
I still think we should do this in InstCombine rather than on the SDAG. It seems like that would be a more natural place to do this.
If we are doing it in InstCombine, why not simply produce a masked store? If we converted the IR for the test case in the description to this:

define dso_local void @foo(<4 x float>* nocapture %a, float %b) local_unnamed_addr #0 {
entry:
  call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> <float 1.000000e+00, float undef, float undef, float 2.000000e+00>, <4 x float>* %a, i32 1, <4 x i1> <i1 true, i1 false, i1 false, i1 true>)
  ret void
}
declare void @llvm.masked.store.v4f32.p0v4f32(<4 x float>, <4 x float>*, i32, <4 x i1>)

We will get the desired codegen:

lis 4, 16256
lis 5, 16384
stw 4, 0(3)
stw 5, 12(3)

And there is target-independent handling for masked stores, so it is not at all clear to me why we'd go through the trouble of implementing this complex handling in the SDAG.

@spatel I know you were initially against doing this in InstCombine, but I still believe that is a better place for this and a simpler way to implement it. If we narrow the scope of this to only handle insertions at constant indices, lib/CodeGen/ScalarizeMaskedMemIntrin.cpp should handle this quite well. And on the subject of cost model, I don't really think we need a target-specific cost model for this - simply the count of load/store/insertelement operations we are saving with the masked intrinsic weighed against the likely number of stores if we expand the masked intrinsic.
For the attached example, we are getting rid of a load and two insertelement instructions and introducing a mask that will expand to a maximum of 2 stores, so it probably makes sense to do it. On the other hand, if we get rid of a load and three insertelement instructions and introduce a mask that may expand to 3 stores, it is probably not worth it.

llvm/include/llvm/CodeGen/TargetLowering.h
3413	This should be more descriptive. Perhaps: /// Determine whether it is profitable to split a single vector store /// into \p NumSplit scalar stores. Furthermore, I don't think this query is useful as implemented. For most targets, it is almost guaranteed that this should return `false` when `NumSplit > 2` and quite likely even with `NumSplit == 2` is not cheaper than a single vector store. The problem is that there is not context to determine what we would be saving if we were to split this up. If we have some sequence of operations on a vector and then we need to store that vector either with a single vector store or split into `NumSplit` pieces, the answer is clearly - don't split it (one store is better than many).
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
6616	Nit: name here does not match the function name.
6756	For readability, move these declarations past the early exits.
6762	+1
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
1609	Really? So it is cheaper to split a `v16i8` vector into 15 pieces than it is to store it with a single store? I really doubt that.

This revision now requires changes to proceed.Nov 22 2019, 6:06 AM

In D70223#1756723, @nemanjai wrote:

@spatel I know you were initially against doing this in InstCombine, but I still believe that is a better place for this and a simpler way to implement it. If we narrow the scope of this to only handle insertions at constant indices, lib/CodeGen/ScalarizeMaskedMemIntrin.cpp should handle this quite well. And on the subject of cost model, I don't really think we need a target-specific cost model for this - simply the count of load/store/insertelement operations we are saving with the masked intrinsic weighed against the likely number of stores if we expand the masked intrinsic.
For the attached example, we are getting rid of a load and two insertelement instructions and introducing a mask that will expand to a maximum of 2 stores, so it probably makes sense to do it. On the other hand, if we get rid of a load and three insertelement instructions and introduce a mask that may expand to 3 stores, it is probably not worth it.

I'm still skeptical that IR canonicalization is better/easier, but I could be convinced. You're still going to have to do the memory analysis that @efriedma mentioned to make sure this is even a valid transform. Is that easier in IR?
To proceed on the IR path (and again, I'm skeptical that intcombine vs. some other IR pass is the right option), we would need to create codegen tests for multiple targets with the alternative IR sequences. Then, we would need to potentially improve that output for multiple targets. After that is done, we could then transform to the masked store intrinsic in IR.

Here's an example of a codegen test of IR alternatives - as I think was originally shown in the llvm-dev thread, we want to replace 2 out of 4 elements of a vector, but this is with values that are in scalar params/registers rather than constants:

define void @insert_store(<4 x i32>* %q, i32 %s0, i32 %s3) {
  %t0 = load <4 x i32>, <4 x i32>* %q, align 16
  %vecins0 = insertelement <4 x i32> %t0, i32 %s0, i32 0
  %vecins3 = insertelement <4 x i32> %vecins0, i32 %s3, i32 3
  store <4 x i32> %vecins3, <4 x i32>* %q, align 16
  ret void
}

declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32, <4 x i1>)

define void @masked_store(<4 x i32>* %q, i32 %s0, i32 %s3) {
  %vecins0 = insertelement <4 x i32> undef, i32 %s0, i32 0
  %vecins3 = insertelement <4 x i32> %vecins0, i32 %s3, i32 3
  call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %vecins3, <4 x i32>* %q, i32 16, <4 x i1> <i1 1, i1 0, i1 0, i1 1>)
  ret void
}

The 2nd sequence looks way better for PPC, so great:

addis 6, 2, .LCPI0_0@toc@ha
mtvsrd 0, 4
lvx 4, 0, 3
addi 4, 6, .LCPI0_0@toc@l
xxswapd	34, 0
lvx 3, 0, 4
mtvsrd 0, 5
addis 4, 2, .LCPI0_1@toc@ha
addi 4, 4, .LCPI0_1@toc@l
vperm 2, 2, 4, 3
xxswapd	35, 0
lvx 4, 0, 4
vperm 2, 3, 2, 4
stvx 2, 0, 3

vs.

stw 4, 0(3)
stw 5, 12(3)

But here's what happens on x86 with AVX2 (this target has custom/legal vector inserts and vector masked store lowering):

vmovdqa	(%rdi), %xmm0
vpinsrd	$0, %esi, %xmm0, %xmm0
vpinsrd	$3, %edx, %xmm0, %xmm0
vmovdqa	%xmm0, (%rdi)

vs.

vmovd	%esi, %xmm0
vpinsrd	$3, %edx, %xmm0, %xmm0
vmovdqa	LCPI1_0(%rip), %xmm1    ## xmm1 = [4294967295,0,0,4294967295]
vpmaskmovd	%xmm0, %xmm1, (%rdi)

I'm not actually sure which of those we consider better. But neither is ideal. We'd be better off pretending there was no masked move instruction and get the expanded:

movl	%esi, (%rdi)
movl	%edx, 12(%rdi)

In D70223#1755719, @efriedma wrote:

It looks like this is missing some checks on the load. The code needs to check that the load and store target the same address, and that there aren't any operations between the load and the store that could modify the memory.

The profitability check probably needs to weigh the cost of the memory operations a little more carefully in cases where the total number of memory operations increases.

I'm a little worried there could be a performance penalty on certain CPUs if the vector value is loaded soon afterwards, due to the partial overlap. Depends on details of the specific CPU, though, and maybe it's rare enough that it doesn't matter.

This patch might be split into two: (1) Merge shuffle-insert and shuffle-shuffle if they have no other uses and RHS in shuffle is constant; (2) Make MatchVectorStoreSplit only consider a simple load-shuffle/insert-store chain.

That may exclude cases that both LHS and RHS are results of shuffle but from the same root. However the case should be rare and complex.

My question is: why we didn't implement merging shuffles in instcombine? I saw comments saying that's unsafe or making things worse:

we are absolutely afraid of producing a shuffle mask not in the input program, because the code gen may not be smart enough to turn a merged shuffle into two specific shuffles: it may produce worse code. As such, we only merge two shuffles if the result is either a splat or one of the input shuffle masks. In this case, merging the shuffles just removes one instruction, which we know is safe.

Would it help if we check their uses before fold them?

In D70223#1759762, @qiucf wrote:

My question is: why we didn't implement merging shuffles in instcombine? I saw comments saying that's unsafe or making things worse:

we are absolutely afraid of producing a shuffle mask not in the input program, because the code gen may not be smart enough to turn a merged shuffle into two specific shuffles: it may produce worse code. As such, we only merge two shuffles if the result is either a splat or one of the input shuffle masks. In this case, merging the shuffles just removes one instruction, which we know is safe.

Would it help if we check their uses before fold them?

I don't understand how checking uses would change that. Do you have an example?

The code comment is still accurate in general because instcombine must be good for all targets and not all targets have a generic shuffle instruction - Altivec vperm is a real luxury. :)
So it's very difficult to reverse shuffle transforms later. As an example, see how much code x86 needs to map shuffles to a series of incomplete shuffle ISAs under combineShuffle():
https://github.com/llvm/llvm-project/blob/master/llvm/lib/Target/X86/X86ISelLowering.cpp

qiucf mentioned this in D70373: [NFC] [PowerPC] Add volatile flag to a swap optimization test.Nov 27 2019, 2:43 AM

Address some comments from the community:

Add the swap-le test back to this revision for better review.
Add check for indexed store/loads.
Add more tests for change-length shuffle, undef, etc.
Subtle some comments inconsistent with code.
Eliminate a possible infinite loop case.

Herald added a subscriber: jfb. · View Herald TranscriptNov 28 2019, 2:13 AM

efriedma requested changes to this revision.Dec 3 2019, 4:41 PM

This revision now requires changes to proceed.Dec 3 2019, 4:41 PM

Thanks for comments and explanation from everyone. I think there're two key issues to clarify and solve about this revision:

Implementation code is too complicated but focused on a specialized case. It tries to search in tree but what we actually can do is not so much. So I'm going to simplify the logic, and cut some extremely rare cases if necessary.
How many elements are suitable for optimization indeed? I didn't get obvious better results in benchmarks. Although we can use target information on DAGCombiner, it's more suitable placed at InstCombine, in concept, I think. Since TargetTransformInfo may not always suitable here, should we start from only do this for 1-element case?

https://reviews.llvm.org/D71828 is created for simpler logic at InstCombine.

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

TargetLowering.h

5 lines

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

213 lines

TargetLowering.cpp

5 lines

Target/

PowerPC/

PPCISelLowering.h

3 lines

PPCISelLowering.cpp

11 lines

test/

CodeGen/

PowerPC/

swaps-le-5.ll

8 lines

swaps-le-6.ll

4 lines

vector-store-split.ll

115 lines

Diff 231380

llvm/include/llvm/CodeGen/TargetLowering.h

Context not available.
	bool LegalOperations, bool ForCodeSize,	bool LegalOperations, bool ForCodeSize,
	unsigned Depth = 0) const;	unsigned Depth = 0) const;

		/// Return true if it's profitable to split one vector store into \p NumSplit
		/// scalar stores.
		virtual bool isCheapToSplitStore(StoreSDNode *N, unsigned NumSplit,
		SelectionDAG &DAG) const;

	//===--------------------------------------------------------------------===//	//===--------------------------------------------------------------------===//
	// Lowering methods - These methods must be implemented by targets so that	// Lowering methods - These methods must be implemented by targets so that
	// the SelectionDAGBuilder code knows how to lower these.	// the SelectionDAGBuilder code knows how to lower these.
Context not available.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

Context not available.
	SDValue MatchRotate(SDValue LHS, SDValue RHS, const SDLoc &DL);	SDValue MatchRotate(SDValue LHS, SDValue RHS, const SDLoc &DL);
	SDValue MatchLoadCombine(SDNode *N);	SDValue MatchLoadCombine(SDNode *N);
	SDValue MatchStoreCombine(StoreSDNode *N);	SDValue MatchStoreCombine(StoreSDNode *N);
		SDValue MatchVectorStoreSplit(StoreSDNode *N);
	SDValue ReduceLoadWidth(SDNode *N);	SDValue ReduceLoadWidth(SDNode *N);
	SDValue ReduceLoadOpStoreWidth(SDNode *N);	SDValue ReduceLoadOpStoreWidth(SDNode *N);
	SDValue splitMergedValStore(StoreSDNode *ST);	SDValue splitMergedValStore(StoreSDNode *ST);
		shchenzUnsubmitted Done Reply Inline Actions Alignment seems strange. Please use clang-format shchenz: Alignment seems strange. Please use clang-format
		shchenzUnsubmitted Done Reply Inline Actions What about one operand is undef? putting an undef into Path has no meaning? shchenz: What about one operand is undef? putting an undef into Path has no meaning?
		shchenzUnsubmitted Done Reply Inline Actions If Path is empty here, there is an infinite loop since Current is not changed. shchenz: If Path is empty here, there is an infinite loop since Current is not changed.
		nemanjaiUnsubmitted Done Reply Inline Actions Nit: name here does not match the function name. nemanjai: Nit: name here does not match the function name.
Context not available.
	return NewStore;	return NewStore;
	}	}

		/// getVectorUpdatePath - Detect a load-insert/shuffle-store path. Returns
		/// head LOAD if we met such pattern. Otherwise, this function returns null
		/// SDValue. The path here contains both nodes on the path and respective
		/// 'next' node index we should visit.
		static SDValue
		getVectorUpdatePath(SDValue Current,
		SmallVectorImpl<std::pair<SDValue, bool>> &Path) {
		if (Current.getOpcode() == ISD::LOAD)
		return SDValue();

		const unsigned MaxDepth = 12;
		unsigned Depth = 0;

		// Here we use DFS to find the LOAD we desire. This buffer keeps nodes
		// we've not visited. If current path is wrong, back and pick another one.
		SmallVector<SDValue, 16> SearchBuffer;

		while (Current) {
		if (Depth >= MaxDepth)
		return SDValue();

		if (Current.getOpcode() == ISD::LOAD)
		return Current;
		else if (Current.getOpcode() == ISD::INSERT_VECTOR_ELT) {
		++Depth;
		// For inserts, we only have one path.
		Path.push_back(std::make_pair(Current, 0));
		shchenzUnsubmitted Done Reply Inline Actions If changed is already set, don't need to set it again. shchenz: If changed is already set, don't need to set it again.
		Current = Current.getOperand(0);
		} else if (Current.getOpcode() == ISD::VECTOR_SHUFFLE) {
		++Depth;
		// For shuffles, pick the first non-const and push the rest into buffer.
		SDValue Op1 = Current.getOperand(0);
		SDValue Op2 = Current.getOperand(1);
		bool Op1Const = Op1.getOpcode() == ISD::BUILD_VECTOR \|\|
		Op1.getOpcode() == ISD::UNDEF;
		bool Op2Const = Op2.getOpcode() == ISD::BUILD_VECTOR \|\|
		Op2.getOpcode() == ISD::UNDEF;

		// Since we're desiring a LOAD at root of the path, constant vector
		// (BUILD_VECTOR) operand in shuffle instr can't be what we want.
		if (Op1Const) {
		shchenzUnsubmitted Done Reply Inline Actions I see for Buf, the second value is only given 0/1? Can we use a bool instead? shchenz: I see for Buf, the second value is only given 0/1? Can we use a bool instead?
		Path.push_back(std::make_pair(Current, 1));
		Current = Op2;
		nemanjaiUnsubmitted Done Reply Inline Actions For readability, move these declarations past the early exits. nemanjai: For readability, move these declarations past the early exits.
		} else {
		Path.push_back(std::make_pair(Current, 0));
		shchenzUnsubmitted Done Reply Inline Actions Move this comments down to line 6761. And also add other execluded cases in the comment shchenz: Move this comments down to line 6761. And also add other execluded cases in the comment
		Current = Op1;
		if (!Op2Const)
		SearchBuffer.push_back(Op2);
		}
		shchenzUnsubmitted Done Reply Inline Actions Can we handle indexed store too? I guess you also need to exclude that kind of store. shchenz: Can we handle indexed store too? I guess you also need to exclude that kind of store.
		nemanjaiUnsubmitted Done Reply Inline Actions +1 nemanjai: +1
		} else {
		if (SearchBuffer.empty() \|\| Path.empty())
		return SDValue();

		// Go back until we find top of buffer is 'brother' of current node.
		// And pick it from buffer.
		while (!Path.empty()) {
		std::pair<SDValue, unsigned> Top = Path.pop_back_val();
		if (Top.first.getOpcode() != ISD::VECTOR_SHUFFLE)
		continue;
		unsigned NewPath = !Top.second;
		if (Top.first.getOperand(NewPath) == SearchBuffer.back()) {
		SearchBuffer.pop_back();
		Path.push_back(std::make_pair(Top.first, NewPath));
		Current = Top.first.getOperand(NewPath);
		break;
		}
		}
		}
		}

		return SDValue();
		shchenzUnsubmitted Not Done Reply Inline Actions I am thinking `UpdatedElementsIdx` and all the codes to update `UpdatedElementsIdx` is redundant. we could collect updated elements idx in `getVectorUpdates` and check whether States[i] need to update by its States[i].second? shchenz: I am thinking `UpdatedElementsIdx` and all the codes to update `UpdatedElementsIdx` is…
		}

		/// getVectorUpdates - Update a state table of elements generated by
		/// a load-insert/shuffle-store chain. We can use this table to simplify
		/// vector store code.
		static bool
		getVectorUpdates(SmallVectorImpl<std::pair<SDValue, int>> &States,
		SmallVectorImpl<std::pair<SDValue, bool>> &Worklist,
		const SDValue &Source) {
		bool Changed = false;
		auto VecLen = Source.getValueType().getVectorNumElements();
		SDValue LastItem;

		// Each element in States should be one of:
		// - The first is default SDValue if it's undef.
		// - The first is scalar if it's a value representing scalar.
		steven.zhangUnsubmitted Done Reply Inline Actions And it is also benefit to do this folding before the legalization, so that, the illegal store could be combined to legal store later. steven.zhang: And it is also benefit to do this folding before the legalization, so that, the illegal store…
		shchenzUnsubmitted Done Reply Inline Actions Better to add an assert here that `UpdatedElementsIdx.size()` will never be larger than `VecLen` shchenz: Better to add an assert here that `UpdatedElementsIdx.size()` will never be larger than `VecLen`
		// - The first is vector and second is index if it's element from a vector.
		lebedev.riUnsubmitted Done Reply Inline Actions It would be best to do as much of this checking as early as possible, before calling `getVectorUpdates()` lebedev.ri: It would be best to do as much of this checking as early as possible, before calling…

		while (!Worklist.empty()) {
		auto Tail = Worklist.pop_back_val();

		// INSERT only changes one element, so just change entry for that one.
		if (Tail.first.getOpcode() == ISD::INSERT_VECTOR_ELT) {
		auto IndexNode = dyn_cast<ConstantSDNode>(Tail.first.getOperand(2));
		if (IndexNode == nullptr)
		return false;
		auto IndexValue = IndexNode->getZExtValue();
		if (IndexValue >= VecLen)
		return false;
		States[IndexValue].first = Tail.first.getOperand(1);
		Changed = true;
		} else if (Tail.first.getOpcode() == ISD::VECTOR_SHUFFLE) {
		// Elements may got exchanged. So we copy the table and refer
		// to the original ones when updating.
		const ShuffleVectorSDNode *SVN =
		dyn_cast<ShuffleVectorSDNode>(Tail.first.getNode());
		SmallVector<std::pair<SDValue, int>, 16> OriginalStates(States.begin(),
		States.end());
		// We can't support shuffle which changes vector length.
		auto MaskSize = SVN->getMask().size();
		if (MaskSize != VecLen)
		return false;

		for (unsigned i = 0; i < MaskSize; i++) {
		int Mask = SVN->getMaskElt(i);
		if (Mask < 0)
		States[i].first = SDValue();
		else {
		SDValue ChoosenVec = SVN->getOperand(Mask >= (int)VecLen);
		States[i].second = (Mask < (int)VecLen) ? Mask : Mask - VecLen;

		// Keep previous set vector if shuffle doesn't change this.
		if (ChoosenVec != LastItem)
		States[i].first = ChoosenVec;
		}

		Changed \|= (OriginalStates[i] != States[i]);
		}
		}

		LastItem = Tail.first;
		}

		return Changed;
		}

		/// Match a pattern where only a few elements of a vector are updated but the
		/// whole vector is stored. This method will recognize a chain in which vector
		/// is updated, and split the store into multiple element stores.
		SDValue DAGCombiner::MatchVectorStoreSplit(StoreSDNode *N) {
		EVT VecType = N->getValue().getValueType();
		EVT EleVT = VecType.getScalarType();

		// We don't support scalable vectors or indexed/volatile/atomic stores now.
		if (!N->isSimple() \|\| VecType.isScalableVector() \|\| !VecType.isVector() \|\|
		N->isIndexed())
		return SDValue();

		if (!TLI.allowsMemoryAccess(*DAG.getContext(), DAG.getDataLayout(), EleVT) \|\|
		!hasOperation(N->getOpcode(), EleVT))
		return SDValue();

		SmallVector<std::pair<SDValue, bool>, 16> Buf;
		SmallVector<std::pair<SDValue, int>, 16> States;

		SDValue SourceVecLoad = getVectorUpdatePath(N->getValue(), Buf);
		LoadSDNode *LoadNode = dyn_cast_or_null<LoadSDNode>(SourceVecLoad.getNode());
		if (!LoadNode \|\| !LoadNode->isSimple() \|\| LoadNode->isIndexed())
		return SDValue();

		// Prepare a table, initialized by the vector and its index. So we can find
		// which elements got updated in the chain.
		auto VecLen = SourceVecLoad.getValueType().getVectorNumElements();
		for (unsigned i = 0; i < VecLen; ++i)
		States.push_back(std::make_pair(SourceVecLoad, i));

		if (!getVectorUpdates(States, Buf, SourceVecLoad))
		return SDValue();

		// Go over the updated table to find updated elements.
		SmallVector<int, 16> UpdatedElementsIdx;
		for (unsigned i = 0; i < States.size(); ++i) {
		SDValue &State = States[i].first;
		if (!State)
		continue;

		// It's not profitable to extract from another vector and insert.
		if (State.getValueType().isVector()) {
		if (State.getOpcode() == ISD::BUILD_VECTOR) {
		State = State.getOperand(States[i].second);
		UpdatedElementsIdx.push_back(i);
		} else if (State != SourceVecLoad \|\| States[i].second != (int)i)
		return SDValue();
		} else
		UpdatedElementsIdx.push_back(i);
		}

		assert(UpdatedElementsIdx.size() <= VecLen);
		if (TLI.isCheapToSplitStore(N, UpdatedElementsIdx.size(), DAG)) {
		SDLoc Loc = SDLoc(N);
		SDValue Index, Chain;

		// Insert new stores, based on vector's address.
		for (unsigned i = 0; i < UpdatedElementsIdx.size(); ++i) {
		Index = TLI.getVectorElementPointer(
		DAG, N->getBasePtr(), VecType,
		DAG.getConstant(UpdatedElementsIdx[i], Loc, MVT::i32));
		Chain =
		DAG.getStore(Chain ? Chain : N->getChain(), Loc,
		States[UpdatedElementsIdx[i]].first,
		Index, N->getPointerInfo());
		}
		return Chain;
		}

		return SDValue();
		}

	/// Match a pattern where a wide type scalar value is loaded by several narrow	/// Match a pattern where a wide type scalar value is loaded by several narrow
	/// loads and combined by shifts and ors. Fold it into a single load or a load	/// loads and combined by shifts and ors. Fold it into a single load or a load
	/// and a BSWAP if the targets supports it.	/// and a BSWAP if the targets supports it.
Context not available.
	if (SDValue Store = MatchStoreCombine(ST))	if (SDValue Store = MatchStoreCombine(ST))
	return Store;	return Store;

		if (SDValue Split = MatchVectorStoreSplit(ST))
		return Split;

	if (ST->isUnindexed()) {	if (ST->isUnindexed()) {
	// Walk up chain skipping non-aliasing memory nodes, on this store and any	// Walk up chain skipping non-aliasing memory nodes, on this store and any
	// adjacent stores.	// adjacent stores.
Context not available.

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp

Context not available.
	llvm_unreachable("Unknown code");	llvm_unreachable("Unknown code");
	}	}

		bool TargetLowering::isCheapToSplitStore(StoreSDNode *N, unsigned NumSplit,
		SelectionDAG &DAG) const {
		return false;
		}

	//===----------------------------------------------------------------------===//	//===----------------------------------------------------------------------===//
	// Legalization Utilities	// Legalization Utilities
	//===----------------------------------------------------------------------===//	//===----------------------------------------------------------------------===//
Context not available.

llvm/lib/Target/PowerPC/PPCISelLowering.h

Context not available.
	return true;	return true;
	}	}

		bool isCheapToSplitStore(StoreSDNode *N, unsigned NumSplit,
		SelectionDAG &DAG) const override;

	bool isCtlzFast() const override {	bool isCtlzFast() const override {
	return true;	return true;
	}	}
Context not available.

llvm/lib/Target/PowerPC/PPCISelLowering.cpp

Context not available.
	return true;	return true;
	}	}
		nemanjaiUnsubmitted Not Done Reply Inline Actions Really? So it is cheaper to split a `v16i8` vector into 15 pieces than it is to store it with a single store? I really doubt that. nemanjai: Really? So it is cheaper to split a `v16i8` vector into 15 pieces than it is to store it with a…

		bool PPCTargetLowering::isCheapToSplitStore(StoreSDNode *N,
		unsigned NumSplit,
		SelectionDAG &DAG) const {
		EVT VecType = N->getValue().getValueType();

		if (!VecType.isVector() \|\| NumSplit >= VecType.getVectorNumElements())
		return false;

		return true;
		}

	/// isVMRGLShuffleMask - Return true if this is a shuffle mask suitable for	/// isVMRGLShuffleMask - Return true if this is a shuffle mask suitable for
	/// a VMRGL* instruction with the specified unit size (1,2 or 4 bytes).	/// a VMRGL* instruction with the specified unit size (1,2 or 4 bytes).
	/// The ShuffleKind distinguishes between big-endian merges with two	/// The ShuffleKind distinguishes between big-endian merges with two
Context not available.

llvm/test/CodeGen/PowerPC/swaps-le-5.ll

Context not available.
	entry:	entry:
	%0 = load <2 x double>, <2 x double>* @x, align 16	%0 = load <2 x double>, <2 x double>* @x, align 16
	%vecins = insertelement <2 x double> %0, double %y, i32 0	%vecins = insertelement <2 x double> %0, double %y, i32 0
	store <2 x double> %vecins, <2 x double>* @z, align 16	store volatile <2 x double> %vecins, <2 x double>* @z, align 16
	ret void	ret void
	}	}

Context not available.
	entry:	entry:
	%0 = load <2 x double>, <2 x double>* @x, align 16	%0 = load <2 x double>, <2 x double>* @x, align 16
	%vecins = insertelement <2 x double> %0, double %y, i32 1	%vecins = insertelement <2 x double> %0, double %y, i32 1
	store <2 x double> %vecins, <2 x double>* @z, align 16	store volatile <2 x double> %vecins, <2 x double>* @z, align 16
	ret void	ret void
	}	}

Context not available.
	%0 = load <2 x double>, <2 x double>* @z, align 16	%0 = load <2 x double>, <2 x double>* @z, align 16
	%1 = load <2 x double>, <2 x double>* @x, align 16	%1 = load <2 x double>, <2 x double>* @x, align 16
	%vecins = shufflevector <2 x double> %0, <2 x double> %1, <2 x i32> <i32 0, i32 2>	%vecins = shufflevector <2 x double> %0, <2 x double> %1, <2 x i32> <i32 0, i32 2>
	store <2 x double> %vecins, <2 x double>* @z, align 16	store volatile <2 x double> %vecins, <2 x double>* @z, align 16
	ret void	ret void
	}	}

Context not available.
	%0 = load <2 x double>, <2 x double>* @z, align 16	%0 = load <2 x double>, <2 x double>* @z, align 16
	%1 = load <2 x double>, <2 x double>* @x, align 16	%1 = load <2 x double>, <2 x double>* @x, align 16
	%vecins = shufflevector <2 x double> %0, <2 x double> %1, <2 x i32> <i32 3, i32 1>	%vecins = shufflevector <2 x double> %0, <2 x double> %1, <2 x i32> <i32 3, i32 1>
	store <2 x double> %vecins, <2 x double>* @z, align 16	store volatile <2 x double> %vecins, <2 x double>* @z, align 16
	ret void	ret void
	}	}

Context not available.

llvm/test/CodeGen/PowerPC/swaps-le-6.ll

Context not available.
	%0 = load <2 x double>, <2 x double>* @x, align 16	%0 = load <2 x double>, <2 x double>* @x, align 16
	%1 = load double, double* @y, align 8	%1 = load double, double* @y, align 8
	%vecins = insertelement <2 x double> %0, double %1, i32 0	%vecins = insertelement <2 x double> %0, double %1, i32 0
	store <2 x double> %vecins, <2 x double>* @z, align 16	store volatile <2 x double> %vecins, <2 x double>* @z, align 16
	ret void	ret void
	}	}

Context not available.
	%0 = load <2 x double>, <2 x double>* @x, align 16	%0 = load <2 x double>, <2 x double>* @x, align 16
	%1 = load double, double* @y, align 8	%1 = load double, double* @y, align 8
	%vecins = insertelement <2 x double> %0, double %1, i32 1	%vecins = insertelement <2 x double> %0, double %1, i32 1
	store <2 x double> %vecins, <2 x double>* @z, align 16	store volatile <2 x double> %vecins, <2 x double>* @z, align 16
	ret void	ret void
	}	}

Context not available.

llvm/test/CodeGen/PowerPC/vector-store-split.ll

This file was added.

				; RUN: llc -mtriple=powerpc64le-unknown-linux-gnu < %s \| FileCheck %s --check-prefixes=LE,CHECK
				; RUN: llc -mtriple=powerpc64-unknown-linux-gnu -mattr=+altivec < %s \| FileCheck %s --check-prefixes=BE,CHECK

				; CHECK-LABEL: insert_store:
				shchenzUnsubmitted Not Done Reply Inline Actions I think the test coverage is not enough. Especially the negative testing. For example `// We don't support shuffle which changes vector length.`, BUILD_VECTOR related testing, undef operand testing and so on. shchenz: I think the test coverage is not enough. Especially the negative testing. For example `// We…
				; CHECK: stw 4, 12(3)
				; CHECK-NEXT: blr
				define void @insert_store(<4 x i32>* %q, i32 zeroext %s) {
				entry:
				%0 = load <4 x i32>, <4 x i32>* %q, align 16
				%vecins = insertelement <4 x i32> %0, i32 %s, i32 3
				store <4 x i32> %vecins, <4 x i32>* %q, align 16
				ret void
				}

				; CHECK-LABEL: volatile_check:
				; CHECK: lvx
				; LE: vperm
				; BE: vsldoi
				; CHECK: stvx
				; CHECK: blr
				define void @volatile_check(<4 x i32>* %q, <4 x i32>* %p, i32 zeroext %s) {
				entry:
				%0 = load <4 x i32>, <4 x i32>* %q, align 16
				%vecins = insertelement <4 x i32> %0, i32 %s, i32 3
				store volatile <4 x i32> %vecins, <4 x i32>* %q, align 16
				%1 = load volatile <4 x i32>, <4 x i32>* %p, align 16
				%vecins1 = insertelement <4 x i32> %1, i32 %s, i32 3
				store <4 x i32> %vecins1, <4 x i32>* %p, align 16
				ret void
				}

				; CHECK-LABEL: shuffle_store:
				; CHECK-DAG: li {{[0-9]+}}, 100
				; CHECK-DAG: li {{[0-9]+}}, 300
				; CHECK-DAG: stw {{[0-9]+}}, 12(3)
				; CHECK-DAG: stw {{[0-9]+}}, 4(3)
				; CHECK-NEXT: blr
				define void @shuffle_store(<4 x i32>* %q) {
				entry:
				%0 = load <4 x i32>, <4 x i32>* %q, align 16
				%vecins3 = shufflevector <4 x i32> %0, <4 x i32> <i32 undef, i32 100, i32 10, i32 300>, <4 x i32> <i32 0, i32 7, i32 2, i32 5>
				store <4 x i32> %vecins3, <4 x i32>* %q, align 16
				ret void
				}

				; CHECK-LABEL: mixed_shuffle_insert:
				; CHECK: lwz 4, 0(4)
				; CHECK-NEXT: li 5, 2
				; CHECK-NEXT: stw 5, 4(3)
				; CHECK-NEXT: stw 4, 8(3)
				; CHECK-NEXT: blr
				define void @mixed_shuffle_insert(<4 x i32>* %q, i32* %m, i32 zeroext %s) {
				entry:
				%0 = load <4 x i32>, <4 x i32>* %q, align 16
				%1 = load i32, i32* %m, align 4
				%vecins0 = insertelement <4 x i32> %0, i32 %s, i32 1
				%vecins1 = shufflevector <4 x i32> %vecins0, <4 x i32> <i32 0, i32 1, i32 2, i32 4>, <4 x i32> <i32 0, i32 6, i32 undef, i32 3>
				%vecins2 = insertelement <4 x i32> %vecins1, i32 %1, i32 2
				store <4 x i32> %vecins2, <4 x i32>* %q, align 16
				ret void
				}

				; CHECK-LABEL: exchange_elements:
				; CHECK: lvx
				; LE: xxswapd
				; LE: vperm
				; BE: vmrghw
				; BE: vsldoi
				; CHECK: stvx
				define void @exchange_elements(<4 x i32>* %a, <4 x i32>* %b) {
				entry:
				%0 = load <4 x i32>, <4 x i32>* %a, align 16
				%1 = load <4 x i32>, <4 x i32>* %b, align 16
				%vecins0 = shufflevector <4 x i32> %0, <4 x i32> %1, <4 x i32> <i32 0, i32 3, i32 7, i32 2>
				%vecins1 = insertelement <4 x i32> %vecins0, i32 5, i32 0
				store <4 x i32> %vecins1, <4 x i32>* %a, align 16
				%vecins2 = shufflevector <4 x i32> %0, <4 x i32> %0, <4 x i32> <i32 0, i32 0, i32 1, i32 1>
				store <4 x i32> %vecins2, <4 x i32>* %b, align 16
				ret void
				}

				; CHECK-LABEL: special_shuffles
				; CHECK: lvx
				; CHECK-NOT: stw
				; LE: xxswapd
				; LE: vperm
				; BE: vmrghw
				; BE: vsldoi
				; CHECK: stvx
				define void @special_shuffles(<2 x i32>* %a, <4 x i32>* %b) {
				entry:
				%0 = load <2 x i32>, <2 x i32>* %a, align 8
				%1 = load <4 x i32>, <4 x i32>* %b, align 16
				%vecins0 = shufflevector <2 x i32> %0, <2 x i32> %0, <4 x i32> <i32 0, i32 3, i32 2, i32 1>
				store <4 x i32> %vecins0, <4 x i32>* %b, align 16
				%vecins1 = shufflevector <4 x i32> %1, <4 x i32> %1, <2 x i32> <i32 1, i32 7>
				store <2 x i32> %vecins1, <2 x i32>* %a, align 8
				ret void
				}

				; CHECK-LABEL: undef_shuffle
				; CHECK-DAG: li {{[0-9]+}}, 100
				; CHECK-DAG: li {{[0-9]+}}, 300
				; CHECK-DAG: stw {{[0-9]+}}, 12(3)
				; CHECK-DAG: stw {{[0-9]+}}, 4(3)
				; CHECK: blr
				define void @undef_shuffle(<4 x i32>* %q) {
				entry:
				%0 = load <4 x i32>, <4 x i32>* %q, align 16
				%vecins3 = shufflevector <4 x i32> %0, <4 x i32> <i32 undef, i32 100, i32 10, i32 300>, <4 x i32> <i32 0, i32 7, i32 2, i32 5>
				%vecins4 = shufflevector <4 x i32> %vecins3, <4 x i32> undef, <4 x i32> <i32 0, i32 7, i32 2, i32 5>
				%vecins5 = insertelement <4 x i32> %vecins4, i32 5, i32 3
				store <4 x i32> %vecins4, <4 x i32>* %q, align 16
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombine] Split vector load-update-store into single element storesAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 231380

llvm/include/llvm/CodeGen/TargetLowering.h

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp

llvm/lib/Target/PowerPC/PPCISelLowering.h

llvm/lib/Target/PowerPC/PPCISelLowering.cpp

llvm/test/CodeGen/PowerPC/swaps-le-5.ll

llvm/test/CodeGen/PowerPC/swaps-le-6.ll

llvm/test/CodeGen/PowerPC/vector-store-split.ll

[DAGCombine] Split vector load-update-store into single element stores
AbandonedPublic