This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
1/1
TargetLowering.h
-
lib/
-
CodeGen/SelectionDAG/
-
SelectionDAG/
13/14
DAGCombiner.cpp
-
TargetLowering.cpp
-
Target/PowerPC/
-
PowerPC/
-
PPCISelLowering.h
2
PPCISelLowering.cpp
-
test/CodeGen/PowerPC/
-
CodeGen/
-
PowerPC/
-
swaps-le-5.ll
-
swaps-le-6.ll

Differential D70223

[DAGCombine] Split vector load-update-store into single element stores
AbandonedPublic

Authored by qiucf on Nov 13 2019, 11:16 PM.

Download Raw Diff

Details

Reviewers

spatel
andrewrk
fhahn
efriedma
bogner
nemanjai

Group Reviewers

Restricted Project

Summary

Currently, Clang does not generate individual stores for update to its elements. For code below:

typedef float v4sf __attribute__ ((vector_size(16)));

void foo(v4sf *a) {
  (*a)[0] = 1;
  (*a)[3] = 2;
}

LLVM generates a shuffle instr for it, even if there's only one element updated. But GCC will generate individual stores (at least on PowerPC).

Also, if we have a chain of shufflevector/insertelement instrs, we can go through it, track status of each element and find which updated, finally replace original vector store into multiple element stores. This patch will do it.

This optimization happens at DAGCombiner, since each platform can easily set rules about turning it own in own version of hook method. Steps of the optimization are:

Start at a vector store, go up through its value operand, until we find a load.
In path from store to the load, we only accept insert/shuffle as operands.
Track value modification from the load the store. Quit if we need to extract from other vectors.
Generate store of elements changed in the path, to replace original vector store.

A target-related method isCheapToSplitStore is created. So only PowerPC platform turns the optimization on now.

Discussion: http://lists.llvm.org/pipermail/llvm-dev/2019-September/135432.html http://lists.llvm.org/pipermail/llvm-dev/2019-October/135638.html

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

qiucf created this revision.Nov 13 2019, 11:16 PM

Herald added subscribers: llvm-commits, steven.zhang, jsji and 3 others. · View Herald TranscriptNov 13 2019, 11:16 PM

qiucf added a reviewer: fhahn.Nov 14 2019, 12:01 AM

Herald added a subscriber: • wuzish. · View Herald TranscriptNov 14 2019, 12:01 AM

This is missing test coverage.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
6799–6801	It would be best to do as much of this checking as early as possible, before calling `getVectorUpdates()`

steven.zhang added inline comments.Nov 14 2019, 3:05 AM

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
6800	And it is also benefit to do this folding before the legalization, so that, the illegal store could be combined to legal store later.

Add regression test.
Check legality before doing costy operations.

I plan to move the two changed swap-le tests into another NFC patch, since they're not related to the logic. Also, currently i8 and i16 are illegal on PowerPC. So this won't affect vectors like <8 x i16> or <16 x i8>.

qiucf marked 2 inline comments as done.Nov 15 2019, 1:36 AM

qiucf added a parent revision: D70373: [NFC] [PowerPC] Add volatile flag to a swap optimization test.Nov 17 2019, 10:43 PM

Remove test case change to swaps-le-5 and swaps-le-6 since they're moved to a single differential D70373.

I plan to move the two changed swap-le tests into another NFC patch, since they're not related to the logic.

I think all test cases change should be reflected so that reviewers can have a look if they are regressions.

This is a very good idea. Love it!.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
6643	Alignment seems strange. Please use clang-format
6656	What about one operand is undef? putting an undef into Path has no meaning?
6671	If Path is empty here, there is an infinite loop since Current is not changed.
6743	If changed is already set, don't need to set it again.
6757	I see for Buf, the second value is only given 0/1? Can we use a bool instead?
6761	Move this comments down to line 6761. And also add other execluded cases in the comment
6765	Can we handle indexed store too? I guess you also need to exclude that kind of store.
6787	I am thinking `UpdatedElementsIdx` and all the codes to update `UpdatedElementsIdx` is redundant. we could collect updated elements idx in `getVectorUpdates` and check whether States[i] need to update by its States[i].second?
6803	Better to add an assert here that `UpdatedElementsIdx.size()` will never be larger than `VecLen`
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
1606	Do we hava a perf testing shows that for <4 x i32> store, 3 scalar stores for i32 has a good perf?
llvm/test/CodeGen/PowerPC/vector-store-split.ll
3 ↗	(On Diff #229750)	I think the test coverage is not enough. Especially the negative testing. For example `// We don't support shuffle which changes vector length.`, BUILD_VECTOR related testing, undef operand testing and so on.

Added Eli and Justin in case they are interested in chiming in on at least the target-independent parts of this.

It looks like this is missing some checks on the load. The code needs to check that the load and store target the same address, and that there aren't any operations between the load and the store that could modify the memory.

The profitability check probably needs to weigh the cost of the memory operations a little more carefully in cases where the total number of memory operations increases.

I'm a little worried there could be a performance penalty on certain CPUs if the vector value is loaded soon afterwards, due to the partial overlap. Depends on details of the specific CPU, though, and maybe it's rare enough that it doesn't matter.

This changeset is a rather complicated and hard-to-follow piece of code that appears to solve a problem in the motivating test case, but makes it rather unclear whether it is a good idea or a bad idea overall.
There are a number of things that need to be clarified for this patch to proceed:

Needs empirical performance data. This is straightforward - see if it affects the performance of any benchmarks.
Needs more thorough testing. We need to cover more types, more ways we may end up with these "merge-and-store" idioms, different numbers of elements changed, etc.
An overview of the algorithm should be provided to aid readability. The code as written does not exactly aid readability so it would be good to provide an outline.
I still think we should do this in InstCombine rather than on the SDAG. It seems like that would be a more natural place to do this.
If we are doing it in InstCombine, why not simply produce a masked store? If we converted the IR for the test case in the description to this:

define dso_local void @foo(<4 x float>* nocapture %a, float %b) local_unnamed_addr #0 {
entry:
  call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> <float 1.000000e+00, float undef, float undef, float 2.000000e+00>, <4 x float>* %a, i32 1, <4 x i1> <i1 true, i1 false, i1 false, i1 true>)
  ret void
}
declare void @llvm.masked.store.v4f32.p0v4f32(<4 x float>, <4 x float>*, i32, <4 x i1>)

We will get the desired codegen:

lis 4, 16256
lis 5, 16384
stw 4, 0(3)
stw 5, 12(3)

And there is target-independent handling for masked stores, so it is not at all clear to me why we'd go through the trouble of implementing this complex handling in the SDAG.

@spatel I know you were initially against doing this in InstCombine, but I still believe that is a better place for this and a simpler way to implement it. If we narrow the scope of this to only handle insertions at constant indices, lib/CodeGen/ScalarizeMaskedMemIntrin.cpp should handle this quite well. And on the subject of cost model, I don't really think we need a target-specific cost model for this - simply the count of load/store/insertelement operations we are saving with the masked intrinsic weighed against the likely number of stores if we expand the masked intrinsic.
For the attached example, we are getting rid of a load and two insertelement instructions and introducing a mask that will expand to a maximum of 2 stores, so it probably makes sense to do it. On the other hand, if we get rid of a load and three insertelement instructions and introduce a mask that may expand to 3 stores, it is probably not worth it.

llvm/include/llvm/CodeGen/TargetLowering.h
3413	This should be more descriptive. Perhaps: /// Determine whether it is profitable to split a single vector store /// into \p NumSplit scalar stores. Furthermore, I don't think this query is useful as implemented. For most targets, it is almost guaranteed that this should return `false` when `NumSplit > 2` and quite likely even with `NumSplit == 2` is not cheaper than a single vector store. The problem is that there is not context to determine what we would be saving if we were to split this up. If we have some sequence of operations on a vector and then we need to store that vector either with a single vector store or split into `NumSplit` pieces, the answer is clearly - don't split it (one store is better than many).
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
6619	Nit: name here does not match the function name.
6759	For readability, move these declarations past the early exits.
6765	+1
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
1609	Really? So it is cheaper to split a `v16i8` vector into 15 pieces than it is to store it with a single store? I really doubt that.

This revision now requires changes to proceed.Nov 22 2019, 6:06 AM

In D70223#1756723, @nemanjai wrote:

@spatel I know you were initially against doing this in InstCombine, but I still believe that is a better place for this and a simpler way to implement it. If we narrow the scope of this to only handle insertions at constant indices, lib/CodeGen/ScalarizeMaskedMemIntrin.cpp should handle this quite well. And on the subject of cost model, I don't really think we need a target-specific cost model for this - simply the count of load/store/insertelement operations we are saving with the masked intrinsic weighed against the likely number of stores if we expand the masked intrinsic.
For the attached example, we are getting rid of a load and two insertelement instructions and introducing a mask that will expand to a maximum of 2 stores, so it probably makes sense to do it. On the other hand, if we get rid of a load and three insertelement instructions and introduce a mask that may expand to 3 stores, it is probably not worth it.

I'm still skeptical that IR canonicalization is better/easier, but I could be convinced. You're still going to have to do the memory analysis that @efriedma mentioned to make sure this is even a valid transform. Is that easier in IR?
To proceed on the IR path (and again, I'm skeptical that intcombine vs. some other IR pass is the right option), we would need to create codegen tests for multiple targets with the alternative IR sequences. Then, we would need to potentially improve that output for multiple targets. After that is done, we could then transform to the masked store intrinsic in IR.

Here's an example of a codegen test of IR alternatives - as I think was originally shown in the llvm-dev thread, we want to replace 2 out of 4 elements of a vector, but this is with values that are in scalar params/registers rather than constants:

define void @insert_store(<4 x i32>* %q, i32 %s0, i32 %s3) {
  %t0 = load <4 x i32>, <4 x i32>* %q, align 16
  %vecins0 = insertelement <4 x i32> %t0, i32 %s0, i32 0
  %vecins3 = insertelement <4 x i32> %vecins0, i32 %s3, i32 3
  store <4 x i32> %vecins3, <4 x i32>* %q, align 16
  ret void
}

declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32, <4 x i1>)

define void @masked_store(<4 x i32>* %q, i32 %s0, i32 %s3) {
  %vecins0 = insertelement <4 x i32> undef, i32 %s0, i32 0
  %vecins3 = insertelement <4 x i32> %vecins0, i32 %s3, i32 3
  call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %vecins3, <4 x i32>* %q, i32 16, <4 x i1> <i1 1, i1 0, i1 0, i1 1>)
  ret void
}

The 2nd sequence looks way better for PPC, so great:

addis 6, 2, .LCPI0_0@toc@ha
mtvsrd 0, 4
lvx 4, 0, 3
addi 4, 6, .LCPI0_0@toc@l
xxswapd	34, 0
lvx 3, 0, 4
mtvsrd 0, 5
addis 4, 2, .LCPI0_1@toc@ha
addi 4, 4, .LCPI0_1@toc@l
vperm 2, 2, 4, 3
xxswapd	35, 0
lvx 4, 0, 4
vperm 2, 3, 2, 4
stvx 2, 0, 3

vs.

stw 4, 0(3)
stw 5, 12(3)

But here's what happens on x86 with AVX2 (this target has custom/legal vector inserts and vector masked store lowering):

vmovdqa	(%rdi), %xmm0
vpinsrd	$0, %esi, %xmm0, %xmm0
vpinsrd	$3, %edx, %xmm0, %xmm0
vmovdqa	%xmm0, (%rdi)

vs.

vmovd	%esi, %xmm0
vpinsrd	$3, %edx, %xmm0, %xmm0
vmovdqa	LCPI1_0(%rip), %xmm1    ## xmm1 = [4294967295,0,0,4294967295]
vpmaskmovd	%xmm0, %xmm1, (%rdi)

I'm not actually sure which of those we consider better. But neither is ideal. We'd be better off pretending there was no masked move instruction and get the expanded:

movl	%esi, (%rdi)
movl	%edx, 12(%rdi)

In D70223#1755719, @efriedma wrote:

It looks like this is missing some checks on the load. The code needs to check that the load and store target the same address, and that there aren't any operations between the load and the store that could modify the memory.

The profitability check probably needs to weigh the cost of the memory operations a little more carefully in cases where the total number of memory operations increases.

I'm a little worried there could be a performance penalty on certain CPUs if the vector value is loaded soon afterwards, due to the partial overlap. Depends on details of the specific CPU, though, and maybe it's rare enough that it doesn't matter.

This patch might be split into two: (1) Merge shuffle-insert and shuffle-shuffle if they have no other uses and RHS in shuffle is constant; (2) Make MatchVectorStoreSplit only consider a simple load-shuffle/insert-store chain.

That may exclude cases that both LHS and RHS are results of shuffle but from the same root. However the case should be rare and complex.

My question is: why we didn't implement merging shuffles in instcombine? I saw comments saying that's unsafe or making things worse:

we are absolutely afraid of producing a shuffle mask not in the input program, because the code gen may not be smart enough to turn a merged shuffle into two specific shuffles: it may produce worse code. As such, we only merge two shuffles if the result is either a splat or one of the input shuffle masks. In this case, merging the shuffles just removes one instruction, which we know is safe.

Would it help if we check their uses before fold them?

In D70223#1759762, @qiucf wrote:

My question is: why we didn't implement merging shuffles in instcombine? I saw comments saying that's unsafe or making things worse:

we are absolutely afraid of producing a shuffle mask not in the input program, because the code gen may not be smart enough to turn a merged shuffle into two specific shuffles: it may produce worse code. As such, we only merge two shuffles if the result is either a splat or one of the input shuffle masks. In this case, merging the shuffles just removes one instruction, which we know is safe.

Would it help if we check their uses before fold them?

I don't understand how checking uses would change that. Do you have an example?

The code comment is still accurate in general because instcombine must be good for all targets and not all targets have a generic shuffle instruction - Altivec vperm is a real luxury. :)
So it's very difficult to reverse shuffle transforms later. As an example, see how much code x86 needs to map shuffles to a series of incomplete shuffle ISAs under combineShuffle():
https://github.com/llvm/llvm-project/blob/master/llvm/lib/Target/X86/X86ISelLowering.cpp

qiucf mentioned this in D70373: [NFC] [PowerPC] Add volatile flag to a swap optimization test.Nov 27 2019, 2:43 AM

Address some comments from the community:

Add the swap-le test back to this revision for better review.
Add check for indexed store/loads.
Add more tests for change-length shuffle, undef, etc.
Subtle some comments inconsistent with code.
Eliminate a possible infinite loop case.

Herald added a subscriber: jfb. · View Herald TranscriptNov 28 2019, 2:13 AM

efriedma requested changes to this revision.Dec 3 2019, 4:41 PM

This revision now requires changes to proceed.Dec 3 2019, 4:41 PM

Thanks for comments and explanation from everyone. I think there're two key issues to clarify and solve about this revision:

Implementation code is too complicated but focused on a specialized case. It tries to search in tree but what we actually can do is not so much. So I'm going to simplify the logic, and cut some extremely rare cases if necessary.
How many elements are suitable for optimization indeed? I didn't get obvious better results in benchmarks. Although we can use target information on DAGCombiner, it's more suitable placed at InstCombine, in concept, I think. Since TargetTransformInfo may not always suitable here, should we start from only do this for 1-element case?

https://reviews.llvm.org/D71828 is created for simpler logic at InstCombine.

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

TargetLowering.h

4 lines

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

206 lines

TargetLowering.cpp

5 lines

Target/

PowerPC/

PPCISelLowering.h

3 lines

PPCISelLowering.cpp

11 lines

test/

CodeGen/

PowerPC/

swaps-le-5.ll

7 lines

swaps-le-6.ll

5 lines

Diff 229232

llvm/include/llvm/CodeGen/TargetLowering.h

Show First 20 Lines • Show All 3,404 Lines • ▼ Show 20 Lines	virtual char isNegatibleForFree(SDValue Op, SelectionDAG &DAG,
bool LegalOperations, bool ForCodeSize,		bool LegalOperations, bool ForCodeSize,
unsigned Depth = 0) const;		unsigned Depth = 0) const;

/// If isNegatibleForFree returns true, return the newly negated expression.		/// If isNegatibleForFree returns true, return the newly negated expression.
virtual SDValue getNegatedExpression(SDValue Op, SelectionDAG &DAG,		virtual SDValue getNegatedExpression(SDValue Op, SelectionDAG &DAG,
bool LegalOperations, bool ForCodeSize,		bool LegalOperations, bool ForCodeSize,
unsigned Depth = 0) const;		unsigned Depth = 0) const;

		/// If it's profitable to split vector store into individual stores.
		nemanjaiUnsubmitted Done Reply Inline Actions This should be more descriptive. Perhaps: /// Determine whether it is profitable to split a single vector store /// into \p NumSplit scalar stores. Furthermore, I don't think this query is useful as implemented. For most targets, it is almost guaranteed that this should return `false` when `NumSplit > 2` and quite likely even with `NumSplit == 2` is not cheaper than a single vector store. The problem is that there is not context to determine what we would be saving if we were to split this up. If we have some sequence of operations on a vector and then we need to store that vector either with a single vector store or split into `NumSplit` pieces, the answer is clearly - don't split it (one store is better than many). nemanjai: This should be more descriptive. Perhaps: ``` /// Determine whether it is profitable to split a…
		virtual bool isCheapToSplitStore(StoreSDNode *N, unsigned NumSplit,
		SelectionDAG &DAG) const;

//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
// Lowering methods - These methods must be implemented by targets so that		// Lowering methods - These methods must be implemented by targets so that
// the SelectionDAGBuilder code knows how to lower these.		// the SelectionDAGBuilder code knows how to lower these.
//		//

/// This hook must be implemented to lower the incoming (formal) arguments,		/// This hook must be implemented to lower the incoming (formal) arguments,
/// described by the Ins array, into the specified DAG. The implementation		/// described by the Ins array, into the specified DAG. The implementation
/// should fill in the InVals array with legal-type argument values, and		/// should fill in the InVals array with legal-type argument values, and
▲ Show 20 Lines • Show All 862 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 541 Lines • ▼ Show 20 Lines	private:
SDValue MatchBSwapHWord(SDNode *N, SDValue N0, SDValue N1);		SDValue MatchBSwapHWord(SDNode *N, SDValue N0, SDValue N1);
SDValue MatchRotatePosNeg(SDValue Shifted, SDValue Pos, SDValue Neg,		SDValue MatchRotatePosNeg(SDValue Shifted, SDValue Pos, SDValue Neg,
SDValue InnerPos, SDValue InnerNeg,		SDValue InnerPos, SDValue InnerNeg,
unsigned PosOpcode, unsigned NegOpcode,		unsigned PosOpcode, unsigned NegOpcode,
const SDLoc &DL);		const SDLoc &DL);
SDValue MatchRotate(SDValue LHS, SDValue RHS, const SDLoc &DL);		SDValue MatchRotate(SDValue LHS, SDValue RHS, const SDLoc &DL);
SDValue MatchLoadCombine(SDNode *N);		SDValue MatchLoadCombine(SDNode *N);
SDValue MatchStoreCombine(StoreSDNode *N);		SDValue MatchStoreCombine(StoreSDNode *N);
		SDValue MatchVectorStoreSplit(StoreSDNode *N);
SDValue ReduceLoadWidth(SDNode *N);		SDValue ReduceLoadWidth(SDNode *N);
SDValue ReduceLoadOpStoreWidth(SDNode *N);		SDValue ReduceLoadOpStoreWidth(SDNode *N);
SDValue splitMergedValStore(StoreSDNode *ST);		SDValue splitMergedValStore(StoreSDNode *ST);
SDValue TransformFPLoadStorePair(SDNode *N);		SDValue TransformFPLoadStorePair(SDNode *N);
SDValue convertBuildVecZextToZext(SDNode *N);		SDValue convertBuildVecZextToZext(SDNode *N);
SDValue reduceBuildVecExtToExtBuildVec(SDNode *N);		SDValue reduceBuildVecExtToExtBuildVec(SDNode *N);
SDValue reduceBuildVecToShuffle(SDNode *N);		SDValue reduceBuildVecToShuffle(SDNode *N);
SDValue createBuildVecShuffle(const SDLoc &DL, SDNode *N,		SDValue createBuildVecShuffle(const SDLoc &DL, SDNode *N,
▲ Show 20 Lines • Show All 6,052 Lines • ▼ Show 20 Lines	SDValue NewStore =
DAG.getStore(Chain, SDLoc(N), CombinedValue, FirstStore->getBasePtr(),		DAG.getStore(Chain, SDLoc(N), CombinedValue, FirstStore->getBasePtr(),
FirstStore->getPointerInfo(), FirstStore->getAlignment());		FirstStore->getPointerInfo(), FirstStore->getAlignment());

// Rely on other DAG combine rules to remove the other individual stores.		// Rely on other DAG combine rules to remove the other individual stores.
DAG.ReplaceAllUsesWith(N, NewStore.getNode());		DAG.ReplaceAllUsesWith(N, NewStore.getNode());
return NewStore;		return NewStore;
}		}

		/// getVectorUpdateChain - Detect a load-insert/shuffle-store chain. Returns
		nemanjaiUnsubmitted Done Reply Inline Actions Nit: name here does not match the function name. nemanjai: Nit: name here does not match the function name.
		/// head LOAD if we met such pattern. Otherwise, this function returns null
		/// SDValue. The path here contains both nodes on the path and respective
		/// 'next' node index we should visit.
		static SDValue
		getVectorUpdatePath(SDValue Current,
		SmallVectorImpl<std::pair<SDValue, unsigned>> &Path) {
		if (Current.getOpcode() == ISD::LOAD)
		return SDValue();

		const unsigned MaxDepth = 12;
		unsigned Depth = 0;

		// Here we use DFS to find the LOAD we desire. This buffer keeps nodes
		// we've not visited. If current path is wrong, back and pick another one.
		SmallVector<SDValue, 16> SearchBuffer;

		while (Current) {
		if (Depth >= MaxDepth)
		return SDValue();

		if (Current.getOpcode() == ISD::LOAD)
		return Current;
		else if (Current.getOpcode() == ISD::INSERT_VECTOR_ELT) {
		// For INSERT_ELT nodes, we only have one path.
		shchenzUnsubmitted Done Reply Inline Actions Alignment seems strange. Please use clang-format shchenz: Alignment seems strange. Please use clang-format
		++Depth;
		Path.push_back(std::make_pair(Current, 0));
		Current = Current.getOperand(0);
		} else if (Current.getOpcode() == ISD::VECTOR_SHUFFLE) {
		// For VECTOR_SHUFFLE, we pick the first non-constant and push
		// the rest into buffer.
		++Depth;
		SDValue Op1 = Current.getOperand(0);
		SDValue Op2 = Current.getOperand(1);

		// Since we're desiring a LOAD at root of the path, constant vector
		// (BUILD_VECTOR) operand in shuffle instr can't be what we want.
		if (Op1.getOpcode() == ISD::BUILD_VECTOR) {
		shchenzUnsubmitted Done Reply Inline Actions What about one operand is undef? putting an undef into Path has no meaning? shchenz: What about one operand is undef? putting an undef into Path has no meaning?
		Path.push_back(std::make_pair(Current, 1));
		Current = Op2;
		} else {
		Path.push_back(std::make_pair(Current, 0));
		Current = Op1;
		if (Op2.getOpcode() != ISD::BUILD_VECTOR)
		SearchBuffer.push_back(Op2);
		}
		} else {
		if (SearchBuffer.empty())
		return SDValue();

		// Go back until we find top of buffer is 'brother' of current node.
		// And pick it from buffer.
		while (!Path.empty()) {
		shchenzUnsubmitted Done Reply Inline Actions If Path is empty here, there is an infinite loop since Current is not changed. shchenz: If Path is empty here, there is an infinite loop since Current is not changed.
		std::pair<SDValue, unsigned> Top = Path.pop_back_val();
		if (Top.first.getOpcode() != ISD::VECTOR_SHUFFLE)
		continue;
		unsigned NewPath = !Top.second;
		if (Top.first.getOperand(NewPath) == SearchBuffer.back()) {
		SearchBuffer.pop_back();
		Path.push_back(std::make_pair(Top.first, NewPath));
		Current = Top.first.getOperand(NewPath);
		break;
		}
		}
		}
		}

		return SDValue();
		}

		/// getVectorUpdates - Update a state table of elements generated by
		/// a load-insert/shuffle-store chain. We can use this table to simplify
		/// vector store code.
		static bool
		getVectorUpdates(SmallVectorImpl<std::pair<SDValue, int>> &States,
		SmallVectorImpl<std::pair<SDValue, unsigned>> &Worklist,
		const SDValue &Source) {
		bool Changed = false;
		auto VecLen = Source.getValueType().getVectorNumElements();
		SDValue LastItem;

		// Each element in States should be one of:
		// - The first is default SDValue if it's undef.
		// - The first is scalar if it's a value representing scalar.
		// - The first is vector and second is index if it's element from a vector.

		while (!Worklist.empty()) {
		auto Tail = Worklist.pop_back_val();

		// INSERT only changes one element, so just change entry for that one.
		if (Tail.first.getOpcode() == ISD::INSERT_VECTOR_ELT) {
		auto IndexNode = dyn_cast<ConstantSDNode>(Tail.first.getOperand(2));
		if (IndexNode == nullptr)
		return false;
		auto IndexValue = IndexNode->getZExtValue();
		if (IndexValue >= VecLen)
		return false;
		States[IndexValue].first = Tail.first.getOperand(1);
		Changed = true;
		} else if (Tail.first.getOpcode() == ISD::VECTOR_SHUFFLE) {
		// Elements may got exchanged. So we copy the table and refer
		// to the original ones when updating.
		const ShuffleVectorSDNode *SVN =
		dyn_cast<ShuffleVectorSDNode>(Tail.first.getNode());
		SmallVector<std::pair<SDValue, int>, 16> OriginalStates(States.begin(),
		States.end());
		// We don't support shuffle which changes vector length.
		auto MaskSize = SVN->getMask().size();
		if (MaskSize != VecLen)
		return false;

		for (unsigned i = 0; i < MaskSize; i++) {
		int Mask = SVN->getMaskElt(i);
		if (Mask < 0)
		States[i].first = SDValue();
		else {
		SDValue ChoosenVec = SVN->getOperand(Mask >= (int)VecLen);
		States[i].second = (Mask < (int)VecLen) ? Mask : Mask - VecLen;

		// Keep previous set vector if shuffle doesn't change this.
		if (ChoosenVec != LastItem)
		States[i].first = ChoosenVec;
		}

		if (OriginalStates[i] != States[i])
		shchenzUnsubmitted Done Reply Inline Actions If changed is already set, don't need to set it again. shchenz: If changed is already set, don't need to set it again.
		Changed = true;
		}
		}

		LastItem = Tail.first;
		}

		return Changed;
		}

		/// Match a pattern where only a few elements of a vector are updated but the
		/// whole vector is stored. This method will recognize a chain in which vector
		/// is updated, and split the store into multiple element stores.
		SDValue DAGCombiner::MatchVectorStoreSplit(StoreSDNode *N) {
		shchenzUnsubmitted Done Reply Inline Actions I see for Buf, the second value is only given 0/1? Can we use a bool instead? shchenz: I see for Buf, the second value is only given 0/1? Can we use a bool instead?
		SmallVector<std::pair<SDValue, unsigned>, 16> Buf;
		SmallVector<std::pair<SDValue, int>, 16> States;
		nemanjaiUnsubmitted Done Reply Inline Actions For readability, move these declarations past the early exits. nemanjai: For readability, move these declarations past the early exits.

		// We don't support scalable vectors now.
		shchenzUnsubmitted Done Reply Inline Actions Move this comments down to line 6761. And also add other execluded cases in the comment shchenz: Move this comments down to line 6761. And also add other execluded cases in the comment
		EVT VecType = N->getValue().getValueType();
		if (!N->isSimple() \|\| VecType.isScalableVector() \|\| !VecType.isVector())
		return SDValue();

		shchenzUnsubmitted Done Reply Inline Actions Can we handle indexed store too? I guess you also need to exclude that kind of store. shchenz: Can we handle indexed store too? I guess you also need to exclude that kind of store.
		nemanjaiUnsubmitted Done Reply Inline Actions +1 nemanjai: +1
		SDValue SourceVecLoad = getVectorUpdatePath(N->getValue(), Buf);
		LoadSDNode *LoadNode = dyn_cast_or_null<LoadSDNode>(SourceVecLoad.getNode());
		if (!LoadNode \|\| !LoadNode->isSimple())
		return SDValue();

		// Prepare a table, initialized by the vector and its index. So we can find
		// which elements got updated in the chain.
		auto VecLen = SourceVecLoad.getValueType().getVectorNumElements();
		for (unsigned i = 0; i < VecLen; ++i)
		States.push_back(std::make_pair(SourceVecLoad, i));

		if (!getVectorUpdates(States, Buf, SourceVecLoad))
		return SDValue();

		// Go over the updated table to find updated elements.
		SmallVector<int, 16> UpdatedElementsIdx;
		for (unsigned i = 0; i < States.size(); ++i) {
		SDValue &State = States[i].first;
		if (!State)
		continue;

		// It's not profitable to extract from another vector and insert.
		shchenzUnsubmitted Not Done Reply Inline Actions I am thinking `UpdatedElementsIdx` and all the codes to update `UpdatedElementsIdx` is redundant. we could collect updated elements idx in `getVectorUpdates` and check whether States[i] need to update by its States[i].second? shchenz: I am thinking `UpdatedElementsIdx` and all the codes to update `UpdatedElementsIdx` is…
		if (State.getValueType().isVector()) {
		if (State.getOpcode() == ISD::BUILD_VECTOR) {
		State = State.getOperand(States[i].second);
		UpdatedElementsIdx.push_back(i);
		} else if (State != SourceVecLoad \|\| States[i].second != (int)i)
		return SDValue();
		} else
		UpdatedElementsIdx.push_back(i);
		}

		EVT EleVT = VecType.getScalarType();
		if (TLI.isCheapToSplitStore(N, UpdatedElementsIdx.size(), DAG) &&
		TLI.getOperationAction(ISD::STORE, EleVT) == TargetLowering::Legal &&
		steven.zhangUnsubmitted Done Reply Inline Actions And it is also benefit to do this folding before the legalization, so that, the illegal store could be combined to legal store later. steven.zhang: And it is also benefit to do this folding before the legalization, so that, the illegal store…
		TLI.allowsMemoryAccess(*DAG.getContext(), DAG.getDataLayout(), EleVT)) {
		lebedev.riUnsubmitted Done Reply Inline Actions It would be best to do as much of this checking as early as possible, before calling `getVectorUpdates()` lebedev.ri: It would be best to do as much of this checking as early as possible, before calling…
		SDLoc Loc = SDLoc(N);
		SDValue Index, Chain;
		shchenzUnsubmitted Done Reply Inline Actions Better to add an assert here that `UpdatedElementsIdx.size()` will never be larger than `VecLen` shchenz: Better to add an assert here that `UpdatedElementsIdx.size()` will never be larger than `VecLen`

		// Insert new stores, based on vector's address.
		for (unsigned i = 0; i < UpdatedElementsIdx.size(); ++i) {
		Index = TLI.getVectorElementPointer(
		DAG, N->getBasePtr(), VecType,
		DAG.getConstant(UpdatedElementsIdx[i], Loc, MVT::i32));
		Chain =
		DAG.getStore(Chain ? Chain : N->getChain(), Loc,
		States[UpdatedElementsIdx[i]].first,
		Index, N->getPointerInfo());
		}
		return Chain;
		}

		return SDValue();
		}

/// Match a pattern where a wide type scalar value is loaded by several narrow		/// Match a pattern where a wide type scalar value is loaded by several narrow
/// loads and combined by shifts and ors. Fold it into a single load or a load		/// loads and combined by shifts and ors. Fold it into a single load or a load
/// and a BSWAP if the targets supports it.		/// and a BSWAP if the targets supports it.
///		///
/// Assuming little endian target:		/// Assuming little endian target:
/// i8 *a = ...		/// i8 *a = ...
/// i32 val = a[0] \| (a[1] << 8) \| (a[2] << 16) \| (a[3] << 24)		/// i32 val = a[0] \| (a[1] << 8) \| (a[2] << 16) \| (a[3] << 24)
/// =>		/// =>
▲ Show 20 Lines • Show All 9,565 Lines • ▼ Show 20 Lines	SDValue DAGCombiner::visitSTORE(SDNode *N) {
// load / store ops.		// load / store ops.
if (SDValue NewST = TransformFPLoadStorePair(N))		if (SDValue NewST = TransformFPLoadStorePair(N))
return NewST;		return NewST;

// Try transforming several stores into STORE (BSWAP).		// Try transforming several stores into STORE (BSWAP).
if (SDValue Store = MatchStoreCombine(ST))		if (SDValue Store = MatchStoreCombine(ST))
return Store;		return Store;

		if (SDValue Split = MatchVectorStoreSplit(ST))
		return Split;

if (ST->isUnindexed()) {		if (ST->isUnindexed()) {
// Walk up chain skipping non-aliasing memory nodes, on this store and any		// Walk up chain skipping non-aliasing memory nodes, on this store and any
// adjacent stores.		// adjacent stores.
if (findBetterNeighborChains(ST)) {		if (findBetterNeighborChains(ST)) {
// replaceStoreChain uses CombineTo, which handled all of the worklist		// replaceStoreChain uses CombineTo, which handled all of the worklist
// manipulation. Return the original node to not do anything else.		// manipulation. Return the original node to not do anything else.
return SDValue(ST, 0);		return SDValue(ST, 0);
}		}
▲ Show 20 Lines • Show All 4,715 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,574 Lines • ▼ Show 20 Lines	return DAG.getNode(ISD::FP_ROUND, SDLoc(Op), Op.getValueType(),
LegalOperations, ForCodeSize,		LegalOperations, ForCodeSize,
Depth + 1),		Depth + 1),
Op.getOperand(1));		Op.getOperand(1));
}		}

llvm_unreachable("Unknown code");		llvm_unreachable("Unknown code");
}		}

		bool TargetLowering::isCheapToSplitStore(StoreSDNode *N, unsigned NumSplit,
		SelectionDAG &DAG) const {
		return false;
		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Legalization Utilities		// Legalization Utilities
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

bool TargetLowering::expandMUL_LOHI(unsigned Opcode, EVT VT, SDLoc dl,		bool TargetLowering::expandMUL_LOHI(unsigned Opcode, EVT VT, SDLoc dl,
SDValue LHS, SDValue RHS,		SDValue LHS, SDValue RHS,
SmallVectorImpl<SDValue> &Result,		SmallVectorImpl<SDValue> &Result,
EVT HiLoVT, SelectionDAG &DAG,		EVT HiLoVT, SelectionDAG &DAG,
▲ Show 20 Lines • Show All 1,782 Lines • Show Last 20 Lines

llvm/lib/Target/PowerPC/PPCISelLowering.h

Show First 20 Lines • Show All 640 Lines • ▼ Show 20 Lines	public:
bool isCheapToSpeculateCttz() const override {		bool isCheapToSpeculateCttz() const override {
return true;		return true;
}		}

bool isCheapToSpeculateCtlz() const override {		bool isCheapToSpeculateCtlz() const override {
return true;		return true;
}		}

		bool isCheapToSplitStore(StoreSDNode *N, unsigned NumSplit,
		SelectionDAG &DAG) const override;

bool isCtlzFast() const override {		bool isCtlzFast() const override {
return true;		return true;
}		}

bool isEqualityCmpFoldedWithSignedCmp() const override {		bool isEqualityCmpFoldedWithSignedCmp() const override {
return false;		return false;
}		}

▲ Show 20 Lines • Show All 590 Lines • Show Last 20 Lines

llvm/lib/Target/PowerPC/PPCISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,592 Lines • ▼ Show 20 Lines	for (unsigned j = 0; j != UnitSize; ++j) { // Step over bytes within unit
LHSStart+j+i*UnitSize) \|\|		LHSStart+j+i*UnitSize) \|\|
!isConstantOrUndef(N->getMaskElt(iUnitSize2+UnitSize+j),		!isConstantOrUndef(N->getMaskElt(iUnitSize2+UnitSize+j),
RHSStart+j+i*UnitSize))		RHSStart+j+i*UnitSize))
return false;		return false;
}		}
return true;		return true;
}		}

		bool PPCTargetLowering::isCheapToSplitStore(StoreSDNode *N,
		unsigned NumSplit,
		SelectionDAG &DAG) const {
		EVT VecType = N->getValue().getValueType();

		if (!VecType.isVector() \|\| NumSplit >= VecType.getVectorNumElements())
		shchenzUnsubmitted Not Done Reply Inline Actions Do we hava a perf testing shows that for <4 x i32> store, 3 scalar stores for i32 has a good perf? shchenz: Do we hava a perf testing shows that for <4 x i32> store, 3 scalar stores for i32 has a good…
		return false;

		return true;
		nemanjaiUnsubmitted Not Done Reply Inline Actions Really? So it is cheaper to split a `v16i8` vector into 15 pieces than it is to store it with a single store? I really doubt that. nemanjai: Really? So it is cheaper to split a `v16i8` vector into 15 pieces than it is to store it with a…
		}

/// isVMRGLShuffleMask - Return true if this is a shuffle mask suitable for		/// isVMRGLShuffleMask - Return true if this is a shuffle mask suitable for
/// a VMRGL* instruction with the specified unit size (1,2 or 4 bytes).		/// a VMRGL* instruction with the specified unit size (1,2 or 4 bytes).
/// The ShuffleKind distinguishes between big-endian merges with two		/// The ShuffleKind distinguishes between big-endian merges with two
/// different inputs (0), either-endian merges with two identical inputs (1),		/// different inputs (0), either-endian merges with two identical inputs (1),
/// and little-endian merges with two different inputs (2). For the latter,		/// and little-endian merges with two different inputs (2). For the latter,
/// the input operands are swapped (see PPCInstrAltivec.td).		/// the input operands are swapped (see PPCInstrAltivec.td).
bool PPC::isVMRGLShuffleMask(ShuffleVectorSDNode *N, unsigned UnitSize,		bool PPC::isVMRGLShuffleMask(ShuffleVectorSDNode *N, unsigned UnitSize,
unsigned ShuffleKind, SelectionDAG &DAG) {		unsigned ShuffleKind, SelectionDAG &DAG) {
▲ Show 20 Lines • Show All 13,967 Lines • Show Last 20 Lines

llvm/test/CodeGen/PowerPC/swaps-le-5.ll

	; RUN: llc -verify-machineinstrs -mcpu=pwr8 -mtriple=powerpc64le-unknown-linux-gnu -O3 < %s \| FileCheck %s			; RUN: llc -verify-machineinstrs -mcpu=pwr8 -mtriple=powerpc64le-unknown-linux-gnu -O3 < %s \| FileCheck %s

	; These tests verify that VSX swap optimization works for various			; These tests verify that VSX swap optimization works for various
	; manipulations of <2 x double> vectors.			; manipulations of <2 x double> vectors.

	@x = global <2 x double> <double 9.970000e+01, double -1.032220e+02>, align 16			@x = global <2 x double> <double 9.970000e+01, double -1.032220e+02>, align 16
	@z = global <2 x double> <double 2.332000e+01, double 3.111111e+01>, align 16			@z = global <2 x double> <double 2.332000e+01, double 3.111111e+01>, align 16

	define void @bar0(double %y) {			define void @bar0(double %y) {
	entry:			entry:
	%0 = load <2 x double>, <2 x double>* @x, align 16			%0 = load <2 x double>, <2 x double>* @x, align 16
	%vecins = insertelement <2 x double> %0, double %y, i32 0			%vecins = insertelement <2 x double> %0, double %y, i32 0
	store <2 x double> %vecins, <2 x double>* @z, align 16			; Add volatile to avoid vector store split optimization.
				store volatile <2 x double> %vecins, <2 x double>* @z, align 16
	ret void			ret void
	}			}

	; CHECK-LABEL: @bar0			; CHECK-LABEL: @bar0
	; CHECK-DAG: lxvd2x [[REG1:[0-9]+]]			; CHECK-DAG: lxvd2x [[REG1:[0-9]+]]
	; CHECK-DAG: xxspltd [[REG2:[0-9]+]]			; CHECK-DAG: xxspltd [[REG2:[0-9]+]]
	; CHECK: xxpermdi [[REG3:[0-9]+]], [[REG2]], [[REG1]], 1			; CHECK: xxpermdi [[REG3:[0-9]+]], [[REG2]], [[REG1]], 1
	; CHECK: stxvd2x [[REG3]]			; CHECK: stxvd2x [[REG3]]
	; CHECK-NOT: xxswapd			; CHECK-NOT: xxswapd

	define void @bar1(double %y) {			define void @bar1(double %y) {
	entry:			entry:
	%0 = load <2 x double>, <2 x double>* @x, align 16			%0 = load <2 x double>, <2 x double>* @x, align 16
	%vecins = insertelement <2 x double> %0, double %y, i32 1			%vecins = insertelement <2 x double> %0, double %y, i32 1
	store <2 x double> %vecins, <2 x double>* @z, align 16			store volatile <2 x double> %vecins, <2 x double>* @z, align 16
	ret void			ret void
	}			}

	; CHECK-LABEL: @bar1			; CHECK-LABEL: @bar1
	; CHECK-DAG: lxvd2x [[REG1:[0-9]+]]			; CHECK-DAG: lxvd2x [[REG1:[0-9]+]]
	; CHECK-DAG: xxspltd [[REG2:[0-9]+]]			; CHECK-DAG: xxspltd [[REG2:[0-9]+]]
	; CHECK: xxmrghd [[REG3:[0-9]+]], [[REG1]], [[REG2]]			; CHECK: xxmrghd [[REG3:[0-9]+]], [[REG1]], [[REG2]]
	; CHECK: stxvd2x [[REG3]]			; CHECK: stxvd2x [[REG3]]
	Show All 15 Lines
	; CHECK: stxvd2x			; CHECK: stxvd2x
	; CHECK-NOT: xxswapd			; CHECK-NOT: xxswapd

	define void @baz1() {			define void @baz1() {
	entry:			entry:
	%0 = load <2 x double>, <2 x double>* @z, align 16			%0 = load <2 x double>, <2 x double>* @z, align 16
	%1 = load <2 x double>, <2 x double>* @x, align 16			%1 = load <2 x double>, <2 x double>* @x, align 16
	%vecins = shufflevector <2 x double> %0, <2 x double> %1, <2 x i32> <i32 3, i32 1>			%vecins = shufflevector <2 x double> %0, <2 x double> %1, <2 x i32> <i32 3, i32 1>
	store <2 x double> %vecins, <2 x double>* @z, align 16			store volatile <2 x double> %vecins, <2 x double>* @z, align 16
	ret void			ret void
	}			}

	; CHECK-LABEL: @baz1			; CHECK-LABEL: @baz1
	; CHECK: lxvd2x			; CHECK: lxvd2x
	; CHECK: lxvd2x			; CHECK: lxvd2x
	; CHECK: xxmrgld			; CHECK: xxmrgld
	; CHECK: stxvd2x			; CHECK: stxvd2x
	; CHECK-NOT: xxswapd			; CHECK-NOT: xxswapd

llvm/test/CodeGen/PowerPC/swaps-le-6.ll

	Show First 20 Lines • Show All 54 Lines • ▼ Show 20 Lines
	; CHECK-P9: xxpermdi vs1, f1, f1, 2			; CHECK-P9: xxpermdi vs1, f1, f1, 2
	; CHECK-P9: xxpermdi vs0, vs0, vs1, 1			; CHECK-P9: xxpermdi vs0, vs0, vs1, 1
	; CHECK-P9: stxvx vs0, 0, r3			; CHECK-P9: stxvx vs0, 0, r3
	; CHECK-P9: blr			; CHECK-P9: blr
	entry:			entry:
	%0 = load <2 x double>, <2 x double>* @x, align 16			%0 = load <2 x double>, <2 x double>* @x, align 16
	%1 = load double, double* @y, align 8			%1 = load double, double* @y, align 8
	%vecins = insertelement <2 x double> %0, double %1, i32 0			%vecins = insertelement <2 x double> %0, double %1, i32 0
	store <2 x double> %vecins, <2 x double>* @z, align 16			; Add volatile to avoid vector store split optimization.
				store volatile <2 x double> %vecins, <2 x double>* @z, align 16
	ret void			ret void
	}			}

	define void @bar1() {			define void @bar1() {
	; CHECK-LABEL: bar1:			; CHECK-LABEL: bar1:
	; CHECK: # %bb.0: # %entry			; CHECK: # %bb.0: # %entry
	; CHECK: addis r3, r2, .LC0@toc@ha			; CHECK: addis r3, r2, .LC0@toc@ha
	; CHECK: addis r4, r2, .LC1@toc@ha			; CHECK: addis r4, r2, .LC1@toc@ha
	Show All 28 Lines
	; CHECK-P9: xxpermdi vs1, f1, f1, 2			; CHECK-P9: xxpermdi vs1, f1, f1, 2
	; CHECK-P9: xxmrgld vs0, vs1, vs0			; CHECK-P9: xxmrgld vs0, vs1, vs0
	; CHECK-P9: stxvx vs0, 0, r3			; CHECK-P9: stxvx vs0, 0, r3
	; CHECK-P9: blr			; CHECK-P9: blr
	entry:			entry:
	%0 = load <2 x double>, <2 x double>* @x, align 16			%0 = load <2 x double>, <2 x double>* @x, align 16
	%1 = load double, double* @y, align 8			%1 = load double, double* @y, align 8
	%vecins = insertelement <2 x double> %0, double %1, i32 1			%vecins = insertelement <2 x double> %0, double %1, i32 1
	store <2 x double> %vecins, <2 x double>* @z, align 16			store volatile <2 x double> %vecins, <2 x double>* @z, align 16
	ret void			ret void
	}			}

This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombine] Split vector load-update-store into single element storesAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 229232

llvm/include/llvm/CodeGen/TargetLowering.h

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp

llvm/lib/Target/PowerPC/PPCISelLowering.h

llvm/lib/Target/PowerPC/PPCISelLowering.cpp

llvm/test/CodeGen/PowerPC/swaps-le-5.ll

llvm/test/CodeGen/PowerPC/swaps-le-6.ll

[DAGCombine] Split vector load-update-store into single element stores
AbandonedPublic