This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
2/8
AArch64TargetTransformInfo.h
6/15
AArch64TargetTransformInfo.cpp
-
test/Transforms/SLPVectorizer/AArch64/
-
Transforms/
-
SLPVectorizer/
-
AArch64/
-
splat-loads.ll

Differential D123638

[SLP][AArch64] Implement lookahead operand reordering score of splat loads for AArch64
ClosedPublic

Authored by vporpo on Apr 12 2022, 3:24 PM.

Download Raw Diff

Details

Reviewers

ABataev
RKSimon
fhahn
dmgreen

Commits

rG7ba702644bac: [SLP][AArch64] Implement lookahead operand reordering score of splat loads for…

Summary

The original patch (https://reviews.llvm.org/D121354) targets x86 and adjusts
the lookahead score of splat loads ad they can be done by the movddup
instruction that combines the load and the broadcast and is cheap to execute.

A similar issue shows up on AArch64. The ld1r instruction performs a broadcast
load and is cheap to execute.

This patch implements the TargetTransformInfo hooks for AArch64.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

vporpo created this revision.Apr 12 2022, 3:24 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 12 2022, 3:24 PM

Herald added subscribers: pengfei, hiraditya, kristof.beyls. · View Herald Transcript

vporpo requested review of this revision.Apr 12 2022, 3:24 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 12 2022, 3:24 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

vporpo added a parent revision: D123637: [SLP][AArch64][NFC] Add test for a follow-up patch that fixes the lookahead cost of splat-loads for AArch64.Apr 12 2022, 3:24 PM

Harbormaster completed remote builds in B159331: Diff 422345.Apr 12 2022, 3:24 PM

ping

ABataev added a reviewer: dmgreen.Apr 19 2022, 1:37 PM

Thanks for looking into this! I was hoping to take a look at some point too, this has saved me a job :)

Do you have any performance results to suggest it is a good idea, past the obvious that it sounds like it should be better? Some very quick runs didn't look amazing, but perhaps they are in some ways unlucky.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2706	This code can be before the table. The table and the CostTableLookup should be together.
2708	I would expect it to have one Arg for a splat. I guess this would not match the canonical representation for a splat load too, except from where it is being created from scalar code. Not sure what to do about that, but it seems unfortunate that the cost will go up after it has been vectorized. %l = load i8, i8 *%p %i = insertelement <16 x i8> poison, i8 %l, i32 0 %s = shufflevector <16 x i8> %i, <16 x i8> poison, <16 x i32> zeroinitializer
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
281	I'm not a fan of these "isLegal" methods for things that should just have a cost. It should either be passed a VectorType or the NumElements should be an ElementCount, to allow for scalable vector types.
283	The "legal" types for NEON would be something where the legalized type is one of {v8i8, v16i8, v4i16, v8i16, v2i32, v4i32, v1i64, v2i64}. But also the floating point equivalents of them. I guess I don't really know what "legal" means here. The X86 version of this method seems to account for NumElements, but if a v4i64 can just be legalized to two v2i64's, it can still generate a splatload efficiently and copy that to another value. Hmm. For now lets say that the ElementCount size needs to be one of {8, 16, 32, 64}, and the size of the vector would be at least 64bits. That should rule out types we don't have at least, and larger vectors can still be treated as cheap.

Thank you Dave for the review.

Do you have any performance results to suggest it is a good idea, past the obvious that it sounds like it should be better? Some very quick runs didn't look amazing, but perhaps they are in some ways unlucky.

Yes, I tried this on an AArch64 machine. It improves the BM_Dot_RealComplex_EigenDotFixed<double_16> test from the Eigen benchmark by more than 10%. But this only tests for a 2 x double type. I am not sure how it performs for other types.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2708	Yes, I think this won't be matched. `Args` is meant to be used for scalar code, but that's a great point, we could improve this in a future patch.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
281	I agree, ideally this should be handled transparently by cost functions. The reason why we need this is that we are not currently using the TTI cost model functions for the operand reordering scores in `getShallowScore()`. So we use this function as a way to check whether the target supports this type of combined load + broadcast instructions. Regarding using `ElementCount` this needs some refactoring on the X86 side too, so I will upload one refactoring patch before this, with some of these changes.
283	"Legal" means that we can efficiently generate an instruction that handles a load + broadcast. In x86 this is done with the `movddup` instruction which seems to do this efficiently for two 64-bit elements, which is what gets accepted by `isLegalBroadcast()`. The `ld1r` instruction in AArch64 seems to support a wide range of type, like the ones you listed, though I am not sure how efficient all of these are. I guess we can accept the element count sizes you propose {8, 16, 32, 64}, which sounds right, or we can stick to a `2 x double` that we know for sure that it works well.

Addressed comments.

vporpo added a parent revision: D124100: [SLP] Refactoring isLegalBroadcastLoad() to use `ElementCount`..Apr 20 2022, 9:19 AM

Harbormaster completed remote builds in B160467: Diff 423930.Apr 20 2022, 9:33 AM

fhahn added inline comments.Apr 20 2022, 10:39 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
283	It looks like there is at least some test coverage missing, e.g. I think we need a test with scalable vectors and with different element types other than double.

dmgreen added inline comments.Apr 20 2022, 11:26 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2610	This assert looks like it will fire a lot of the time. Should we use `IsLoad && isLegalBroadcastLoad(...)`? That might make this testable from the cost model too, Even if it's slightly unorthodox to use a vector load for such cases.
2708	Three instructions patterns are tough to cost-model nicely is llvm at the moment. It needs too much code to check down through uses and up through operands. The first part of my comment was meaning that Args should only have 1 element for a splat. IsLoad = Args.size() == 1 && isa<LoadInstr>(Args[0]); Do we expect it get called when there are multiple loads too?
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
283	It is the "efficiently generate" that gets me, it being a pretty imprecise term. Would an architecture that uses two operations count? Many architectures will split the operation into two micro-ops in any case. Why not just give it a cost like other functions. The reciprocal throughput is 1. It's more clear what that means compared to other instructions. The other isLegalMaskedLoad/isLegalMaskedGather functions I can see a place for - they effectively tell you what the canonical representation for a gather should be - should it be treated as a vector operation that has vector optimizations on it or as a group of scalar operations that go through scalar optimizations. Ignore me though. It's a fairly minor complaint, and it working is much better than it not working :)
283	I don't think we _can_ test scalable vectors, this is only callable from the slp vectorizer, unfortunately.

vporpo added inline comments.Apr 20 2022, 2:23 PM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2610	It is already guarded by `if (IsLoad)` so I guess `IsLoad &&` is not needed. That might make this testable from the cost model too, Even if it's slightly unorthodox to use a vector load for such cases. I am not sure I understand. Is this about the assertion?
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
283	Yeah, I am not sure how I would test scalable vectors, given that the test will run through the SLP vectorizer. I added a couple more tests for float, i32 and i64. I tried writing a test for i16 and i8 but I think it is not possible to expose the issue with the current operand reordering. These require 4x or 8x vectors respectively and a deep reduction tree. SLP would need to perform deep operand reordering across the leaf nodes of the deep reduction tree, which I think we don't support currently.

Updated test.

Harbormaster completed remote builds in B160534: Diff 424031.Apr 20 2022, 4:48 PM

dmgreen added inline comments.Apr 21 2022, 4:09 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2610	I meant remove the assert and turn it into a condition. This function might be checking Types which we do not consider to be legal. https://godbolt.org/z/dv56WMaP7 has an example (apparently :) ). We could presumably have a test in llvm/test/Analysis/CostModel/AArch64 that tests `load <vector>, splat-shuffle`, and it will trigger this and be treated as free?
2708	Do we need to check more than 1 arg?

vporpo added inline comments.Apr 21 2022, 6:25 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2610	Oops, yes this is totally broken, sorry about that. I had just copied this part of the code from X86 TTI where we have a cost table that checks the types but I skipped the table. This assertion is meant to check that the table and `isLegalBroadcastLoad` are in sync. To fix this I can either add a cost table like this: static const CostTblEntry NeonBroadcastLoadTbl[] = { {TTI::SK_Broadcast, MVT::v8i8, 0 }, {TTI::SK_Broadcast, MVT::v16i8, 0 }, {TTI::SK_Broadcast, MVT::v4i16, 0 }, {TTI::SK_Broadcast, MVT::v8i16, 0 }, {TTI::SK_Broadcast, MVT::v2i32, 0 }, {TTI::SK_Broadcast, MVT::v4i32, 0 }, {TTI::SK_Broadcast, MVT::v2f64, 0}, {TTI::SK_Broadcast, MVT::v4f32, 0 }, }; using a similar logic as in X86TargetTransformInfo.cpp:1558, or rely on an `if (isLegalBroadcastLoad())`. I think adding the table makes it a bit more explicit, and I would prefer it. What do you think? Btw would you happen to know if a `v2f32` broadcast is handled by `ld1r` in neon? I think this is the only 64-bit entry missing from the table.
2708	Passing the whole vector seems a bit more natural from the SLP point of view. But yes, I agree, a single argument when `Kind == TTI::SK_Broadcast` would probably make more sense from the TTI point of view. This needs changes in the x86 side too, so I will update it in a separate patch.

dmgreen added inline comments.Apr 21 2022, 6:47 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2610	v2f32 should work OK - there is nothing very different between a v2f32 and a v2i32 load. There are also fp16 types and possibly bf16 types (but they might not work at the moment). My slight preference would probably be towards re-using isLegalBroadcastLoad because they are then insync by design and we don't need to repeat the logic. Up to you though.
2708	I was expecting the operands to be the two inputs of the shuffle (or the scalar equivalents that would become those operands). But if it doesn't work that way, then it sounds OK to check them all.

vporpo added inline comments.Apr 21 2022, 8:24 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2708	Let me rephrase what I mean because I think I misunderstood your comments earlier. I think you raised several issues (please correct me if I am wrong): (i) The number of `Args` passed to `getShuffleCost()` and whether we need a second one. Having one `Args` is enough for a splat, but yes I agree, in the general case `getShuffleCost()` should model any type of shuffle, which would require us to have two `Args` like `getShuffleCost(..., Args1, Args2)`, one for each operand. For the splat case we could leave one of them empty. (ii) Populating `Args` with only one element in case of a splat. I think this makes sense too. (iii) Whether `Args` should be a vector of scalar operands or a single vector operand. Given that we are currently using this only from within SLP, I think we can stick to a vector of scalar operands for now. But I guess it could be useful to support both. Since these are TTI design issues I think they should be part of a separate patch. I will work on a patch for (i) and (ii).

dmgreen added inline comments.Apr 21 2022, 8:35 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2708	Yeah it can be separately to this patch. I was expecting it to work like getArithmeticInstrCost which takes an array of the Args representing the operands for the instruction (which from the loop vectorizer will be the scalar instructions that will be converted to vector operands, for example). The getIntrinsicCost does the same with the Args array.

vporpo added inline comments.Apr 21 2022, 9:25 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2610	We could presumably have a test in llvm/test/Analysis/CostModel/AArch64 that tests load <vector>, splat-shuffle, and it will trigger this and be treated as free? This won't work with the current code. TTI's `getUserCost()` won't pass shuffle's operands to `TargetTTI->getShuffleCost()`. This would have to be part of the refactoring patches.

Fixed getShuffleCost logic related to isLegalBroadcastLoad().

Harbormaster completed remote builds in B160679: Diff 424232.Apr 21 2022, 11:12 AM

Thanks. This sounds like a sensible idea, even if some of the benchmarks I have don't show it, it still LGTM.

This revision is now accepted and ready to land.Apr 22 2022, 2:38 AM

This revision was landed with ongoing or failed builds.Apr 22 2022, 7:48 AM

Closed by commit rG7ba702644bac: [SLP][AArch64] Implement lookahead operand reordering score of splat loads for… (authored by vporpo). · Explain Why

This revision was automatically updated to reflect the committed changes.

vporpo added a commit: rG7ba702644bac: [SLP][AArch64] Implement lookahead operand reordering score of splat loads for….

vporpo added a reverting change: rG7052a0ad689b: Revert "[SLP][AArch64] Implement lookahead operand reordering score of splat….Apr 22 2022, 8:24 AM

dmgreen mentioned this in D145578: [AArch64] Cost-model vector splat LD1Rs to avoid unprofitable SLP vectorisation.Mar 10 2023, 2:41 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64TargetTransformInfo.h

17 lines

AArch64TargetTransformInfo.cpp

11 lines

test/

Transforms/

SLPVectorizer/

AArch64/

splat-loads.ll

96 lines

Diff 424479

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

Show First 20 Lines • Show All 272 Lines • ▼ Show 20 Lines	public:

bool isLegalMaskedGather(Type *DataType, Align Alignment) const {		bool isLegalMaskedGather(Type *DataType, Align Alignment) const {
return isLegalMaskedGatherScatter(DataType);		return isLegalMaskedGatherScatter(DataType);
}		}
bool isLegalMaskedScatter(Type *DataType, Align Alignment) const {		bool isLegalMaskedScatter(Type *DataType, Align Alignment) const {
return isLegalMaskedGatherScatter(DataType);		return isLegalMaskedGatherScatter(DataType);
}		}

		bool isLegalBroadcastLoad(Type *ElementTy, ElementCount NumElements) const {
		dmgreenUnsubmitted Not Done Reply Inline Actions I'm not a fan of these "isLegal" methods for things that should just have a cost. It should either be passed a VectorType or the NumElements should be an ElementCount, to allow for scalable vector types. dmgreen: I'm not a fan of these "isLegal" methods for things that should just have a cost. It should…
		vporpoAuthorUnsubmitted Done Reply Inline Actions I agree, ideally this should be handled transparently by cost functions. The reason why we need this is that we are not currently using the TTI cost model functions for the operand reordering scores in `getShallowScore()`. So we use this function as a way to check whether the target supports this type of combined load + broadcast instructions. Regarding using `ElementCount` this needs some refactoring on the X86 side too, so I will upload one refactoring patch before this, with some of these changes. vporpo: I agree, ideally this should be handled transparently by cost functions. The reason why we need…
		// Return true if we can generate a `ld1r` splat load instruction.
		if (!ST->hasNEON() \|\| NumElements.isScalable())
		dmgreenUnsubmitted Not Done Reply Inline Actions The "legal" types for NEON would be something where the legalized type is one of {v8i8, v16i8, v4i16, v8i16, v2i32, v4i32, v1i64, v2i64}. But also the floating point equivalents of them. I guess I don't really know what "legal" means here. The X86 version of this method seems to account for NumElements, but if a v4i64 can just be legalized to two v2i64's, it can still generate a splatload efficiently and copy that to another value. Hmm. For now lets say that the ElementCount size needs to be one of {8, 16, 32, 64}, and the size of the vector would be at least 64bits. That should rule out types we don't have at least, and larger vectors can still be treated as cheap. dmgreen: The "legal" types for NEON would be something where the legalized type is one of {v8i8, v16i8…
		vporpoAuthorUnsubmitted Done Reply Inline Actions "Legal" means that we can efficiently generate an instruction that handles a load + broadcast. In x86 this is done with the `movddup` instruction which seems to do this efficiently for two 64-bit elements, which is what gets accepted by `isLegalBroadcast()`. The `ld1r` instruction in AArch64 seems to support a wide range of type, like the ones you listed, though I am not sure how efficient all of these are. I guess we can accept the element count sizes you propose {8, 16, 32, 64}, which sounds right, or we can stick to a `2 x double` that we know for sure that it works well. vporpo: "Legal" means that we can efficiently generate an instruction that handles a load + broadcast.
		dmgreenUnsubmitted Not Done Reply Inline Actions It is the "efficiently generate" that gets me, it being a pretty imprecise term. Would an architecture that uses two operations count? Many architectures will split the operation into two micro-ops in any case. Why not just give it a cost like other functions. The reciprocal throughput is 1. It's more clear what that means compared to other instructions. The other isLegalMaskedLoad/isLegalMaskedGather functions I can see a place for - they effectively tell you what the canonical representation for a gather should be - should it be treated as a vector operation that has vector optimizations on it or as a group of scalar operations that go through scalar optimizations. Ignore me though. It's a fairly minor complaint, and it working is much better than it not working :) dmgreen: It is the "efficiently generate" that gets me, it being a pretty imprecise term. Would an…
		fhahnUnsubmitted Not Done Reply Inline Actions It looks like there is at least some test coverage missing, e.g. I think we need a test with scalable vectors and with different element types other than double. fhahn: It looks like there is at least some test coverage missing, e.g. I think we need a test with…
		dmgreenUnsubmitted Not Done Reply Inline Actions I don't think we _can_ test scalable vectors, this is only callable from the slp vectorizer, unfortunately. dmgreen: I don't think we _can_ test scalable vectors, this is only callable from the slp vectorizer…
		vporpoAuthorUnsubmitted Not Done Reply Inline Actions Yeah, I am not sure how I would test scalable vectors, given that the test will run through the SLP vectorizer. I added a couple more tests for float, i32 and i64. I tried writing a test for i16 and i8 but I think it is not possible to expose the issue with the current operand reordering. These require 4x or 8x vectors respectively and a deep reduction tree. SLP would need to perform deep operand reordering across the leaf nodes of the deep reduction tree, which I think we don't support currently. vporpo: Yeah, I am not sure how I would test scalable vectors, given that the test will run through the…
		return false;
		switch (unsigned ElementBits = ElementTy->getScalarSizeInBits()) {
		case 8:
		case 16:
		case 32:
		case 64: {
		// We accept bit-widths >= 64bits and elements {8,16,32,64} bits.
		unsigned VectorBits = NumElements.getFixedValue() * ElementBits;
		return VectorBits >= 64;
		}
		}
		return false;
		}

bool isLegalNTStore(Type *DataType, Align Alignment) {		bool isLegalNTStore(Type *DataType, Align Alignment) {
// NOTE: The logic below is mostly geared towards LV, which calls it with		// NOTE: The logic below is mostly geared towards LV, which calls it with
// vectors with 2 elements. We might want to improve that, if other		// vectors with 2 elements. We might want to improve that, if other
// users show up.		// users show up.
// Nontemporal vector stores can be directly lowered to STNP, if the vector		// Nontemporal vector stores can be directly lowered to STNP, if the vector
// can be halved so that each half fits into a register. That's the case if		// can be halved so that each half fits into a register. That's the case if
// the element type fits into a register and the number of elements is a		// the element type fits into a register and the number of elements is a
// power of 2 > 1.		// power of 2 > 1.
▲ Show 20 Lines • Show All 52 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 2,594 Lines • ▼ Show 20 Lines	InstructionCost AArch64TTIImpl::getShuffleCost(TTI::ShuffleKind Kind,
ArrayRef<int> Mask, int Index,		ArrayRef<int> Mask, int Index,
VectorType *SubTp,		VectorType *SubTp,
ArrayRef<Value *> Args) {		ArrayRef<Value *> Args) {
Kind = improveShuffleKindFromMask(Kind, Mask);		Kind = improveShuffleKindFromMask(Kind, Mask);
std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, Tp);		std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, Tp);
if (Kind == TTI::SK_Broadcast \|\| Kind == TTI::SK_Transpose \|\|		if (Kind == TTI::SK_Broadcast \|\| Kind == TTI::SK_Transpose \|\|
Kind == TTI::SK_Select \|\| Kind == TTI::SK_PermuteSingleSrc \|\|		Kind == TTI::SK_Select \|\| Kind == TTI::SK_PermuteSingleSrc \|\|
Kind == TTI::SK_Reverse) {		Kind == TTI::SK_Reverse) {

		// Check for broadcast loads.
		if (Kind == TTI::SK_Broadcast) {
		bool IsLoad = !Args.empty() && llvm::all_of(Args, [](const Value *V) {
		return isa<LoadInst>(V);
		});
		if (IsLoad && isLegalBroadcastLoad(Tp->getElementType(),
		LT.second.getVectorElementCount()))
		dmgreenUnsubmitted Not Done Reply Inline Actions This assert looks like it will fire a lot of the time. Should we use `IsLoad && isLegalBroadcastLoad(...)`? That might make this testable from the cost model too, Even if it's slightly unorthodox to use a vector load for such cases. dmgreen: This assert looks like it will fire a lot of the time. Should we use `IsLoad &&…
		vporpoAuthorUnsubmitted Not Done Reply Inline Actions It is already guarded by `if (IsLoad)` so I guess `IsLoad &&` is not needed. That might make this testable from the cost model too, Even if it's slightly unorthodox to use a vector load for such cases. I am not sure I understand. Is this about the assertion? vporpo: It is already guarded by `if (IsLoad)` so I guess `IsLoad && ` is not needed. > That might…
		dmgreenUnsubmitted Not Done Reply Inline Actions I meant remove the assert and turn it into a condition. This function might be checking Types which we do not consider to be legal. https://godbolt.org/z/dv56WMaP7 has an example (apparently :) ). We could presumably have a test in llvm/test/Analysis/CostModel/AArch64 that tests `load <vector>, splat-shuffle`, and it will trigger this and be treated as free? dmgreen: I meant remove the assert and turn it into a condition. This function might be checking Types…
		vporpoAuthorUnsubmitted Done Reply Inline Actions Oops, yes this is totally broken, sorry about that. I had just copied this part of the code from X86 TTI where we have a cost table that checks the types but I skipped the table. This assertion is meant to check that the table and `isLegalBroadcastLoad` are in sync. To fix this I can either add a cost table like this: static const CostTblEntry NeonBroadcastLoadTbl[] = { {TTI::SK_Broadcast, MVT::v8i8, 0 }, {TTI::SK_Broadcast, MVT::v16i8, 0 }, {TTI::SK_Broadcast, MVT::v4i16, 0 }, {TTI::SK_Broadcast, MVT::v8i16, 0 }, {TTI::SK_Broadcast, MVT::v2i32, 0 }, {TTI::SK_Broadcast, MVT::v4i32, 0 }, {TTI::SK_Broadcast, MVT::v2f64, 0}, {TTI::SK_Broadcast, MVT::v4f32, 0 }, }; using a similar logic as in X86TargetTransformInfo.cpp:1558, or rely on an `if (isLegalBroadcastLoad())`. I think adding the table makes it a bit more explicit, and I would prefer it. What do you think? Btw would you happen to know if a `v2f32` broadcast is handled by `ld1r` in neon? I think this is the only 64-bit entry missing from the table. vporpo: Oops, yes this is totally broken, sorry about that. I had just copied this part of the code…
		dmgreenUnsubmitted Not Done Reply Inline Actions v2f32 should work OK - there is nothing very different between a v2f32 and a v2i32 load. There are also fp16 types and possibly bf16 types (but they might not work at the moment). My slight preference would probably be towards re-using isLegalBroadcastLoad because they are then insync by design and we don't need to repeat the logic. Up to you though. dmgreen: v2f32 should work OK - there is nothing very different between a v2f32 and a v2i32 load. There…
		vporpoAuthorUnsubmitted Done Reply Inline Actions We could presumably have a test in llvm/test/Analysis/CostModel/AArch64 that tests load <vector>, splat-shuffle, and it will trigger this and be treated as free? This won't work with the current code. TTI's `getUserCost()` won't pass shuffle's operands to `TargetTTI->getShuffleCost()`. This would have to be part of the refactoring patches. vporpo: > We could presumably have a test in llvm/test/Analysis/CostModel/AArch64 that tests load…
		return 0; // broadcast is handled by ld1r
		}

static const CostTblEntry ShuffleTbl[] = {		static const CostTblEntry ShuffleTbl[] = {
// Broadcast shuffle kinds can be performed with 'dup'.		// Broadcast shuffle kinds can be performed with 'dup'.
{ TTI::SK_Broadcast, MVT::v8i8, 1 },		{ TTI::SK_Broadcast, MVT::v8i8, 1 },
{ TTI::SK_Broadcast, MVT::v16i8, 1 },		{ TTI::SK_Broadcast, MVT::v16i8, 1 },
{ TTI::SK_Broadcast, MVT::v4i16, 1 },		{ TTI::SK_Broadcast, MVT::v4i16, 1 },
{ TTI::SK_Broadcast, MVT::v8i16, 1 },		{ TTI::SK_Broadcast, MVT::v8i16, 1 },
{ TTI::SK_Broadcast, MVT::v2i32, 1 },		{ TTI::SK_Broadcast, MVT::v2i32, 1 },
{ TTI::SK_Broadcast, MVT::v4i32, 1 },		{ TTI::SK_Broadcast, MVT::v4i32, 1 },
▲ Show 20 Lines • Show All 76 Lines • ▼ Show 20 Lines	static const CostTblEntry ShuffleTbl[] = {
{ TTI::SK_Reverse, MVT::nxv4f32, 1 },		{ TTI::SK_Reverse, MVT::nxv4f32, 1 },
{ TTI::SK_Reverse, MVT::nxv2f64, 1 },		{ TTI::SK_Reverse, MVT::nxv2f64, 1 },
{ TTI::SK_Reverse, MVT::nxv16i1, 1 },		{ TTI::SK_Reverse, MVT::nxv16i1, 1 },
{ TTI::SK_Reverse, MVT::nxv8i1, 1 },		{ TTI::SK_Reverse, MVT::nxv8i1, 1 },
{ TTI::SK_Reverse, MVT::nxv4i1, 1 },		{ TTI::SK_Reverse, MVT::nxv4i1, 1 },
{ TTI::SK_Reverse, MVT::nxv2i1, 1 },		{ TTI::SK_Reverse, MVT::nxv2i1, 1 },
};		};
if (const auto *Entry = CostTableLookup(ShuffleTbl, Kind, LT.second))		if (const auto *Entry = CostTableLookup(ShuffleTbl, Kind, LT.second))
return LT.first * Entry->Cost;		return LT.first * Entry->Cost;
		dmgreenUnsubmitted Done Reply Inline Actions This code can be before the table. The table and the CostTableLookup should be together. dmgreen: This code can be before the table. The table and the CostTableLookup should be together.
}		}

		dmgreenUnsubmitted Not Done Reply Inline Actions I would expect it to have one Arg for a splat. I guess this would not match the canonical representation for a splat load too, except from where it is being created from scalar code. Not sure what to do about that, but it seems unfortunate that the cost will go up after it has been vectorized. %l = load i8, i8 %p %i = insertelement <16 x i8> poison, i8 %l, i32 0 %s = shufflevector <16 x i8> %i, <16 x i8> poison, <16 x i32> zeroinitializer dmgreen:* I would expect it to have one Arg for a splat. I guess this would not match the canonical…
		vporpoAuthorUnsubmitted Done Reply Inline Actions Yes, I think this won't be matched. `Args` is meant to be used for scalar code, but that's a great point, we could improve this in a future patch. vporpo: Yes, I think this won't be matched. `Args` is meant to be used for scalar code, but that's a…
		dmgreenUnsubmitted Not Done Reply Inline Actions Three instructions patterns are tough to cost-model nicely is llvm at the moment. It needs too much code to check down through uses and up through operands. The first part of my comment was meaning that Args should only have 1 element for a splat. IsLoad = Args.size() == 1 && isa<LoadInstr>(Args[0]); Do we expect it get called when there are multiple loads too? dmgreen: Three instructions patterns are tough to cost-model nicely is llvm at the moment. It needs too…
		dmgreenUnsubmitted Not Done Reply Inline Actions Do we need to check more than 1 arg? dmgreen: Do we need to check more than 1 arg?
		vporpoAuthorUnsubmitted Done Reply Inline Actions Passing the whole vector seems a bit more natural from the SLP point of view. But yes, I agree, a single argument when `Kind == TTI::SK_Broadcast` would probably make more sense from the TTI point of view. This needs changes in the x86 side too, so I will update it in a separate patch. vporpo: Passing the whole vector seems a bit more natural from the SLP point of view. But yes, I agree…
		dmgreenUnsubmitted Not Done Reply Inline Actions I was expecting the operands to be the two inputs of the shuffle (or the scalar equivalents that would become those operands). But if it doesn't work that way, then it sounds OK to check them all. dmgreen: I was expecting the operands to be the two inputs of the shuffle (or the scalar equivalents…
		vporpoAuthorUnsubmitted Done Reply Inline Actions Let me rephrase what I mean because I think I misunderstood your comments earlier. I think you raised several issues (please correct me if I am wrong): (i) The number of `Args` passed to `getShuffleCost()` and whether we need a second one. Having one `Args` is enough for a splat, but yes I agree, in the general case `getShuffleCost()` should model any type of shuffle, which would require us to have two `Args` like `getShuffleCost(..., Args1, Args2)`, one for each operand. For the splat case we could leave one of them empty. (ii) Populating `Args` with only one element in case of a splat. I think this makes sense too. (iii) Whether `Args` should be a vector of scalar operands or a single vector operand. Given that we are currently using this only from within SLP, I think we can stick to a vector of scalar operands for now. But I guess it could be useful to support both. Since these are TTI design issues I think they should be part of a separate patch. I will work on a patch for (i) and (ii). vporpo: Let me rephrase what I mean because I think I misunderstood your comments earlier. I think you…
		dmgreenUnsubmitted Not Done Reply Inline Actions Yeah it can be separately to this patch. I was expecting it to work like getArithmeticInstrCost which takes an array of the Args representing the operands for the instruction (which from the loop vectorizer will be the scalar instructions that will be converted to vector operands, for example). The getIntrinsicCost does the same with the Args array. dmgreen: Yeah it can be separately to this patch. I was expecting it to work like getArithmeticInstrCost…
if (Kind == TTI::SK_Splice && isa<ScalableVectorType>(Tp))		if (Kind == TTI::SK_Splice && isa<ScalableVectorType>(Tp))
return getSpliceCost(Tp, Index);		return getSpliceCost(Tp, Index);

// Inserting a subvector can often be done with either a D, S or H register		// Inserting a subvector can often be done with either a D, S or H register
// move, so long as the inserted vector is "aligned".		// move, so long as the inserted vector is "aligned".
if (Kind == TTI::SK_InsertSubvector && LT.second.isFixedLengthVector() &&		if (Kind == TTI::SK_InsertSubvector && LT.second.isFixedLengthVector() &&
LT.second.getSizeInBits() <= 128 && SubTp) {		LT.second.getSizeInBits() <= 128 && SubTp) {
std::pair<InstructionCost, MVT> SubLT =		std::pair<InstructionCost, MVT> SubLT =
Show All 11 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/splat-loads.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -slp-vectorizer -S \| FileCheck %s			; RUN: opt < %s -slp-vectorizer -S \| FileCheck %s

	target triple = "aarch64--linux-gnu"			target triple = "aarch64--linux-gnu"

	; This checks that we we prefer splats rather than load vectors + shuffles.			; This checks that we we prefer splats rather than load vectors + shuffles.
	; A load + broadcast can be done efficiently with a single `ld1r` instruction.			; A load + broadcast can be done efficiently with a single `ld1r` instruction.
	define void @splat_loads_double(double %array1, double %array2, double %ptrA, double %ptrB) {			define void @splat_loads_double(double %array1, double %array2, double %ptrA, double %ptrB) {
	; CHECK-LABEL: @splat_loads_double(			; CHECK-LABEL: @splat_loads_double(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[GEP_1_0:%.]] = getelementptr inbounds double, double [[ARRAY1:%.*]], i64 0			; CHECK-NEXT: [[GEP_1_0:%.]] = getelementptr inbounds double, double [[ARRAY1:%.*]], i64 0
	; CHECK-NEXT: [[GEP_2_0:%.]] = getelementptr inbounds double, double [[ARRAY2:%.*]], i64 0			; CHECK-NEXT: [[GEP_2_0:%.]] = getelementptr inbounds double, double [[ARRAY2:%.*]], i64 0
				; CHECK-NEXT: [[GEP_2_1:%.]] = getelementptr inbounds double, double [[ARRAY2]], i64 1
				; CHECK-NEXT: [[LD_2_0:%.]] = load double, double [[GEP_2_0]], align 8
				; CHECK-NEXT: [[LD_2_1:%.]] = load double, double [[GEP_2_1]], align 8
	; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[GEP_1_0]] to <2 x double>*			; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[GEP_1_0]] to <2 x double>*
	; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 8			; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 8
	; CHECK-NEXT: [[TMP2:%.]] = bitcast double [[GEP_2_0]] to <2 x double>*			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[LD_2_0]], i32 0
	; CHECK-NEXT: [[TMP3:%.]] = load <2 x double>, <2 x double> [[TMP2]], align 8			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[LD_2_0]], i32 1
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP3]], <2 x double> poison, <2 x i32> <i32 1, i32 0>			; CHECK-NEXT: [[TMP4:%.*]] = fmul <2 x double> [[TMP1]], [[TMP3]]
	; CHECK-NEXT: [[TMP4:%.*]] = fmul <2 x double> [[TMP1]], [[SHUFFLE]]			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> poison, double [[LD_2_1]], i32 0
	; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x double> [[SHUFFLE]], i32 1			; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> [[TMP5]], double [[LD_2_1]], i32 1
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> poison, double [[TMP5]], i32 0			; CHECK-NEXT: [[TMP7:%.*]] = fmul <2 x double> [[TMP1]], [[TMP6]]
	; CHECK-NEXT: [[TMP7:%.*]] = extractelement <2 x double> [[SHUFFLE]], i32 0			; CHECK-NEXT: [[TMP8:%.*]] = fadd <2 x double> [[TMP4]], [[TMP7]]
	; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x double> [[TMP6]], double [[TMP7]], i32 1			; CHECK-NEXT: [[TMP9:%.]] = bitcast double [[GEP_1_0]] to <2 x double>*
	; CHECK-NEXT: [[TMP9:%.*]] = fmul <2 x double> [[TMP1]], [[TMP8]]			; CHECK-NEXT: store <2 x double> [[TMP8]], <2 x double>* [[TMP9]], align 8
	; CHECK-NEXT: [[TMP10:%.*]] = fadd <2 x double> [[TMP4]], [[TMP9]]
	; CHECK-NEXT: [[TMP11:%.]] = bitcast double [[GEP_1_0]] to <2 x double>*
	; CHECK-NEXT: store <2 x double> [[TMP10]], <2 x double>* [[TMP11]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%gep_1_0 = getelementptr inbounds double, double* %array1, i64 0			%gep_1_0 = getelementptr inbounds double, double* %array1, i64 0
	%gep_1_1 = getelementptr inbounds double, double* %array1, i64 1			%gep_1_1 = getelementptr inbounds double, double* %array1, i64 1
	%ld_1_0 = load double, double* %gep_1_0, align 8			%ld_1_0 = load double, double* %gep_1_0, align 8
	%ld_1_1 = load double, double* %gep_1_1, align 8			%ld_1_1 = load double, double* %gep_1_1, align 8

	Show All 17 Lines
	}			}

	; Same but with float instead of double			; Same but with float instead of double
	define void @splat_loads_float(float %array1, float %array2, float %ptrA, float %ptrB) {			define void @splat_loads_float(float %array1, float %array2, float %ptrA, float %ptrB) {
	; CHECK-LABEL: @splat_loads_float(			; CHECK-LABEL: @splat_loads_float(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[GEP_1_0:%.]] = getelementptr inbounds float, float [[ARRAY1:%.*]], i64 0			; CHECK-NEXT: [[GEP_1_0:%.]] = getelementptr inbounds float, float [[ARRAY1:%.*]], i64 0
	; CHECK-NEXT: [[GEP_2_0:%.]] = getelementptr inbounds float, float [[ARRAY2:%.*]], i64 0			; CHECK-NEXT: [[GEP_2_0:%.]] = getelementptr inbounds float, float [[ARRAY2:%.*]], i64 0
				; CHECK-NEXT: [[GEP_2_1:%.]] = getelementptr inbounds float, float [[ARRAY2]], i64 1
				; CHECK-NEXT: [[LD_2_0:%.]] = load float, float [[GEP_2_0]], align 8
				; CHECK-NEXT: [[LD_2_1:%.]] = load float, float [[GEP_2_1]], align 8
	; CHECK-NEXT: [[TMP0:%.]] = bitcast float [[GEP_1_0]] to <2 x float>*			; CHECK-NEXT: [[TMP0:%.]] = bitcast float [[GEP_1_0]] to <2 x float>*
	; CHECK-NEXT: [[TMP1:%.]] = load <2 x float>, <2 x float> [[TMP0]], align 8			; CHECK-NEXT: [[TMP1:%.]] = load <2 x float>, <2 x float> [[TMP0]], align 8
	; CHECK-NEXT: [[TMP2:%.]] = bitcast float [[GEP_2_0]] to <2 x float>*			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x float> poison, float [[LD_2_0]], i32 0
	; CHECK-NEXT: [[TMP3:%.]] = load <2 x float>, <2 x float> [[TMP2]], align 8			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x float> [[TMP2]], float [[LD_2_0]], i32 1
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x float> [[TMP3]], <2 x float> poison, <2 x i32> <i32 1, i32 0>			; CHECK-NEXT: [[TMP4:%.*]] = fmul <2 x float> [[TMP1]], [[TMP3]]
	; CHECK-NEXT: [[TMP4:%.*]] = fmul <2 x float> [[TMP1]], [[SHUFFLE]]			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x float> poison, float [[LD_2_1]], i32 0
	; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x float> [[SHUFFLE]], i32 1			; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x float> [[TMP5]], float [[LD_2_1]], i32 1
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x float> poison, float [[TMP5]], i32 0			; CHECK-NEXT: [[TMP7:%.*]] = fmul <2 x float> [[TMP1]], [[TMP6]]
	; CHECK-NEXT: [[TMP7:%.*]] = extractelement <2 x float> [[SHUFFLE]], i32 0			; CHECK-NEXT: [[TMP8:%.*]] = fadd <2 x float> [[TMP4]], [[TMP7]]
	; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x float> [[TMP6]], float [[TMP7]], i32 1			; CHECK-NEXT: [[TMP9:%.]] = bitcast float [[GEP_1_0]] to <2 x float>*
	; CHECK-NEXT: [[TMP9:%.*]] = fmul <2 x float> [[TMP1]], [[TMP8]]			; CHECK-NEXT: store <2 x float> [[TMP8]], <2 x float>* [[TMP9]], align 4
	; CHECK-NEXT: [[TMP10:%.*]] = fadd <2 x float> [[TMP4]], [[TMP9]]
	; CHECK-NEXT: [[TMP11:%.]] = bitcast float [[GEP_1_0]] to <2 x float>*
	; CHECK-NEXT: store <2 x float> [[TMP10]], <2 x float>* [[TMP11]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%gep_1_0 = getelementptr inbounds float, float* %array1, i64 0			%gep_1_0 = getelementptr inbounds float, float* %array1, i64 0
	%gep_1_1 = getelementptr inbounds float, float* %array1, i64 1			%gep_1_1 = getelementptr inbounds float, float* %array1, i64 1
	%ld_1_0 = load float, float* %gep_1_0, align 8			%ld_1_0 = load float, float* %gep_1_0, align 8
	%ld_1_1 = load float, float* %gep_1_1, align 8			%ld_1_1 = load float, float* %gep_1_1, align 8

	Show All 17 Lines
	}			}

	; Same but with i64			; Same but with i64
	define void @splat_loads_i64(i64 %array1, i64 %array2, i64 %ptrA, i64 %ptrB) {			define void @splat_loads_i64(i64 %array1, i64 %array2, i64 %ptrA, i64 %ptrB) {
	; CHECK-LABEL: @splat_loads_i64(			; CHECK-LABEL: @splat_loads_i64(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[GEP_1_0:%.]] = getelementptr inbounds i64, i64 [[ARRAY1:%.*]], i64 0			; CHECK-NEXT: [[GEP_1_0:%.]] = getelementptr inbounds i64, i64 [[ARRAY1:%.*]], i64 0
	; CHECK-NEXT: [[GEP_2_0:%.]] = getelementptr inbounds i64, i64 [[ARRAY2:%.*]], i64 0			; CHECK-NEXT: [[GEP_2_0:%.]] = getelementptr inbounds i64, i64 [[ARRAY2:%.*]], i64 0
				; CHECK-NEXT: [[GEP_2_1:%.]] = getelementptr inbounds i64, i64 [[ARRAY2]], i64 1
				; CHECK-NEXT: [[LD_2_0:%.]] = load i64, i64 [[GEP_2_0]], align 8
				; CHECK-NEXT: [[LD_2_1:%.]] = load i64, i64 [[GEP_2_1]], align 8
	; CHECK-NEXT: [[TMP0:%.]] = bitcast i64 [[GEP_1_0]] to <2 x i64>*			; CHECK-NEXT: [[TMP0:%.]] = bitcast i64 [[GEP_1_0]] to <2 x i64>*
	; CHECK-NEXT: [[TMP1:%.]] = load <2 x i64>, <2 x i64> [[TMP0]], align 8			; CHECK-NEXT: [[TMP1:%.]] = load <2 x i64>, <2 x i64> [[TMP0]], align 8
	; CHECK-NEXT: [[TMP2:%.]] = bitcast i64 [[GEP_2_0]] to <2 x i64>*			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x i64> poison, i64 [[LD_2_0]], i32 0
	; CHECK-NEXT: [[TMP3:%.]] = load <2 x i64>, <2 x i64> [[TMP2]], align 8			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x i64> [[TMP2]], i64 [[LD_2_0]], i32 1
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i64> [[TMP3]], <2 x i64> poison, <2 x i32> <i32 1, i32 0>			; CHECK-NEXT: [[TMP4:%.*]] = or <2 x i64> [[TMP1]], [[TMP3]]
	; CHECK-NEXT: [[TMP4:%.*]] = or <2 x i64> [[TMP1]], [[SHUFFLE]]			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x i64> poison, i64 [[LD_2_1]], i32 0
	; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x i64> [[SHUFFLE]], i32 1			; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x i64> [[TMP5]], i64 [[LD_2_1]], i32 1
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x i64> poison, i64 [[TMP5]], i32 0			; CHECK-NEXT: [[TMP7:%.*]] = or <2 x i64> [[TMP1]], [[TMP6]]
	; CHECK-NEXT: [[TMP7:%.*]] = extractelement <2 x i64> [[SHUFFLE]], i32 0			; CHECK-NEXT: [[TMP8:%.*]] = add <2 x i64> [[TMP4]], [[TMP7]]
	; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x i64> [[TMP6]], i64 [[TMP7]], i32 1			; CHECK-NEXT: [[TMP9:%.]] = bitcast i64 [[GEP_1_0]] to <2 x i64>*
	; CHECK-NEXT: [[TMP9:%.*]] = or <2 x i64> [[TMP1]], [[TMP8]]			; CHECK-NEXT: store <2 x i64> [[TMP8]], <2 x i64>* [[TMP9]], align 4
	; CHECK-NEXT: [[TMP10:%.*]] = add <2 x i64> [[TMP4]], [[TMP9]]
	; CHECK-NEXT: [[TMP11:%.]] = bitcast i64 [[GEP_1_0]] to <2 x i64>*
	; CHECK-NEXT: store <2 x i64> [[TMP10]], <2 x i64>* [[TMP11]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%gep_1_0 = getelementptr inbounds i64, i64* %array1, i64 0			%gep_1_0 = getelementptr inbounds i64, i64* %array1, i64 0
	%gep_1_1 = getelementptr inbounds i64, i64* %array1, i64 1			%gep_1_1 = getelementptr inbounds i64, i64* %array1, i64 1
	%ld_1_0 = load i64, i64* %gep_1_0, align 8			%ld_1_0 = load i64, i64* %gep_1_0, align 8
	%ld_1_1 = load i64, i64* %gep_1_1, align 8			%ld_1_1 = load i64, i64* %gep_1_1, align 8

	Show All 17 Lines
	}			}

	; Same but with i32			; Same but with i32
	define void @splat_loads_i32(i32 %array1, i32 %array2, i32 %ptrA, i32 %ptrB) {			define void @splat_loads_i32(i32 %array1, i32 %array2, i32 %ptrA, i32 %ptrB) {
	; CHECK-LABEL: @splat_loads_i32(			; CHECK-LABEL: @splat_loads_i32(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[GEP_1_0:%.]] = getelementptr inbounds i32, i32 [[ARRAY1:%.*]], i64 0			; CHECK-NEXT: [[GEP_1_0:%.]] = getelementptr inbounds i32, i32 [[ARRAY1:%.*]], i64 0
	; CHECK-NEXT: [[GEP_2_0:%.]] = getelementptr inbounds i32, i32 [[ARRAY2:%.*]], i64 0			; CHECK-NEXT: [[GEP_2_0:%.]] = getelementptr inbounds i32, i32 [[ARRAY2:%.*]], i64 0
				; CHECK-NEXT: [[GEP_2_1:%.]] = getelementptr inbounds i32, i32 [[ARRAY2]], i64 1
				; CHECK-NEXT: [[LD_2_0:%.]] = load i32, i32 [[GEP_2_0]], align 8
				; CHECK-NEXT: [[LD_2_1:%.]] = load i32, i32 [[GEP_2_1]], align 8
	; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[GEP_1_0]] to <2 x i32>*			; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[GEP_1_0]] to <2 x i32>*
	; CHECK-NEXT: [[TMP1:%.]] = load <2 x i32>, <2 x i32> [[TMP0]], align 8			; CHECK-NEXT: [[TMP1:%.]] = load <2 x i32>, <2 x i32> [[TMP0]], align 8
	; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[GEP_2_0]] to <2 x i32>*			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x i32> poison, i32 [[LD_2_0]], i32 0
	; CHECK-NEXT: [[TMP3:%.]] = load <2 x i32>, <2 x i32> [[TMP2]], align 8			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x i32> [[TMP2]], i32 [[LD_2_0]], i32 1
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> poison, <2 x i32> <i32 1, i32 0>			; CHECK-NEXT: [[TMP4:%.*]] = or <2 x i32> [[TMP1]], [[TMP3]]
	; CHECK-NEXT: [[TMP4:%.*]] = or <2 x i32> [[TMP1]], [[SHUFFLE]]			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x i32> poison, i32 [[LD_2_1]], i32 0
	; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x i32> [[SHUFFLE]], i32 1			; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x i32> [[TMP5]], i32 [[LD_2_1]], i32 1
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x i32> poison, i32 [[TMP5]], i32 0			; CHECK-NEXT: [[TMP7:%.*]] = or <2 x i32> [[TMP1]], [[TMP6]]
	; CHECK-NEXT: [[TMP7:%.*]] = extractelement <2 x i32> [[SHUFFLE]], i32 0			; CHECK-NEXT: [[TMP8:%.*]] = add <2 x i32> [[TMP4]], [[TMP7]]
	; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x i32> [[TMP6]], i32 [[TMP7]], i32 1			; CHECK-NEXT: [[TMP9:%.]] = bitcast i32 [[GEP_1_0]] to <2 x i32>*
	; CHECK-NEXT: [[TMP9:%.*]] = or <2 x i32> [[TMP1]], [[TMP8]]			; CHECK-NEXT: store <2 x i32> [[TMP8]], <2 x i32>* [[TMP9]], align 4
	; CHECK-NEXT: [[TMP10:%.*]] = add <2 x i32> [[TMP4]], [[TMP9]]
	; CHECK-NEXT: [[TMP11:%.]] = bitcast i32 [[GEP_1_0]] to <2 x i32>*
	; CHECK-NEXT: store <2 x i32> [[TMP10]], <2 x i32>* [[TMP11]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%gep_1_0 = getelementptr inbounds i32, i32* %array1, i64 0			%gep_1_0 = getelementptr inbounds i32, i32* %array1, i64 0
	%gep_1_1 = getelementptr inbounds i32, i32* %array1, i64 1			%gep_1_1 = getelementptr inbounds i32, i32* %array1, i64 1
	%ld_1_0 = load i32, i32* %gep_1_0, align 8			%ld_1_0 = load i32, i32* %gep_1_0, align 8
	%ld_1_1 = load i32, i32* %gep_1_1, align 8			%ld_1_1 = load i32, i32* %gep_1_1, align 8

	Show All 18 Lines