This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64TargetTransformInfo.cpp
-
test/
-
Analysis/CostModel/AArch64/
-
CostModel/
-
AArch64/
-
shuffle-load.ll
-
Transforms/SLPVectorizer/AArch64/
-
SLPVectorizer/
-
AArch64/
3/5
slp-fma-loss.ll

Differential D145578

[AArch64] Cost-model vector splat LD1Rs to avoid unprofitable SLP vectorisation
ClosedPublic

Authored by SjoerdMeijer on Mar 8 2023, 4:55 AM.

Download Raw Diff

Details

Reviewers

dmgreen
fhahn
vporpo

Commits

rG775451b66a4c: [AArch64] Cost-model vector splat LD1Rs to avoid unprofitable SLP vectorisation

Summary

This slightly increases the costs of InsertElement instructions that are part of a vector splat sequence, i.e. a load, InsertElement and a shuffle. The resulting LD1R is a high latency instruction, and this slight increase in costs avoids SLP vectorisation for a couple of cases where this isn't profitable.

SPEC 2017 FP and INT performance results with this change are completely neutral so only seem to affect cases like the changed regression tests.

Fixes: https://github.com/llvm/llvm-project/issues/61047

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

SjoerdMeijer created this revision.Mar 8 2023, 4:55 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 8 2023, 4:55 AM

Herald added subscribers: vporpo, StephenFan, hiraditya, kristof.beyls. · View Herald Transcript

SjoerdMeijer requested review of this revision.Mar 8 2023, 4:55 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 8 2023, 4:55 AM

Herald added a subscriber: • pcwang-thead. · View Herald Transcript

SjoerdMeijer added inline comments.Mar 8 2023, 5:00 AM

llvm/test/Transforms/SLPVectorizer/AArch64/slp-fma-loss.ll
86	Ah, only after uploading this diff I noticed that the function names indicate that this should be profitable... I had missed that. Hmmm.... I guess that then needs looking into.

SjoerdMeijer added inline comments.Mar 8 2023, 5:02 AM

llvm/test/Transforms/SLPVectorizer/AArch64/slp-fma-loss.ll
86	Eyeballing this, my first reaction is that I slightly doubt that SLP will be profitable, but I guess that's what I need to find out.

ABataev added a subscriber: ABataev.Mar 8 2023, 5:35 AM

ABataev added inline comments.

llvm/test/Transforms/SLPVectorizer/AArch64/slp-fma-loss.ll
86	Working on fma vectorization support in SLP, hope to spend more time on this later this month.

Harbormaster completed remote builds in B218071: Diff 503327.Mar 8 2023, 6:22 AM

SjoerdMeijer added inline comments.Mar 8 2023, 7:16 AM

llvm/test/Transforms/SLPVectorizer/AArch64/slp-fma-loss.ll
86	It's difficult to see how the SLP variant is ever going to be faster for this example with just a handful of scalar instructions (ignoring the loads/stores) that mostly have overlap in dispatch and execution: [0,3] D======eeeER .. fmul s4, s1, s0 [0,4] D======eeeER .. fmul s1, s2, s1 [0,5] D=======eeeER .. fmul s5, s3, s2 [0,6] D=======eeeER .. fmul s0, s3, s0 [0,7] D==========eeER.. fsub s2, s4, s5 [0,8] D==========eeER.. fadd s0, s0, s1 especially if we have to do things like shuffles: [0,2] D==eeeeeeeeER . . . .. ld1r.2s { v1 }, [x8], #4 [0,3] D==========eeeeeeER . . .. ldr d2, [x8] [0,4] D================eeeER . .. fmul.2s v1, v1, v2 [0,5] D================eeeER . .. fmul.2s v0, v2, v0[0] [0,6] D===================eeER . .. rev64.2s v1, v1 [0,7] D=====================eeER .. fsub.2s v3, v0, v1 [0,8] .D====================eeER .. fadd.2s v0, v0, v1 [0,9] .D======================eeER .. mov.s v3[1], v0[1] [0,10] .D========================eeER.. str d3, [x0] [0,11] .D========================eeeeER st1.s { v2 }[1], [x1] Here I am showing some loads/stores, but that's just to show they are not simple loads/stores anymore but more high-latency instructions, and perhaps more importantly we have got the REV and extract, so with FMAs things might look a bit better but it's difficult to beat the scalar variant. The SLP timeline is a little bit skewed of the post-inc and the result being available a lot earlier, but the bottom line is that there is very little parallelism here as we are working on 2 floats and there's the overhead of the vectorisation. I have run some micro-benchmarks, and I've measured that the SLP variant is indeed slower. @fhahn , @dmgreen : I think it makes to also not SLP vectorise this function (and there other 2 below). Do you agree?

Hello. I had to remind myself where this came from. It looks like it was introduced in D123638, and there were some comments already about the performance not always being ideal. It apparently helped for some <2 x double> vectorization. I'm not sure if there it a perfect answer, but an effective cost of 2 for the throughput of the ld1r would seem to match the hardware better. This doesn't alter isLegalBroadcastLoad and the tests added in D123638 don't seem to change.

The cost of a broadcast is already 1 so the code here doesn't seem like it would do much any more. It could be checking the cost type and returning 0 for costsize and 1 for throughput. Otherwise this bit of code could probably be removed.

llvm/test/Transforms/SLPVectorizer/AArch64/slp-fma-loss.ll
86	I think we will fold the dup into the fmul as opposed to the load now, which seems a little cheaper. https://godbolt.org/z/4aeab8Pas I'm not sure if that makes it cheaper overall though. I agree that the rev and the mov make the timing tight. From the look of the test it looks like "profitable" here just means that the non-fast version would be slightly more instructions, not that it was known to be profitable. i.e the test is for testing fma combining, this was just a negative test for that issue.

I like it when I can delete things and achieve the same, so I have just done that. This was my understanding of your comments. Thanks for the suggestion and for looking into this.

So with this new revision, the SLP vectorisation tests won't be vectorised, which is what we want.
For the cost-model test, there are 2 changes compared to the previous revision: there is cost of 21 and 42 for two shuffles with half types. Probably need looking into, and fixed in a separate/companion patch with this.

Harbormaster completed remote builds in B218663: Diff 504113.Mar 10 2023, 7:58 AM

The CostKind can be TCK_RecipThroughput (the default and the one we usually care most about), TCK_Latency, TCK_CodeSize or TCK_SizeAndLatency. I think if we have the code we might as well get TCK_CodeSize correct and return 0 in that case, so the load+dup have a combined cost of 1. TCK_Latency and TCK_SizeAndLatency I'm less sure about, perhaps leave them with the same costs as TCK_RecipThroughput?

So it might be a little better to change the code to this, with a comment explaining that the other costs are expected to be higher even with ld1r:

// Check for broadcast loads.
if (CostKind == TCK_CodeSize && Kind == TTI::SK_Broadcast) {
  bool IsLoad = !Args.empty() && isa<LoadInst>(Args[0]);
  if (IsLoad && LT.second.isVector() &&
      isLegalBroadcastLoad(Tp->getElementType(),
                           LT.second.getVectorElementCount()))
    return 0; // broadcast is handled by ld1r
}

Thanks, I have restored that piece of logic and added the code-size check to it (and added a code size check to the test).
That means that we now get add an additional cost for TCK_RecipThroughput, as well as for TCK_Latency and TCK_SizeAndLatency.

Brilliant, thanks. LGTM.

This revision is now accepted and ready to land.Mar 13 2023, 7:02 AM

This revision was landed with ongoing or failed builds.Mar 13 2023, 8:14 AM

Closed by commit rG775451b66a4c: [AArch64] Cost-model vector splat LD1Rs to avoid unprofitable SLP vectorisation (authored by SjoerdMeijer). · Explain Why

This revision was automatically updated to reflect the committed changes.

SjoerdMeijer added a commit: rG775451b66a4c: [AArch64] Cost-model vector splat LD1Rs to avoid unprofitable SLP vectorisation.

Harbormaster completed remote builds in B219015: Diff 504618.Mar 13 2023, 8:28 AM

Just a heads up with are seeing a 10% regression caused by this change in a very SLP sensitive workload (the original source for the slp-fma-loss.ll tests). I still have to double check where the slowdown is coming from exactly.

In D145578#4201717, @fhahn wrote:

Just a heads up with are seeing a 10% regression caused by this change in a very SLP sensitive workload (the original source for the slp-fma-loss.ll tests). I still have to double check where the slowdown is coming from exactly.

I thought we were helping your case, not regress it. :(
But thanks for the heads up, am happy to look at it.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64TargetTransformInfo.cpp

12 lines

test/

Analysis/

CostModel/

AArch64/

shuffle-load.ll

158 lines

Transforms/

SLPVectorizer/

AArch64/

slp-fma-loss.ll

155 lines

Diff 504668

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 3,213 Lines • ▼ Show 20 Lines	for (unsigned N = 0; N < NumVecs; N++) {
else		else
Cost += LTNumElts;		Cost += LTNumElts;
}		}
return Cost;		return Cost;
}		}

Kind = improveShuffleKindFromMask(Kind, Mask);		Kind = improveShuffleKindFromMask(Kind, Mask);

// Check for broadcast loads.		// Check for broadcast loads, which are supported by the LD1R instruction.
if (Kind == TTI::SK_Broadcast) {		// In terms of code-size, the shuffle vector is free when a load + dup get
		// folded into a LD1R. That's what we check and return here. For performance
		// and reciprocal throughput, a LD1R is not completely free. In this case, we
		// return the cost for the broadcast below (i.e. 1 for most/all types), so
		// that we model the load + dup sequence slightly higher because LD1R is a
		// high latency instruction.
		if (CostKind == TTI::TCK_CodeSize && Kind == TTI::SK_Broadcast) {
bool IsLoad = !Args.empty() && isa<LoadInst>(Args[0]);		bool IsLoad = !Args.empty() && isa<LoadInst>(Args[0]);
if (IsLoad && LT.second.isVector() &&		if (IsLoad && LT.second.isVector() &&
isLegalBroadcastLoad(Tp->getElementType(),		isLegalBroadcastLoad(Tp->getElementType(),
LT.second.getVectorElementCount()))		LT.second.getVectorElementCount()))
return 0; // broadcast is handled by ld1r		return 0;
}		}

// If we have 4 elements for the shuffle and a Mask, get the cost straight		// If we have 4 elements for the shuffle and a Mask, get the cost straight
// from the perfect shuffle tables.		// from the perfect shuffle tables.
if (Mask.size() == 4 && Tp->getElementCount() == ElementCount::getFixed(4) &&		if (Mask.size() == 4 && Tp->getElementCount() == ElementCount::getFixed(4) &&
(Tp->getScalarSizeInBits() == 16 \|\| Tp->getScalarSizeInBits() == 32) &&		(Tp->getScalarSizeInBits() == 16 \|\| Tp->getScalarSizeInBits() == 32) &&
all_of(Mask, [](int E) { return E < 8; }))		all_of(Mask, [](int E) { return E < 8; }))
return getPerfectShuffleCost(Mask);		return getPerfectShuffleCost(Mask);
▲ Show 20 Lines • Show All 185 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/AArch64/shuffle-load.ll

	; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py
	; RUN: opt < %s -mtriple=aarch64--linux-gnu -passes="print<cost-model>" 2>&1 -disable-output \| FileCheck %s			; RUN: opt < %s -mtriple=aarch64--linux-gnu -passes="print<cost-model>" 2>&1 -disable-output \| FileCheck %s
				; RUN: opt < %s -mtriple=aarch64--linux-gnu -passes="print<cost-model>" -cost-kind=code-size 2>&1 -disable-output \| FileCheck %s --check-prefix=CODESIZE

	; These tests check the costs of ld1r instructions, through the			; These tests check the costs of ld1r instructions, through the
	; isLegalBroadcastLoad method.			; isLegalBroadcastLoad method.

	target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"

	; The tests use vector loads and splats, as opposed to scalar loads, inserts			; The tests use vector loads and splats, as opposed to scalar loads, inserts
	; and splats as that is how getShuffleCost currently recognizes them.			; and splats as that is how getShuffleCost currently recognizes them.
	define void @shuffle() {			define void @shuffle() {
	; CHECK-LABEL: 'shuffle'			; CHECK-LABEL: 'shuffle'
	; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %lv2i8 = load <2 x i8>, ptr undef, align 2			; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %lv2i8 = load <2 x i8>, ptr undef, align 2
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sv2i8 = shufflevector <2 x i8> %lv2i8, <2 x i8> undef, <2 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sv2i8 = shufflevector <2 x i8> %lv2i8, <2 x i8> undef, <2 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %lv4i8 = load <4 x i8>, ptr undef, align 4			; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %lv4i8 = load <4 x i8>, ptr undef, align 4
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sv4i8 = shufflevector <4 x i8> %lv4i8, <4 x i8> undef, <4 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sv4i8 = shufflevector <4 x i8> %lv4i8, <4 x i8> undef, <4 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv8i8 = load <8 x i8>, ptr undef, align 8			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv8i8 = load <8 x i8>, ptr undef, align 8
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv8i8 = shufflevector <8 x i8> %lv8i8, <8 x i8> undef, <8 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sv8i8 = shufflevector <8 x i8> %lv8i8, <8 x i8> undef, <8 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv16i8 = load <16 x i8>, ptr undef, align 16			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv16i8 = load <16 x i8>, ptr undef, align 16
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv16i8 = shufflevector <16 x i8> %lv16i8, <16 x i8> undef, <16 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sv16i8 = shufflevector <16 x i8> %lv16i8, <16 x i8> undef, <16 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %lv2i16 = load <2 x i16>, ptr undef, align 4			; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %lv2i16 = load <2 x i16>, ptr undef, align 4
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sv2i16 = shufflevector <2 x i16> %lv2i16, <2 x i16> undef, <2 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sv2i16 = shufflevector <2 x i16> %lv2i16, <2 x i16> undef, <2 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv4i16 = load <4 x i16>, ptr undef, align 8			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv4i16 = load <4 x i16>, ptr undef, align 8
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv4i16 = shufflevector <4 x i16> %lv4i16, <4 x i16> undef, <4 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sv4i16 = shufflevector <4 x i16> %lv4i16, <4 x i16> undef, <4 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv8i16 = load <8 x i16>, ptr undef, align 16			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv8i16 = load <8 x i16>, ptr undef, align 16
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv8i16 = shufflevector <8 x i16> %lv8i16, <8 x i16> undef, <8 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sv8i16 = shufflevector <8 x i16> %lv8i16, <8 x i16> undef, <8 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %lv16i16 = load <16 x i16>, ptr undef, align 32			; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %lv16i16 = load <16 x i16>, ptr undef, align 32
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv16i16 = shufflevector <16 x i16> %lv16i16, <16 x i16> undef, <16 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %sv16i16 = shufflevector <16 x i16> %lv16i16, <16 x i16> undef, <16 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv2i32 = load <2 x i32>, ptr undef, align 8			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv2i32 = load <2 x i32>, ptr undef, align 8
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv2i32 = shufflevector <2 x i32> %lv2i32, <2 x i32> undef, <2 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sv2i32 = shufflevector <2 x i32> %lv2i32, <2 x i32> undef, <2 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv4i32 = load <4 x i32>, ptr undef, align 16			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv4i32 = load <4 x i32>, ptr undef, align 16
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv4i32 = shufflevector <4 x i32> %lv4i32, <4 x i32> undef, <4 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sv4i32 = shufflevector <4 x i32> %lv4i32, <4 x i32> undef, <4 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %lv8i32 = load <8 x i32>, ptr undef, align 32			; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %lv8i32 = load <8 x i32>, ptr undef, align 32
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv8i32 = shufflevector <8 x i32> %lv8i32, <8 x i32> undef, <8 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %sv8i32 = shufflevector <8 x i32> %lv8i32, <8 x i32> undef, <8 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv2i64 = load <2 x i64>, ptr undef, align 16			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv2i64 = load <2 x i64>, ptr undef, align 16
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv2i64 = shufflevector <2 x i64> %lv2i64, <2 x i64> undef, <2 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sv2i64 = shufflevector <2 x i64> %lv2i64, <2 x i64> undef, <2 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %lv4i64 = load <4 x i64>, ptr undef, align 32			; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %lv4i64 = load <4 x i64>, ptr undef, align 32
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv4i64 = shufflevector <4 x i64> %lv4i64, <4 x i64> undef, <4 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %sv4i64 = shufflevector <4 x i64> %lv4i64, <4 x i64> undef, <4 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv2f16 = load <2 x half>, ptr undef, align 4			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv2f16 = load <2 x half>, ptr undef, align 4
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv2f16 = shufflevector <2 x half> %lv2f16, <2 x half> undef, <2 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %sv2f16 = shufflevector <2 x half> %lv2f16, <2 x half> undef, <2 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv4f16 = load <4 x half>, ptr undef, align 8			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv4f16 = load <4 x half>, ptr undef, align 8
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv4f16 = shufflevector <4 x half> %lv4f16, <4 x half> undef, <4 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sv4f16 = shufflevector <4 x half> %lv4f16, <4 x half> undef, <4 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv8f16 = load <8 x half>, ptr undef, align 16			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv8f16 = load <8 x half>, ptr undef, align 16
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv8f16 = shufflevector <8 x half> %lv8f16, <8 x half> undef, <8 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 21 for instruction: %sv8f16 = shufflevector <8 x half> %lv8f16, <8 x half> undef, <8 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %lv16f16 = load <16 x half>, ptr undef, align 32			; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %lv16f16 = load <16 x half>, ptr undef, align 32
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv16f16 = shufflevector <16 x half> %lv16f16, <16 x half> undef, <16 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 42 for instruction: %sv16f16 = shufflevector <16 x half> %lv16f16, <16 x half> undef, <16 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv2f32 = load <2 x float>, ptr undef, align 8			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv2f32 = load <2 x float>, ptr undef, align 8
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv2f32 = shufflevector <2 x float> %lv2f32, <2 x float> undef, <2 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sv2f32 = shufflevector <2 x float> %lv2f32, <2 x float> undef, <2 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv4f32 = load <4 x float>, ptr undef, align 16			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv4f32 = load <4 x float>, ptr undef, align 16
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv4f32 = shufflevector <4 x float> %lv4f32, <4 x float> undef, <4 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sv4f32 = shufflevector <4 x float> %lv4f32, <4 x float> undef, <4 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %lv8f32 = load <8 x float>, ptr undef, align 32			; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %lv8f32 = load <8 x float>, ptr undef, align 32
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv8f32 = shufflevector <8 x float> %lv8f32, <8 x float> undef, <8 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %sv8f32 = shufflevector <8 x float> %lv8f32, <8 x float> undef, <8 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv2f64 = load <2 x double>, ptr undef, align 16			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv2f64 = load <2 x double>, ptr undef, align 16
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv2f64 = shufflevector <2 x double> %lv2f64, <2 x double> undef, <2 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sv2f64 = shufflevector <2 x double> %lv2f64, <2 x double> undef, <2 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %lv4f64 = load <4 x double>, ptr undef, align 32			; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %lv4f64 = load <4 x double>, ptr undef, align 32
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv4f64 = shufflevector <4 x double> %lv4f64, <4 x double> undef, <4 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %sv4f64 = shufflevector <4 x double> %lv4f64, <4 x double> undef, <4 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
				; CODESIZE-LABEL: 'shuffle'
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv2i8 = load <2 x i8>, ptr undef, align 2
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sv2i8 = shufflevector <2 x i8> %lv2i8, <2 x i8> undef, <2 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv4i8 = load <4 x i8>, ptr undef, align 4
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sv4i8 = shufflevector <4 x i8> %lv4i8, <4 x i8> undef, <4 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv8i8 = load <8 x i8>, ptr undef, align 8
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv8i8 = shufflevector <8 x i8> %lv8i8, <8 x i8> undef, <8 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv16i8 = load <16 x i8>, ptr undef, align 16
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv16i8 = shufflevector <16 x i8> %lv16i8, <16 x i8> undef, <16 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv2i16 = load <2 x i16>, ptr undef, align 4
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %sv2i16 = shufflevector <2 x i16> %lv2i16, <2 x i16> undef, <2 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv4i16 = load <4 x i16>, ptr undef, align 8
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv4i16 = shufflevector <4 x i16> %lv4i16, <4 x i16> undef, <4 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv8i16 = load <8 x i16>, ptr undef, align 16
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv8i16 = shufflevector <8 x i16> %lv8i16, <8 x i16> undef, <8 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %lv16i16 = load <16 x i16>, ptr undef, align 32
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv16i16 = shufflevector <16 x i16> %lv16i16, <16 x i16> undef, <16 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv2i32 = load <2 x i32>, ptr undef, align 8
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv2i32 = shufflevector <2 x i32> %lv2i32, <2 x i32> undef, <2 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv4i32 = load <4 x i32>, ptr undef, align 16
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv4i32 = shufflevector <4 x i32> %lv4i32, <4 x i32> undef, <4 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %lv8i32 = load <8 x i32>, ptr undef, align 32
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv8i32 = shufflevector <8 x i32> %lv8i32, <8 x i32> undef, <8 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv2i64 = load <2 x i64>, ptr undef, align 16
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv2i64 = shufflevector <2 x i64> %lv2i64, <2 x i64> undef, <2 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %lv4i64 = load <4 x i64>, ptr undef, align 32
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv4i64 = shufflevector <4 x i64> %lv4i64, <4 x i64> undef, <4 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv2f16 = load <2 x half>, ptr undef, align 4
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv2f16 = shufflevector <2 x half> %lv2f16, <2 x half> undef, <2 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv4f16 = load <4 x half>, ptr undef, align 8
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv4f16 = shufflevector <4 x half> %lv4f16, <4 x half> undef, <4 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv8f16 = load <8 x half>, ptr undef, align 16
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv8f16 = shufflevector <8 x half> %lv8f16, <8 x half> undef, <8 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %lv16f16 = load <16 x half>, ptr undef, align 32
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv16f16 = shufflevector <16 x half> %lv16f16, <16 x half> undef, <16 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv2f32 = load <2 x float>, ptr undef, align 8
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv2f32 = shufflevector <2 x float> %lv2f32, <2 x float> undef, <2 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv4f32 = load <4 x float>, ptr undef, align 16
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv4f32 = shufflevector <4 x float> %lv4f32, <4 x float> undef, <4 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %lv8f32 = load <8 x float>, ptr undef, align 32
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv8f32 = shufflevector <8 x float> %lv8f32, <8 x float> undef, <8 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lv2f64 = load <2 x double>, ptr undef, align 16
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv2f64 = shufflevector <2 x double> %lv2f64, <2 x double> undef, <2 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %lv4f64 = load <4 x double>, ptr undef, align 32
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %sv4f64 = shufflevector <4 x double> %lv4f64, <4 x double> undef, <4 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret void
				;
	%lv2i8 = load <2 x i8>, ptr undef			%lv2i8 = load <2 x i8>, ptr undef
	%sv2i8 = shufflevector <2 x i8> %lv2i8, <2 x i8> undef, <2 x i32> zeroinitializer			%sv2i8 = shufflevector <2 x i8> %lv2i8, <2 x i8> undef, <2 x i32> zeroinitializer
	%lv4i8 = load <4 x i8>, ptr undef			%lv4i8 = load <4 x i8>, ptr undef
	%sv4i8 = shufflevector <4 x i8> %lv4i8, <4 x i8> undef, <4 x i32> zeroinitializer			%sv4i8 = shufflevector <4 x i8> %lv4i8, <4 x i8> undef, <4 x i32> zeroinitializer
	%lv8i8 = load <8 x i8>, ptr undef			%lv8i8 = load <8 x i8>, ptr undef
	%sv8i8 = shufflevector <8 x i8> %lv8i8, <8 x i8> undef, <8 x i32> zeroinitializer			%sv8i8 = shufflevector <8 x i8> %lv8i8, <8 x i8> undef, <8 x i32> zeroinitializer
	%lv16i8 = load <16 x i8>, ptr undef			%lv16i8 = load <16 x i8>, ptr undef
	%sv16i8 = shufflevector <16 x i8> %lv16i8, <16 x i8> undef, <16 x i32> zeroinitializer			%sv16i8 = shufflevector <16 x i8> %lv16i8, <16 x i8> undef, <16 x i32> zeroinitializer
	▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines

	define <4 x half> @ld1r_4h_float_shuff(ptr nocapture %x) {			define <4 x half> @ld1r_4h_float_shuff(ptr nocapture %x) {
	; CHECK-LABEL: 'ld1r_4h_float_shuff'			; CHECK-LABEL: 'ld1r_4h_float_shuff'
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load half, ptr %x, align 2			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load half, ptr %x, align 2
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = insertelement <4 x half> undef, half %tmp, i32 0			; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = insertelement <4 x half> undef, half %tmp, i32 0
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <4 x half> %tmp1, <4 x half> undef, <4 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <4 x half> %tmp1, <4 x half> undef, <4 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x half> %lane			; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x half> %lane
	;			;
				; CODESIZE-LABEL: 'ld1r_4h_float_shuff'
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load half, ptr %x, align 2
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = insertelement <4 x half> undef, half %tmp, i32 0
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <4 x half> %tmp1, <4 x half> undef, <4 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret <4 x half> %lane
				;
	entry:			entry:
	%tmp = load half, ptr %x, align 2			%tmp = load half, ptr %x, align 2
	%tmp1 = insertelement <4 x half> undef, half %tmp, i32 0			%tmp1 = insertelement <4 x half> undef, half %tmp, i32 0
	%lane = shufflevector <4 x half> %tmp1, <4 x half> undef, <4 x i32> zeroinitializer			%lane = shufflevector <4 x half> %tmp1, <4 x half> undef, <4 x i32> zeroinitializer
	ret <4 x half> %lane			ret <4 x half> %lane
	}			}

	define <8 x half> @ld1r_8h_float_shuff(ptr nocapture %x) {			define <8 x half> @ld1r_8h_float_shuff(ptr nocapture %x) {
	; CHECK-LABEL: 'ld1r_8h_float_shuff'			; CHECK-LABEL: 'ld1r_8h_float_shuff'
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load half, ptr %x, align 2			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load half, ptr %x, align 2
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = insertelement <8 x half> undef, half %tmp, i32 0			; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = insertelement <8 x half> undef, half %tmp, i32 0
	; CHECK-NEXT: Cost Model: Found an estimated cost of 21 for instruction: %lane = shufflevector <8 x half> %tmp1, <8 x half> undef, <8 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 21 for instruction: %lane = shufflevector <8 x half> %tmp1, <8 x half> undef, <8 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x half> %lane			; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x half> %lane
	;			;
				; CODESIZE-LABEL: 'ld1r_8h_float_shuff'
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load half, ptr %x, align 2
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = insertelement <8 x half> undef, half %tmp, i32 0
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 21 for instruction: %lane = shufflevector <8 x half> %tmp1, <8 x half> undef, <8 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret <8 x half> %lane
				;
	entry:			entry:
	%tmp = load half, ptr %x, align 2			%tmp = load half, ptr %x, align 2
	%tmp1 = insertelement <8 x half> undef, half %tmp, i32 0			%tmp1 = insertelement <8 x half> undef, half %tmp, i32 0
	%lane = shufflevector <8 x half> %tmp1, <8 x half> undef, <8 x i32> zeroinitializer			%lane = shufflevector <8 x half> %tmp1, <8 x half> undef, <8 x i32> zeroinitializer
	ret <8 x half> %lane			ret <8 x half> %lane
	}			}

	define <2 x float> @ld1r_2s_float_shuff(ptr nocapture %x) {			define <2 x float> @ld1r_2s_float_shuff(ptr nocapture %x) {
	; CHECK-LABEL: 'ld1r_2s_float_shuff'			; CHECK-LABEL: 'ld1r_2s_float_shuff'
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load float, ptr %x, align 4			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load float, ptr %x, align 4
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = insertelement <2 x float> undef, float %tmp, i32 0			; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = insertelement <2 x float> undef, float %tmp, i32 0
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <2 x float> %tmp1, <2 x float> undef, <2 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <2 x float> %tmp1, <2 x float> undef, <2 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x float> %lane			; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x float> %lane
	;			;
				; CODESIZE-LABEL: 'ld1r_2s_float_shuff'
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load float, ptr %x, align 4
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = insertelement <2 x float> undef, float %tmp, i32 0
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <2 x float> %tmp1, <2 x float> undef, <2 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret <2 x float> %lane
				;
	entry:			entry:
	%tmp = load float, ptr %x, align 4			%tmp = load float, ptr %x, align 4
	%tmp1 = insertelement <2 x float> undef, float %tmp, i32 0			%tmp1 = insertelement <2 x float> undef, float %tmp, i32 0
	%lane = shufflevector <2 x float> %tmp1, <2 x float> undef, <2 x i32> zeroinitializer			%lane = shufflevector <2 x float> %tmp1, <2 x float> undef, <2 x i32> zeroinitializer
	ret <2 x float> %lane			ret <2 x float> %lane
	}			}

	define <4 x float> @ld1r_4s_float_shuff(ptr nocapture %x) {			define <4 x float> @ld1r_4s_float_shuff(ptr nocapture %x) {
	; CHECK-LABEL: 'ld1r_4s_float_shuff'			; CHECK-LABEL: 'ld1r_4s_float_shuff'
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load float, ptr %x, align 4			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load float, ptr %x, align 4
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = insertelement <4 x float> undef, float %tmp, i32 0			; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = insertelement <4 x float> undef, float %tmp, i32 0
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <4 x float> %tmp1, <4 x float> undef, <4 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <4 x float> %tmp1, <4 x float> undef, <4 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x float> %lane			; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x float> %lane
	;			;
				; CODESIZE-LABEL: 'ld1r_4s_float_shuff'
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load float, ptr %x, align 4
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = insertelement <4 x float> undef, float %tmp, i32 0
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <4 x float> %tmp1, <4 x float> undef, <4 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret <4 x float> %lane
				;
	entry:			entry:
	%tmp = load float, ptr %x, align 4			%tmp = load float, ptr %x, align 4
	%tmp1 = insertelement <4 x float> undef, float %tmp, i32 0			%tmp1 = insertelement <4 x float> undef, float %tmp, i32 0
	%lane = shufflevector <4 x float> %tmp1, <4 x float> undef, <4 x i32> zeroinitializer			%lane = shufflevector <4 x float> %tmp1, <4 x float> undef, <4 x i32> zeroinitializer
	ret <4 x float> %lane			ret <4 x float> %lane
	}			}

	define <2 x double> @ld1r_2d_double_shuff(ptr nocapture %x) {			define <2 x double> @ld1r_2d_double_shuff(ptr nocapture %x) {
	; CHECK-LABEL: 'ld1r_2d_double_shuff'			; CHECK-LABEL: 'ld1r_2d_double_shuff'
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load double, ptr %x, align 4			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load double, ptr %x, align 4
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = insertelement <2 x double> undef, double %tmp, i32 0			; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = insertelement <2 x double> undef, double %tmp, i32 0
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <2 x double> %tmp1, <2 x double> undef, <2 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <2 x double> %tmp1, <2 x double> undef, <2 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x double> %lane			; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x double> %lane
	;			;
				; CODESIZE-LABEL: 'ld1r_2d_double_shuff'
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load double, ptr %x, align 4
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = insertelement <2 x double> undef, double %tmp, i32 0
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <2 x double> %tmp1, <2 x double> undef, <2 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret <2 x double> %lane
				;
	entry:			entry:
	%tmp = load double, ptr %x, align 4			%tmp = load double, ptr %x, align 4
	%tmp1 = insertelement <2 x double> undef, double %tmp, i32 0			%tmp1 = insertelement <2 x double> undef, double %tmp, i32 0
	%lane = shufflevector <2 x double> %tmp1, <2 x double> undef, <2 x i32> zeroinitializer			%lane = shufflevector <2 x double> %tmp1, <2 x double> undef, <2 x i32> zeroinitializer
	ret <2 x double> %lane			ret <2 x double> %lane
	}			}

	; Check ld1r generated from scalar integer loads			; Check ld1r generated from scalar integer loads

	define <8 x i8> @ld1r_8b_int_shuff(ptr nocapture %x) {			define <8 x i8> @ld1r_8b_int_shuff(ptr nocapture %x) {
	; CHECK-LABEL: 'ld1r_8b_int_shuff'			; CHECK-LABEL: 'ld1r_8b_int_shuff'
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load i8, ptr %x, align 2			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load i8, ptr %x, align 2
	; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %tmp1 = insertelement <8 x i8> undef, i8 %tmp, i8 0			; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %tmp1 = insertelement <8 x i8> undef, i8 %tmp, i8 0
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <8 x i8> %tmp1, <8 x i8> undef, <8 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <8 x i8> %tmp1, <8 x i8> undef, <8 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i8> %lane			; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i8> %lane
	;			;
				; CODESIZE-LABEL: 'ld1r_8b_int_shuff'
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load i8, ptr %x, align 2
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %tmp1 = insertelement <8 x i8> undef, i8 %tmp, i8 0
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <8 x i8> %tmp1, <8 x i8> undef, <8 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret <8 x i8> %lane
				;
	entry:			entry:
	%tmp = load i8, ptr %x, align 2			%tmp = load i8, ptr %x, align 2
	%tmp1 = insertelement <8 x i8> undef, i8 %tmp, i8 0			%tmp1 = insertelement <8 x i8> undef, i8 %tmp, i8 0
	%lane = shufflevector <8 x i8> %tmp1, <8 x i8> undef, <8 x i32> zeroinitializer			%lane = shufflevector <8 x i8> %tmp1, <8 x i8> undef, <8 x i32> zeroinitializer
	ret <8 x i8> %lane			ret <8 x i8> %lane
	}			}

	define <16 x i8> @ld1r_16b_int_shuff(ptr nocapture %x) {			define <16 x i8> @ld1r_16b_int_shuff(ptr nocapture %x) {
	; CHECK-LABEL: 'ld1r_16b_int_shuff'			; CHECK-LABEL: 'ld1r_16b_int_shuff'
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load i8, ptr %x, align 2			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load i8, ptr %x, align 2
	; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %tmp1 = insertelement <16 x i8> undef, i8 %tmp, i8 0			; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %tmp1 = insertelement <16 x i8> undef, i8 %tmp, i8 0
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <16 x i8> %tmp1, <16 x i8> undef, <16 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <16 x i8> %tmp1, <16 x i8> undef, <16 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i8> %lane			; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i8> %lane
	;			;
				; CODESIZE-LABEL: 'ld1r_16b_int_shuff'
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load i8, ptr %x, align 2
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %tmp1 = insertelement <16 x i8> undef, i8 %tmp, i8 0
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <16 x i8> %tmp1, <16 x i8> undef, <16 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret <16 x i8> %lane
				;
	entry:			entry:
	%tmp = load i8, ptr %x, align 2			%tmp = load i8, ptr %x, align 2
	%tmp1 = insertelement <16 x i8> undef, i8 %tmp, i8 0			%tmp1 = insertelement <16 x i8> undef, i8 %tmp, i8 0
	%lane = shufflevector <16 x i8> %tmp1, <16 x i8> undef, <16 x i32> zeroinitializer			%lane = shufflevector <16 x i8> %tmp1, <16 x i8> undef, <16 x i32> zeroinitializer
	ret <16 x i8> %lane			ret <16 x i8> %lane
	}			}

	define <4 x i16> @ld1r_4h_int_shuff(ptr nocapture %x) {			define <4 x i16> @ld1r_4h_int_shuff(ptr nocapture %x) {
	; CHECK-LABEL: 'ld1r_4h_int_shuff'			; CHECK-LABEL: 'ld1r_4h_int_shuff'
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load i16, ptr %x, align 2			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load i16, ptr %x, align 2
	; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %tmp1 = insertelement <4 x i16> undef, i16 %tmp, i16 0			; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %tmp1 = insertelement <4 x i16> undef, i16 %tmp, i16 0
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <4 x i16> %tmp1, <4 x i16> undef, <4 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <4 x i16> %tmp1, <4 x i16> undef, <4 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i16> %lane			; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i16> %lane
	;			;
				; CODESIZE-LABEL: 'ld1r_4h_int_shuff'
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load i16, ptr %x, align 2
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %tmp1 = insertelement <4 x i16> undef, i16 %tmp, i16 0
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <4 x i16> %tmp1, <4 x i16> undef, <4 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret <4 x i16> %lane
				;
	entry:			entry:
	%tmp = load i16, ptr %x, align 2			%tmp = load i16, ptr %x, align 2
	%tmp1 = insertelement <4 x i16> undef, i16 %tmp, i16 0			%tmp1 = insertelement <4 x i16> undef, i16 %tmp, i16 0
	%lane = shufflevector <4 x i16> %tmp1, <4 x i16> undef, <4 x i32> zeroinitializer			%lane = shufflevector <4 x i16> %tmp1, <4 x i16> undef, <4 x i32> zeroinitializer
	ret <4 x i16> %lane			ret <4 x i16> %lane
	}			}

	define <8 x i16> @ld1r_8h_int_shuff(ptr nocapture %x) {			define <8 x i16> @ld1r_8h_int_shuff(ptr nocapture %x) {
	; CHECK-LABEL: 'ld1r_8h_int_shuff'			; CHECK-LABEL: 'ld1r_8h_int_shuff'
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load i16, ptr %x, align 2			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load i16, ptr %x, align 2
	; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %tmp1 = insertelement <8 x i16> undef, i16 %tmp, i16 0			; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %tmp1 = insertelement <8 x i16> undef, i16 %tmp, i16 0
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <8 x i16> %tmp1, <8 x i16> undef, <8 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <8 x i16> %tmp1, <8 x i16> undef, <8 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i16> %lane			; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i16> %lane
	;			;
				; CODESIZE-LABEL: 'ld1r_8h_int_shuff'
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load i16, ptr %x, align 2
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %tmp1 = insertelement <8 x i16> undef, i16 %tmp, i16 0
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <8 x i16> %tmp1, <8 x i16> undef, <8 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret <8 x i16> %lane
				;
	entry:			entry:
	%tmp = load i16, ptr %x, align 2			%tmp = load i16, ptr %x, align 2
	%tmp1 = insertelement <8 x i16> undef, i16 %tmp, i16 0			%tmp1 = insertelement <8 x i16> undef, i16 %tmp, i16 0
	%lane = shufflevector <8 x i16> %tmp1, <8 x i16> undef, <8 x i32> zeroinitializer			%lane = shufflevector <8 x i16> %tmp1, <8 x i16> undef, <8 x i32> zeroinitializer
	ret <8 x i16> %lane			ret <8 x i16> %lane
	}			}

	define <2 x i32> @ld1r_2s_int_shuff(ptr nocapture %x) {			define <2 x i32> @ld1r_2s_int_shuff(ptr nocapture %x) {
	; CHECK-LABEL: 'ld1r_2s_int_shuff'			; CHECK-LABEL: 'ld1r_2s_int_shuff'
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load i32, ptr %x, align 4			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load i32, ptr %x, align 4
	; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %tmp1 = insertelement <2 x i32> undef, i32 %tmp, i32 0			; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %tmp1 = insertelement <2 x i32> undef, i32 %tmp, i32 0
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <2 x i32> %tmp1, <2 x i32> undef, <2 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <2 x i32> %tmp1, <2 x i32> undef, <2 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %lane			; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %lane
	;			;
				; CODESIZE-LABEL: 'ld1r_2s_int_shuff'
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load i32, ptr %x, align 4
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %tmp1 = insertelement <2 x i32> undef, i32 %tmp, i32 0
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <2 x i32> %tmp1, <2 x i32> undef, <2 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret <2 x i32> %lane
				;
	entry:			entry:
	%tmp = load i32, ptr %x, align 4			%tmp = load i32, ptr %x, align 4
	%tmp1 = insertelement <2 x i32> undef, i32 %tmp, i32 0			%tmp1 = insertelement <2 x i32> undef, i32 %tmp, i32 0
	%lane = shufflevector <2 x i32> %tmp1, <2 x i32> undef, <2 x i32> zeroinitializer			%lane = shufflevector <2 x i32> %tmp1, <2 x i32> undef, <2 x i32> zeroinitializer
	ret <2 x i32> %lane			ret <2 x i32> %lane
	}			}

	define <4 x i32> @ld1r_4s_int_shuff(ptr nocapture %x) {			define <4 x i32> @ld1r_4s_int_shuff(ptr nocapture %x) {
	; CHECK-LABEL: 'ld1r_4s_int_shuff'			; CHECK-LABEL: 'ld1r_4s_int_shuff'
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load i32, ptr %x, align 4			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load i32, ptr %x, align 4
	; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %tmp1 = insertelement <4 x i32> undef, i32 %tmp, i32 0			; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %tmp1 = insertelement <4 x i32> undef, i32 %tmp, i32 0
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <4 x i32> %tmp1, <4 x i32> undef, <4 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <4 x i32> %tmp1, <4 x i32> undef, <4 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %lane			; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %lane
	;			;
				; CODESIZE-LABEL: 'ld1r_4s_int_shuff'
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load i32, ptr %x, align 4
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %tmp1 = insertelement <4 x i32> undef, i32 %tmp, i32 0
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <4 x i32> %tmp1, <4 x i32> undef, <4 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret <4 x i32> %lane
				;
	entry:			entry:
	%tmp = load i32, ptr %x, align 4			%tmp = load i32, ptr %x, align 4
	%tmp1 = insertelement <4 x i32> undef, i32 %tmp, i32 0			%tmp1 = insertelement <4 x i32> undef, i32 %tmp, i32 0
	%lane = shufflevector <4 x i32> %tmp1, <4 x i32> undef, <4 x i32> zeroinitializer			%lane = shufflevector <4 x i32> %tmp1, <4 x i32> undef, <4 x i32> zeroinitializer
	ret <4 x i32> %lane			ret <4 x i32> %lane
	}			}

	define <2 x i64> @ld1r_2d_int_shuff(ptr nocapture %x) {			define <2 x i64> @ld1r_2d_int_shuff(ptr nocapture %x) {
	; CHECK-LABEL: 'ld1r_2d_int_shuff'			; CHECK-LABEL: 'ld1r_2d_int_shuff'
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load i64, ptr %x, align 8			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load i64, ptr %x, align 8
	; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %tmp1 = insertelement <2 x i64> undef, i64 %tmp, i32 0			; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %tmp1 = insertelement <2 x i64> undef, i64 %tmp, i32 0
	; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <2 x i64> %tmp1, <2 x i64> undef, <2 x i32> zeroinitializer			; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <2 x i64> %tmp1, <2 x i64> undef, <2 x i32> zeroinitializer
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %lane			; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %lane
	;			;
				; CODESIZE-LABEL: 'ld1r_2d_int_shuff'
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load i64, ptr %x, align 8
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %tmp1 = insertelement <2 x i64> undef, i64 %tmp, i32 0
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %lane = shufflevector <2 x i64> %tmp1, <2 x i64> undef, <2 x i32> zeroinitializer
				; CODESIZE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret <2 x i64> %lane
				;
	entry:			entry:
	%tmp = load i64, ptr %x, align 8			%tmp = load i64, ptr %x, align 8
	%tmp1 = insertelement <2 x i64> undef, i64 %tmp, i32 0			%tmp1 = insertelement <2 x i64> undef, i64 %tmp, i32 0
	%lane = shufflevector <2 x i64> %tmp1, <2 x i64> undef, <2 x i32> zeroinitializer			%lane = shufflevector <2 x i64> %tmp1, <2 x i64> undef, <2 x i32> zeroinitializer
	ret <2 x i64> %lane			ret <2 x i64> %lane
	}			}

llvm/test/Transforms/SLPVectorizer/AArch64/slp-fma-loss.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt -passes=slp-vectorizer -mtriple=arm64-apple-ios -S %s \| FileCheck %s		; RUN: opt -passes=slp-vectorizer -mtriple=arm64-apple-ios -S %s \| FileCheck %s

; Test case where not vectorizing is more profitable because multiple		; Test case where not vectorizing is more profitable because multiple
; fmul/{fadd,fsub} pairs can be lowered to fma instructions.		; fmul/{fadd,fsub} pairs can be lowered to fma instructions.
define void @slp_not_profitable_with_fast_fmf(ptr %A, ptr %B) {		define void @slp_not_profitable_with_fast_fmf(ptr %A, ptr %B) {
; CHECK-LABEL: @slp_not_profitable_with_fast_fmf(		; CHECK-LABEL: @slp_not_profitable_with_fast_fmf(
; CHECK-NEXT: [[GEP_B_1:%.]] = getelementptr inbounds float, ptr [[B:%.]], i64 1		; CHECK-NEXT: [[GEP_B_1:%.]] = getelementptr inbounds float, ptr [[B:%.]], i64 1
; CHECK-NEXT: [[A_0:%.]] = load float, ptr [[A:%.]], align 4		; CHECK-NEXT: [[A_0:%.]] = load float, ptr [[A:%.]], align 4
		; CHECK-NEXT: [[B_1:%.*]] = load float, ptr [[GEP_B_1]], align 4
		; CHECK-NEXT: [[MUL_0:%.*]] = fmul fast float [[B_1]], [[A_0]]
; CHECK-NEXT: [[B_0:%.*]] = load float, ptr [[B]], align 4		; CHECK-NEXT: [[B_0:%.*]] = load float, ptr [[B]], align 4
; CHECK-NEXT: [[TMP1:%.*]] = load <2 x float>, ptr [[GEP_B_1]], align 4		; CHECK-NEXT: [[GEP_B_2:%.*]] = getelementptr inbounds float, ptr [[B]], i64 2
; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x float> poison, float [[B_0]], i32 0		; CHECK-NEXT: [[B_2:%.*]] = load float, ptr [[GEP_B_2]], align 4
; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <2 x float> [[TMP2]], <2 x float> poison, <2 x i32> zeroinitializer		; CHECK-NEXT: [[MUL_1:%.*]] = fmul fast float [[B_2]], [[B_0]]
; CHECK-NEXT: [[TMP3:%.*]] = fmul fast <2 x float> [[SHUFFLE1]], [[TMP1]]		; CHECK-NEXT: [[SUB:%.*]] = fsub fast float [[MUL_0]], [[MUL_1]]
; CHECK-NEXT: [[SHUFFLE2:%.*]] = shufflevector <2 x float> [[TMP3]], <2 x float> poison, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[MUL_2:%.*]] = fmul fast float [[B_0]], [[B_1]]
; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x float> poison, float [[A_0]], i32 0		; CHECK-NEXT: [[MUL_3:%.*]] = fmul fast float [[B_2]], [[A_0]]
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x float> [[TMP4]], <2 x float> poison, <2 x i32> zeroinitializer		; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[MUL_3]], [[MUL_2]]
; CHECK-NEXT: [[TMP5:%.*]] = fmul fast <2 x float> [[TMP1]], [[SHUFFLE]]		; CHECK-NEXT: store float [[SUB]], ptr [[A]], align 4
; CHECK-NEXT: [[TMP6:%.*]] = fsub fast <2 x float> [[TMP5]], [[SHUFFLE2]]		; CHECK-NEXT: [[GEP_A_1:%.*]] = getelementptr inbounds float, ptr [[A]], i64 1
; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <2 x float> [[TMP5]], [[SHUFFLE2]]		; CHECK-NEXT: store float [[ADD]], ptr [[GEP_A_1]], align 4
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x float> [[TMP6]], <2 x float> [[TMP7]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: store float [[B_2]], ptr [[B]], align 4
; CHECK-NEXT: store <2 x float> [[TMP8]], ptr [[A]], align 4
; CHECK-NEXT: [[TMP9:%.*]] = extractelement <2 x float> [[TMP1]], i32 1
; CHECK-NEXT: store float [[TMP9]], ptr [[B]], align 4
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
%gep.B.1 = getelementptr inbounds float, ptr %B, i64 1		%gep.B.1 = getelementptr inbounds float, ptr %B, i64 1
%A.0 = load float, ptr %A, align 4		%A.0 = load float, ptr %A, align 4
%B.1 = load float, ptr %gep.B.1, align 4		%B.1 = load float, ptr %gep.B.1, align 4
%mul.0 = fmul fast float %B.1, %A.0		%mul.0 = fmul fast float %B.1, %A.0
%B.0 = load float, ptr %B, align 4		%B.0 = load float, ptr %B, align 4
%gep.B.2 = getelementptr inbounds float, ptr %B, i64 2		%gep.B.2 = getelementptr inbounds float, ptr %B, i64 2
Show All 9 Lines	;
store float %B.2, ptr %B, align 4		store float %B.2, ptr %B, align 4
ret void		ret void
}		}

define void @slp_not_profitable_with_reassoc_fmf(ptr %A, ptr %B) {		define void @slp_not_profitable_with_reassoc_fmf(ptr %A, ptr %B) {
; CHECK-LABEL: @slp_not_profitable_with_reassoc_fmf(		; CHECK-LABEL: @slp_not_profitable_with_reassoc_fmf(
; CHECK-NEXT: [[GEP_B_1:%.]] = getelementptr inbounds float, ptr [[B:%.]], i64 1		; CHECK-NEXT: [[GEP_B_1:%.]] = getelementptr inbounds float, ptr [[B:%.]], i64 1
; CHECK-NEXT: [[A_0:%.]] = load float, ptr [[A:%.]], align 4		; CHECK-NEXT: [[A_0:%.]] = load float, ptr [[A:%.]], align 4
		; CHECK-NEXT: [[B_1:%.*]] = load float, ptr [[GEP_B_1]], align 4
		; CHECK-NEXT: [[MUL_0:%.*]] = fmul reassoc float [[B_1]], [[A_0]]
; CHECK-NEXT: [[B_0:%.*]] = load float, ptr [[B]], align 4		; CHECK-NEXT: [[B_0:%.*]] = load float, ptr [[B]], align 4
; CHECK-NEXT: [[TMP1:%.*]] = load <2 x float>, ptr [[GEP_B_1]], align 4		; CHECK-NEXT: [[GEP_B_2:%.*]] = getelementptr inbounds float, ptr [[B]], i64 2
; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x float> poison, float [[B_0]], i32 0		; CHECK-NEXT: [[B_2:%.*]] = load float, ptr [[GEP_B_2]], align 4
; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <2 x float> [[TMP2]], <2 x float> poison, <2 x i32> zeroinitializer		; CHECK-NEXT: [[MUL_1:%.*]] = fmul float [[B_2]], [[B_0]]
; CHECK-NEXT: [[TMP3:%.*]] = fmul <2 x float> [[SHUFFLE1]], [[TMP1]]		; CHECK-NEXT: [[SUB:%.*]] = fsub reassoc float [[MUL_0]], [[MUL_1]]
; CHECK-NEXT: [[SHUFFLE2:%.*]] = shufflevector <2 x float> [[TMP3]], <2 x float> poison, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[MUL_2:%.*]] = fmul float [[B_0]], [[B_1]]
; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x float> poison, float [[A_0]], i32 0		; CHECK-NEXT: [[MUL_3:%.*]] = fmul reassoc float [[B_2]], [[A_0]]
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x float> [[TMP4]], <2 x float> poison, <2 x i32> zeroinitializer		; CHECK-NEXT: [[ADD:%.*]] = fadd reassoc float [[MUL_3]], [[MUL_2]]
; CHECK-NEXT: [[TMP5:%.*]] = fmul reassoc <2 x float> [[TMP1]], [[SHUFFLE]]		; CHECK-NEXT: store float [[SUB]], ptr [[A]], align 4
; CHECK-NEXT: [[TMP6:%.*]] = fsub reassoc <2 x float> [[TMP5]], [[SHUFFLE2]]		; CHECK-NEXT: [[GEP_A_1:%.*]] = getelementptr inbounds float, ptr [[A]], i64 1
; CHECK-NEXT: [[TMP7:%.*]] = fadd reassoc <2 x float> [[TMP5]], [[SHUFFLE2]]		; CHECK-NEXT: store float [[ADD]], ptr [[GEP_A_1]], align 4
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x float> [[TMP6]], <2 x float> [[TMP7]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: store float [[B_2]], ptr [[B]], align 4
; CHECK-NEXT: store <2 x float> [[TMP8]], ptr [[A]], align 4
; CHECK-NEXT: [[TMP9:%.*]] = extractelement <2 x float> [[TMP1]], i32 1
; CHECK-NEXT: store float [[TMP9]], ptr [[B]], align 4
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
%gep.B.1 = getelementptr inbounds float, ptr %B, i64 1		%gep.B.1 = getelementptr inbounds float, ptr %B, i64 1
%A.0 = load float, ptr %A, align 4		%A.0 = load float, ptr %A, align 4
%B.1 = load float, ptr %gep.B.1, align 4		%B.1 = load float, ptr %gep.B.1, align 4
%mul.0 = fmul reassoc float %B.1, %A.0		%mul.0 = fmul reassoc float %B.1, %A.0
%B.0 = load float, ptr %B, align 4		%B.0 = load float, ptr %B, align 4
%gep.B.2 = getelementptr inbounds float, ptr %B, i64 2		%gep.B.2 = getelementptr inbounds float, ptr %B, i64 2
%B.2 = load float, ptr %gep.B.2, align 4		%B.2 = load float, ptr %gep.B.2, align 4
%mul.1 = fmul float %B.2, %B.0		%mul.1 = fmul float %B.2, %B.0
%sub = fsub reassoc float %mul.0, %mul.1		%sub = fsub reassoc float %mul.0, %mul.1
%mul.2 = fmul float %B.0, %B.1		%mul.2 = fmul float %B.0, %B.1
%mul.3 = fmul reassoc float %B.2, %A.0		%mul.3 = fmul reassoc float %B.2, %A.0
%add = fadd reassoc float %mul.3, %mul.2		%add = fadd reassoc float %mul.3, %mul.2
store float %sub, ptr %A, align 4		store float %sub, ptr %A, align 4
%gep.A.1 = getelementptr inbounds float, ptr %A, i64 1		%gep.A.1 = getelementptr inbounds float, ptr %A, i64 1
store float %add, ptr %gep.A.1, align 4		store float %add, ptr %gep.A.1, align 4
store float %B.2, ptr %B, align 4		store float %B.2, ptr %B, align 4
ret void		ret void
}		}

; FMA cannot be used due to missing fast-math flags, so SLP should kick in.		; FMA cannot be used due to missing fast-math flags, so SLP should kick in.
define void @slp_profitable_missing_fmf_on_fadd_fsub(ptr %A, ptr %B) {		define void @slp_profitable_missing_fmf_on_fadd_fsub(ptr %A, ptr %B) {
; CHECK-LABEL: @slp_profitable_missing_fmf_on_fadd_fsub(		; CHECK-LABEL: @slp_profitable_missing_fmf_on_fadd_fsub(
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Ah, only after uploading this diff I noticed that the function names indicate that this should be profitable... I had missed that. Hmmm.... I guess that then needs looking into. SjoerdMeijer: Ah, only after uploading this diff I noticed that the function names indicate that this should…
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Eyeballing this, my first reaction is that I slightly doubt that SLP will be profitable, but I guess that's what I need to find out. SjoerdMeijer: Eyeballing this, my first reaction is that I slightly doubt that SLP will be profitable, but I…
		ABataevUnsubmitted Not Done Reply Inline Actions Working on fma vectorization support in SLP, hope to spend more time on this later this month. ABataev: Working on fma vectorization support in SLP, hope to spend more time on this later this month.
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions It's difficult to see how the SLP variant is ever going to be faster for this example with just a handful of scalar instructions (ignoring the loads/stores) that mostly have overlap in dispatch and execution: [0,3] D======eeeER .. fmul s4, s1, s0 [0,4] D======eeeER .. fmul s1, s2, s1 [0,5] D=======eeeER .. fmul s5, s3, s2 [0,6] D=======eeeER .. fmul s0, s3, s0 [0,7] D==========eeER.. fsub s2, s4, s5 [0,8] D==========eeER.. fadd s0, s0, s1 especially if we have to do things like shuffles: [0,2] D==eeeeeeeeER . . . .. ld1r.2s { v1 }, [x8], #4 [0,3] D==========eeeeeeER . . .. ldr d2, [x8] [0,4] D================eeeER . .. fmul.2s v1, v1, v2 [0,5] D================eeeER . .. fmul.2s v0, v2, v0[0] [0,6] D===================eeER . .. rev64.2s v1, v1 [0,7] D=====================eeER .. fsub.2s v3, v0, v1 [0,8] .D====================eeER .. fadd.2s v0, v0, v1 [0,9] .D======================eeER .. mov.s v3[1], v0[1] [0,10] .D========================eeER.. str d3, [x0] [0,11] .D========================eeeeER st1.s { v2 }[1], [x1] Here I am showing some loads/stores, but that's just to show they are not simple loads/stores anymore but more high-latency instructions, and perhaps more importantly we have got the REV and extract, so with FMAs things might look a bit better but it's difficult to beat the scalar variant. The SLP timeline is a little bit skewed of the post-inc and the result being available a lot earlier, but the bottom line is that there is very little parallelism here as we are working on 2 floats and there's the overhead of the vectorisation. I have run some micro-benchmarks, and I've measured that the SLP variant is indeed slower. @fhahn , @dmgreen : I think it makes to also not SLP vectorise this function (and there other 2 below). Do you agree? SjoerdMeijer: It's difficult to see how the SLP variant is ever going to be faster for this example with just…
		dmgreenUnsubmitted Not Done Reply Inline Actions I think we will fold the dup into the fmul as opposed to the load now, which seems a little cheaper. https://godbolt.org/z/4aeab8Pas I'm not sure if that makes it cheaper overall though. I agree that the rev and the mov make the timing tight. From the look of the test it looks like "profitable" here just means that the non-fast version would be slightly more instructions, not that it was known to be profitable. i.e the test is for testing fma combining, this was just a negative test for that issue. dmgreen: I think we will fold the dup into the fmul as opposed to the load now, which seems a little…
; CHECK-NEXT: [[GEP_B_1:%.]] = getelementptr inbounds float, ptr [[B:%.]], i64 1		; CHECK-NEXT: [[GEP_B_1:%.]] = getelementptr inbounds float, ptr [[B:%.]], i64 1
; CHECK-NEXT: [[A_0:%.]] = load float, ptr [[A:%.]], align 4		; CHECK-NEXT: [[A_0:%.]] = load float, ptr [[A:%.]], align 4
		; CHECK-NEXT: [[B_1:%.*]] = load float, ptr [[GEP_B_1]], align 4
		; CHECK-NEXT: [[MUL_0:%.*]] = fmul fast float [[B_1]], [[A_0]]
; CHECK-NEXT: [[B_0:%.*]] = load float, ptr [[B]], align 4		; CHECK-NEXT: [[B_0:%.*]] = load float, ptr [[B]], align 4
; CHECK-NEXT: [[TMP1:%.*]] = load <2 x float>, ptr [[GEP_B_1]], align 4		; CHECK-NEXT: [[GEP_B_2:%.*]] = getelementptr inbounds float, ptr [[B]], i64 2
; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x float> poison, float [[B_0]], i32 0		; CHECK-NEXT: [[B_2:%.*]] = load float, ptr [[GEP_B_2]], align 4
; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <2 x float> [[TMP2]], <2 x float> poison, <2 x i32> zeroinitializer		; CHECK-NEXT: [[MUL_1:%.*]] = fmul fast float [[B_2]], [[B_0]]
; CHECK-NEXT: [[TMP3:%.*]] = fmul fast <2 x float> [[SHUFFLE1]], [[TMP1]]		; CHECK-NEXT: [[SUB:%.*]] = fsub float [[MUL_0]], [[MUL_1]]
; CHECK-NEXT: [[SHUFFLE2:%.*]] = shufflevector <2 x float> [[TMP3]], <2 x float> poison, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[MUL_2:%.*]] = fmul fast float [[B_0]], [[B_1]]
; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x float> poison, float [[A_0]], i32 0		; CHECK-NEXT: [[MUL_3:%.*]] = fmul fast float [[B_2]], [[A_0]]
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x float> [[TMP4]], <2 x float> poison, <2 x i32> zeroinitializer		; CHECK-NEXT: [[ADD:%.*]] = fadd float [[MUL_3]], [[MUL_2]]
; CHECK-NEXT: [[TMP5:%.*]] = fmul fast <2 x float> [[TMP1]], [[SHUFFLE]]		; CHECK-NEXT: store float [[SUB]], ptr [[A]], align 4
; CHECK-NEXT: [[TMP6:%.*]] = fsub <2 x float> [[TMP5]], [[SHUFFLE2]]		; CHECK-NEXT: [[GEP_A_1:%.*]] = getelementptr inbounds float, ptr [[A]], i64 1
; CHECK-NEXT: [[TMP7:%.*]] = fadd <2 x float> [[TMP5]], [[SHUFFLE2]]		; CHECK-NEXT: store float [[ADD]], ptr [[GEP_A_1]], align 4
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x float> [[TMP6]], <2 x float> [[TMP7]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: store float [[B_2]], ptr [[B]], align 4
; CHECK-NEXT: store <2 x float> [[TMP8]], ptr [[A]], align 4
; CHECK-NEXT: [[TMP9:%.*]] = extractelement <2 x float> [[TMP1]], i32 1
; CHECK-NEXT: store float [[TMP9]], ptr [[B]], align 4
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
%gep.B.1 = getelementptr inbounds float, ptr %B, i64 1		%gep.B.1 = getelementptr inbounds float, ptr %B, i64 1
%A.0 = load float, ptr %A, align 4		%A.0 = load float, ptr %A, align 4
%B.1 = load float, ptr %gep.B.1, align 4		%B.1 = load float, ptr %gep.B.1, align 4
%mul.0 = fmul fast float %B.1, %A.0		%mul.0 = fmul fast float %B.1, %A.0
%B.0 = load float, ptr %B, align 4		%B.0 = load float, ptr %B, align 4
%gep.B.2 = getelementptr inbounds float, ptr %B, i64 2		%gep.B.2 = getelementptr inbounds float, ptr %B, i64 2
Show All 10 Lines	;
ret void		ret void
}		}

; FMA cannot be used due to missing fast-math flags, so SLP should kick in.		; FMA cannot be used due to missing fast-math flags, so SLP should kick in.
define void @slp_profitable_missing_fmf_on_fmul_fadd_fsub(ptr %A, ptr %B) {		define void @slp_profitable_missing_fmf_on_fmul_fadd_fsub(ptr %A, ptr %B) {
; CHECK-LABEL: @slp_profitable_missing_fmf_on_fmul_fadd_fsub(		; CHECK-LABEL: @slp_profitable_missing_fmf_on_fmul_fadd_fsub(
; CHECK-NEXT: [[GEP_B_1:%.]] = getelementptr inbounds float, ptr [[B:%.]], i64 1		; CHECK-NEXT: [[GEP_B_1:%.]] = getelementptr inbounds float, ptr [[B:%.]], i64 1
; CHECK-NEXT: [[A_0:%.]] = load float, ptr [[A:%.]], align 4		; CHECK-NEXT: [[A_0:%.]] = load float, ptr [[A:%.]], align 4
		; CHECK-NEXT: [[B_1:%.*]] = load float, ptr [[GEP_B_1]], align 4
		; CHECK-NEXT: [[MUL_0:%.*]] = fmul float [[B_1]], [[A_0]]
; CHECK-NEXT: [[B_0:%.*]] = load float, ptr [[B]], align 4		; CHECK-NEXT: [[B_0:%.*]] = load float, ptr [[B]], align 4
; CHECK-NEXT: [[TMP1:%.*]] = load <2 x float>, ptr [[GEP_B_1]], align 4		; CHECK-NEXT: [[GEP_B_2:%.*]] = getelementptr inbounds float, ptr [[B]], i64 2
; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x float> poison, float [[B_0]], i32 0		; CHECK-NEXT: [[B_2:%.*]] = load float, ptr [[GEP_B_2]], align 4
; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <2 x float> [[TMP2]], <2 x float> poison, <2 x i32> zeroinitializer		; CHECK-NEXT: [[MUL_1:%.*]] = fmul float [[B_2]], [[B_0]]
; CHECK-NEXT: [[TMP3:%.*]] = fmul <2 x float> [[SHUFFLE1]], [[TMP1]]		; CHECK-NEXT: [[SUB:%.*]] = fsub float [[MUL_0]], [[MUL_1]]
; CHECK-NEXT: [[SHUFFLE2:%.*]] = shufflevector <2 x float> [[TMP3]], <2 x float> poison, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[MUL_2:%.*]] = fmul float [[B_0]], [[B_1]]
; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x float> poison, float [[A_0]], i32 0		; CHECK-NEXT: [[MUL_3:%.*]] = fmul float [[B_2]], [[A_0]]
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x float> [[TMP4]], <2 x float> poison, <2 x i32> zeroinitializer		; CHECK-NEXT: [[ADD:%.*]] = fadd float [[MUL_3]], [[MUL_2]]
; CHECK-NEXT: [[TMP5:%.*]] = fmul <2 x float> [[TMP1]], [[SHUFFLE]]		; CHECK-NEXT: store float [[SUB]], ptr [[A]], align 4
; CHECK-NEXT: [[TMP6:%.*]] = fsub <2 x float> [[TMP5]], [[SHUFFLE2]]		; CHECK-NEXT: [[GEP_A_1:%.*]] = getelementptr inbounds float, ptr [[A]], i64 1
; CHECK-NEXT: [[TMP7:%.*]] = fadd <2 x float> [[TMP5]], [[SHUFFLE2]]		; CHECK-NEXT: store float [[ADD]], ptr [[GEP_A_1]], align 4
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x float> [[TMP6]], <2 x float> [[TMP7]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: store float [[B_2]], ptr [[B]], align 4
; CHECK-NEXT: store <2 x float> [[TMP8]], ptr [[A]], align 4
; CHECK-NEXT: [[TMP9:%.*]] = extractelement <2 x float> [[TMP1]], i32 1
; CHECK-NEXT: store float [[TMP9]], ptr [[B]], align 4
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
%gep.B.1 = getelementptr inbounds float, ptr %B, i64 1		%gep.B.1 = getelementptr inbounds float, ptr %B, i64 1
%A.0 = load float, ptr %A, align 4		%A.0 = load float, ptr %A, align 4
%B.1 = load float, ptr %gep.B.1, align 4		%B.1 = load float, ptr %gep.B.1, align 4
%mul.0 = fmul float %B.1, %A.0		%mul.0 = fmul float %B.1, %A.0
%B.0 = load float, ptr %B, align 4		%B.0 = load float, ptr %B, align 4
%gep.B.2 = getelementptr inbounds float, ptr %B, i64 2		%gep.B.2 = getelementptr inbounds float, ptr %B, i64 2
Show All 10 Lines	;
ret void		ret void
}		}

; FMA cannot be used due to missing fast-math flags, so SLP should kick in.		; FMA cannot be used due to missing fast-math flags, so SLP should kick in.
define void @slp_profitable_missing_fmf_nnans_only(ptr %A, ptr %B) {		define void @slp_profitable_missing_fmf_nnans_only(ptr %A, ptr %B) {
; CHECK-LABEL: @slp_profitable_missing_fmf_nnans_only(		; CHECK-LABEL: @slp_profitable_missing_fmf_nnans_only(
; CHECK-NEXT: [[GEP_B_1:%.]] = getelementptr inbounds float, ptr [[B:%.]], i64 1		; CHECK-NEXT: [[GEP_B_1:%.]] = getelementptr inbounds float, ptr [[B:%.]], i64 1
; CHECK-NEXT: [[A_0:%.]] = load float, ptr [[A:%.]], align 4		; CHECK-NEXT: [[A_0:%.]] = load float, ptr [[A:%.]], align 4
		; CHECK-NEXT: [[B_1:%.*]] = load float, ptr [[GEP_B_1]], align 4
		; CHECK-NEXT: [[MUL_0:%.*]] = fmul nnan float [[B_1]], [[A_0]]
; CHECK-NEXT: [[B_0:%.*]] = load float, ptr [[B]], align 4		; CHECK-NEXT: [[B_0:%.*]] = load float, ptr [[B]], align 4
; CHECK-NEXT: [[TMP1:%.*]] = load <2 x float>, ptr [[GEP_B_1]], align 4		; CHECK-NEXT: [[GEP_B_2:%.*]] = getelementptr inbounds float, ptr [[B]], i64 2
; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x float> poison, float [[B_0]], i32 0		; CHECK-NEXT: [[B_2:%.*]] = load float, ptr [[GEP_B_2]], align 4
; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <2 x float> [[TMP2]], <2 x float> poison, <2 x i32> zeroinitializer		; CHECK-NEXT: [[MUL_1:%.*]] = fmul nnan float [[B_2]], [[B_0]]
; CHECK-NEXT: [[TMP3:%.*]] = fmul nnan <2 x float> [[SHUFFLE1]], [[TMP1]]		; CHECK-NEXT: [[SUB:%.*]] = fsub nnan float [[MUL_0]], [[MUL_1]]
; CHECK-NEXT: [[SHUFFLE2:%.*]] = shufflevector <2 x float> [[TMP3]], <2 x float> poison, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[MUL_2:%.*]] = fmul nnan float [[B_0]], [[B_1]]
; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x float> poison, float [[A_0]], i32 0		; CHECK-NEXT: [[MUL_3:%.*]] = fmul nnan float [[B_2]], [[A_0]]
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x float> [[TMP4]], <2 x float> poison, <2 x i32> zeroinitializer		; CHECK-NEXT: [[ADD:%.*]] = fadd nnan float [[MUL_3]], [[MUL_2]]
; CHECK-NEXT: [[TMP5:%.*]] = fmul nnan <2 x float> [[TMP1]], [[SHUFFLE]]		; CHECK-NEXT: store float [[SUB]], ptr [[A]], align 4
; CHECK-NEXT: [[TMP6:%.*]] = fsub nnan <2 x float> [[TMP5]], [[SHUFFLE2]]		; CHECK-NEXT: [[GEP_A_1:%.*]] = getelementptr inbounds float, ptr [[A]], i64 1
; CHECK-NEXT: [[TMP7:%.*]] = fadd nnan <2 x float> [[TMP5]], [[SHUFFLE2]]		; CHECK-NEXT: store float [[ADD]], ptr [[GEP_A_1]], align 4
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x float> [[TMP6]], <2 x float> [[TMP7]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: store float [[B_2]], ptr [[B]], align 4
; CHECK-NEXT: store <2 x float> [[TMP8]], ptr [[A]], align 4
; CHECK-NEXT: [[TMP9:%.*]] = extractelement <2 x float> [[TMP1]], i32 1
; CHECK-NEXT: store float [[TMP9]], ptr [[B]], align 4
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
%gep.B.1 = getelementptr inbounds float, ptr %B, i64 1		%gep.B.1 = getelementptr inbounds float, ptr %B, i64 1
%A.0 = load float, ptr %A, align 4		%A.0 = load float, ptr %A, align 4
%B.1 = load float, ptr %gep.B.1, align 4		%B.1 = load float, ptr %gep.B.1, align 4
%mul.0 = fmul nnan float %B.1, %A.0		%mul.0 = fmul nnan float %B.1, %A.0
%B.0 = load float, ptr %B, align 4		%B.0 = load float, ptr %B, align 4
%gep.B.2 = getelementptr inbounds float, ptr %B, i64 2		%gep.B.2 = getelementptr inbounds float, ptr %B, i64 2
▲ Show 20 Lines • Show All 66 Lines • ▼ Show 20 Lines
}		}

define void @slp_profitable(ptr %A, ptr %B, float %0) {		define void @slp_profitable(ptr %A, ptr %B, float %0) {
; CHECK-LABEL: @slp_profitable(		; CHECK-LABEL: @slp_profitable(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[SUB_I1096:%.]] = fsub fast float 1.000000e+00, [[TMP0:%.]]		; CHECK-NEXT: [[SUB_I1096:%.]] = fsub fast float 1.000000e+00, [[TMP0:%.]]
; CHECK-NEXT: [[TMP1:%.]] = load <2 x float>, ptr [[A:%.]], align 4		; CHECK-NEXT: [[TMP1:%.]] = load <2 x float>, ptr [[A:%.]], align 4
; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x float> poison, float [[TMP0]], i32 0		; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x float> poison, float [[TMP0]], i32 0
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x float> [[TMP2]], <2 x float> poison, <2 x i32> zeroinitializer		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x float> [[TMP2]], <2 x float> poison, <2 x i32> zeroinitializer
; CHECK-NEXT: [[TMP3:%.*]] = fmul fast <2 x float> [[TMP1]], [[SHUFFLE]]		; CHECK-NEXT: [[TMP4:%.*]] = fmul fast <2 x float> [[TMP1]], [[TMP3]]
; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <2 x float> [[TMP3]], <2 x float> poison, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <2 x float> [[TMP4]], <2 x float> poison, <2 x i32> <i32 1, i32 0>
; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x float> poison, float [[SUB_I1096]], i32 0		; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x float> poison, float [[SUB_I1096]], i32 0
; CHECK-NEXT: [[SHUFFLE2:%.*]] = shufflevector <2 x float> [[TMP4]], <2 x float> poison, <2 x i32> zeroinitializer		; CHECK-NEXT: [[TMP7:%.*]] = shufflevector <2 x float> [[TMP6]], <2 x float> poison, <2 x i32> zeroinitializer
; CHECK-NEXT: [[TMP5:%.*]] = fmul fast <2 x float> [[TMP1]], [[SHUFFLE2]]		; CHECK-NEXT: [[TMP8:%.*]] = fmul fast <2 x float> [[TMP1]], [[TMP7]]
; CHECK-NEXT: [[TMP6:%.*]] = fadd fast <2 x float> [[SHUFFLE1]], [[TMP5]]		; CHECK-NEXT: [[TMP9:%.*]] = fadd fast <2 x float> [[TMP5]], [[TMP8]]
; CHECK-NEXT: [[TMP7:%.*]] = fsub fast <2 x float> [[SHUFFLE1]], [[TMP5]]		; CHECK-NEXT: [[TMP10:%.*]] = fsub fast <2 x float> [[TMP5]], [[TMP8]]
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x float> [[TMP6]], <2 x float> [[TMP7]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <2 x float> [[TMP9]], <2 x float> [[TMP10]], <2 x i32> <i32 0, i32 3>
; CHECK-NEXT: store <2 x float> [[TMP8]], ptr [[B:%.*]], align 4		; CHECK-NEXT: store <2 x float> [[TMP11]], ptr [[B:%.*]], align 4
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%gep.A.1 = getelementptr inbounds float, ptr %A, i64 1		%gep.A.1 = getelementptr inbounds float, ptr %A, i64 1
%sub.i1096 = fsub fast float 1.000000e+00, %0		%sub.i1096 = fsub fast float 1.000000e+00, %0
%1 = load float, ptr %A, align 4		%1 = load float, ptr %A, align 4
%mul.i1100 = fmul fast float %1, %sub.i1096		%mul.i1100 = fmul fast float %1, %sub.i1096
%2 = load float, ptr %gep.A.1, align 4		%2 = load float, ptr %gep.A.1, align 4
Show All 10 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Cost-model vector splat LD1Rs to avoid unprofitable SLP vectorisationClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 504668

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/test/Analysis/CostModel/AArch64/shuffle-load.ll

llvm/test/Transforms/SLPVectorizer/AArch64/slp-fma-loss.ll

[AArch64] Cost-model vector splat LD1Rs to avoid unprofitable SLP vectorisation
ClosedPublic