This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
2
AArch64Subtarget.cpp
-
test/Analysis/CostModel/AArch64/
-
Analysis/
-
CostModel/
-
AArch64/
1/1
insert-extract.ll

Differential D142359

[TTI][AArch64] Cost model vector INS instructions
Needs ReviewPublic

Authored by SjoerdMeijer on Jan 23 2023, 6:48 AM.

Download Raw Diff

Details

Reviewers

dmgreen
david-arm

Summary

This makes the "ASIMD insert, element to element" instruction, INS, cheaper for the N1, N2, V1, V2 cores. The software optimisation guides specify a latency of 2 for this instructions, which makes them pretty cheap. I see a 2.5% perf uplift for x264 with this on the V1.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Jan 23 2023, 6:48 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 23 2023, 6:48 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

SjoerdMeijer requested review of this revision.Jan 23 2023, 6:48 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 23 2023, 6:48 AM

Harbormaster completed remote builds in B209363: Diff 491351.Jan 23 2023, 6:49 AM

SjoerdMeijer mentioned this in D141602: [TTI][AArch64] Cost model insertelement and indexed LD1 instructions.Jan 23 2023, 6:50 AM

I see a 2.5% perf uplift for x264 with this on the V1.

I haven't analysed the reasons for this, but it's a nice bonus while making INS a bit cheaper which seems more accurate.

llvm/test/Analysis/CostModel/AArch64/insert-extract.ll
167	In D141602, we made the indexed LD1 a bit more expensive with `ST->getVectorInsertExtractBaseCost() + 1`, which resulted in this cost here going up from 3 to 4. But because we lower the cost of `getVectorInsertExtractBaseCost()` to 2 in this patch, the cost of indexed LD1 is back to 3. Don't think I am too unhappy with all of this: INS is a bit cheaper which I think it is or should be, and LD1 is a bit more expensive.

There are some comments in https://reviews.llvm.org/D132185 about the last time this came up, although that was more aggressive than this.

A few points:

I'm not a huge fan of making this cpu specific if we can avoid it. Most of the optimizations guides I have looked at going back to the A57 have similar latencies for the relative instruction, and making things very CPU specific increases the surface area for things to go wrong in a way that is not caught. There's not a big difference between neoverse cores and cortex-a cores in this regard, and if it's better we will likely want the advantages for -mcpu=generic too. I also think this might be more about how well the SLP vectorizer behaves than exactly how much instructions cost in the cpu.
The default cost type in these functions (especially from vectorization) is TCK_RecipThroughput, not TCK_Latency. It is the relative throughputs that matters in most cases. (There can be times where critical path is important so it can be considered as part of the cost, but in general with a good out-of-order core or a well-scheduled in-order core, throughput it most pressing).
This VectorInsertExtractBaseCost is really controlling the cost of inserting to a vector, extracting from a vector, or shuffling the vector around (either directly or through scalarization overhead). With vector shuffling or fp types this will be either "INS (ASIMD insert, element to element )" or "FMOV (FP move, register)". These two are usually quite efficient. With integer types it can involve a move between gpr and vpr, which comes under "FMOV (FP transfer, from gen to vec reg)" and "FMOV (FP transfer, from vec to gen reg)". These two have a throughput of 1 on all the cpus I looked at, which is a lot lower than other instructions. They often come in clumps too, so only being able to do 1 a cycle can be a real limiting factor.
The VectorInsertExtractBaseCost is also a lever on how much you are willing to put up with the SLP vectorizer producing lots of shuffling. I got a report just yesterday about the SLP vectorizer producing slower code because of all the shuffling it introduced around a loop (which I think may be to do with the relative cost of FMA, but shows the low costs on vector inserts/extract can cause problems).
The x264 improvement is probably https://reviews.llvm.org/D133441 where there are two mutual reductions on the same values (an add and a mla). With a lower vector extract cost it can vectorize one of the reductions with a lot of extracts, but then vectorize all the extracts with another in a second step. It doesn't actually have to pay the price of the extracts their though, it's a bit of a hack to get the same result. The other one I've seen is a function called hadamrad that I mentioned in the other ticket, but as far as I remember that uses ld1 instructions so the inserts would be expensive if D141602 is applying.

I've also seen regressions in performance when running SPEC and other benchmarks in the past, and it has often out-weighed the benefits. Perhaps keeping the load-insert cost high will help there? I think working towards reducing this default from 3 to 2 is a good thing to do, I'm just unsure about whether it is really a better value yet.

Thanks for all your thoughts on this!
I find a 2.5% uplift for x264 interesting enough to progress this, but wanted to check first with you first if you see any potential here or if it is a non-starter for you. For example, I cherry picked only a few SPEC apps but haven't done full INT + FP runs which I can do. But do you know if a cost of 2 is not going to work for your benchmarks/workloads?

In D142359#4077275, @dmgreen wrote:

There are some comments in https://reviews.llvm.org/D132185 about the last time this came up, although that was more aggressive than this.

A few points:

I'm not a huge fan of making this cpu specific if we can avoid it. Most of the optimizations guides I have looked at going back to the A57 have similar latencies for the relative instruction, and making things very CPU specific increases the surface area for things to go wrong in a way that is not caught. There's not a big difference between neoverse cores and cortex-a cores in this regard, and if it's better we will likely want the advantages for -mcpu=generic too. I also think this might be more about how well the SLP vectorizer behaves than exactly how much instructions cost in the cpu.

Yeah, good point, I wanted to be "conservative" and enable this only for the newer cores just in case one of the older cores didn't have a good implementation for this.

The default cost type in these functions (especially from vectorization) is TCK_RecipThroughput, not TCK_Latency. It is the relative throughputs that matters in most cases. (There can be times where critical path is important so it can be considered as part of the cost, but in general with a good out-of-order core or a well-scheduled in-order core, throughput it most pressing).

Yep, I got that. So 3 doesn't make sense too, right? So I just wanted to make it slightly less expensive.

This VectorInsertExtractBaseCost is really controlling the cost of inserting to a vector, extracting from a vector, or shuffling the vector around (either directly or through scalarization overhead). With vector shuffling or fp types this will be either "INS (ASIMD insert, element to element )" or "FMOV (FP move, register)". These two are usually quite efficient. With integer types it can involve a move between gpr and vpr, which comes under "FMOV (FP transfer, from gen to vec reg)" and "FMOV (FP transfer, from vec to gen reg)". These two have a throughput of 1 on all the cpus I looked at, which is a lot lower than other instructions. They often come in clumps too, so only being able to do 1 a cycle can be a real limiting factor.

The VectorInsertExtractBaseCost is also a lever on how much you are willing to put up with the SLP vectorizer producing lots of shuffling. I got a report just yesterday about the SLP vectorizer producing slower code because of all the shuffling it introduced around a loop (which I think may be to do with the relative cost of FMA, but shows the low costs on vector inserts/extract can cause problems).

The x264 improvement is probably https://reviews.llvm.org/D133441 where there are two mutual reductions on the same values (an add and a mla). With a lower vector extract cost it can vectorize one of the reductions with a lot of extracts, but then vectorize all the extracts with another in a second step. It doesn't actually have to pay the price of the extracts their though, it's a bit of a hack to get the same result. The other one I've seen is a function called hadamrad that I mentioned in the other ticket, but as far as I remember that uses ld1 instructions so the inserts would be expensive if D141602 is applying.

I've also seen regressions in performance when running SPEC and other benchmarks in the past, and it has often out-weighed the benefits. Perhaps keeping the load-insert cost high will help there? I think working towards reducing this default from 3 to 2 is a good thing to do, I'm just unsure about whether it is really a better value yet.

Ok, I am going to do some more homework, kick off SPEC runs and check for regressions, and also see where the uplift comes from.
But yeah, in the mean time I am interested to know if you see any potential in this.

The internal embedded benchmarks I tried had some pretty wild swings in both direction. I think it is worth working towards this, if we can try and minimize the regressions in the process. Running more benchmarks from SPEC and perhaps the llvm-test-suite would be good (maybe try and see what is going on in salsa for example? We might be getting the costs of scalar rotates/funnel shifts incorrect?)

There might be quite a few other cases. I can try and provide some examples if I can extract them.

In D142359#4083490, @dmgreen wrote:

The internal embedded benchmarks I tried had some pretty wild swings in both direction. I think it is worth working towards this, if we can try and minimize the regressions in the process. Running more benchmarks from SPEC and perhaps the llvm-test-suite would be good (maybe try and see what is going on in salsa for example? We might be getting the costs of scalar rotates/funnel shifts incorrect?)

There might be quite a few other cases. I can try and provide some examples if I can extract them.

I have tried SPEC2017 INT and SPEC FP, and the LLVM test-suite:

As we already knew, there's only 1 change in SPEC INT and that is x264 which is an uplift,
In SPEC FP, there is 1 change, which is a minor regression in 510.parest. It's small, but it's definitely there.
Nothing stands out in the llvm test-suite, and I don't see a regression in salsa,

For SPEC INT and FP, overall it is a tiny win, but I wouldn't object to calling it neutral. The uplift is 2.5% and the regression 1.5%.
So, the way I look at this at the moment, is that this is more an enabler.

About the regression, I looked at the 2 hottest functions in 510.parest which together are good for 55% of the runtime and I see this pattern repeated in different places in both functions:

Before:

54c16c:       2f00e400        movi    d0, #0x0
54c170:       2f00e401        movi    d1, #0x0
54c18c:       6d7f8c02        ldp     d2, d3, [x0, #-8]
54c19c:       fc637964        ldr     d4, [x11, x3, lsl #3]
54c1a0:       fc647965        ldr     d5, [x11, x4, lsl #3]
54c1a4:       1f420080        fmadd   d0, d4, d2, d0
54c1a8:       1f4304a1        fmadd   d1, d5, d3, d1

After:

54e3c8:       6f00e400        movi    v0.2d, #0x0
54e3e4:       3cc10601        ldr     q1, [x16], #16
54e3f0:       fc627962        ldr     d2, [x11, x2, lsl #3]
54e3f4:       8b030d63        add     x3, x11, x3, lsl #3
54e3f8:       4d408462        ld1     {v2.d}[1], [x3]
54e3fc:       4e61cc40        fmla    v0.2d, v2.2d, v1.2d

I think this must be responsible for the minor regression. It is not terribly wrong, but I think it's just this high-latency LD1 variant that is not making this SLP vectorised code faster. This is funny, because we looked at the cost-modelling for this LD1 recently in D141602, but I had not spotted this LD1 here in 510.parest. I am curious to know why the SLP vectoriser thinks this is beneficial, so I will look at that.

In the meantime, I was curious if you had additional thoughts on this.

I have tried SPEC2017 INT and SPEC FP, and the LLVM test-suite:

As we already knew, there's only 1 change in SPEC INT and that is x264 which is an uplift,

In SPEC FP, there is 1 change, which is a minor regression in 510.parest. It's small, but it's definitely there.

Nothing stands out in the llvm test-suite, and I don't see a regression in salsa,

For SPEC INT and FP, overall it is a tiny win, but I wouldn't object to calling it neutral. The uplift is 2.5% and the regression 1.5%.
So, the way I look at this at the moment, is that this is more an enabler.

About the regression, I looked at the 2 hottest functions in 510.parest which together are good for 55% of the runtime and I see this pattern repeated in different places in both functions:

Before:
54c16c:       2f00e400        movi    d0, #0x0
54c170:       2f00e401        movi    d1, #0x0
54c18c:       6d7f8c02        ldp     d2, d3, [x0, #-8]
54c19c:       fc637964        ldr     d4, [x11, x3, lsl #3]
54c1a0:       fc647965        ldr     d5, [x11, x4, lsl #3]
54c1a4:       1f420080        fmadd   d0, d4, d2, d0
54c1a8:       1f4304a1        fmadd   d1, d5, d3, d1
After:
54e3c8:       6f00e400        movi    v0.2d, #0x0
54e3e4:       3cc10601        ldr     q1, [x16], #16
54e3f0:       fc627962        ldr     d2, [x11, x2, lsl #3]
54e3f4:       8b030d63        add     x3, x11, x3, lsl #3
54e3f8:       4d408462        ld1     {v2.d}[1], [x3]
54e3fc:       4e61cc40        fmla    v0.2d, v2.2d, v1.2d
I think this must be responsible for the minor regression. It is not terribly wrong, but I think it's just this high-latency LD1 variant that is not making this SLP vectorised code faster. This is funny, because we looked at the cost-modelling for this LD1 recently in D141602, but I had not spotted this LD1 here in 510.parest. I am curious to know why the SLP vectoriser thinks this is beneficial, so I will look at that.

In the meantime, I was curious if you had additional thoughts on this.

I was expecting that to be similar to the other fmul+fadd vs fma issue we have seen elsewhere, but I'm not sure it is. Does it reduce the value to a single element again?

As for a few examples, this case is a little worse. I'm not sure if it is using bad costs, but the slp vectorization seems to reach through a phi and the result when put through llc is mostly just more instructions: https://godbolt.org/z/z6hnEcPG1.
Another case which is perhaps simpler is this "distance" one from cmsisdsp: https://godbolt.org/z/1xz7GP3fM. It looks like something might be being scalarized, but again I've not looked into the details.

There are some nice improvements too, if we can get the regressions hopefully fixed.

In D142359#4131731, @dmgreen wrote:

I was expecting that to be similar to the other fmul+fadd vs fma issue we have seen elsewhere, but I'm not sure it is. Does it reduce the value to a single element again?

As for a few examples, this case is a little worse. I'm not sure if it is using bad costs, but the slp vectorization seems to reach through a phi and the result when put through llc is mostly just more instructions: https://godbolt.org/z/z6hnEcPG1.
Another case which is perhaps simpler is this "distance" one from cmsisdsp: https://godbolt.org/z/1xz7GP3fM. It looks like something might be being scalarized, but again I've not looked into the details.

There are some nice improvements too, if we can get the regressions hopefully fixed.

Thank you so much for these reproducers. They are the best and most concrete reproducer I've got so far, so I will be looking into them.

ktkachov added a subscriber: ktkachov.Feb 20 2023, 3:09 AM

ktkachov added inline comments.

llvm/lib/Target/AArch64/AArch64Subtarget.cpp
216–217	Beyond David's comments I'll note that there are other microarchitectures that are similar to N1 and so should benefit from this. Particularly Cortex-A76 and later CPUs in that family

dmgreen added inline comments.Feb 20 2023, 4:44 AM

llvm/lib/Target/AArch64/AArch64Subtarget.cpp
216–217	Yeah - I would say that we should be changing this globally for all cores under aarch64, not just some. As in we change the default value to 2. (In the tests I ran the in-order cores actually did better than out of order!). I believe this cost is more about controlling the code-quality of SLP code than the actual cost of insert/extracts. We need to get the performance better overall first though.

dmgreen mentioned this in D141940: [SLP]Add shuffling of extractelements to avoid extra costs/data movement..Feb 21 2023, 3:07 AM

Allen added a subscriber: Allen.May 23 2023, 1:18 AM

SjoerdMeijer mentioned this in D151189: [AArch64] Increase the cost of i1 inserts / extracts.May 23 2023, 1:25 AM

SjoerdMeijer mentioned this in D152826: [AArch64] Improve shuffles of i1 vectors (WIP).Jun 13 2023, 11:10 AM

dmgreen mentioned this in D155459: [AArch64] Change the cost of vector insert/extract to 2.Jul 17 2023, 7:03 AM

dmgreen mentioned this in rG2a859b201461: [AArch64] Change the cost of vector insert/extract to 2.Jul 28 2023, 1:27 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64Subtarget.cpp

3 lines

test/

Analysis/

CostModel/

AArch64/

insert-extract.ll

52 lines

Diff 491351

llvm/lib/Target/AArch64/AArch64Subtarget.cpp

Show First 20 Lines • Show All 207 Lines • ▼ Show 20 Lines	case Kryo:
MinVectorRegisterBitWidth = 128;		MinVectorRegisterBitWidth = 128;
break;		break;
case NeoverseE1:		case NeoverseE1:
PrefFunctionLogAlignment = 3;		PrefFunctionLogAlignment = 3;
break;		break;
case NeoverseN1:		case NeoverseN1:
PrefFunctionLogAlignment = 4;		PrefFunctionLogAlignment = 4;
PrefLoopLogAlignment = 5;		PrefLoopLogAlignment = 5;
MaxBytesForLoopAlignment = 16;		MaxBytesForLoopAlignment = 16;
		VectorInsertExtractBaseCost = 2;
		ktkachovUnsubmitted Not Done Reply Inline Actions Beyond David's comments I'll note that there are other microarchitectures that are similar to N1 and so should benefit from this. Particularly Cortex-A76 and later CPUs in that family ktkachov: Beyond David's comments I'll note that there are other microarchitectures that are similar to…
		dmgreenUnsubmitted Not Done Reply Inline Actions Yeah - I would say that we should be changing this globally for all cores under aarch64, not just some. As in we change the default value to 2. (In the tests I ran the in-order cores actually did better than out of order!). I believe this cost is more about controlling the code-quality of SLP code than the actual cost of insert/extracts. We need to get the performance better overall first though. dmgreen: Yeah - I would say that we should be changing this globally for all cores under aarch64, not…
break;		break;
case NeoverseN2:		case NeoverseN2:
case NeoverseV2:		case NeoverseV2:
PrefFunctionLogAlignment = 4;		PrefFunctionLogAlignment = 4;
PrefLoopLogAlignment = 5;		PrefLoopLogAlignment = 5;
MaxBytesForLoopAlignment = 16;		MaxBytesForLoopAlignment = 16;
VScaleForTuning = 1;		VScaleForTuning = 1;
		VectorInsertExtractBaseCost = 2;
break;		break;
case NeoverseV1:		case NeoverseV1:
PrefFunctionLogAlignment = 4;		PrefFunctionLogAlignment = 4;
PrefLoopLogAlignment = 5;		PrefLoopLogAlignment = 5;
MaxBytesForLoopAlignment = 16;		MaxBytesForLoopAlignment = 16;
VScaleForTuning = 2;		VScaleForTuning = 2;
		VectorInsertExtractBaseCost = 2;
break;		break;
case Neoverse512TVB:		case Neoverse512TVB:
PrefFunctionLogAlignment = 4;		PrefFunctionLogAlignment = 4;
VScaleForTuning = 1;		VScaleForTuning = 1;
MaxInterleaveFactor = 4;		MaxInterleaveFactor = 4;
break;		break;
case Saphira:		case Saphira:
MaxInterleaveFactor = 4;		MaxInterleaveFactor = 4;
▲ Show 20 Lines • Show All 237 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/AArch64/insert-extract.ll

	Show All 35 Lines
	; KRYO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t120 = insertelement <4 x half> zeroinitializer, half 0xH0000, i64 1			; KRYO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t120 = insertelement <4 x half> zeroinitializer, half 0xH0000, i64 1
	; KRYO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %t130 = insertelement <2 x float> zeroinitializer, float 0.000000e+00, i64 0			; KRYO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %t130 = insertelement <2 x float> zeroinitializer, float 0.000000e+00, i64 0
	; KRYO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t140 = insertelement <2 x float> zeroinitializer, float 0.000000e+00, i64 1			; KRYO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t140 = insertelement <2 x float> zeroinitializer, float 0.000000e+00, i64 1
	; KRYO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %t150 = insertelement <2 x double> zeroinitializer, double 0.000000e+00, i64 0			; KRYO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %t150 = insertelement <2 x double> zeroinitializer, double 0.000000e+00, i64 0
	; KRYO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t160 = insertelement <2 x double> zeroinitializer, double 0.000000e+00, i64 1			; KRYO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t160 = insertelement <2 x double> zeroinitializer, double 0.000000e+00, i64 1
	; KRYO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; KRYO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	; NEO-LABEL: 'vectorInstrCost'			; NEO-LABEL: 'vectorInstrCost'
	; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %t1 = extractelement <8 x i8> undef, i32 0			; NEO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t1 = extractelement <8 x i8> undef, i32 0
	; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %t2 = extractelement <8 x i8> undef, i32 1			; NEO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t2 = extractelement <8 x i8> undef, i32 1
	; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %t3 = extractelement <4 x i16> undef, i32 0			; NEO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t3 = extractelement <4 x i16> undef, i32 0
	; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %t4 = extractelement <4 x i16> undef, i32 1			; NEO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t4 = extractelement <4 x i16> undef, i32 1
	; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %t5 = extractelement <2 x i32> undef, i32 0			; NEO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t5 = extractelement <2 x i32> undef, i32 0
	; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %t6 = extractelement <2 x i32> undef, i32 1			; NEO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t6 = extractelement <2 x i32> undef, i32 1
	; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %t7 = extractelement <2 x i64> undef, i32 0			; NEO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t7 = extractelement <2 x i64> undef, i32 0
	; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %t8 = extractelement <2 x i64> undef, i32 1			; NEO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t8 = extractelement <2 x i64> undef, i32 1
	; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %t9 = extractelement <4 x half> undef, i32 0			; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %t9 = extractelement <4 x half> undef, i32 0
	; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %t10 = extractelement <4 x half> undef, i32 1			; NEO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t10 = extractelement <4 x half> undef, i32 1
	; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %t11 = extractelement <2 x float> undef, i32 0			; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %t11 = extractelement <2 x float> undef, i32 0
	; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %t12 = extractelement <2 x float> undef, i32 1			; NEO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t12 = extractelement <2 x float> undef, i32 1
	; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %t13 = extractelement <2 x double> undef, i32 0			; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %t13 = extractelement <2 x double> undef, i32 0
	; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %t14 = extractelement <2 x double> undef, i32 1			; NEO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t14 = extractelement <2 x double> undef, i32 1
	; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %t30 = insertelement <8 x i8> undef, i8 0, i32 0			; NEO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t30 = insertelement <8 x i8> undef, i8 0, i32 0
	; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %t40 = insertelement <8 x i8> undef, i8 1, i32 1			; NEO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t40 = insertelement <8 x i8> undef, i8 1, i32 1
	; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %t50 = insertelement <4 x i16> undef, i16 2, i32 0			; NEO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t50 = insertelement <4 x i16> undef, i16 2, i32 0
	; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %t60 = insertelement <4 x i16> undef, i16 3, i32 1			; NEO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t60 = insertelement <4 x i16> undef, i16 3, i32 1
	; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %t70 = insertelement <2 x i32> undef, i32 4, i32 0			; NEO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t70 = insertelement <2 x i32> undef, i32 4, i32 0
	; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %t80 = insertelement <2 x i32> undef, i32 5, i32 1			; NEO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t80 = insertelement <2 x i32> undef, i32 5, i32 1
	; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %t90 = insertelement <2 x i64> undef, i64 6, i32 0			; NEO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t90 = insertelement <2 x i64> undef, i64 6, i32 0
	; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %t100 = insertelement <2 x i64> undef, i64 7, i32 1			; NEO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t100 = insertelement <2 x i64> undef, i64 7, i32 1
	; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %t110 = insertelement <4 x half> zeroinitializer, half 0xH0000, i64 0			; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %t110 = insertelement <4 x half> zeroinitializer, half 0xH0000, i64 0
	; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %t120 = insertelement <4 x half> zeroinitializer, half 0xH0000, i64 1			; NEO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t120 = insertelement <4 x half> zeroinitializer, half 0xH0000, i64 1
	; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %t130 = insertelement <2 x float> zeroinitializer, float 0.000000e+00, i64 0			; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %t130 = insertelement <2 x float> zeroinitializer, float 0.000000e+00, i64 0
	; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %t140 = insertelement <2 x float> zeroinitializer, float 0.000000e+00, i64 1			; NEO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t140 = insertelement <2 x float> zeroinitializer, float 0.000000e+00, i64 1
	; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %t150 = insertelement <2 x double> zeroinitializer, double 0.000000e+00, i64 0			; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %t150 = insertelement <2 x double> zeroinitializer, double 0.000000e+00, i64 0
	; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %t160 = insertelement <2 x double> zeroinitializer, double 0.000000e+00, i64 1			; NEO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %t160 = insertelement <2 x double> zeroinitializer, double 0.000000e+00, i64 1
	; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	%t1 = extractelement <8 x i8> undef, i32 0			%t1 = extractelement <8 x i8> undef, i32 0
	%t2 = extractelement <8 x i8> undef, i32 1			%t2 = extractelement <8 x i8> undef, i32 1
	%t3 = extractelement <4 x i16> undef, i32 0			%t3 = extractelement <4 x i16> undef, i32 0
	%t4 = extractelement <4 x i16> undef, i32 1			%t4 = extractelement <4 x i16> undef, i32 1
	%t5 = extractelement <2 x i32> undef, i32 0			%t5 = extractelement <2 x i32> undef, i32 0
	%t6 = extractelement <2 x i32> undef, i32 1			%t6 = extractelement <2 x i32> undef, i32 1
	Show All 28 Lines
	define <8 x i8> @LD1_B(<8 x i8> %vec, ptr noundef %i) {			define <8 x i8> @LD1_B(<8 x i8> %vec, ptr noundef %i) {
	; KRYO-LABEL: 'LD1_B'			; KRYO-LABEL: 'LD1_B'
	; KRYO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i8, ptr %i, align 1			; KRYO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i8, ptr %i, align 1
	; KRYO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2 = insertelement <8 x i8> %vec, i8 %v1, i32 1			; KRYO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2 = insertelement <8 x i8> %vec, i8 %v1, i32 1
	; KRYO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i8> %v2			; KRYO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i8> %v2
	;			;
	; NEO-LABEL: 'LD1_B'			; NEO-LABEL: 'LD1_B'
	; NEO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i8, ptr %i, align 1			; NEO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i8, ptr %i, align 1
	; NEO-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %v2 = insertelement <8 x i8> %vec, i8 %v1, i32 1			; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2 = insertelement <8 x i8> %vec, i8 %v1, i32 1
	; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i8> %v2			; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i8> %v2
	;			;
	entry:			entry:
	%v1 = load i8, ptr %i, align 1			%v1 = load i8, ptr %i, align 1
	%v2 = insertelement <8 x i8> %vec, i8 %v1, i32 1			%v2 = insertelement <8 x i8> %vec, i8 %v1, i32 1
	ret <8x i8> %v2			ret <8x i8> %v2
	}			}

	define <4 x i16> @LD1_H(<4 x i16> %vec, ptr noundef %i) {			define <4 x i16> @LD1_H(<4 x i16> %vec, ptr noundef %i) {
	; KRYO-LABEL: 'LD1_H'			; KRYO-LABEL: 'LD1_H'
	; KRYO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i16, ptr %i, align 2			; KRYO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i16, ptr %i, align 2
	; KRYO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2 = insertelement <4 x i16> %vec, i16 %v1, i32 2			; KRYO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2 = insertelement <4 x i16> %vec, i16 %v1, i32 2
	; KRYO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i16> %v2			; KRYO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i16> %v2
	;			;
	; NEO-LABEL: 'LD1_H'			; NEO-LABEL: 'LD1_H'
	; NEO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i16, ptr %i, align 2			; NEO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i16, ptr %i, align 2
	; NEO-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %v2 = insertelement <4 x i16> %vec, i16 %v1, i32 2			; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2 = insertelement <4 x i16> %vec, i16 %v1, i32 2
	; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i16> %v2			; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i16> %v2
	;			;
	entry:			entry:
	%v1 = load i16, ptr %i, align 2			%v1 = load i16, ptr %i, align 2
	%v2 = insertelement <4 x i16> %vec, i16 %v1, i32 2			%v2 = insertelement <4 x i16> %vec, i16 %v1, i32 2
	ret <4 x i16> %v2			ret <4 x i16> %v2
	}			}

	define <4 x i32> @LD1_W(<4 x i32> %vec, ptr noundef %i) {			define <4 x i32> @LD1_W(<4 x i32> %vec, ptr noundef %i) {
	; KRYO-LABEL: 'LD1_W'			; KRYO-LABEL: 'LD1_W'
	; KRYO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i32, ptr %i, align 4			; KRYO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i32, ptr %i, align 4
	; KRYO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2 = insertelement <4 x i32> %vec, i32 %v1, i32 3			; KRYO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2 = insertelement <4 x i32> %vec, i32 %v1, i32 3
	; KRYO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v2			; KRYO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v2
	;			;
	; NEO-LABEL: 'LD1_W'			; NEO-LABEL: 'LD1_W'
	; NEO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i32, ptr %i, align 4			; NEO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i32, ptr %i, align 4
	; NEO-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %v2 = insertelement <4 x i32> %vec, i32 %v1, i32 3			; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2 = insertelement <4 x i32> %vec, i32 %v1, i32 3
	; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v2			; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v2
	;			;
	entry:			entry:
	%v1 = load i32, ptr %i, align 4			%v1 = load i32, ptr %i, align 4
	%v2 = insertelement <4 x i32> %vec, i32 %v1, i32 3			%v2 = insertelement <4 x i32> %vec, i32 %v1, i32 3
	ret <4 x i32> %v2			ret <4 x i32> %v2
	}			}

	define <2 x i64> @LD1_X(<2 x i64> %vec, ptr noundef %i) {			define <2 x i64> @LD1_X(<2 x i64> %vec, ptr noundef %i) {
	; KRYO-LABEL: 'LD1_X'			; KRYO-LABEL: 'LD1_X'
	; KRYO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i64, ptr %i, align 8			; KRYO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i64, ptr %i, align 8
	; KRYO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2 = insertelement <2 x i64> %vec, i64 %v1, i32 0			; KRYO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2 = insertelement <2 x i64> %vec, i64 %v1, i32 0
	; KRYO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v2			; KRYO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v2
	;			;
	; NEO-LABEL: 'LD1_X'			; NEO-LABEL: 'LD1_X'
	; NEO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i64, ptr %i, align 8			; NEO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i64, ptr %i, align 8
	; NEO-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %v2 = insertelement <2 x i64> %vec, i64 %v1, i32 0			; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2 = insertelement <2 x i64> %vec, i64 %v1, i32 0
				SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions In D141602, we made the indexed LD1 a bit more expensive with `ST->getVectorInsertExtractBaseCost() + 1`, which resulted in this cost here going up from 3 to 4. But because we lower the cost of `getVectorInsertExtractBaseCost()` to 2 in this patch, the cost of indexed LD1 is back to 3. Don't think I am too unhappy with all of this: INS is a bit cheaper which I think it is or should be, and LD1 is a bit more expensive. SjoerdMeijer: In D141602, we made the indexed LD1 a bit more expensive with `ST…
	; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v2			; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v2
	;			;
	entry:			entry:
	%v1 = load i64, ptr %i, align 8			%v1 = load i64, ptr %i, align 8
	%v2 = insertelement <2 x i64> %vec, i64 %v1, i32 0			%v2 = insertelement <2 x i64> %vec, i64 %v1, i32 0
	ret <2 x i64> %v2			ret <2 x i64> %v2
	}			}