This is an archive of the discontinued LLVM Phabricator instance.

[TTI][AArch64] Cost model vector INS instructions
Needs ReviewPublic

Authored by SjoerdMeijer on Jan 23 2023, 6:48 AM.

Details

Reviewers
dmgreen
david-arm
Summary

This makes the "ASIMD insert, element to element" instruction, INS, cheaper for the N1, N2, V1, V2 cores. The software optimisation guides specify a latency of 2 for this instructions, which makes them pretty cheap. I see a 2.5% perf uplift for x264 with this on the V1.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Jan 23 2023, 6:48 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 23 2023, 6:48 AM
SjoerdMeijer requested review of this revision.Jan 23 2023, 6:48 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 23 2023, 6:48 AM

I see a 2.5% perf uplift for x264 with this on the V1.

I haven't analysed the reasons for this, but it's a nice bonus while making INS a bit cheaper which seems more accurate.

llvm/test/Analysis/CostModel/AArch64/insert-extract.ll
167

In D141602, we made the indexed LD1 a bit more expensive with ST->getVectorInsertExtractBaseCost() + 1, which resulted in this cost here going up from 3 to 4. But because we lower the cost of getVectorInsertExtractBaseCost() to 2 in this patch, the cost of indexed LD1 is back to 3.

Don't think I am too unhappy with all of this: INS is a bit cheaper which I think it is or should be, and LD1 is a bit more expensive.

There are some comments in https://reviews.llvm.org/D132185 about the last time this came up, although that was more aggressive than this.

A few points:

  • I'm not a huge fan of making this cpu specific if we can avoid it. Most of the optimizations guides I have looked at going back to the A57 have similar latencies for the relative instruction, and making things very CPU specific increases the surface area for things to go wrong in a way that is not caught. There's not a big difference between neoverse cores and cortex-a cores in this regard, and if it's better we will likely want the advantages for -mcpu=generic too. I also think this might be more about how well the SLP vectorizer behaves than exactly how much instructions cost in the cpu.
  • The default cost type in these functions (especially from vectorization) is TCK_RecipThroughput, not TCK_Latency. It is the relative throughputs that matters in most cases. (There can be times where critical path is important so it can be considered as part of the cost, but in general with a good out-of-order core or a well-scheduled in-order core, throughput it most pressing).
  • This VectorInsertExtractBaseCost is really controlling the cost of inserting to a vector, extracting from a vector, or shuffling the vector around (either directly or through scalarization overhead). With vector shuffling or fp types this will be either "INS (ASIMD insert, element to element )" or "FMOV (FP move, register)". These two are usually quite efficient. With integer types it can involve a move between gpr and vpr, which comes under "FMOV (FP transfer, from gen to vec reg)" and "FMOV (FP transfer, from vec to gen reg)". These two have a throughput of 1 on all the cpus I looked at, which is a lot lower than other instructions. They often come in clumps too, so only being able to do 1 a cycle can be a real limiting factor.
  • The VectorInsertExtractBaseCost is also a lever on how much you are willing to put up with the SLP vectorizer producing lots of shuffling. I got a report just yesterday about the SLP vectorizer producing slower code because of all the shuffling it introduced around a loop (which I think may be to do with the relative cost of FMA, but shows the low costs on vector inserts/extract can cause problems).
  • The x264 improvement is probably https://reviews.llvm.org/D133441 where there are two mutual reductions on the same values (an add and a mla). With a lower vector extract cost it can vectorize one of the reductions with a lot of extracts, but then vectorize all the extracts with another in a second step. It doesn't actually have to pay the price of the extracts their though, it's a bit of a hack to get the same result. The other one I've seen is a function called hadamrad that I mentioned in the other ticket, but as far as I remember that uses ld1 instructions so the inserts would be expensive if D141602 is applying.

I've also seen regressions in performance when running SPEC and other benchmarks in the past, and it has often out-weighed the benefits. Perhaps keeping the load-insert cost high will help there? I think working towards reducing this default from 3 to 2 is a good thing to do, I'm just unsure about whether it is really a better value yet.

Thanks for all your thoughts on this!
I find a 2.5% uplift for x264 interesting enough to progress this, but wanted to check first with you first if you see any potential here or if it is a non-starter for you. For example, I cherry picked only a few SPEC apps but haven't done full INT + FP runs which I can do. But do you know if a cost of 2 is not going to work for your benchmarks/workloads?

There are some comments in https://reviews.llvm.org/D132185 about the last time this came up, although that was more aggressive than this.

A few points:

  • I'm not a huge fan of making this cpu specific if we can avoid it. Most of the optimizations guides I have looked at going back to the A57 have similar latencies for the relative instruction, and making things very CPU specific increases the surface area for things to go wrong in a way that is not caught. There's not a big difference between neoverse cores and cortex-a cores in this regard, and if it's better we will likely want the advantages for -mcpu=generic too. I also think this might be more about how well the SLP vectorizer behaves than exactly how much instructions cost in the cpu.

Yeah, good point, I wanted to be "conservative" and enable this only for the newer cores just in case one of the older cores didn't have a good implementation for this.

  • The default cost type in these functions (especially from vectorization) is TCK_RecipThroughput, not TCK_Latency. It is the relative throughputs that matters in most cases. (There can be times where critical path is important so it can be considered as part of the cost, but in general with a good out-of-order core or a well-scheduled in-order core, throughput it most pressing).

Yep, I got that. So 3 doesn't make sense too, right? So I just wanted to make it slightly less expensive.

  • This VectorInsertExtractBaseCost is really controlling the cost of inserting to a vector, extracting from a vector, or shuffling the vector around (either directly or through scalarization overhead). With vector shuffling or fp types this will be either "INS (ASIMD insert, element to element )" or "FMOV (FP move, register)". These two are usually quite efficient. With integer types it can involve a move between gpr and vpr, which comes under "FMOV (FP transfer, from gen to vec reg)" and "FMOV (FP transfer, from vec to gen reg)". These two have a throughput of 1 on all the cpus I looked at, which is a lot lower than other instructions. They often come in clumps too, so only being able to do 1 a cycle can be a real limiting factor.
  • The VectorInsertExtractBaseCost is also a lever on how much you are willing to put up with the SLP vectorizer producing lots of shuffling. I got a report just yesterday about the SLP vectorizer producing slower code because of all the shuffling it introduced around a loop (which I think may be to do with the relative cost of FMA, but shows the low costs on vector inserts/extract can cause problems).
  • The x264 improvement is probably https://reviews.llvm.org/D133441 where there are two mutual reductions on the same values (an add and a mla). With a lower vector extract cost it can vectorize one of the reductions with a lot of extracts, but then vectorize all the extracts with another in a second step. It doesn't actually have to pay the price of the extracts their though, it's a bit of a hack to get the same result. The other one I've seen is a function called hadamrad that I mentioned in the other ticket, but as far as I remember that uses ld1 instructions so the inserts would be expensive if D141602 is applying.

I've also seen regressions in performance when running SPEC and other benchmarks in the past, and it has often out-weighed the benefits. Perhaps keeping the load-insert cost high will help there? I think working towards reducing this default from 3 to 2 is a good thing to do, I'm just unsure about whether it is really a better value yet.

Ok, I am going to do some more homework, kick off SPEC runs and check for regressions, and also see where the uplift comes from.
But yeah, in the mean time I am interested to know if you see any potential in this.

The internal embedded benchmarks I tried had some pretty wild swings in both direction. I think it is worth working towards this, if we can try and minimize the regressions in the process. Running more benchmarks from SPEC and perhaps the llvm-test-suite would be good (maybe try and see what is going on in salsa for example? We might be getting the costs of scalar rotates/funnel shifts incorrect?)

There might be quite a few other cases. I can try and provide some examples if I can extract them.

The internal embedded benchmarks I tried had some pretty wild swings in both direction. I think it is worth working towards this, if we can try and minimize the regressions in the process. Running more benchmarks from SPEC and perhaps the llvm-test-suite would be good (maybe try and see what is going on in salsa for example? We might be getting the costs of scalar rotates/funnel shifts incorrect?)

There might be quite a few other cases. I can try and provide some examples if I can extract them.

I have tried SPEC2017 INT and SPEC FP, and the LLVM test-suite:

  • As we already knew, there's only 1 change in SPEC INT and that is x264 which is an uplift,
  • In SPEC FP, there is 1 change, which is a minor regression in 510.parest. It's small, but it's definitely there.
  • Nothing stands out in the llvm test-suite, and I don't see a regression in salsa,

For SPEC INT and FP, overall it is a tiny win, but I wouldn't object to calling it neutral. The uplift is 2.5% and the regression 1.5%.
So, the way I look at this at the moment, is that this is more an enabler.

About the regression, I looked at the 2 hottest functions in 510.parest which together are good for 55% of the runtime and I see this pattern repeated in different places in both functions:

Before:

54c16c:       2f00e400        movi    d0, #0x0
54c170:       2f00e401        movi    d1, #0x0
54c18c:       6d7f8c02        ldp     d2, d3, [x0, #-8]
54c19c:       fc637964        ldr     d4, [x11, x3, lsl #3]
54c1a0:       fc647965        ldr     d5, [x11, x4, lsl #3]
54c1a4:       1f420080        fmadd   d0, d4, d2, d0
54c1a8:       1f4304a1        fmadd   d1, d5, d3, d1

After:

54e3c8:       6f00e400        movi    v0.2d, #0x0
54e3e4:       3cc10601        ldr     q1, [x16], #16
54e3f0:       fc627962        ldr     d2, [x11, x2, lsl #3]
54e3f4:       8b030d63        add     x3, x11, x3, lsl #3
54e3f8:       4d408462        ld1     {v2.d}[1], [x3]
54e3fc:       4e61cc40        fmla    v0.2d, v2.2d, v1.2d

I think this must be responsible for the minor regression. It is not terribly wrong, but I think it's just this high-latency LD1 variant that is not making this SLP vectorised code faster. This is funny, because we looked at the cost-modelling for this LD1 recently in D141602, but I had not spotted this LD1 here in 510.parest. I am curious to know why the SLP vectoriser thinks this is beneficial, so I will look at that.

In the meantime, I was curious if you had additional thoughts on this.

I have tried SPEC2017 INT and SPEC FP, and the LLVM test-suite:

  • As we already knew, there's only 1 change in SPEC INT and that is x264 which is an uplift,
  • In SPEC FP, there is 1 change, which is a minor regression in 510.parest. It's small, but it's definitely there.
  • Nothing stands out in the llvm test-suite, and I don't see a regression in salsa,

For SPEC INT and FP, overall it is a tiny win, but I wouldn't object to calling it neutral. The uplift is 2.5% and the regression 1.5%.
So, the way I look at this at the moment, is that this is more an enabler.

About the regression, I looked at the 2 hottest functions in 510.parest which together are good for 55% of the runtime and I see this pattern repeated in different places in both functions:

Before:

54c16c:       2f00e400        movi    d0, #0x0
54c170:       2f00e401        movi    d1, #0x0
54c18c:       6d7f8c02        ldp     d2, d3, [x0, #-8]
54c19c:       fc637964        ldr     d4, [x11, x3, lsl #3]
54c1a0:       fc647965        ldr     d5, [x11, x4, lsl #3]
54c1a4:       1f420080        fmadd   d0, d4, d2, d0
54c1a8:       1f4304a1        fmadd   d1, d5, d3, d1

After:

54e3c8:       6f00e400        movi    v0.2d, #0x0
54e3e4:       3cc10601        ldr     q1, [x16], #16
54e3f0:       fc627962        ldr     d2, [x11, x2, lsl #3]
54e3f4:       8b030d63        add     x3, x11, x3, lsl #3
54e3f8:       4d408462        ld1     {v2.d}[1], [x3]
54e3fc:       4e61cc40        fmla    v0.2d, v2.2d, v1.2d

I think this must be responsible for the minor regression. It is not terribly wrong, but I think it's just this high-latency LD1 variant that is not making this SLP vectorised code faster. This is funny, because we looked at the cost-modelling for this LD1 recently in D141602, but I had not spotted this LD1 here in 510.parest. I am curious to know why the SLP vectoriser thinks this is beneficial, so I will look at that.

In the meantime, I was curious if you had additional thoughts on this.

I was expecting that to be similar to the other fmul+fadd vs fma issue we have seen elsewhere, but I'm not sure it is. Does it reduce the value to a single element again?

As for a few examples, this case is a little worse. I'm not sure if it is using bad costs, but the slp vectorization seems to reach through a phi and the result when put through llc is mostly just more instructions: https://godbolt.org/z/z6hnEcPG1.
Another case which is perhaps simpler is this "distance" one from cmsisdsp: https://godbolt.org/z/1xz7GP3fM. It looks like something might be being scalarized, but again I've not looked into the details.

There are some nice improvements too, if we can get the regressions hopefully fixed.

I was expecting that to be similar to the other fmul+fadd vs fma issue we have seen elsewhere, but I'm not sure it is. Does it reduce the value to a single element again?

As for a few examples, this case is a little worse. I'm not sure if it is using bad costs, but the slp vectorization seems to reach through a phi and the result when put through llc is mostly just more instructions: https://godbolt.org/z/z6hnEcPG1.
Another case which is perhaps simpler is this "distance" one from cmsisdsp: https://godbolt.org/z/1xz7GP3fM. It looks like something might be being scalarized, but again I've not looked into the details.

There are some nice improvements too, if we can get the regressions hopefully fixed.

Thank you so much for these reproducers. They are the best and most concrete reproducer I've got so far, so I will be looking into them.

ktkachov added inline comments.
llvm/lib/Target/AArch64/AArch64Subtarget.cpp
216–217

Beyond David's comments I'll note that there are other microarchitectures that are similar to N1 and so should benefit from this. Particularly Cortex-A76 and later CPUs in that family

dmgreen added inline comments.Feb 20 2023, 4:44 AM
llvm/lib/Target/AArch64/AArch64Subtarget.cpp
216–217

Yeah - I would say that we should be changing this globally for all cores under aarch64, not just some. As in we change the default value to 2. (In the tests I ran the in-order cores actually did better than out of order!). I believe this cost is more about controlling the code-quality of SLP code than the actual cost of insert/extracts. We need to get the performance better overall first though.

Allen added a subscriber: Allen.May 23 2023, 1:18 AM