This is an archive of the discontinued LLVM Phabricator instance.

[TTI][AArch64] Cost model insertelement and indexed LD1 instructions
ClosedPublic

Authored by SjoerdMeijer on Jan 12 2023, 4:54 AM.

Details

Summary

The insertelement IR instruction can lead to different codegen, there are quite a few variants available/applicable. One option is to generate an INS, which is "ASIMD insert, element to element" instruction. This is actually a cheap instructions as it only has a latency of 2 on modern cores like the N1, N2 and V1. Currently we model this with a cost of 3, which perhaps is slightly higher than needed, but that is for another time

This is about another variant, an indexed LD1, or "ASIMD load, 1 element, one lane, B/H/S" instruction, that loads a value and inserts an element into a vector. This is actually an expensive instruction, which has a latency of 8 on modern cores. We generate an indexed LD1 when an insertelement instruction has a load as an operand. And this patch is recognising that, assigning a cost of 4 to this type of insertelement instructions making it a bit more expensive than the 3 it was before. This new cost of 4 is fairly arbitrary, but the point is that it makes it more expensive.

Diff Detail

Unit TestsFailed

Event Timeline

SjoerdMeijer created this revision.Jan 12 2023, 4:54 AM
SjoerdMeijer requested review of this revision.Jan 12 2023, 4:54 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 12 2023, 4:54 AM

Do you have any performance results?

You are wanting to increase the cost? I had actually tried to do the opposite a little while ago, as there are some routines that would benefit from doing single lane load/inserts, so that the rest of the routine could vectorize nicely. It is one instruction after all, but I couldn't justify it in general because of the high cost.
The cost these routines measure reciprocal throughput by default, not latency. The operations seem to take two pipelines though, so a high cost would not be unreasonable.

Good questions! Let me provide a bit more context, which I indeed didn't provide.

I am looking at a few cases where the SLP vectoriser is too smart for its own good. I.e., the SLP vectoriser kicks in generating vector code, but it tanks performance. It's exactly this what you mentioned:

single lane load/inserts, so that the rest of the routine could vectorize nicely.

It enables vectorisation, and it looks well vectorised code, but it executes slower than scalar code. There are 2 problems I have seen so far. First, some instructions are expensive, this indexed LD1 is an example of that. And please note not the INS, as I mentioned in the description, I agree that the cost of that thing can probably be lowered. Second, the SLP vectorised code causes "dependency chains" resulting in a lot of backend stalls.

A recent raised SLP performance bug that is very similar to my problems is this one:

https://github.com/llvm/llvm-project/issues/59867

It's not exactly the same thing I was looking at, but very similar and a good proxy, I think. Perhaps I should raise a few reproducers as perf bugs upstream, but I have good hopes that solving that problem also solves my problems. So, summarising, I am on a little mission to rein the SLP vectoriser back in, and this is a bit of cost modelling potentially helping. It's not solving PR59867, that seems something going wrong in the SLP vectoriser.

About performance results: hand written micro benchmark clearly show the benefit of not generating this particular LD1 variant. I ran some benchmarking that was neutral, but not sure this variant was generated or performance critical. So I will see if I can do a bit more. From that point this just seemed to be the right thing to do, because it is an expensive instructions.

I have in the past considered the cost from getVectorInstrCost to be less about the exact cost of moving into vector lanes, and more about how much sillyness you are willing to put up with from the SLP vectorizer.

If you have test cases that could be added for the SLP vectorizer, that would be useful too.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2222

It might be worth using ST->getVectorInsertExtractBaseCost() + 1

I am struggling with my motivating example for the SLP vectoriser. There are multiple things going on, and this doesn't seem to make a difference.
So for now, this "looks more correct" to me as index LD1s are expensive. If you're not convinced, I am happy to park this for now.

Im happy for the change to go in, I think. It changes the total cost of load+insert from 1+3 to 1+4 which isnt a huge increase. It sounds OK to me, although it was already fairly high before and there is a chance we might want to adjust it in the future if we have good reason.

Can you change it to ST->getVectorInsertExtractBaseCost() + 1?

Im happy for the change to go in, I think. It changes the total cost of load+insert from 1+3 to 1+4 which isnt a huge increase. It sounds OK to me, although it was already fairly high before and there is a chance we might want to adjust it in the future if we have good reason.

Ok, cheers, sounds good.
I will follow up today/tomorrow as I think we can lower the cost for a "MOV v1[1], v2[1]" type of element insertion, as they are actually cheap instructions on modern cores. I think that would reflect things a lot better; it would make the indexed LD1 (cost 4) two times more expensive than an MOV/INS (cost 2, which is 3 at the moment). But I will run SPEC for this today to see if I am not missing anything.

Can you change it to ST->getVectorInsertExtractBaseCost() + 1?

Ah, sorry, I thought I did, I must have uploaded a different/wrong diff.

This time with ST->getVectorInsertExtractBaseCost() + 1

FYI: D142359, for the "ASIMD insert, element to element" instruction.

Matt added a subscriber: Matt.Jan 23 2023, 12:57 PM
dmgreen accepted this revision.Jan 24 2023, 7:43 AM

LGTM, thanks.

This revision is now accepted and ready to land.Jan 24 2023, 7:43 AM
This revision was landed with ongoing or failed builds.Feb 9 2023, 8:28 AM
This revision was automatically updated to reflect the committed changes.