This is an archive of the discontinued LLVM Phabricator instance.

[X86] Rewrite `getScalarizationOverhead()`
ClosedPublic

Authored by lebedev.ri on Nov 13 2022, 12:29 PM.

Details

Summary

All of our insert/extract ops work on 128-bit lanes.

For Insert, we need to extract affected 128-bit lane,
unless it's being fully overwritten (FIXME: do we need to be
careful about legalization-induced padding that we obviously don't demand?),
perform insertions, and then insert the 128-bit lane back.

But hold on. If we are operating on an 256-bit legal vector,
and thus have two 128-bit subvectors, and are fully overwriting them both,
we don't actually need to insert *both* subvectors,
only the second one, into the implicitly-widened first one.

Also, Insert wasn't actually querying the costs,
but just assuming them to be 1.

getShuffleCost(TTI::SK_ExtractSubvector) notes:

// Note that in general, the insertion starting at the beginning of a vector
// isn't free, because we need to preserve the rest of the wide vector.

... so as far as i can tell, we didn't account for that.

I was hoping this would allow vectorization at a higher VF at one case i looked at,
but the subvector insertion cost is still dis-advising that.

The change for Extract is NFC, and is for consistency only,
i wanted to get rid of of that weird explicit discounting of insertion of 0'th element,
since the general code should already deal with that.

Diff Detail

Event Timeline

lebedev.ri created this revision.Nov 13 2022, 12:29 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 13 2022, 12:29 PM
lebedev.ri requested review of this revision.Nov 13 2022, 12:29 PM
lebedev.ri edited the summary of this revision. (Show Details)Nov 13 2022, 12:32 PM

Sorry - running out of time, I'll have to come back to this tomorrow

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
4433

Do we actually need to do a costly popcnt? Are we doing anything other than checking for zero / allones?

lebedev.ri marked an inline comment as done.Nov 13 2022, 12:45 PM
lebedev.ri added inline comments.
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
4433

This whole logic is structured around the fact that we might only insert *some* elements.

lebedev.ri marked an inline comment as not done.Nov 13 2022, 12:48 PM
lebedev.ri added inline comments.
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
4433

Err, actually, right, we don't. Let me fix that...

lebedev.ri marked an inline comment as done.

Thanks for taking a look!

A bit more refactoring...

please can you recreate the diffs with context?

RKSimon added inline comments.Nov 14 2022, 4:35 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
4430–4431

This confused me - we might need a better term that CostValue now (NumLegalVectors?)

lebedev.ri marked an inline comment as done.

Upload with full context in tests too.
Thanks for taking a look!

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
4430–4431

Yeah, i was wondering if i should make naming more correct, but made the wrong choice.

lebedev.ri added inline comments.Nov 14 2022, 6:38 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
4460–4475

I'm a bit worried about legalization-induced padding,
that we obviously don't demand yet don't care about,
but i will look into that afterwards.

Thanks - almost there! Please can you tidyup the patch summary?

lebedev.ri edited the summary of this revision. (Show Details)Nov 14 2022, 8:24 AM

Thanks - almost there! Please can you tidyup the patch summary?

Is this better?

I think the FIXME would be more useful if it was added to the code than forgotten in a commit message?

Move FIXME into comment.

RKSimon accepted this revision.Nov 15 2022, 10:02 AM

LGTM - cheers

This revision is now accepted and ready to land.Nov 15 2022, 10:02 AM

LGTM - cheers

Thank you for the review!

This revision was automatically updated to reflect the committed changes.
vdmitrie added inline comments.
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
4470

I just came across this patch and I'd like to ask some clarification question as I'd like to better understand the intent.
A quick remark before the question. In vectorizer term "lane" is typically mapped into individual vector element while when referencing memory/cache that is another thing. So in order to avoid confusion I will be referencing vector element explicitly and use "sub-vector" instead of "Lane" here.

Lets consider an example where we have 16 elements in of 32bits type each, like <16 x f32 >.
On avx512 architecture that will be just single legal vector which consists of four sub-vectors of < 4 x f32 > types.
Now consider two cases:

  1. we are building a full vector inserting each element of the legal vector
  2. we are inserting some of elements in every sub-vector in the legal vector (let it be each odd element for simplicity)

in the case 1, as we are building the entire vector, we don't have any upper elements to preserve so we don't really need insert there. So I assume that this is the case that supposed to be targeted (to be precise this should be not the entire vector build but rather all the upper sub-vectors build)

in case 2 we have to preserve some elements of upper sub-vectors. If I'm not mistaken It means that we still need to use insert for lowest sub-vector as well in order to preserve these elements (the upper part of the legal vector).
But as it done in this patch: AffectedLanes will be {1,1,1,1} as each sub-vector has a couple of elements to be inserted (preserving the other two). Derived form that FullyAffectedLegalVector[0] will also be '1' in that case and we will not account for insert of the lowest sub-vector. Is than done intentionally or this is just a mistake?
Should we instead to check that we have nothing to preserve in any of the upper sub-vectors?

lebedev.ri added inline comments.Dec 7 2022, 12:12 PM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
4470

I just came across this patch and I'd like to ask some clarification question as I'd like to better understand the intent.

A quick remark before the question. In vectorizer term "lane" is typically mapped into individual vector element while when referencing memory/cache that is another thing. So in order to avoid confusion I will be referencing vector element explicitly and use "sub-vector" instead of "Lane" here.

Lets consider an example where we have 16 elements in of 32bits type each, like <16 x f32 >.

On avx512 architecture that will be just single legal vector which consists of four sub-vectors of < 4 x f32 > types.

Now consider two cases:

  1. we are building a full vector inserting each element of the legal vector
  2. we are inserting some of elements in every sub-vector in the legal vector (let it be each odd element for simplicity)

(How to boil water: pour water into kettle, boil.
how to boil water in half-full kettle: remove water, and see original recipe.)

These two cases are equivalent, or i do not understand the question.
Note that all of the logic here operates on XMM vectors.

As long as XMM subvector is touched by an element insertion,
and not elements are overridden, we need to first extract said subvector,
insert into it, and then insert the subvector back
into the original legal vector.

In case 2., you said that every XMM subvector receives some elements,
so what subvector would we need to preserve, if all of them are affected?

RKSimon added inline comments.Dec 7 2022, 12:34 PM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
4470

@vdmitrie I'm struggling to follow to your description - maybe it would be easier if you create a godbolt link with code examples please?

vdmitrie added inline comments.Dec 7 2022, 12:45 PM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
4470

(How to boil water: pour water into kettle, boil.
how to boil water in half-full kettle: remove water, and see original recipe.)

Let's be serious and put the kettle aside)

These two cases are equivalent, or i do not understand the question.

Not really. They are not equivalent IMO. See below.

Note that all of the logic here operates on XMM vectors.

As long as XMM subvector is touched by an element insertion,
and not elements are overridden, we need to first extract said subvector,
insert into it, and then insert the subvector back
into the original legal vector.

in case 1 we don't need to extract any of the upper sub-vectors. That's the point.

In case 2., you said that every XMM subvector receives some elements,
so what subvector would we need to preserve, if all of them are affected?

Look, the question here is whether we need to insert the lowest (0th) sub-vector back into the original vector. How you are going to write back to 0th sub-vector in the 512bits vector while preserving its upper part? I hope it is clear why we need to preserve the upper part of the vector while inserting sub-vector 0. We need to insert an xmm into lower part of a zmm. Right? That is case 2
EVEX vmovapd xmm,xmm seems to preserve the upper part, so I think you are right, but I just wanted this to be confirmed explicitly that is what is assumed in this cost modeling.

Now for case1, when you have all the upper elements inserted (all of 4th through 15th) it means that we can set elements in xmm0 for example and than insert all upper sub-vectors (i.e. 1,2,3) into zmm0. Right? So we really can omit insertion of the 0th sub-vector.

lebedev.ri added inline comments.Dec 7 2022, 12:47 PM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
4470

+1 to what @RKSimon said.
Do we affect (insert into) every XMM subvector or not?

vdmitrie added inline comments.Dec 7 2022, 12:56 PM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
4470

+1 to what @RKSimon said.
Do we affect (insert into) every XMM subvector or not?

yes. I'd like to avoid using term XMM for every sub-vector because an xmm register actually represents only sib-vector 0.

Here is what I was trying to describe:
{w, p, w, p} {w, p, w, p} {w, p, w, p} {w, p, w, p}
subv3 subv2 subv1 subv0

"w" - we write to a 512bits vector element, "p" - we preserve the original value of an element.

lebedev.ri marked 4 inline comments as done.Dec 7 2022, 1:00 PM
lebedev.ri added inline comments.
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
4470

As you noted, in the code here, lane means 128-bit subvector.
In that example, every 128-bit subvector of the legal vector is being affected,
which means we need to extract every every 128-bit subvector of the legal vector,
and afterwards the situation becomes the same as in the example 1.
Kettle example is relevant here.

vdmitrie added inline comments.Dec 7 2022, 1:09 PM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
4470

As you noted, in the code here, lane means 128-bit subvector.
In that example, every 128-bit subvector of the legal vector is being affected,
which means we need to extract every every 128-bit subvector of the legal vector,
and afterwards the situation becomes the same as in the example 1.
Kettle example is relevant here.

So, you mean that we first do extracts and keep every sub-vector on it's own xmm register?
That assumes increased register pressure. So "the kettle" is irrelevant in this regard :)

vdmitrie added inline comments.Dec 7 2022, 1:53 PM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
4470

@vdmitrie I'm struggling to follow to your description - maybe it would be easier if you create a godbolt link with code examples please?

https://godbolt.org/z/5qWv4TPq8

lebedev.ri marked 2 inline comments as done.Dec 7 2022, 1:55 PM
lebedev.ri added inline comments.
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
4470

So, you mean that we first do extracts and keep every sub-vector on it's own xmm register?

That is the cost model, yes.

vdmitrie added inline comments.Dec 7 2022, 3:10 PM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
4470

That is the cost model, yes.

It does not seem to reflect actual code generation. So it is kind of cheating.

lebedev.ri marked 3 inline comments as done.Dec 7 2022, 3:32 PM
lebedev.ri added inline comments.
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
4470

Unpopular opinion: cost modelling in llvm is completely and utterly broken.
As long as we have more than a single source of truth, we will get it wrong.