This is an archive of the discontinued LLVM Phabricator instance.

[X86][LoopVectorize] "Fix" `X86TTIImpl::getAddressComputationCost()`
ClosedPublic

Authored by lebedev.ri on Oct 8 2021, 1:00 PM.

Details

Summary

We ask TTI.getAddressComputationCost() about the cost of computing vector address,
and then multiply it by the vector width. This doesn't make any sense,
it implies that we'd do a vector GEP and then scalarize the vector of pointers,
but there is no such thing in the vectorized IR, we perform scalar GEP's.

This is *especially* bad on X86, and was effectively prohibiting any scalarized
vectorization of gathers/scatters, because X86TTIImpl::getAddressComputationCost()
says that cost of vector address computation is 10 as compared to 1 for scalar.

The computed costs are similar to the ones with D111222+D111220,
but we end up without masked memory intrinsics that we'd then have to
expand later on, without much luck. (D111363)

Diff Detail

Event Timeline

lebedev.ri created this revision.Oct 8 2021, 1:00 PM

This looks like the same patch as D93129. I don't believe that patch was correct and I don't really believe this is, but the code is super unintuitive and really not well explained anywhere. Let me try and describe it, as far as I understand things.

This just changes the type passes to getAddressComputationCost. That function is called from:

  • LoopVectorizationCostModel::getMemoryInstructionCost with a scalar type and no SCEV for the cost of a scalar load/store.
  • LoopVectorizationCostModel::getGatherScatterCost with a vector type and no SCEV for the cost of a gather/scatter
  • LoopVectorizationCostModel::getUniformMemOpCost with a scalar type and no SCEV for a splatted load/store.
  • LoopVectorizationCostModel::getMemInstScalarizationCost in this case, with a vector type and a SCEV. Multiplied by VF.

So "vector type plus SCEV" means scalarized load/stores. The actual implementation of getAddressComputationCost then is:

InstructionCost X86TTIImpl::getAddressComputationCost(Type *Ty,
                                                      ScalarEvolution *SE,
                                                      const SCEV *Ptr) {
  // Address computations in vectorized code with non-consecutive addresses will
  // likely result in more instructions compared to scalar code where the
  // computation can more often be merged into the index mode. The resulting
  // extra micro-ops can significantly decrease throughput.
  const unsigned NumVectorInstToHideOverhead = 10;

  // Cost modeling of Strided Access Computation is hidden by the indexing
  // modes of X86 regardless of the stride value. We dont believe that there
  // is a difference between constant strided access in gerenal and constant
  // strided value which is less than or equal to 64.
  // Even in the case of (loop invariant) stride whose value is not known at
  // compile time, the address computation will not incur more than one extra
  // ADD instruction.
  if (Ty->isVectorTy() && SE) {
    if (!BaseT::isStridedAccess(Ptr))
      return NumVectorInstToHideOverhead;
    if (!BaseT::getConstantStrideStep(SE, Ptr))
      return 1;
  }

  return BaseT::getAddressComputationCost(Ty, SE, Ptr);
}

So the NumVectorInstToHideOverhead is designed specifically for this case - to increase the cost for scalarized loads/stores that are unlikely to be worth vectorizing due to the "significantly decreased throughput". If you want to change that - changing how X86TTIImpl::getAddressComputationCost works is likely your best bet. Otherwise you are just turning that into dead code. The base getAddressComputationCost never returns anything other than 0 and doesn't distinguish between scalar and vector types.

Like I say, this isn't explained very well, but as far as I can see it is "working as intended". The vectorizer asks the backend what the overhead of scalarized loads/stores is, and the backend has answered. Even if it is a strange way to ask.

I've not seen a lot of evidence that disabling the equivalent code for ARM/AArch64 is a generally good idea, even if there are individual cases where it is an improvement. There is no scev for vectors, and so no LSR for vector addresses and pulling address values out of vectors can be very slow. Perhaps that can be improved over time.

@dmgreen thank you for taking a look!

This looks like the same patch as D93129.

Ah, indeed.

I don't believe that patch was correct and I don't really believe this is, but the code is super unintuitive and really not well explained anywhere. Let me try and describe it, as far as I understand things.

This just changes the type passes to getAddressComputationCost. That function is called from:

  • LoopVectorizationCostModel::getMemoryInstructionCost with a scalar type and no SCEV for the cost of a scalar load/store.
  • LoopVectorizationCostModel::getGatherScatterCost with a vector type and no SCEV for the cost of a gather/scatter
  • LoopVectorizationCostModel::getUniformMemOpCost with a scalar type and no SCEV for a splatted load/store.
  • LoopVectorizationCostModel::getMemInstScalarizationCost in this case, with a vector type and a SCEV. Multiplied by VF.

So "vector type plus SCEV" means scalarized load/stores. The actual implementation of getAddressComputationCost then is:

InstructionCost X86TTIImpl::getAddressComputationCost(Type *Ty,
                                                      ScalarEvolution *SE,
                                                      const SCEV *Ptr) {
  // Address computations in vectorized code with non-consecutive addresses will
  // likely result in more instructions compared to scalar code where the
  // computation can more often be merged into the index mode. The resulting
  // extra micro-ops can significantly decrease throughput.
  const unsigned NumVectorInstToHideOverhead = 10;

  // Cost modeling of Strided Access Computation is hidden by the indexing
  // modes of X86 regardless of the stride value. We dont believe that there
  // is a difference between constant strided access in gerenal and constant
  // strided value which is less than or equal to 64.
  // Even in the case of (loop invariant) stride whose value is not known at
  // compile time, the address computation will not incur more than one extra
  // ADD instruction.
  if (Ty->isVectorTy() && SE) {
    if (!BaseT::isStridedAccess(Ptr))
      return NumVectorInstToHideOverhead;
    if (!BaseT::getConstantStrideStep(SE, Ptr))
      return 1;
  }

  return BaseT::getAddressComputationCost(Ty, SE, Ptr);
}

So the NumVectorInstToHideOverhead is designed specifically for this case - to increase the cost for scalarized loads/stores that are unlikely to be worth vectorizing due to the "significantly decreased throughput". If you want to change that - changing how X86TTIImpl::getAddressComputationCost works is likely your best bet. Otherwise you are just turning that into dead code. The base getAddressComputationCost never returns anything other than 0 and doesn't distinguish between scalar and vector types.

Like I say, this isn't explained very well, but as far as I can see it is "working as intended". The vectorizer asks the backend what the overhead of scalarized loads/stores is, and the backend has answered. Even if it is a strange way to ask.

I see. While it does really look that way, i maintain that this is not intentional and just a bug/leftover from the D27919 rework.
To know for sure we'd have to ask the patch's author (@delena), but there is no recent activity on that account, so don't really expect to get an answer.

I've not seen a lot of evidence that disabling the equivalent code for ARM/AArch64 is a generally good idea, even if there are individual cases where it is an improvement.

There is no scev for vectors, and so no LSR for vector addresses and pulling address values out of vectors can be very slow. Perhaps that can be improved over time.

Sure. But i'm not really following. Where do we pull address values out of vectors?
This approach (as opposed to D111220) specifically does not result in any vector GEP's.

Matt added a subscriber: Matt.Oct 10 2021, 6:34 AM

I see. While it does really look that way, i maintain that this is not intentional and just a bug/leftover from the D27919 rework.
To know for sure we'd have to ask the patch's author (@delena), but there is no recent activity on that account, so don't really expect to get an answer.

It sure has changed over time, but the use of getAddressComputationCost before and after that patch looks the same to me. Only the call inside "Legal->memoryInstructionMustBeScalarized(I, VF)" was passing a SCEV through to getAddressComputationCost.

There is no scev for vectors, and so no LSR for vector addresses and pulling address values out of vectors can be very slow. Perhaps that can be improved over time.

Sure. But i'm not really following. Where do we pull address values out of vectors?
This approach (as opposed to D111220) specifically does not result in any vector GEP's.

Something like this code is what I was looking at: https://godbolt.org/z/sj57EaGzc. Hopefully that does enough extra work to overcome the additional cost with this patch, but uses a pragma in that example (I believe it produces the same code, without legal gather/scatters). It does the address computation for the store in vectors, and then transfers those over to scalars.

I see. While it does really look that way, i maintain that this is not intentional and just a bug/leftover from the D27919 rework.
To know for sure we'd have to ask the patch's author (@delena), but there is no recent activity on that account, so don't really expect to get an answer.

It sure has changed over time, but the use of getAddressComputationCost before and after that patch looks the same to me. Only the call inside "Legal->memoryInstructionMustBeScalarized(I, VF)" was passing a SCEV through to getAddressComputationCost.

There is no scev for vectors, and so no LSR for vector addresses and pulling address values out of vectors can be very slow. Perhaps that can be improved over time.

Sure. But i'm not really following. Where do we pull address values out of vectors?
This approach (as opposed to D111220) specifically does not result in any vector GEP's.

Something like this code is what I was looking at: https://godbolt.org/z/sj57EaGzc. Hopefully that does enough extra work to overcome the additional cost with this patch, but uses a pragma in that example (I believe it produces the same code, without legal gather/scatters). It does the address computation for the store in vectors, and then transfers those over to scalars.

Aha, that is a great point! Posted D111546.

lebedev.ri retitled this revision from [X86][LoopVectorize] Fix `LoopVectorizationCostModel::getMemInstScalarizationCost()` to [X86][LoopVectorize] "Fix" `X86TTIImpl::getAddressComputationCost()`.
lebedev.ri edited the summary of this revision. (Show Details)

Rebased, ping. I believe this should now be obviously good.
I've decided to reduce the scope of the patch to only affect
X86 AVX2+, at least because we have a good model
for interleaving shuffles in AVX2 (hardcoded) and AVX512.

pengfei requested changes to this revision.Oct 16 2021, 5:22 AM
pengfei added reviewers: wxiao3, hsaito.

Sorry, I don't think favoring interleaved loads over gather is always correct. Add more people who may be interested in this.

This revision now requires changes to proceed.Oct 16 2021, 5:23 AM
lebedev.ri requested review of this revision.Oct 16 2021, 5:24 AM

Sorry, I don't think favoring interleaved loads over gather is always correct.

Where am i doing that?

Add more people who may be interested in this.

Sorry, I don't think favoring interleaved loads over gather is always correct.

Where am i doing that?

Doesn't comments in X86TargetTransformInfo.cpp say this patch is a hack to do that? Or I just understood it the other way around?

Sorry, I don't think favoring interleaved loads over gather is always correct.

Where am i doing that?

Doesn't comments in X86TargetTransformInfo.cpp say this patch is a hack to do that? Or I just understood it the other way around?

It's precisely the other way around.

Sorry, I don't think favoring interleaved loads over gather is always correct.

Where am i doing that?

Doesn't comments in X86TargetTransformInfo.cpp say this patch is a hack to do that? Or I just understood it the other way around?

It's precisely the other way around.

Awkward :( Sorry for that. Anyway, I want to hold for a while until I'm confident with this patch.

Sorry, I don't think favoring interleaved loads over gather is always correct.

Where am i doing that?

Doesn't comments in X86TargetTransformInfo.cpp say this patch is a hack to do that? Or I just understood it the other way around?

It's precisely the other way around.

Awkward :( Sorry for that.

No, i agree, this is really confusing.

Anyway, I want to hold for a while until I'm confident with this patch.

Rebased, NFC.

RKSimon added inline comments.Oct 17 2021, 8:20 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3858

et?

3860

"TODO: AVX2 is the current cut-off......"

lebedev.ri marked 2 inline comments as done.

Address nits.

No, but actually this time.

@lebedev.ri , I saw you mentioned this patch on D111942. Could you explain what's the relationship between this one and the interleaving cost patches?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7058

It would be more productive if it includes the reason why we can’t change it right away.
Lack of such explanation led to the original version of the patch that repeats D93129.

@lebedev.ri , I saw you mentioned this patch on D111942. Could you explain what's the relationship between this one and the interleaving cost patches?

It's mostly tangential, the longer this patch drags out, the more tests may need to be updated, which isn't fun.

Naive interleaving sequence is: wide load + extracts + inserts,
while scalarized gather sequence is: narrow loads + inserts,

Right now "gather" sequence is modelled as having artificially-high cost
(what this patch changes), effectively prohibiting using it for vectorization.
But afterwards, naturally, the cost of scalarized gather sequence lowers,
becoming more attractive for vectorization, and sometimes even more than
for the naive interleaving sequence, as you can see in interleaved-load-i16-stride-5.ll.

But in general, we can assume that sequence of shuffles may be better than
a sequence of extracts+inserts, so we may be better off with interleaving
rather than gather. But we can't really improve the cost modelling for the
naive interleaving sequence (because we can't really properly cost model shuffles),
so what we have to do is hardcode the costs for the interestting interleaving tuples
in the cost tables, like i have done in those 90+ patches.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7058

I'm open to suggestions as to how this should be amended.

Thanks for the explanation! Apologize for my silly questions, I'm still learning in this realm :)

It's mostly tangential, the longer this patch drags out, the more tests may need to be updated, which isn't fun.

I don't get the point, why a tangential patch results in test updated?

Naive interleaving sequence is: wide load + extracts + inserts,
while scalarized gather sequence is: narrow loads + inserts,

Right now "gather" sequence is modelled as having artificially-high cost
(what this patch changes), effectively prohibiting using it for vectorization.
But afterwards, naturally, the cost of scalarized gather sequence lowers,
becoming more attractive for vectorization, and sometimes even more than
for the naive interleaving sequence, as you can see in interleaved-load-i16-stride-5.ll.

I think the high cost might have practical consideration, e.g., impact some benchmarks etc.
What's the value to give a cost smaller than naive interleaving sequence? IIUC, you are saying we prefer naive interleaving to scalarized gather, right?

But in general, we can assume that sequence of shuffles may be better than
a sequence of extracts+inserts, so we may be better off with interleaving
rather than gather. But we can't really improve the cost modelling for the
naive interleaving sequence (because we can't really properly cost model shuffles),
so what we have to do is hardcode the costs for the interestting interleaving tuples
in the cost tables, like i have done in those 90+ patches.

Do you mean this patch will enable the use of hardcode the costs? Can we make sure we always generate the interleaving sequence, e.g., when not have a constant stride?

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3853–3855

Does this still fall into @dmgreen 's previous comments?

The base getAddressComputationCost never returns anything other than 0 and doesn't distinguish between scalar and vector types.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7058

Sorry, I don't have a concrete suggestion. I have concerns because I saw disagreement from Dave and didn't see consensus have been made.

Thanks for the explanation! Apologize for my silly questions, I'm still learning in this realm :)

It's mostly tangential, the longer this patch drags out, the more tests may need to be updated, which isn't fun.

I don't get the point, why a tangential patch results in test updated?

Sorry, i'm not really following the question.

Naive interleaving sequence is: wide load + extracts + inserts,
while scalarized gather sequence is: narrow loads + inserts,

Right now "gather" sequence is modelled as having artificially-high cost
(what this patch changes), effectively prohibiting using it for vectorization.
But afterwards, naturally, the cost of scalarized gather sequence lowers,
becoming more attractive for vectorization, and sometimes even more than
for the naive interleaving sequence, as you can see in interleaved-load-i16-stride-5.ll.

I think the high cost might have practical consideration, e.g., impact some benchmarks etc.

Sure, any change might have practical considerations, in either direction.

What's the value to give a cost smaller than naive interleaving sequence?

Basically, we can't have a cake and eat it too.
We don't give "scalarized gather sequence" the cost lower than the
cost for "naive interleaving sequence", we give it it's real cost,
that happens to be lower than for the "naive interleaving sequence".

IIUC, you are saying we prefer naive interleaving to scalarized gather, right?

Quote please? Define we prefer?

But in general, we can assume that sequence of shuffles may be better than
a sequence of extracts+inserts, so we may be better off with interleaving
rather than gather. But we can't really improve the cost modelling for the
naive interleaving sequence (because we can't really properly cost model shuffles),
so what we have to do is hardcode the costs for the interestting interleaving tuples
in the cost tables, like i have done in those 90+ patches.

Do you mean this patch will enable the use of hardcode the costs?

Define use?

Can we make sure we always generate the interleaving sequence, e.g., when not have a constant stride?

Why? I don't think we should be doing anything like that. It is up to the vectorizer
to pick the best sequence, given the cost estimates provided by TTI for each one.

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3853–3855

I do not understand the question.
This if-block is a hack, and the patch disables it for AVX2+.

dmgreen added inline comments.Oct 18 2021, 7:13 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7058

My comments were mostly for ARM/AArch64, which this patch no longer affects. I don't have as much knowledge on X86 matters and no evidence suggesting one way or another whether this is a good idea I'm afraid.

I don't think this comment is particularly useful though. Personally I would personally just remove this comment and leave the patch to the X86 backend.

I think the high cost might have practical consideration, e.g., impact some benchmarks etc.

Sure, any change might have practical considerations, in either direction.

I agree, but we should evaluate it carefully to make sure we get good than bad, right?

IIUC, you are saying we prefer naive interleaving to scalarized gather, right?

Quote please? Define we prefer?

X86TargetTransformInfo.cpp:3853, "interleaved load is better in general in reality"

Can we make sure we always generate the interleaving sequence, e.g., when not have a constant stride?

Why? I don't think we should be doing anything like that. It is up to the vectorizer
to pick the best sequence, given the cost estimates provided by TTI for each one.

Only when the estimate is precise enough. Underestimating the cost of scalarized gather sequence will fool vectorizer in practice.
IA optimization manual says gather/scatter is prefered, emulated gather/scatter cost shouldn’t be lower than real gather/scatter.

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3853–3855

It is a question whether this is a hack or working as intended. It already checks both isStridedAccess and getConstantStrideStep. Shouldn't these be ture for interleaved load/store and return 0 finally? Given they are constant stride cases.
We still have to handle none stride or variable stride, e.g.,

(%base, %index, scale)
add %base, %stride
(%base, %index, scale)
add %base, %stride
(%base, %index, scale)
add %base, %stride

Returning 10 for none stride access might be too high, but returning 1 for variable stride still makes sense here.
Besides, checking for AVX2+ doesn't make sense either. We also have many targets have fast insert/extract support on SSE4.1 above.

@RKSimon / @craig.topper thoughts on the diff?

I think the high cost might have practical consideration, e.g., impact some benchmarks etc.

Sure, any change might have practical considerations, in either direction.

I agree, but we should evaluate it carefully to make sure we get good than bad, right?

Or that it is at least a step in the right direction.

IIUC, you are saying we prefer naive interleaving to scalarized gather, right?

Quote please? Define we prefer?

X86TargetTransformInfo.cpp:3853, "interleaved load is better in general in reality"

I mean, sure, wide load + shuffles should be better than many narrow scalar loads + chain of insertelement's.
Emphasis on *should*, this is not an ultimate truth.

Can we make sure we always generate the interleaving sequence, e.g., when not have a constant stride?

Why? I don't think we should be doing anything like that. It is up to the vectorizer
to pick the best sequence, given the cost estimates provided by TTI for each one.

Only when the estimate is precise enough. Underestimating the cost of scalarized gather sequence will fool vectorizer in practice.
IA optimization manual says gather/scatter is prefered, emulated gather/scatter cost shouldn’t be lower than real gather/scatter.

Yeah no, if this is what you believe we won't make progress here.

TBH I'd much prefer we focus on fixing X86 interleaved costs, writing a script and comparing different sse levels vs cpus like I did on D103695, even if the llvm-mca numbers aren't great they will give us a better magnitude.

Or maybe improve the fallback costs - for instance by generating costs for the BaseT getInterleavedMemoryOpCost for shuffle and scalarization patterns and returning the lowest.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7058

@fhahn What happened to D93129 that actually dealt with this?

TBH I'd much prefer we focus on fixing X86 interleaved costs, writing a script and comparing different sse levels vs cpus like I did on D103695, even if the llvm-mca numbers aren't great they will give us a better magnitude.

Or maybe improve the fallback costs - for instance by generating costs for the BaseT getInterleavedMemoryOpCost for shuffle and scalarization patterns and returning the lowest.

As i have said these efforts are correlated, but are not identical.
I'm not doing this (D111460) to affect the interleaving costs,
i'm doing this to stop reporting dumb costs for naive gather sequence.

I would like to propose that we move onto more productive part of the review process.
Does anyone have any concrete concerns with this?
If preferred, i could simly lift this AVX2+ restriction instead.

@pengfei like with D111546, are intel benchmarks impacted by this? in which way?

@lebedev.ri , we are busy on other things recently. It needs more time to schedule such tests. Do you have data about the benchmark impact?

Adding some more people that may be interested in measuring performance impact before it happens.

@lebedev.ri , we are busy on other things recently.

It needs more time to schedule such tests.

Any quantifiable estimate?

Do you have data about the benchmark impact?

Rebased, NFC.

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3858–3859

This appears to be dead code, if i replace return 1; with assert(false); it does not fire on any of the tests.

pengfei resigned from this revision.Oct 27 2021, 1:42 AM

@pengfei like with D111546, are intel benchmarks impacted by this? in which way?

I checked cpu2017 on SKX, no obvious improvment/regression found.

I'll resign given I neither have strong objections nor stand for it.

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3854–3859

The comments are easy misleading. I suggest we just remove the comments here. Maybe only explain why works for targets prior to AVX2.

3858–3859

Neither does on my tests.

3860–3861

I'm not doing this (D111460) to affect the interleaving costs,

The comments here make me think interleaved cost is substitute of the math here.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7058

Removing the comment sounds good to me.

llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll
1450 ↗(On Diff #381590)

I think masked gather/scatter/load/store etc. is definitely better than branches on AVX512. I think we should consider the cost of branches.

@pengfei like with D111546, are intel benchmarks impacted by this? in which way?

I checked cpu2017 on SKX, no obvious improvment/regression found.

Thank you for checking! This is mainly for the CPU's with no/poor gather support,
IOW as it currently stands, those without AVX512. But i think SKX in LLVM
is Skylake-AVX512? So those results aren't really representable...

I'll resign given I neither have strong objections nor stand for it.

@pengfei like with D111546, are intel benchmarks impacted by this? in which way?

I checked cpu2017 on SKX, no obvious improvment/regression found.

Thank you for checking! This is mainly for the CPU's with no/poor gather support,
IOW as it currently stands, those without AVX512. But i think SKX in LLVM
is Skylake-AVX512? So those results aren't really representable...

Yes, but I don't have the target without fast gather support.

lebedev.ri marked 11 inline comments as done.

I've added more exhaustive LV costmodel test coverage for masked mem ops; rebased.

Does anyone have further thoughts? @fhahn @RKSimon ?

@pengfei thank you for taking a look!

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3853–3855

Does this comment look better to everyone?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7058

Hm, i'd prefer to leave at least some mark here. How about this instead?

llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll
1450 ↗(On Diff #381590)

So, this is kind-of expected. Note, ENABLED_MASKED_STRIDED is check prefix for the run line
with -enable-masked-interleaved-mem-accesses, which is an override
for TTI::enableMaskedInterleavedAccessVectorization, where latter returns false for X86.
This is really about vectorizing using masked memory ops when the memory ops are strided,
so further interleaving/deinterleaving is required. So we don't really support this on X86.

The actual problem here is in X86TTIImpl::getInterleavedMemoryOpCost():

if (UseMaskForCond || UseMaskForGaps)
  return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
                                           Alignment, AddressSpace, CostKind,
                                           UseMaskForCond, UseMaskForGaps);

So the moment we ask for the interleaving costs when there is masking involved,
X86 cost model immediately fallbacks to the scalarized cost model,
while the only thing it needs to to is to ask X86TTIImpl::getMaskedMemoryOpCost)
instead of X86TTIImpl::getMemoryOpCost().

I think this can be improved, but i'm not sure if this is a blocker or not.
Again, this used to work because without this patch,
the cost of naive scalar gather is much higher than for the naive interleaved load.

lebedev.ri added inline comments.Oct 28 2021, 8:15 AM
llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll
1450 ↗(On Diff #381590)

I took a look, and it seems like we'd need to have yet another cost table,
and record the cost of the following 35 shuffle sequences: https://godbolt.org/z/Mv8MYGxqP

Can be done, but unless we consider this to be a prerequisite here
(note that TTI::enableMaskedInterleavedAccessVectorization() says false for X86),
it's a bit on the edge of my interest, at least right now.

Did you ever manage to extend BaseT::getInterleavedMemoryOpCost to compare scalarization vs shuffle costs (and select the lowest)?

Did you ever manage to extend BaseT::getInterleavedMemoryOpCost to compare scalarization vs shuffle costs (and select the lowest)?

Hm, why would i/we do that?
LV already does that: https://github.com/llvm/llvm-project/blob/5d9318638e892143c7788191ca1d89445582880b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L7509-L7547

pengfei added a subscriber: dorit.Oct 29 2021, 1:25 AM

Did you ever manage to extend BaseT::getInterleavedMemoryOpCost to compare scalarization vs shuffle costs (and select the lowest)?

Hm, why would i/we do that?
LV already does that: https://github.com/llvm/llvm-project/blob/5d9318638e892143c7788191ca1d89445582880b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L7509-L7547

I think the problem is getMemInstScalarizationCost is considering the interleaved cases, which doesn't look totally independent to me.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7061–7063

I'm wondering why not/if we can improve this function? The comments say we can get the stride information here. If this is true, we don't need the change in X86TargetTransformInfo.cpp, right? Since it already checks isStridedAccess and getConstantStrideStep there.

llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll
1450 ↗(On Diff #381590)

I think this can be improved, but i'm not sure if this is a blocker or not.

I think the ENABLED_MASKED_STRIDED checks are intentional for the masked interleave cases. See comments above the function. I would think this is a blocker. Add @dorit @Ayal who worked with the initial patches D53011 and D53559.

it seems like we'd need to have yet another cost table

Can we give a coarse-gained calculation first? I think we shouldn't break the dependence of the existing code.

https://godbolt.org/z/Mv8MYGxqP

Why don't use llvm.masked.load/store directly in the tests?

lebedev.ri added inline comments.Oct 29 2021, 1:29 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7061–7063

I don't have a clue as to what you mean here.

ping @dorit @Ayal @hsaito - for X86's, is -enable-masked-interleaved-mem-accesses/TargetTransformInfo::enableMaskedInterleavedAccessVectorization()
considered to be a supported configuration that *must* not be regressed, even though it is not enabled on any upstream X86 CPU?

Rebased ontop of crude attempt at improving AVX512-specific handling of cost modelling of masked interleaving.
Doesn't help much.

The more i look at it, the more i'm surprised it all currently appears to work somewhat.
Other obvious issue - X86TTIImpl::getGSVectorCost() does not account for the legalization cost,
especially if there is a variable mask. (X86TTIImpl::getMaskedMemoryOpCost(), OTOH, does it right.)
Same for masked interleaved loads.

llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll
1450 ↗(On Diff #381590)

Attempted that in D112873, doesn't really help all that much.

ping

llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll
1450 ↗(On Diff #381590)

ping @pengfei - does D112873 look satisfactory to you to unblock this? :)

pengfei added inline comments.Nov 3 2021, 2:37 AM
llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll
1450 ↗(On Diff #381590)

Sorry, D112873 doesn't reduce my concerns given:

  1. I'm not fully understand why the change need to be done within getInterleavedMemoryOpCost and what's the relationship between it and the change in getAddressComputationCost here. I'd expect the calculation of branches happens in getAddressComputationCost where it may have the minimal side effect. Besides, the change in D112873 without a full test doesn't look good to me either.
  2. Your words "The more i look at it, the more i'm surprised it ..." make me feel like the code is fragile and one more brick will crush the tower :)

I'm OK with leaving these tests as the last patch if you don't have interest with AVX512. You can add FIXME here and leave it to folks who have worked or may be interested it in future.

@pengfei thank you for taking a look.

llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll
1450 ↗(On Diff #381590)

D112873 does exactly what you suggested in previous comments here:

it seems like we'd need to have yet another cost table

Can we give a coarse-gained calculation first? I think we shouldn't break the dependence of the existing code.

It adds a coarse-grained cost model for the mask cost.

I'm not fully understand why the change need to be done within getInterleavedMemoryOpCost

The change in getInterleavedMemoryOpCost is obviously needed because without it,
getInterleavedMemoryOpCostAVX512() would never be called when there is a masking.

and what's the relationship between it and the change in getAddressComputationCost here

Well, in principle they both are standalone.

I'd expect the calculation of branches happens in getAddressComputationCost where it may have the minimal side effect.

I'm not really following.

Besides, the change in D112873 without a full test doesn't look good to me either.

There is an integration test for that - notice how the regression in llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll is gone in this patch.

Your words "The more i look at it, the more i'm surprised it ..." make me feel like the code is fragile and one more brick will crush the tower :)

True, but the alternative of not touching that ever is not really acceptable either.

lebedev.ri updated this revision to Diff 384452.Nov 3 2021, 8:24 AM

Rebasing (newly added test needed updaing).

Okay!
It took some time, but the costmodel for AVX512's masked interleaved has been completed.
I believe, this addressed all the review notes here.
Does anyone have any further closing remarks?
If not, i will be landing this shortly.

Rebased again.

I believe, this addressed all the review notes here.
Does anyone have any further closing remarks?
If not, i will be landing this shortly.

I can't predict if the new shuffle patterns will be better or worse on each affected platform, least of all looking at IR, not ASM.

But if those changes have positive changes in code generation (translating to better benchmark numbers), then this looks good to me.

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3855

This makes a lot of sense to me. I expect none of the pre-AVX2 benchmark numbers to change at all and those for AVX2+ to change positively. If that's the case, then I don't see why this would be a bad choice.

lebedev.ri marked an inline comment as done.Nov 29 2021, 10:53 AM

I can't predict if the new shuffle patterns will be better or worse on each affected platform, least of all looking at IR, not ASM.

But if those changes have positive changes in code generation (translating to better benchmark numbers), then this looks good to me.

Thank you for chiming in and confirming my understanding!
I'll land this tomorrow, unless there's further comments.

This revision was not accepted when it landed; it landed in state Needs Review.Nov 29 2021, 11:48 PM
This revision was automatically updated to reflect the committed changes.