This is an archive of the discontinued LLVM Phabricator instance.

[X86][Costmodel] Improve cost modelling for not-fully-interleaved load
ClosedPublic

Authored by lebedev.ri on Oct 5 2021, 11:55 AM.

Details

Summary

While i've modelled most of the relevant tuples for AVX2,
that only covered fully-interleaved groups.

By definition, interleaving load of stride N means:
load N*VF elements, and shuffle them into N VF-sized vectors,
with 0'th vector containing elements [0, VF)*stride + 0,
and 1'th vector containing elements [0, VF)*stride + 1.
Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6)

Now, not fully interleaved load, is when not all of these vectors is demanded.
So at worst, we could just pretend that everything is demanded,
and discard the non-demanded vectors. What this means is that the cost
for not-fully-interleaved group should be not greater than the cost
for the same fully-interleaved group, but perhaps somewhat less.
Examples:
https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4)
https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2)
https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1)

As we have established over the course of last ~70 patches, (wow)
BaseT::getInterleavedMemoryOpCos() is absolutely bogus,
it is usually almost an order of magnitude overestimation,
so i would claim that we should at least use the hardcoded costs
of fully interleaved load groups.

We could go further and adjust them e.g. by the number of demanded indices,
but then i'm somewhat fearful of underestimating the cost.

Thoughts?

Diff Detail

Event Timeline

lebedev.ri created this revision.Oct 5 2021, 11:55 AM
lebedev.ri requested review of this revision.Oct 5 2021, 11:55 AM
lebedev.ri added a subscriber: jeroen.dobbelaere.
lebedev.ri added inline comments.
llvm/test/Transforms/LoopVectorize/X86/pr48340.ll
12

@jeroen.dobbelaere this test broke. any suggestions how it can be made less fragile? :)

llvm/test/Transforms/LoopVectorize/X86/pr48340.ll
12

Any idea on how to convince (force) loop-vectorize to do the vectorization ?

lebedev.ri added inline comments.Oct 5 2021, 12:30 PM
llvm/test/Transforms/LoopVectorize/X86/pr48340.ll
12

Oh it did vectorize alright, it just decided to do the interleaved load instead of gather.

lebedev.ri marked an inline comment as done.

Avoid testcase regression.

llvm/test/Transforms/LoopVectorize/X86/pr48340.ll
12

Never mind, got it.

lebedev.ri edited the summary of this revision. (Show Details)Oct 5 2021, 1:07 PM
Matt added a subscriber: Matt.Oct 6 2021, 9:13 AM
lebedev.ri planned changes to this revision.Oct 6 2021, 1:51 PM

Let's deal with D111220 first, i think that one is rather straight-forward (and the costs are obviously correct & won't require further fixes).

lebedev.ri requested review of this revision.Oct 13 2021, 5:40 AM

While i would love/prefer to get D111546+D111460 merged first, i suppose this doesn't strictly have to wait for that.

@RKSimon does this seem like an okay intermediate step before we decide on discounting non-fully-interleaved groups?

RKSimon accepted this revision.Oct 14 2021, 12:36 PM

While i would love/prefer to get D111546+D111460 merged first, i suppose this doesn't strictly have to wait for that.

@RKSimon does this seem like an okay intermediate step before we decide on discounting non-fully-interleaved groups?

Yes, it needs rebasing after D111822 but LGTM for committal.

This revision is now accepted and ready to land.Oct 14 2021, 12:36 PM

While i would love/prefer to get D111546+D111460 merged first, i suppose this doesn't strictly have to wait for that.

@RKSimon does this seem like an okay intermediate step before we decide on discounting non-fully-interleaved groups?

Yes, it needs rebasing after D111822 but LGTM for committal.

Thank you for the review!

This revision was landed with ongoing or failed builds.Oct 14 2021, 1:15 PM
This revision was automatically updated to reflect the committed changes.