This is an archive of the discontinued LLVM Phabricator instance.

[X86][Costmodel] Load/store i64/f64 Stride=6 VF=8 interleaving costs
ClosedPublic

Authored by lebedev.ri on Oct 4 2021, 12:42 PM.

Details

Summary

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/1jfGddcre - for intels Block RThroughput: =36.0; for ryzens, Block RThroughput: =12.0
So could pick cost of 36

For store we have:
https://godbolt.org/z/ao9srMT8r - for intels Block RThroughput: =30.0; for ryzens, Block RThroughput: =12.0
So we could pick cost of 30.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Diff Detail

Event Timeline

lebedev.ri created this revision.Oct 4 2021, 12:42 PM
RKSimon accepted this revision.Oct 5 2021, 6:21 AM

LGTM

This revision is now accepted and ready to land.Oct 5 2021, 6:21 AM

LGTM

HURRAY! Thank you for the reviews!
Let's see how far has this gotten us.

This revision was landed with ongoing or failed builds.Oct 5 2021, 7:00 AM
This revision was automatically updated to reflect the committed changes.
lebedev.ri added a comment.EditedOct 5 2021, 7:50 AM

So i've rechecked and as far as the full interleave groups go, this has fully covered my interest in rawspeed+darktable (and ended up vectorizing +12% (+113) more loops).
I've looked at tertiary relevant projects (i didn't before), and despite my best hopes,
stride=5/7/8 comes up in rawtherapee/babl/gegl/gimp :/
I don't really want to deal with that again right away, so i'll instead look into what's missing for non-fully-interleaved groups.