This is an archive of the discontinued LLVM Phabricator instance.

[X86][TTI] update costs of interleaved load\store of i64\double
ClosedPublic

Authored by magabari on Nov 14 2017, 12:32 AM.

Details

Summary

This patch contains more accurate cost of interelaved load\store of stride 2 for the types int64\double on AVX2.

Diff Detail

Repository
rL LLVM

Event Timeline

magabari created this revision.Nov 14 2017, 12:32 AM
magabari retitled this revision from [X86][TTI] update costs of interleaved load to [X86][TTI] update costs of interleaved load\store of i64\double.Nov 14 2017, 12:34 AM
magabari edited the summary of this revision. (Show Details)
magabari added a subscriber: llvm-commits.
dorit edited edge metadata.Nov 15 2017, 10:27 AM

I think it would be nice to make the testcases smaller; Right now you have something like this:
for (…) {
Dst[2*i] = Dst[2*i] + Src[2*i] * k
Dst[2*i+1] = Dst[2*i+1] + Src[2*i+1] * k
}
...which actually tests both strided loads and strided stores.
So you could either use one test to check both store and load costs (and even then you probably don't need both a mul and an add just to check memops costs).
Or if you want to separate the load and store cases, the Load test could be something like:
for (…) {
s += Src[2*i]
s += Src[2*i+1]
}
The Store test could be something like:
For(…){

Dst[2*i] = k1;
Dst[2*i+1] = k2;

}

test/Analysis/CostModel/X86/interleaved-store-i64.ll
2 ↗(On Diff #122791)

I see some of the interleave tests in this directory use -mcpu=core_avx2 and some use -mcpu=skylake. I wonder which one we want to use?

magabari updated this revision to Diff 123134.Nov 15 2017, 11:43 PM
magabari marked an inline comment as done.

fixed dorit notes

dorit accepted this revision.Nov 15 2017, 11:55 PM

You missed just one mcpu=skylake :)
LGTM with this change

This revision is now accepted and ready to land.Nov 15 2017, 11:55 PM
This revision was automatically updated to reflect the committed changes.
lebedev.ri added a subscriber: lebedev.ri.EditedApr 26 2021, 12:48 AM

@RKSimon @magabari I'd like to add some more tuples, but i have a question: how are the costs actually derived?
For example, the assembly for interleaved load of i16 w/ stride 2: https://godbolt.org/z/hjb3d5x6E
What's it cost? I'm guessing it's not just 10, aka the instruction count excluding the loads/stores?
Is it 5 from Block RThroughput: 4.8 from MCA: https://godbolt.org/z/fxYcEj3Wx ?
Which CPU should be used for these numbers?

Herald added a project: Restricted Project. · View Herald TranscriptApr 26 2021, 12:48 AM
Herald added a subscriber: pengfei. · View Herald Transcript

@RKSimon @magabari I'd like to add some more tuples, but i have a question: how are the costs actually derived?
For example, the assembly for interleaved load of i16 w/ stride 2: https://godbolt.org/z/hjb3d5x6E
What's it cost? I'm guessing it's not just 10, aka the instruction count excluding the loads/stores?
Is it 5 from Block RThroughput: 4.8 from MCA: https://godbolt.org/z/fxYcEj3Wx ?
Which CPU should be used for these numbers?

I believe they were taken from IACA probably with a Haswell CPU - a reciprocal throughput from llvm-mca should be similar.

Usually with cost tables we tend to compare numbers from similar spec CPUs (AVX2 - Haswell/Ryzen) and choose the worst.....

lebedev.ri added a comment.EditedApr 26 2021, 4:49 AM

@RKSimon @magabari I'd like to add some more tuples, but i have a question: how are the costs actually derived?
For example, the assembly for interleaved load of i16 w/ stride 2: https://godbolt.org/z/hjb3d5x6E
What's it cost? I'm guessing it's not just 10, aka the instruction count excluding the loads/stores?
Is it 5 from Block RThroughput: 4.8 from MCA: https://godbolt.org/z/fxYcEj3Wx ?
Which CPU should be used for these numbers?

I believe they were taken from IACA probably with a Haswell CPU - a reciprocal throughput from llvm-mca should be similar.

Usually with cost tables we tend to compare numbers from similar spec CPUs (AVX2 - Haswell/Ryzen) and choose the worst.....

I see. So in this case we have:

therefore for that tuple we choose 9, correct? I'm not seeing any other sched models for AVX2 but not AVX512 CPU's.

And another question: now that we've established the rules, should i be submitting these changes through review,
or committing these directly? I fear former would either result in bulky patches that are hard to review,
or saturate the review queue.