This patch contains more accurate cost of interelaved load\store of stride 2 for the types int64\double on AVX2.
Details
Diff Detail
Event Timeline
I think it would be nice to make the testcases smaller; Right now you have something like this:
for (…) {
Dst[2*i] = Dst[2*i] + Src[2*i] * k
Dst[2*i+1] = Dst[2*i+1] + Src[2*i+1] * k
}
...which actually tests both strided loads and strided stores.
So you could either use one test to check both store and load costs (and even then you probably don't need both a mul and an add just to check memops costs).
Or if you want to separate the load and store cases, the Load test could be something like:
for (…) {
s += Src[2*i]
s += Src[2*i+1]
}
The Store test could be something like:
For(…){
Dst[2*i] = k1; Dst[2*i+1] = k2;
}
test/Analysis/CostModel/X86/interleaved-store-i64.ll | ||
---|---|---|
2 | I see some of the interleave tests in this directory use -mcpu=core_avx2 and some use -mcpu=skylake. I wonder which one we want to use? |
@RKSimon @magabari I'd like to add some more tuples, but i have a question: how are the costs actually derived?
For example, the assembly for interleaved load of i16 w/ stride 2: https://godbolt.org/z/hjb3d5x6E
What's it cost? I'm guessing it's not just 10, aka the instruction count excluding the loads/stores?
Is it 5 from Block RThroughput: 4.8 from MCA: https://godbolt.org/z/fxYcEj3Wx ?
Which CPU should be used for these numbers?
I believe they were taken from IACA probably with a Haswell CPU - a reciprocal throughput from llvm-mca should be similar.
Usually with cost tables we tend to compare numbers from similar spec CPUs (AVX2 - Haswell/Ryzen) and choose the worst.....
I see. So in this case we have:
- znver1/2 4.8 https://godbolt.org/z/W9x6GWdnh https://godbolt.org/z/dx7718YT9 (likely unreliable, awaiting zen3)
- haswell/broadwell/skylake 9 https://godbolt.org/z/bzG17drjn https://godbolt.org/z/frnEfeY6K https://godbolt.org/z/o7jK9M9hK
therefore for that tuple we choose 9, correct? I'm not seeing any other sched models for AVX2 but not AVX512 CPU's.
And another question: now that we've established the rules, should i be submitting these changes through review,
or committing these directly? I fear former would either result in bulky patches that are hard to review,
or saturate the review queue.
I see some of the interleave tests in this directory use -mcpu=core_avx2 and some use -mcpu=skylake. I wonder which one we want to use?