This is an archive of the discontinued LLVM Phabricator instance.

[X86][Costmodel] Load/store v2i16 VF=2 interleaving costs
ClosedPublic

Authored by lebedev.ri on May 26 2021, 2:38 AM.

Details

Summary

@RKSimon please can you check that i'm got the idea right, and not missing any steps?

The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/M8vEKs5jY - for intels Block RThroughput: =2.0; for ryzens, Block RThroughput: <=1.0
So pick cost of 2.

For store we have:
https://godbolt.org/z/Kx1nKz7je - for intels Block RThroughput: =1.0; for ryzens, Block RThroughput: <=0.5
So pick cost of 1.

I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution.

Diff Detail

Event Timeline

lebedev.ri created this revision.May 26 2021, 2:38 AM
lebedev.ri requested review of this revision.May 26 2021, 2:38 AM

In your analysis, don't you need to account for the load/stores as well? Although in these cases they don't affect the numbers: https://llvm.godbolt.org/z/4xeEf619Y

lebedev.ri added inline comments.May 26 2021, 9:52 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
5062–5063

In your analysis, don't you need to account for the load/stores as well? Although in these cases they don't affect the numbers: https://llvm.godbolt.org/z/4xeEf619Y

I'm following this comment, only measuring the shuffle sequence, ignoring loads/stores.

lebedev.ri added a comment.EditedMay 26 2021, 12:54 PM

And one more example: store VF16 stride 6:

Thus, 48.

Sidenote, i'm still unsure whether we should specify that Zen 3 has FeatureFastVariableShuffle.
Byte-wise variable shuffle is fast, but some are ucoded.

@RKSimon to be most specific: what i want here, is to agree on the algorithm, namely:

  1. pick one IR sequence (from the codegen tests i've already added in llvm/test/CodeGen/X86/vector-interleaved-{load,store}-i16-stride-[2-6].ll)
  2. annotate the [de]interleaving shuffle block with MCA macros
  3. for each CPU (that has AVX2 but not AVX512) for which we have sched model:
    1. codegen the IR for given CPU
    2. do NO manual changes to the produced assembly whatsoever
    3. record Block RThroughput the MCA produces for that CPU/sched model
  4. pick largest recorded Block RThroughput (rounding 0.999->1, 1.5->1, 1.50001->2) as the cost for the tuple

... so i can proceed to add the missing costs without nagging on you to double-check the findings.
At least that is what i would like to happen. If that is unfeasible, i can submit them as reviews..

Sidenote, i'm still unsure whether we should specify that Zen 3 has FeatureFastVariableShuffle.
Byte-wise variable shuffle is fast, but some are ucoded.

Yes, in hindsight this feature flag needs splitting into FeatureFastVariableCrossLaneShuffle + FeatureFastVariablePerLaneShuffle,

Sidenote, i'm still unsure whether we should specify that Zen 3 has FeatureFastVariableShuffle.
Byte-wise variable shuffle is fast, but some are ucoded.

Yes, in hindsight this feature flag needs splitting into FeatureFastVariableCrossLaneShuffle + FeatureFastVariablePerLaneShuffle,

Hmm. Let me look into that first then, right now :)

Sidenote, i'm still unsure whether we should specify that Zen 3 has FeatureFastVariableShuffle.
Byte-wise variable shuffle is fast, but some are ucoded.

Yes, in hindsight this feature flag needs splitting into FeatureFastVariableCrossLaneShuffle + FeatureFastVariablePerLaneShuffle,

Hmm. Let me look into that first then, right now :)

Done, D103274.

RKSimon accepted this revision.May 28 2021, 8:42 AM

LGTM, and thanks for handling D103274

This revision is now accepted and ready to land.May 28 2021, 8:42 AM

Thank you for taking a look!

LGTM, and thanks for handling D103274

Could you please specify, is LGTM for this specific change;
or about the algorithm, in which case i can proceed with
adding the other missing costs in
X86TTIImpl::getInterleavedMemoryOpCostAVX2()
without going throught this review?

The LGTM was for this patch.

The algorithm makes sense, although in the script I'm slowly writing for cost-tables vs llvm-mca comparisons, I'm always using ceil() for fractional values, this was mainly to keep with the 'worst case' approach of the cost tables.

lebedev.ri abandoned this revision.May 30 2021, 5:12 AM

Thank you.
The more i look at this, the more i think i'm not okay with the state of things there.
Clearly, the generic/worst-case costs are sacred, and are not easily changed,
but in the end, i think i actually care more about specific CPU[s].
So let's see if i can extend my scripts to knee-jerkingly improve situation for non-generic costs.

lebedev.ri reclaimed this revision.Sep 26 2021, 9:04 AM

I've reconsidered.
If a patch spam is what it takes then i can surely do that.

This revision is now accepted and ready to land.Sep 26 2021, 9:04 AM
This revision was landed with ongoing or failed builds.Sep 26 2021, 9:14 AM
This revision was automatically updated to reflect the committed changes.