This is an archive of the discontinued LLVM Phabricator instance.

[X86][Costmodel] Load/store i64/f64 Stride=3 VF=16 interleaving costs
ClosedPublic

Authored by lebedev.ri on Oct 3 2021, 11:59 AM.

Details

Summary

This required huge amount of assembly surgery, but i think this is about right.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/z11crMEcj - for intels Block RThroughput: =20.0; for ryzens, Block RThroughput: <=18.0
So could pick cost of 25.

For store we have:
https://godbolt.org/z/eqT4ze3j4 - for intels Block RThroughput: =24.0; for ryzens, Block RThroughput: <=16.0
So we could pick cost of 24.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Diff Detail

Event Timeline

lebedev.ri created this revision.Oct 3 2021, 11:59 AM

We are quite limited - even adding all xmm0 to the llvm-mca asm capture (~{xmm0} etc.) has very little effect - short of hand coding something, I think what you have here is at least a realistic magnitude vaue.

I'm happy to accept this if you want.

lebedev.ri edited the summary of this revision. (Show Details)Oct 3 2021, 2:07 PM

We are quite limited - even adding all xmm0 to the llvm-mca asm capture (~{xmm0} etc.) has very little effect - short of hand coding something, I think what you have here is at least a realistic magnitude vaue.

Right.

I'm happy to accept this if you want.

Hmm, what if we omit the epilogue stores in the load case: https://godbolt.org/z/43ao6qYvK -- does load interleaving cost of 18 seem more/less realistic?
It would be good to also ignore the prologue loads, since we'll cost them twice

And likewise for interleaved store case: https://godbolt.org/z/1EjY3fqPG -- does interleaving store cost of 18 seem more/less realistic?

We are quite limited - even adding all xmm0 to the llvm-mca asm capture (~{xmm0} etc.) has very little effect - short of hand coding something, I think what you have here is at least a realistic magnitude vaue.

Right.

I'm happy to accept this if you want.

Hmm, what if we omit the epilogue stores in the load case: https://godbolt.org/z/43ao6qYvK -- does load interleaving cost of 18 seem more/less realistic?
It would be good to also ignore the prologue loads, since we'll cost them twice

And likewise for interleaved store case: https://godbolt.org/z/1EjY3fqPG -- does interleaving store cost of 18 seem more/less realistic?

Correction, wrong second link.

After messing around with it, i think i can sorta get rid of memory ops in the store case: https://godbolt.org/z/do6nTcY85
That results in interleaved store cost of 24, which seems to be in the same ballpark as with the memory ops,
and follows the trend of ~doubling the cost with double the VF. Looks better?

Likewise for load: https://godbolt.org/z/z11crMEcj
I guess i'll just update the review.

lebedev.ri updated this revision to Diff 376795.Oct 3 2021, 4:06 PM
lebedev.ri retitled this revision from [WIP][X86][Costmodel] Load/store i64/f64 Stride=3 VF=16 interleaving costs to [X86][Costmodel] Load/store i64/f64 Stride=3 VF=16 interleaving costs.
lebedev.ri edited the summary of this revision. (Show Details)

@RKSimon hopefully this looks right to you!
After this i think we are only missing {i32,i64}*{stride 4, stride 6}.

RKSimon accepted this revision.Oct 4 2021, 4:01 AM

LGTM

This revision is now accepted and ready to land.Oct 4 2021, 4:01 AM

LGTM

Alrighty!
Thank you for the review!