@RKSimon please can you check that i'm got the idea right, and not missing any steps?
The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3
For load we have:
https://godbolt.org/z/M8vEKs5jY - for intels Block RThroughput: =2.0; for ryzens, Block RThroughput: <=1.0
So pick cost of 2.
For store we have:
https://godbolt.org/z/Kx1nKz7je - for intels Block RThroughput: =1.0; for ryzens, Block RThroughput: <=0.5
So pick cost of 1.
I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution.
I'm following this comment, only measuring the shuffle sequence, ignoring loads/stores.