This is an archive of the discontinued LLVM Phabricator instance.

[X86][Costmodel] Load/store i8 Stride=4 VF=32 interleaving costs
ClosedPublic

Authored by lebedev.ri on Oct 1 2021, 1:33 PM.

Details

Summary

While we already model this tuple, the load cost is divergent from reality, so fix it.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/zWMhhnPYa - for intels Block RThroughput: =56.0; for ryzens, Block RThroughput: <=24.0
So pick cost of 56.

For store we have:
https://godbolt.org/z/vnqqjWx51 - for intels Block RThroughput: =12.0; for ryzens, Block RThroughput: <=4.0
So pick cost of 12.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Diff Detail

Event Timeline

lebedev.ri created this revision.Oct 1 2021, 1:33 PM
RKSimon accepted this revision.Oct 2 2021, 3:23 AM

LGTM

This revision is now accepted and ready to land.Oct 2 2021, 3:23 AM

LGTM

Thank you so much for the reviews!

By now, i have to ask.
By now, all but one preexisting entries in AVX2InterleavedLoadTbl/AVX2InterleavedStoreTbl tables have been redone.
This has covered my most immediate interest. Has a saturation point been reached with these reviews,
in the sense that reviewing them become a painful boring tedious expirience? If so, i could stop here.
If not, there are a few more i could add..

LGTM

Thank you so much for the reviews!

By now, i have to ask.
By now, all but one preexisting entries in AVX2InterleavedLoadTbl/AVX2InterleavedStoreTbl tables have been redone.
This has covered my most immediate interest. Has a saturation point been reached with these reviews,
in the sense that reviewing them become a painful boring tedious expirience? If so, i could stop here.
If not, there are a few more i could add..

Hm, looks like i lied, i still need to add a few more, even when looking just at the darktable:

X86TTIImpl::getInterleavedMemoryOpCostAVX2():  EltTy = i16 Stride = 3 VF = 16 Indices.size() = 3
X86TTIImpl::getInterleavedMemoryOpCostAVX2():  EltTy = i16 Stride = 3 VF = 2 Indices.size() = 3
X86TTIImpl::getInterleavedMemoryOpCostAVX2():  EltTy = i16 Stride = 3 VF = 4 Indices.size() = 3
X86TTIImpl::getInterleavedMemoryOpCostAVX2():  EltTy = i16 Stride = 3 VF = 8 Indices.size() = 3
X86TTIImpl::getInterleavedMemoryOpCostAVX2():  EltTy = i32 Stride = 3 VF = 2 Indices.size() = 3
X86TTIImpl::getInterleavedMemoryOpCostAVX2():  EltTy = i32 Stride = 3 VF = 4 Indices.size() = 3
X86TTIImpl::getInterleavedMemoryOpCostAVX2():  EltTy = i32 Stride = 3 VF = 8 Indices.size() = 3
X86TTIImpl::getInterleavedMemoryOpCostAVX2():  EltTy = i32 Stride = 4 VF = 2 Indices.size() = 4
X86TTIImpl::getInterleavedMemoryOpCostAVX2():  EltTy = i32 Stride = 4 VF = 4 Indices.size() = 4
X86TTIImpl::getInterleavedMemoryOpCostAVX2():  EltTy = i32 Stride = 4 VF = 8 Indices.size() = 4
X86TTIImpl::getInterleavedMemoryOpCostAVX2():  EltTy = i32 Stride = 6 VF = 2 Indices.size() = 6
X86TTIImpl::getInterleavedMemoryOpCostAVX2():  EltTy = i32 Stride = 6 VF = 4 Indices.size() = 6
X86TTIImpl::getInterleavedMemoryOpCostAVX2():  EltTy = i32 Stride = 6 VF = 8 Indices.size() = 6
X86TTIImpl::getInterleavedMemoryOpCostAVX2():  EltTy = i8 Stride = 6 VF = 2 Indices.size() = 6
X86TTIImpl::getInterleavedMemoryOpCostAVX2():  EltTy = i8 Stride = 6 VF = 4 Indices.size() = 6
X86TTIImpl::getInterleavedMemoryOpCostAVX2():  EltTy = i8 Stride = 6 VF = 8 Indices.size() = 6

I've no objections, although I think we'd gain more by ensuring we have cost numbers of the right magnitude for SSE2 first and adjust for later targets.

As you said, its all rather tedious though - I don't know if we've passed the point where automating more of this with scripting (either the cost extraction or the CHECKs) would be useful?

I've no objections,

Oh, good! Then i basically intend to completely fill out the {i8,i16,i32,64} x {2,3,4,6} permutation matrix, we are quite close actually.

although I think we'd gain more by ensuring we have cost numbers of the right magnitude for SSE2 first and adjust for later targets.

I agree that the baseline costs are bogus, but as discussed previously to improve them we need better generic shuffle cost modelling.

As you said, its all rather tedious though - I don't know if we've passed the point where automating more of this with scripting (either the cost extraction or the CHECKs) would be useful?

It would be good to have the update-check-lines script for this, yes.
Automatic cost extraction is somewhat more tedious than i have originally anticipated,
because most of the time i have to manually adjust the assembly to weed out the loads/stores.

After i'm done with these costs, there are two general things i'll want to look into:

  • we have a gigantic elephant in the room: if some lanes aren't demanded, we should just scale the cost by the number of demanded lanes, not fallback to the baseline cost. this will have monumental impact.
  • i suspect our legality-driven cost model for constant-masked loads/stores/scatters/gathers is misguided.