This is an archive of the discontinued LLVM Phabricator instance.

[X86][Costmodel] Add SSE2 sub-128bit vXi8/16/32 and 128/256-bit vXi32/64 stride 2 interleaved store costs
ClosedPublic

Authored by RKSimon on Oct 16 2021, 9:59 AM.

Details

Summary

These all expand to 1 or 2 UNPCK shuffle ops. AVX1/AVX2 sometimes expands to a subvector-concat + permute pattern instead but the costs turn out to be very similar, so move them from the AVX2 to the SSE2 cost table.

Diff Detail

Event Timeline

RKSimon created this revision.Oct 16 2021, 9:59 AM
RKSimon requested review of this revision.Oct 16 2021, 9:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 16 2021, 9:59 AM
RKSimon retitled this revision from [X86][Costmodel] Add SSE2 sub-128bit vXi8/16/32 and 128/256-bit vXi32/64 stride 2 interleaved load costs to [X86][Costmodel] Add SSE2 sub-128bit vXi8/16/32 and 128/256-bit vXi32/64 stride 2 interleaved store costs.Oct 17 2021, 8:21 AM
lebedev.ri accepted this revision.Oct 17 2021, 10:13 AM

I suppose this is a better ballpark, but i'm not really sold on i64/i32-vf4 part.

llvm/test/Analysis/CostModel/X86/interleaved-store-i16-stride-2.ll
13

LG

llvm/test/Analysis/CostModel/X86/interleaved-store-i32-stride-2.ll
13

@store_i32_stride2_vf4's codegen looks really different, i'm not sure this is right:
https://godbolt.org/z/dojn9enWK
https://godbolt.org/z/zfEPrYovd

llvm/test/Analysis/CostModel/X86/interleaved-store-i64-stride-2.ll
13

All of @store_i64_stride2_vf2/@store_i64_stride2_vf4's codegen looks really different.

llvm/test/Analysis/CostModel/X86/interleaved-store-i8-stride-2.ll
13

LG

This revision is now accepted and ready to land.Oct 17 2021, 10:13 AM

I'll just address the i8/i16 cases first