This is an archive of the discontinued LLVM Phabricator instance.

[CostModel][X86] add CostModel for SK_Select(v8f64, v8i64, v16f32, v16i32, v32i16, v64i8)
ClosedPublic

Authored by yubing on Sep 18 2020, 12:17 AM.

Details

Summary

add CostModel for SK_Select(v8f64, v8i64, v16f32, v16i32, v32i16, v64i8)

Diff Detail

Event Timeline

yubing created this revision.Sep 18 2020, 12:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 18 2020, 12:17 AM
yubing requested review of this revision.Sep 18 2020, 12:17 AM

https://reviews.llvm.org/B72136 shows it fails when making HTTP Request. But I don't have permission to restart the build procedure.
Besides, I've check-all locally with success.

Please pre-commit the new tests against trunk and rebase the patch to show the diffs

yubing updated this revision to Diff 293049.Sep 20 2020, 7:07 PM

Just a rebase

Thanks for your comments, Simon. With your Comments, I found my new testcases are useless since they are for Instruction::Select instead of Instruction::ShuffleVector(SK_Select)

yubing updated this revision to Diff 293077.Sep 20 2020, 11:28 PM

Delete some testcases which was added previously since they are for Instruction::Select instead of Instruction::ShuffleVector(SK_Select)

@yubing I've added shuffle-select.ll which should have better test coverage - please can you rebase and check?

yubing updated this revision to Diff 293325.EditedSep 21 2020, 7:45 PM
This comment has been deleted.
yubing added a comment.EditedSep 21 2020, 8:44 PM

@yubing I've added shuffle-select.ll which should have better test coverage - please can you rebase and check?

Hi, Simon. Should we provide more precise cost for v32i16&v64i8 in avx512f? I think they should be at most 42, according to following code in AVX512ShuffleTbl:

{TTI::SK_PermuteTwoSrc,    MVT::v32i16, 42},
{TTI::SK_PermuteTwoSrc,    MVT::v64i8,  42},
yubing updated this revision to Diff 293342.Sep 21 2020, 10:44 PM

With avx512f, the cost SK_Select(v32i16 or v64i8) shoulde be 3(vmovdqa64 + vpternlogq)

With avx512f, the cost SK_Select(v32i16 or v64i8) shoulde be 3(vmovdqa64 + vpternlogq)

The moves probably don't really count since they can be eliminated during register renaming. So only the vpternlog executes.

With avx512f, the cost SK_Select(v32i16 or v64i8) shoulde be 3(vmovdqa64 + vpternlogq)

The moves probably don't really count since they can be eliminated during register renaming. So only the vpternlog executes.

Eh, Craig, why it has relationship with register renaming? I thought, vternlog's third operand should be provided by a vmovdqa64.
Besides, we can observe the following asm for v32i16's SK_Select:

vmovdqa64       .LCPI0_0(%rip), %zmm0   # zmm0 = [0,0,0,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535]
vpternlogq      $202, 144(%rbp), %zmm4, %zmm0

With avx512f, the cost SK_Select(v32i16 or v64i8) shoulde be 3(vmovdqa64 + vpternlogq)

The moves probably don't really count since they can be eliminated during register renaming. So only the vpternlog executes.

Eh, Craig, why it has relationship with register renaming? I thought, vternlog's third operand should be provided by a vmovdqa64.
Besides, we can observe the following asm for v32i16's SK_Select:

vmovdqa64       .LCPI0_0(%rip), %zmm0   # zmm0 = [0,0,0,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535]
vpternlogq      $202, 144(%rbp), %zmm4, %zmm0

Sorry I thought the vmovdqa you mentioned was due to the vpternlogq reading 3 sources and clobbering one of them. So sometimes it needs a register to register move to preserve a register.

I'm not sure if we usually cost the constant pool load since its loop invariant. Do we cost the load that vpermi2b/w/d/q would use for 2 source permute?

With avx512f, the cost SK_Select(v32i16 or v64i8) shoulde be 3(vmovdqa64 + vpternlogq)

The moves probably don't really count since they can be eliminated during register renaming. So only the vpternlog executes.

Eh, Craig, why it has relationship with register renaming? I thought, vternlog's third operand should be provided by a vmovdqa64.
Besides, we can observe the following asm for v32i16's SK_Select:

vmovdqa64       .LCPI0_0(%rip), %zmm0   # zmm0 = [0,0,0,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535,65535]
vpternlogq      $202, 144(%rbp), %zmm4, %zmm0

Sorry I thought the vmovdqa you mentioned was due to the vpternlogq reading 3 sources and clobbering one of them. So sometimes it needs a register to register move to preserve a register.

I'm not sure if we usually cost the constant pool load since its loop invariant. Do we cost the load that vpermi2b/w/d/q would use for 2 source permute?

You're right, we don't need to consider cost of vmovdaq64. In AVX2ShuffleTbl, although shuffle's index is provided by a MOV but we won't take it into consideration:

{TTI::SK_PermuteSingleSrc, MVT::v4f64, 1},  // vpermpd
yubing updated this revision to Diff 293360.EditedSep 22 2020, 12:27 AM

With avx512f, the cost of SK_Select(v32i16 or v64i8) should be 1(vpternlogq)

RKSimon accepted this revision.Sep 22 2020, 7:32 AM

LGTM with one minor

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
1193

ignore clang-format - please can you add whitespace to align these columns - otherwise its very difficult to see costs at a glance.

This revision is now accepted and ready to land.Sep 22 2020, 7:32 AM
This revision was landed with ongoing or failed builds.Sep 22 2020, 7:29 PM
This revision was automatically updated to reflect the committed changes.