This is an archive of the discontinued LLVM Phabricator instance.

[X86] Disable masked UNPCKLPD/UNPCKHPD -> SHUFPS transformation
ClosedPublic

Authored by pengfei on Apr 3 2023, 11:48 PM.

Details

Summary

UNPCKLPD/UNPCKHPD is a 64-bit element operation. The masked version
doesn't match SHUFPS in lanes.
This reverts part of D144763.

Diff Detail

Event Timeline

pengfei created this revision.Apr 3 2023, 11:48 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 3 2023, 11:48 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
pengfei requested review of this revision.Apr 3 2023, 11:48 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 3 2023, 11:48 PM
pengfei updated this revision to Diff 510711.Apr 3 2023, 11:59 PM

Remove commnets.

pengfei edited the summary of this revision. (Show Details)Apr 4 2023, 12:01 AM
RKSimon accepted this revision.Apr 4 2023, 12:31 AM

Good Catch! LGTM

llvm/test/CodeGen/X86/tuning-shuffle-unpckpd-avx512.ll
167–168

Add comments to the changed test cases - saying these are negative tests as the predicate masks don't match

This revision is now accepted and ready to land.Apr 4 2023, 12:31 AM
This revision was landed with ongoing or failed builds.Apr 4 2023, 12:57 AM
This revision was automatically updated to reflect the committed changes.
pengfei marked an inline comment as done.

@goldstein.w.n You might want to investigate if its worth using VSHUFPD instead?

@goldstein.w.n You might want to investigate if its worth using VSHUFPD instead?

@pengfei and @RKSimon what about using {VP}UNPCK{L|H}QDQ{...}? I tested on ICL and didn't see any domain penalty. Wasn't able to find the hardware to test hsw/skl/.... and not sure if it falls under no-shuffle hasNoDomainDelayShuffle or something else but it is the ideal replacement both from perf and codesize perspective.

@goldstein.w.n You might want to investigate if its worth using VSHUFPD instead?

@pengfei and @RKSimon what about using {VP}UNPCK{L|H}QDQ{...}? I tested on ICL and didn't see any domain penalty. Wasn't able to find the hardware to test hsw/skl/.... and not sure if it falls under no-shuffle hasNoDomainDelayShuffle or something else but it is the ideal replacement both from perf and codesize perspective.

Why not both? We can try with VSHUFPD to see if it has better scheduling, else try integer unpack if we don't have a domain penalty

@goldstein.w.n You might want to investigate if its worth using VSHUFPD instead?

See: D147541