This is an archive of the discontinued LLVM Phabricator instance.

[X86][DAGISel] Don't widen shuffle element with AVX512
ClosedPublic

Authored by LuoYuanke on Jul 11 2022, 8:04 PM.

Details

Summary
Currently the X86 shuffle lowering would widen the element type for
shuffle if the mask element value is adjacent. For below example

  %t2 = add nsw <16 x i32> %t0, %t1
  %t3 = sub nsw <16 x i32> %t0, %t1
  %t4 = shufflevector <16 x i32> %t2, <16 x i32> %t3,
                      <16 x i32> <i32 16, i32 17, i32 2, i32 3, i32 4,
                       i32 5, i32 6, i32 7, i32 8, i32 9, i32 10,
                       i32 11, i32 12, i32 13, i32 14, i32 15>

  ret <16 x i32> %t4

Compiler would transform the shuffle to
  %t4 = shufflevector <8 x i64> %t2, <8 x i64> %t3,
                      <8 x i64> <i32 8, i32 1, i32 2, i32 3, i32 4,
                                 i32 5, i32 6, i32 7>
This may lose the oppotunity to let ISel select mask instruction when
avx512 is enabled.

This patch is to prevent the tranform when avx512 feature is enabled.

Diff Detail

Event Timeline

LuoYuanke created this revision.Jul 11 2022, 8:04 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 11 2022, 8:04 PM
LuoYuanke requested review of this revision.Jul 11 2022, 8:04 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 11 2022, 8:04 PM
wxiao3 added a subscriber: wxiao3.Jul 14 2022, 7:57 PM
RKSimon added inline comments.Jul 18 2022, 6:08 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
44372

This doesn't look right - shouldn't it be something like:

APInt Mask = APIntOps::ScaleBitMask(ConstCond->getAPIntValue(), NumElts * 2) ?

LuoYuanke added inline comments.Jul 18 2022, 7:50 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
44372

Thanks, Simon. ScaleBitMask perfectly fit this coputation.

LuoYuanke updated this revision to Diff 445514.Jul 18 2022, 8:08 AM

Address Simon's comments.

I think we can probably generalize this very easily to any legal widening, e.g. to handle: https://gcc.godbolt.org/z/Pjz5qfYT7 (v32i16 -> v16i32)

RKSimon added inline comments.Jul 18 2022, 1:20 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
44346

Check if N->getOpcode() == ISD::VSELECT

44350
auto *ConstCond = dyn_cast<ConstantSDNode>(Cond.getOperand(0));
if (!ConstCond)
  return SDValue();
LuoYuanke added inline comments.Jul 18 2022, 10:37 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
44346

Is it possible that CondVT is not vXi1 for ISD::VSELECT? I ask the question becasue the comments for ISD::VSELECT says "targets may change the condition type".

/// At first, the VSELECT condition is of vXi1 type. Later, targets may
/// change the condition type in order to match the VSELECT node using a
/// pattern. The condition follows the BooleanContent format of the target.

Address Simon's comments.

I think we can probably generalize this very easily to any legal widening, e.g. to handle: https://gcc.godbolt.org/z/Pjz5qfYT7 (v32i16 -> v16i32)

Good suggestion. Let me take a look at it.

LuoYuanke updated this revision to Diff 446075.Jul 20 2022, 1:43 AM

Address Simon's address to generalize the blend/select combine.

LuoYuanke added inline comments.Jul 20 2022, 1:48 AM
llvm/test/CodeGen/X86/avx512-shuffles/shuffle-blend.ll
94 ↗(On Diff #446075)

For this case not sure if it is worse than left side code (previous code).

I've added some additional test coverage to shuffle-blend.ll - please can you rebase?

llvm/test/CodeGen/X86/avx512-shuffles/shuffle-blend.ll
94 ↗(On Diff #446075)

Add a 128-bit vector limit?

Please can you update the patch title/summary?

LuoYuanke updated this revision to Diff 446109.Jul 20 2022, 4:25 AM

Rebase and update test case.

LuoYuanke retitled this revision from [X86][DAGISel] Combine select vXi64 with AVX512 target to [X86][DAGISel] Don't widen shuffle element with AVX512.Jul 20 2022, 4:25 AM
LuoYuanke edited the summary of this revision. (Show Details)
LuoYuanke added inline comments.Jul 20 2022, 4:48 AM
llvm/test/CodeGen/X86/avx512-shuffles/shuffle-blend.ll
94 ↗(On Diff #446075)

I'll add 128-bit vector limit. However in this case it is 128-bit vector.

LuoYuanke updated this revision to Diff 446114.Jul 20 2022, 5:03 AM

Limit the vector bit width >=128 and add test cases.

RKSimon added inline comments.Jul 20 2022, 5:16 AM
llvm/test/CodeGen/X86/avx512-shuffles/shuffle-blend.ll
5 ↗(On Diff #446114)

CHECK,AVX512BW,X86-AVX512BW
CHECK,AVX512BW,X64-AVX512BW

LuoYuanke updated this revision to Diff 446132.Jul 20 2022, 6:03 AM

Address Simon's comments.

LuoYuanke marked an inline comment as done.Jul 20 2022, 6:06 AM
LuoYuanke added inline comments.
llvm/test/CodeGen/X86/avx512-shuffles/shuffle-blend.ll
94 ↗(On Diff #446132)

retl can be merged to retq by ret{{[l|q]}}. Not sure why utils/update_llc_test_checks.py doesn't merge.

RKSimon added inline comments.Jul 20 2022, 6:13 AM
llvm/test/CodeGen/X86/avx512-shuffles/shuffle-blend.ll
169 ↗(On Diff #446132)

regression? you might need to improve the 128-bit limit logic to account for vXi16 specifically

LuoYuanke added inline comments.Jul 20 2022, 7:49 AM
llvm/test/CodeGen/X86/avx512-shuffles/shuffle-blend.ll
169 ↗(On Diff #446132)

There is PBLENDW, but there is no PBLENDB instruction, so it is better to widen to v2Xi16 from vXi8. Besides there is more instruction for 16-bit element (e.g., movsh). I'll investigate more on this issue.

LuoYuanke updated this revision to Diff 446163.Jul 20 2022, 8:12 AM

Specially handling for vXi8, because vXi16 can be applied PBLENDW while vXi8 can't.

RKSimon added inline comments.Jul 20 2022, 8:16 AM
llvm/test/CodeGen/X86/avx512-shuffles/shuffle-blend.ll
243 ↗(On Diff #446163)

pre-commit these additional tests

94 ↗(On Diff #446132)

"kmovd %eax" vs "kmovq %rax"

LuoYuanke added inline comments.Jul 20 2022, 8:19 AM
llvm/test/CodeGen/X86/avx512-shuffles/shuffle-blend.ll
243 ↗(On Diff #446163)

Sure. I'll do it.

94 ↗(On Diff #446132)

Got it. :)

RKSimon added inline comments.Jul 22 2022, 2:54 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
19326

This list is going to get longer, and we're likely to miss patterns that only fold to target nodes later on - I'm wondering whether we could consider accepting any TLI.isBinOp() case here?

RKSimon accepted this revision.Jul 25 2022, 6:08 AM

LGTM - with one minor comment for future work

llvm/lib/Target/X86/X86ISelLowering.cpp
19326

Please can you add a TODO about maybe converting this to TLI.isBinOp()?

This revision is now accepted and ready to land.Jul 25 2022, 6:08 AM
LuoYuanke updated this revision to Diff 447318.Jul 25 2022, 7:05 AM

Address Simon's comments.

LuoYuanke added inline comments.Jul 25 2022, 7:08 AM
llvm/test/CodeGen/X86/haddsub-undef.ll
1053 ↗(On Diff #447318)

This looks a regression, I'll take a look at it.

LuoYuanke added inline comments.Jul 25 2022, 7:09 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
19326

Thank Simon for the suggestion. It seems there is regression on some cases, I'll take a look at the regression.

Yeah - that was what I saw as well - if you want to get this in for 15.x I'd recommend going back to the old switch statement - then investigate the binop general case later (if you solve it soon enough you request a merge)

LuoYuanke updated this revision to Diff 447540.Jul 25 2022, 8:16 PM

Revert to previous version and add TODO for checking TLI.isBinOp().

Yeah - that was what I saw as well - if you want to get this in for 15.x I'd recommend going back to the old switch statement - then investigate the binop general case later (if you solve it soon enough you request a merge)

Thank Simon for the review. All the comments are valuable. Let me land the old switch statement version first and then investigate the general binop.

This revision was landed with ongoing or failed builds.Jul 25 2022, 8:56 PM
This revision was automatically updated to reflect the committed changes.
fhahn added a subscriber: fhahn.Jul 28 2022, 7:28 AM

This patch unfortunately causes crashes when building llvm-test-suite optimizing for AVX512.

Reproducer for llc:

target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx"

define i32 @test(<32 x i32> %0) #0 {
entry:
  %1 = mul <32 x i32> %0, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
  %2 = tail call i32 @llvm.vector.reduce.add.v32i32(<32 x i32> %1)
  ret i32 %2
}

; Function Attrs: nocallback nofree nosync nounwind readnone willreturn
declare i32 @llvm.vector.reduce.add.v32i32(<32 x i32>) #1

attributes #0 = { "min-legal-vector-width"="0" "target-cpu"="skylake-avx512" }
attributes #1 = { nocallback nofree nosync nounwind readnone willreturn }

I've reverted the patch in the meantime to get current main back into a good state.