This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] canonicalize splat shuffle after cmp
ClosedPublic

Authored by spatel on Jan 28 2020, 12:34 PM.

Details

Summary

cmp (splat V1, M), SplatC --> splat (cmp V1, SplatC'), M

As discussed in PR44588:
https://bugs.llvm.org/show_bug.cgi?id=44588
...we try harder to push shuffles after binops than after compares.

This patch handles the special (but presumably most common case) of splat shuffles. If both operands are splats, then we can do the comparison on the non-splat inputs followed by splat of the compare. That should take care of the regression noted in D73411.

There's another potential fold requested in PR37463 to scalarize the compare, but that's another patch (and it's not clear if we can do that without the ability to undo it later):
https://bugs.llvm.org/show_bug.cgi?id=37463

Diff Detail

Event Timeline

spatel created this revision.Jan 28 2020, 12:34 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 28 2020, 12:34 PM
nikic accepted this revision.Jan 28 2020, 1:05 PM

LGTM

Does it make sense to extend this for arbitrary shuffles in the future? The mask construction logic in https://github.com/llvm/llvm-project/blob/4aa8cdfeebec115b928e0ccb452551b520d00f0b/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp#L1612-L1660 can probably mostly be reused for the icmp case.

This revision is now accepted and ready to land.Jan 28 2020, 1:05 PM

LGTM

Does it make sense to extend this for arbitrary shuffles in the future? The mask construction logic in https://github.com/llvm/llvm-project/blob/4aa8cdfeebec115b928e0ccb452551b520d00f0b/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp#L1612-L1660 can probably mostly be reused for the icmp case.

Yes, I was debating if it was worth lifting that code or going for this more direct approach for splats-only. Based on the vector code that I've looked at so far, the splat case is by far the common case. I'll add a TODO here though in case we want to revisit that choice.

This revision was automatically updated to reflect the committed changes.
LuoYuanke added inline comments.
llvm/test/Transforms/InstCombine/gep-inbounds-null.ll
95

On X86 the result of the vector compare instruction would be in %k register, but there is no shuffle instruction for %k register. Here is the test case that was regressed due to this patch. We can duplicate it with "llc -mcpu=skylake-avx512". Any idea to improve it?

define void @before_canonicalization_i1(<16 x i1> %msk, i32 %in, <16 x i1>* %dst) {
entry:
  %insrt = insertelement <16 x i32> undef, i32 %in, i32 0
  %splat = shufflevector <16 x i32> %insrt, <16 x i32> poison, <16 x i32> zeroinitializer
  %mul = mul <16 x i32> <i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789>, %splat
  %cmp1 = icmp eq <16 x i32> %mul, <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
  %and = and <16 x i1> %msk, %cmp1
  %bc = bitcast <16 x i1> %and to i16
  %cmp = icmp ne i16 %bc, 0
  br i1 %cmp, label %b1, label %b2
b1:
  store <16 x i1> %and, <16 x i1>* %dst, align 8
  ret void
b2:
  store <16 x i1> %cmp1, <16 x i1>* %dst, align 8
  ret void
}
define void @after_canonicalization_i1(<16 x i1> %msk, i32 %in, <16 x i1>* %dst) {
entry:
  %insrt = insertelement <16 x i32> undef, i32 %in, i32 0
  %0 = mul <16 x i32> %insrt, <i32 789, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %1 = icmp eq <16 x i32> %0, zeroinitializer
  %cmp1 = shufflevector <16 x i1> %1, <16 x i1> poison, <16 x i32> zeroinitializer
  %and = and <16 x i1> %cmp1, %msk
  %bc = bitcast <16 x i1> %and to i16
  %cmp.not = icmp eq i16 %bc, 0
  br i1 %cmp.not, label %b2, label %b1
b1:
  store <16 x i1> %and, <16 x i1>* %dst, align 8
  ret void
b2:
  store <16 x i1> %cmp1, <16 x i1>* %dst, align 8
  ret void
}
spatel added inline comments.Nov 11 2021, 11:51 AM
llvm/test/Transforms/InstCombine/gep-inbounds-null.ll
95

This example (not sure if this was over-reduced from original code...) shows a few potential missed folds. I'm not sure yet why we failed all of them.

Reduced the vector length for readability:

define <2 x i1> @src(i32 %in) { 
  %insrt = insertelement <2 x i32> undef, i32 %in, i32 0
  %m = mul <2 x i32> %insrt, <i32 789, i32 poison>
  %r = icmp eq <2 x i32> %m, zeroinitializer
  ret <2 x i1> %r
}
  1. Scalarize the mul?
  2. Followed by scalarize the icmp?
  3. We also missed eliminating the mul - probably because we didn't recognize the constant with poison.

https://alive2.llvm.org/ce/z/TZyTU2

LuoYuanke added inline comments.Nov 12 2021, 6:32 AM
llvm/test/Transforms/InstCombine/gep-inbounds-null.ll
95

I mean we should avoid shuffle <X x i1> after icmp with AVX512 enabled, because there is no shuffle instruction for k register. Take below code as example.

cat shufvXi32.ll

define <16 x i1> @shuffle(<16 x i1> %msk, i32 %in) {
entry:
  %insrt = insertelement <16 x i32> undef, i32 %in, i32 0
  %splat = shufflevector <16 x i32> %insrt, <16 x i32> poison, <16 x i32> zeroinitializer
  %mul = mul <16 x i32> <i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789>, %splat
  %cmp1 = icmp eq <16 x i32> %mul, <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
  %and = and <16 x i1> %msk, %cmp1
  ret <16 x i1> %and
}

opt -S < shufvxi32.ll -instcombine -o shufvXi1.ll

We get below transformed code.

define <16 x i1> @shuffle(<16 x i1> %msk, i32 %in) {
entry:
  %insrt = insertelement <16 x i32> undef, i32 %in, i32 0
  %0 = mul <16 x i32> %insrt, <i32 789, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %1 = icmp eq <16 x i32> %0, zeroinitializer
  %cmp1 = shufflevector <16 x i1> %1, <16 x i1> poison, <16 x i32> zeroinitializer
  %and = and <16 x i1> %cmp1, %msk
  ret <16 x i1> %and
}

llc -mcpu=skylake-avx512 shufvXi32.l
We got below assembly

# %bb.0:                                # %entry
        vpsllw  $7, %xmm0, %xmm0
        vpmovb2m        %xmm0, %k1
        vpbroadcastd    %edi, %zmm0
        vpmulld .LCPI0_0(%rip){1to16}, %zmm0, %zmm0
        vptestnmd       %zmm0, %zmm0, %k0 {%k1}
        vpmovm2b        %k0, %xmm0
        vzeroupper
        retq

llc -mcpu=skylake-avx512 shufvXi1.ll
We got below assembly.

# %bb.0:                                # %entry
        vpsllw  $7, %xmm0, %xmm0
        vpxor   %xmm1, %xmm1, %xmm1
        vmovd   %edi, %xmm2
        movl    $789, %eax                      # imm = 0x315
        vmovd   %eax, %xmm3
        vpmulld %xmm3, %xmm2, %xmm2
        vptestnmd       %zmm2, %zmm2, %k0
        vpmovm2w        %k0, %ymm2
        vpbroadcastw    %xmm2, %ymm2
        vpmovw2m        %ymm2, %k1
        vpcmpgtb        %xmm0, %xmm1, %k0 {%k1}
        vpmovm2b        %k0, %xmm0
        vzeroupper
        retq

You can see there is more instruction generated for shufvXi1.ll.

spatel added inline comments.Nov 12 2021, 9:13 AM
llvm/test/Transforms/InstCombine/gep-inbounds-null.ll
95

We can't really avoid the transform based on target constraints - it's a canonicalization, so it's up to some later pass to invert it if necessary.

I think the mul and maybe even the icmp are distractions from the real problem in these examples.

There's a question of should we transform a splat of bool to sext in IR:
https://alive2.llvm.org/ce/z/ra54Xe

But that doesn't appear to change codegen, so I think there needs to be backend fix that does that transform (or the inverse?). Is there a bug report for this example? If not, can you file it? Thanks!

LuoYuanke added inline comments.Nov 13 2021, 6:37 PM
llvm/test/Transforms/InstCombine/gep-inbounds-null.ll
95

I filed a bug (https://bugs.llvm.org/show_bug.cgi?id=52500) in Bugzilla. Thanks for the suggestion.