Page MenuHomePhabricator

[X86][SSE] Add support for extending bool vectors bitcasted from scalars.
ClosedPublic

Authored by RKSimon on Jul 12 2017, 12:41 PM.

Details

Summary

This patch acts as a reverse to combineBitcastvxi1 - bitcasting a scalar integer to a boolean vector and extending it 'in place' to the requested legal type.

Currently this doesn't handle AVX512 at all - but the current mask register approach is lacking for some cases.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.Jul 12 2017, 12:41 PM
igorb added a subscriber: igorb.Jul 12 2017, 11:31 PM
delena edited edge metadata.Jul 15 2017, 12:07 PM

I'm thinking about another algorithm:
let's assume that we have sext (bitcast i8 to 8 x i1) to <8 x i16>

I can do the following:

  1. broadcast i8 over the xmm
  2. "pand" with vector-mask (1, 2, 4, 8, 16 ..)
  3. compare with zeroinitializer

If we have "zext" instead of "sext", we can put an additional shift-right at the end.

I'm thinking about another algorithm:
let's assume that we have sext (bitcast i8 to 8 x i1) to <8 x i16>

I can do the following:

  1. broadcast i8 over the xmm
  2. "pand" with vector-mask (1, 2, 4, 8, 16 ..)
  3. compare with zeroinitializer

    If we have "zext" instead of "sext", we can put an additional shift-right at the end.

That makes sense, there may have to be special cases for older CPUs with weaker shift instructions but I'll see what I can do.

RKSimon updated this revision to Diff 106865.Jul 17 2017, 6:55 AM

Moved to AND+CMPEQ pattern suggested by @delena.

This required a modification to LowerVSETCC to reverse a dag combine to: (X & Y) != 0 --> (X & Y) == Y iff Y is power-of-2. This avoids an inversion and the creation of 0/-1 bits vectors.

What should be returned at the end according to the calling conventions - zext or sext? In some cases you add shift-right in order to return zext value. In the AVX-512 cases the last inst is vpmovm2* that means sext.

test/CodeGen/X86/bitcast-int-to-vector-bool.ll
24 ↗(On Diff #106865)

One "mov" is enough here. you, probably, use "zext" instead of "anyext". (I did not look at the code yet)

120 ↗(On Diff #106865)

pshufb can be used for broadcasting i8.

133 ↗(On Diff #106865)

why load is not folded in vpand?

RKSimon added inline comments.Jul 18 2017, 2:34 AM
test/CodeGen/X86/bitcast-int-to-vector-bool.ll
24 ↗(On Diff #106865)

The code requested an ANY_EXTEND but for some reason its choosing to do a movz, probably partial register logic kicking in.

120 ↗(On Diff #106865)

Yes, we don't normally combine to pshufb until depth=3 due to size/load costs of a shuffle mask, but in this case we just need a (free-ish) zero shuffle mask so it'd be safe to support it. Any additional shuffle combine would then take us over the depth limit.

133 ↗(On Diff #106865)

xmm1 is used twice - both as the mask and then with the comparison below ((X & M) == M)

This required a modification to LowerVSETCC to reverse a dag combine to: (X & Y) != 0 --> (X & Y) == Y iff Y is power-of-2. This avoids an inversion and the creation of 0/-1 bits vectors.

Can you do this in a separate commit?

lib/Target/X86/X86ISelLowering.cpp
34419 ↗(On Diff #106865)

EltSizeInBits should be equal to NumElts.

test/CodeGen/X86/bitcast-int-to-vector-bool.ll
237 ↗(On Diff #106865)

I can give a better sequence here:
vmovd %edi, %xmm0
shrl $16, %edi
vmovd %edi, %xmm1
vinserti128 $1, %xmm1, %ymm0, %ymm0
vpshufb .LCPI0_0(%rip), %ymm0, %ymm0 # ymm0 = ymm0[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,16,16,16,16,16,16,16,16,17,17,17,17,17,17,17,17]

It comes from
define <32 x i8> @foo(i32 %a) {

%b = lshr i32 %a, 16
%vec = insertelement <8 x i32>undef, i32 %a, i32 0
%vec1 = insertelement <8 x i32>%vec, i32 %b, i32 4
%nvec = bitcast<8 x i32>%vec1 to <32 x i8>
%res = shufflevector <32 x i8> %nvec, <32 x i8>undef, <32 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, 
                                                                  i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, 
                                                                  i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, 
                                                                  i32 17, i32 17, i32 17, i32 17, i32 17, i32 17, i32 17, i32 17>
ret <32 x i8> %res

}

But I agree is this optimization will not be a part of this patch.

RKSimon updated this revision to Diff 116215.Sep 21 2017, 10:30 AM

Rebased before resuming work on this

RKSimon marked 2 inline comments as done.Sep 21 2017, 10:31 AM
RKSimon added inline comments.Sep 21 2017, 10:35 AM
lib/Target/X86/X86ISelLowering.cpp
34419 ↗(On Diff #106865)

EltSizeInBits means the extended vector element size (v8i16: NumElts == 8, EltSizeInBits == 16). SclVT.getSizeInBits() would equal NumElts.

RKSimon updated this revision to Diff 116259.Sep 21 2017, 1:43 PM

Assert for NumElts == SclVT.getSizeInBits() equality to make it more obvious

delena accepted this revision.Sep 23 2017, 11:42 PM

LGTM. Thank you.

This revision is now accepted and ready to land.Sep 23 2017, 11:42 PM
This revision was automatically updated to reflect the committed changes.