This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] shrink truncated insertelement with constant operand
ClosedPublic

Authored by spatel on Feb 18 2017, 10:53 AM.

Details

Summary

This is the 2nd part of solving:
http://lists.llvm.org/pipermail/llvm-dev/2017-February/110293.html

D30123 would move the trunc ahead of the shuffle, and this will move the trunc ahead of the insertelement. I'm assuming that we're ok transforming insertelement insts in general (unlike shufflevector) since those are simple ops that any target should handle.

I neglected to include FP truncate in D30123, so if this looks correct, I can update that patch to do a similar conversion or make a follow-up patch for that.

Diff Detail

Event Timeline

spatel created this revision.Feb 18 2017, 10:53 AM
zvi added a subscriber: zvi.Feb 21 2017, 6:45 AM
efriedma edited edge metadata.Mar 7 2017, 11:01 AM

On the integer side, I'm sort of worried this could make code generation worse on targets which don't have insert instructions for all relevant widths (for example, SSE2 has i16 insertelement, but not i8 insertelement).

For fptrunc, you could in theory hurt performance if the insert is overwriting a denormal float, but I guess that's unlikely to happen in practice.

On the integer side, I'm sort of worried this could make code generation worse on targets which don't have insert instructions for all relevant widths (for example, SSE2 has i16 insertelement, but not i8 insertelement).

If it's any consolation, the current x86 codegen for tests similar to the ones in this patch -- even with newer features like AVX2 -- looks awful. I can file a bug with this and others:

define <8 x i16> @wide_insert_into_constant_vector(i32 %x) {
  %vec = insertelement <8 x i32> <i32 1, i32 2, i32 3, i32 4, i32 5, i32 3, i32 -2, i32 65536>, i32 %x, i32 1
  %trunc = trunc <8 x i32> %vec to <8 x i16>
  ret <8 x i16> %trunc
}

$ ./llc -o - shrinkinselt.ll -mattr=avx2
...
	movl	$1, %eax
	vmovd	%eax, %xmm0
	vpinsrd	$1, %edi, %xmm0, %xmm0
	movl	$3, %eax
	vpinsrd	$2, %eax, %xmm0, %xmm0
	movl	$4, %eax
	vpinsrd	$3, %eax, %xmm0, %xmm0
	vinserti128	$1, LCPI0_0(%rip), %ymm0, %ymm0
	vpshufb	LCPI0_1(%rip), %ymm0, %ymm0 ## ymm0 = ymm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15,16,17,20,21,24,25,28,29,24,25,28,29,28,29,30,31]
	vpermq	$232, %ymm0, %ymm0      ## ymm0 = ymm0[0,2,2,3]
                                        ## kill: %XMM0<def> %XMM0<kill> %YMM0<kill>
	vzeroupper
	retq

Not sure how to explain this...

We can ignore deficiencies which are obviously bugs; I was thinking of something more like this:

define <16 x i8> @trunc_inselt1(<16 x i16> %a, i16 %x) {
  %vec = insertelement <16 x i16> %a, i16 %x, i32 1
  %trunc = trunc <16 x i16> %vec to <16 x i8>
  ret <16 x i8> %trunc
}

define <16 x i8> @trunc_inselt2(<16 x i16> %a, i8 %x) {
  %trunc = trunc <16 x i16> %a to <16 x i8>
  %vec = insertelement <16 x i8> %trunc, i8 %x, i32 1
  ret <16 x i8> %vec
}

For trunc_inselt2, the x86 backend produces:

trunc_inselt2:                          # @trunc_inselt2
        .cfi_startproc
# BB#0:
        movdqa  .LCPI1_0(%rip), %xmm2   # xmm2 = [255,255,255,255,255,255,255,255]
        pand    %xmm2, %xmm1
        pand    %xmm2, %xmm0
        packuswb        %xmm1, %xmm0
        movdqa  .LCPI1_1(%rip), %xmm1   # xmm1 = [255,0,255,255,255,255,255,255,255,255,255,255,255,255,255,255]
        pand    %xmm1, %xmm0
        movd    %edi, %xmm2
        psllw   $8, %xmm2
        pandn   %xmm2, %xmm1
        por     %xmm1, %xmm0
        retq

The pand+movd+psllw+pandn+por sequence is clearly a lot worse than a pinsrw.

We can ignore deficiencies which are obviously bugs; I was thinking of something more like this:

define <16 x i8> @trunc_inselt1(<16 x i16> %a, i16 %x) {
  %vec = insertelement <16 x i16> %a, i16 %x, i32 1
  %trunc = trunc <16 x i16> %vec to <16 x i8>
  ret <16 x i8> %trunc
}

define <16 x i8> @trunc_inselt2(<16 x i16> %a, i8 %x) {
  %trunc = trunc <16 x i16> %a to <16 x i8>
  %vec = insertelement <16 x i8> %trunc, i8 %x, i32 1
  ret <16 x i8> %vec
}

For trunc_inselt2, the x86 backend produces:

trunc_inselt2:                          # @trunc_inselt2
        .cfi_startproc
# BB#0:
        movdqa  .LCPI1_0(%rip), %xmm2   # xmm2 = [255,255,255,255,255,255,255,255]
        pand    %xmm2, %xmm1
        pand    %xmm2, %xmm0
        packuswb        %xmm1, %xmm0
        movdqa  .LCPI1_1(%rip), %xmm1   # xmm1 = [255,0,255,255,255,255,255,255,255,255,255,255,255,255,255,255]
        pand    %xmm1, %xmm0
        movd    %edi, %xmm2
        psllw   $8, %xmm2
        pandn   %xmm2, %xmm1
        por     %xmm1, %xmm0
        retq

This patch won't fire on your example because the scalar or the vector must be a constant. I'm trying to find an affected case that isn't blindingly bad already, but I haven't seen it yet. :)

Replace %x in my testcase with 3, and you get the same result. Granted, there are better ways to generate code than what the x86 backend currently does, but the end result would still be worse than the version with pinsrw.

I'd be more comfortable special-casing insertelement into a undef vector (so there's an obvious efficient lowering on any architecture).

spatel added a comment.Mar 7 2017, 2:16 PM

I'd be more comfortable special-casing insertelement into a undef vector (so there's an obvious efficient lowering on any architecture).

Yes, I think that's a safe intermediate step that still handles the motivating case. I'll update the patch.

spatel updated this revision to Diff 90948.Mar 7 2017, 3:11 PM

Patch updated:

  1. Limit the transform to insertion into an undef vector to avoid backend problems.
  2. Add tests for undef.
  3. Add TODO comments for original tests.
efriedma accepted this revision.Mar 7 2017, 3:18 PM

LGTM

This revision is now accepted and ready to land.Mar 7 2017, 3:18 PM
This revision was automatically updated to reflect the committed changes.