This is an archive of the discontinued LLVM Phabricator instance.

Lower certain build_vectors to insertps instructions
ClosedPublic

Authored by filcab on Apr 26 2014, 10:54 PM.

Details

Summary

Vectors built with zeros and elements in the same order as another
(source) vector are optimized to be built using a single insertps
instruction.

Further optimizations are possible, described in TODO comments.
I will be implementing at least some of them in the near future.

Added some tests for different cases where this optimization triggers.

Diff Detail

Event Timeline

filcab updated this revision to Diff 8868.Apr 26 2014, 10:54 PM
filcab retitled this revision from to Lower certain build_vectors to insertps instructions.
filcab updated this object.
filcab edited the test plan for this revision. (Show Details)
filcab added reviewers: nadav, delena, craig.topper.
filcab added a subscriber: Unknown Object (MLST).
delena edited edge metadata.Apr 26 2014, 11:44 PM

I added some comments, please see inside.

  • Elena

Hi Elena,

Did you forget to add the comments? I didn't get them nor see them in phabricator.

Thank you,
Filipe

delena added inline comments.Apr 26 2014, 11:49 PM
lib/Target/X86/X86ISelLowering.cpp
5418

CorrectIdx here is boolean - 1 or 0. Further you do ++.

5432

Looks like you are looking for a splat vector. But if you want to use INSERTPS, your build-vector should include only one non-zero element.

6203

What happens for 8x32 and 16x32 vectors here?

test/CodeGen/X86/sse41.ll
331

Could you, please, explain what code you expect to see here? Is it only one insertps instruction?
Usually, such extract-insert chain we have for matrix transpose. But in this case the elements are extracted from different vectors.

filcab updated this revision to Diff 8871.Apr 27 2014, 12:05 AM
filcab edited edge metadata.

Fixed a bug pointed by Elena: only optimize v4x32 vectors, not bigger ones.

Let me know if you want these changed.

Thanks,
Filipe

lib/Target/X86/X86ISelLowering.cpp
5418

CorrectIdx is an unsigned int. It gets initialized to 0 or 1 according to that test.

5432

Not necessarily a splat vector.
Right now I'm only doing the optimization for a build_vector of elements from one single vector. But for an INSERTPS we can have up to 3 (non-zero) elements from one vector (if their destination index is the same as the source index) and one (non-zero) vector from another.
We can also set any position to zero.

This test is here to bail early and serve as a start for further optimizations. But I think if I optimized for every case, this diff would be too big. Optimizing for every case will also take time, since the IR for this can vary wildly (extractelement + insertelement + vectorshuffle, etc).

test/CodeGen/X86/sse41.ll
331

Eventually all these shuf_???? should be reduced to a single insertps instruction.
Right now my patch doesn't accomplish this, since we need to match several other cases (and also match on lowershufflevector).

I can match the exact code that should be emitted, if you prefer.
I can also match more code in the functions that aren't yet fully reduced to an insertps instruction.
Let me know if you want me to do any of these.

I figured splitting this optimization would be easier to review and accept and it still improves our code generation gradually.

Let's assume that you need to insert 2 or 3 elements that extracted from vector X.
But you create an INSERPS node that puts one element to UNDEF.

return DAG.getNode(X86ISD::INSERTPS, dl, VT, V, DAG.getUNDEF(VT),

InsertpsMask);
  • Elena

I'm not following. Those still work. I have examples in the tests and they succeed.

I just ran them through llc, as well as additional examples (two slight modifications to shuffle the zeros around, described in the name of the functions) and here's is llc's output (including the comments for insertps, which show what we want:

_shuf_XYZ0:                             ## @shuf_XYZ0
  .cfi_startproc
## BB#0:                                ## %entry
  insertps  $8, %xmm0, %xmm0 ## xmm0 = xmm0[0,1,2],zero
  retq
  .cfi_endproc

  .globl    _shuf_0YZW
  .align    4, 0x90
_shuf_0YZW:                             ## @shuf_0YZW
  .cfi_startproc
## BB#0:                                ## %entry
  insertps  $81, %xmm0, %xmm0 ## xmm0 = zero,xmm0[1,2,3]
  retq
  .cfi_endproc

  .globl    _shuf_XY00
  .align    4, 0x90
_shuf_XY00:                             ## @shuf_XY00
  .cfi_startproc
## BB#0:                                ## %entry
  insertps  $12, %xmm0, %xmm0 ## xmm0 = xmm0[0,1],zero,zero
  retq
  .cfi_endproc

  .globl    _shuf_X0Z0
  .align    4, 0x90
_shuf_X0Z0:                             ## @shuf_X0Z0
  .cfi_startproc
## BB#0:                                ## %entry
  insertps  $10, %xmm0, %xmm0 ## xmm0 = xmm0[0],zero,xmm0[2],zero
  retq
  .cfi_endproc

If I misunderstood, please provide additional details.

Thanks,
Filipe

You see the right code because it is a small test and %xmm0 is used
Try to run this test

define < 4 x float> @test(<4 x float> %x) {

%vecext = extractelement <4 x float> %x, i32 0
%vecinit = insertelement <4 x float> undef, float %vecext, i32 0
%vecext1 = extractelement <4 x float> %x, i32 1
%vecinit2 = insertelement <4 x float> %vecinit, float %vecext1, i32 1
%vecext3 = extractelement <4 x float> %x, i32 2
%vecinit4 = insertelement <4 x float> %vecinit2, float %vecext3, i32 2
%vecinit5 = insertelement <4 x float> %vecinit4, float 0.0, i32 3

%mask = fcmp olt <4 x float> %vecinit5, %x

%res = select  <4 x i1> %mask, <4 x float> %x, <4 x float>%vecinit5
ret <4 x float> %res

}

  • Elena
filcab updated this revision to Diff 8872.Apr 27 2014, 1:48 AM

Fix the undef bug pointed by Elena.

Thanks for the detailed report, I finally understood the bug's cause :-)

Filipe
delena added inline comments.Apr 27 2014, 2:06 AM
lib/Target/X86/X86ISelLowering.cpp
5423

If it is a zeroNode, somebody should take care for this. I don't understand how do your tests with zeroes work.

5447

You can't insert V into V. if you want to "copy" 3 elements and insert 1, you should write
(INSERTPS, dl, VT, V, scalar_to_vector(elt), index)

If you want to copy 2 elements and insert 2, you can't use INSERTPS at all

test/CodeGen/X86/sse41.ll
353

What code is generated here?

filcab added inline comments.Apr 27 2014, 11:42 AM
lib/Target/X86/X86ISelLowering.cpp
5423

This is testing if this element of the build_vector is a zeroNode.
If it is, we're still ok to do the optimization, since we can insert 0 wherever we want.

What I can do is simply not check for zero or undef. It won't change CorrectIdx nor change the comparison of CorrectIdx and NumNonZero.

5447

For now this optimization is only dealing with inserting 0 in vectors. For inserting 0 in the vectors, it is acceptable to insert V into V (with countD == countS == the index of one of the elements that won't be turned into 0).

In the future, it will have to be changed to insert a V0 into a V1 (or vice-versa), with a special case for when we're moving an element inside V0.
e.g: (x,y,z,w) -> (x,z,z,w) or (x,y,z,w) -> (x,0,0,x), etc.

We're only using insertps iff NumNonZero (which was counted in Lowerbuildvector and is the number of non-zero+non-undef elements) is equal to CorrectIdx (which is the number of elements from V that are inserted in the new vector with the same index they had in V). Since we know this, we can use insertps with V as both vector arguments.

test/CodeGen/X86/sse41.ll
353
_shuf_XY00:                             ## @shuf_XY00
  .cfi_startproc
## BB#0:                                ## %entry
  insertps    $12, %xmm0, %xmm0 ## encoding: [0x66,0x0f,0x3a,0x21,0xc0,0x0c]
                                        ## xmm0 = xmm0[0,1],zero,zero
  retq                            ## encoding: [0xc3]
  .cfi_endproc
filcab updated this revision to Diff 8901.Apr 28 2014, 8:20 PM

Perform the optimization when we're moving a single element to a different
place in the vector, while zeroing out some other elements.

filcab updated this revision to Diff 8902.Apr 28 2014, 8:25 PM

(sorry about the last revision. I misused arc diff)

Vectors built with zeros and elements in the same order as another
(source) vector are optimized to be built using a single insertps
instruction.
Also optimize when we move one element in a vector to a different place
in that vector while zeroing out some of the other elements.

Further optimizations are possible, described in TODO comments.
I will be implementing at least some of them in the near future.

Added some tests for different cases where this optimization triggers.

delena added inline comments.Apr 28 2014, 11:23 PM
lib/Target/X86/X86ISelLowering.cpp
5461

Let's assume that the first element of the BUILD_VECTOR is "undef" and the first element of EXTRACT does not go to any place. FirstNonZeroIdx = 2, for example. The INSERTPS instruction you use copies the first element to another place, because this instruction can manipulate with the first element only.

filcab added inline comments.Apr 29 2014, 1:07 AM
lib/Target/X86/X86ISelLowering.cpp
5461

Hi Elena,

I'm not understanding what you mean, sorry.
I tried making examples from your comment, but they seem to work and do the correct modification.

When we find a non-zero element that has the correct index, we copy it to the same place, IF there's no element that has to change index (we use that index for countS and countD in the insertpsmask). If there is an element that changes index, then we use its old and new indices for the insertpsmask..

Here's a couple of examples:
ll file: https://gist.github.com/filcab/11392909
llc -debug output: https://gist.github.com/filcab/11392943

Thanks,
Filipe

Please ignore my last message. It's my mistake. I don't see any other problem in the code itself. I just think that such bunch of extracts and inserts may be converted to one shuffle.

  • Elena

Yes, some of these kinds of functions do get turned into shufflevectors instead of build_vectors. I will address these in subsequent patches.

Is that ok?

Filipe

Ping. I'm not sure if your last message was an LGTM, Elena.

filcab accepted this revision.May 7 2014, 5:53 PM
filcab added a reviewer: filcab.
This revision is now accepted and ready to land.May 7 2014, 5:53 PM
filcab closed this revision.May 7 2014, 5:53 PM