Page MenuHomePhabricator

Optimization for certain shufflevector by using insertps.

Authored by filcab on Apr 23 2014, 2:46 PM.



If we're doing a v4f32 shuffle on x86 with SSE4.1, we can lower certain
shufflevectors to an insertps instruction:
When most of the shufflevector result's elements come from one vector (and
keep their index), and one element comes from another vector or a memory

Added tests for insertps optimizations on shufflevector.

Diff Detail

Event Timeline

filcab updated this revision to Diff 8781.Apr 23 2014, 2:46 PM
filcab retitled this revision from to Optimization for certain shufflevector by using insertps..
filcab updated this object.
filcab edited the test plan for this revision. (Show Details)
filcab added a reviewer: nadav.
filcab added a subscriber: Unknown Object (MLST).
silvas added a subscriber: silvas.Apr 23 2014, 5:49 PM
silvas added inline comments.

Tiny style nit: the canonical LLVM counted loop is

for (int i = 0, e = Foo.size(); i != e; ++i)
filcab updated this revision to Diff 8785.Apr 23 2014, 6:06 PM

Fixed llvm-style nit pointed by Sean Silva.

nadav edited edge metadata.Apr 23 2014, 9:50 PM

I did not review the patch carefully but from a quick look it looks fine. Andrea, what do you say?

delena added a subscriber: delena.Apr 24 2014, 12:18 AM

Hi Filipe,

  1. you can check INSERTPS mask outside NormalizeVectorShuffle(), where all other masks have been checked.
  2. you can use insertps for v4i32 as well.
  1. I think that folding load in insertps is not fully correct

you translate this IR to (add + insertps) commands.

%0 = load <4 x float>* %pb, align 16
%vecinit6 = shufflevector <4 x float> %a, <4 x float> %0, <4 x i32> <i32 0, i32 1, i32 2, i32 4>

insertps loads 4 bytes instead of 32. You lose exceptions. It is ok for OpenCL but other compilers can't ignore exceptions.

And in general, I'm not sure that
add + insertps-load-form is better than load + insertps

  1. About tests: Why do you check X32 and X64 separately?
  • Elena
andreadb added a subscriber: andreadb.

Hi filipe,

I think your patch in general looks good except for a few things (see comments below).


I would change this into:

if (VT != MVT::v4i32 && VT != MVT::v4f32)

It probably makes sense to add a comment explaining that it is safe to call this function only when the shuffle mask is a valid INSERTPS mask (according to your new isINSERTPSMask function). Otherwise, DestIndex would be wrongly computed.


I would remove the bool 'HasAVX' since it is not used.
That extra argument can be added in future once we decide to optimize for AVX cases too.


You should also support MVT::v4i32.


As Elena pointed out,
it is unsafe to transform the load+insertps into an add+insertps with memory operand. If you force that transformation, then the alignment constraint would not be honored.

According to the instruction set reference, no alignment is required for the insertps instruction (unless alignment checking is enabled). If alignment checking is enabled, the check would ensure that the address is aligned to 4 (and not 16).

The good thing is that, even without that if-stmt, your transformation would still improve all the test cases you added (i.e. for the load+shuffle case, you would get a movaps+insertps instead of movaps+shufps+shufps).


Without the 'load+insertps -> add+insertps' transformation, this test would still be improved by your patch.

With your patch (excluding the if-stmt between lines 7315:7336 ), the code generate is better than it was before applying your patch. We now get a movaps+insertps; before we produced instead a movaps followed by two shufps instructions.

I would change the test adding explicit checks for the aligned load (movaps) and the insertps.


Elena is right.
Since you are matching the same code for X64 and X32, I would suggest adding a common check-prefix (example: --check-prefix=CHECK) to both RUN lines.

You can use a single CHECK-LABEL in each test function (rather than having two identical checks [X32-LABEL and X64-LABEL] for every function).
Your new tests would only use CHECK (and not X32/X64).

I would add extra tests to verify that your lowering rule works fine with <4 x i32> vectors.


Your parch improves the code generation also in the following case:

define <4 x float> @foo(<4 x float> %a, float* %b) {
  %1 = load float* %b, align 4
  %2 = insertelement <4 x float> undef, float %1, i32 0
  %result = shufflevector <4 x float> %a, <4 x float> %2, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
  ret <4 x i32> %result

That is because your fix helps matching the pattern defined for def rm : SS4AI8 that selects an INSERTPS with a memory operand. (See - multiclass SS41I_insertf32).

With your patch, the backend would produce a single insertps for the test above. Before (i.e. without your patch), we produced the following long sequence of instructions (on X64):

movss  (%rdi), %xmm1
movaps %xmm0, %xmm2
movss %xmm1, %xmm2
shufps $36, %xmm2, %xmm0

I suggest to add that test as well (and another test to verify that the code is simplifed even if we have <4 x i32> vector types).

Hi Elena and Andrea,

1 - I've changed the check + change to NormalizeVectorShuffe(), I will update the patch soon;

2 - This won't be as easy as it looks (see below);

3 - I don't understand what the problem would be in making the load smaller. Especially since we do these kinds of load reductions in other places (including target independent opts like in DAGCombiner::ReduceLoadOpStoreWidth, and others). The alignment requirements don't get stricter, so we shouldn't output code that might break when the original wouldn't. And if I'm not mistaken, the alignment exceptions would fall under undefined behaviour, which wouldn't make this optimization invalid;

By the way, with the current transform, we would never generate an add instruction for the insertps. We just generate the insertps with the address calculation in it (just to be sure we're on the same page).

4 - I will change the test. After this fix I'll also change other tests in the same file that similarly check for the same text on X32 and X64.

About supporting i32 besides f32:

I can simply do the transform for MVT::v4i32 too, but it will not optimize as much as the current one does for f32.
It will optimize some cases but, for example, we might still get movsd+insertps when insertps should suffice.

To fix this problem, we add a bunch of new ones:

I don't know that much about the backend, but there doesn't seem to be a way to do it without creating a new SDNode type for INSERTPS (I don't know what it would be called, since INSERTPS is already used). (
I can't find a way to have, on the type profile, an OR condition, which would allow us to define X86insertps as getting a f32 or i32 from memory.
Without this, the pattern we generate in won't match due to its types (even if we define a whole new multiclass for the i32 case).

I think the best course of action would be to:

  • Add some more tests, for the i32 cases (and collapse the tests to a common check-prefix);
  • Do the transform on both i32 and f32, for now, even if the generated instructions aren't perfect;
  • Later figure out how to better express the insertps for i32 in the backed (either with a new instruction, or a simpler one-time hack for this one).

What do you think?


Just a small addition that I will investigate tomorrow:
The whole “get an i32 from memory into an xmm element” gets much easier if my transform emits a pinsrd for that case. I will change it to use pinsrd (for the i32 from mem case) tomorrow.

filcab updated this revision to Diff 8858.Apr 25 2014, 4:30 PM
filcab edited edge metadata.

Added the optimization for v4i32 (but we emit pinsrd instead of insertps when loading from memory).
Addresses the concerns in the review, except for the load minimization, due to discussion on IRC, existing optimizations that reduce load sizes, and lack of response to my question.

andreadb accepted this revision.Apr 25 2014, 4:51 PM
andreadb edited edge metadata.
This revision is now accepted and ready to land.Apr 25 2014, 4:51 PM
filcab closed this revision.Apr 25 2014, 7:09 PM