This is an archive of the discontinued LLVM Phabricator instance.

Added more insertps optimizations
ClosedPublic

Authored by filcab on Apr 30 2014, 8:25 PM.

Details

Summary

When inserting an element that's coming from a vector load or a broadcast
of a vector (or scalar) load, combine the load into the insertps
instruction.

Added test cases.

Diff Detail

Event Timeline

filcab updated this revision to Diff 9003.Apr 30 2014, 8:25 PM
filcab retitled this revision from to Added more insertps optimizations.
filcab updated this object.
filcab edited the test plan for this revision. (Show Details)
filcab added reviewers: delena, nadav, craig.topper.
filcab added a subscriber: Unknown Object (MLST).
filcab updated this revision to Diff 9459.May 15 2014, 6:08 PM

Pinging the patch, but also added a CHECK-NOT to a test.

Changed CHECK to CHECK-NOT instead of removing.

andreadb edited edge metadata.May 16 2014, 4:59 AM

Hi Filipe,

I think there is a better way to fix this bug.

For example, you can simply add the following ISel patterns to X86InstrSSE.td instead of adding new target combine rules in X86ISelLowering.cpp.

let Predicates = [UseSSE41] in {
  def : Pat<(v4f32 (X86insertps (v4f32 VR128:$src1), (loadv4f32 addr:$src2),
                imm:$src3)),
            (INSERTPSrm VR128:$src1, addr:$src2, imm:$src3)>;
  def : Pat<(v4f32 (X86insertps (v4f32 VR128:$src1), (X86PShufd (v4f32 
                 (scalar_to_vector (loadf32 addr:$src2))), (i8 0)), imm:$src3)),
            (INSERTPSrm VR128:$src1, addr:$src2, imm:$src3)>;
}

let Predicates = [UseAVX] in {
  def : Pat<(v4f32 (X86insertps (v4f32 VR128:$src1), (loadv4f32 addr:$src2),
                imm:$src3)),
            (VINSERTPSrm VR128:$src1, addr:$src2, imm:$src3)>;
  def : Pat<(v4f32 (X86insertps (v4f32 VR128:$src1),
                (X86VBroadcast (loadf32 addr:$src2)), imm:$src3)),
            (VINSERTPSrm VR128:$src1, addr:$src2, imm:$src3)>;
}

I am pretty sure that the four patterns above would cover all the interesting cases.
I personally prefer to have tablegen patterns instead of complicated target combine rules that would only trigger on the already legalized and immediately before instruction selection.

Also, you should add tests to verify the AVX codegen as well. Currently your new tests only verifies that we do the correct thing with SSE4.1. However, your patch also affects AVX (in fact, you specifically added combine rules for the case where the inserted element comes from a vbroadcast).

-Andrea

Hi Andrea,

Wouldn't the tablegen patterns be problematic when we use the load result
several times?
If we use it several times, then we shouldn't generate additional loads. Or
do you think it doesn't matter since they should be close together, since
we're generating for one BB only?

That was the reasoning behind doing it with code and guarding them with
MayFoldLoad, which includes hasOneUse().

I'll add the AVX tests today.

Filipe

Hi Andrea,

Wouldn't the tablegen patterns be problematic when we use the load result
several times?
If we use it several times, then we shouldn't generate additional loads. Or
do you think it doesn't matter since they should be close together, since
we're generating for one BB only?

The ISel pattern matcher will only reduce a dag sequence if the chain is "profitable to fold". A chain is never profitable, for example, if it contains nodes with more than one use.

In the case of your new test function 'insertps_from_vector_load', if I add another use for variable %1, then the following rule will no longer trigger.

def : Pat<(v4f32 (X86insertps (v4f32 VR128:$src1), (loadv4f32 addr:$src2),
                imm:$src3)),
            (INSERTPSrm VR128:$src1, addr:$src2, imm:$src3)>;

That is because eventually, method 'X86DAGtoDAGISel::IsProfitableToFold' will return false if the 'loadv4f32' has more than one use (see X86ISelDAGToDAG.cpp - around line 303). Basically, the matcher will try to verify that all the intermediate nodes between the X86Insertps (excluded) and the loadv4f32 (included), only have a single use.

Given how the ISel Matcher works, in case of a load result used several times, we will not generate additional loads.

That was the reasoning behind doing it with code and guarding them with
MayFoldLoad, which includes hasOneUse().

That reasoning is correct. However, the pattern Matcher will do the same and will never try to fold the loadv4f32 if it has more than one use. (P.s.: you can see it by yourself if you some experiments adding more uses for the load instruction and then you debug ISel passing flag -debug to llc).

I hope this helps.

Thanks Andrea. I will check that out and resubmit the patch.

Filipe

filcab updated this revision to Diff 9502.May 16 2014, 8:15 PM
filcab edited edge metadata.

Added the patterns that Andrea mentioned.

We still need to have PerformSELECTCombine for the case when we do a
vector load and extract one element of that vector. In this case we want
to narrow it to an element load.

The other cases are simplified by just dealing with them with patterns in X86InstrSSE.td

Added test cases for SSE4.1, AVX, and AVX2.

filcab updated this revision to Diff 9503.May 16 2014, 8:17 PM

Fixed the diff to not contain extra stuff that was in some trunk commits.

filcab updated this revision to Diff 9504.May 16 2014, 8:32 PM

Actually add AVX tests.
Sorry about the revisions. I'll wait for comments, now.

Hi Filipe,

I tested yout patch and it works for me.
If you address the (minor) comments below, then the patch looks good to me!

lib/Target/X86/X86ISelLowering.cpp
20287

It might be useful to have a comment here explaining why you need a shift.
When the source is a memory operand, the Count_S bits of the immediate operand are not used to select the floating point element from the source memory location.
That's why we have to extract the 'Count_S' bits from the immediate operand and use them as 'index' for a new load instruction.

lib/Target/X86/X86InstrSSE.td
6553

You forgot to enclose both patterns between curly braces.
It still works fine because we never produce an X86insertps dag node if we don't have SSE4.1 :-)

6564

Same here, you should enclose the following two patterns between curly braces.

test/CodeGen/X86/avx.ll
6–28 ↗(On Diff #9504)

These three tests are not part of this patch.
I think you should add those in a separate commit.

filcab closed this revision.May 19 2014, 12:53 PM
filcab updated this revision to Diff 9582.

Closed by commit rL209156 (authored by @filcab).

Thanks Andrea. Committed as r209156.

lib/Target/X86/X86InstrSSE.td
6553

The lack of {} is what you get when you add a second pattern and forget to add the {}.