This is an archive of the discontinued LLVM Phabricator instance.

[X86] tranform insertps to blendps when possible for better performance
AbandonedPublic

Authored by spatel on Feb 24 2015, 1:41 PM.

Details

Summary

This patch adds a target-specific combine to transform insertps nodes into blendi nodes. We just have to check to see if a translation of the immediate mask is possible.

Insertps has less potential throughput than blendps on all x86 chips that I have surveyed. For example on Haswell, we can execute blendps on 3 different ports, but insertps is limited to 1. On Sandybridge, PIledriver, and Bulldozer, it's 2 vs. 1.

Doing this transform also reduces the number of patterns we have to match when optimizing scalar SSE code.

Diff Detail

Event Timeline

spatel updated this revision to Diff 20619.Feb 24 2015, 1:41 PM
spatel retitled this revision from to [X86] tranform insertps to blendps when possible for better performance.
spatel updated this object.
spatel edited the test plan for this revision. (Show Details)
spatel added reviewers: mkuper, chandlerc, RKSimon.
spatel added a subscriber: Unknown Object (MLST).
spatel updated this revision to Diff 20621.Feb 24 2015, 1:45 PM

Updated patch: I had commented out the patterns in X86InstrSSE.td rather than deleting them.

I wrote a test program that executes 100 of each inst type to confirm that we do achieve double the throughput on SandyBridge using blendps:
blendps : 5406907580 cycles for 150000000 iterations (36.05 cycles/iter).
insertps: 10869956010 cycles for 150000000 iterations (72.47 cycles/iter).

I also found that on AMD Jaguar (btver2), blendps has half the latency of insertps, so it's an even bigger win there.

ab added a subscriber: ab.Feb 26 2015, 4:05 PM
chandlerc edited edge metadata.Feb 27 2015, 2:01 PM

I'm not really sold on doing this as a target combine. Is there some reason we can't just produce the desired insertps or blendps when lowering? This doesn't seem likely to come up only after doing some other shuffle lowering, but maybe I'm not seeing why.

lib/Target/X86/X86ISelLowering.cpp
22913

Please use an explicit type. Its short and removes questions.

22915

This use of auto hurts readability IMO.

22932–22935

I think we need to fix this always, and I think we should just handle it here in the "may fold load" case.

The primary reason to use insertps is to fold a scalar load into the shuffle. Switching to blendps is a huge mistake there because new we have to do some scalar -> vector first. In many cases, we may end up emitting insertps+blendps or movss+blendps which doesn't seem like the right lowering to me.

22937–22959

Can you use something like getTargetShuffleMask to handle all of this?

spatel added a comment.Mar 1 2015, 9:24 AM

I'm not really sold on doing this as a target combine. Is there some reason we can't just produce the desired insertps or blendps when lowering? This doesn't seem likely to come up only after doing some other shuffle lowering, but maybe I'm not seeing why.

Let me answer this question first before going to wrangle up some data on the other question:
I made this a target combine because I don't know how else to handle this case given our current intrinsic lowering:

define <4 x float> @blendps(<4 x float> %x, <4 x float> %y) {
  %0 = tail call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %x, <4 x float> %y, i8 0)
  ret <4 x float> %0
}

This is the IR produced when a programmer uses SSE intrinsics in C source. It directly becomes an INSERTPS node via:

X86_INTRINSIC_DATA(sse41_insertps,    INTR_TYPE_3OP, X86ISD::INSERTPS, 0)
spatel added inline comments.Mar 2 2015, 2:52 PM
lib/Target/X86/X86ISelLowering.cpp
22932–22935

I think I understand the concern now. You're worried about the cases where we're loading into one of the higher lanes. That could easily require a bonus shuffle instruction if converted to a blendps. Let me resubmit the patch to only handle the single case of the low 32-bit lane because that should just be a movss from memory.

I was focused on the low element case. Ie, which of these exact sequences is better:

vmovss   C0(%rip), %xmm1   <--- load into low lane; no shuffling before blendps
vblendps $1, $xmm1, $xmm0, $xmm0

vs.

vinsertps $0, C0(%rip), %xmm0, %xmm0

Sequences of movss+blendps have better throughput on SandyBridge and Haswell than insertps.

Here's what IACA shows for the load cases:
SandyBridge: 2x throughput
Haswell: 2x throughput (we're limited by the loads here; blendps has 3x throughput on its own)

I wrote a test program to confirm the load case performance on SandyBridge:
blendps : 5381572012 cycles for 150000000 iterations (35.88 cycles/iter).
insertps: 10387753446 cycles for 150000000 iterations (69.25 cycles/iter).

This is for a string of 100 independent shuffle ops like:

vmovss          ones(%rip), %xmm0
vblendps        $1, %xmm0, %xmm1, %xmm1
vmovss          ones(%rip), %xmm0
vblendps        $1, %xmm0, %xmm2, %xmm2
...

On AMD Jaguar, the independent strings of load op versions of blendps and insertps perform equivalently. For a string of dependent load+shuffle ops, I see the same 2x perf win for blendps due to the lower latency of the blendps instruction.

insertps is an extra special wart in the SSE instruction set: it can't be extended to longer vectors (AVX, AVX512), so it will never get any extra transistors thrown its way to improve its performance relative to better-defined vector instructions. We should be careful generating insertps...some day it may end up microcoded.

spatel updated this revision to Diff 21055.Mar 2 2015, 3:47 PM
spatel edited edge metadata.

Updated patch to:

  1. Only handle the low element to low element insertps case (immediate == 0)
  2. Added test case to confirm that we're only transforming the low element insertps case
  3. Removed use of 'auto'

Sorry for the delays...

spatel added a comment.Mar 4 2015, 7:55 AM

@chandlerc wrote:

I think we just need to change the SSE intrinsics to use generic shuffle IR rather than intrinsics. We shouldn't be worrying about
re-combining the LLVM instruction intrinsics in the backend to speed things up. We should insist that code use generic IR as input if
they want this kind of combining.

That's the 2nd suggestion I've gotten to rework the intrinsics in a week, so I guess I can't ignore that angle any longer. :)

So yes, let's put this patch on hold and see what happens when we start sending more generic shuffles down the line.

spatel abandoned this revision.Mar 13 2015, 2:55 PM

Abandoning:
Attempting to address the FIXME in D8332.
Patching the C header files to generate more generic shuffles is also underway.