This is an archive of the discontinued LLVM Phabricator instance.

Prefer blendps over insertps codegen for one special case [X86]
ClosedPublic

Authored by spatel on Mar 13 2015, 2:49 PM.

Details

Summary

I had originally made this a FIXME in D7866, but we're attacking the problem from different angles now. If we don't have a target-specific combine on insertps, we need to generate the right code in the first place.

With this patch, for this one exact case, we'll generate:

blendps %xmm0, %xmm1, $1

instead of:

insertps %xmm0, %xmm1, $0

If there's a memory operand available for load folding and we're optimizing for size, we'll still generate the insertps.

The detailed performance data motivation for this may be found in D7866; in summary, blendps has 2-3x throughput vs. insertps on widely used chips.

Diff Detail

Event Timeline

spatel updated this revision to Diff 21953.Mar 13 2015, 2:49 PM
spatel retitled this revision from to Prefer blendps over insertps codegen for one special case [X86].
spatel updated this object.
spatel edited the test plan for this revision. (Show Details)
spatel added reviewers: chandlerc, qcolombet, mkuper, ab.
spatel added a subscriber: Unknown Object (MLST).
qcolombet added inline comments.Mar 20 2015, 11:23 AM
lib/Target/X86/X86ISelLowering.cpp
10520

Instead of checking for OptimizeForSize, I would check for MinSize or both.

10521

As soon as there is a folding opportunity, shouldn’t it be better to use it?
Could you check that with IACA?

spatel added inline comments.Mar 20 2015, 12:25 PM
lib/Target/X86/X86ISelLowering.cpp
10520

Hi Quentin -

Thanks for looking at the patch. I had not seen MinSize used before. That corresponds to -Oz?

10521

I checked this with real code running on SandyBridge, Haswell, and Jaguar. Load folding does not improve performance here. The usage of insertps is the limiting factor because it can only execute on one port.

Here's the SB result from the earlier patch for a microbenchmark including loads:

blendps : 5381572012 cycles for 150000000 iterations (35.88 cycles/iter).
insertps: 10387753446 cycles for 150000000 iterations (69.25 cycles/iter).

qcolombet edited edge metadata.Mar 20 2015, 12:40 PM

LGTM with the MinSize fix.

Q.

lib/Target/X86/X86ISelLowering.cpp
10520

Yes, it is Oz.

10521

Thanks for checking.

spatel updated this revision to Diff 22374.Mar 20 2015, 1:45 PM
spatel edited edge metadata.

Patch updated based on feedback from Quentin:
Changed function attribute check from 'optsize' to 'minsize' (-Os vs. -Oz)

qcolombet accepted this revision.Mar 20 2015, 1:51 PM
qcolombet edited edge metadata.

Thanks Sanjay!

Q.

This revision is now accepted and ready to land.Mar 20 2015, 1:51 PM
This revision was automatically updated to reflect the committed changes.