This is an archive of the discontinued LLVM Phabricator instance.

[x86] use vperm2f128 rather than vinsertf128 when there's a chance to fold a 32-byte load
ClosedPublic

Authored by spatel on Jun 6 2017, 6:43 AM.

Details

Summary

I was looking closer at the x86 test diffs in D33866, and the first change seems like it shouldn't happen in the first place. So this patch is trying to resolve that.

Using Agner's tables and AMD docs, vperm2f128 and vinsertf128 have identical timing for any given CPU model, so we should be able to interchange those without affecting perf. But as we can see in some of the diffs here, using vperm2f128 allows load folding, so we should take that opportunity to reduce code size and register pressure.

A secondary advantage is making AVX1 and AVX2 codegen more similar. Given that vperm2f128 was introduced with AVX1, we should be using it in all of the same situations that we would with AVX2. If there's some reason that an AVX1 CPU would not want to use this instruction, I think that should be fixed up in a later pass.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Jun 6 2017, 6:43 AM
RKSimon added inline comments.Jun 6 2017, 7:14 AM
test/CodeGen/X86/avx-vperm2x128.ll
55 ↗(On Diff #101552)

I wonder what's preventing this from using VBROADCASTF128 ?

spatel added inline comments.Jun 6 2017, 7:24 AM
test/CodeGen/X86/avx-vperm2x128.ll
55 ↗(On Diff #101552)

I think it's just that we don't have the code to do the load shrinking + address offset. Ie, this is a 32-byte load even though we're only using half of it.

spatel added inline comments.Jun 6 2017, 7:44 AM
test/CodeGen/X86/avx-vperm2x128.ll
55 ↗(On Diff #101552)

On 2nd thought, it's more likely because we don't recognize this as a splat because we didn't see the "canWidenShuffleElements()" opportunity. Ie, these are 32-bit elts, so the mask isn't a simple splat.

RKSimon accepted this revision.Jun 10 2017, 9:39 AM

LGTM.

Looking at lowerV2X128VectorShuffle, shuffle combining will have a much easier time if we keep to 256-bit vectors (blends / X86ISD::VPERM2X128) as much as possible - subvector extract/insert chains makes combining really tricky - and this dealing with memory cases looks like a good first step.

This revision is now accepted and ready to land.Jun 10 2017, 9:39 AM

A couple of notes for reference:

  1. There's another potential case for trying harder to recognize a splat mask in PR32007:

https://bugs.llvm.org/show_bug.cgi?id=32007

  1. I looked at adding a VPERM2X128 case to combineTargetShuffle() that would turn this into X86ISD::SUBV_BROADCAST. It actually produced the expected vbroadcastf128 instruction, but I'm not sure how that matched because I didn't do anything to shrink the loaded value (!).
This revision was automatically updated to reflect the committed changes.