This is an archive of the discontinued LLVM Phabricator instance.

[X86][AVX2] Prefer VPERMQ/VPERMPD over VINSERTI128/VINSERTF128 for unary shuffles
ClosedPublic

Authored by RKSimon on Apr 18 2016, 9:52 AM.

Details

Summary

Using VPERMQ/VPERMPD allows memory folding of the (repeated) input where VINSERTI128/VINSERTF128 can not.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon updated this revision to Diff 54076.Apr 18 2016, 9:52 AM
RKSimon retitled this revision from to [X86][AVX2] Prefer VPERMQ/VPERMPD over VPERM2I128/VPERM2F128 for unary shuffles.
RKSimon updated this object.
RKSimon added reviewers: mkuper, andreadb, spatel.
RKSimon set the repository for this revision to rL LLVM.
RKSimon added a subscriber: llvm-commits.
spatel edited edge metadata.Apr 18 2016, 11:08 AM
lib/Target/X86/X86ISelLowering.cpp
10584 ↗(On Diff #54076)

I may have missed it, but the advantage shown in the test changes is just that we get to use an instruction with a single input operand. Add a test to show the load folding win?

RKSimon updated this revision to Diff 54107.Apr 18 2016, 1:34 PM
RKSimon retitled this revision from [X86][AVX2] Prefer VPERMQ/VPERMPD over VPERM2I128/VPERM2F128 for unary shuffles to [X86][AVX2] Prefer VPERMQ/VPERMPD over VINSERTI128/VINSERTF128 for unary shuffles.
RKSimon updated this object.
RKSimon edited edge metadata.

Cheers Sanjay, my previous explanation was wrong - the actual problem is with the lowerV2X128VectorShuffle cases that use vinsertf128/vinserti128, the perm2f128/perm2i128 cases correctly fold for unary shuffles. I've added tests to demonstrate this.

I've updated the patch title/description accordingly.

spatel added inline comments.Apr 18 2016, 2:04 PM
test/CodeGen/X86/avx-vperm2x128.ll
65–66 ↗(On Diff #54107)

So this one could be 'vperm2f128' with a memop, couldn't it? Any idea why that didn't happen?

RKSimon added inline comments.Apr 18 2016, 2:49 PM
test/CodeGen/X86/avx-vperm2x128.ll
65–66 ↗(On Diff #54107)

The insertf128 pattern is used instead for cases where we're inserting the lower half (so no extract) and the other half is already in place - this is the better thing to do on pre-AVX2 targets according to Agner's lists (especially on AMD targets which is weak on 128-bit lane crossings).

Fixing this in the memory fold code would be tricky as the folding logic will see the input split into 2 and will assume it can't be folded so it'll never arrive at foldMemoryOperandImpl.

spatel accepted this revision.Apr 18 2016, 3:20 PM
spatel edited edge metadata.

LGTM. Thanks!

This revision is now accepted and ready to land.Apr 18 2016, 3:20 PM
This revision was automatically updated to reflect the committed changes.

Thanks Sanjay, in the commit I was able to move the patch inside the insertf128 lowering code - this means that AVX2 targets still use perm2f128/perm2i128 in some unary shuffles.