Using VPERMQ/VPERMPD allows memory folding of the (repeated) input where VINSERTI128/VINSERTF128 can not.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
10584 ↗ | (On Diff #54076) | I may have missed it, but the advantage shown in the test changes is just that we get to use an instruction with a single input operand. Add a test to show the load folding win? |
Cheers Sanjay, my previous explanation was wrong - the actual problem is with the lowerV2X128VectorShuffle cases that use vinsertf128/vinserti128, the perm2f128/perm2i128 cases correctly fold for unary shuffles. I've added tests to demonstrate this.
I've updated the patch title/description accordingly.
test/CodeGen/X86/avx-vperm2x128.ll | ||
---|---|---|
65–66 ↗ | (On Diff #54107) | So this one could be 'vperm2f128' with a memop, couldn't it? Any idea why that didn't happen? |
test/CodeGen/X86/avx-vperm2x128.ll | ||
---|---|---|
65–66 ↗ | (On Diff #54107) | The insertf128 pattern is used instead for cases where we're inserting the lower half (so no extract) and the other half is already in place - this is the better thing to do on pre-AVX2 targets according to Agner's lists (especially on AMD targets which is weak on 128-bit lane crossings). Fixing this in the memory fold code would be tricky as the folding logic will see the input split into 2 and will assume it can't be folded so it'll never arrive at foldMemoryOperandImpl. |
Thanks Sanjay, in the commit I was able to move the patch inside the insertf128 lowering code - this means that AVX2 targets still use perm2f128/perm2i128 in some unary shuffles.