[x86] allow vector load narrowing with multi-use values

Authored by spatel on Sat, Nov 3, 10:26 AM.



This is a long-awaited follow-up suggested in D33578. Since then, I think we've picked up even more opportunities for vector narrowing from changes like D53784, so there are a lot of test diffs. Apart from 2-3 strange cases, I think these are all wins.

I've structured this to be no-functional-change-intended for any target except for x86 because I couldn't tell if AArch64, ARM, and AMDGPU would improve or not. All of those targets have existing regression tests (4, 4, 10 files respectively) that would be affected. Also, Hexagon overrides the shouldReduceLoadWidth() hook, but doesn't show any regression test diffs. The trade-off is deciding if an extra vector load is better than a single wide load + extract_subvector.

For x86, this is almost always better (on paper at least) because we often can fold loads into subsequent ops and not increase the official instruction count. There's also some unknown -- but potentially large -- benefit from using narrower vector ops if wide ops are implemented with multiple uops and/or frequency throttling is avoided.

Diff Detail

spatel created this revision.Sat, Nov 3, 10:26 AM
RKSimon added inline comments.Thu, Nov 8, 5:38 AM
493 ↗(On Diff #172493)

(Unrelated) We should investigate fixing this - <48 x i8> is unlikely to occur but tweaking the AVG patterns + SplitOpsAndApply to handle 'uneven split' cases should be relatively trivial.

24 ↗(On Diff #172493)

NFC changes - pull out?

1290 ↗(On Diff #172493)

@craig.topper Is this OK?

277 ↗(On Diff #172493)

This doesn't look great?

186 ↗(On Diff #172493)

Codesize regression?

spatel added inline comments.Thu, Nov 8, 7:36 AM
277 ↗(On Diff #172493)

Right - we can see this even in the existing codegen because in LowerINSERT_VECTOR_ELT:

// If the vector is wider than 128 bits, extract the 128-bit subvector, insert
// into that, and then insert the subvector back into the result.

But is there a better way to get a scalar into the high part of a wide vector?

Eg, AVX1 can't even splat from register to a 256-bit vector, so a shuffle-based-alternative would be:

	vmovd	%edi, %xmm0
	vpshufd	$36, %xmm0, %xmm0       ## xmm0 = xmm0[0,1,2,0]
	vinsertf128	$1, %xmm0, %ymm0, %ymm0
	vblendps	$127, LCPI0_0(%rip), %ymm0, %ymm0 ## ymm0 = mem[0,1,2,3,4,5,6],ymm0[7]
craig.topper added inline comments.Thu, Nov 8, 9:15 AM
1290 ↗(On Diff #172493)

I think so. kshift is 3 cycles. kmovd is one cycle. So this is probably better unless we have some terrible crossing penalty.

spatel updated this revision to Diff 173204.Thu, Nov 8, 11:52 AM

Patch updated:
No code changes; rebased test diffs after rL346433 (improve buildvector lowering) and rL346298 (cosmetic diffs in CHECK lines).
I think the only remaining open question is for pr34653.ll, so I'll take a look at that. IR shows a <38 x double> vector type on an AVX512f target...

spatel added inline comments.Thu, Nov 8, 1:33 PM
2 ↗(On Diff #173204)

This is a -O0 test, so doesn't matter? With optimization, everything disappears:

pushq	%rbp
movq	%rsp, %rbp
andq	$-512, %rsp             ## imm = 0xFE00
subq	$1024, %rsp             ## imm = 0x400
movq	%rsp, %rdi
callq	_test
movq	%rbp, %rsp
popq	%rbp
nhaehnle removed a subscriber: nhaehnle.Fri, Nov 9, 5:45 AM
RKSimon accepted this revision.Fri, Nov 9, 6:25 AM

LGTM - thanks

2 ↗(On Diff #173204)

Agreed - it's annoying but doesn't matter.

This revision is now accepted and ready to land.Fri, Nov 9, 6:25 AM
This revision was automatically updated to reflect the committed changes.