This is an archive of the discontinued LLVM Phabricator instance.

optimize merging of scalar loads for 32-byte vectors [X86, AVX] (PR21710)
ClosedPublic

Authored by spatel on Dec 4 2014, 1:04 PM.

Details

Summary

This patch fixes the poor codegen seen in PR21710 ( http://llvm.org/bugs/show_bug.cgi?id=21710 ). Before we crack 32-byte build vectors into smaller chunks (and then subsequently glue them back together), we should look for the easy case where we can just load all elements in a single op.

The codegen change for the latter 2 testcases (derived from the bug report examples) is:

vmovss	16(%rdi), %xmm1
vmovups	(%rdi), %xmm0
vinsertps	$16, 20(%rdi), %xmm1, %xmm1
vinsertps	$32, 24(%rdi), %xmm1, %xmm1
vinsertps	$48, 28(%rdi), %xmm1, %xmm1
vinsertf128	$1, %xmm1, %ymm0, %ymm0
retq

To:

vmovups	(%rdi), %ymm0
retq

And:

vmovsd	16(%rdi), %xmm1
vmovupd	(%rdi), %xmm0
vmovhpd	24(%rdi), %xmm1, %xmm1
vinsertf128	$1, %xmm1, %ymm0, %ymm0
retq

To:

vmovups	(%rdi), %ymm0
retq

I think it's benign that we generate 'vmovups' in that 2nd case rather than 'vmovupd' because we're not using the result here. I confirmed that we will use a double instruction if we actually use the load result in this function.

I've also updated the existing load merge test to use FileCheck and added a v4f32 test for completeness.

Diff Detail

Event Timeline

spatel updated this revision to Diff 16945.Dec 4 2014, 1:04 PM
spatel retitled this revision from to optimize merging of scalar loads for 32-byte vectors [X86, AVX] (PR21710).
spatel updated this object.
spatel edited the test plan for this revision. (Show Details)
spatel added reviewers: qcolombet, andreadb, RKSimon.
spatel added a subscriber: Unknown Object (MLST).
RKSimon edited edge metadata.Dec 4 2014, 4:07 PM

Can you use FileCheck prefixes so that we can demonstrate on which targets we are merging loads to ymm and which to 2 * xmm (SlowUAMem32)?

spatel updated this revision to Diff 16972.Dec 4 2014, 5:49 PM
spatel edited edge metadata.

Hi Simon -

Updated test file to include a run for a machine with slow 32-byte mem accesses (currently that's just SandyBridge / IvyBridge), but I've used the target attributes directly to make the difference explicit.

RKSimon accepted this revision.Dec 5 2014, 11:03 AM
RKSimon edited edge metadata.

Tested with some basic internal tests - no problems encountered (and folding seems to work well too).

I also tested integer (i64, i32, i16, i8) sequential loads and that optimized as expected too - not sure if its worth adding tests for these or not?

Come to think of it I should have tested D6492 as well while I was doing these......

LGTM

This revision is now accepted and ready to land.Dec 5 2014, 11:03 AM
andreadb accepted this revision.Dec 5 2014, 11:55 AM
andreadb edited edge metadata.

Hi Sanjay,

The patch LGTM too.
Thanks!

spatel closed this revision.Dec 5 2014, 1:29 PM
spatel updated this revision to Diff 16996.

Closed by commit rL223518 (authored by @spatel).

spatel added a comment.Dec 5 2014, 1:33 PM
In D6536#7, @RKSimon wrote:

Tested with some basic internal tests - no problems encountered (and folding seems to work well too).

I also tested integer (i64, i32, i16, i8) sequential loads and that optimized as expected too - not sure if its worth adding tests for these or not?

Thanks - committed at r223518.

I think we're ok without testing every type, but this does raise a potential corner case for an AVX-only machine: is it perf worse to use a 32-byte FP store when dealing with ints? Ie, is there a domain-crossing penalty for a store of the 'wrong' type? Would we ever have a 32-byte vector of ints incoming to this code on an AVX-only machine?

In D6536#15, @spatel wrote:

I think we're ok without testing every type, but this does raise a potential corner case for an AVX-only machine: is it perf worse to use a 32-byte FP store when dealing with ints? Ie, is there a domain-crossing penalty for a store of the 'wrong' type? Would we ever have a 32-byte vector of ints incoming to this code on an AVX-only machine?

The get/setExecutionDomain code should deal with domain crossing of load/stores as well as bitwise ops. If the incoming AVX1 code has gone to the trouble of wanting to load integers into 256-bit vectors then we have to assume that it knows what its doing - hopefully performing float domain only ops, although shuffles might be an issue.