This is an archive of the discontinued LLVM Phabricator instance.

optimize merging of scalar loads for 32-byte vectors [X86, AVX] (PR21710)
ClosedPublic

Authored by spatel on Dec 4 2014, 1:04 PM.

Download Raw Diff

Details

Reviewers

qcolombet
RKSimon
andreadb

Commits

rG4bf9b7685ced: Optimize merging of scalar loads for 32-byte vectors [X86, AVX]
rL223518: Optimize merging of scalar loads for 32-byte vectors [X86, AVX]

Summary

This patch fixes the poor codegen seen in PR21710 ( http://llvm.org/bugs/show_bug.cgi?id=21710 ). Before we crack 32-byte build vectors into smaller chunks (and then subsequently glue them back together), we should look for the easy case where we can just load all elements in a single op.

The codegen change for the latter 2 testcases (derived from the bug report examples) is:

vmovss	16(%rdi), %xmm1
vmovups	(%rdi), %xmm0
vinsertps	$16, 20(%rdi), %xmm1, %xmm1
vinsertps	$32, 24(%rdi), %xmm1, %xmm1
vinsertps	$48, 28(%rdi), %xmm1, %xmm1
vinsertf128	$1, %xmm1, %ymm0, %ymm0
retq

To:

vmovups	(%rdi), %ymm0
retq

And:

vmovsd	16(%rdi), %xmm1
vmovupd	(%rdi), %xmm0
vmovhpd	24(%rdi), %xmm1, %xmm1
vinsertf128	$1, %xmm1, %ymm0, %ymm0
retq

To:

vmovups	(%rdi), %ymm0
retq

I think it's benign that we generate 'vmovups' in that 2nd case rather than 'vmovupd' because we're not using the result here. I confirmed that we will use a double instruction if we actually use the load result in this function.

I've also updated the existing load merge test to use FileCheck and added a v4f32 test for completeness.

Diff Detail

Event Timeline

spatel updated this revision to Diff 16945.Dec 4 2014, 1:04 PM

spatel retitled this revision from to optimize merging of scalar loads for 32-byte vectors [X86, AVX] (PR21710).

spatel updated this object.

spatel edited the test plan for this revision. (Show Details)

spatel added reviewers: qcolombet, andreadb, RKSimon.

spatel added a subscriber: Unknown Object (MLST).

Can you use FileCheck prefixes so that we can demonstrate on which targets we are merging loads to ymm and which to 2 * xmm (SlowUAMem32)?

Hi Simon -

Updated test file to include a run for a machine with slow 32-byte mem accesses (currently that's just SandyBridge / IvyBridge), but I've used the target attributes directly to make the difference explicit.

Tested with some basic internal tests - no problems encountered (and folding seems to work well too).

I also tested integer (i64, i32, i16, i8) sequential loads and that optimized as expected too - not sure if its worth adding tests for these or not?

Come to think of it I should have tested D6492 as well while I was doing these......

LGTM

This revision is now accepted and ready to land.Dec 5 2014, 11:03 AM

Hi Sanjay,

The patch LGTM too.
Thanks!

Closed by commit rL223518 (authored by @spatel).

In D6536#7, @RKSimon wrote:

Tested with some basic internal tests - no problems encountered (and folding seems to work well too).

I also tested integer (i64, i32, i16, i8) sequential loads and that optimized as expected too - not sure if its worth adding tests for these or not?

Thanks - committed at r223518.

I think we're ok without testing every type, but this does raise a potential corner case for an AVX-only machine: is it perf worse to use a 32-byte FP store when dealing with ints? Ie, is there a domain-crossing penalty for a store of the 'wrong' type? Would we ever have a 32-byte vector of ints incoming to this code on an AVX-only machine?

In D6536#15, @spatel wrote:

I think we're ok without testing every type, but this does raise a potential corner case for an AVX-only machine: is it perf worse to use a 32-byte FP store when dealing with ints? Ie, is there a domain-crossing penalty for a store of the 'wrong' type? Would we ever have a 32-byte vector of ints incoming to this code on an AVX-only machine?

The get/setExecutionDomain code should deal with domain crossing of load/stores as well as bitwise ops. If the incoming AVX1 code has gone to the trouble of wanting to load integers into 256-bit vectors then we have to assume that it knows what its doing - hopefully performing float domain only ops, although shuffles might be an issue.

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

7 lines

test/

CodeGen/

X86/

vec_loadsingles.ll

105 lines

Diff 16972

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,990 Lines • ▼ Show 20 Lines	if (Values.size() == 1) {
return SDValue();		return SDValue();
}		}

// A vector full of immediates; various special cases are already		// A vector full of immediates; various special cases are already
// handled, so this is best done with a single constant-pool load.		// handled, so this is best done with a single constant-pool load.
if (IsAllConstants)		if (IsAllConstants)
return SDValue();		return SDValue();

// For AVX-length vectors, build the individual 128-bit pieces and use		// For AVX-length vectors, see if we can use a vector load to get all of the
		// elements, otherwise build the individual 128-bit pieces and use
// shuffles to put them in place.		// shuffles to put them in place.
if (VT.is256BitVector() \|\| VT.is512BitVector()) {		if (VT.is256BitVector() \|\| VT.is512BitVector()) {
SmallVector<SDValue, 64> V;		SmallVector<SDValue, 64> V;
for (unsigned i = 0; i != NumElems; ++i)		for (unsigned i = 0; i != NumElems; ++i)
V.push_back(Op.getOperand(i));		V.push_back(Op.getOperand(i));

		// Check for a build vector of consecutive loads.
		if (SDValue LD = EltsFromConsecutiveLoads(VT, V, dl, DAG, false))
		return LD;

EVT HVT = EVT::getVectorVT(*DAG.getContext(), ExtVT, NumElems/2);		EVT HVT = EVT::getVectorVT(*DAG.getContext(), ExtVT, NumElems/2);

// Build both the lower and upper subvector.		// Build both the lower and upper subvector.
SDValue Lower = DAG.getNode(ISD::BUILD_VECTOR, dl, HVT,		SDValue Lower = DAG.getNode(ISD::BUILD_VECTOR, dl, HVT,
makeArrayRef(&V[0], NumElems/2));		makeArrayRef(&V[0], NumElems/2));
SDValue Upper = DAG.getNode(ISD::BUILD_VECTOR, dl, HVT,		SDValue Upper = DAG.getNode(ISD::BUILD_VECTOR, dl, HVT,
makeArrayRef(&V[NumElems / 2], NumElems/2));		makeArrayRef(&V[NumElems / 2], NumElems/2));

▲ Show 20 Lines • Show All 9,991 Lines • Show Last 20 Lines

test/CodeGen/X86/vec_loadsingles.ll

	; RUN: llc < %s -march=x86 -mattr=+sse2 \| grep movq			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,-slow-unaligned-mem-32 \| FileCheck %s --check-prefix=ALL --check-prefix=FAST32
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+slow-unaligned-mem-32 \| FileCheck %s --check-prefix=ALL --check-prefix=SLOW32

	define <4 x float> @a(<4 x float> %a, float* nocapture %p) nounwind readonly {			define <4 x float> @merge_2_floats(float* nocapture %p) nounwind readonly {
	entry:
	%tmp1 = load float* %p			%tmp1 = load float* %p
	%vecins = insertelement <4 x float> undef, float %tmp1, i32 0			%vecins = insertelement <4 x float> undef, float %tmp1, i32 0
	%add.ptr = getelementptr float* %p, i32 1			%add.ptr = getelementptr float* %p, i32 1
	%tmp5 = load float* %add.ptr			%tmp5 = load float* %add.ptr
	%vecins7 = insertelement <4 x float> %vecins, float %tmp5, i32 1			%vecins7 = insertelement <4 x float> %vecins, float %tmp5, i32 1
	ret <4 x float> %vecins7			ret <4 x float> %vecins7

				; ALL-LABEL: merge_2_floats
				; ALL: vmovq
				; ALL-NEXT: retq
				}

				define <4 x float> @merge_4_floats(float* %ptr) {
				%a = load float* %ptr, align 8
				%vec = insertelement <4 x float> undef, float %a, i32 0
				%idx1 = getelementptr inbounds float* %ptr, i64 1
				%b = load float* %idx1, align 8
				%vec2 = insertelement <4 x float> %vec, float %b, i32 1
				%idx3 = getelementptr inbounds float* %ptr, i64 2
				%c = load float* %idx3, align 8
				%vec4 = insertelement <4 x float> %vec2, float %c, i32 2
				%idx5 = getelementptr inbounds float* %ptr, i64 3
				%d = load float* %idx5, align 8
				%vec6 = insertelement <4 x float> %vec4, float %d, i32 3
				ret <4 x float> %vec6

				; ALL-LABEL: merge_4_floats
				; ALL: vmovups
				; ALL-NEXT: retq
				}

				; PR21710 ( http://llvm.org/bugs/show_bug.cgi?id=21710 )
				; Make sure that 32-byte vectors are handled efficiently.
				; If the target has slow 32-byte accesses, we should still generate
				; 16-byte loads.

				define <8 x float> @merge_8_floats(float* %ptr) {
				%a = load float* %ptr, align 4
				%vec = insertelement <8 x float> undef, float %a, i32 0
				%idx1 = getelementptr inbounds float* %ptr, i64 1
				%b = load float* %idx1, align 4
				%vec2 = insertelement <8 x float> %vec, float %b, i32 1
				%idx3 = getelementptr inbounds float* %ptr, i64 2
				%c = load float* %idx3, align 4
				%vec4 = insertelement <8 x float> %vec2, float %c, i32 2
				%idx5 = getelementptr inbounds float* %ptr, i64 3
				%d = load float* %idx5, align 4
				%vec6 = insertelement <8 x float> %vec4, float %d, i32 3
				%idx7 = getelementptr inbounds float* %ptr, i64 4
				%e = load float* %idx7, align 4
				%vec8 = insertelement <8 x float> %vec6, float %e, i32 4
				%idx9 = getelementptr inbounds float* %ptr, i64 5
				%f = load float* %idx9, align 4
				%vec10 = insertelement <8 x float> %vec8, float %f, i32 5
				%idx11 = getelementptr inbounds float* %ptr, i64 6
				%g = load float* %idx11, align 4
				%vec12 = insertelement <8 x float> %vec10, float %g, i32 6
				%idx13 = getelementptr inbounds float* %ptr, i64 7
				%h = load float* %idx13, align 4
				%vec14 = insertelement <8 x float> %vec12, float %h, i32 7
				ret <8 x float> %vec14

				; ALL-LABEL: merge_8_floats

				; FAST32: vmovups
				; FAST32-NEXT: retq

				; SLOW32: vmovups
				; SLOW32: vinsertf128
				; SLOW32-NEXT: retq
				}

				define <4 x double> @merge_4_doubles(double* %ptr) {
				%a = load double* %ptr, align 8
				%vec = insertelement <4 x double> undef, double %a, i32 0
				%idx1 = getelementptr inbounds double* %ptr, i64 1
				%b = load double* %idx1, align 8
				%vec2 = insertelement <4 x double> %vec, double %b, i32 1
				%idx3 = getelementptr inbounds double* %ptr, i64 2
				%c = load double* %idx3, align 8
				%vec4 = insertelement <4 x double> %vec2, double %c, i32 2
				%idx5 = getelementptr inbounds double* %ptr, i64 3
				%d = load double* %idx5, align 8
				%vec6 = insertelement <4 x double> %vec4, double %d, i32 3
				ret <4 x double> %vec6

				; ALL-LABEL: merge_4_doubles
				; FAST32: vmovups
				; FAST32-NEXT: retq

				; SLOW32: vmovups
				; SLOW32: vinsertf128
				; SLOW32-NEXT: retq
	}			}