This is an archive of the discontinued LLVM Phabricator instance.

[X86] Make EltsFromConsecutiveLoads code path work only on vectors it expects
ClosedPublic

Authored by mkuper on Dec 9 2014, 1:21 AM.

Download Raw Diff

Details

Reviewers

Commits

rG0104ff65294a: [X86] Make a code path in EltsFromConsecutiveLoads work only on vectors it…
rL223922: [X86] Make a code path in EltsFromConsecutiveLoads work only on vectors it…

Summary

EltsFromConsecutiveLoads was apparently only ever called for 128-bit vectors, and assumed this implicitly.
r223518 started calling it for AVX-sized vectors, causing the code path that had this assumption to crash.

This makes the assumption explicit by adding a check.

Diff Detail

Repository: rL LLVM

Event Timeline

mkuper updated this revision to Diff 17074.Dec 9 2014, 1:21 AM

mkuper retitled this revision from to [X86] Make EltsFromConsecutiveLoads code path work only on vectors it expects.

mkuper updated this object.

mkuper edited the test plan for this revision. (Show Details)

mkuper added a reviewer: spatel.

mkuper added a subscriber: Unknown Object (MLST).

Hi Michael,
Thanks for notifying me about this bug. While your patch will prevent the crash, it will also prevent vector loads for i8 / i16, right? Is it possible to fix the condition to avoid the crash while still generating a vector load for all data types?

For reference - while stepping through this path, I noticed another bug:
http://llvm.org/bugs/show_bug.cgi?id=21790

For your testcase with i64, it looks like we'll end up with the optimal codegen, but only because we'll crack the 256-bit vector in half and then the two i64 loads will be treated like a full vector load the next time we call EltsFromConsecutiveLoads().

Can you add your testcase to test/CodeGen/X86/vec_loadsingles.ll (name change is probably in order at this point), so we have all of the related testcases in one file? See inline comments for the testcase itself.

test/CodeGen/X86/elts-from-loads-256.ll
1 ↗	(On Diff #17074)	Use FileCheck for expected output.
7–9 ↗	(On Diff #17074)	Remove unnecessary test elements: addrspace, attributes.
23 ↗	(On Diff #17074)	Add newline

Thanks for the review, Sanjay.

Right about the test-case, it was bugpoint-reduced, and I didn't clean it up enough, thanks for pointing it out.
Will fix and put it in the right place.

Regarding vector loads of i8 / i16 - I believe that didn't work to begin with.
The existing code fires only when NumElems == 4, and creates a ResNode of type v2i64. So for the bitcast in the end to be valid, the element size must be 32 bit. Or did I miss something?

Regarding the PR - you're right, it looks like unneeded domain crossing. I'd rather resolve the crash first, though.

Test-case updated.

In D6579#5, @mkuper wrote:

Regarding vector loads of i8 / i16 - I believe that didn't work to begin with.
The existing code fires only when NumElems == 4, and creates a ResNode of type v2i64. So for the bitcast in the end to be valid, the element size must be 32 bit. Or did I miss something?

Ah, ok. Please add a 'TODO' comment there that we should handle other types and/or file a PR. I don't think there's any reason to limit the optimization to just 32-bit elements.

This revision is now accepted and ready to land.Dec 9 2014, 1:34 PM

Closed by commit rL223922 (authored by @mkuper).

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

5 lines

test/

CodeGen/

X86/

vec_loadsingles.ll

19 lines

Diff 17115

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,017 Lines • ▼ Show 20 Lines	if (LDBase->hasAnyUseOfValue(1)) {
SDValue(NewLd.getNode(), 1));		SDValue(NewLd.getNode(), 1));
DAG.ReplaceAllUsesOfValueWith(SDValue(LDBase, 1), NewChain);		DAG.ReplaceAllUsesOfValueWith(SDValue(LDBase, 1), NewChain);
DAG.UpdateNodeOperands(NewChain.getNode(), SDValue(LDBase, 1),		DAG.UpdateNodeOperands(NewChain.getNode(), SDValue(LDBase, 1),
SDValue(NewLd.getNode(), 1));		SDValue(NewLd.getNode(), 1));
}		}

return NewLd;		return NewLd;
}		}
if (NumElems == 4 && LastLoadedElt == 1 &&
		//TODO: The code below fires only for for loading the low v2i32 / v2f32
		//of a v4i32 / v4f32. It's probably worth generalizing.
		if (NumElems == 4 && LastLoadedElt == 1 && (EltVT.getSizeInBits() == 32) &&
DAG.getTargetLoweringInfo().isTypeLegal(MVT::v2i64)) {		DAG.getTargetLoweringInfo().isTypeLegal(MVT::v2i64)) {
SDVTList Tys = DAG.getVTList(MVT::v2i64, MVT::Other);		SDVTList Tys = DAG.getVTList(MVT::v2i64, MVT::Other);
SDValue Ops[] = { LDBase->getChain(), LDBase->getBasePtr() };		SDValue Ops[] = { LDBase->getChain(), LDBase->getBasePtr() };
SDValue ResNode =		SDValue ResNode =
DAG.getMemIntrinsicNode(X86ISD::VZEXT_LOAD, DL, Tys, Ops, MVT::i64,		DAG.getMemIntrinsicNode(X86ISD::VZEXT_LOAD, DL, Tys, Ops, MVT::i64,
LDBase->getPointerInfo(),		LDBase->getPointerInfo(),
LDBase->getAlignment(),		LDBase->getAlignment(),
false/isVolatile/, true/ReadMem/,		false/isVolatile/, true/ReadMem/,
▲ Show 20 Lines • Show All 20,192 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/vec_loadsingles.ll

	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,-slow-unaligned-mem-32 \| FileCheck %s --check-prefix=ALL --check-prefix=FAST32			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,-slow-unaligned-mem-32 \| FileCheck %s --check-prefix=ALL --check-prefix=FAST32
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+slow-unaligned-mem-32 \| FileCheck %s --check-prefix=ALL --check-prefix=SLOW32			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+slow-unaligned-mem-32 \| FileCheck %s --check-prefix=ALL --check-prefix=SLOW32

	define <4 x float> @merge_2_floats(float* nocapture %p) nounwind readonly {			define <4 x float> @merge_2_floats(float* nocapture %p) nounwind readonly {
	%tmp1 = load float* %p			%tmp1 = load float* %p
	%vecins = insertelement <4 x float> undef, float %tmp1, i32 0			%vecins = insertelement <4 x float> undef, float %tmp1, i32 0
	%add.ptr = getelementptr float* %p, i32 1			%add.ptr = getelementptr float* %p, i32 1
	%tmp5 = load float* %add.ptr			%tmp5 = load float* %add.ptr
	%vecins7 = insertelement <4 x float> %vecins, float %tmp5, i32 1			%vecins7 = insertelement <4 x float> %vecins, float %tmp5, i32 1
	ret <4 x float> %vecins7			ret <4 x float> %vecins7

	; ALL-LABEL: merge_2_floats			; ALL-LABEL: merge_2_floats
	; ALL: vmovq			; ALL: vmovq
	; ALL-NEXT: retq			; ALL-NEXT: retq
	}			}

				; Test-case generated due to a crash when trying to treat loading the first
				; two i64s of a <4 x i64> as a load of two i32s.
				define <4 x i64> @merge_2_floats_into_4() {
				%1 = load i64** undef, align 8
				%2 = getelementptr inbounds i64* %1, i64 0
				%3 = load i64* %2
				%4 = insertelement <4 x i64> undef, i64 %3, i32 0
				%5 = load i64** undef, align 8
				%6 = getelementptr inbounds i64* %5, i64 1
				%7 = load i64* %6
				%8 = insertelement <4 x i64> %4, i64 %7, i32 1
				%9 = shufflevector <4 x i64> %8, <4 x i64> undef, <4 x i32> <i32 0, i32 1, i32 4, i32 5>
				ret <4 x i64> %9

				; ALL-LABEL: merge_2_floats_into_4
				; ALL: vmovups
				; ALL-NEXT: retq
				}

	define <4 x float> @merge_4_floats(float* %ptr) {			define <4 x float> @merge_4_floats(float* %ptr) {
	%a = load float* %ptr, align 8			%a = load float* %ptr, align 8
	%vec = insertelement <4 x float> undef, float %a, i32 0			%vec = insertelement <4 x float> undef, float %a, i32 0
	%idx1 = getelementptr inbounds float* %ptr, i64 1			%idx1 = getelementptr inbounds float* %ptr, i64 1
	%b = load float* %idx1, align 8			%b = load float* %idx1, align 8
	%vec2 = insertelement <4 x float> %vec, float %b, i32 1			%vec2 = insertelement <4 x float> %vec, float %b, i32 1
	%idx3 = getelementptr inbounds float* %ptr, i64 2			%idx3 = getelementptr inbounds float* %ptr, i64 2
	%c = load float* %idx3, align 8			%c = load float* %idx3, align 8
	▲ Show 20 Lines • Show All 75 Lines • Show Last 20 Lines