This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Add general 32-bit LOAD + VZEXT_MOVL support to EltsFromConsecutiveLoads
ClosedPublic

Authored by RKSimon on Jan 29 2016, 10:33 AM.

Download Raw Diff

Details

Reviewers

spatel
qcolombet
delena
mkuper

Commits

rG6788f33cf2d8: [X86][SSE] Add general 32-bit LOAD + VZEXT_MOVL support to…
rL259796: [X86][SSE] Add general 32-bit LOAD + VZEXT_MOVL support to…

Summary

This patch adds support for consecutive (load/undef elements) 32-bit loads, followed by trailing undef/zero elements to be combined to a single MOVSS load.

Follow up to D16217

Note: I've been looking into correcting the domain for both the MOVSS/MOVD and the MOVSD/MOVQ load/stores but am concerned about the number of test changes - is this something that people think is worthwhile? I'd probably have to change many of the tests to ensure that they keep to the intended domain,

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon updated this revision to Diff 46397.Jan 29 2016, 10:33 AM

RKSimon retitled this revision from to [X86][SSE] Add general 32-bit LOAD + VZEXT_MOVL support to EltsFromConsecutiveLoads.

RKSimon updated this object.

RKSimon added reviewers: qcolombet, spatel, mkuper.

RKSimon set the repository for this revision to rL LLVM.

RKSimon added a subscriber: llvm-commits.

RKSimon mentioned this in D16768: [X86][AVX] Add support for 64-bit VZEXT_LOAD of 256-bit vectors to EltsFromConsecutiveLoads.Feb 1 2016, 6:28 AM

Add 512-bit vector support

delena added inline comments.Feb 2 2016, 12:10 PM

test/CodeGen/X86/merge-consecutive-loads-128.ll
335	I think that in these architectures we pay additional cycle for switching from INT to FP. Can we use movd?
test/CodeGen/X86/merge-consecutive-loads-256.ll
515	this instruction (movss) reads 4 bytes from memory. Does it require 4 bytes alignment?

RKSimon added inline comments.Feb 2 2016, 3:07 PM

test/CodeGen/X86/merge-consecutive-loads-128.ll
335	This was something I mentioned in the summary - adding domain support for MOVSS/MOVD is straightforward but has a knock on effect on a lot of tests, which would need some tests modifying to keep to the original domain and others we'd let switch. If you think its worthwhile I'll start looking at this more seriously?
test/CodeGen/X86/merge-consecutive-loads-256.ll
515	Not unless SSE/AVX alignment checks are enabled - AFAICT llvm assumes they aren't. We are using the alignment of the base pointer, so lowering of the consecutive load is being driven from that.

RKSimon mentioned this in rL259635: [X86][AVX] Add support for 64-bit VZEXT_LOAD of 256/512-bit vectors to….Feb 3 2016, 1:46 AM

Converted loads to the integer domain - as Elena said, its the more sensible options for consecutive 32-bit loads.

LGTM

lib/Target/X86/X86ISelLowering.cpp
5652	What happens here in 32-bit mode, where i64 is illegal?
5675	I think that all these checks are not necessary. (VT.getSizeInBits() >= 128) should be enough.
5679	What happens if you don't put hardcoded MVT::f32, but choose between f32 and i32 according to VT ?
lib/Target/X86/X86InstrAVX512.td
3054	Why v8i64 is not handled in this patterns?

This revision is now accepted and ready to land.Feb 3 2016, 11:47 PM

Closed by commit rL259796: [X86][SSE] Add general 32-bit LOAD + VZEXT_MOVL support to… (authored by RKSimon). · Explain WhyFeb 4 2016, 8:17 AM

This revision was automatically updated to reflect the committed changes.

Thanks Elena - I've addressed your comments in a couple of addition commits:

Added 32-bit target tests to make sure that the 64-bit integer loads still happen (this is the point of the VZEXT_LOAD op any way).

Added float/integer domain handling and removed the over-zealous IsTypeLegal tests.

Note: the v8i64 VZEXT_LOAD patterns are already handled in another part of X86InstrAVX512.td

RKSimon mentioned this in rL259991: [X86][SSE] Don't replace an existing 32-bit load with its duplicate.Feb 6 2016, 7:41 AM

Revision Contents

Path

Size

lib/

Target/

X86/

	X86ISelLowering.cpp
	X86ISelLowering.cpp (revision 259654)

49 lines

	X86InstrAVX512.td
	X86InstrAVX512.td (revision 259654)

12 lines

test/

CodeGen/

X86/

	merge-consecutive-loads-128.ll
	merge-consecutive-loads-128.ll (revision 259654)

35 lines

	merge-consecutive-loads-256.ll
	merge-consecutive-loads-256.ll (revision 259654)

61 lines

	merge-consecutive-loads-512.ll
	merge-consecutive-loads-512.ll (revision 259654)

30 lines

Diff 46790

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,595 Lines • ▼ Show 20 Lines	if (LoadMask[i]) {
break;		break;
}		}
} else if (ZeroMask[i]) {		} else if (ZeroMask[i]) {
IsConsecutiveLoad = false;		IsConsecutiveLoad = false;
break;		break;
}		}
}		}

		auto CreateLoad = [&DAG, &DL](EVT VT, LoadSDNode *LDBase) {
		SDValue NewLd = DAG.getLoad(VT, DL, LDBase->getChain(),
		LDBase->getBasePtr(), LDBase->getPointerInfo(),
		LDBase->isVolatile(), LDBase->isNonTemporal(),
		LDBase->isInvariant(), LDBase->getAlignment());

		if (LDBase->hasAnyUseOfValue(1)) {
		SDValue NewChain =
		DAG.getNode(ISD::TokenFactor, DL, MVT::Other, SDValue(LDBase, 1),
		SDValue(NewLd.getNode(), 1));
		DAG.ReplaceAllUsesOfValueWith(SDValue(LDBase, 1), NewChain);
		DAG.UpdateNodeOperands(NewChain.getNode(), SDValue(LDBase, 1),
		SDValue(NewLd.getNode(), 1));
		}

		return NewLd;
		};

// LOAD - all consecutive load/undefs (must start/end with a load).		// LOAD - all consecutive load/undefs (must start/end with a load).
// If we have found an entire vector of loads and undefs, then return a large		// If we have found an entire vector of loads and undefs, then return a large
// load of the entire vector width starting at the base pointer.		// load of the entire vector width starting at the base pointer.
if (IsConsecutiveLoad && FirstLoadedElt == 0 &&		if (IsConsecutiveLoad && FirstLoadedElt == 0 &&
LastLoadedElt == (int)(NumElems - 1) && ZeroMask.none()) {		LastLoadedElt == (int)(NumElems - 1) && ZeroMask.none()) {
assert(LDBase && "Did not find base load for merging consecutive loads");		assert(LDBase && "Did not find base load for merging consecutive loads");
EVT EltVT = LDBase->getValueType(0);		EVT EltVT = LDBase->getValueType(0);
// Ensure that the input vector size for the merged loads matches the		// Ensure that the input vector size for the merged loads matches the
// cumulative size of the input elements.		// cumulative size of the input elements.
if (VT.getSizeInBits() != EltVT.getSizeInBits() * NumElems)		if (VT.getSizeInBits() != EltVT.getSizeInBits() * NumElems)
return SDValue();		return SDValue();

if (isAfterLegalize && !TLI.isOperationLegal(ISD::LOAD, VT))		if (isAfterLegalize && !TLI.isOperationLegal(ISD::LOAD, VT))
return SDValue();		return SDValue();

SDValue NewLd = SDValue();		return CreateLoad(VT, LDBase);

NewLd = DAG.getLoad(VT, DL, LDBase->getChain(), LDBase->getBasePtr(),
LDBase->getPointerInfo(), LDBase->isVolatile(),
LDBase->isNonTemporal(), LDBase->isInvariant(),
LDBase->getAlignment());

if (LDBase->hasAnyUseOfValue(1)) {
SDValue NewChain =
DAG.getNode(ISD::TokenFactor, DL, MVT::Other, SDValue(LDBase, 1),
SDValue(NewLd.getNode(), 1));
DAG.ReplaceAllUsesOfValueWith(SDValue(LDBase, 1), NewChain);
DAG.UpdateNodeOperands(NewChain.getNode(), SDValue(LDBase, 1),
SDValue(NewLd.getNode(), 1));
}

return NewLd;
}		}

int LoadSize =		int LoadSize =
(1 + LastLoadedElt - FirstLoadedElt) * LDBaseVT.getStoreSizeInBits();		(1 + LastLoadedElt - FirstLoadedElt) * LDBaseVT.getStoreSizeInBits();

// VZEXT_LOAD - consecutive load/undefs followed by zeros/undefs.		// VZEXT_LOAD - consecutive load/undefs followed by zeros/undefs.
if (IsConsecutiveLoad && FirstLoadedElt == 0 && LoadSize == 64 &&		if (IsConsecutiveLoad && FirstLoadedElt == 0 && LoadSize == 64 &&
((VT.is128BitVector() && TLI.isTypeLegal(MVT::v2i64)) \|\|		((VT.is128BitVector() && TLI.isTypeLegal(MVT::v2i64)) \|\|
(VT.is256BitVector() && TLI.isTypeLegal(MVT::v4i64)) \|\|		(VT.is256BitVector() && TLI.isTypeLegal(MVT::v4i64)) \|\|
(VT.is512BitVector() && TLI.isTypeLegal(MVT::v8i64)))) {		(VT.is512BitVector() && TLI.isTypeLegal(MVT::v8i64)))) {
MVT VecVT = MVT::getVectorVT(MVT::i64, VT.getSizeInBits() / 64);		MVT VecVT = MVT::getVectorVT(MVT::i64, VT.getSizeInBits() / 64);
SDVTList Tys = DAG.getVTList(VecVT, MVT::Other);		SDVTList Tys = DAG.getVTList(VecVT, MVT::Other);
SDValue Ops[] = { LDBase->getChain(), LDBase->getBasePtr() };		SDValue Ops[] = { LDBase->getChain(), LDBase->getBasePtr() };
SDValue ResNode =		SDValue ResNode =
DAG.getMemIntrinsicNode(X86ISD::VZEXT_LOAD, DL, Tys, Ops, MVT::i64,		DAG.getMemIntrinsicNode(X86ISD::VZEXT_LOAD, DL, Tys, Ops, MVT::i64,
		delenaUnsubmitted Not Done Reply Inline Actions What happens here in 32-bit mode, where i64 is illegal? delena: What happens here in 32-bit mode, where i64 is illegal?
LDBase->getPointerInfo(),		LDBase->getPointerInfo(),
LDBase->getAlignment(),		LDBase->getAlignment(),
false/isVolatile/, true/ReadMem/,		false/isVolatile/, true/ReadMem/,
false/WriteMem/);		false/WriteMem/);

// Make sure the newly-created LOAD is in the same position as LDBase in		// Make sure the newly-created LOAD is in the same position as LDBase in
// terms of dependency. We create a TokenFactor for LDBase and ResNode, and		// terms of dependency. We create a TokenFactor for LDBase and ResNode, and
// update uses of LDBase's output chain to use the TokenFactor.		// update uses of LDBase's output chain to use the TokenFactor.
if (LDBase->hasAnyUseOfValue(1)) {		if (LDBase->hasAnyUseOfValue(1)) {
SDValue NewChain =		SDValue NewChain =
DAG.getNode(ISD::TokenFactor, DL, MVT::Other, SDValue(LDBase, 1),		DAG.getNode(ISD::TokenFactor, DL, MVT::Other, SDValue(LDBase, 1),
SDValue(ResNode.getNode(), 1));		SDValue(ResNode.getNode(), 1));
DAG.ReplaceAllUsesOfValueWith(SDValue(LDBase, 1), NewChain);		DAG.ReplaceAllUsesOfValueWith(SDValue(LDBase, 1), NewChain);
DAG.UpdateNodeOperands(NewChain.getNode(), SDValue(LDBase, 1),		DAG.UpdateNodeOperands(NewChain.getNode(), SDValue(LDBase, 1),
SDValue(ResNode.getNode(), 1));		SDValue(ResNode.getNode(), 1));
}		}

return DAG.getBitcast(VT, ResNode);		return DAG.getBitcast(VT, ResNode);
}		}

		// VZEXT_MOVL - consecutive 32-bit load/undefs followed by zeros/undefs.
		if (IsConsecutiveLoad && FirstLoadedElt == 0 && LoadSize == 32 &&
		((VT.is128BitVector() && TLI.isTypeLegal(MVT::v4i32)) \|\|
		delenaUnsubmitted Not Done Reply Inline Actions I think that all these checks are not necessary. (VT.getSizeInBits() >= 128) should be enough. delena: I think that all these checks are not necessary. (VT.getSizeInBits() >= 128) should be enough.
		(VT.is256BitVector() && TLI.isTypeLegal(MVT::v8i32)) \|\|
		(VT.is512BitVector() && TLI.isTypeLegal(MVT::v16i32)))) {
		MVT VecVT = MVT::getVectorVT(MVT::i32, VT.getSizeInBits() / 32);
		SDValue V = CreateLoad(MVT::i32, LDBase);
		delenaUnsubmitted Not Done Reply Inline Actions What happens if you don't put hardcoded MVT::f32, but choose between f32 and i32 according to VT ? delena: What happens if you don't put hardcoded MVT::f32, but choose between f32 and i32 according to…
		V = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, VecVT, V);
		V = DAG.getNode(X86ISD::VZEXT_MOVL, DL, VecVT, V);
		return DAG.getBitcast(VT, V);
		}

return SDValue();		return SDValue();
}		}

/// LowerVectorBroadcast - Attempt to use the vbroadcast instruction		/// LowerVectorBroadcast - Attempt to use the vbroadcast instruction
/// to generate a splat value for the following cases:		/// to generate a splat value for the following cases:
/// 1. A splat BUILD_VECTOR which uses a single scalar load, or a constant.		/// 1. A splat BUILD_VECTOR which uses a single scalar load, or a constant.
/// 2. A splat shuffle which uses a scalar_to_vector node which comes from		/// 2. A splat shuffle which uses a scalar_to_vector node which comes from
/// a scalar load, or a constant.		/// a scalar load, or a constant.
▲ Show 20 Lines • Show All 23,606 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrAVX512.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,040 Lines • ▼ Show 20 Lines	def : Pat<(v8i32 (X86vzmovl (insert_subvector undef,
(v4i32 (scalar_to_vector (loadi32 addr:$src))), (iPTR 0)))),		(v4i32 (scalar_to_vector (loadi32 addr:$src))), (iPTR 0)))),
(SUBREG_TO_REG (i32 0), (VMOVDI2PDIZrm addr:$src), sub_xmm)>;		(SUBREG_TO_REG (i32 0), (VMOVDI2PDIZrm addr:$src), sub_xmm)>;
def : Pat<(v8f32 (X86vzmovl (insert_subvector undef,		def : Pat<(v8f32 (X86vzmovl (insert_subvector undef,
(v4f32 (scalar_to_vector (loadf32 addr:$src))), (iPTR 0)))),		(v4f32 (scalar_to_vector (loadf32 addr:$src))), (iPTR 0)))),
(SUBREG_TO_REG (i32 0), (VMOVSSZrm addr:$src), sub_xmm)>;		(SUBREG_TO_REG (i32 0), (VMOVSSZrm addr:$src), sub_xmm)>;
def : Pat<(v4f64 (X86vzmovl (insert_subvector undef,		def : Pat<(v4f64 (X86vzmovl (insert_subvector undef,
(v2f64 (scalar_to_vector (loadf64 addr:$src))), (iPTR 0)))),		(v2f64 (scalar_to_vector (loadf64 addr:$src))), (iPTR 0)))),
(SUBREG_TO_REG (i32 0), (VMOVSDZrm addr:$src), sub_xmm)>;		(SUBREG_TO_REG (i32 0), (VMOVSDZrm addr:$src), sub_xmm)>;

		// Represent the same patterns above but in the form they appear for
		// 512-bit types
		def : Pat<(v16i32 (X86vzmovl (insert_subvector undef,
		(v4i32 (scalar_to_vector (loadi32 addr:$src))), (iPTR 0)))),
		(SUBREG_TO_REG (i32 0), (VMOVDI2PDIZrm addr:$src), sub_xmm)>;
		delenaUnsubmitted Not Done Reply Inline Actions Why v8i64 is not handled in this patterns? delena: Why v8i64 is not handled in this patterns?
		def : Pat<(v16f32 (X86vzmovl (insert_subvector undef,
		(v4f32 (scalar_to_vector (loadf32 addr:$src))), (iPTR 0)))),
		(SUBREG_TO_REG (i32 0), (VMOVSSZrm addr:$src), sub_xmm)>;
		def : Pat<(v8f64 (X86vzmovl (insert_subvector undef,
		(v2f64 (scalar_to_vector (loadf64 addr:$src))), (iPTR 0)))),
		(SUBREG_TO_REG (i32 0), (VMOVSDZrm addr:$src), sub_xmm)>;
}		}
def : Pat<(v8f32 (X86vzmovl (insert_subvector undef,		def : Pat<(v8f32 (X86vzmovl (insert_subvector undef,
(v4f32 (scalar_to_vector FR32X:$src)), (iPTR 0)))),		(v4f32 (scalar_to_vector FR32X:$src)), (iPTR 0)))),
(SUBREG_TO_REG (i32 0), (v4f32 (VMOVSSZrr (v4f32 (V_SET0)),		(SUBREG_TO_REG (i32 0), (v4f32 (VMOVSSZrr (v4f32 (V_SET0)),
FR32X:$src)), sub_xmm)>;		FR32X:$src)), sub_xmm)>;
def : Pat<(v4f64 (X86vzmovl (insert_subvector undef,		def : Pat<(v4f64 (X86vzmovl (insert_subvector undef,
(v2f64 (scalar_to_vector FR64X:$src)), (iPTR 0)))),		(v2f64 (scalar_to_vector FR64X:$src)), (iPTR 0)))),
(SUBREG_TO_REG (i64 0), (v2f64 (VMOVSDZrr (v2f64 (V_SET0)),		(SUBREG_TO_REG (i64 0), (v2f64 (VMOVSDZrr (v2f64 (V_SET0)),
▲ Show 20 Lines • Show All 4,611 Lines • Show Last 20 Lines

test/CodeGen/X86/merge-consecutive-loads-128.ll

Show First 20 Lines • Show All 321 Lines • ▼ Show 20 Lines	; AVX-NEXT: retq
%res5 = insertelement <8 x i16> %res4, i16 %val5, i32 5		%res5 = insertelement <8 x i16> %res4, i16 %val5, i32 5
%res7 = insertelement <8 x i16> %res5, i16 %val7, i32 7		%res7 = insertelement <8 x i16> %res5, i16 %val7, i32 7
ret <8 x i16> %res7		ret <8 x i16> %res7
}		}

define <8 x i16> @merge_8i16_i16_34uuuuuu(i16* %ptr) nounwind uwtable noinline ssp {		define <8 x i16> @merge_8i16_i16_34uuuuuu(i16* %ptr) nounwind uwtable noinline ssp {
; SSE-LABEL: merge_8i16_i16_34uuuuuu:		; SSE-LABEL: merge_8i16_i16_34uuuuuu:
; SSE: # BB#0:		; SSE: # BB#0:
; SSE-NEXT: pinsrw $0, 6(%rdi), %xmm0		; SSE-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
; SSE-NEXT: pinsrw $1, 8(%rdi), %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: merge_8i16_i16_34uuuuuu:		; AVX-LABEL: merge_8i16_i16_34uuuuuu:
; AVX: # BB#0:		; AVX: # BB#0:
; AVX-NEXT: vpinsrw $0, 6(%rdi), %xmm0, %xmm0		; AVX-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
		delenaUnsubmitted Not Done Reply Inline Actions I think that in these architectures we pay additional cycle for switching from INT to FP. Can we use movd? delena: I think that in these architectures we pay additional cycle for switching from INT to FP. Can…
		RKSimonAuthorUnsubmitted Not Done Reply Inline Actions This was something I mentioned in the summary - adding domain support for MOVSS/MOVD is straightforward but has a knock on effect on a lot of tests, which would need some tests modifying to keep to the original domain and others we'd let switch. If you think its worthwhile I'll start looking at this more seriously? RKSimon: This was something I mentioned in the summary - adding domain support for MOVSS/MOVD is…
; AVX-NEXT: vpinsrw $1, 8(%rdi), %xmm0, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%ptr0 = getelementptr inbounds i16, i16* %ptr, i64 3		%ptr0 = getelementptr inbounds i16, i16* %ptr, i64 3
%ptr1 = getelementptr inbounds i16, i16* %ptr, i64 4		%ptr1 = getelementptr inbounds i16, i16* %ptr, i64 4
%val0 = load i16, i16* %ptr0		%val0 = load i16, i16* %ptr0
%val1 = load i16, i16* %ptr1		%val1 = load i16, i16* %ptr1
%res0 = insertelement <8 x i16> undef, i16 %val0, i32 0		%res0 = insertelement <8 x i16> undef, i16 %val0, i32 0
%res1 = insertelement <8 x i16> %res0, i16 %val1, i32 1		%res1 = insertelement <8 x i16> %res0, i16 %val1, i32 1
ret <8 x i16> %res1		ret <8 x i16> %res1
▲ Show 20 Lines • Show All 76 Lines • ▼ Show 20 Lines	; AVX-NEXT: retq
%resB = insertelement <16 x i8> %resA, i8 %valB, i32 11		%resB = insertelement <16 x i8> %resA, i8 %valB, i32 11
%resC = insertelement <16 x i8> %resB, i8 %valC, i32 12		%resC = insertelement <16 x i8> %resB, i8 %valC, i32 12
%resD = insertelement <16 x i8> %resC, i8 %valD, i32 13		%resD = insertelement <16 x i8> %resC, i8 %valD, i32 13
%resF = insertelement <16 x i8> %resD, i8 %valF, i32 15		%resF = insertelement <16 x i8> %resD, i8 %valF, i32 15
ret <16 x i8> %resF		ret <16 x i8> %resF
}		}

define <16 x i8> @merge_16i8_i8_01u3uuzzuuuuuzzz(i8* %ptr) nounwind uwtable noinline ssp {		define <16 x i8> @merge_16i8_i8_01u3uuzzuuuuuzzz(i8* %ptr) nounwind uwtable noinline ssp {
; SSE2-LABEL: merge_16i8_i8_01u3uuzzuuuuuzzz:		; SSE-LABEL: merge_16i8_i8_01u3uuzzuuuuuzzz:
; SSE2: # BB#0:		; SSE: # BB#0:
; SSE2-NEXT: movzbl (%rdi), %eax		; SSE-NEXT: movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
; SSE2-NEXT: movzbl 1(%rdi), %ecx		; SSE-NEXT: retq
; SSE2-NEXT: shll $8, %ecx
; SSE2-NEXT: orl %eax, %ecx
; SSE2-NEXT: pxor %xmm0, %xmm0
; SSE2-NEXT: pinsrw $0, %ecx, %xmm0
; SSE2-NEXT: movzbl 3(%rdi), %eax
; SSE2-NEXT: shll $8, %eax
; SSE2-NEXT: pinsrw $1, %eax, %xmm0
; SSE2-NEXT: retq
;
; SSE41-LABEL: merge_16i8_i8_01u3uuzzuuuuuzzz:
; SSE41: # BB#0:
; SSE41-NEXT: pxor %xmm0, %xmm0
; SSE41-NEXT: pinsrb $0, (%rdi), %xmm0
; SSE41-NEXT: pinsrb $1, 1(%rdi), %xmm0
; SSE41-NEXT: pinsrb $3, 3(%rdi), %xmm0
; SSE41-NEXT: retq
;		;
; AVX-LABEL: merge_16i8_i8_01u3uuzzuuuuuzzz:		; AVX-LABEL: merge_16i8_i8_01u3uuzzuuuuuzzz:
; AVX: # BB#0:		; AVX: # BB#0:
; AVX-NEXT: vpxor %xmm0, %xmm0, %xmm0		; AVX-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
; AVX-NEXT: vpinsrb $0, (%rdi), %xmm0, %xmm0
; AVX-NEXT: vpinsrb $1, 1(%rdi), %xmm0, %xmm0
; AVX-NEXT: vpinsrb $3, 3(%rdi), %xmm0, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%ptr0 = getelementptr inbounds i8, i8* %ptr, i64 0		%ptr0 = getelementptr inbounds i8, i8* %ptr, i64 0
%ptr1 = getelementptr inbounds i8, i8* %ptr, i64 1		%ptr1 = getelementptr inbounds i8, i8* %ptr, i64 1
%ptr3 = getelementptr inbounds i8, i8* %ptr, i64 3		%ptr3 = getelementptr inbounds i8, i8* %ptr, i64 3
%val0 = load i8, i8* %ptr0		%val0 = load i8, i8* %ptr0
%val1 = load i8, i8* %ptr1		%val1 = load i8, i8* %ptr1
%val3 = load i8, i8* %ptr3		%val3 = load i8, i8* %ptr3
%res0 = insertelement <16 x i8> undef, i8 %val0, i32 0		%res0 = insertelement <16 x i8> undef, i8 %val0, i32 0
▲ Show 20 Lines • Show All 43 Lines • Show Last 20 Lines

test/CodeGen/X86/merge-consecutive-loads-256.ll

Show First 20 Lines • Show All 395 Lines • ▼ Show 20 Lines	; AVX512F-NEXT: retq
%res2 = insertelement <8 x i32> %res0, i32 %val2, i32 2		%res2 = insertelement <8 x i32> %res0, i32 %val2, i32 2
%res4 = insertelement <8 x i32> %res2, i32 %val4, i32 4		%res4 = insertelement <8 x i32> %res2, i32 %val4, i32 4
%res5 = insertelement <8 x i32> %res4, i32 0, i32 5		%res5 = insertelement <8 x i32> %res4, i32 0, i32 5
%res7 = insertelement <8 x i32> %res5, i32 %val7, i32 7		%res7 = insertelement <8 x i32> %res5, i32 %val7, i32 7
ret <8 x i32> %res7		ret <8 x i32> %res7
}		}

define <16 x i16> @merge_16i16_i16_89zzzuuuuuuuuuuuz(i16* %ptr) nounwind uwtable noinline ssp {		define <16 x i16> @merge_16i16_i16_89zzzuuuuuuuuuuuz(i16* %ptr) nounwind uwtable noinline ssp {
; AVX1-LABEL: merge_16i16_i16_89zzzuuuuuuuuuuuz:		; AVX-LABEL: merge_16i16_i16_89zzzuuuuuuuuuuuz:
; AVX1: # BB#0:		; AVX: # BB#0:
; AVX1-NEXT: vpxor %xmm0, %xmm0, %xmm0		; AVX-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
; AVX1-NEXT: vpinsrw $0, 16(%rdi), %xmm0, %xmm1		; AVX-NEXT: retq
; AVX1-NEXT: vpinsrw $1, 18(%rdi), %xmm1, %xmm1
; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0
; AVX1-NEXT: retq
;
; AVX2-LABEL: merge_16i16_i16_89zzzuuuuuuuuuuuz:
; AVX2: # BB#0:
; AVX2-NEXT: vpxor %xmm0, %xmm0, %xmm0
; AVX2-NEXT: vpinsrw $0, 16(%rdi), %xmm0, %xmm1
; AVX2-NEXT: vpinsrw $1, 18(%rdi), %xmm1, %xmm1
; AVX2-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
; AVX2-NEXT: retq
;
; AVX512F-LABEL: merge_16i16_i16_89zzzuuuuuuuuuuuz:
; AVX512F: # BB#0:
; AVX512F-NEXT: vpxor %xmm0, %xmm0, %xmm0
; AVX512F-NEXT: vpinsrw $0, 16(%rdi), %xmm0, %xmm1
; AVX512F-NEXT: vpinsrw $1, 18(%rdi), %xmm1, %xmm1
; AVX512F-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
; AVX512F-NEXT: retq
%ptr0 = getelementptr inbounds i16, i16* %ptr, i64 8		%ptr0 = getelementptr inbounds i16, i16* %ptr, i64 8
%ptr1 = getelementptr inbounds i16, i16* %ptr, i64 9		%ptr1 = getelementptr inbounds i16, i16* %ptr, i64 9
%val0 = load i16, i16* %ptr0		%val0 = load i16, i16* %ptr0
%val1 = load i16, i16* %ptr1		%val1 = load i16, i16* %ptr1
%res0 = insertelement <16 x i16> undef, i16 %val0, i16 0		%res0 = insertelement <16 x i16> undef, i16 %val0, i16 0
%res1 = insertelement <16 x i16> %res0, i16 %val1, i16 1		%res1 = insertelement <16 x i16> %res0, i16 %val1, i16 1
%res2 = insertelement <16 x i16> %res1, i16 0, i16 2		%res2 = insertelement <16 x i16> %res1, i16 0, i16 2
%res3 = insertelement <16 x i16> %res2, i16 0, i16 3		%res3 = insertelement <16 x i16> %res2, i16 0, i16 3
▲ Show 20 Lines • Show All 91 Lines • ▼ Show 20 Lines	; AVX512F-NEXT: retq
%resE = insertelement <16 x i16> %resD, i16 %valE, i16 14		%resE = insertelement <16 x i16> %resD, i16 %valE, i16 14
%resF = insertelement <16 x i16> %resE, i16 %valF, i16 15		%resF = insertelement <16 x i16> %resE, i16 %valF, i16 15
ret <16 x i16> %resF		ret <16 x i16> %resF
}		}

define <32 x i8> @merge_32i8_i8_45u7uuuuuuuuuuuuuuuuuuuuuuuuuuuu(i8* %ptr) nounwind uwtable noinline ssp {		define <32 x i8> @merge_32i8_i8_45u7uuuuuuuuuuuuuuuuuuuuuuuuuuuu(i8* %ptr) nounwind uwtable noinline ssp {
; AVX-LABEL: merge_32i8_i8_45u7uuuuuuuuuuuuuuuuuuuuuuuuuuuu:		; AVX-LABEL: merge_32i8_i8_45u7uuuuuuuuuuuuuuuuuuuuuuuuuuuu:
; AVX: # BB#0:		; AVX: # BB#0:
; AVX-NEXT: vpinsrb $0, 4(%rdi), %xmm0, %xmm0		; AVX-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
		delenaUnsubmitted Not Done Reply Inline Actions this instruction (movss) reads 4 bytes from memory. Does it require 4 bytes alignment? delena: this instruction (movss) reads 4 bytes from memory. Does it require 4 bytes alignment?
		RKSimonAuthorUnsubmitted Not Done Reply Inline Actions Not unless SSE/AVX alignment checks are enabled - AFAICT llvm assumes they aren't. We are using the alignment of the base pointer, so lowering of the consecutive load is being driven from that. RKSimon: Not unless SSE/AVX alignment checks are enabled - AFAICT llvm assumes they aren't. We are using…
; AVX-NEXT: vpinsrb $1, 5(%rdi), %xmm0, %xmm0
; AVX-NEXT: vpinsrb $3, 7(%rdi), %xmm0, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%ptr0 = getelementptr inbounds i8, i8* %ptr, i64 4		%ptr0 = getelementptr inbounds i8, i8* %ptr, i64 4
%ptr1 = getelementptr inbounds i8, i8* %ptr, i64 5		%ptr1 = getelementptr inbounds i8, i8* %ptr, i64 5
%ptr3 = getelementptr inbounds i8, i8* %ptr, i64 7		%ptr3 = getelementptr inbounds i8, i8* %ptr, i64 7
%val0 = load i8, i8* %ptr0		%val0 = load i8, i8* %ptr0
%val1 = load i8, i8* %ptr1		%val1 = load i8, i8* %ptr1
%val3 = load i8, i8* %ptr3		%val3 = load i8, i8* %ptr3
%res0 = insertelement <32 x i8> undef, i8 %val0, i8 0		%res0 = insertelement <32 x i8> undef, i8 %val0, i8 0
%res1 = insertelement <32 x i8> %res0, i8 %val1, i8 1		%res1 = insertelement <32 x i8> %res0, i8 %val1, i8 1
%res3 = insertelement <32 x i8> %res1, i8 %val3, i8 3		%res3 = insertelement <32 x i8> %res1, i8 %val3, i8 3
ret <32 x i8> %res3		ret <32 x i8> %res3
}		}

define <32 x i8> @merge_32i8_i8_23u5uuuuuuuuuuzzzzuuuuuuuuuuuuuu(i8* %ptr) nounwind uwtable noinline ssp {		define <32 x i8> @merge_32i8_i8_23u5uuuuuuuuuuzzzzuuuuuuuuuuuuuu(i8* %ptr) nounwind uwtable noinline ssp {
; AVX1-LABEL: merge_32i8_i8_23u5uuuuuuuuuuzzzzuuuuuuuuuuuuuu:		; AVX-LABEL: merge_32i8_i8_23u5uuuuuuuuuuzzzzuuuuuuuuuuuuuu:
; AVX1: # BB#0:		; AVX: # BB#0:
; AVX1-NEXT: vpxor %xmm0, %xmm0, %xmm0		; AVX-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
; AVX1-NEXT: vpinsrb $0, 2(%rdi), %xmm0, %xmm1		; AVX-NEXT: retq
; AVX1-NEXT: vpinsrb $1, 3(%rdi), %xmm1, %xmm1
; AVX1-NEXT: vpinsrb $3, 5(%rdi), %xmm1, %xmm1
; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0
; AVX1-NEXT: retq
;
; AVX2-LABEL: merge_32i8_i8_23u5uuuuuuuuuuzzzzuuuuuuuuuuuuuu:
; AVX2: # BB#0:
; AVX2-NEXT: vpxor %xmm0, %xmm0, %xmm0
; AVX2-NEXT: vpinsrb $0, 2(%rdi), %xmm0, %xmm1
; AVX2-NEXT: vpinsrb $1, 3(%rdi), %xmm1, %xmm1
; AVX2-NEXT: vpinsrb $3, 5(%rdi), %xmm1, %xmm1
; AVX2-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
; AVX2-NEXT: retq
;
; AVX512F-LABEL: merge_32i8_i8_23u5uuuuuuuuuuzzzzuuuuuuuuuuuuuu:
; AVX512F: # BB#0:
; AVX512F-NEXT: vpxor %xmm0, %xmm0, %xmm0
; AVX512F-NEXT: vpinsrb $0, 2(%rdi), %xmm0, %xmm1
; AVX512F-NEXT: vpinsrb $1, 3(%rdi), %xmm1, %xmm1
; AVX512F-NEXT: vpinsrb $3, 5(%rdi), %xmm1, %xmm1
; AVX512F-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
; AVX512F-NEXT: retq
%ptr0 = getelementptr inbounds i8, i8* %ptr, i64 2		%ptr0 = getelementptr inbounds i8, i8* %ptr, i64 2
%ptr1 = getelementptr inbounds i8, i8* %ptr, i64 3		%ptr1 = getelementptr inbounds i8, i8* %ptr, i64 3
%ptr3 = getelementptr inbounds i8, i8* %ptr, i64 5		%ptr3 = getelementptr inbounds i8, i8* %ptr, i64 5
%val0 = load i8, i8* %ptr0		%val0 = load i8, i8* %ptr0
%val1 = load i8, i8* %ptr1		%val1 = load i8, i8* %ptr1
%val3 = load i8, i8* %ptr3		%val3 = load i8, i8* %ptr3
%res0 = insertelement <32 x i8> undef, i8 %val0, i8 0		%res0 = insertelement <32 x i8> undef, i8 %val0, i8 0
%res1 = insertelement <32 x i8> %res0, i8 %val1, i8 1		%res1 = insertelement <32 x i8> %res0, i8 %val1, i8 1
%res3 = insertelement <32 x i8> %res1, i8 %val3, i8 3		%res3 = insertelement <32 x i8> %res1, i8 %val3, i8 3
%resE = insertelement <32 x i8> %res3, i8 0, i8 14		%resE = insertelement <32 x i8> %res3, i8 0, i8 14
%resF = insertelement <32 x i8> %resE, i8 0, i8 15		%resF = insertelement <32 x i8> %resE, i8 0, i8 15
%resG = insertelement <32 x i8> %resF, i8 0, i8 16		%resG = insertelement <32 x i8> %resF, i8 0, i8 16
%resH = insertelement <32 x i8> %resG, i8 0, i8 17		%resH = insertelement <32 x i8> %resG, i8 0, i8 17
ret <32 x i8> %resH		ret <32 x i8> %resH
}		}

test/CodeGen/X86/merge-consecutive-loads-512.ll

Show First 20 Lines • Show All 404 Lines • ▼ Show 20 Lines	; ALL-NEXT: retq
%res1 = insertelement <32 x i16> %res0, i16 %val1, i16 1		%res1 = insertelement <32 x i16> %res0, i16 %val1, i16 1
%res3 = insertelement <32 x i16> %res1, i16 %val3, i16 3		%res3 = insertelement <32 x i16> %res1, i16 %val3, i16 3
ret <32 x i16> %res3		ret <32 x i16> %res3
}		}

define <32 x i16> @merge_32i16_i16_23uzuuuuuuuuuuzzzzuuuuuuuuuuuuuu(i16* %ptr) nounwind uwtable noinline ssp {		define <32 x i16> @merge_32i16_i16_23uzuuuuuuuuuuzzzzuuuuuuuuuuuuuu(i16* %ptr) nounwind uwtable noinline ssp {
; AVX512F-LABEL: merge_32i16_i16_23uzuuuuuuuuuuzzzzuuuuuuuuuuuuuu:		; AVX512F-LABEL: merge_32i16_i16_23uzuuuuuuuuuuzzzzuuuuuuuuuuuuuu:
; AVX512F: # BB#0:		; AVX512F: # BB#0:
; AVX512F-NEXT: vpxor %xmm0, %xmm0, %xmm0		; AVX512F-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
; AVX512F-NEXT: vpinsrw $0, 4(%rdi), %xmm0, %xmm1		; AVX512F-NEXT: vxorps %ymm1, %ymm1, %ymm1
; AVX512F-NEXT: vpinsrw $1, 6(%rdi), %xmm1, %xmm1
; AVX512F-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
; AVX512F-NEXT: vpxor %ymm1, %ymm1, %ymm1
; AVX512F-NEXT: retq		; AVX512F-NEXT: retq
;		;
; AVX512BW-LABEL: merge_32i16_i16_23uzuuuuuuuuuuzzzzuuuuuuuuuuuuuu:		; AVX512BW-LABEL: merge_32i16_i16_23uzuuuuuuuuuuzzzzuuuuuuuuuuuuuu:
; AVX512BW: # BB#0:		; AVX512BW: # BB#0:
; AVX512BW-NEXT: vpxor %xmm0, %xmm0, %xmm0		; AVX512BW-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
; AVX512BW-NEXT: vpinsrw $0, 4(%rdi), %xmm0, %xmm1
; AVX512BW-NEXT: vpinsrw $1, 6(%rdi), %xmm1, %xmm1
; AVX512BW-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
; AVX512BW-NEXT: vpxor %ymm1, %ymm1, %ymm1
; AVX512BW-NEXT: vinserti64x4 $1, %ymm1, %zmm0, %zmm0
; AVX512BW-NEXT: retq		; AVX512BW-NEXT: retq
%ptr0 = getelementptr inbounds i16, i16* %ptr, i64 2		%ptr0 = getelementptr inbounds i16, i16* %ptr, i64 2
%ptr1 = getelementptr inbounds i16, i16* %ptr, i64 3		%ptr1 = getelementptr inbounds i16, i16* %ptr, i64 3
%val0 = load i16, i16* %ptr0		%val0 = load i16, i16* %ptr0
%val1 = load i16, i16* %ptr1		%val1 = load i16, i16* %ptr1
%res0 = insertelement <32 x i16> undef, i16 %val0, i16 0		%res0 = insertelement <32 x i16> undef, i16 %val0, i16 0
%res1 = insertelement <32 x i16> %res0, i16 %val1, i16 1		%res1 = insertelement <32 x i16> %res0, i16 %val1, i16 1
%res3 = insertelement <32 x i16> %res1, i16 0, i16 3		%res3 = insertelement <32 x i16> %res1, i16 0, i16 3
Show All 33 Lines	; AVX512BW-NEXT: retq
%res17 = insertelement <64 x i8> %res16, i8 0, i8 17		%res17 = insertelement <64 x i8> %res16, i8 0, i8 17
%res63 = insertelement <64 x i8> %res17, i8 0, i8 63		%res63 = insertelement <64 x i8> %res17, i8 0, i8 63
ret <64 x i8> %res63		ret <64 x i8> %res63
}		}

define <64 x i8> @merge_64i8_i8_12u4uuuuuuuuuuzzzzuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuz(i8* %ptr) nounwind uwtable noinline ssp {		define <64 x i8> @merge_64i8_i8_12u4uuuuuuuuuuzzzzuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuz(i8* %ptr) nounwind uwtable noinline ssp {
; AVX512F-LABEL: merge_64i8_i8_12u4uuuuuuuuuuzzzzuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuz:		; AVX512F-LABEL: merge_64i8_i8_12u4uuuuuuuuuuzzzzuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuz:
; AVX512F: # BB#0:		; AVX512F: # BB#0:
; AVX512F-NEXT: vpxor %xmm0, %xmm0, %xmm0		; AVX512F-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
; AVX512F-NEXT: vpinsrb $0, 1(%rdi), %xmm0, %xmm1		; AVX512F-NEXT: vxorps %ymm1, %ymm1, %ymm1
; AVX512F-NEXT: vpinsrb $1, 2(%rdi), %xmm1, %xmm1
; AVX512F-NEXT: vpinsrb $3, 4(%rdi), %xmm1, %xmm1
; AVX512F-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
; AVX512F-NEXT: vpxor %ymm1, %ymm1, %ymm1
; AVX512F-NEXT: retq		; AVX512F-NEXT: retq
;		;
; AVX512BW-LABEL: merge_64i8_i8_12u4uuuuuuuuuuzzzzuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuz:		; AVX512BW-LABEL: merge_64i8_i8_12u4uuuuuuuuuuzzzzuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuz:
; AVX512BW: # BB#0:		; AVX512BW: # BB#0:
; AVX512BW-NEXT: vpxor %xmm0, %xmm0, %xmm0		; AVX512BW-NEXT: vmovd {{.*#+}} xmm0 = mem[0],zero,zero,zero
; AVX512BW-NEXT: vpinsrb $0, 1(%rdi), %xmm0, %xmm1
; AVX512BW-NEXT: vpinsrb $1, 2(%rdi), %xmm1, %xmm1
; AVX512BW-NEXT: vpinsrb $3, 4(%rdi), %xmm1, %xmm1
; AVX512BW-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
; AVX512BW-NEXT: vpxor %ymm1, %ymm1, %ymm1
; AVX512BW-NEXT: vinserti64x4 $1, %ymm1, %zmm0, %zmm0
; AVX512BW-NEXT: retq		; AVX512BW-NEXT: retq
%ptr0 = getelementptr inbounds i8, i8* %ptr, i64 1		%ptr0 = getelementptr inbounds i8, i8* %ptr, i64 1
%ptr1 = getelementptr inbounds i8, i8* %ptr, i64 2		%ptr1 = getelementptr inbounds i8, i8* %ptr, i64 2
%ptr3 = getelementptr inbounds i8, i8* %ptr, i64 4		%ptr3 = getelementptr inbounds i8, i8* %ptr, i64 4
%val0 = load i8, i8* %ptr0		%val0 = load i8, i8* %ptr0
%val1 = load i8, i8* %ptr1		%val1 = load i8, i8* %ptr1
%val3 = load i8, i8* %ptr3		%val3 = load i8, i8* %ptr3
%res0 = insertelement <64 x i8> undef, i8 %val0, i8 0		%res0 = insertelement <64 x i8> undef, i8 %val0, i8 0
Show All 9 Lines