This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Target/X86/
-
Target/
-
X86/
-
X86ISelLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
clear_upper_vector_element_bits.ll
2/7
load-partial.ll

Differential D64551

[X86] EltsFromConsecutiveLoads - support common source loads
ClosedPublic

Authored by RKSimon on Jul 11 2019, 3:00 AM.

Download Raw Diff

Details

Reviewers

craig.topper
spatel
niravd
deadalnix

Commits

rGb3d719e1cf02: [X86] EltsFromConsecutiveLoads - support common source loads (REAPPLIED)
rL366681: [X86] EltsFromConsecutiveLoads - support common source loads (REAPPLIED)
rG48104ef7c9c6: [X86] EltsFromConsecutiveLoads - support common source loads
rL366441: [X86] EltsFromConsecutiveLoads - support common source loads

Summary

This patch enables us to find the source loads for each element, splitting them into a Load and ByteOffset, and attempts to recognise consecutive loads that are in fact from the same source load.

A helper function, FindEltLoadSrc, recurses through to find a LoadSDNode and determines the element's byte offset within it. When attempting to match consecutive loads, byte offsetted loads then attempt to matched against a previous load that has already been confirmed to be a consecutive match.

Next step towards PR16739 - after this we just need to account for shuffling/repeated elements to create a vector load + shuffle.

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon created this revision.Jul 11 2019, 3:00 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 11 2019, 3:00 AM

Does this need to do anything to ensure that there are no interferences, in the sense of non-known-noalias writes?

In D64551#1580187, @lebedev.ri wrote:

Does this need to do anything to ensure that there are no interferences, in the sense of non-known-noalias writes?

That's what areNonVolatileConsecutiveLoads handles no?

In D64551#1580222, @RKSimon wrote:

In D64551#1580187, @lebedev.ri wrote:

Does this need to do anything to ensure that there are no interferences, in the sense of non-known-noalias writes?

That's what areNonVolatileConsecutiveLoads handles no?

It isn't fully obvious whether that only checks that the loads are from sequential locations in memory,
or whether it also checks that it is *legal* to perform those loads as one, at least to me.
I.e. i'm expecting this doesn't fold:

define <4 x float> @load_float4_float3_as_float2_float__with_write(<4 x float>* nocapture readonly dereferenceable(16)) {
  %2 = bitcast <4 x float>* %0 to <2 x float>*
  %3 = load <2 x float>, <2 x float>* %2, align 4
  %4 = extractelement <2 x float> %3, i32 0
  %5 = insertelement <4 x float> undef, float %4, i32 0
  %6 = extractelement <2 x float> %3, i32 1
  %7 = insertelement <4 x float> %5, float %6, i32 1
  %8 = getelementptr inbounds <4 x float>, <4 x float>* %0, i64 0, i64 2
  store float 42.0, float* %8 ; !!!
  %9 = load float, float* %8, align 4
  %10 = insertelement <4 x float> %7, float %9, i32 2
  ret <4 x float> %10
}

Is that what if (LD->getChain() != Base->getChain()) return false; does?

Is that what if (LD->getChain() != Base->getChain()) return false; does?

Yes, chains will handle these kinds of dependencies - do you want me to add that test ?

In D64551#1580320, @RKSimon wrote:

Is that what if (LD->getChain() != Base->getChain()) return false; does?

Yes, chains will handle these kinds of dependencies

Great to know! No other comments from me.

do you want me to add that test ?

Hmm, i'm not fully sure it's really useful as-is - in *that* case we can just reorder that store before these loads, so *that* case could still be folded.

Any more comments? I'm keen to try and get PR16739 fixed before the release branch.

ping

It's worth noting here in the review that this patch depends on the dereferenceable attribute (see D64205), and that attribute could change meaning as part of the larger changes related to the Attributor pass (D63243).
Based on current definitions, I think this is correct and allowable, so LGTM.

lib/Target/X86/X86ISelLowering.cpp
7505 ↗	(On Diff #209154)	formatting: FindElt... --> findElt
7519–7521 ↗	(On Diff #209154)	Inconsistent methods for getting the constant (APInt vs. uin64_t). I don't think a logic difference is possible given other limits of LLVM, so go with getConstantOperandVal() in both places?

This revision is now accepted and ready to land.Jul 18 2019, 5:41 AM

Closed by commit rL366441: [X86] EltsFromConsecutiveLoads - support common source loads (authored by RKSimon). · Explain WhyJul 18 2019, 7:34 AM

This revision was automatically updated to reflect the committed changes.

This broke compilation for me, with https://martin.st/temp/simd_cmp_avx2.cpp, built with clang -target i686-w64-mingw32 -c -O3 -mavx2 simd_cmp_avx2.cpp, triggering failed asserts. I can file a proper bug later if necessary.

In D64551#1597144, @mstorsjo wrote:

This broke compilation for me, with https://martin.st/temp/simd_cmp_avx2.cpp, built with clang -target i686-w64-mingw32 -c -O3 -mavx2 simd_cmp_avx2.cpp, triggering failed asserts. I can file a proper bug later if necessary.

Filed as a bug at https://bugs.llvm.org/show_bug.cgi?id=42727.

Hi, Simon. This patch has produced a bug in llvm:

t.ll1 KBDownload

the attached t.ll can reproduce this bug:
For t.ll, llvm without this patch produces the correct asm while llvm with this patch produces bad asm:
llvm with this patch:

vmovd   xmm0, dword ptr [rdi + 260] # xmm0 = mem[0],zero,zero,zero
vmovd   xmm1, dword ptr [rdi + 252] # xmm1 = mem[0],zero,zero,zero
vpunpcklqdq     xmm0, xmm1, xmm0 # xmm0 = xmm1[0],xmm0[0]
vxorps  xmm1, xmm1, xmm1
vinserti128     ymm0, ymm1, xmm0, 1
vmovdqa ymmword ptr [rsi + 672], ymm0

llvm without this patch:

vmovq   xmm0, qword ptr [rdi + 252] # xmm0 = mem[0],zero
movabs  rax, offset .LCPI0_0
vmovdqa xmm1, xmmword ptr [rax]
vpshufb xmm0, xmm0, xmm1
vmovd   xmm1, dword ptr [rdi + 260] # xmm1 = mem[0],zero,zero,zero
vpunpcklqdq     xmm0, xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0]
vxorps  xmm1, xmm1, xmm1
vinserti128     ymm0, ymm1, xmm0, 1
vmovdqa ymmword ptr [rsi + 672], ymm0

When I compared the ‘llc -debug’ output, I found the procedure of Combining: t53: v4i32 = X86ISD::VZEXT_MOVL t43 are different:
llvm without this patch:

Legalizing: t53: v4i32 = X86ISD::VZEXT_MOVL t43
Legal node: nothing to do

Combining: t53: v4i32 = X86ISD::VZEXT_MOVL t43
Creating new node: t55: v2i64 = undef
Creating constant: t56: i8 = Constant<4>
Creating constant: t57: i8 = Constant<5>
Creating constant: t58: i8 = Constant<6>
Creating constant: t59: i8 = Constant<7>
Creating constant: t60: i8 = Constant<-1>

llvm with this patch:

Legalizing: t53: v4i32 = X86ISD::VZEXT_MOVL t43
Legal node: nothing to do

Combining: t53: v4i32 = X86ISD::VZEXT_MOVL t43

Then I used ‘dbg llc’ and found that EltsFromConsecutiveLoads will return SDValue() while llvm with this patch won’t.

The return SDValue() was deleted by your patch:

if (!ISD::isNON_EXTLoad(Elt.getNode()))
   return SDValue();

So, I added these two lines back to llvm with this patch, then output asm is correct and the same with llvm without this patch:

Thanks @yubing

I've submit a patch to solve the bug which I commented yesterday.
https://reviews.llvm.org/D67210

llvm/trunk/test/CodeGen/X86/load-partial.ll
64	This is also not correct, according to IR.
86	This is not correct, since we are moving 3 float numbers to %xmm0, according to IR, instead of 4 float. Before your patch, the testcase's CHECK is correct: ; AVX: # %bb.0: ; AVX-NEXT: vmovss (%rdi), %xmm0 # xmm0 = mem[0],zero,zero,zero ; AVX-NEXT: vinsertps $16, 4(%rdi), %xmm0, %xmm0 # xmm0 = xmm0[0],mem[0],xmm0[2,3] ; AVX-NEXT: vinsertps $32, 8(%rdi), %xmm0, %xmm0 # xmm0 = xmm0[0,1],mem[0],xmm0[3] ; AVX-NEXT: retq

RKSimon marked 2 inline comments as done.Sep 5 2019, 4:28 AM

RKSimon added inline comments.

llvm/trunk/test/CodeGen/X86/load-partial.ll
64	why?
86	The pointer is 16 byte dereferencable - loading all 4 floats is safe.

I've raised https://bugs.llvm.org/show_bug.cgi?id=43227 to handle this

RKSimon mentioned this in D67210: [x86] bug fix for https://reviews.llvm.org/D64551.Sep 5 2019, 5:40 AM

RKSimon mentioned this in rG29361c704dfa: [X86][SSE] EltsFromConsecutiveLoads - ignore non-zero offset base loads….Sep 5 2019, 8:08 AM

RKSimon mentioned this in rL371078: [X86][SSE] EltsFromConsecutiveLoads - ignore non-zero offset base loads….

yubing added inline comments.Sep 5 2019, 9:51 AM

llvm/trunk/test/CodeGen/X86/load-partial.ll
86	Sorry, It seems I am not familiar with 'dereferencable'. Could you please explain it in detail? All I see is that moving only 3 float numbers to %xmm0, according to IR.

lebedev.ri added inline comments.Sep 5 2019, 10:06 AM

llvm/trunk/test/CodeGen/X86/load-partial.ll
86	... moving 3 floats to %xmm, with 4'th channel being `undef`. As per `dereferenceable` attribute, we can load 16 bytes here, and since we are allowed to replace `undef` with anything, we can just simply load the entire %xmm0. So i'm not seeing an issue here. Can you explain what issue do you see?

yubing added inline comments.Sep 5 2019, 7:05 PM

llvm/trunk/test/CodeGen/X86/load-partial.ll
86	You're right. Now I get your point.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

68 lines

test/

CodeGen/

X86/

clear_upper_vector_element_bits.ll

300 lines

load-partial.ll

60 lines

Diff 210563

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,498 Lines • ▼ Show 20 Lines	if (LoadSDNode *LD = dyn_cast<LoadSDNode>(SrcOp)) {
SmallVector<int, 8> Mask(NumElems, EltNo);		SmallVector<int, 8> Mask(NumElems, EltNo);

return DAG.getVectorShuffle(NVT, dl, V1, DAG.getUNDEF(NVT), Mask);		return DAG.getVectorShuffle(NVT, dl, V1, DAG.getUNDEF(NVT), Mask);
}		}

return SDValue();		return SDValue();
}		}

		// Recurse to find a LoadSDNode source and the accumulated ByteOffest.
		static bool findEltLoadSrc(SDValue Elt, LoadSDNode *&Ld, int64_t &ByteOffset) {
		if (ISD::isNON_EXTLoad(Elt.getNode())) {
		Ld = cast<LoadSDNode>(Elt);
		ByteOffset = 0;
		return true;
		}

		switch (Elt.getOpcode()) {
		case ISD::BITCAST:
		case ISD::TRUNCATE:
		case ISD::SCALAR_TO_VECTOR:
		return findEltLoadSrc(Elt.getOperand(0), Ld, ByteOffset);
		case ISD::SRL:
		if (isa<ConstantSDNode>(Elt.getOperand(1))) {
		uint64_t Idx = Elt.getConstantOperandVal(1);
		if ((Idx % 8) == 0 && findEltLoadSrc(Elt.getOperand(0), Ld, ByteOffset)) {
		ByteOffset += Idx / 8;
		return true;
		}
		}
		break;
		case ISD::EXTRACT_VECTOR_ELT:
		if (isa<ConstantSDNode>(Elt.getOperand(1))) {
		SDValue Src = Elt.getOperand(0);
		unsigned SrcSizeInBits = Src.getScalarValueSizeInBits();
		unsigned DstSizeInBits = Elt.getScalarValueSizeInBits();
		if (DstSizeInBits == SrcSizeInBits && (SrcSizeInBits % 8) == 0 &&
		findEltLoadSrc(Src, Ld, ByteOffset)) {
		uint64_t Idx = Elt.getConstantOperandVal(1);
		ByteOffset += Idx * (SrcSizeInBits / 8);
		return true;
		}
		}
		break;
		}

		return false;
		}

/// Given the initializing elements 'Elts' of a vector of type 'VT', see if the		/// Given the initializing elements 'Elts' of a vector of type 'VT', see if the
/// elements can be replaced by a single large load which has the same value as		/// elements can be replaced by a single large load which has the same value as
/// a build_vector or insert_subvector whose loaded operands are 'Elts'.		/// a build_vector or insert_subvector whose loaded operands are 'Elts'.
///		///
/// Example: <load i32 a, load i32 a+4, zero, undef> -> zextload a		/// Example: <load i32 a, load i32 a+4, zero, undef> -> zextload a
static SDValue EltsFromConsecutiveLoads(EVT VT, ArrayRef<SDValue> Elts,		static SDValue EltsFromConsecutiveLoads(EVT VT, ArrayRef<SDValue> Elts,
const SDLoc &DL, SelectionDAG &DAG,		const SDLoc &DL, SelectionDAG &DAG,
const X86Subtarget &Subtarget,		const X86Subtarget &Subtarget,
bool isAfterLegalize) {		bool isAfterLegalize) {
unsigned NumElems = Elts.size();		unsigned NumElems = Elts.size();

int LastLoadedElt = -1;		int LastLoadedElt = -1;
APInt LoadMask = APInt::getNullValue(NumElems);		APInt LoadMask = APInt::getNullValue(NumElems);
APInt ZeroMask = APInt::getNullValue(NumElems);		APInt ZeroMask = APInt::getNullValue(NumElems);
APInt UndefMask = APInt::getNullValue(NumElems);		APInt UndefMask = APInt::getNullValue(NumElems);

SmallVector<LoadSDNode*, 8> Loads(NumElems, nullptr);		SmallVector<LoadSDNode*, 8> Loads(NumElems, nullptr);
		SmallVector<int64_t, 8> ByteOffsets(NumElems, 0);

// For each element in the initializer, see if we've found a load, zero or an		// For each element in the initializer, see if we've found a load, zero or an
// undef.		// undef.
for (unsigned i = 0; i < NumElems; ++i) {		for (unsigned i = 0; i < NumElems; ++i) {
SDValue Elt = peekThroughBitcasts(Elts[i]);		SDValue Elt = peekThroughBitcasts(Elts[i]);
if (!Elt.getNode())		if (!Elt.getNode())
return SDValue();		return SDValue();
if (Elt.isUndef()) {		if (Elt.isUndef()) {
UndefMask.setBit(i);		UndefMask.setBit(i);
continue;		continue;
}		}
if (X86::isZeroNode(Elt) \|\| ISD::isBuildVectorAllZeros(Elt.getNode())) {		if (X86::isZeroNode(Elt) \|\| ISD::isBuildVectorAllZeros(Elt.getNode())) {
ZeroMask.setBit(i);		ZeroMask.setBit(i);
continue;		continue;
}		}

// Each loaded element must be the correct fractional portion of the		// Each loaded element must be the correct fractional portion of the
// requested vector load.		// requested vector load.
if ((NumElems * Elt.getValueSizeInBits()) != VT.getSizeInBits())		unsigned EltSizeInBits = Elt.getValueSizeInBits();
		if ((NumElems * EltSizeInBits) != VT.getSizeInBits())
return SDValue();		return SDValue();

if (!ISD::isNON_EXTLoad(Elt.getNode()))		if (!findEltLoadSrc(Elt, Loads[i], ByteOffsets[i]))
return SDValue();		return SDValue();
		assert(0 <= ByteOffsets[i] &&
		((ByteOffsets[i] * 8) + EltSizeInBits) <=
		Loads[i]->getValueSizeInBits(0) &&
		"Element offset outside of load bounds");

Loads[i] = cast<LoadSDNode>(Elt);
LoadMask.setBit(i);		LoadMask.setBit(i);
LastLoadedElt = i;		LastLoadedElt = i;
}		}
assert((ZeroMask.countPopulation() + UndefMask.countPopulation() +		assert((ZeroMask.countPopulation() + UndefMask.countPopulation() +
LoadMask.countPopulation()) == NumElems &&		LoadMask.countPopulation()) == NumElems &&
"Incomplete element masks");		"Incomplete element masks");

// Handle Special Cases - all undef or undef/zero.		// Handle Special Cases - all undef or undef/zero.
Show All 13 Lines	assert(EltBaseVT.getSizeInBits() == EltBaseVT.getStoreSizeInBits() &&
"Register/Memory size mismatch");		"Register/Memory size mismatch");
LoadSDNode *LDBase = Loads[FirstLoadedElt];		LoadSDNode *LDBase = Loads[FirstLoadedElt];
assert(LDBase && "Did not find base load for merging consecutive loads");		assert(LDBase && "Did not find base load for merging consecutive loads");
unsigned BaseSizeInBits = EltBaseVT.getStoreSizeInBits();		unsigned BaseSizeInBits = EltBaseVT.getStoreSizeInBits();
unsigned BaseSizeInBytes = BaseSizeInBits / 8;		unsigned BaseSizeInBytes = BaseSizeInBits / 8;
int LoadSizeInBits = (1 + LastLoadedElt - FirstLoadedElt) * BaseSizeInBits;		int LoadSizeInBits = (1 + LastLoadedElt - FirstLoadedElt) * BaseSizeInBits;
assert((BaseSizeInBits % 8) == 0 && "Sub-byte element loads detected");		assert((BaseSizeInBits % 8) == 0 && "Sub-byte element loads detected");

		// Check to see if the element's load is consecutive to the base load
		// or offset from a previous (already checked) load.
		auto CheckConsecutiveLoad = [&](LoadSDNode *Base, int EltIdx) {
		LoadSDNode *Ld = Loads[EltIdx];
		int64_t ByteOffset = ByteOffsets[EltIdx];
		if (ByteOffset && (ByteOffset % BaseSizeInBytes) == 0) {
		int64_t BaseIdx = EltIdx - (ByteOffset / BaseSizeInBytes);
		return (0 <= BaseIdx && BaseIdx < (int)NumElems && LoadMask[BaseIdx] &&
		Loads[BaseIdx] == Ld && ByteOffsets[BaseIdx] == 0);
		}
		return DAG.areNonVolatileConsecutiveLoads(Ld, Base, BaseSizeInBytes,
		EltIdx - FirstLoadedElt);
		};

// Consecutive loads can contain UNDEFS but not ZERO elements.		// Consecutive loads can contain UNDEFS but not ZERO elements.
// Consecutive loads with UNDEFs and ZEROs elements require a		// Consecutive loads with UNDEFs and ZEROs elements require a
// an additional shuffle stage to clear the ZERO elements.		// an additional shuffle stage to clear the ZERO elements.
bool IsConsecutiveLoad = true;		bool IsConsecutiveLoad = true;
bool IsConsecutiveLoadWithZeros = true;		bool IsConsecutiveLoadWithZeros = true;
for (int i = FirstLoadedElt + 1; i <= LastLoadedElt; ++i) {		for (int i = FirstLoadedElt + 1; i <= LastLoadedElt; ++i) {
if (LoadMask[i]) {		if (LoadMask[i]) {
if (!DAG.areNonVolatileConsecutiveLoads(Loads[i], LDBase, BaseSizeInBytes,		if (!CheckConsecutiveLoad(LDBase, i)) {
i - FirstLoadedElt)) {
IsConsecutiveLoad = false;		IsConsecutiveLoad = false;
IsConsecutiveLoadWithZeros = false;		IsConsecutiveLoadWithZeros = false;
break;		break;
}		}
} else if (ZeroMask[i]) {		} else if (ZeroMask[i]) {
IsConsecutiveLoad = false;		IsConsecutiveLoad = false;
}		}
}		}
▲ Show 20 Lines • Show All 37,922 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/clear_upper_vector_element_bits.ll

	Show First 20 Lines • Show All 979 Lines • ▼ Show 20 Lines
	; SSE42-NEXT: movq %rdx, %xmm0			; SSE42-NEXT: movq %rdx, %xmm0
	; SSE42-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]			; SSE42-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
	; SSE42-NEXT: popq %rbx			; SSE42-NEXT: popq %rbx
	; SSE42-NEXT: retq			; SSE42-NEXT: retq
	;			;
	; AVX1-LABEL: _clearupper32xi8b:			; AVX1-LABEL: _clearupper32xi8b:
	; AVX1: # %bb.0:			; AVX1: # %bb.0:
	; AVX1-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)			; AVX1-NEXT: vmovaps %xmm0, -{{[0-9]+}}(%rsp)
	; AVX1-NEXT: movq -{{[0-9]+}}(%rsp), %r9			; AVX1-NEXT: movq -{{[0-9]+}}(%rsp), %rax
	; AVX1-NEXT: movq -{{[0-9]+}}(%rsp), %rcx			; AVX1-NEXT: movq %rax, %rcx
	; AVX1-NEXT: movq %r9, %r8			; AVX1-NEXT: movq %rax, %rdx
	; AVX1-NEXT: shrq $56, %r8
	; AVX1-NEXT: andl $15, %r8d
	; AVX1-NEXT: movq %rcx, %rsi
	; AVX1-NEXT: movq %rcx, %rdi
	; AVX1-NEXT: movq %rcx, %rdx
	; AVX1-NEXT: movq %rcx, %rax
	; AVX1-NEXT: shrq $32, %rax
	; AVX1-NEXT: andl $15, %eax
	; AVX1-NEXT: shlq $32, %rax
	; AVX1-NEXT: andl $252645135, %ecx # imm = 0xF0F0F0F
	; AVX1-NEXT: orq %rax, %rcx
	; AVX1-NEXT: movq %r9, %rax
	; AVX1-NEXT: shrq $48, %rax
	; AVX1-NEXT: andl $15, %eax
	; AVX1-NEXT: shrq $40, %rdx
	; AVX1-NEXT: andl $15, %edx
	; AVX1-NEXT: shlq $40, %rdx
	; AVX1-NEXT: orq %rcx, %rdx
	; AVX1-NEXT: movq %r9, %rcx
	; AVX1-NEXT: shrq $40, %rcx
	; AVX1-NEXT: andl $15, %ecx
	; AVX1-NEXT: shrq $48, %rdi
	; AVX1-NEXT: andl $15, %edi
	; AVX1-NEXT: shlq $48, %rdi
	; AVX1-NEXT: orq %rdx, %rdi
	; AVX1-NEXT: movq %r9, %rdx
	; AVX1-NEXT: shrq $32, %rdx
	; AVX1-NEXT: andl $15, %edx
	; AVX1-NEXT: shrq $56, %rsi
	; AVX1-NEXT: andl $15, %esi
	; AVX1-NEXT: shlq $56, %rsi
	; AVX1-NEXT: orq %rdi, %rsi
	; AVX1-NEXT: movq %rsi, -{{[0-9]+}}(%rsp)
	; AVX1-NEXT: shlq $32, %rdx
	; AVX1-NEXT: andl $252645135, %r9d # imm = 0xF0F0F0F
	; AVX1-NEXT: orq %rdx, %r9
	; AVX1-NEXT: shlq $40, %rcx
	; AVX1-NEXT: orq %r9, %rcx
	; AVX1-NEXT: shlq $48, %rax
	; AVX1-NEXT: orq %rcx, %rax
	; AVX1-NEXT: shlq $56, %r8
	; AVX1-NEXT: orq %rax, %r8
	; AVX1-NEXT: movq %r8, -{{[0-9]+}}(%rsp)
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm0
	; AVX1-NEXT: vmovq %xmm0, %rax
	; AVX1-NEXT: movq %rax, %r8
	; AVX1-NEXT: movq %rax, %r9
	; AVX1-NEXT: movq %rax, %rsi			; AVX1-NEXT: movq %rax, %rsi
	; AVX1-NEXT: movq %rax, %rdi			; AVX1-NEXT: movq %rax, %rdi
	; AVX1-NEXT: movl %eax, %ecx
	; AVX1-NEXT: movl %eax, %edx
	; AVX1-NEXT: vmovd %eax, %xmm1
	; AVX1-NEXT: shrl $8, %eax
	; AVX1-NEXT: vpinsrb $1, %eax, %xmm1, %xmm1
	; AVX1-NEXT: shrl $16, %edx
	; AVX1-NEXT: vpinsrb $2, %edx, %xmm1, %xmm1
	; AVX1-NEXT: shrl $24, %ecx
	; AVX1-NEXT: vpinsrb $3, %ecx, %xmm1, %xmm1
	; AVX1-NEXT: shrq $32, %rdi			; AVX1-NEXT: shrq $32, %rdi
	; AVX1-NEXT: vpinsrb $4, %edi, %xmm1, %xmm1			; AVX1-NEXT: andl $15, %edi
				; AVX1-NEXT: shlq $32, %rdi
				; AVX1-NEXT: andl $252645135, %eax # imm = 0xF0F0F0F
				; AVX1-NEXT: orq %rdi, %rax
				; AVX1-NEXT: movq -{{[0-9]+}}(%rsp), %rdi
	; AVX1-NEXT: shrq $40, %rsi			; AVX1-NEXT: shrq $40, %rsi
	; AVX1-NEXT: vpinsrb $5, %esi, %xmm1, %xmm1			; AVX1-NEXT: andl $15, %esi
	; AVX1-NEXT: vmovdqa -{{[0-9]+}}(%rsp), %xmm2			; AVX1-NEXT: shlq $40, %rsi
	; AVX1-NEXT: shrq $48, %r9			; AVX1-NEXT: orq %rax, %rsi
	; AVX1-NEXT: vpinsrb $6, %r9d, %xmm1, %xmm1			; AVX1-NEXT: movq %rdi, %rax
	; AVX1-NEXT: vpextrq $1, %xmm0, %rax			; AVX1-NEXT: shrq $48, %rdx
	; AVX1-NEXT: shrq $56, %r8			; AVX1-NEXT: andl $15, %edx
	; AVX1-NEXT: vpinsrb $7, %r8d, %xmm1, %xmm0			; AVX1-NEXT: shlq $48, %rdx
	; AVX1-NEXT: movl %eax, %ecx			; AVX1-NEXT: orq %rsi, %rdx
	; AVX1-NEXT: shrl $8, %ecx			; AVX1-NEXT: movq %rdi, %rsi
	; AVX1-NEXT: vpinsrb $8, %eax, %xmm0, %xmm0			; AVX1-NEXT: shrq $56, %rcx
	; AVX1-NEXT: vpinsrb $9, %ecx, %xmm0, %xmm0			; AVX1-NEXT: andl $15, %ecx
	; AVX1-NEXT: movl %eax, %ecx			; AVX1-NEXT: shlq $56, %rcx
	; AVX1-NEXT: shrl $16, %ecx			; AVX1-NEXT: orq %rdx, %rcx
	; AVX1-NEXT: vpinsrb $10, %ecx, %xmm0, %xmm0			; AVX1-NEXT: movq %rdi, %rdx
	; AVX1-NEXT: movl %eax, %ecx			; AVX1-NEXT: movq %rcx, -{{[0-9]+}}(%rsp)
	; AVX1-NEXT: shrl $24, %ecx			; AVX1-NEXT: movq %rdi, %rcx
	; AVX1-NEXT: vpinsrb $11, %ecx, %xmm0, %xmm0
	; AVX1-NEXT: movq %rax, %rcx
	; AVX1-NEXT: shrq $32, %rcx			; AVX1-NEXT: shrq $32, %rcx
	; AVX1-NEXT: vpinsrb $12, %ecx, %xmm0, %xmm0			; AVX1-NEXT: andl $15, %ecx
	; AVX1-NEXT: movq %rax, %rcx			; AVX1-NEXT: shlq $32, %rcx
	; AVX1-NEXT: shrq $40, %rcx			; AVX1-NEXT: andl $252645135, %edi # imm = 0xF0F0F0F
	; AVX1-NEXT: vpinsrb $13, %ecx, %xmm0, %xmm0			; AVX1-NEXT: orq %rcx, %rdi
	; AVX1-NEXT: movq %rax, %rcx			; AVX1-NEXT: shrq $40, %rdx
	; AVX1-NEXT: shrq $48, %rcx			; AVX1-NEXT: andl $15, %edx
	; AVX1-NEXT: vpinsrb $14, %ecx, %xmm0, %xmm0			; AVX1-NEXT: shlq $40, %rdx
	; AVX1-NEXT: vmovq %xmm2, %rcx			; AVX1-NEXT: orq %rdi, %rdx
				; AVX1-NEXT: shrq $48, %rsi
				; AVX1-NEXT: andl $15, %esi
				; AVX1-NEXT: shlq $48, %rsi
				; AVX1-NEXT: orq %rdx, %rsi
	; AVX1-NEXT: shrq $56, %rax			; AVX1-NEXT: shrq $56, %rax
	; AVX1-NEXT: vpinsrb $15, %eax, %xmm0, %xmm0			; AVX1-NEXT: andl $15, %eax
				; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm0
				; AVX1-NEXT: shlq $56, %rax
				; AVX1-NEXT: orq %rsi, %rax
				; AVX1-NEXT: vmovq %xmm0, %rcx
				; AVX1-NEXT: movq %rax, -{{[0-9]+}}(%rsp)
	; AVX1-NEXT: movl %ecx, %eax			; AVX1-NEXT: movl %ecx, %eax
	; AVX1-NEXT: shrl $8, %eax			; AVX1-NEXT: shrl $8, %eax
	; AVX1-NEXT: vmovd %ecx, %xmm1			; AVX1-NEXT: vmovd %ecx, %xmm1
	; AVX1-NEXT: vpinsrb $1, %eax, %xmm1, %xmm1			; AVX1-NEXT: vpinsrb $1, %eax, %xmm1, %xmm1
	; AVX1-NEXT: movl %ecx, %eax			; AVX1-NEXT: movl %ecx, %eax
	; AVX1-NEXT: shrl $16, %eax			; AVX1-NEXT: shrl $16, %eax
	; AVX1-NEXT: vpinsrb $2, %eax, %xmm1, %xmm1			; AVX1-NEXT: vpinsrb $2, %eax, %xmm1, %xmm1
	; AVX1-NEXT: movl %ecx, %eax			; AVX1-NEXT: movl %ecx, %eax
	; AVX1-NEXT: shrl $24, %eax			; AVX1-NEXT: shrl $24, %eax
	; AVX1-NEXT: vpinsrb $3, %eax, %xmm1, %xmm1			; AVX1-NEXT: vpinsrb $3, %eax, %xmm1, %xmm1
	; AVX1-NEXT: movq %rcx, %rax			; AVX1-NEXT: movq %rcx, %rax
	; AVX1-NEXT: shrq $32, %rax			; AVX1-NEXT: shrq $32, %rax
	; AVX1-NEXT: vpinsrb $4, %eax, %xmm1, %xmm1			; AVX1-NEXT: vpinsrb $4, %eax, %xmm1, %xmm1
	; AVX1-NEXT: movq %rcx, %rax			; AVX1-NEXT: movq %rcx, %rax
	; AVX1-NEXT: shrq $40, %rax			; AVX1-NEXT: shrq $40, %rax
	; AVX1-NEXT: vpinsrb $5, %eax, %xmm1, %xmm1			; AVX1-NEXT: vpinsrb $5, %eax, %xmm1, %xmm1
	; AVX1-NEXT: movq %rcx, %rax			; AVX1-NEXT: movq %rcx, %rax
	; AVX1-NEXT: shrq $48, %rax			; AVX1-NEXT: shrq $48, %rax
	; AVX1-NEXT: vpinsrb $6, %eax, %xmm1, %xmm1			; AVX1-NEXT: vpinsrb $6, %eax, %xmm1, %xmm1
	; AVX1-NEXT: vpextrq $1, %xmm2, %rax			; AVX1-NEXT: vpextrq $1, %xmm0, %rax
	; AVX1-NEXT: shrq $56, %rcx			; AVX1-NEXT: shrq $56, %rcx
	; AVX1-NEXT: vpinsrb $7, %ecx, %xmm1, %xmm1			; AVX1-NEXT: vpinsrb $7, %ecx, %xmm1, %xmm0
	; AVX1-NEXT: movl %eax, %ecx			; AVX1-NEXT: movl %eax, %ecx
	; AVX1-NEXT: shrl $8, %ecx			; AVX1-NEXT: shrl $8, %ecx
	; AVX1-NEXT: vpinsrb $8, %eax, %xmm1, %xmm1			; AVX1-NEXT: vpinsrb $8, %eax, %xmm0, %xmm0
	; AVX1-NEXT: vpinsrb $9, %ecx, %xmm1, %xmm1			; AVX1-NEXT: vpinsrb $9, %ecx, %xmm0, %xmm0
	; AVX1-NEXT: movl %eax, %ecx			; AVX1-NEXT: movl %eax, %ecx
	; AVX1-NEXT: shrl $16, %ecx			; AVX1-NEXT: shrl $16, %ecx
	; AVX1-NEXT: vpinsrb $10, %ecx, %xmm1, %xmm1			; AVX1-NEXT: vpinsrb $10, %ecx, %xmm0, %xmm0
	; AVX1-NEXT: movl %eax, %ecx			; AVX1-NEXT: movl %eax, %ecx
	; AVX1-NEXT: shrl $24, %ecx			; AVX1-NEXT: shrl $24, %ecx
	; AVX1-NEXT: vpinsrb $11, %ecx, %xmm1, %xmm1			; AVX1-NEXT: vpinsrb $11, %ecx, %xmm0, %xmm0
	; AVX1-NEXT: movq %rax, %rcx			; AVX1-NEXT: movq %rax, %rcx
	; AVX1-NEXT: shrq $32, %rcx			; AVX1-NEXT: shrq $32, %rcx
	; AVX1-NEXT: vpinsrb $12, %ecx, %xmm1, %xmm1			; AVX1-NEXT: vpinsrb $12, %ecx, %xmm0, %xmm0
	; AVX1-NEXT: movq %rax, %rcx			; AVX1-NEXT: movq %rax, %rcx
	; AVX1-NEXT: shrq $40, %rcx			; AVX1-NEXT: shrq $40, %rcx
	; AVX1-NEXT: vpinsrb $13, %ecx, %xmm1, %xmm1			; AVX1-NEXT: vpinsrb $13, %ecx, %xmm0, %xmm0
	; AVX1-NEXT: movq %rax, %rcx			; AVX1-NEXT: movq %rax, %rcx
	; AVX1-NEXT: shrq $48, %rcx			; AVX1-NEXT: shrq $48, %rcx
	; AVX1-NEXT: vpinsrb $14, %ecx, %xmm1, %xmm1			; AVX1-NEXT: vpinsrb $14, %ecx, %xmm0, %xmm0
	; AVX1-NEXT: shrq $56, %rax			; AVX1-NEXT: shrq $56, %rax
	; AVX1-NEXT: vpinsrb $15, %eax, %xmm1, %xmm1			; AVX1-NEXT: vpinsrb $15, %eax, %xmm0, %xmm0
				; AVX1-NEXT: vmovaps -{{[0-9]+}}(%rsp), %xmm1
	; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: _clearupper32xi8b:			; AVX2-LABEL: _clearupper32xi8b:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vmovdqa %xmm0, -{{[0-9]+}}(%rsp)			; AVX2-NEXT: vmovdqa %xmm0, -{{[0-9]+}}(%rsp)
	; AVX2-NEXT: movq -{{[0-9]+}}(%rsp), %r9			; AVX2-NEXT: movq -{{[0-9]+}}(%rsp), %rax
	; AVX2-NEXT: movq -{{[0-9]+}}(%rsp), %rcx			; AVX2-NEXT: movq %rax, %rcx
	; AVX2-NEXT: movq %r9, %r8			; AVX2-NEXT: movq %rax, %rdx
	; AVX2-NEXT: shrq $56, %r8
	; AVX2-NEXT: andl $15, %r8d
	; AVX2-NEXT: movq %rcx, %rsi
	; AVX2-NEXT: movq %rcx, %rdi
	; AVX2-NEXT: movq %rcx, %rdx
	; AVX2-NEXT: movq %rcx, %rax
	; AVX2-NEXT: shrq $32, %rax
	; AVX2-NEXT: andl $15, %eax
	; AVX2-NEXT: shlq $32, %rax
	; AVX2-NEXT: andl $252645135, %ecx # imm = 0xF0F0F0F
	; AVX2-NEXT: orq %rax, %rcx
	; AVX2-NEXT: movq %r9, %rax
	; AVX2-NEXT: shrq $48, %rax
	; AVX2-NEXT: andl $15, %eax
	; AVX2-NEXT: shrq $40, %rdx
	; AVX2-NEXT: andl $15, %edx
	; AVX2-NEXT: shlq $40, %rdx
	; AVX2-NEXT: orq %rcx, %rdx
	; AVX2-NEXT: movq %r9, %rcx
	; AVX2-NEXT: shrq $40, %rcx
	; AVX2-NEXT: andl $15, %ecx
	; AVX2-NEXT: shrq $48, %rdi
	; AVX2-NEXT: andl $15, %edi
	; AVX2-NEXT: shlq $48, %rdi
	; AVX2-NEXT: orq %rdx, %rdi
	; AVX2-NEXT: movq %r9, %rdx
	; AVX2-NEXT: shrq $32, %rdx
	; AVX2-NEXT: andl $15, %edx
	; AVX2-NEXT: shrq $56, %rsi
	; AVX2-NEXT: andl $15, %esi
	; AVX2-NEXT: shlq $56, %rsi
	; AVX2-NEXT: orq %rdi, %rsi
	; AVX2-NEXT: movq %rsi, -{{[0-9]+}}(%rsp)
	; AVX2-NEXT: shlq $32, %rdx
	; AVX2-NEXT: andl $252645135, %r9d # imm = 0xF0F0F0F
	; AVX2-NEXT: orq %rdx, %r9
	; AVX2-NEXT: shlq $40, %rcx
	; AVX2-NEXT: orq %r9, %rcx
	; AVX2-NEXT: shlq $48, %rax
	; AVX2-NEXT: orq %rcx, %rax
	; AVX2-NEXT: shlq $56, %r8
	; AVX2-NEXT: orq %rax, %r8
	; AVX2-NEXT: movq %r8, -{{[0-9]+}}(%rsp)
	; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm0
	; AVX2-NEXT: vmovq %xmm0, %rax
	; AVX2-NEXT: movq %rax, %r8
	; AVX2-NEXT: movq %rax, %r9
	; AVX2-NEXT: movq %rax, %rsi			; AVX2-NEXT: movq %rax, %rsi
	; AVX2-NEXT: movq %rax, %rdi			; AVX2-NEXT: movq %rax, %rdi
	; AVX2-NEXT: movl %eax, %ecx
	; AVX2-NEXT: movl %eax, %edx
	; AVX2-NEXT: vmovd %eax, %xmm1
	; AVX2-NEXT: shrl $8, %eax
	; AVX2-NEXT: vpinsrb $1, %eax, %xmm1, %xmm1
	; AVX2-NEXT: shrl $16, %edx
	; AVX2-NEXT: vpinsrb $2, %edx, %xmm1, %xmm1
	; AVX2-NEXT: shrl $24, %ecx
	; AVX2-NEXT: vpinsrb $3, %ecx, %xmm1, %xmm1
	; AVX2-NEXT: shrq $32, %rdi			; AVX2-NEXT: shrq $32, %rdi
	; AVX2-NEXT: vpinsrb $4, %edi, %xmm1, %xmm1			; AVX2-NEXT: andl $15, %edi
				; AVX2-NEXT: shlq $32, %rdi
				; AVX2-NEXT: andl $252645135, %eax # imm = 0xF0F0F0F
				; AVX2-NEXT: orq %rdi, %rax
				; AVX2-NEXT: movq -{{[0-9]+}}(%rsp), %rdi
	; AVX2-NEXT: shrq $40, %rsi			; AVX2-NEXT: shrq $40, %rsi
	; AVX2-NEXT: vpinsrb $5, %esi, %xmm1, %xmm1			; AVX2-NEXT: andl $15, %esi
	; AVX2-NEXT: vmovdqa -{{[0-9]+}}(%rsp), %xmm2			; AVX2-NEXT: shlq $40, %rsi
	; AVX2-NEXT: shrq $48, %r9			; AVX2-NEXT: orq %rax, %rsi
	; AVX2-NEXT: vpinsrb $6, %r9d, %xmm1, %xmm1			; AVX2-NEXT: movq %rdi, %rax
	; AVX2-NEXT: vpextrq $1, %xmm0, %rax			; AVX2-NEXT: shrq $48, %rdx
	; AVX2-NEXT: shrq $56, %r8			; AVX2-NEXT: andl $15, %edx
	; AVX2-NEXT: vpinsrb $7, %r8d, %xmm1, %xmm0			; AVX2-NEXT: shlq $48, %rdx
	; AVX2-NEXT: movl %eax, %ecx			; AVX2-NEXT: orq %rsi, %rdx
	; AVX2-NEXT: shrl $8, %ecx			; AVX2-NEXT: movq %rdi, %rsi
	; AVX2-NEXT: vpinsrb $8, %eax, %xmm0, %xmm0			; AVX2-NEXT: shrq $56, %rcx
	; AVX2-NEXT: vpinsrb $9, %ecx, %xmm0, %xmm0			; AVX2-NEXT: andl $15, %ecx
	; AVX2-NEXT: movl %eax, %ecx			; AVX2-NEXT: shlq $56, %rcx
	; AVX2-NEXT: shrl $16, %ecx			; AVX2-NEXT: orq %rdx, %rcx
	; AVX2-NEXT: vpinsrb $10, %ecx, %xmm0, %xmm0			; AVX2-NEXT: movq %rdi, %rdx
	; AVX2-NEXT: movl %eax, %ecx			; AVX2-NEXT: movq %rcx, -{{[0-9]+}}(%rsp)
	; AVX2-NEXT: shrl $24, %ecx			; AVX2-NEXT: movq %rdi, %rcx
	; AVX2-NEXT: vpinsrb $11, %ecx, %xmm0, %xmm0
	; AVX2-NEXT: movq %rax, %rcx
	; AVX2-NEXT: shrq $32, %rcx			; AVX2-NEXT: shrq $32, %rcx
	; AVX2-NEXT: vpinsrb $12, %ecx, %xmm0, %xmm0			; AVX2-NEXT: andl $15, %ecx
	; AVX2-NEXT: movq %rax, %rcx			; AVX2-NEXT: shlq $32, %rcx
	; AVX2-NEXT: shrq $40, %rcx			; AVX2-NEXT: andl $252645135, %edi # imm = 0xF0F0F0F
	; AVX2-NEXT: vpinsrb $13, %ecx, %xmm0, %xmm0			; AVX2-NEXT: orq %rcx, %rdi
	; AVX2-NEXT: movq %rax, %rcx			; AVX2-NEXT: shrq $40, %rdx
	; AVX2-NEXT: shrq $48, %rcx			; AVX2-NEXT: andl $15, %edx
	; AVX2-NEXT: vpinsrb $14, %ecx, %xmm0, %xmm0			; AVX2-NEXT: shlq $40, %rdx
	; AVX2-NEXT: vmovq %xmm2, %rcx			; AVX2-NEXT: orq %rdi, %rdx
				; AVX2-NEXT: shrq $48, %rsi
				; AVX2-NEXT: andl $15, %esi
				; AVX2-NEXT: shlq $48, %rsi
				; AVX2-NEXT: orq %rdx, %rsi
	; AVX2-NEXT: shrq $56, %rax			; AVX2-NEXT: shrq $56, %rax
	; AVX2-NEXT: vpinsrb $15, %eax, %xmm0, %xmm0			; AVX2-NEXT: andl $15, %eax
				; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm0
				; AVX2-NEXT: shlq $56, %rax
				; AVX2-NEXT: orq %rsi, %rax
				; AVX2-NEXT: vmovq %xmm0, %rcx
				; AVX2-NEXT: movq %rax, -{{[0-9]+}}(%rsp)
	; AVX2-NEXT: movl %ecx, %eax			; AVX2-NEXT: movl %ecx, %eax
	; AVX2-NEXT: shrl $8, %eax			; AVX2-NEXT: shrl $8, %eax
	; AVX2-NEXT: vmovd %ecx, %xmm1			; AVX2-NEXT: vmovd %ecx, %xmm1
	; AVX2-NEXT: vpinsrb $1, %eax, %xmm1, %xmm1			; AVX2-NEXT: vpinsrb $1, %eax, %xmm1, %xmm1
	; AVX2-NEXT: movl %ecx, %eax			; AVX2-NEXT: movl %ecx, %eax
	; AVX2-NEXT: shrl $16, %eax			; AVX2-NEXT: shrl $16, %eax
	; AVX2-NEXT: vpinsrb $2, %eax, %xmm1, %xmm1			; AVX2-NEXT: vpinsrb $2, %eax, %xmm1, %xmm1
	; AVX2-NEXT: movl %ecx, %eax			; AVX2-NEXT: movl %ecx, %eax
	; AVX2-NEXT: shrl $24, %eax			; AVX2-NEXT: shrl $24, %eax
	; AVX2-NEXT: vpinsrb $3, %eax, %xmm1, %xmm1			; AVX2-NEXT: vpinsrb $3, %eax, %xmm1, %xmm1
	; AVX2-NEXT: movq %rcx, %rax			; AVX2-NEXT: movq %rcx, %rax
	; AVX2-NEXT: shrq $32, %rax			; AVX2-NEXT: shrq $32, %rax
	; AVX2-NEXT: vpinsrb $4, %eax, %xmm1, %xmm1			; AVX2-NEXT: vpinsrb $4, %eax, %xmm1, %xmm1
	; AVX2-NEXT: movq %rcx, %rax			; AVX2-NEXT: movq %rcx, %rax
	; AVX2-NEXT: shrq $40, %rax			; AVX2-NEXT: shrq $40, %rax
	; AVX2-NEXT: vpinsrb $5, %eax, %xmm1, %xmm1			; AVX2-NEXT: vpinsrb $5, %eax, %xmm1, %xmm1
	; AVX2-NEXT: movq %rcx, %rax			; AVX2-NEXT: movq %rcx, %rax
	; AVX2-NEXT: shrq $48, %rax			; AVX2-NEXT: shrq $48, %rax
	; AVX2-NEXT: vpinsrb $6, %eax, %xmm1, %xmm1			; AVX2-NEXT: vpinsrb $6, %eax, %xmm1, %xmm1
	; AVX2-NEXT: vpextrq $1, %xmm2, %rax			; AVX2-NEXT: vpextrq $1, %xmm0, %rax
	; AVX2-NEXT: shrq $56, %rcx			; AVX2-NEXT: shrq $56, %rcx
	; AVX2-NEXT: vpinsrb $7, %ecx, %xmm1, %xmm1			; AVX2-NEXT: vpinsrb $7, %ecx, %xmm1, %xmm0
	; AVX2-NEXT: movl %eax, %ecx			; AVX2-NEXT: movl %eax, %ecx
	; AVX2-NEXT: shrl $8, %ecx			; AVX2-NEXT: shrl $8, %ecx
	; AVX2-NEXT: vpinsrb $8, %eax, %xmm1, %xmm1			; AVX2-NEXT: vpinsrb $8, %eax, %xmm0, %xmm0
	; AVX2-NEXT: vpinsrb $9, %ecx, %xmm1, %xmm1			; AVX2-NEXT: vpinsrb $9, %ecx, %xmm0, %xmm0
	; AVX2-NEXT: movl %eax, %ecx			; AVX2-NEXT: movl %eax, %ecx
	; AVX2-NEXT: shrl $16, %ecx			; AVX2-NEXT: shrl $16, %ecx
	; AVX2-NEXT: vpinsrb $10, %ecx, %xmm1, %xmm1			; AVX2-NEXT: vpinsrb $10, %ecx, %xmm0, %xmm0
	; AVX2-NEXT: movl %eax, %ecx			; AVX2-NEXT: movl %eax, %ecx
	; AVX2-NEXT: shrl $24, %ecx			; AVX2-NEXT: shrl $24, %ecx
	; AVX2-NEXT: vpinsrb $11, %ecx, %xmm1, %xmm1			; AVX2-NEXT: vpinsrb $11, %ecx, %xmm0, %xmm0
	; AVX2-NEXT: movq %rax, %rcx			; AVX2-NEXT: movq %rax, %rcx
	; AVX2-NEXT: shrq $32, %rcx			; AVX2-NEXT: shrq $32, %rcx
	; AVX2-NEXT: vpinsrb $12, %ecx, %xmm1, %xmm1			; AVX2-NEXT: vpinsrb $12, %ecx, %xmm0, %xmm0
	; AVX2-NEXT: movq %rax, %rcx			; AVX2-NEXT: movq %rax, %rcx
	; AVX2-NEXT: shrq $40, %rcx			; AVX2-NEXT: shrq $40, %rcx
	; AVX2-NEXT: vpinsrb $13, %ecx, %xmm1, %xmm1			; AVX2-NEXT: vpinsrb $13, %ecx, %xmm0, %xmm0
	; AVX2-NEXT: movq %rax, %rcx			; AVX2-NEXT: movq %rax, %rcx
	; AVX2-NEXT: shrq $48, %rcx			; AVX2-NEXT: shrq $48, %rcx
	; AVX2-NEXT: vpinsrb $14, %ecx, %xmm1, %xmm1			; AVX2-NEXT: vpinsrb $14, %ecx, %xmm0, %xmm0
	; AVX2-NEXT: shrq $56, %rax			; AVX2-NEXT: shrq $56, %rax
	; AVX2-NEXT: vpinsrb $15, %eax, %xmm1, %xmm1			; AVX2-NEXT: vpinsrb $15, %eax, %xmm0, %xmm0
				; AVX2-NEXT: vmovdqa -{{[0-9]+}}(%rsp), %xmm1
	; AVX2-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0			; AVX2-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%x4 = bitcast <32 x i8> %0 to <64 x i4>			%x4 = bitcast <32 x i8> %0 to <64 x i4>
	%r0 = insertelement <64 x i4> %x4, i4 zeroinitializer, i32 1			%r0 = insertelement <64 x i4> %x4, i4 zeroinitializer, i32 1
	%r1 = insertelement <64 x i4> %r0, i4 zeroinitializer, i32 3			%r1 = insertelement <64 x i4> %r0, i4 zeroinitializer, i32 3
	%r2 = insertelement <64 x i4> %r1, i4 zeroinitializer, i32 5			%r2 = insertelement <64 x i4> %r1, i4 zeroinitializer, i32 5
	%r3 = insertelement <64 x i4> %r2, i4 zeroinitializer, i32 7			%r3 = insertelement <64 x i4> %r2, i4 zeroinitializer, i32 7
	%r4 = insertelement <64 x i4> %r3, i4 zeroinitializer, i32 9			%r4 = insertelement <64 x i4> %r3, i4 zeroinitializer, i32 9
	▲ Show 20 Lines • Show All 185 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/load-partial.ll

Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	; AVX-NEXT: retq
%ld2 = load float, float* %p2, align 4		%ld2 = load float, float* %p2, align 4
%r0 = insertelement <8 x float> undef, float %ld0, i32 0		%r0 = insertelement <8 x float> undef, float %ld0, i32 0
%r1 = insertelement <8 x float> %r0, float %ld1, i32 1		%r1 = insertelement <8 x float> %r0, float %ld1, i32 1
%r2 = insertelement <8 x float> %r1, float %ld2, i32 2		%r2 = insertelement <8 x float> %r1, float %ld2, i32 2
ret <8 x float> %r2		ret <8 x float> %r2
}		}

define <4 x float> @load_float4_float3_as_float2_float(<4 x float>* nocapture readonly dereferenceable(16)) {		define <4 x float> @load_float4_float3_as_float2_float(<4 x float>* nocapture readonly dereferenceable(16)) {
; SSE2-LABEL: load_float4_float3_as_float2_float:		; SSE-LABEL: load_float4_float3_as_float2_float:
; SSE2: # %bb.0:		; SSE: # %bb.0:
; SSE2-NEXT: movsd {{.*#+}} xmm0 = mem[0],zero		; SSE-NEXT: movups (%rdi), %xmm0
; SSE2-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero		; SSE-NEXT: retq
; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,0],xmm0[3,0]
; SSE2-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,1],xmm1[0,2]
; SSE2-NEXT: retq
;
; SSSE3-LABEL: load_float4_float3_as_float2_float:
; SSSE3: # %bb.0:
; SSSE3-NEXT: movsd {{.*#+}} xmm0 = mem[0],zero
; SSSE3-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
; SSSE3-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,0],xmm0[3,0]
; SSSE3-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,1],xmm1[0,2]
; SSSE3-NEXT: retq
;
; SSE41-LABEL: load_float4_float3_as_float2_float:
; SSE41: # %bb.0:
; SSE41-NEXT: movsd {{.*#+}} xmm0 = mem[0],zero
; SSE41-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,1],mem[0],xmm0[3]
; SSE41-NEXT: retq
;		;
; AVX-LABEL: load_float4_float3_as_float2_float:		; AVX-LABEL: load_float4_float3_as_float2_float:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vmovsd {{.*#+}} xmm0 = mem[0],zero		; AVX-NEXT: vmovups (%rdi), %xmm0
		yubingUnsubmitted Not Done Reply Inline Actions This is also not correct, according to IR. yubing: This is also not correct, according to IR.
		RKSimonAuthorUnsubmitted Done Reply Inline Actions why? RKSimon: why?
; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm0[0,1],mem[0],xmm0[3]
; AVX-NEXT: retq		; AVX-NEXT: retq
%2 = bitcast <4 x float>* %0 to <2 x float>*		%2 = bitcast <4 x float>* %0 to <2 x float>*
%3 = load <2 x float>, <2 x float>* %2, align 4		%3 = load <2 x float>, <2 x float>* %2, align 4
%4 = extractelement <2 x float> %3, i32 0		%4 = extractelement <2 x float> %3, i32 0
%5 = insertelement <4 x float> undef, float %4, i32 0		%5 = insertelement <4 x float> undef, float %4, i32 0
%6 = extractelement <2 x float> %3, i32 1		%6 = extractelement <2 x float> %3, i32 1
%7 = insertelement <4 x float> %5, float %6, i32 1		%7 = insertelement <4 x float> %5, float %6, i32 1
%8 = getelementptr inbounds <4 x float>, <4 x float>* %0, i64 0, i64 2		%8 = getelementptr inbounds <4 x float>, <4 x float>* %0, i64 0, i64 2
%9 = load float, float* %8, align 4		%9 = load float, float* %8, align 4
%10 = insertelement <4 x float> %7, float %9, i32 2		%10 = insertelement <4 x float> %7, float %9, i32 2
ret <4 x float> %10		ret <4 x float> %10
}		}

define <4 x float> @load_float4_float3_trunc(<4 x float>* nocapture readonly dereferenceable(16)) {		define <4 x float> @load_float4_float3_trunc(<4 x float>* nocapture readonly dereferenceable(16)) {
; SSE2-LABEL: load_float4_float3_trunc:		; SSE-LABEL: load_float4_float3_trunc:
; SSE2: # %bb.0:		; SSE: # %bb.0:
; SSE2-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero		; SSE-NEXT: movaps (%rdi), %xmm0
; SSE2-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero		; SSE-NEXT: retq
; SSE2-NEXT: unpcklps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
; SSE2-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
; SSE2-NEXT: movlhps {{.*#+}} xmm0 = xmm0[0],xmm1[0]
; SSE2-NEXT: retq
;
; SSSE3-LABEL: load_float4_float3_trunc:
; SSSE3: # %bb.0:
; SSSE3-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
; SSSE3-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
; SSSE3-NEXT: unpcklps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
; SSSE3-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
; SSSE3-NEXT: movlhps {{.*#+}} xmm0 = xmm0[0],xmm1[0]
; SSSE3-NEXT: retq
;
; SSE41-LABEL: load_float4_float3_trunc:
; SSE41: # %bb.0:
; SSE41-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
; SSE41-NEXT: insertps {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[2,3]
; SSE41-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,1],mem[0],xmm0[3]
; SSE41-NEXT: retq
;		;
; AVX-LABEL: load_float4_float3_trunc:		; AVX-LABEL: load_float4_float3_trunc:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vmovss {{.*#+}} xmm0 = mem[0],zero,zero,zero		; AVX-NEXT: vmovaps (%rdi), %xmm0
		yubingUnsubmitted Not Done Reply Inline Actions This is not correct, since we are moving 3 float numbers to %xmm0, according to IR, instead of 4 float. Before your patch, the testcase's CHECK is correct: ; AVX: # %bb.0: ; AVX-NEXT: vmovss (%rdi), %xmm0 # xmm0 = mem[0],zero,zero,zero ; AVX-NEXT: vinsertps $16, 4(%rdi), %xmm0, %xmm0 # xmm0 = xmm0[0],mem[0],xmm0[2,3] ; AVX-NEXT: vinsertps $32, 8(%rdi), %xmm0, %xmm0 # xmm0 = xmm0[0,1],mem[0],xmm0[3] ; AVX-NEXT: retq yubing: This is not correct, since we are moving 3 float numbers to %xmm0, according to IR, instead of…
		RKSimonAuthorUnsubmitted Done Reply Inline Actions The pointer is 16 byte dereferencable - loading all 4 floats is safe. RKSimon: The pointer is 16 byte dereferencable - loading all 4 floats is safe.
		yubingUnsubmitted Not Done Reply Inline Actions Sorry, It seems I am not familiar with 'dereferencable'. Could you please explain it in detail? All I see is that moving only 3 float numbers to %xmm0, according to IR. yubing: Sorry, It seems I am not familiar with 'dereferencable'. Could you please explain it in detail?
		lebedev.riUnsubmitted Not Done Reply Inline Actions ... moving 3 floats to %xmm, with 4'th channel being `undef`. As per `dereferenceable` attribute, we can load 16 bytes here, and since we are allowed to replace `undef` with anything, we can just simply load the entire %xmm0. So i'm not seeing an issue here. Can you explain what issue do you see? lebedev.ri: ... moving 3 floats to %xmm, with 4'th channel being `undef`. As per `dereferenceable`…
		yubingUnsubmitted Not Done Reply Inline Actions You're right. Now I get your point. yubing: You're right. Now I get your point.
; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[2,3]
; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm0[0,1],mem[0],xmm0[3]
; AVX-NEXT: retq		; AVX-NEXT: retq
%2 = bitcast <4 x float>* %0 to i64*		%2 = bitcast <4 x float>* %0 to i64*
%3 = load i64, i64* %2, align 16		%3 = load i64, i64* %2, align 16
%4 = getelementptr inbounds <4 x float>, <4 x float>* %0, i64 0, i64 2		%4 = getelementptr inbounds <4 x float>, <4 x float>* %0, i64 0, i64 2
%5 = bitcast float* %4 to i64*		%5 = bitcast float* %4 to i64*
%6 = load i64, i64* %5, align 8		%6 = load i64, i64* %5, align 8
%7 = trunc i64 %3 to i32		%7 = trunc i64 %3 to i32
%8 = bitcast i32 %7 to float		%8 = bitcast i32 %7 to float
▲ Show 20 Lines • Show All 45 Lines • Show Last 20 Lines