This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Detect zeroable shuffle elements from different value types
ClosedPublic

Authored by RKSimon on Nov 2 2015, 3:02 PM.

Download Raw Diff

Details

Reviewers

spatel
qcolombet
chandlerc
andreadb

Commits

rGc44472a5bc53: [X86][SSE] Detect zeroable shuffle elements from different value types
rL263906: [X86][SSE] Detect zeroable shuffle elements from different value types

Summary

Improve computeZeroableShuffleElements to be able to peek through bitcasts to extract zero/undef values from BUILD_VECTOR nodes of different element sizes to the shuffle mask.

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon updated this revision to Diff 38992.Nov 2 2015, 3:02 PM

RKSimon retitled this revision from to [X86][SSE] Recursive search for zeroable shuffle elements.

RKSimon updated this object.

RKSimon added reviewers: chandlerc, qcolombet, andreadb, spatel.

RKSimon set the repository for this revision to rL LLVM.

RKSimon added a subscriber: llvm-commits.

Hi Simon,

lib/Target/X86/X86ISelLowering.cpp
6732–6734 ↗	(On Diff #38992)	Do you think it would make sense to limit the recursion level?
6746–6749 ↗	(On Diff #38992)	I don't think you need to compute V1IsZero and V2IsZero anymore. The two calls to 'isBuildVectorAllZeros' are now made redundant by the calls to 'GetSubZeroable'. You can also simplify the if statement at line 6754. In particular, you would only need to check for the presence of an undef index. All the remaining cases should be taken care by the checks after line 6760.
6777–6787 ↗	(On Diff #38992)	At line 6777, you don't need to check if 'Size < SubSize'. If control reaches line 6777, then Size can never be bigger than or equal to SubSize. The motivation is that the check for (Size < SubSize) is dominated by the checks for (Size == SubSize) and the check for (Size > SubSize). You can also remove the llvm_unreachable at line 6787 as it is not needed.

Added recursion limit and other recommendations from Andrea.

I'm really not convinced this is the correct approach.

Instead, I think we should combine SHUFFLE_VECTOR nodes into a single node so that we don't need this kind of recursive logic. If that doesn't work for some reason, I think that needs to be pretty clearly explained.

In D14261#289720, @chandlerc wrote:

I'm really not convinced this is the correct approach.

Is it mainly the recursion that is concerning you? There is more that we could do at the DAGCombiner or X86 level to merge shuffles, but it would still be recursion. Other targets have shuffle instructions that set elements to constants (zero and allbits being the most common) as well as as input permutations, but I don't see much that makes use of this at all.

I've been trying to think of ways to canonicalize shuffles with zeros/constants inputs to reduce this but can't think of anything that would really help.

Instead, I think we should combine SHUFFLE_VECTOR nodes into a single node so that we don't need this kind of recursive logic. If that doesn't work for some reason, I think that needs to be pretty clearly explained.

The X86 shuffle combining could do some of this, but it would still be recursive and it would be repeating a lot of what is done at lowering already.

insertps is a curious instruction in that it really takes 3 inputs (well, 2 inputs + zero), which makes combining rather tricky with our existing methods - as another approach I could disable recursion in computeZeroableShuffleElements by default and just have the insertps lowering use it?

In D14261#289721, @RKSimon wrote:

In D14261#289720, @chandlerc wrote:

I'm really not convinced this is the correct approach.

Is it mainly the recursion that is concerning you? There is more that we could do at the DAGCombiner or X86 level to merge shuffles, but it would still be recursion.

It's not the same at all though. When we combine N shuffles into 1 shuffle, we can do that in O(N) because we can combine as we go. When we instead recursively match during lowering here we can do O(N^2) work.

Other targets have shuffle instructions that set elements to constants (zero and allbits being the most common) as well as as input permutations, but I don't see much that makes use of this at all.

I'm not really concerned about helping other targets here, I'm concerned about making this use the correct model.

I've been trying to think of ways to canonicalize shuffles with zeros/constants inputs to reduce this but can't think of anything that would really help.

Can you describe why not? I'm not seeing it.

Specifically, I can't come up with any way that we would need to look through more than N-1 VECTOR_SHUFFLE nodes to find N inputs.

Instead, I think we should combine SHUFFLE_VECTOR nodes into a single node so that we don't need this kind of recursive logic. If that doesn't work for some reason, I think that needs to be pretty clearly explained.

The X86 shuffle combining could do some of this, but it would still be recursive and it would be repeating a lot of what is done at lowering already.

As above, the point is to do *less* of this at lowering time. Even if we still need some of this logic during lowering in order to handle iterative decomposition, we should minimize it and always avoid unbounded recursion.

insertps is a curious instruction in that it really takes 3 inputs (well, 2 inputs + zero), which makes combining rather tricky with our existing methods - as another approach I could disable recursion in computeZeroableShuffleElements by default and just have the insertps lowering use it?

Ah, I think this is really the critical thing to realize.

Because of the simplification provided by it, we should essentially re-associate shuffles to cause all constant inputs to come from a single build_vector, and that a build_vector of constants should be the last input shuffled:

For example:

S1 = (shuffle A, (build constant, constant, ...))
S2 = (shuffle B, S1)

should combine to:

S1 = (shuffle A, B)
S2 = (shuffle S1, (build constant, constant, ...))

So that we can always identify constant inputs at the bottom of the shuffle tree. We can still indicate *unused* inputs in S1 with undef shuffle indices.

Does this make sense to you? Is there some *other* thing broken by this style of reassociation?

In D14261#289722, @chandlerc wrote:
In D14261#289721, @RKSimon wrote:

insertps is a curious instruction in that it really takes 3 inputs (well, 2 inputs + zero), which makes combining rather tricky with our existing methods - as another approach I could disable recursion in computeZeroableShuffleElements by default and just have the insertps lowering use it?

Ah, I think this is really the critical thing to realize.

Because of the simplification provided by it, we should essentially re-associate shuffles to cause all constant inputs to come from a single build_vector, and that a build_vector of constants should be the last input shuffled:

For example:
S1 = (shuffle A, (build constant, constant, ...))
S2 = (shuffle B, S1)
should combine to:
S1 = (shuffle A, B)
S2 = (shuffle S1, (build constant, constant, ...))
So that we can always identify constant inputs at the bottom of the shuffle tree. We can still indicate *unused* inputs in S1 with undef shuffle indices.

Does this make sense to you? Is there some *other* thing broken by this style of reassociation?

The approach you propose is the canonicalization step I mentioned. My main concern with it is making sure our existing shuffle lowering can be adapted to 'peek' through a bottom shuffle with constants (in this case zeroable vectors) - this will affect insertps and the perm2f128 instructions and possibly others - but its certainly doable, but may involve reordering some of the lowerings so that they are attempted earlier. This will probably affect build_vector lowering as well - but I won't know the scope until I investigate this more.

Do you think this should be limited to the X86? Large parts of the code, such as combining multiple constant build_vector shuffle inputs into one, would almost certainly be useful in the DAGCombiner and I can't think of reasons why pushing down the shuffles with constants would cause any notable regressions in other targets' shuffle codegen.

That would leave this patch without the recursion element - I think reducing it to just allowing it to check for zeroable build_vectors of different element counts would still be useful, I'll see if I can update this patch.

Cheers, Simon.

RKSimon mentioned this in D15378: [X86] Determine if target shuffle contains zero elements.Dec 9 2015, 6:43 AM

RKSimon mentioned this in rL256706: [X86][SSE41] Added test cases for improving insertps shuffles.Jan 3 2016, 9:17 AM

RKSimon mentioned this in rL256992: [X86] Determine if target shuffle can contain zero elements.Jan 6 2016, 3:28 PM

Hi Simon,

Is this patch still alive?
(I.e., does it need review.)

Thanks,
-Quentin

It needs redoing (and reducing in scope) now that more of the target shuffle combine work is completed. We should be able to reduce it to a test for zero on BUILD_VECTORs with different number of elements to the mask (same as to what has been done in setTargetShuffleZeroElements).

It just hasn't been high enough on my todo list......

RKSimon mentioned this in rL263606: [X86][SSE41] Additional tests for extracting zeroable shuffle elements.Mar 15 2016, 5:18 PM

As recommended by Chandler, I've adjusted this patch to avoid recursion (we now do this in setTargetShuffleZeroElements for target shuffle combining) and instead focus on improving computeZeroableShuffleElements to be able to peek through bitcasts to extract zero/undef values from BUILD_VECTOR nodes of different element sizes to the shuffle mask.

FWIW, this looks really nice now. Nits below, LGTM with suggested changes.

lib/Target/X86/X86ISelLowering.cpp
7249–7250 ↗	(On Diff #50792)	Please just use 'int' unless you need modular arithmetic...
7272 ↗	(On Diff #50792)	int here as well.
7293–7295 ↗	(On Diff #50792)	int here, and j < Scale.

This revision is now accepted and ready to land.Mar 16 2016, 7:53 AM

Closed by commit rL263906: [X86][SSE] Detect zeroable shuffle elements from different value types (authored by RKSimon). · Explain WhyMar 20 2016, 8:51 AM

This revision was automatically updated to reflect the committed changes.

RKSimon mentioned this in rL263911: [X86][SSE] Tidyup setTargetShuffleZeroElements to match….Mar 20 2016, 10:48 AM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

50 lines

test/

CodeGen/

X86/

insertps-combine.ll

61 lines

widen_load-2.ll

28 lines

Diff 51129

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,251 Lines • ▼ Show 20 Lines	static SmallBitVector computeZeroableShuffleElements(ArrayRef<int> Mask,
while (V1.getOpcode() == ISD::BITCAST)		while (V1.getOpcode() == ISD::BITCAST)
V1 = V1->getOperand(0);		V1 = V1->getOperand(0);
while (V2.getOpcode() == ISD::BITCAST)		while (V2.getOpcode() == ISD::BITCAST)
V2 = V2->getOperand(0);		V2 = V2->getOperand(0);

bool V1IsZero = ISD::isBuildVectorAllZeros(V1.getNode());		bool V1IsZero = ISD::isBuildVectorAllZeros(V1.getNode());
bool V2IsZero = ISD::isBuildVectorAllZeros(V2.getNode());		bool V2IsZero = ISD::isBuildVectorAllZeros(V2.getNode());

		int VectorSizeInBits = V1.getValueType().getSizeInBits();
		int ScalarSizeInBits = VectorSizeInBits / Mask.size();
		assert(!(VectorSizeInBits % ScalarSizeInBits) && "Illegal shuffle mask size");

for (int i = 0, Size = Mask.size(); i < Size; ++i) {		for (int i = 0, Size = Mask.size(); i < Size; ++i) {
int M = Mask[i];		int M = Mask[i];
// Handle the easy cases.		// Handle the easy cases.
if (M < 0 \|\| (M >= 0 && M < Size && V1IsZero) \|\| (M >= Size && V2IsZero)) {		if (M < 0 \|\| (M >= 0 && M < Size && V1IsZero) \|\| (M >= Size && V2IsZero)) {
Zeroable[i] = true;		Zeroable[i] = true;
continue;		continue;
}		}

// If this is an index into a build_vector node (which has the same number		// Determine shuffle input and normalize the mask.
// of elements), dig out the input value and use it.
SDValue V = M < Size ? V1 : V2;		SDValue V = M < Size ? V1 : V2;
if (V.getOpcode() != ISD::BUILD_VECTOR \|\| Size != (int)V.getNumOperands())		M %= Size;

		// Currently we can only search BUILD_VECTOR for UNDEF/ZERO elements.
		if (V.getOpcode() != ISD::BUILD_VECTOR)
continue;		continue;

SDValue Input = V.getOperand(M % Size);		// If the BUILD_VECTOR has fewer elements then the bitcasted portion of
// The UNDEF opcode check really should be dead code here, but not quite		// the (larger) source element must be UNDEF/ZERO.
// worth asserting on (it isn't invalid, just unexpected).		if ((Size % V.getNumOperands()) == 0) {
if (Input.isUndef() \|\| X86::isZeroNode(Input))		int Scale = Size / V->getNumOperands();
		SDValue Op = V.getOperand(M / Scale);
		if (Op.isUndef() \|\| X86::isZeroNode(Op))
Zeroable[i] = true;		Zeroable[i] = true;
		else if (ConstantSDNode *Cst = dyn_cast<ConstantSDNode>(Op)) {
		APInt Val = Cst->getAPIntValue();
		Val = Val.lshr((M % Scale) * ScalarSizeInBits);
		Val = Val.getLoBits(ScalarSizeInBits);
		Zeroable[i] = (Val == 0);
		} else if (ConstantFPSDNode *Cst = dyn_cast<ConstantFPSDNode>(Op)) {
		APInt Val = Cst->getValueAPF().bitcastToAPInt();
		Val = Val.lshr((M % Scale) * ScalarSizeInBits);
		Val = Val.getLoBits(ScalarSizeInBits);
		Zeroable[i] = (Val == 0);
		}
		continue;
		}

		// If the BUILD_VECTOR has more elements then all the (smaller) source
		// elements must be UNDEF or ZERO.
		if ((V.getNumOperands() % Size) == 0) {
		int Scale = V->getNumOperands() / Size;
		bool AllZeroable = true;
		for (int j = 0; j < Scale; ++j) {
		SDValue Op = V.getOperand((M * Scale) + j);
		AllZeroable &= (Op.isUndef() \|\| X86::isZeroNode(Op));
		}
		Zeroable[i] = AllZeroable;
		continue;
		}
}		}

return Zeroable;		return Zeroable;
}		}

// X86 has dedicated unpack instructions that can handle specific blend		// X86 has dedicated unpack instructions that can handle specific blend
// operations: UNPCKH and UNPCKL.		// operations: UNPCKH and UNPCKL.
static SDValue lowerVectorShuffleWithUNPCK(SDLoc DL, MVT VT, ArrayRef<int> Mask,		static SDValue lowerVectorShuffleWithUNPCK(SDLoc DL, MVT VT, ArrayRef<int> Mask,
▲ Show 20 Lines • Show All 22,886 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/insertps-combine.ll

Show First 20 Lines • Show All 129 Lines • ▼ Show 20 Lines	; AVX-NEXT: retq
%res1 = call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %a0, <4 x float> %res0, i8 21)		%res1 = call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %a0, <4 x float> %res0, i8 21)
%res2 = shufflevector <4 x float> %res1, <4 x float> zeroinitializer, <4 x i32> <i32 0, i32 5, i32 2, i32 3>		%res2 = shufflevector <4 x float> %res1, <4 x float> zeroinitializer, <4 x i32> <i32 0, i32 5, i32 2, i32 3>
ret <4 x float> %res2		ret <4 x float> %res2
}		}

define <4 x float> @insertps_zero_from_v2f64(<4 x float> %a0, <2 x double>* %a1) nounwind {		define <4 x float> @insertps_zero_from_v2f64(<4 x float> %a0, <2 x double>* %a1) nounwind {
; SSE-LABEL: insertps_zero_from_v2f64:		; SSE-LABEL: insertps_zero_from_v2f64:
; SSE: # BB#0:		; SSE: # BB#0:
; SSE-NEXT: movapd {{.*#+}} xmm1 = [1.000000e+00,2.000000e+00]		; SSE-NEXT: movapd (%rdi), %xmm1
; SSE-NEXT: movapd (%rdi), %xmm2		; SSE-NEXT: addpd {{.*}}(%rip), %xmm1
; SSE-NEXT: addpd %xmm1, %xmm2		; SSE-NEXT: insertps {{.*#+}} xmm0 = zero,xmm0[2,2,3]
; SSE-NEXT: shufps {{.*#+}} xmm1 = xmm1[2,0],xmm0[2,0]		; SSE-NEXT: movapd %xmm1, (%rdi)
; SSE-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,2],xmm0[2,3]
; SSE-NEXT: movapd %xmm2, (%rdi)
; SSE-NEXT: movaps %xmm1, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: insertps_zero_from_v2f64:		; AVX-LABEL: insertps_zero_from_v2f64:
; AVX: # BB#0:		; AVX: # BB#0:
; AVX-NEXT: vmovapd {{.*#+}} xmm1 = [1.000000e+00,2.000000e+00]		; AVX-NEXT: vmovapd (%rdi), %xmm1
; AVX-NEXT: vaddpd (%rdi), %xmm1, %xmm2		; AVX-NEXT: vaddpd {{.*}}(%rip), %xmm1, %xmm1
; AVX-NEXT: vshufps {{.*#+}} xmm1 = xmm1[2,0],xmm0[2,0]		; AVX-NEXT: vinsertps {{.*#+}} xmm0 = zero,xmm0[2,2,3]
; AVX-NEXT: vshufps {{.*#+}} xmm0 = xmm1[0,2],xmm0[2,3]		; AVX-NEXT: vmovapd %xmm1, (%rdi)
; AVX-NEXT: vmovapd %xmm2, (%rdi)
; AVX-NEXT: retq		; AVX-NEXT: retq
%1 = load <2 x double>, <2 x double>* %a1		%1 = load <2 x double>, <2 x double>* %a1
%2 = bitcast <2 x double> <double 1.0, double 2.0> to <4 x float>		%2 = bitcast <2 x double> <double 1.0, double 2.0> to <4 x float>
%3 = fadd <2 x double> %1, <double 1.0, double 2.0>		%3 = fadd <2 x double> %1, <double 1.0, double 2.0>
%4 = shufflevector <4 x float> %a0, <4 x float> %2, <4 x i32> <i32 6, i32 2, i32 2, i32 3>		%4 = shufflevector <4 x float> %a0, <4 x float> %2, <4 x i32> <i32 6, i32 2, i32 2, i32 3>
store <2 x double> %3, <2 x double> *%a1		store <2 x double> %3, <2 x double> *%a1
ret <4 x float> %4		ret <4 x float> %4
}		}

define <4 x float> @insertps_zero_from_v2i64(<4 x float> %a0, <2 x i64>* %a1) nounwind {		define <4 x float> @insertps_zero_from_v2i64(<4 x float> %a0, <2 x i64>* %a1) nounwind {
; SSE-LABEL: insertps_zero_from_v2i64:		; SSE-LABEL: insertps_zero_from_v2i64:
; SSE: # BB#0:		; SSE: # BB#0:
; SSE-NEXT: movdqa {{.*#+}} xmm1 = [1,18446744073709551614]		; SSE-NEXT: movdqa (%rdi), %xmm1
; SSE-NEXT: movdqa (%rdi), %xmm2		; SSE-NEXT: paddq {{.*}}(%rip), %xmm1
; SSE-NEXT: paddq %xmm1, %xmm2		; SSE-NEXT: insertps {{.*#+}} xmm0 = zero,xmm0[2,2,3]
; SSE-NEXT: shufps {{.*#+}} xmm1 = xmm1[2,0],xmm0[2,0]		; SSE-NEXT: movdqa %xmm1, (%rdi)
; SSE-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,2],xmm0[2,3]
; SSE-NEXT: movdqa %xmm2, (%rdi)
; SSE-NEXT: movaps %xmm1, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: insertps_zero_from_v2i64:		; AVX-LABEL: insertps_zero_from_v2i64:
; AVX: # BB#0:		; AVX: # BB#0:
; AVX-NEXT: vmovdqa {{.*#+}} xmm1 = [1,18446744073709551614]		; AVX-NEXT: vmovdqa (%rdi), %xmm1
; AVX-NEXT: vpaddq (%rdi), %xmm1, %xmm2		; AVX-NEXT: vpaddq {{.*}}(%rip), %xmm1, %xmm1
; AVX-NEXT: vshufps {{.*#+}} xmm1 = xmm1[2,0],xmm0[2,0]		; AVX-NEXT: vinsertps {{.*#+}} xmm0 = zero,xmm0[2,2,3]
; AVX-NEXT: vshufps {{.*#+}} xmm0 = xmm1[0,2],xmm0[2,3]		; AVX-NEXT: vmovdqa %xmm1, (%rdi)
; AVX-NEXT: vmovdqa %xmm2, (%rdi)
; AVX-NEXT: retq		; AVX-NEXT: retq
%1 = load <2 x i64>, <2 x i64>* %a1		%1 = load <2 x i64>, <2 x i64>* %a1
%2 = bitcast <2 x i64> <i64 1, i64 -2> to <4 x float>		%2 = bitcast <2 x i64> <i64 1, i64 -2> to <4 x float>
%3 = add <2 x i64> %1, <i64 1, i64 -2>		%3 = add <2 x i64> %1, <i64 1, i64 -2>
%4 = shufflevector <4 x float> %a0, <4 x float> %2, <4 x i32> <i32 6, i32 2, i32 2, i32 3>		%4 = shufflevector <4 x float> %a0, <4 x float> %2, <4 x i32> <i32 5, i32 2, i32 2, i32 3>
store <2 x i64> %3, <2 x i64> *%a1		store <2 x i64> %3, <2 x i64> *%a1
ret <4 x float> %4		ret <4 x float> %4
}		}

define <4 x float> @insertps_zero_from_v8i16(<4 x float> %a0, <8 x i16>* %a1) nounwind {		define <4 x float> @insertps_zero_from_v8i16(<4 x float> %a0, <8 x i16>* %a1) nounwind {
; SSE-LABEL: insertps_zero_from_v8i16:		; SSE-LABEL: insertps_zero_from_v8i16:
; SSE: # BB#0:		; SSE: # BB#0:
; SSE-NEXT: movdqa {{.*#+}} xmm1 = [0,0,1,1,2,2,3,3]		; SSE-NEXT: movdqa (%rdi), %xmm1
; SSE-NEXT: movdqa (%rdi), %xmm2		; SSE-NEXT: paddw {{.*}}(%rip), %xmm1
; SSE-NEXT: paddw %xmm1, %xmm2		; SSE-NEXT: insertps {{.*#+}} xmm0 = zero,xmm0[2,2,3]
; SSE-NEXT: blendpd {{.*#+}} xmm0 = xmm1[0],xmm0[1]		; SSE-NEXT: movdqa %xmm1, (%rdi)
; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,2,2,3]
; SSE-NEXT: movdqa %xmm2, (%rdi)
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: insertps_zero_from_v8i16:		; AVX-LABEL: insertps_zero_from_v8i16:
; AVX: # BB#0:		; AVX: # BB#0:
; AVX-NEXT: vmovdqa {{.*#+}} xmm1 = [0,0,1,1,2,2,3,3]		; AVX-NEXT: vmovdqa (%rdi), %xmm1
; AVX-NEXT: vpaddw (%rdi), %xmm1, %xmm2		; AVX-NEXT: vpaddw {{.*}}(%rip), %xmm1, %xmm1
; AVX-NEXT: vblendpd {{.*#+}} xmm0 = xmm1[0],xmm0[1]		; AVX-NEXT: vinsertps {{.*#+}} xmm0 = zero,xmm0[2,2,3]
; AVX-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[0,2,2,3]		; AVX-NEXT: vmovdqa %xmm1, (%rdi)
; AVX-NEXT: vmovdqa %xmm2, (%rdi)
; AVX-NEXT: retq		; AVX-NEXT: retq
%1 = load <8 x i16>, <8 x i16>* %a1		%1 = load <8 x i16>, <8 x i16>* %a1
%2 = bitcast <8 x i16> <i16 0, i16 0, i16 1, i16 1, i16 2, i16 2, i16 3, i16 3> to <4 x float>		%2 = bitcast <8 x i16> <i16 0, i16 0, i16 1, i16 1, i16 2, i16 2, i16 3, i16 3> to <4 x float>
%3 = add <8 x i16> %1, <i16 0, i16 0, i16 1, i16 1, i16 2, i16 2, i16 3, i16 3>		%3 = add <8 x i16> %1, <i16 0, i16 0, i16 1, i16 1, i16 2, i16 2, i16 3, i16 3>
%4 = shufflevector <4 x float> %a0, <4 x float> %2, <4 x i32> <i32 4, i32 2, i32 2, i32 3>		%4 = shufflevector <4 x float> %a0, <4 x float> %2, <4 x i32> <i32 4, i32 2, i32 2, i32 3>
store <8 x i16> %3, <8 x i16> *%a1		store <8 x i16> %3, <8 x i16> *%a1
ret <4 x float> %4		ret <4 x float> %4
}		}
▲ Show 20 Lines • Show All 53 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/widen_load-2.ll

Show First 20 Lines • Show All 204 Lines • ▼ Show 20 Lines	; CHECK-NEXT: retq
ret void		ret void
}		}


%i8vec3pack = type { <3 x i8>, i8 }		%i8vec3pack = type { <3 x i8>, i8 }
define void @rot(%i8vec3pack* nocapture sret %result, %i8vec3pack* %X, %i8vec3pack* %rot) nounwind {		define void @rot(%i8vec3pack* nocapture sret %result, %i8vec3pack* %X, %i8vec3pack* %rot) nounwind {
; CHECK-LABEL: rot:		; CHECK-LABEL: rot:
; CHECK: # BB#0: # %entry		; CHECK: # BB#0: # %entry
; CHECK-NEXT: movdqa {{.*#+}} xmm0 = <158,158,158,u>		; CHECK-NEXT: movdqa {{.*#+}} xmm0 = <0,4,8,128,u,u,u,u,u,u,u,u,u,u,u,u>
; CHECK-NEXT: movdqa {{.*#+}} xmm1 = <0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u>		; CHECK-NEXT: movdqa {{.*#+}} xmm1 = <158,158,158,u>
; CHECK-NEXT: pshufb %xmm1, %xmm0		; CHECK-NEXT: pshufb %xmm0, %xmm1
; CHECK-NEXT: pmovzxwq {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero		; CHECK-NEXT: pmovzxwq {{.*#+}} xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero
; CHECK-NEXT: movd %xmm0, %eax		; CHECK-NEXT: movd %xmm1, %eax
; CHECK-NEXT: movw %ax, (%rsi)		; CHECK-NEXT: movw %ax, (%rsi)
; CHECK-NEXT: movb $-98, 2(%rsi)		; CHECK-NEXT: movb $-98, 2(%rsi)
; CHECK-NEXT: movdqa {{.*#+}} xmm0 = <1,1,1,u>		; CHECK-NEXT: movdqa {{.*#+}} xmm1 = <1,1,1,u>
; CHECK-NEXT: pshufb %xmm1, %xmm0		; CHECK-NEXT: pshufb %xmm0, %xmm1
; CHECK-NEXT: pmovzxwq {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero		; CHECK-NEXT: pmovzxwq {{.*#+}} xmm0 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero
; CHECK-NEXT: movd %xmm0, %eax		; CHECK-NEXT: movd %xmm0, %eax
; CHECK-NEXT: movw %ax, (%rdx)		; CHECK-NEXT: movw %ax, (%rdx)
; CHECK-NEXT: movb $1, 2(%rdx)		; CHECK-NEXT: movb $1, 2(%rdx)
; CHECK-NEXT: pmovzxbd {{.*#+}} xmm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero		; CHECK-NEXT: pmovzxbd {{.*#+}} xmm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
; CHECK-NEXT: movdqa %xmm0, %xmm2		; CHECK-NEXT: movdqa %xmm0, %xmm1
; CHECK-NEXT: psrld $1, %xmm2		; CHECK-NEXT: psrld $1, %xmm1
; CHECK-NEXT: pblendw {{.*#+}} xmm2 = xmm2[0,1,2,3,4,5],xmm0[6,7]		; CHECK-NEXT: pblendw {{.*#+}} xmm1 = xmm1[0,1,2,3,4,5],xmm0[6,7]
; CHECK-NEXT: pextrb $8, %xmm2, 2(%rdi)		; CHECK-NEXT: pextrb $8, %xmm1, 2(%rdi)
; CHECK-NEXT: pshufb %xmm1, %xmm2		; CHECK-NEXT: pshufb {{.*#+}} xmm1 = xmm1[0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u]
; CHECK-NEXT: pmovzxwq {{.*#+}} xmm0 = xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero		; CHECK-NEXT: pmovzxwq {{.*#+}} xmm0 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero
; CHECK-NEXT: movd %xmm0, %eax		; CHECK-NEXT: movd %xmm0, %eax
; CHECK-NEXT: movw %ax, (%rdi)		; CHECK-NEXT: movw %ax, (%rdi)
; CHECK-NEXT: movq %rdi, %rax		; CHECK-NEXT: movq %rdi, %rax
; CHECK-NEXT: retq		; CHECK-NEXT: retq
entry:		entry:
%storetmp = bitcast %i8vec3pack* %X to <3 x i8>*		%storetmp = bitcast %i8vec3pack* %X to <3 x i8>*
store <3 x i8> <i8 -98, i8 -98, i8 -98>, <3 x i8>* %storetmp		store <3 x i8> <i8 -98, i8 -98, i8 -98>, <3 x i8>* %storetmp
%storetmp1 = bitcast %i8vec3pack* %rot to <3 x i8>*		%storetmp1 = bitcast %i8vec3pack* %rot to <3 x i8>*
Show All 11 Lines