This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] reduce extract subvector of concat
ClosedPublic

Authored by spatel on Jan 7 2020, 1:58 PM.

Download Raw Diff

Details

Reviewers

craig.topper
RKSimon
lebedev.ri

Commits

rGcb5612e2df89: [DAGCombiner] reduce extract subvector of concat

Summary

If we are extracting a chunk of a vector that's smaller than an operand of the concatenated vector operand, we can extract directly from one of those original operands.
This is another suggestion from PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024#c2

But I'm not sure yet if it will make any difference on those patterns. It seems to help a few existing AVX512 tests though.

Most of the code diff here is refactoring, so I can make that a preliminary commit if preferred.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Jan 7 2020, 1:58 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 7 2020, 1:58 PM

Herald added subscribers: hiraditya, mcrosier. · View Herald Transcript

IMHO some NFC cleanup wouldn't decrease readability of the diff

LGTM - some of NFC refactoring as pre-commits would make sense to separate it from the new features.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
18576	If we're peeking into any concat can we guarantee this or should we just wrap in an if() instead?

This revision is now accepted and ready to land.Jan 8 2020, 1:21 AM

In D72361#1809619, @RKSimon wrote:

LGTM - some of NFC refactoring as pre-commits would make sense to separate it from the new features.

Thanks - yes, I'll push the cosmetic cleanup and update here.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
18576	This is an edit of the existing assert, and IIUC, it's independent of the concat. We are asserting that the extract_subvector conforms with its definition: /// EXTRACT_SUBVECTOR(VECTOR, IDX) - Returns a subvector from VECTOR (an /// vector value) starting with the element number IDX, which must be a /// constant multiple of the result vector length.

spatel mentioned this in rG780ba1f22b53: [DAGCombiner] clean up extract-of-concat fold; NFC.Jan 8 2020, 7:22 AM

Patch updated:
Rebased after pre-committing the cleanup with rG780ba1f22b53.

lebedev.ri added inline comments.Jan 8 2020, 7:42 AM

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
18591	I'm having a hard time parsing this. Did we ensure that the original extract only extracts only from one single concatenated vector? I.e. that it does not take a new elements from the end of first concatenated vector, and a few elements from the beginning of second concatenated vector? (That case could be implemented as a shuffle if we concatenating the same vector)

spatel marked an inline comment as done.Jan 8 2020, 8:03 AM

spatel added inline comments.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
18591	I think this goes back to the assert mentioned above for extract_subvector. It's not legal to have something like this: v2i8 extract_subvec (v16i8 X), 7 Because the index operand must be a multiple of the final vector length (7 % 2 != 0). So that means it is impossible to extract from more than 1 operand of the concat (because the concat operands are guaranteed to be longer than this extract result). I can add the above as a code comment if it helps. We could also assert that ExtNumElts % NewExtIdx == 0?

spatel marked an inline comment as done.Jan 8 2020, 8:30 AM

spatel added inline comments.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
18591	Aha - I wasn't being imaginative enough. My assumption was that the concat vector ops are some multiple of the final vector size rather than just larger. So yes, it is possible to straddle concat ops if we have something like this: v4i8 extract_subvec (v12i8 concat (v6i8 X), (v6i8 Y), 4 I'll fix the predicate and add an assert.

lebedev.ri requested changes to this revision.Jan 8 2020, 8:52 AM

This revision now requires changes to proceed.Jan 8 2020, 8:52 AM

Patch updated:
Made predicate more restrictive and added asserts.

In D72361#1810243, @spatel wrote:

Patch updated:
Made predicate more restrictive and added asserts.

Can a test case be constructed for that?

(That case could be implemented as a shuffle, if we don't do that already)

spatel mentioned this in rG31992a69b808: [x86] add test for concat-extract corner case; NFC.Jan 8 2020, 11:45 AM

In D72361#1810410, @lebedev.ri wrote:

In D72361#1810243, @spatel wrote:

Patch updated:
Made predicate more restrictive and added asserts.

Can a test case be constructed for that?

(That case could be implemented as a shuffle, if we don't do that already)

This takes some creativity because we won't generate the needed DAG nodes if the types are too weird and/or mapped to registers in a strange way, but I've found a case that is legal enough to show what would have been a miscompile:
define <4 x i32> @cat_ext_straddle(<6 x i32>* %px, <6 x i32>* %py) {

%x = load <6 x i32>, <6 x i32>* %px
%y = load <6 x i32>, <6 x i32>* %py
%cat = shufflevector <6 x i32> %x, <6 x i32> %y, <12 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11>
%ext = shufflevector <12 x i32> %cat, <12 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
ret <4 x i32> %ext

}

Should be:

movaps	16(%rdi), %xmm0
unpcklpd	(%rsi), %xmm0   ## xmm0 = xmm0[0],mem[0]

But would be miscompiled by the earlier rev of this patch:

movaps	16(%rdi), %xmm0

rG31992a69b808

In D72361#1810621, @spatel wrote:

In D72361#1810410, @lebedev.ri wrote:

...

Thank you!

This revision is now accepted and ready to land.Jan 8 2020, 11:58 AM

Closed by commit rGcb5612e2df89: [DAGCombiner] reduce extract subvector of concat (authored by spatel). · Explain WhyJan 9 2020, 6:46 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

18 lines

test/

CodeGen/

X86/

avg.ll

12 lines

pr34657.ll

13 lines

x86-interleaved-access.ll

76 lines

Diff 237067

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 18,567 Lines • ▼ Show 20 Lines	if ((DestNumElts % SrcNumElts) == 0) {
SDLoc DL(N);		SDLoc DL(N);
SDValue NewIndex = DAG.getIntPtrConstant(IndexValScaled, DL);		SDValue NewIndex = DAG.getIntPtrConstant(IndexValScaled, DL);
SDValue NewExtract = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, NewExtVT,		SDValue NewExtract = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, NewExtVT,
V.getOperand(0), NewIndex);		V.getOperand(0), NewIndex);
return DAG.getBitcast(NVT, NewExtract);		return DAG.getBitcast(NVT, NewExtract);
}		}
}		}
}		}
}		}
		RKSimonUnsubmitted Not Done Reply Inline Actions If we're peeking into any concat can we guarantee this or should we just wrap in an if() instead? RKSimon: If we're peeking into any concat can we guarantee this or should we just wrap in an if()…
		spatelAuthorUnsubmitted Done Reply Inline Actions This is an edit of the existing assert, and IIUC, it's independent of the concat. We are asserting that the extract_subvector conforms with its definition: /// EXTRACT_SUBVECTOR(VECTOR, IDX) - Returns a subvector from VECTOR (an /// vector value) starting with the element number IDX, which must be a /// constant multiple of the result vector length. spatel: This is an edit of the existing assert, and IIUC, it's independent of the concat. We are…

if (V.getOpcode() == ISD::CONCAT_VECTORS && isa<ConstantSDNode>(Index)) {		if (V.getOpcode() == ISD::CONCAT_VECTORS && isa<ConstantSDNode>(Index)) {
EVT ConcatSrcVT = V.getOperand(0).getValueType();		EVT ConcatSrcVT = V.getOperand(0).getValueType();
assert(ConcatSrcVT.getVectorElementType() == NVT.getVectorElementType() &&		assert(ConcatSrcVT.getVectorElementType() == NVT.getVectorElementType() &&
"Concat and extract subvector do not change element type");		"Concat and extract subvector do not change element type");

unsigned ExtIdx = N->getConstantOperandVal(1);		unsigned ExtIdx = N->getConstantOperandVal(1);
unsigned ExtNumElts = NVT.getVectorNumElements();		unsigned ExtNumElts = NVT.getVectorNumElements();
assert(ExtIdx % ExtNumElts == 0 &&		assert(ExtIdx % ExtNumElts == 0 &&
"Extract index is not a multiple of the input vector length.");		"Extract index is not a multiple of the input vector length.");

unsigned ConcatSrcNumElts = ConcatSrcVT.getVectorNumElements();		unsigned ConcatSrcNumElts = ConcatSrcVT.getVectorNumElements();
unsigned ConcatOpIdx = ExtIdx / ConcatSrcNumElts;		unsigned ConcatOpIdx = ExtIdx / ConcatSrcNumElts;

// If the concatenated source types match this extract, it's a direct		// If the concatenated source types match this extract, it's a direct
		lebedev.riUnsubmitted Not Done Reply Inline Actions I'm having a hard time parsing this. Did we ensure that the original extract only extracts only from one single concatenated vector? I.e. that it does not take a new elements from the end of first concatenated vector, and a few elements from the beginning of second concatenated vector? (That case could be implemented as a shuffle if we concatenating the same vector) lebedev.ri: I'm having a hard time parsing this. Did we ensure that the original extract only extracts…
		spatelAuthorUnsubmitted Done Reply Inline Actions I think this goes back to the assert mentioned above for extract_subvector. It's not legal to have something like this: v2i8 extract_subvec (v16i8 X), 7 Because the index operand must be a multiple of the final vector length (7 % 2 != 0). So that means it is impossible to extract from more than 1 operand of the concat (because the concat operands are guaranteed to be longer than this extract result). I can add the above as a code comment if it helps. We could also assert that ExtNumElts % NewExtIdx == 0? spatel: I think this goes back to the assert mentioned above for extract_subvector. It's not legal to…
		spatelAuthorUnsubmitted Done Reply Inline Actions Aha - I wasn't being imaginative enough. My assumption was that the concat vector ops are some multiple of the final vector size rather than just larger. So yes, it is possible to straddle concat ops if we have something like this: v4i8 extract_subvec (v12i8 concat (v6i8 X), (v6i8 Y), 4 I'll fix the predicate and add an assert. spatel: Aha - I wasn't being imaginative enough. My assumption was that the concat vector ops are some…
// simplification:		// simplification:
// extract_subvec (concat V1, V2, ...), i --> Vi		// extract_subvec (concat V1, V2, ...), i --> Vi
if (ConcatSrcNumElts == ExtNumElts)		if (ConcatSrcNumElts == ExtNumElts)
return V.getOperand(ConcatOpIdx);		return V.getOperand(ConcatOpIdx);

// TODO: Handle the case where the concat operands are larger than the		// If the concatenated source vectors are a multiple length of this extract,
// result of this extract by extracting directly from a concat op.		// then extract a fraction of one of those source vectors directly from a
		// concat operand. Example:
		// v2i8 extract_subvec (v16i8 concat (v8i8 X), (v8i8 Y), 14 -->
		// v2i8 extract_subvec v8i8 Y, 6
		if (ConcatSrcNumElts % ExtNumElts == 0) {
		SDLoc DL(N);
		unsigned NewExtIdx = ExtIdx - ConcatOpIdx * ConcatSrcNumElts;
		assert(NewExtIdx + ExtNumElts <= ConcatSrcNumElts &&
		"Trying to extract from >1 concat operand?");
		assert(NewExtIdx % ExtNumElts == 0 &&
		"Extract index is not a multiple of the input vector length.");
		SDValue NewIndexC = DAG.getIntPtrConstant(NewExtIdx, DL);
		return DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, NVT,
		V.getOperand(ConcatOpIdx), NewIndexC);
		}
}		}

V = peekThroughBitcasts(V);		V = peekThroughBitcasts(V);

// If the input is a build vector. Try to make a smaller build vector.		// If the input is a build vector. Try to make a smaller build vector.
if (V.getOpcode() == ISD::BUILD_VECTOR) {		if (V.getOpcode() == ISD::BUILD_VECTOR) {
if (auto *IdxC = dyn_cast<ConstantSDNode>(Index)) {		if (auto *IdxC = dyn_cast<ConstantSDNode>(Index)) {
EVT InVT = V.getValueType();		EVT InVT = V.getValueType();
▲ Show 20 Lines • Show All 2,629 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/avg.ll

	Show First 20 Lines • Show All 456 Lines • ▼ Show 20 Lines
	; AVX512F-NEXT: vmovdqu %xmm2, (%rax)			; AVX512F-NEXT: vmovdqu %xmm2, (%rax)
	; AVX512F-NEXT: retq			; AVX512F-NEXT: retq
	;			;
	; AVX512BW-LABEL: avg_v48i8:			; AVX512BW-LABEL: avg_v48i8:
	; AVX512BW: # %bb.0:			; AVX512BW: # %bb.0:
	; AVX512BW-NEXT: vmovdqa (%rdi), %xmm0			; AVX512BW-NEXT: vmovdqa (%rdi), %xmm0
	; AVX512BW-NEXT: vmovdqa 16(%rdi), %xmm1			; AVX512BW-NEXT: vmovdqa 16(%rdi), %xmm1
	; AVX512BW-NEXT: vmovdqa 32(%rdi), %xmm2			; AVX512BW-NEXT: vmovdqa 32(%rdi), %xmm2
	; AVX512BW-NEXT: vpavgb 16(%rsi), %xmm1, %xmm1			; AVX512BW-NEXT: vpavgb 32(%rsi), %xmm2, %xmm2
	; AVX512BW-NEXT: vpavgb (%rsi), %xmm0, %xmm0			; AVX512BW-NEXT: vpavgb (%rsi), %xmm0, %xmm0
	; AVX512BW-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0			; AVX512BW-NEXT: vpavgb 16(%rsi), %xmm1, %xmm1
	; AVX512BW-NEXT: vpavgb 32(%rsi), %xmm2, %xmm1			; AVX512BW-NEXT: vmovdqu %xmm1, (%rax)
	; AVX512BW-NEXT: vinserti64x4 $1, %ymm1, %zmm0, %zmm1			; AVX512BW-NEXT: vmovdqu %xmm0, (%rax)
	; AVX512BW-NEXT: vmovdqu %ymm0, (%rax)			; AVX512BW-NEXT: vmovdqu %xmm2, (%rax)
	; AVX512BW-NEXT: vextracti32x4 $2, %zmm1, (%rax)
	; AVX512BW-NEXT: vzeroupper
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	%1 = load <48 x i8>, <48 x i8>* %a			%1 = load <48 x i8>, <48 x i8>* %a
	%2 = load <48 x i8>, <48 x i8>* %b			%2 = load <48 x i8>, <48 x i8>* %b
	%3 = zext <48 x i8> %1 to <48 x i32>			%3 = zext <48 x i8> %1 to <48 x i32>
	%4 = zext <48 x i8> %2 to <48 x i32>			%4 = zext <48 x i8> %2 to <48 x i32>
	%5 = add nuw nsw <48 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>			%5 = add nuw nsw <48 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
	%6 = add nuw nsw <48 x i32> %5, %4			%6 = add nuw nsw <48 x i32> %5, %4
	%7 = lshr <48 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>			%7 = lshr <48 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
	▲ Show 20 Lines • Show All 2,513 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/pr34657.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512f,+avx512bw \| FileCheck %s			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512f,+avx512bw \| FileCheck %s

	define <112 x i8> @pr34657(<112 x i8>* %src) local_unnamed_addr {			define <112 x i8> @pr34657(<112 x i8>* %src) local_unnamed_addr {
	; CHECK-LABEL: pr34657:			; CHECK-LABEL: pr34657:
	; CHECK: # %bb.0: # %entry			; CHECK: # %bb.0: # %entry
	; CHECK-NEXT: movq %rdi, %rax			; CHECK-NEXT: movq %rdi, %rax
	; CHECK-NEXT: vmovups 64(%rsi), %ymm0			; CHECK-NEXT: vmovups (%rsi), %zmm0
	; CHECK-NEXT: vbroadcastf128 {{.*#+}} ymm1 = mem[0,1,0,1]			; CHECK-NEXT: vmovups 64(%rsi), %ymm1
	; CHECK-NEXT: vinsertf64x4 $1, %ymm1, %zmm0, %zmm1			; CHECK-NEXT: vmovups 96(%rsi), %xmm2
	; CHECK-NEXT: vmovups (%rsi), %zmm2			; CHECK-NEXT: vmovaps %xmm2, 96(%rdi)
	; CHECK-NEXT: vmovaps %ymm0, 64(%rdi)			; CHECK-NEXT: vmovaps %ymm1, 64(%rdi)
	; CHECK-NEXT: vmovaps %zmm2, (%rdi)			; CHECK-NEXT: vmovaps %zmm0, (%rdi)
	; CHECK-NEXT: vextractf32x4 $2, %zmm1, 96(%rdi)
	; CHECK-NEXT: vzeroupper			; CHECK-NEXT: vzeroupper
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	entry:			entry:
	%wide.vec51 = load <112 x i8>, <112 x i8>* %src, align 2			%wide.vec51 = load <112 x i8>, <112 x i8>* %src, align 2
	ret <112 x i8> %wide.vec51			ret <112 x i8> %wide.vec51
	}			}

llvm/test/CodeGen/X86/x86-interleaved-access.ll

	Show First 20 Lines • Show All 1,049 Lines • ▼ Show 20 Lines
	%1 = shufflevector <8 x i8> %a, <8 x i8> %b, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>			%1 = shufflevector <8 x i8> %a, <8 x i8> %b, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
	%2 = shufflevector <8 x i8> %c, <8 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%2 = shufflevector <8 x i8> %c, <8 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%interleaved.vec = shufflevector <16 x i8> %1, <16 x i8> %2, <24 x i32> <i32 0, i32 8, i32 16, i32 1, i32 9, i32 17, i32 2, i32 10, i32 18, i32 3, i32 11, i32 19, i32 4, i32 12, i32 20, i32 5, i32 13, i32 21, i32 6, i32 14, i32 22, i32 7, i32 15, i32 23>			%interleaved.vec = shufflevector <16 x i8> %1, <16 x i8> %2, <24 x i32> <i32 0, i32 8, i32 16, i32 1, i32 9, i32 17, i32 2, i32 10, i32 18, i32 3, i32 11, i32 19, i32 4, i32 12, i32 20, i32 5, i32 13, i32 21, i32 6, i32 14, i32 22, i32 7, i32 15, i32 23>
	store <24 x i8> %interleaved.vec, <24 x i8>* %p, align 1			store <24 x i8> %interleaved.vec, <24 x i8>* %p, align 1
	ret void			ret void
	}			}

	define void @interleaved_store_vf16_i8_stride3(<16 x i8> %a, <16 x i8> %b, <16 x i8> %c, <48 x i8>* %p) {			define void @interleaved_store_vf16_i8_stride3(<16 x i8> %a, <16 x i8> %b, <16 x i8> %c, <48 x i8>* %p) {
	; AVX1-LABEL: interleaved_store_vf16_i8_stride3:			; AVX-LABEL: interleaved_store_vf16_i8_stride3:
	; AVX1: # %bb.0:			; AVX: # %bb.0:
	; AVX1-NEXT: vpalignr {{.*#+}} xmm0 = xmm0[6,7,8,9,10,11,12,13,14,15,0,1,2,3,4,5]			; AVX-NEXT: vpalignr {{.*#+}} xmm0 = xmm0[6,7,8,9,10,11,12,13,14,15,0,1,2,3,4,5]
	; AVX1-NEXT: vpalignr {{.*#+}} xmm3 = xmm1[11,12,13,14,15,0,1,2,3,4,5,6,7,8,9,10]			; AVX-NEXT: vpalignr {{.*#+}} xmm3 = xmm1[11,12,13,14,15,0,1,2,3,4,5,6,7,8,9,10]
	; AVX1-NEXT: vpalignr {{.*#+}} xmm4 = xmm0[5,6,7,8,9,10,11,12,13,14,15],xmm2[0,1,2,3,4]			; AVX-NEXT: vpalignr {{.*#+}} xmm4 = xmm0[5,6,7,8,9,10,11,12,13,14,15],xmm2[0,1,2,3,4]
	; AVX1-NEXT: vpalignr {{.*#+}} xmm0 = xmm3[5,6,7,8,9,10,11,12,13,14,15],xmm0[0,1,2,3,4]			; AVX-NEXT: vpalignr {{.*#+}} xmm0 = xmm3[5,6,7,8,9,10,11,12,13,14,15],xmm0[0,1,2,3,4]
	; AVX1-NEXT: vpalignr {{.*#+}} xmm2 = xmm2[5,6,7,8,9,10,11,12,13,14,15],xmm3[0,1,2,3,4]			; AVX-NEXT: vpalignr {{.*#+}} xmm2 = xmm2[5,6,7,8,9,10,11,12,13,14,15],xmm3[0,1,2,3,4]
	; AVX1-NEXT: vpalignr {{.*#+}} xmm1 = xmm4[5,6,7,8,9,10,11,12,13,14,15],xmm1[0,1,2,3,4]			; AVX-NEXT: vpalignr {{.*#+}} xmm1 = xmm4[5,6,7,8,9,10,11,12,13,14,15],xmm1[0,1,2,3,4]
	; AVX1-NEXT: vmovdqa {{.*#+}} xmm3 = [0,11,6,1,12,7,2,13,8,3,14,9,4,15,10,5]			; AVX-NEXT: vmovdqa {{.*#+}} xmm3 = [0,11,6,1,12,7,2,13,8,3,14,9,4,15,10,5]
	; AVX1-NEXT: vpshufb %xmm3, %xmm1, %xmm1			; AVX-NEXT: vpshufb %xmm3, %xmm1, %xmm1
	; AVX1-NEXT: vpalignr {{.*#+}} xmm0 = xmm0[5,6,7,8,9,10,11,12,13,14,15],xmm2[0,1,2,3,4]			; AVX-NEXT: vpalignr {{.*#+}} xmm0 = xmm0[5,6,7,8,9,10,11,12,13,14,15],xmm2[0,1,2,3,4]
	; AVX1-NEXT: vpshufb %xmm3, %xmm0, %xmm0			; AVX-NEXT: vpshufb %xmm3, %xmm0, %xmm0
	; AVX1-NEXT: vpalignr {{.*#+}} xmm2 = xmm2[5,6,7,8,9,10,11,12,13,14,15],xmm4[0,1,2,3,4]			; AVX-NEXT: vpalignr {{.*#+}} xmm2 = xmm2[5,6,7,8,9,10,11,12,13,14,15],xmm4[0,1,2,3,4]
	; AVX1-NEXT: vpshufb %xmm3, %xmm2, %xmm2			; AVX-NEXT: vpshufb %xmm3, %xmm2, %xmm2
	; AVX1-NEXT: vmovdqu %xmm0, 16(%rdi)			; AVX-NEXT: vmovdqu %xmm0, 16(%rdi)
	; AVX1-NEXT: vmovdqu %xmm1, (%rdi)			; AVX-NEXT: vmovdqu %xmm1, (%rdi)
	; AVX1-NEXT: vmovdqu %xmm2, 32(%rdi)			; AVX-NEXT: vmovdqu %xmm2, 32(%rdi)
	; AVX1-NEXT: retq			; AVX-NEXT: retq
	;
	; AVX2-LABEL: interleaved_store_vf16_i8_stride3:
	; AVX2: # %bb.0:
	; AVX2-NEXT: vpalignr {{.*#+}} xmm0 = xmm0[6,7,8,9,10,11,12,13,14,15,0,1,2,3,4,5]
	; AVX2-NEXT: vpalignr {{.*#+}} xmm3 = xmm1[11,12,13,14,15,0,1,2,3,4,5,6,7,8,9,10]
	; AVX2-NEXT: vpalignr {{.*#+}} xmm4 = xmm0[5,6,7,8,9,10,11,12,13,14,15],xmm2[0,1,2,3,4]
	; AVX2-NEXT: vpalignr {{.*#+}} xmm0 = xmm3[5,6,7,8,9,10,11,12,13,14,15],xmm0[0,1,2,3,4]
	; AVX2-NEXT: vpalignr {{.*#+}} xmm2 = xmm2[5,6,7,8,9,10,11,12,13,14,15],xmm3[0,1,2,3,4]
	; AVX2-NEXT: vpalignr {{.*#+}} xmm1 = xmm4[5,6,7,8,9,10,11,12,13,14,15],xmm1[0,1,2,3,4]
	; AVX2-NEXT: vmovdqa {{.*#+}} xmm3 = [0,11,6,1,12,7,2,13,8,3,14,9,4,15,10,5]
	; AVX2-NEXT: vpshufb %xmm3, %xmm1, %xmm1
	; AVX2-NEXT: vpalignr {{.*#+}} xmm0 = xmm0[5,6,7,8,9,10,11,12,13,14,15],xmm2[0,1,2,3,4]
	; AVX2-NEXT: vpshufb %xmm3, %xmm0, %xmm0
	; AVX2-NEXT: vpalignr {{.*#+}} xmm2 = xmm2[5,6,7,8,9,10,11,12,13,14,15],xmm4[0,1,2,3,4]
	; AVX2-NEXT: vpshufb %xmm3, %xmm2, %xmm2
	; AVX2-NEXT: vmovdqu %xmm0, 16(%rdi)
	; AVX2-NEXT: vmovdqu %xmm1, (%rdi)
	; AVX2-NEXT: vmovdqu %xmm2, 32(%rdi)
	; AVX2-NEXT: retq
	;
	; AVX512-LABEL: interleaved_store_vf16_i8_stride3:
	; AVX512: # %bb.0:
	; AVX512-NEXT: vpalignr {{.*#+}} xmm0 = xmm0[6,7,8,9,10,11,12,13,14,15,0,1,2,3,4,5]
	; AVX512-NEXT: vpalignr {{.*#+}} xmm3 = xmm1[11,12,13,14,15,0,1,2,3,4,5,6,7,8,9,10]
	; AVX512-NEXT: vpalignr {{.*#+}} xmm4 = xmm0[5,6,7,8,9,10,11,12,13,14,15],xmm2[0,1,2,3,4]
	; AVX512-NEXT: vpalignr {{.*#+}} xmm0 = xmm3[5,6,7,8,9,10,11,12,13,14,15],xmm0[0,1,2,3,4]
	; AVX512-NEXT: vpalignr {{.*#+}} xmm2 = xmm2[5,6,7,8,9,10,11,12,13,14,15],xmm3[0,1,2,3,4]
	; AVX512-NEXT: vpalignr {{.*#+}} xmm1 = xmm4[5,6,7,8,9,10,11,12,13,14,15],xmm1[0,1,2,3,4]
	; AVX512-NEXT: vmovdqa {{.*#+}} xmm3 = [0,11,6,1,12,7,2,13,8,3,14,9,4,15,10,5]
	; AVX512-NEXT: vpshufb %xmm3, %xmm1, %xmm1
	; AVX512-NEXT: vpalignr {{.*#+}} xmm0 = xmm0[5,6,7,8,9,10,11,12,13,14,15],xmm2[0,1,2,3,4]
	; AVX512-NEXT: vpshufb %xmm3, %xmm0, %xmm0
	; AVX512-NEXT: vpalignr {{.*#+}} xmm2 = xmm2[5,6,7,8,9,10,11,12,13,14,15],xmm4[0,1,2,3,4]
	; AVX512-NEXT: vpshufb %xmm3, %xmm2, %xmm2
	; AVX512-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
	; AVX512-NEXT: vinserti64x4 $1, %ymm2, %zmm0, %zmm1
	; AVX512-NEXT: vmovdqu %ymm0, (%rdi)
	; AVX512-NEXT: vextracti32x4 $2, %zmm1, 32(%rdi)
	; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq
	%1 = shufflevector <16 x i8> %a, <16 x i8> %b, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>			%1 = shufflevector <16 x i8> %a, <16 x i8> %b, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
	%2 = shufflevector <16 x i8> %c, <16 x i8> undef, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%2 = shufflevector <16 x i8> %c, <16 x i8> undef, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%interleaved.vec = shufflevector <32 x i8> %1, <32 x i8> %2, <48 x i32> <i32 0, i32 16, i32 32, i32 1, i32 17, i32 33, i32 2, i32 18, i32 34, i32 3, i32 19, i32 35, i32 4, i32 20, i32 36, i32 5, i32 21, i32 37, i32 6, i32 22, i32 38, i32 7, i32 23, i32 39, i32 8, i32 24, i32 40, i32 9, i32 25, i32 41, i32 10, i32 26, i32 42, i32 11, i32 27, i32 43, i32 12, i32 28, i32 44, i32 13, i32 29, i32 45, i32 14, i32 30, i32 46, i32 15, i32 31, i32 47>			%interleaved.vec = shufflevector <32 x i8> %1, <32 x i8> %2, <48 x i32> <i32 0, i32 16, i32 32, i32 1, i32 17, i32 33, i32 2, i32 18, i32 34, i32 3, i32 19, i32 35, i32 4, i32 20, i32 36, i32 5, i32 21, i32 37, i32 6, i32 22, i32 38, i32 7, i32 23, i32 39, i32 8, i32 24, i32 40, i32 9, i32 25, i32 41, i32 10, i32 26, i32 42, i32 11, i32 27, i32 43, i32 12, i32 28, i32 44, i32 13, i32 29, i32 45, i32 14, i32 30, i32 46, i32 15, i32 31, i32 47>
	store <48 x i8> %interleaved.vec, <48 x i8>* %p, align 1			store <48 x i8> %interleaved.vec, <48 x i8>* %p, align 1
	ret void			ret void
	}			}

	define void @interleaved_store_vf32_i8_stride3(<32 x i8> %a, <32 x i8> %b, <32 x i8> %c, <96 x i8>* %p) {			define void @interleaved_store_vf32_i8_stride3(<32 x i8> %a, <32 x i8> %b, <32 x i8> %c, <96 x i8>* %p) {
	▲ Show 20 Lines • Show All 556 Lines • Show Last 20 Lines