This is an archive of the discontinued LLVM Phabricator instance.

[X86] Replace avx2.pbroadcast intrinsics with native IR.
ClosedPublic

Authored by ab on Jun 18 2015, 4:45 PM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon
chandlerc
silvas
andreadb
craig.topper

Commits

rG1a498705e4b4: [X86] Replace avx2 broadcast intrinsics with native IR.
rL245606: [X86] Replace avx2 broadcast intrinsics with native IR.

Summary

While working on PR23464, the broadcast intrinsics annoyed me. Let's remove them: it's one of the simplest shuffle kind, IR is good enough.

Diff Detail

Repository: rL LLVM

Event Timeline

ab updated this revision to Diff 27972.Jun 18 2015, 4:45 PM

ab retitled this revision from to [X86] Replace avx2.pbroadcast intrinsics with native IR..

ab updated this object.

ab edited the test plan for this revision. (Show Details)

ab added a reviewer: spatel.

ab added a subscriber: Unknown Object (MLST).

ab mentioned this in D10556: [Headers][X86] Replace avx2.pbroadcast intrinsics with native IR..Jun 18 2015, 4:49 PM

Hi Ahmed -

I had a similar patch in D8276, but then it was pointed out that this may not be the best solution. The concern is that doing it this way may alter/optimize code at -O0 (debugging vector code is hard enough without the damn compiler getting in the way!).

So for my next patch on this front, D8486, I used instcombine so there would be no changes in the front-end or -O0.

Adding some other folks who might have an opinion about the best way forward.

In D10555#191088, @spatel wrote:

Hi Ahmed -

I had a similar patch in D8276, but then it was pointed out that this may not be the best solution. The concern is that doing it this way may alter/optimize code at -O0 (debugging vector code is hard enough without the damn compiler getting in the way!).

So for my next patch on this front, D8486, I used instcombine so there would be no changes in the front-end or -O0.

Ah, I saw the vinsert/vextract commits and figured "this is fine"; I missed the other half!

To make sure I understand: this is only a problem because of DAGCombines running at -O0, right? (and perhaps some of the lowering being too smart? though without combines I'd find that surprising)
And this in turn is only a problem because the C intrinsics (_mm_*) are always inlined, and thus can be combined, right?

I guess there's a reason not to, but would it make sense to avoid inlining them at -O0? That oughta give you the best debuggability we can give, right?

-Ahmed

In D10555#191124, @ab wrote:

To make sure I understand: this is only a problem because of DAGCombines running at -O0, right? (and perhaps some of the lowering being too smart? though without combines I'd find that surprising)
And this in turn is only a problem because the C intrinsics (_mm_*) are always inlined, and thus can be combined, right?

I think the problem is independent of inlining and DAGCombines. As an example, consider this:

__m128 foo(__m256 a) {
  return _mm256_extractf128_ps(a, 0);
}

After D8275, this becomes a shufflevector in clang, and there's not much hope of turning it back into a vextractf128. It becomes an ISD::EXTRACT_SUBVECTOR in the DAG without any combiner opts AFAICT. Then, it turns into a EXTRACT_SUBREG machine inst. Then, it's either just a move or nothing at all in x86.

In D10555#191178, @spatel wrote:
In D10555#191124, @ab wrote:

To make sure I understand: this is only a problem because of DAGCombines running at -O0, right? (and perhaps some of the lowering being too smart? though without combines I'd find that surprising)
And this in turn is only a problem because the C intrinsics (_mm_*) are always inlined, and thus can be combined, right?

I think the problem is independent of inlining and DAGCombines. As an example, consider this:
__m128 foo(__m256 a) {
  return _mm256_extractf128_ps(a, 0);
}

If _mm256_extractf128_ps is a proper function instead of a macro (using the enable_if trick if necessary), would Ahmed's suggestion work for keeping these debuggable?

In D10555#191295, @silvas wrote:
In D10555#191178, @spatel wrote:
In D10555#191124, @ab wrote:

To make sure I understand: this is only a problem because of DAGCombines running at -O0, right? (and perhaps some of the lowering being too smart? though without combines I'd find that surprising)
And this in turn is only a problem because the C intrinsics (_mm_*) are always inlined, and thus can be combined, right?

I think the problem is independent of inlining and DAGCombines. As an example, consider this:
__m128 foo(__m256 a) {
  return _mm256_extractf128_ps(a, 0);
}
If _mm256_extractf128_ps is a proper function instead of a macro (using the enable_if trick if necessary), would Ahmed's suggestion work for keeping these debuggable?

I tried an experiment with:

__m128i foo(__m128i x) {
  return _mm_add_epi32(x, _mm_set1_epi32(0));  // so easy to optimize, but...must...resist!
}

...because that's defined as a proper function:

static __inline__ __m128i DEFAULT_FN_ATTRS
_mm_add_epi32(__m128i __a, __m128i __b)
{
  return (__m128i)((__v4si)__a + (__v4si)__b);
}

The add is present in the unoptimized IR, but it's gone in the asm. Removing the 'inline' didn't appear to change anything in this example.
Removing 'inline' could cause a different problem - vector coders really don't want those header files showing up in profiles or stepping in/out while debugging. IIRC, that happened for some reason with gcc about 10 years ago and had to be worked around.

So I revisited this as I've been messing with instcombiner reduction of intrinsics a lot recently.

Looking at the O0/O1/O2 codegen, the pbroadcast (and the broadcastss/broadcastsd register variants) are well behaved and keep to the expected instructions - we're not doing anything different here to how many of the other shuffle intrinsics are already implemented in the headers. The only one that has problems is _mm256_broadcastsi128_si256 (vbroadcasti128) which isn't being proposed here.

Along with an update of avx2intrin.h to call __builtin_shufflevector directly (and suitable tests to ensure that debug code doesn't change in the future) I'd say that this should be a win, but if people are still hesitant we should at least push forward with support in instcombiner now instead of putting it off.

In D10555#225105, @RKSimon wrote:

Along with an update of avx2intrin.h to call __builtin_shufflevector directly (and suitable tests to ensure that debug code doesn't change in the future) I'd say that this should be a win, but if people are still hesitant we should at least push forward with support in instcombiner now instead of putting it off.

My vote is for an instcombine solution just for the sake of implementation consistency...of course, I don't think there's been any consensus that the instcombine solution is the right way forward.

Possibly related: I just hit the same problem as https://llvm.org/bugs/show_bug.cgi?id=24125 , so anything we can do to make that fix easier should be considered. The current crashing is really unpleasant.

FWIW, I really like this patch. Is there anything we can do to make this work?

In D10555#225988, @chandlerc wrote:

FWIW, I really like this patch. Is there anything we can do to make this work?

It appears we have a few things that need to be decided before going any further:

1 - When is it permitable to replace a (sub)target-specific intrinsic with a non-specific implementation in the headers (e.g. using __builtin_shufflevector for these broadcasts)?

As long as the expected instruction remains in debug code I'm keen for this to be encouraged - we can add suitable tests, remove those builtin intrinsics to AutoUpgrade.cpp until 4.0 and get much cleaner headers.

2 - When is it permitable to replace a (sub)target-specific intrinsic in IR/DAG creation, and should that occur in InstCombine or in the target ISel code someplace?

I'd vote for InstCombine as we already appear to have a critical mass of intrinsics here.

3 - What are we going to do to fix the issue introduced by the header refactor removing the target guards, causing a tricky to decipher 'Cannot select: intrinsic %llvm.x86.vcvtps2ph.128' style backend error for intrinsics that are implemented as macros?

A quick+nasty solution would be to add header guards at least around each of those macros.

In D10555#226023, @RKSimon wrote:

In D10555#225988, @chandlerc wrote:

FWIW, I really like this patch. Is there anything we can do to make this work?

It appears we have a few things that need to be decided before going any further:

1 - When is it permitable to replace a (sub)target-specific intrinsic with a non-specific implementation in the headers (e.g. using __builtin_shufflevector for these broadcasts)?

As long as the expected instruction remains in debug code I'm keen for this to be encouraged - we can add suitable tests, remove those builtin intrinsics to AutoUpgrade.cpp until 4.0 and get much cleaner headers.

2 - When is it permitable to replace a (sub)target-specific intrinsic in IR/DAG creation, and should that occur in InstCombine or in the target ISel code someplace?

I'd vote for InstCombine as we already appear to have a critical mass of intrinsics here.

3 - What are we going to do to fix the issue introduced by the header refactor removing the target guards, causing a tricky to decipher 'Cannot select: intrinsic %llvm.x86.vcvtps2ph.128' style backend error for intrinsics that are implemented as macros?

A quick+nasty solution would be to add header guards at least around each of those macros.

At some point while working on one of these, Andrea told me about one more place where we do some builtin/intrinsic handling. I think that was CGBuiltin.cpp in clang's CodeGen. It seems messy that we have at least 3 ways of dealing with these things, but there are probably good reasons for each.

I don't want to hold up progress, so I don't object to this patch going in as-is (especially since Simon confirmed that -O0 code looks fine for these cases). But it would be great to answer the design questions that Simon has raised here for our collective future reference. There will surely be more intrinsics where these came from. :)

This revision is now accepted and ready to land.Aug 18 2015, 8:38 AM

Hi all,

RKSimon mentioned this in D12143: [X86][AVX2] Replace avx2.pbroadcast / avx2.vbroadcast intrinsics usage in avx2intrin.h with __builtin_shufflevector.Aug 19 2015, 4:13 AM

ab mentioned this in D12212: [Headers][X86] Add -O0 assembly tests for intrinsics..Aug 20 2015, 1:35 PM

Closed by commit rL245606: [X86] Replace avx2 broadcast intrinsics with native IR. (authored by ab). · Explain WhyAug 20 2015, 1:37 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

IR/

IntrinsicsX86.td

33 lines

lib/

IR/

AutoUpgrade.cpp

10 lines

Target/

X86/

X86InstrSSE.td

116 lines

test/

CodeGen/

X86/

avx2-intrinsics-x86-upgrade.ll

120 lines

avx2-intrinsics-x86.ll

88 lines

stack-folding-int-avx2.ll

6 lines

Diff 32735

llvm/trunk/include/llvm/IR/IntrinsicsX86.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 2,161 Lines • ▼ Show 20 Lines
	let TargetPrefix = "x86" in { // All intrinsics start with "llvm.x86.".			let TargetPrefix = "x86" in { // All intrinsics start with "llvm.x86.".
	def int_x86_avx2_pblendvb : GCCBuiltin<"__builtin_ia32_pblendvb256">,			def int_x86_avx2_pblendvb : GCCBuiltin<"__builtin_ia32_pblendvb256">,
	Intrinsic<[llvm_v32i8_ty], [llvm_v32i8_ty, llvm_v32i8_ty,			Intrinsic<[llvm_v32i8_ty], [llvm_v32i8_ty, llvm_v32i8_ty,
	llvm_v32i8_ty], [IntrNoMem]>;			llvm_v32i8_ty], [IntrNoMem]>;
	}			}

	// Vector load with broadcast			// Vector load with broadcast
	let TargetPrefix = "x86" in { // All intrinsics start with "llvm.x86.".			let TargetPrefix = "x86" in { // All intrinsics start with "llvm.x86.".
	def int_x86_avx2_vbroadcast_ss_ps :
	GCCBuiltin<"__builtin_ia32_vbroadcastss_ps">,
	Intrinsic<[llvm_v4f32_ty], [llvm_v4f32_ty], [IntrNoMem]>;
	def int_x86_avx2_vbroadcast_sd_pd_256 :
	GCCBuiltin<"__builtin_ia32_vbroadcastsd_pd256">,
	Intrinsic<[llvm_v4f64_ty], [llvm_v2f64_ty], [IntrNoMem]>;
	def int_x86_avx2_vbroadcast_ss_ps_256 :
	GCCBuiltin<"__builtin_ia32_vbroadcastss_ps256">,
	Intrinsic<[llvm_v8f32_ty], [llvm_v4f32_ty], [IntrNoMem]>;
	def int_x86_avx2_pbroadcastb_128 :
	GCCBuiltin<"__builtin_ia32_pbroadcastb128">,
	Intrinsic<[llvm_v16i8_ty], [llvm_v16i8_ty], [IntrNoMem]>;
	def int_x86_avx2_pbroadcastb_256 :
	GCCBuiltin<"__builtin_ia32_pbroadcastb256">,
	Intrinsic<[llvm_v32i8_ty], [llvm_v16i8_ty], [IntrNoMem]>;
	def int_x86_avx2_pbroadcastw_128 :
	GCCBuiltin<"__builtin_ia32_pbroadcastw128">,
	Intrinsic<[llvm_v8i16_ty], [llvm_v8i16_ty], [IntrNoMem]>;
	def int_x86_avx2_pbroadcastw_256 :
	GCCBuiltin<"__builtin_ia32_pbroadcastw256">,
	Intrinsic<[llvm_v16i16_ty], [llvm_v8i16_ty], [IntrNoMem]>;
	def int_x86_avx2_pbroadcastd_128 :
	GCCBuiltin<"__builtin_ia32_pbroadcastd128">,
	Intrinsic<[llvm_v4i32_ty], [llvm_v4i32_ty], [IntrNoMem]>;
	def int_x86_avx2_pbroadcastd_256 :
	GCCBuiltin<"__builtin_ia32_pbroadcastd256">,
	Intrinsic<[llvm_v8i32_ty], [llvm_v4i32_ty], [IntrNoMem]>;
	def int_x86_avx2_pbroadcastq_128 :
	GCCBuiltin<"__builtin_ia32_pbroadcastq128">,
	Intrinsic<[llvm_v2i64_ty], [llvm_v2i64_ty], [IntrNoMem]>;
	def int_x86_avx2_pbroadcastq_256 :
	GCCBuiltin<"__builtin_ia32_pbroadcastq256">,
	Intrinsic<[llvm_v4i64_ty], [llvm_v2i64_ty], [IntrNoMem]>;
	def int_x86_avx512_mask_pbroadcast_d_gpr_512 :			def int_x86_avx512_mask_pbroadcast_d_gpr_512 :
	GCCBuiltin<"__builtin_ia32_pbroadcastd512_gpr_mask">,			GCCBuiltin<"__builtin_ia32_pbroadcastd512_gpr_mask">,
	Intrinsic<[llvm_v16i32_ty], [llvm_i32_ty, llvm_v16i32_ty,			Intrinsic<[llvm_v16i32_ty], [llvm_i32_ty, llvm_v16i32_ty,
	llvm_i16_ty], [IntrNoMem]>;			llvm_i16_ty], [IntrNoMem]>;
	def int_x86_avx512_mask_pbroadcast_q_gpr_512 :			def int_x86_avx512_mask_pbroadcast_q_gpr_512 :
	GCCBuiltin<"__builtin_ia32_pbroadcastq512_gpr_mask">,			GCCBuiltin<"__builtin_ia32_pbroadcastq512_gpr_mask">,
	Intrinsic<[llvm_v8i64_ty], [llvm_i64_ty, llvm_v8i64_ty,			Intrinsic<[llvm_v8i64_ty], [llvm_i64_ty, llvm_v8i64_ty,
	llvm_i8_ty], [IntrNoMem]>;			llvm_i8_ty], [IntrNoMem]>;
	▲ Show 20 Lines • Show All 4,430 Lines • Show Last 20 Lines

llvm/trunk/lib/IR/AutoUpgrade.cpp

Show First 20 Lines • Show All 123 Lines • ▼ Show 20 Lines	case 'o':
}		}
break;		break;

case 'x': {		case 'x': {
if (Name.startswith("x86.sse2.pcmpeq.") \|\|		if (Name.startswith("x86.sse2.pcmpeq.") \|\|
Name.startswith("x86.sse2.pcmpgt.") \|\|		Name.startswith("x86.sse2.pcmpgt.") \|\|
Name.startswith("x86.avx2.pcmpeq.") \|\|		Name.startswith("x86.avx2.pcmpeq.") \|\|
Name.startswith("x86.avx2.pcmpgt.") \|\|		Name.startswith("x86.avx2.pcmpgt.") \|\|
		Name.startswith("x86.avx2.vbroadcast") \|\|
		Name.startswith("x86.avx2.pbroadcast") \|\|
Name.startswith("x86.avx.vpermil.") \|\|		Name.startswith("x86.avx.vpermil.") \|\|
Name == "x86.avx.vinsertf128.pd.256" \|\|		Name == "x86.avx.vinsertf128.pd.256" \|\|
Name == "x86.avx.vinsertf128.ps.256" \|\|		Name == "x86.avx.vinsertf128.ps.256" \|\|
Name == "x86.avx.vinsertf128.si.256" \|\|		Name == "x86.avx.vinsertf128.si.256" \|\|
Name == "x86.avx2.vinserti128" \|\|		Name == "x86.avx2.vinserti128" \|\|
Name == "x86.avx.vextractf128.pd.256" \|\|		Name == "x86.avx.vextractf128.pd.256" \|\|
Name == "x86.avx.vextractf128.ps.256" \|\|		Name == "x86.avx.vextractf128.ps.256" \|\|
Name == "x86.avx.vextractf128.si.256" \|\|		Name == "x86.avx.vextractf128.si.256" \|\|
▲ Show 20 Lines • Show All 302 Lines • ▼ Show 20 Lines	if (Name.startswith("llvm.x86.sse2.pcmpeq.") \|\|
// Replace vbroadcasts with a vector shuffle.		// Replace vbroadcasts with a vector shuffle.
Type *VT = VectorType::get(Type::getInt64Ty(C), 2);		Type *VT = VectorType::get(Type::getInt64Ty(C), 2);
Value *Op = Builder.CreatePointerCast(CI->getArgOperand(0),		Value *Op = Builder.CreatePointerCast(CI->getArgOperand(0),
PointerType::getUnqual(VT));		PointerType::getUnqual(VT));
Value *Load = Builder.CreateLoad(VT, Op);		Value *Load = Builder.CreateLoad(VT, Op);
const int Idxs[4] = { 0, 1, 0, 1 };		const int Idxs[4] = { 0, 1, 0, 1 };
Rep = Builder.CreateShuffleVector(Load, UndefValue::get(Load->getType()),		Rep = Builder.CreateShuffleVector(Load, UndefValue::get(Load->getType()),
Idxs);		Idxs);
		} else if (Name.startswith("llvm.x86.avx2.pbroadcast") \|\|
		Name.startswith("llvm.x86.avx2.vbroadcast")) {
		// Replace vp?broadcasts with a vector shuffle.
		Value *Op = CI->getArgOperand(0);
		unsigned NumElts = CI->getType()->getVectorNumElements();
		Type *MaskTy = VectorType::get(Type::getInt32Ty(C), NumElts);
		Rep = Builder.CreateShuffleVector(Op, UndefValue::get(Op->getType()),
		Constant::getNullValue(MaskTy));
} else if (Name == "llvm.x86.sse2.psll.dq") {		} else if (Name == "llvm.x86.sse2.psll.dq") {
// 128-bit shift left specified in bits.		// 128-bit shift left specified in bits.
unsigned Shift = cast<ConstantInt>(CI->getArgOperand(1))->getZExtValue();		unsigned Shift = cast<ConstantInt>(CI->getArgOperand(1))->getZExtValue();
Rep = UpgradeX86PSLLDQIntrinsics(Builder, C, CI->getArgOperand(0), 1,		Rep = UpgradeX86PSLLDQIntrinsics(Builder, C, CI->getArgOperand(0), 1,
Shift / 8); // Shift is in bits.		Shift / 8); // Shift is in bits.
} else if (Name == "llvm.x86.sse2.psrl.dq") {		} else if (Name == "llvm.x86.sse2.psrl.dq") {
// 128-bit shift right specified in bits.		// 128-bit shift right specified in bits.
unsigned Shift = cast<ConstantInt>(CI->getArgOperand(1))->getZExtValue();		unsigned Shift = cast<ConstantInt>(CI->getArgOperand(1))->getZExtValue();
▲ Show 20 Lines • Show All 356 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86InstrSSE.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,817 Lines • ▼ Show 20 Lines
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// AVX Instructions		// AVX Instructions
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// VBROADCAST - Load from memory and broadcast to all elements of the		// VBROADCAST - Load from memory and broadcast to all elements of the
// destination operand		// destination operand
//		//
class avx_broadcast<bits<8> opc, string OpcodeStr, RegisterClass RC,		class avx_broadcast_rm<bits<8> opc, string OpcodeStr, RegisterClass RC,
X86MemOperand x86memop, Intrinsic Int, SchedWrite Sched> :
AVX8I<opc, MRMSrcMem, (outs RC:$dst), (ins x86memop:$src),
!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"),
[(set RC:$dst, (Int addr:$src))]>, Sched<[Sched]>, VEX;

class avx_broadcast_no_int<bits<8> opc, string OpcodeStr, RegisterClass RC,
X86MemOperand x86memop, ValueType VT,		X86MemOperand x86memop, ValueType VT,
PatFrag ld_frag, SchedWrite Sched> :		PatFrag ld_frag, SchedWrite Sched> :
AVX8I<opc, MRMSrcMem, (outs RC:$dst), (ins x86memop:$src),		AVX8I<opc, MRMSrcMem, (outs RC:$dst), (ins x86memop:$src),
!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"),		!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"),
[(set RC:$dst, (VT (X86VBroadcast (ld_frag addr:$src))))]>,		[(set RC:$dst, (VT (X86VBroadcast (ld_frag addr:$src))))]>,
Sched<[Sched]>, VEX {		Sched<[Sched]>, VEX {
let mayLoad = 1;		let mayLoad = 1;
}		}

// AVX2 adds register forms		// AVX2 adds register forms
class avx2_broadcast_reg<bits<8> opc, string OpcodeStr, RegisterClass RC,		class avx2_broadcast_rr<bits<8> opc, string OpcodeStr, RegisterClass RC,
Intrinsic Int, SchedWrite Sched> :		ValueType ResVT, ValueType OpVT, SchedWrite Sched> :
AVX28I<opc, MRMSrcReg, (outs RC:$dst), (ins VR128:$src),		AVX28I<opc, MRMSrcReg, (outs RC:$dst), (ins VR128:$src),
!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"),		!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"),
[(set RC:$dst, (Int VR128:$src))]>, Sched<[Sched]>, VEX;		[(set RC:$dst, (ResVT (X86VBroadcast (OpVT VR128:$src))))]>,
		Sched<[Sched]>, VEX;

let ExeDomain = SSEPackedSingle in {		let ExeDomain = SSEPackedSingle in {
def VBROADCASTSSrm : avx_broadcast_no_int<0x18, "vbroadcastss", VR128,		def VBROADCASTSSrm : avx_broadcast_rm<0x18, "vbroadcastss", VR128,
f32mem, v4f32, loadf32, WriteLoad>;		f32mem, v4f32, loadf32, WriteLoad>;
def VBROADCASTSSYrm : avx_broadcast_no_int<0x18, "vbroadcastss", VR256,		def VBROADCASTSSYrm : avx_broadcast_rm<0x18, "vbroadcastss", VR256,
f32mem, v8f32, loadf32,		f32mem, v8f32, loadf32,
WriteFShuffleLd>, VEX_L;		WriteFShuffleLd>, VEX_L;
}		}
let ExeDomain = SSEPackedDouble in		let ExeDomain = SSEPackedDouble in
def VBROADCASTSDYrm : avx_broadcast_no_int<0x19, "vbroadcastsd", VR256, f64mem,		def VBROADCASTSDYrm : avx_broadcast_rm<0x19, "vbroadcastsd", VR256, f64mem,
v4f64, loadf64, WriteFShuffleLd>, VEX_L;		v4f64, loadf64, WriteFShuffleLd>, VEX_L;
def VBROADCASTF128 : avx_broadcast<0x1A, "vbroadcastf128", VR256, f128mem,
int_x86_avx_vbroadcastf128_pd_256,
WriteFShuffleLd>, VEX_L;

let ExeDomain = SSEPackedSingle in {		let ExeDomain = SSEPackedSingle in {
def VBROADCASTSSrr : avx2_broadcast_reg<0x18, "vbroadcastss", VR128,		def VBROADCASTSSrr : avx2_broadcast_rr<0x18, "vbroadcastss", VR128,
int_x86_avx2_vbroadcast_ss_ps,		v4f32, v4f32, WriteFShuffle>;
WriteFShuffle>;		def VBROADCASTSSYrr : avx2_broadcast_rr<0x18, "vbroadcastss", VR256,
def VBROADCASTSSYrr : avx2_broadcast_reg<0x18, "vbroadcastss", VR256,		v8f32, v4f32, WriteFShuffle256>, VEX_L;
int_x86_avx2_vbroadcast_ss_ps_256,
WriteFShuffle256>, VEX_L;
}		}
let ExeDomain = SSEPackedDouble in		let ExeDomain = SSEPackedDouble in
def VBROADCASTSDYrr : avx2_broadcast_reg<0x19, "vbroadcastsd", VR256,		def VBROADCASTSDYrr : avx2_broadcast_rr<0x19, "vbroadcastsd", VR256,
int_x86_avx2_vbroadcast_sd_pd_256,		v4f64, v2f64, WriteFShuffle256>, VEX_L;
WriteFShuffle256>, VEX_L;

let mayLoad = 1, Predicates = [HasAVX2] in		let mayLoad = 1, Predicates = [HasAVX2] in
def VBROADCASTI128 : AVX8I<0x5A, MRMSrcMem, (outs VR256:$dst),		def VBROADCASTI128 : AVX8I<0x5A, MRMSrcMem, (outs VR256:$dst),
(ins i128mem:$src),		(ins i128mem:$src),
"vbroadcasti128\t{$src, $dst\|$dst, $src}", []>,		"vbroadcasti128\t{$src, $dst\|$dst, $src}", []>,
Sched<[WriteLoad]>, VEX, VEX_L;		Sched<[WriteLoad]>, VEX, VEX_L;

		def VBROADCASTF128 : AVX8I<0x1A, MRMSrcMem, (outs VR256:$dst),
		(ins f128mem:$src),
		"vbroadcastf128\t{$src, $dst\|$dst, $src}",
		[(set VR256:$dst,
		(int_x86_avx_vbroadcastf128_pd_256 addr:$src))]>,
		Sched<[WriteFShuffleLd]>, VEX, VEX_L;

let Predicates = [HasAVX] in		let Predicates = [HasAVX] in
def : Pat<(int_x86_avx_vbroadcastf128_ps_256 addr:$src),		def : Pat<(int_x86_avx_vbroadcastf128_ps_256 addr:$src),
(VBROADCASTF128 addr:$src)>;		(VBROADCASTF128 addr:$src)>;


//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// VINSERTF128 - Insert packed floating-point values		// VINSERTF128 - Insert packed floating-point values
//		//
▲ Show 20 Lines • Show All 422 Lines • ▼ Show 20 Lines	defm VPBLENDDY : AVX2_binop_rmi<0x02, "vpblendd", X86Blendi, v8i32,
VR256, loadv4i64, i256mem>, VEX_L;		VR256, loadv4i64, i256mem>, VEX_L;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// VPBROADCAST - Load from memory and broadcast to all elements of the		// VPBROADCAST - Load from memory and broadcast to all elements of the
// destination operand		// destination operand
//		//
multiclass avx2_broadcast<bits<8> opc, string OpcodeStr,		multiclass avx2_broadcast<bits<8> opc, string OpcodeStr,
X86MemOperand x86memop, PatFrag ld_frag,		X86MemOperand x86memop, PatFrag ld_frag,
Intrinsic Int128, Intrinsic Int256> {		ValueType OpVT128, ValueType OpVT256> {
def rr : AVX28I<opc, MRMSrcReg, (outs VR128:$dst), (ins VR128:$src),		def rr : AVX28I<opc, MRMSrcReg, (outs VR128:$dst), (ins VR128:$src),
!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"),		!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"),
[(set VR128:$dst, (Int128 VR128:$src))]>,		[(set VR128:$dst, (OpVT128 (X86VBroadcast (OpVT128 VR128:$src))))]>,
Sched<[WriteShuffle]>, VEX;		Sched<[WriteShuffle]>, VEX;
def rm : AVX28I<opc, MRMSrcMem, (outs VR128:$dst), (ins x86memop:$src),		def rm : AVX28I<opc, MRMSrcMem, (outs VR128:$dst), (ins x86memop:$src),
!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"),		!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"),
[(set VR128:$dst,		[(set VR128:$dst, (OpVT128 (X86VBroadcast (ld_frag addr:$src))))]>,
(Int128 (scalar_to_vector (ld_frag addr:$src))))]>,
Sched<[WriteLoad]>, VEX;		Sched<[WriteLoad]>, VEX;
def Yrr : AVX28I<opc, MRMSrcReg, (outs VR256:$dst), (ins VR128:$src),		def Yrr : AVX28I<opc, MRMSrcReg, (outs VR256:$dst), (ins VR128:$src),
!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"),		!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"),
[(set VR256:$dst, (Int256 VR128:$src))]>,		[(set VR256:$dst, (OpVT256 (X86VBroadcast (OpVT128 VR128:$src))))]>,
Sched<[WriteShuffle256]>, VEX, VEX_L;		Sched<[WriteShuffle256]>, VEX, VEX_L;
def Yrm : AVX28I<opc, MRMSrcMem, (outs VR256:$dst), (ins x86memop:$src),		def Yrm : AVX28I<opc, MRMSrcMem, (outs VR256:$dst), (ins x86memop:$src),
!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"),		!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"),
[(set VR256:$dst,		[(set VR256:$dst, (OpVT256 (X86VBroadcast (ld_frag addr:$src))))]>,
(Int256 (scalar_to_vector (ld_frag addr:$src))))]>,
Sched<[WriteLoad]>, VEX, VEX_L;		Sched<[WriteLoad]>, VEX, VEX_L;
}		}

defm VPBROADCASTB : avx2_broadcast<0x78, "vpbroadcastb", i8mem, loadi8,		defm VPBROADCASTB : avx2_broadcast<0x78, "vpbroadcastb", i8mem, loadi8, v16i8, v32i8>;
int_x86_avx2_pbroadcastb_128,		defm VPBROADCASTW : avx2_broadcast<0x79, "vpbroadcastw", i16mem, loadi16, v8i16, v16i16>;
int_x86_avx2_pbroadcastb_256>;		defm VPBROADCASTD : avx2_broadcast<0x58, "vpbroadcastd", i32mem, loadi32, v4i32, v8i32>;
defm VPBROADCASTW : avx2_broadcast<0x79, "vpbroadcastw", i16mem, loadi16,		defm VPBROADCASTQ : avx2_broadcast<0x59, "vpbroadcastq", i64mem, loadi64, v2i64, v4i64>;
int_x86_avx2_pbroadcastw_128,
int_x86_avx2_pbroadcastw_256>;
defm VPBROADCASTD : avx2_broadcast<0x58, "vpbroadcastd", i32mem, loadi32,
int_x86_avx2_pbroadcastd_128,
int_x86_avx2_pbroadcastd_256>;
defm VPBROADCASTQ : avx2_broadcast<0x59, "vpbroadcastq", i64mem, loadi64,
int_x86_avx2_pbroadcastq_128,
int_x86_avx2_pbroadcastq_256>;

let Predicates = [HasAVX2] in {		let Predicates = [HasAVX2] in {
def : Pat<(v16i8 (X86VBroadcast (loadi8 addr:$src))),
(VPBROADCASTBrm addr:$src)>;
def : Pat<(v32i8 (X86VBroadcast (loadi8 addr:$src))),
(VPBROADCASTBYrm addr:$src)>;
def : Pat<(v8i16 (X86VBroadcast (loadi16 addr:$src))),
(VPBROADCASTWrm addr:$src)>;
def : Pat<(v16i16 (X86VBroadcast (loadi16 addr:$src))),
(VPBROADCASTWYrm addr:$src)>;
def : Pat<(v4i32 (X86VBroadcast (loadi32 addr:$src))),
(VPBROADCASTDrm addr:$src)>;
def : Pat<(v8i32 (X86VBroadcast (loadi32 addr:$src))),
(VPBROADCASTDYrm addr:$src)>;
def : Pat<(v2i64 (X86VBroadcast (loadi64 addr:$src))),
(VPBROADCASTQrm addr:$src)>;
def : Pat<(v4i64 (X86VBroadcast (loadi64 addr:$src))),
(VPBROADCASTQYrm addr:$src)>;

def : Pat<(v16i8 (X86VBroadcast (v16i8 VR128:$src))),
(VPBROADCASTBrr VR128:$src)>;
def : Pat<(v32i8 (X86VBroadcast (v16i8 VR128:$src))),
(VPBROADCASTBYrr VR128:$src)>;
def : Pat<(v8i16 (X86VBroadcast (v8i16 VR128:$src))),
(VPBROADCASTWrr VR128:$src)>;
def : Pat<(v16i16 (X86VBroadcast (v8i16 VR128:$src))),
(VPBROADCASTWYrr VR128:$src)>;
def : Pat<(v4i32 (X86VBroadcast (v4i32 VR128:$src))),
(VPBROADCASTDrr VR128:$src)>;
def : Pat<(v8i32 (X86VBroadcast (v4i32 VR128:$src))),
(VPBROADCASTDYrr VR128:$src)>;
def : Pat<(v2i64 (X86VBroadcast (v2i64 VR128:$src))),
(VPBROADCASTQrr VR128:$src)>;
def : Pat<(v4i64 (X86VBroadcast (v2i64 VR128:$src))),
(VPBROADCASTQYrr VR128:$src)>;
def : Pat<(v4f32 (X86VBroadcast (v4f32 VR128:$src))),
(VBROADCASTSSrr VR128:$src)>;
def : Pat<(v8f32 (X86VBroadcast (v4f32 VR128:$src))),
(VBROADCASTSSYrr VR128:$src)>;
def : Pat<(v2f64 (X86VBroadcast (v2f64 VR128:$src))),
(VPBROADCASTQrr VR128:$src)>;
def : Pat<(v4f64 (X86VBroadcast (v2f64 VR128:$src))),
(VBROADCASTSDYrr VR128:$src)>;

// Provide aliases for broadcast from the same register class that		// Provide aliases for broadcast from the same register class that
// automatically does the extract.		// automatically does the extract.
def : Pat<(v32i8 (X86VBroadcast (v32i8 VR256:$src))),		def : Pat<(v32i8 (X86VBroadcast (v32i8 VR256:$src))),
(VPBROADCASTBYrr (v16i8 (EXTRACT_SUBREG (v32i8 VR256:$src),		(VPBROADCASTBYrr (v16i8 (EXTRACT_SUBREG (v32i8 VR256:$src),
sub_xmm)))>;		sub_xmm)))>;
def : Pat<(v16i16 (X86VBroadcast (v16i16 VR256:$src))),		def : Pat<(v16i16 (X86VBroadcast (v16i16 VR256:$src))),
(VPBROADCASTWYrr (v8i16 (EXTRACT_SUBREG (v16i16 VR256:$src),		(VPBROADCASTWYrr (v8i16 (EXTRACT_SUBREG (v16i16 VR256:$src),
sub_xmm)))>;		sub_xmm)))>;
▲ Show 20 Lines • Show All 511 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/avx2-intrinsics-x86-upgrade.ll

	Show First 20 Lines • Show All 77 Lines • ▼ Show 20 Lines
	; CHECK-LABEL: test_x86_avx2_vinserti128:			; CHECK-LABEL: test_x86_avx2_vinserti128:
	; CHECK: vinserti128			; CHECK: vinserti128

	%res = call <4 x i64> @llvm.x86.avx2.vinserti128(<4 x i64> %a0, <2 x i64> %a1, i8 7)			%res = call <4 x i64> @llvm.x86.avx2.vinserti128(<4 x i64> %a0, <2 x i64> %a1, i8 7)
	ret <4 x i64> %res			ret <4 x i64> %res
	}			}
	declare <4 x i64> @llvm.x86.avx2.vinserti128(<4 x i64>, <2 x i64>, i8) nounwind readnone			declare <4 x i64> @llvm.x86.avx2.vinserti128(<4 x i64>, <2 x i64>, i8) nounwind readnone


				define <4 x double> @test_x86_avx2_vbroadcast_sd_pd_256(<2 x double> %a0) {
				; CHECK-LABEL: test_x86_avx2_vbroadcast_sd_pd_256:
				; CHECK: ## BB#0:
				; CHECK-NEXT: vbroadcastsd %xmm0, %ymm0
				; CHECK-NEXT: retl
				%res = call <4 x double> @llvm.x86.avx2.vbroadcast.sd.pd.256(<2 x double> %a0)
				ret <4 x double> %res
				}
				declare <4 x double> @llvm.x86.avx2.vbroadcast.sd.pd.256(<2 x double>) nounwind readonly


				define <4 x float> @test_x86_avx2_vbroadcast_ss_ps(<4 x float> %a0) {
				; CHECK-LABEL: test_x86_avx2_vbroadcast_ss_ps:
				; CHECK: ## BB#0:
				; CHECK-NEXT: vbroadcastss %xmm0, %xmm0
				; CHECK-NEXT: retl
				%res = call <4 x float> @llvm.x86.avx2.vbroadcast.ss.ps(<4 x float> %a0)
				ret <4 x float> %res
				}
				declare <4 x float> @llvm.x86.avx2.vbroadcast.ss.ps(<4 x float>) nounwind readonly


				define <8 x float> @test_x86_avx2_vbroadcast_ss_ps_256(<4 x float> %a0) {
				; CHECK-LABEL: test_x86_avx2_vbroadcast_ss_ps_256:
				; CHECK: ## BB#0:
				; CHECK-NEXT: vbroadcastss %xmm0, %ymm0
				; CHECK-NEXT: retl
				%res = call <8 x float> @llvm.x86.avx2.vbroadcast.ss.ps.256(<4 x float> %a0)
				ret <8 x float> %res
				}
				declare <8 x float> @llvm.x86.avx2.vbroadcast.ss.ps.256(<4 x float>) nounwind readonly


				define <16 x i8> @test_x86_avx2_pbroadcastb_128(<16 x i8> %a0) {
				; CHECK-LABEL: test_x86_avx2_pbroadcastb_128:
				; CHECK: ## BB#0:
				; CHECK-NEXT: vpbroadcastb %xmm0, %xmm0
				; CHECK-NEXT: retl
				%res = call <16 x i8> @llvm.x86.avx2.pbroadcastb.128(<16 x i8> %a0)
				ret <16 x i8> %res
				}
				declare <16 x i8> @llvm.x86.avx2.pbroadcastb.128(<16 x i8>) nounwind readonly


				define <32 x i8> @test_x86_avx2_pbroadcastb_256(<16 x i8> %a0) {
				; CHECK-LABEL: test_x86_avx2_pbroadcastb_256:
				; CHECK: ## BB#0:
				; CHECK-NEXT: vpbroadcastb %xmm0, %ymm0
				; CHECK-NEXT: retl
				%res = call <32 x i8> @llvm.x86.avx2.pbroadcastb.256(<16 x i8> %a0)
				ret <32 x i8> %res
				}
				declare <32 x i8> @llvm.x86.avx2.pbroadcastb.256(<16 x i8>) nounwind readonly


				define <8 x i16> @test_x86_avx2_pbroadcastw_128(<8 x i16> %a0) {
				; CHECK-LABEL: test_x86_avx2_pbroadcastw_128:
				; CHECK: ## BB#0:
				; CHECK-NEXT: vpbroadcastw %xmm0, %xmm0
				; CHECK-NEXT: retl
				%res = call <8 x i16> @llvm.x86.avx2.pbroadcastw.128(<8 x i16> %a0)
				ret <8 x i16> %res
				}
				declare <8 x i16> @llvm.x86.avx2.pbroadcastw.128(<8 x i16>) nounwind readonly


				define <16 x i16> @test_x86_avx2_pbroadcastw_256(<8 x i16> %a0) {
				; CHECK-LABEL: test_x86_avx2_pbroadcastw_256:
				; CHECK: ## BB#0:
				; CHECK-NEXT: vpbroadcastw %xmm0, %ymm0
				; CHECK-NEXT: retl
				%res = call <16 x i16> @llvm.x86.avx2.pbroadcastw.256(<8 x i16> %a0)
				ret <16 x i16> %res
				}
				declare <16 x i16> @llvm.x86.avx2.pbroadcastw.256(<8 x i16>) nounwind readonly


				define <4 x i32> @test_x86_avx2_pbroadcastd_128(<4 x i32> %a0) {
				; CHECK-LABEL: test_x86_avx2_pbroadcastd_128:
				; CHECK: ## BB#0:
				; CHECK-NEXT: vbroadcastss %xmm0, %xmm0
				; CHECK-NEXT: retl
				%res = call <4 x i32> @llvm.x86.avx2.pbroadcastd.128(<4 x i32> %a0)
				ret <4 x i32> %res
				}
				declare <4 x i32> @llvm.x86.avx2.pbroadcastd.128(<4 x i32>) nounwind readonly


				define <8 x i32> @test_x86_avx2_pbroadcastd_256(<4 x i32> %a0) {
				; CHECK-LABEL: test_x86_avx2_pbroadcastd_256:
				; CHECK: ## BB#0:
				; CHECK-NEXT: vbroadcastss %xmm0, %ymm0
				; CHECK-NEXT: retl
				%res = call <8 x i32> @llvm.x86.avx2.pbroadcastd.256(<4 x i32> %a0)
				ret <8 x i32> %res
				}
				declare <8 x i32> @llvm.x86.avx2.pbroadcastd.256(<4 x i32>) nounwind readonly


				define <2 x i64> @test_x86_avx2_pbroadcastq_128(<2 x i64> %a0) {
				; CHECK-LABEL: test_x86_avx2_pbroadcastq_128:
				; CHECK: ## BB#0:
				; CHECK-NEXT: vpbroadcastq %xmm0, %xmm0
				; CHECK-NEXT: retl
				%res = call <2 x i64> @llvm.x86.avx2.pbroadcastq.128(<2 x i64> %a0)
				ret <2 x i64> %res
				}
				declare <2 x i64> @llvm.x86.avx2.pbroadcastq.128(<2 x i64>) nounwind readonly


				define <4 x i64> @test_x86_avx2_pbroadcastq_256(<2 x i64> %a0) {
				; CHECK-LABEL: test_x86_avx2_pbroadcastq_256:
				; CHECK: ## BB#0:
				; CHECK-NEXT: vbroadcastsd %xmm0, %ymm0
				; CHECK-NEXT: retl
				%res = call <4 x i64> @llvm.x86.avx2.pbroadcastq.256(<2 x i64> %a0)
				ret <4 x i64> %res
				}
				declare <4 x i64> @llvm.x86.avx2.pbroadcastq.256(<2 x i64>) nounwind readonly

llvm/trunk/test/CodeGen/X86/avx2-intrinsics-x86.ll

	Show First 20 Lines • Show All 635 Lines • ▼ Show 20 Lines
	define <4 x i64> @test_x86_avx2_pmul.dq(<8 x i32> %a0, <8 x i32> %a1) {			define <4 x i64> @test_x86_avx2_pmul.dq(<8 x i32> %a0, <8 x i32> %a1) {
	; CHECK: vpmuldq			; CHECK: vpmuldq
	%res = call <4 x i64> @llvm.x86.avx2.pmul.dq(<8 x i32> %a0, <8 x i32> %a1) ; <<2 x i64>> [#uses=1]			%res = call <4 x i64> @llvm.x86.avx2.pmul.dq(<8 x i32> %a0, <8 x i32> %a1) ; <<2 x i64>> [#uses=1]
	ret <4 x i64> %res			ret <4 x i64> %res
	}			}
	declare <4 x i64> @llvm.x86.avx2.pmul.dq(<8 x i32>, <8 x i32>) nounwind readnone			declare <4 x i64> @llvm.x86.avx2.pmul.dq(<8 x i32>, <8 x i32>) nounwind readnone


	define <4 x double> @test_x86_avx2_vbroadcast_sd_pd_256(<2 x double> %a0) {
	; CHECK: vbroadcastsd
	%res = call <4 x double> @llvm.x86.avx2.vbroadcast.sd.pd.256(<2 x double> %a0) ; <<4 x double>> [#uses=1]
	ret <4 x double> %res
	}
	declare <4 x double> @llvm.x86.avx2.vbroadcast.sd.pd.256(<2 x double>) nounwind readonly


	define <4 x float> @test_x86_avx2_vbroadcast_ss_ps(<4 x float> %a0) {
	; CHECK: vbroadcastss
	%res = call <4 x float> @llvm.x86.avx2.vbroadcast.ss.ps(<4 x float> %a0) ; <<4 x float>> [#uses=1]
	ret <4 x float> %res
	}
	declare <4 x float> @llvm.x86.avx2.vbroadcast.ss.ps(<4 x float>) nounwind readonly


	define <8 x float> @test_x86_avx2_vbroadcast_ss_ps_256(<4 x float> %a0) {
	; CHECK: vbroadcastss
	%res = call <8 x float> @llvm.x86.avx2.vbroadcast.ss.ps.256(<4 x float> %a0) ; <<8 x float>> [#uses=1]
	ret <8 x float> %res
	}
	declare <8 x float> @llvm.x86.avx2.vbroadcast.ss.ps.256(<4 x float>) nounwind readonly


	define <4 x i32> @test_x86_avx2_pblendd_128(<4 x i32> %a0, <4 x i32> %a1) {			define <4 x i32> @test_x86_avx2_pblendd_128(<4 x i32> %a0, <4 x i32> %a1) {
	; CHECK: vpblendd			; CHECK: vpblendd
	%res = call <4 x i32> @llvm.x86.avx2.pblendd.128(<4 x i32> %a0, <4 x i32> %a1, i8 7) ; <<4 x i32>> [#uses=1]			%res = call <4 x i32> @llvm.x86.avx2.pblendd.128(<4 x i32> %a0, <4 x i32> %a1, i8 7) ; <<4 x i32>> [#uses=1]
	ret <4 x i32> %res			ret <4 x i32> %res
	}			}
	declare <4 x i32> @llvm.x86.avx2.pblendd.128(<4 x i32>, <4 x i32>, i8) nounwind readnone			declare <4 x i32> @llvm.x86.avx2.pblendd.128(<4 x i32>, <4 x i32>, i8) nounwind readnone


	define <8 x i32> @test_x86_avx2_pblendd_256(<8 x i32> %a0, <8 x i32> %a1) {			define <8 x i32> @test_x86_avx2_pblendd_256(<8 x i32> %a0, <8 x i32> %a1) {
	; CHECK: vpblendd			; CHECK: vpblendd
	%res = call <8 x i32> @llvm.x86.avx2.pblendd.256(<8 x i32> %a0, <8 x i32> %a1, i8 7) ; <<8 x i32>> [#uses=1]			%res = call <8 x i32> @llvm.x86.avx2.pblendd.256(<8 x i32> %a0, <8 x i32> %a1, i8 7) ; <<8 x i32>> [#uses=1]
	ret <8 x i32> %res			ret <8 x i32> %res
	}			}
	declare <8 x i32> @llvm.x86.avx2.pblendd.256(<8 x i32>, <8 x i32>, i8) nounwind readnone			declare <8 x i32> @llvm.x86.avx2.pblendd.256(<8 x i32>, <8 x i32>, i8) nounwind readnone


	define <16 x i8> @test_x86_avx2_pbroadcastb_128(<16 x i8> %a0) {
	; CHECK: vpbroadcastb
	%res = call <16 x i8> @llvm.x86.avx2.pbroadcastb.128(<16 x i8> %a0) ; <<16 x i8>> [#uses=1]
	ret <16 x i8> %res
	}
	declare <16 x i8> @llvm.x86.avx2.pbroadcastb.128(<16 x i8>) nounwind readonly


	define <32 x i8> @test_x86_avx2_pbroadcastb_256(<16 x i8> %a0) {
	; CHECK: vpbroadcastb
	%res = call <32 x i8> @llvm.x86.avx2.pbroadcastb.256(<16 x i8> %a0) ; <<32 x i8>> [#uses=1]
	ret <32 x i8> %res
	}
	declare <32 x i8> @llvm.x86.avx2.pbroadcastb.256(<16 x i8>) nounwind readonly


	define <8 x i16> @test_x86_avx2_pbroadcastw_128(<8 x i16> %a0) {
	; CHECK: vpbroadcastw
	%res = call <8 x i16> @llvm.x86.avx2.pbroadcastw.128(<8 x i16> %a0) ; <<8 x i16>> [#uses=1]
	ret <8 x i16> %res
	}
	declare <8 x i16> @llvm.x86.avx2.pbroadcastw.128(<8 x i16>) nounwind readonly


	define <16 x i16> @test_x86_avx2_pbroadcastw_256(<8 x i16> %a0) {
	; CHECK: vpbroadcastw
	%res = call <16 x i16> @llvm.x86.avx2.pbroadcastw.256(<8 x i16> %a0) ; <<16 x i16>> [#uses=1]
	ret <16 x i16> %res
	}
	declare <16 x i16> @llvm.x86.avx2.pbroadcastw.256(<8 x i16>) nounwind readonly


	define <4 x i32> @test_x86_avx2_pbroadcastd_128(<4 x i32> %a0) {
	; CHECK: vbroadcastss
	%res = call <4 x i32> @llvm.x86.avx2.pbroadcastd.128(<4 x i32> %a0) ; <<4 x i32>> [#uses=1]
	ret <4 x i32> %res
	}
	declare <4 x i32> @llvm.x86.avx2.pbroadcastd.128(<4 x i32>) nounwind readonly


	define <8 x i32> @test_x86_avx2_pbroadcastd_256(<4 x i32> %a0) {
	; CHECK: vbroadcastss {{[^,]+}}, %ymm{{[0-9]+}}
	%res = call <8 x i32> @llvm.x86.avx2.pbroadcastd.256(<4 x i32> %a0) ; <<8 x i32>> [#uses=1]
	ret <8 x i32> %res
	}
	declare <8 x i32> @llvm.x86.avx2.pbroadcastd.256(<4 x i32>) nounwind readonly


	define <2 x i64> @test_x86_avx2_pbroadcastq_128(<2 x i64> %a0) {
	; CHECK: vpbroadcastq
	%res = call <2 x i64> @llvm.x86.avx2.pbroadcastq.128(<2 x i64> %a0) ; <<2 x i64>> [#uses=1]
	ret <2 x i64> %res
	}
	declare <2 x i64> @llvm.x86.avx2.pbroadcastq.128(<2 x i64>) nounwind readonly


	define <4 x i64> @test_x86_avx2_pbroadcastq_256(<2 x i64> %a0) {
	; CHECK: vbroadcastsd {{[^,]+}}, %ymm{{[0-9]+}}
	%res = call <4 x i64> @llvm.x86.avx2.pbroadcastq.256(<2 x i64> %a0) ; <<4 x i64>> [#uses=1]
	ret <4 x i64> %res
	}
	declare <4 x i64> @llvm.x86.avx2.pbroadcastq.256(<2 x i64>) nounwind readonly


	define <8 x i32> @test_x86_avx2_permd(<8 x i32> %a0, <8 x i32> %a1) {			define <8 x i32> @test_x86_avx2_permd(<8 x i32> %a0, <8 x i32> %a1) {
	; Check that the arguments are swapped between the intrinsic definition			; Check that the arguments are swapped between the intrinsic definition
	; and its lowering. Indeed, the offsets are the first source in			; and its lowering. Indeed, the offsets are the first source in
	; the instruction.			; the instruction.
	; CHECK: vpermd %ymm0, %ymm1, %ymm0			; CHECK: vpermd %ymm0, %ymm1, %ymm0
	%res = call <8 x i32> @llvm.x86.avx2.permd(<8 x i32> %a0, <8 x i32> %a1) ; <<8 x i32>> [#uses=1]			%res = call <8 x i32> @llvm.x86.avx2.permd(<8 x i32> %a0, <8 x i32> %a1) ; <<8 x i32>> [#uses=1]
	ret <8 x i32> %res			ret <8 x i32> %res
	}			}
	▲ Show 20 Lines • Show All 353 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/stack-folding-int-avx2.ll

	; RUN: llc -O3 -disable-peephole -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=+avx2 < %s \| FileCheck %s			; RUN: llc -O3 -disable-peephole -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=+avx2 < %s \| FileCheck %s

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-unknown"			target triple = "x86_64-unknown-unknown"

	; Stack reload folding tests.			; Stack reload folding tests.
	;			;
	; By including a nop call with sideeffects we can force a partial register spill of the			; By including a nop call with sideeffects we can force a partial register spill of the
	; relevant registers and check that the reload is correctly folded into the instruction.			; relevant registers and check that the reload is correctly folded into the instruction.

	define <4 x double> @stack_fold_broadcastsd_ymm(<2 x double> %a0) {			define <4 x double> @stack_fold_broadcastsd_ymm(<2 x double> %a0) {
	;CHECK-LABEL: stack_fold_broadcastsd_ymm			;CHECK-LABEL: stack_fold_broadcastsd_ymm
	;CHECK: vbroadcastsd {{-?[0-9]}}(%rsp), {{%ymm[0-9][0-9]}} {{.*#+}} 16-byte Folded Reload			;CHECK: vbroadcastsd {{-?[0-9]}}(%rsp), {{%ymm[0-9][0-9]}} {{.*#+}} 16-byte Folded Reload
	%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm1},~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()			%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm1},~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()
	%2 = call <4 x double> @llvm.x86.avx2.vbroadcast.sd.pd.256(<2 x double> %a0)			%2 = shufflevector <2 x double> %a0, <2 x double> undef, <4 x i32> zeroinitializer
	; fadd forces execution domain			; fadd forces execution domain
	%3 = fadd <4 x double> %2, <double 0x0, double 0x0, double 0x0, double 0x0>			%3 = fadd <4 x double> %2, <double 0x0, double 0x0, double 0x0, double 0x0>
	ret <4 x double> %3			ret <4 x double> %3
	}			}
	declare <4 x double> @llvm.x86.avx2.vbroadcast.sd.pd.256(<2 x double>) nounwind readonly			declare <4 x double> @llvm.x86.avx2.vbroadcast.sd.pd.256(<2 x double>) nounwind readonly

	define <4 x float> @stack_fold_broadcastss(<4 x float> %a0) {			define <4 x float> @stack_fold_broadcastss(<4 x float> %a0) {
	;CHECK-LABEL: stack_fold_broadcastss			;CHECK-LABEL: stack_fold_broadcastss
	;CHECK: vbroadcastss {{-?[0-9]}}(%rsp), {{%xmm[0-9][0-9]}} {{.*#+}} 16-byte Folded Reload			;CHECK: vbroadcastss {{-?[0-9]}}(%rsp), {{%xmm[0-9][0-9]}} {{.*#+}} 16-byte Folded Reload
	%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm1},~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()			%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm1},~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()
	%2 = call <4 x float> @llvm.x86.avx2.vbroadcast.ss.ps(<4 x float> %a0)			%2 = shufflevector <4 x float> %a0, <4 x float> undef, <4 x i32> zeroinitializer
	; fadd forces execution domain			; fadd forces execution domain
	%3 = fadd <4 x float> %2, <float 0x0, float 0x0, float 0x0, float 0x0>			%3 = fadd <4 x float> %2, <float 0x0, float 0x0, float 0x0, float 0x0>
	ret <4 x float> %3			ret <4 x float> %3
	}			}
	declare <4 x float> @llvm.x86.avx2.vbroadcast.ss.ps(<4 x float>) nounwind readonly			declare <4 x float> @llvm.x86.avx2.vbroadcast.ss.ps(<4 x float>) nounwind readonly

	define <8 x float> @stack_fold_broadcastss_ymm(<4 x float> %a0) {			define <8 x float> @stack_fold_broadcastss_ymm(<4 x float> %a0) {
	;CHECK-LABEL: stack_fold_broadcastss_ymm			;CHECK-LABEL: stack_fold_broadcastss_ymm
	;CHECK: vbroadcastss {{-?[0-9]}}(%rsp), {{%ymm[0-9][0-9]}} {{.*#+}} 16-byte Folded Reload			;CHECK: vbroadcastss {{-?[0-9]}}(%rsp), {{%ymm[0-9][0-9]}} {{.*#+}} 16-byte Folded Reload
	%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm1},~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()			%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm1},~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()
	%2 = call <8 x float> @llvm.x86.avx2.vbroadcast.ss.ps.256(<4 x float> %a0)			%2 = shufflevector <4 x float> %a0, <4 x float> undef, <8 x i32> zeroinitializer
	; fadd forces execution domain			; fadd forces execution domain
	%3 = fadd <8 x float> %2, <float 0x0, float 0x0, float 0x0, float 0x0, float 0x0, float 0x0, float 0x0, float 0x0>			%3 = fadd <8 x float> %2, <float 0x0, float 0x0, float 0x0, float 0x0, float 0x0, float 0x0, float 0x0, float 0x0>
	ret <8 x float> %3			ret <8 x float> %3
	}			}
	declare <8 x float> @llvm.x86.avx2.vbroadcast.ss.ps.256(<4 x float>) nounwind readonly			declare <8 x float> @llvm.x86.avx2.vbroadcast.ss.ps.256(<4 x float>) nounwind readonly

	define <4 x i32> @stack_fold_extracti128(<8 x i32> %a0, <8 x i32> %a1) {			define <4 x i32> @stack_fold_extracti128(<8 x i32> %a0, <8 x i32> %a1) {
	;CHECK-LABEL: stack_fold_extracti128			;CHECK-LABEL: stack_fold_extracti128
	▲ Show 20 Lines • Show All 1,170 Lines • Show Last 20 Lines