This is an archive of the discontinued LLVM Phabricator instance.

[X86] replace vinsertf128 intrinsics with generic shuffles
ClosedPublic

Authored by spatel on Mar 5 2015, 12:14 PM.

Download Raw Diff

Details

Reviewers

qcolombet
chandlerc
craig.topper

Commits

rG19792fb27044: [X86, AVX] replace vinsertf128 intrinsics with generic shuffles
rL231794: [X86, AVX] replace vinsertf128 intrinsics with generic shuffles

Summary

This is my first hack at getting us out of the custom x86 shuffle intrinsic business.
This was suggested in: http://reviews.llvm.org/D7866

Let's start with the vinsertf128 troika. I'll post a clang sibling patch shortly.

Please let me know if I've missed anything. I'm basing these changes on:
http://llvm.org/viewvc/llvm-project?view=revision&revision=230860

If anyone can explain why vinsertf128 with a 0 immediate exists, I'm curious...

Diff Detail

Repository: rL LLVM

Event Timeline

spatel updated this revision to Diff 21295.Mar 5 2015, 12:14 PM

spatel retitled this revision from to [X86] replace vinsertf128 intrinsics with generic shuffles.

spatel updated this object.

spatel edited the test plan for this revision. (Show Details)

spatel added reviewers: chandlerc, craig.topper, qcolombet.

spatel added a subscriber: Unknown Object (MLST).

chandlerc added inline comments.Mar 5 2015, 12:51 PM

lib/IR/AutoUpgrade.cpp
644–649 ↗	(On Diff #21295)	I think you should teach the CreateShuffleVector method to also accept a array ref of ints and to use the -1 -> undef, mapping to produce the constants for you. Then you can use std::iota to remove all the loops in this code.

Hi Sanjay,

lib/IR/AutoUpgrade.cpp
639 ↗	(On Diff #21295)	What happens if Imm is for example 2 or 4? According to the Intel documentation: "The high 7 bits of the immediate are ignored". So, only the first bit of 'Imm' has a meaning in this context. However, your code doesn't clear the upper bits of 'Imm'. So (unless I misread the code) the for loops between lines 665 and 674 would propagate the wrong indices if Imm is an even number bigger than zero. Can you add a test for the case where Imm is a non-zero even number?

If anyone can explain why vinsertf128 with a 0 immediate exists, I'm curious...

IIRC its only real use is for loading xmm into the ymm - using a blend would reference a whole 256-bit vector of memory that might not be valid.

In D8086#135205, @RKSimon wrote:

If anyone can explain why vinsertf128 with a 0 immediate exists, I'm curious...

IIRC its only real use is for loading xmm into the ymm - using a blend would reference a whole 256-bit vector of memory that might not be valid.

Yes, the memop form is different. I see only one regression test with a memop vinsertf128 (stack-folding-fp-avx1.ll), and it has an immediate 1 operand. From what I've seen so far, we're always going to replace the reg/reg form with imm 0 with a blendi after this change. This should be good because vblendps/pd have better perf on every chip that I checked. Will have to see what happens in the load fold case.

spatel added inline comments.Mar 5 2015, 5:25 PM

lib/IR/AutoUpgrade.cpp
639 ↗	(On Diff #21295)	Yes, that would be a bug. We need to ignore the high bits. The software intrinsic doesn't match the hardware either; it shows an 'int' rather than a 'char'.

RKSimon mentioned this in D8088: [X86] replace vinsertf128 intrinsics with generic shuffles.Mar 6 2015, 12:27 AM

spatel added inline comments.Mar 6 2015, 9:15 AM

lib/IR/AutoUpgrade.cpp
644–649 ↗	(On Diff #21295)	I had no idea about std::iota. This would apparently be the first use of it anywhere in LLVM! But if using it is an NFC, do you mind if I do that in a follow-on patch? Although this raises another question that I had: what is the life expectancy of the AutoUpgrade code? Once we turn off the tap to create these intrinsics in clang, I assume this will be an even colder code path than it already is. Do we make any guarantees on backward support for deprecated intrinsics? Could we delete this after the next release?

Patch updated:

Mask off all but the low bit of the immediate
Added test case to confirm masking

I believe I was previously told that we had to keep auto-upgrade support for everything from 3.0 onwards (see comment from Chris in r154778 back in April 2012). Maybe they can go away at 4.0?

In D8086#136096, @craig.topper wrote:

I believe I was previously told that we had to keep auto-upgrade support for everything from 3.0 onwards (see comment from Chris in r154778 back in April 2012). Maybe they can go away at 4.0?

I thought 4.1 (so 4.0 can read all 3.x input).

In D8086#136822, @probinson wrote:

In D8086#136096, @craig.topper wrote:

I believe I was previously told that we had to keep auto-upgrade support for everything from 3.0 onwards (see comment from Chris in r154778 back in April 2012). Maybe they can go away at 4.0?

I thought 4.1 (so 4.0 can read all 3.x input).

So the plan is for all of the auto-upgrade intrinsics to be dropped at once at 4.1? Would a deprecation warning be useful here (either now or at 4.0) or at least a description in the comments in AutoUpgrade.cpp?

spatel mentioned this in D8184: IRBuilder - add a CreateShuffleVector function that takes an ArrayRef of int.Mar 9 2015, 2:48 PM

LGTM

This revision is now accepted and ready to land.Mar 9 2015, 8:19 PM

Closed by commit rL231794: [X86, AVX] replace vinsertf128 intrinsics with generic shuffles (authored by spatel). · Explain WhyMar 10 2015, 9:11 AM

This revision was automatically updated to reflect the committed changes.

Thanks all for the patch feedback - checked in at r231794 (the Clang half is at r231792).

I didn't add a comment about the expected lifetime of AutoUpgrade code because I wasn't sure what the final answer is. But I agree that we should add a comment once we have consensus. That way anyone looking in that file will have some idea about the policy.

spatel mentioned this in D8276: [X86] replace vextractf128 intrinsics with generic shuffles (LLVM).Mar 11 2015, 4:47 PM

spatel mentioned this in rL232045: [X86, AVX] replace vextractf128 intrinsics with generic shuffles.Mar 12 2015, 8:20 AM

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

IR/

IntrinsicsX86.td

13 lines

lib/

CodeGen/

SelectionDAG/

SelectionDAGBuilder.cpp

3 lines

IR/

AutoUpgrade.cpp

52 lines

test/

CodeGen/

X86/

avx-intrinsics-x86-upgrade.ll

36 lines

avx-intrinsics-x86.ll

24 lines

avx-vinsertf128.ll

42 lines

unaligned-32-byte-memops.ll

52 lines

Diff 21590

llvm/trunk/include/llvm/IR/IntrinsicsX86.td

Show First 20 Lines • Show All 1,177 Lines • ▼ Show 20 Lines	def int_x86_avx_vextractf128_pd_256 :
GCCBuiltin<"__builtin_ia32_vextractf128_pd256">,		GCCBuiltin<"__builtin_ia32_vextractf128_pd256">,
Intrinsic<[llvm_v2f64_ty], [llvm_v4f64_ty, llvm_i8_ty], [IntrNoMem]>;		Intrinsic<[llvm_v2f64_ty], [llvm_v4f64_ty, llvm_i8_ty], [IntrNoMem]>;
def int_x86_avx_vextractf128_ps_256 :		def int_x86_avx_vextractf128_ps_256 :
GCCBuiltin<"__builtin_ia32_vextractf128_ps256">,		GCCBuiltin<"__builtin_ia32_vextractf128_ps256">,
Intrinsic<[llvm_v4f32_ty], [llvm_v8f32_ty, llvm_i8_ty], [IntrNoMem]>;		Intrinsic<[llvm_v4f32_ty], [llvm_v8f32_ty, llvm_i8_ty], [IntrNoMem]>;
def int_x86_avx_vextractf128_si_256 :		def int_x86_avx_vextractf128_si_256 :
GCCBuiltin<"__builtin_ia32_vextractf128_si256">,		GCCBuiltin<"__builtin_ia32_vextractf128_si256">,
Intrinsic<[llvm_v4i32_ty], [llvm_v8i32_ty, llvm_i8_ty], [IntrNoMem]>;		Intrinsic<[llvm_v4i32_ty], [llvm_v8i32_ty, llvm_i8_ty], [IntrNoMem]>;

def int_x86_avx_vinsertf128_pd_256 :
GCCBuiltin<"__builtin_ia32_vinsertf128_pd256">,
Intrinsic<[llvm_v4f64_ty], [llvm_v4f64_ty,
llvm_v2f64_ty, llvm_i8_ty], [IntrNoMem]>;
def int_x86_avx_vinsertf128_ps_256 :
GCCBuiltin<"__builtin_ia32_vinsertf128_ps256">,
Intrinsic<[llvm_v8f32_ty], [llvm_v8f32_ty,
llvm_v4f32_ty, llvm_i8_ty], [IntrNoMem]>;
def int_x86_avx_vinsertf128_si_256 :
GCCBuiltin<"__builtin_ia32_vinsertf128_si256">,
Intrinsic<[llvm_v8i32_ty], [llvm_v8i32_ty,
llvm_v4i32_ty, llvm_i8_ty], [IntrNoMem]>;
}		}

// Vector convert		// Vector convert
let TargetPrefix = "x86" in { // All intrinsics start with "llvm.x86.".		let TargetPrefix = "x86" in { // All intrinsics start with "llvm.x86.".
def int_x86_avx_cvtdq2_pd_256 : GCCBuiltin<"__builtin_ia32_cvtdq2pd256">,		def int_x86_avx_cvtdq2_pd_256 : GCCBuiltin<"__builtin_ia32_cvtdq2pd256">,
Intrinsic<[llvm_v4f64_ty], [llvm_v4i32_ty], [IntrNoMem]>;		Intrinsic<[llvm_v4f64_ty], [llvm_v4i32_ty], [IntrNoMem]>;
def int_x86_avx_cvtdq2_ps_256 : GCCBuiltin<"__builtin_ia32_cvtdq2ps256">,		def int_x86_avx_cvtdq2_ps_256 : GCCBuiltin<"__builtin_ia32_cvtdq2ps256">,
Intrinsic<[llvm_v8f32_ty], [llvm_v8i32_ty], [IntrNoMem]>;		Intrinsic<[llvm_v8f32_ty], [llvm_v8i32_ty], [IntrNoMem]>;
▲ Show 20 Lines • Show All 2,674 Lines • Show Last 20 Lines

llvm/trunk/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,950 Lines • ▼ Show 20 Lines	case Intrinsic::x86_mmx_psrai_d: {
EVT DestVT = TLI.getValueType(I.getType());		EVT DestVT = TLI.getValueType(I.getType());
ShAmt = DAG.getNode(ISD::BITCAST, sdl, DestVT, ShAmt);		ShAmt = DAG.getNode(ISD::BITCAST, sdl, DestVT, ShAmt);
Res = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, sdl, DestVT,		Res = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, sdl, DestVT,
DAG.getConstant(NewIntrinsic, MVT::i32),		DAG.getConstant(NewIntrinsic, MVT::i32),
getValue(I.getArgOperand(0)), ShAmt);		getValue(I.getArgOperand(0)), ShAmt);
setValue(&I, Res);		setValue(&I, Res);
return nullptr;		return nullptr;
}		}
case Intrinsic::x86_avx_vinsertf128_pd_256:
case Intrinsic::x86_avx_vinsertf128_ps_256:
case Intrinsic::x86_avx_vinsertf128_si_256:
case Intrinsic::x86_avx2_vinserti128: {		case Intrinsic::x86_avx2_vinserti128: {
EVT DestVT = TLI.getValueType(I.getType());		EVT DestVT = TLI.getValueType(I.getType());
EVT ElVT = TLI.getValueType(I.getArgOperand(1)->getType());		EVT ElVT = TLI.getValueType(I.getArgOperand(1)->getType());
uint64_t Idx = (cast<ConstantInt>(I.getArgOperand(2))->getZExtValue() & 1) *		uint64_t Idx = (cast<ConstantInt>(I.getArgOperand(2))->getZExtValue() & 1) *
ElVT.getVectorNumElements();		ElVT.getVectorNumElements();
Res =		Res =
DAG.getNode(ISD::INSERT_SUBVECTOR, sdl, DestVT,		DAG.getNode(ISD::INSERT_SUBVECTOR, sdl, DestVT,
getValue(I.getArgOperand(0)), getValue(I.getArgOperand(1)),		getValue(I.getArgOperand(0)), getValue(I.getArgOperand(1)),
▲ Show 20 Lines • Show All 2,863 Lines • Show Last 20 Lines

llvm/trunk/lib/IR/AutoUpgrade.cpp

//===-- AutoUpgrade.cpp - Implement auto-upgrade helper functions ---------===//		//===-- AutoUpgrade.cpp - Implement auto-upgrade helper functions ---------===//
//		//
// The LLVM Compiler Infrastructure		// The LLVM Compiler Infrastructure
//		//
// This file is distributed under the University of Illinois Open Source		// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.		// License. See LICENSE.TXT for details.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This file implements the auto-upgrade helper functions		// This file implements the auto-upgrade helper functions.
		// This is where deprecated IR intrinsics and other IR features are updated to
		// current specifications.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "llvm/IR/AutoUpgrade.h"		#include "llvm/IR/AutoUpgrade.h"
#include "llvm/IR/CFG.h"		#include "llvm/IR/CFG.h"
#include "llvm/IR/CallSite.h"		#include "llvm/IR/CallSite.h"
#include "llvm/IR/Constants.h"		#include "llvm/IR/Constants.h"
#include "llvm/IR/DIBuilder.h"		#include "llvm/IR/DIBuilder.h"
▲ Show 20 Lines • Show All 132 Lines • ▼ Show 20 Lines	case 'o':
break;		break;

case 'x': {		case 'x': {
if (Name.startswith("x86.sse2.pcmpeq.") \|\|		if (Name.startswith("x86.sse2.pcmpeq.") \|\|
Name.startswith("x86.sse2.pcmpgt.") \|\|		Name.startswith("x86.sse2.pcmpgt.") \|\|
Name.startswith("x86.avx2.pcmpeq.") \|\|		Name.startswith("x86.avx2.pcmpeq.") \|\|
Name.startswith("x86.avx2.pcmpgt.") \|\|		Name.startswith("x86.avx2.pcmpgt.") \|\|
Name.startswith("x86.avx.vpermil.") \|\|		Name.startswith("x86.avx.vpermil.") \|\|
		Name == "x86.avx.vinsertf128.pd.256" \|\|
		Name == "x86.avx.vinsertf128.ps.256" \|\|
		Name == "x86.avx.vinsertf128.si.256" \|\|
Name == "x86.avx.movnt.dq.256" \|\|		Name == "x86.avx.movnt.dq.256" \|\|
Name == "x86.avx.movnt.pd.256" \|\|		Name == "x86.avx.movnt.pd.256" \|\|
Name == "x86.avx.movnt.ps.256" \|\|		Name == "x86.avx.movnt.ps.256" \|\|
Name == "x86.sse42.crc32.64.8" \|\|		Name == "x86.sse42.crc32.64.8" \|\|
Name == "x86.avx.vbroadcast.ss" \|\|		Name == "x86.avx.vbroadcast.ss" \|\|
Name == "x86.avx.vbroadcast.ss.256" \|\|		Name == "x86.avx.vbroadcast.ss.256" \|\|
Name == "x86.avx.vbroadcast.sd.256" \|\|		Name == "x86.avx.vbroadcast.sd.256" \|\|
Name == "x86.sse2.psll.dq" \|\|		Name == "x86.sse2.psll.dq" \|\|
▲ Show 20 Lines • Show All 454 Lines • ▼ Show 20 Lines	if (Name.startswith("llvm.x86.sse2.pcmpeq.") \|\|

SmallVector<Constant*, 16> Idxs;		SmallVector<Constant*, 16> Idxs;
for (unsigned i = 0; i != NumElts; ++i) {		for (unsigned i = 0; i != NumElts; ++i) {
unsigned Idx = ((Imm >> (i%8)) & 1) ? i + NumElts : i;		unsigned Idx = ((Imm >> (i%8)) & 1) ? i + NumElts : i;
Idxs.push_back(Builder.getInt32(Idx));		Idxs.push_back(Builder.getInt32(Idx));
}		}

Rep = Builder.CreateShuffleVector(Op0, Op1, ConstantVector::get(Idxs));		Rep = Builder.CreateShuffleVector(Op0, Op1, ConstantVector::get(Idxs));
		} else if (Name == "llvm.x86.avx.vinsertf128.pd.256" \|\|
		Name == "llvm.x86.avx.vinsertf128.ps.256" \|\|
		Name == "llvm.x86.avx.vinsertf128.si.256") {
		Value *Op0 = CI->getArgOperand(0);
		Value *Op1 = CI->getArgOperand(1);
		unsigned Imm = cast<ConstantInt>(CI->getArgOperand(2))->getZExtValue();
		VectorType *VecTy = cast<VectorType>(CI->getType());
		unsigned NumElts = VecTy->getNumElements();

		// Mask off the high bits of the immediate value; hardware ignores those.
		Imm = Imm & 1;

		// Extend the second operand into a vector that is twice as big.
		Value *UndefV = UndefValue::get(Op1->getType());
		SmallVector<Constant*, 8> Idxs;
		for (unsigned i = 0; i != NumElts; ++i) {
		Idxs.push_back(Builder.getInt32(i));
		}
		Rep = Builder.CreateShuffleVector(Op1, UndefV, ConstantVector::get(Idxs));

		// Insert the second operand into the first operand.

		// Note that there is no guarantee that instruction lowering will actually
		// produce a vinsertf128 instruction for the created shuffles. In
		// particular, the 0 immediate case involves no lane changes, so it can
		// be handled as a blend.

		// Example of shuffle mask for 32-bit elements:
		// Imm = 1 <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11>
		// Imm = 0 <i32 8, i32 9, i32 10, i32 11, i32 4, i32 5, i32 6, i32 7 >

		SmallVector<Constant*, 8> Idxs2;
		// The low half of the result is either the low half of the 1st operand
		// or the low half of the 2nd operand (the inserted vector).
		for (unsigned i = 0; i != NumElts / 2; ++i) {
		unsigned Idx = Imm ? i : (i + NumElts);
		Idxs2.push_back(Builder.getInt32(Idx));
		}
		// The high half of the result is either the low half of the 2nd operand
		// (the inserted vector) or the high half of the 1st operand.
		for (unsigned i = NumElts / 2; i != NumElts; ++i) {
		unsigned Idx = Imm ? (i + NumElts / 2) : i;
		Idxs2.push_back(Builder.getInt32(Idx));
		}
		Rep = Builder.CreateShuffleVector(Op0, Rep, ConstantVector::get(Idxs2));
} else {		} else {
bool PD128 = false, PD256 = false, PS128 = false, PS256 = false;		bool PD128 = false, PD256 = false, PS128 = false, PS256 = false;
if (Name == "llvm.x86.avx.vpermil.pd.256")		if (Name == "llvm.x86.avx.vpermil.pd.256")
PD256 = true;		PD256 = true;
else if (Name == "llvm.x86.avx.vpermil.pd")		else if (Name == "llvm.x86.avx.vpermil.pd")
PD128 = true;		PD128 = true;
else if (Name == "llvm.x86.avx.vpermil.ps.256")		else if (Name == "llvm.x86.avx.vpermil.ps.256")
PS256 = true;		PS256 = true;
▲ Show 20 Lines • Show All 270 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/avx-intrinsics-x86-upgrade.ll

	; RUN: llc < %s -mtriple=x86_64-apple-darwin -march=x86 -mcpu=corei7-avx \| FileCheck %s			; RUN: llc < %s -mtriple=x86_64-apple-darwin -march=x86 -mcpu=corei7-avx \| FileCheck %s

				; We don't check any vinsertf128 variant with immediate 0 because that's just a blend.

				define <4 x double> @test_x86_avx_vinsertf128_pd_256_1(<4 x double> %a0, <2 x double> %a1) {
				; CHECK-LABEL: test_x86_avx_vinsertf128_pd_256_1:
				; CHECK: vinsertf128 $1, %xmm1, %ymm0, %ymm0
				%res = call <4 x double> @llvm.x86.avx.vinsertf128.pd.256(<4 x double> %a0, <2 x double> %a1, i8 1)
				ret <4 x double> %res
				}
				declare <4 x double> @llvm.x86.avx.vinsertf128.pd.256(<4 x double>, <2 x double>, i8) nounwind readnone

				define <8 x float> @test_x86_avx_vinsertf128_ps_256_1(<8 x float> %a0, <4 x float> %a1) {
				; CHECK-LABEL: test_x86_avx_vinsertf128_ps_256_1:
				; CHECK: vinsertf128 $1, %xmm1, %ymm0, %ymm0
				%res = call <8 x float> @llvm.x86.avx.vinsertf128.ps.256(<8 x float> %a0, <4 x float> %a1, i8 1)
				ret <8 x float> %res
				}
				declare <8 x float> @llvm.x86.avx.vinsertf128.ps.256(<8 x float>, <4 x float>, i8) nounwind readnone

				define <8 x i32> @test_x86_avx_vinsertf128_si_256_1(<8 x i32> %a0, <4 x i32> %a1) {
				; CHECK-LABEL: test_x86_avx_vinsertf128_si_256_1:
				; CHECK: vinsertf128 $1, %xmm1, %ymm0, %ymm0
				%res = call <8 x i32> @llvm.x86.avx.vinsertf128.si.256(<8 x i32> %a0, <4 x i32> %a1, i8 1)
				ret <8 x i32> %res
				}

				; Verify that high bits of the immediate are masked off. This should be the equivalent
				; of a vinsertf128 $0 which should be optimized into a blend, so just check that it's
				; not a vinsertf128 $1.
				define <8 x i32> @test_x86_avx_vinsertf128_si_256_2(<8 x i32> %a0, <4 x i32> %a1) {
				; CHECK-LABEL: test_x86_avx_vinsertf128_si_256_2:
				; CHECK-NOT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
				%res = call <8 x i32> @llvm.x86.avx.vinsertf128.si.256(<8 x i32> %a0, <4 x i32> %a1, i8 2)
				ret <8 x i32> %res
				}
				declare <8 x i32> @llvm.x86.avx.vinsertf128.si.256(<8 x i32>, <4 x i32>, i8) nounwind readnone

	define <4 x double> @test_x86_avx_blend_pd_256(<4 x double> %a0, <4 x double> %a1) {			define <4 x double> @test_x86_avx_blend_pd_256(<4 x double> %a0, <4 x double> %a1) {
	; CHECK: vblendpd			; CHECK: vblendpd
	%res = call <4 x double> @llvm.x86.avx.blend.pd.256(<4 x double> %a0, <4 x double> %a1, i32 7) ; <<4 x double>> [#uses=1]			%res = call <4 x double> @llvm.x86.avx.blend.pd.256(<4 x double> %a0, <4 x double> %a1, i32 7) ; <<4 x double>> [#uses=1]
	ret <4 x double> %res			ret <4 x double> %res
	}			}
	declare <4 x double> @llvm.x86.avx.blend.pd.256(<4 x double>, <4 x double>, i32) nounwind readnone			declare <4 x double> @llvm.x86.avx.blend.pd.256(<4 x double>, <4 x double>, i32) nounwind readnone


	▲ Show 20 Lines • Show All 54 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/avx-intrinsics-x86.ll

	Show First 20 Lines • Show All 2,181 Lines • ▼ Show 20 Lines
	define <4 x i32> @test_x86_avx_vextractf128_si_256(<8 x i32> %a0) {			define <4 x i32> @test_x86_avx_vextractf128_si_256(<8 x i32> %a0) {
	; CHECK: vextractf128			; CHECK: vextractf128
	%res = call <4 x i32> @llvm.x86.avx.vextractf128.si.256(<8 x i32> %a0, i8 7) ; <<4 x i32>> [#uses=1]			%res = call <4 x i32> @llvm.x86.avx.vextractf128.si.256(<8 x i32> %a0, i8 7) ; <<4 x i32>> [#uses=1]
	ret <4 x i32> %res			ret <4 x i32> %res
	}			}
	declare <4 x i32> @llvm.x86.avx.vextractf128.si.256(<8 x i32>, i8) nounwind readnone			declare <4 x i32> @llvm.x86.avx.vextractf128.si.256(<8 x i32>, i8) nounwind readnone


	define <4 x double> @test_x86_avx_vinsertf128_pd_256(<4 x double> %a0, <2 x double> %a1) {
	; CHECK: vinsertf128
	%res = call <4 x double> @llvm.x86.avx.vinsertf128.pd.256(<4 x double> %a0, <2 x double> %a1, i8 7) ; <<4 x double>> [#uses=1]
	ret <4 x double> %res
	}
	declare <4 x double> @llvm.x86.avx.vinsertf128.pd.256(<4 x double>, <2 x double>, i8) nounwind readnone


	define <8 x float> @test_x86_avx_vinsertf128_ps_256(<8 x float> %a0, <4 x float> %a1) {
	; CHECK: vinsertf128
	%res = call <8 x float> @llvm.x86.avx.vinsertf128.ps.256(<8 x float> %a0, <4 x float> %a1, i8 7) ; <<8 x float>> [#uses=1]
	ret <8 x float> %res
	}
	declare <8 x float> @llvm.x86.avx.vinsertf128.ps.256(<8 x float>, <4 x float>, i8) nounwind readnone


	define <8 x i32> @test_x86_avx_vinsertf128_si_256(<8 x i32> %a0, <4 x i32> %a1) {
	; CHECK: vinsertf128
	%res = call <8 x i32> @llvm.x86.avx.vinsertf128.si.256(<8 x i32> %a0, <4 x i32> %a1, i8 7) ; <<8 x i32>> [#uses=1]
	ret <8 x i32> %res
	}
	declare <8 x i32> @llvm.x86.avx.vinsertf128.si.256(<8 x i32>, <4 x i32>, i8) nounwind readnone


	define <4 x double> @test_x86_avx_vperm2f128_pd_256(<4 x double> %a0, <4 x double> %a1) {			define <4 x double> @test_x86_avx_vperm2f128_pd_256(<4 x double> %a0, <4 x double> %a1) {
	; CHECK: vperm2f128			; CHECK: vperm2f128
	%res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 7) ; <<4 x double>> [#uses=1]			%res = call <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double> %a0, <4 x double> %a1, i8 7) ; <<4 x double>> [#uses=1]
	ret <4 x double> %res			ret <4 x double> %res
	}			}
	declare <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double>, <4 x double>, i8) nounwind readnone			declare <4 x double> @llvm.x86.avx.vperm2f128.pd.256(<4 x double>, <4 x double>, i8) nounwind readnone


	▲ Show 20 Lines • Show All 320 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/avx-vinsertf128.ll

	; RUN: llc < %s -mtriple=x86_64-apple-darwin -mcpu=corei7-avx -mattr=+avx \| FileCheck %s			; RUN: llc < %s -mtriple=x86_64-apple-darwin -mattr=+avx \| FileCheck %s
	; RUN: llc < %s -mtriple=x86_64-apple-darwin -mcpu=corei7-avx -mattr=+avx \| FileCheck -check-prefix=CHECK-SSE %s

				; CHECK-LABEL: A:
	; CHECK-NOT: vunpck			; CHECK-NOT: vunpck
	; CHECK: vinsertf128 $1			; CHECK: vinsertf128 $1
	define <8 x float> @A(<8 x float> %a) nounwind uwtable readnone ssp {			define <8 x float> @A(<8 x float> %a) nounwind uwtable readnone ssp {
	entry:			entry:
	%shuffle = shufflevector <8 x float> %a, <8 x float> undef, <8 x i32> <i32 8, i32 8, i32 8, i32 8, i32 0, i32 1, i32 2, i32 3>			%shuffle = shufflevector <8 x float> %a, <8 x float> undef, <8 x i32> <i32 8, i32 8, i32 8, i32 8, i32 0, i32 1, i32 2, i32 3>
	ret <8 x float> %shuffle			ret <8 x float> %shuffle
	}			}

				; CHECK-LABEL: B:
	; CHECK-NOT: vunpck			; CHECK-NOT: vunpck
	; CHECK: vinsertf128 $1			; CHECK: vinsertf128 $1
	define <4 x double> @B(<4 x double> %a) nounwind uwtable readnone ssp {			define <4 x double> @B(<4 x double> %a) nounwind uwtable readnone ssp {
	entry:			entry:
	%shuffle = shufflevector <4 x double> %a, <4 x double> undef, <4 x i32> <i32 4, i32 4, i32 0, i32 1>			%shuffle = shufflevector <4 x double> %a, <4 x double> undef, <4 x i32> <i32 4, i32 4, i32 0, i32 1>
	ret <4 x double> %shuffle			ret <4 x double> %shuffle
	}			}

	declare <2 x double> @llvm.x86.sse2.min.pd(<2 x double>, <2 x double>) nounwind readnone			declare <2 x double> @llvm.x86.sse2.min.pd(<2 x double>, <2 x double>) nounwind readnone

	declare <2 x double> @llvm.x86.sse2.min.sd(<2 x double>, <2 x double>) nounwind readnone			declare <2 x double> @llvm.x86.sse2.min.sd(<2 x double>, <2 x double>) nounwind readnone

	; Just check that no crash happens			; Just check that no crash happens
	; CHECK-SSE: _insert_crash			; CHECK-LABEL: _insert_crash:
	define void @insert_crash() nounwind {			define void @insert_crash() nounwind {
	allocas:			allocas:
	%v1.i.i451 = shufflevector <4 x double> zeroinitializer, <4 x double> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>			%v1.i.i451 = shufflevector <4 x double> zeroinitializer, <4 x double> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
	%ret_0a.i.i.i452 = shufflevector <4 x double> %v1.i.i451, <4 x double> undef, <2 x i32> <i32 0, i32 1>			%ret_0a.i.i.i452 = shufflevector <4 x double> %v1.i.i451, <4 x double> undef, <2 x i32> <i32 0, i32 1>
	%vret_0.i.i.i454 = tail call <2 x double> @llvm.x86.sse2.min.pd(<2 x double> %ret_0a.i.i.i452, <2 x double> undef) nounwind			%vret_0.i.i.i454 = tail call <2 x double> @llvm.x86.sse2.min.pd(<2 x double> %ret_0a.i.i.i452, <2 x double> undef) nounwind
	%ret_val.i.i.i463 = tail call <2 x double> @llvm.x86.sse2.min.sd(<2 x double> %vret_0.i.i.i454, <2 x double> undef) nounwind			%ret_val.i.i.i463 = tail call <2 x double> @llvm.x86.sse2.min.sd(<2 x double> %vret_0.i.i.i454, <2 x double> undef) nounwind
	%ret.i1.i.i464 = extractelement <2 x double> %ret_val.i.i.i463, i32 0			%ret.i1.i.i464 = extractelement <2 x double> %ret_val.i.i.i463, i32 0
	%double2float = fptrunc double %ret.i1.i.i464 to float			%double2float = fptrunc double %ret.i1.i.i464 to float
	%smearinsert50 = insertelement <4 x float> undef, float %double2float, i32 3			%smearinsert50 = insertelement <4 x float> undef, float %double2float, i32 3
	%blendAsInt.i503 = bitcast <4 x float> %smearinsert50 to <4 x i32>			%blendAsInt.i503 = bitcast <4 x float> %smearinsert50 to <4 x i32>
	store <4 x i32> %blendAsInt.i503, <4 x i32>* undef, align 4			store <4 x i32> %blendAsInt.i503, <4 x i32>* undef, align 4
	ret void			ret void
	}			}

	;; DAG Combine must remove useless vinsertf128 instructions			;; DAG Combine must remove useless vinsertf128 instructions

	; CHECK: DAGCombineA			; CHECK-LABEL: DAGCombineA:
	; CHECK-NOT: vinsertf128 $1			; CHECK-NOT: vinsertf128 $1
	define <4 x i32> @DAGCombineA(<4 x i32> %v1) nounwind readonly {			define <4 x i32> @DAGCombineA(<4 x i32> %v1) nounwind readonly {
	%1 = shufflevector <4 x i32> %v1, <4 x i32> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>			%1 = shufflevector <4 x i32> %v1, <4 x i32> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	%2 = shufflevector <8 x i32> %1, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>			%2 = shufflevector <8 x i32> %1, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	ret <4 x i32> %2			ret <4 x i32> %2
	}			}

	; CHECK: DAGCombineB			; CHECK-LABEL: DAGCombineB:
	; CHECK: vpaddd %xmm			; CHECK: vpaddd %xmm
	; CHECK-NOT: vinsertf128 $1			; CHECK-NOT: vinsertf128 $1
	; CHECK: vpaddd %xmm			; CHECK: vpaddd %xmm
	define <8 x i32> @DAGCombineB(<8 x i32> %v1, <8 x i32> %v2) nounwind readonly {			define <8 x i32> @DAGCombineB(<8 x i32> %v1, <8 x i32> %v2) nounwind readonly {
	%1 = add <8 x i32> %v1, %v2			%1 = add <8 x i32> %v1, %v2
	%2 = add <8 x i32> %1, %v1			%2 = add <8 x i32> %1, %v1
	ret <8 x i32> %2			ret <8 x i32> %2
	}			}

	; CHECK: insert_pd			; CHECK-LABEL: insert_undef_pd:
	define <4 x double> @insert_pd(<4 x double> %a0, <2 x double> %a1) {
	; CHECK: vinsertf128
	%res = call <4 x double> @llvm.x86.avx.vinsertf128.pd.256(<4 x double> %a0, <2 x double> %a1, i8 0)
	ret <4 x double> %res
	}

	; CHECK: insert_undef_pd
	define <4 x double> @insert_undef_pd(<4 x double> %a0, <2 x double> %a1) {			define <4 x double> @insert_undef_pd(<4 x double> %a0, <2 x double> %a1) {
	; CHECK: vmovaps %ymm1, %ymm0			; CHECK: vmovaps %ymm1, %ymm0
	%res = call <4 x double> @llvm.x86.avx.vinsertf128.pd.256(<4 x double> undef, <2 x double> %a1, i8 0)			%res = call <4 x double> @llvm.x86.avx.vinsertf128.pd.256(<4 x double> undef, <2 x double> %a1, i8 0)
	ret <4 x double> %res			ret <4 x double> %res
	}			}
	declare <4 x double> @llvm.x86.avx.vinsertf128.pd.256(<4 x double>, <2 x double>, i8) nounwind readnone			declare <4 x double> @llvm.x86.avx.vinsertf128.pd.256(<4 x double>, <2 x double>, i8) nounwind readnone


	; CHECK: insert_ps			; CHECK-LABEL: insert_undef_ps:
	define <8 x float> @insert_ps(<8 x float> %a0, <4 x float> %a1) {
	; CHECK: vinsertf128
	%res = call <8 x float> @llvm.x86.avx.vinsertf128.ps.256(<8 x float> %a0, <4 x float> %a1, i8 0)
	ret <8 x float> %res
	}

	; CHECK: insert_undef_ps
	define <8 x float> @insert_undef_ps(<8 x float> %a0, <4 x float> %a1) {			define <8 x float> @insert_undef_ps(<8 x float> %a0, <4 x float> %a1) {
	; CHECK: vmovaps %ymm1, %ymm0			; CHECK: vmovaps %ymm1, %ymm0
	%res = call <8 x float> @llvm.x86.avx.vinsertf128.ps.256(<8 x float> undef, <4 x float> %a1, i8 0)			%res = call <8 x float> @llvm.x86.avx.vinsertf128.ps.256(<8 x float> undef, <4 x float> %a1, i8 0)
	ret <8 x float> %res			ret <8 x float> %res
	}			}
	declare <8 x float> @llvm.x86.avx.vinsertf128.ps.256(<8 x float>, <4 x float>, i8) nounwind readnone			declare <8 x float> @llvm.x86.avx.vinsertf128.ps.256(<8 x float>, <4 x float>, i8) nounwind readnone


	; CHECK: insert_si			; CHECK-LABEL: insert_undef_si:
	define <8 x i32> @insert_si(<8 x i32> %a0, <4 x i32> %a1) {
	; CHECK: vinsertf128
	%res = call <8 x i32> @llvm.x86.avx.vinsertf128.si.256(<8 x i32> %a0, <4 x i32> %a1, i8 0)
	ret <8 x i32> %res
	}

	; CHECK: insert_undef_si
	define <8 x i32> @insert_undef_si(<8 x i32> %a0, <4 x i32> %a1) {			define <8 x i32> @insert_undef_si(<8 x i32> %a0, <4 x i32> %a1) {
	; CHECK: vmovaps %ymm1, %ymm0			; CHECK: vmovaps %ymm1, %ymm0
	%res = call <8 x i32> @llvm.x86.avx.vinsertf128.si.256(<8 x i32> undef, <4 x i32> %a1, i8 0)			%res = call <8 x i32> @llvm.x86.avx.vinsertf128.si.256(<8 x i32> undef, <4 x i32> %a1, i8 0)
	ret <8 x i32> %res			ret <8 x i32> %res
	}			}
	declare <8 x i32> @llvm.x86.avx.vinsertf128.si.256(<8 x i32>, <4 x i32>, i8) nounwind readnone			declare <8 x i32> @llvm.x86.avx.vinsertf128.si.256(<8 x i32>, <4 x i32>, i8) nounwind readnone

	; rdar://10643481			; rdar://10643481
	; CHECK: vinsertf128_combine			; CHECK-LABEL: vinsertf128_combine:
	define <8 x float> @vinsertf128_combine(float* nocapture %f) nounwind uwtable readonly ssp {			define <8 x float> @vinsertf128_combine(float* nocapture %f) nounwind uwtable readonly ssp {
	; CHECK-NOT: vmovaps			; CHECK-NOT: vmovaps
	; CHECK: vinsertf128			; CHECK: vinsertf128
	entry:			entry:
	%add.ptr = getelementptr inbounds float, float* %f, i64 4			%add.ptr = getelementptr inbounds float, float* %f, i64 4
	%0 = bitcast float* %add.ptr to <4 x float>*			%0 = bitcast float* %add.ptr to <4 x float>*
	%1 = load <4 x float>, <4 x float>* %0, align 16			%1 = load <4 x float>, <4 x float>* %0, align 16
	%2 = tail call <8 x float> @llvm.x86.avx.vinsertf128.ps.256(<8 x float> undef, <4 x float> %1, i8 1)			%2 = tail call <8 x float> @llvm.x86.avx.vinsertf128.ps.256(<8 x float> undef, <4 x float> %1, i8 1)
	ret <8 x float> %2			ret <8 x float> %2
	}			}

	; rdar://11076953			; rdar://11076953
	; CHECK: vinsertf128_ucombine			; CHECK-LABEL: vinsertf128_ucombine:
	define <8 x float> @vinsertf128_ucombine(float* nocapture %f) nounwind uwtable readonly ssp {			define <8 x float> @vinsertf128_ucombine(float* nocapture %f) nounwind uwtable readonly ssp {
	; CHECK-NOT: vmovups			; CHECK-NOT: vmovups
	; CHECK: vinsertf128			; CHECK: vinsertf128
	entry:			entry:
	%add.ptr = getelementptr inbounds float, float* %f, i64 4			%add.ptr = getelementptr inbounds float, float* %f, i64 4
	%0 = bitcast float* %add.ptr to <4 x float>*			%0 = bitcast float* %add.ptr to <4 x float>*
	%1 = load <4 x float>, <4 x float>* %0, align 8			%1 = load <4 x float>, <4 x float>* %0, align 8
	%2 = tail call <8 x float> @llvm.x86.avx.vinsertf128.ps.256(<8 x float> undef, <4 x float> %1, i8 1)			%2 = tail call <8 x float> @llvm.x86.avx.vinsertf128.ps.256(<8 x float> undef, <4 x float> %1, i8 1)
	ret <8 x float> %2			ret <8 x float> %2
	}			}

llvm/trunk/test/CodeGen/X86/unaligned-32-byte-memops.ll

Show First 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	define void @store32bytes(<8 x float> %A, <8 x float>* %P) {

store <8 x float> %A, <8 x float>* %P, align 16		store <8 x float> %A, <8 x float>* %P, align 16
ret void		ret void
}		}

; Merge two consecutive 16-byte subvector loads into a single 32-byte load		; Merge two consecutive 16-byte subvector loads into a single 32-byte load
; if it's faster.		; if it's faster.

declare <8 x float> @llvm.x86.avx.vinsertf128.ps.256(<8 x float>, <4 x float>, i8)

; Use the vinsertf128 intrinsic to model source code
; that explicitly uses AVX intrinsics.
define <8 x float> @combine_16_byte_loads(<4 x float>* %ptr) {
; CHECK-LABEL: combine_16_byte_loads

; SANDYB: vmovups
; SANDYB-NEXT: vinsertf128
; SANDYB-NEXT: retq

; BTVER2: vmovups
; BTVER2-NEXT: retq

; HASWELL: vmovups
; HASWELL-NEXT: retq

%ptr1 = getelementptr inbounds <4 x float>, <4 x float>* %ptr, i64 1
%ptr2 = getelementptr inbounds <4 x float>, <4 x float>* %ptr, i64 2
%v1 = load <4 x float>, <4 x float>* %ptr1, align 1
%v2 = load <4 x float>, <4 x float>* %ptr2, align 1
%shuffle = shufflevector <4 x float> %v1, <4 x float> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
%v3 = tail call <8 x float> @llvm.x86.avx.vinsertf128.ps.256(<8 x float> %shuffle, <4 x float> %v2, i8 1)
ret <8 x float> %v3
}

; Swap the operands of the shufflevector and vinsertf128 to ensure that the
; pattern still matches.
define <8 x float> @combine_16_byte_loads_swap(<4 x float>* %ptr) {
; CHECK-LABEL: combine_16_byte_loads_swap

; SANDYB: vmovups
; SANDYB-NEXT: vinsertf128
; SANDYB-NEXT: retq

; BTVER2: vmovups
; BTVER2-NEXT: retq

; HASWELL: vmovups
; HASWELL-NEXT: retq

%ptr1 = getelementptr inbounds <4 x float>, <4 x float>* %ptr, i64 2
%ptr2 = getelementptr inbounds <4 x float>, <4 x float>* %ptr, i64 3
%v1 = load <4 x float>, <4 x float>* %ptr1, align 1
%v2 = load <4 x float>, <4 x float>* %ptr2, align 1
%shuffle = shufflevector <4 x float> %v2, <4 x float> undef, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 0, i32 1, i32 2, i32 3>
%v3 = tail call <8 x float> @llvm.x86.avx.vinsertf128.ps.256(<8 x float> %shuffle, <4 x float> %v1, i8 0)
ret <8 x float> %v3
}

; Replace the vinsertf128 intrinsic with a shufflevector as might be
; expected from auto-vectorized code.
define <8 x float> @combine_16_byte_loads_no_intrinsic(<4 x float>* %ptr) {		define <8 x float> @combine_16_byte_loads_no_intrinsic(<4 x float>* %ptr) {
; CHECK-LABEL: combine_16_byte_loads_no_intrinsic		; CHECK-LABEL: combine_16_byte_loads_no_intrinsic

; SANDYB: vmovups		; SANDYB: vmovups
; SANDYB-NEXT: vinsertf128		; SANDYB-NEXT: vinsertf128
; SANDYB-NEXT: retq		; SANDYB-NEXT: retq

; BTVER2: vmovups		; BTVER2: vmovups
▲ Show 20 Lines • Show All 178 Lines • Show Last 20 Lines