This is an archive of the discontinued LLVM Phabricator instance.

Scalarization for masked gather/scatter intrinsics
ClosedPublic

Authored by delena on Oct 14 2015, 5:27 AM.

Download Raw Diff

Details

Reviewers

qcolombet
mzolotukhin
mkuper
hfinkel

Commits

rG092858588a1c: Scalarizer for masked.gather and masked.scatter intrinsics. When the target…
rL251237: Scalarizer for masked.gather and masked.scatter intrinsics.

Summary

Masked gather/scatter intrinsics are supported on AVX-512 targets and only for 32/64 bit integers and FP types.
In all other cases these intrinsics should be split in a chain of basic blocks and a sequence of scalar load/store operations.

Example:
<16 x i32 > @llvm.masked.gather.v16i32( <16 x i32*> %Ptrs, i32 4, <16 x i1> %Mask, <16 x i32> %Src)
is translated to:

%Mask0 = extractelement <16 x i1> %Mask, i32 0
% ToLoad0 = icmp eq i1 % Mask0, true
br i1 % ToLoad0, label %cond.load, label %else

cond.load:
% Ptr0 = extractelement <16 x i32*> %Ptrs, i32 0
% Load0 = load i32, i32* % Ptr0, align 4
% Res0 = insertelement <16 x i32> undef, i32 % Load0, i32 0
br label %else

else:
%res.phi.else = phi <16 x i32>[% Res0, %cond.load], [undef, % 0]
% Mask1 = extractelement <16 x i1> %Mask, i32 1
% ToLoad1 = icmp eq i1 % Mask1, true
...

Diff Detail

Repository: rL LLVM

Event Timeline

delena updated this revision to Diff 37337.Oct 14 2015, 5:27 AM

delena retitled this revision from to Scalarization for masked gather/scatter intrinsics.

delena updated this object.

delena added reviewers: qcolombet, hfinkel, mzolotukhin.

delena set the repository for this revision to rL LLVM.

delena added a subscriber: llvm-commits.

delena added a reviewer: mkuper.Oct 15 2015, 2:47 AM

Hi Elena,
Some comments inline.

../include/llvm/Analysis/TargetTransformInfo.h
580 ↗	(On Diff #37337)	How is this different from isLegalMaskedStore() with Consecutive == 0?
../lib/CodeGen/CodeGenPrepare.cpp
1123 ↗	(On Diff #37337)	Perhaps these changes belong in a separate patch from the addition of ScalarizeMaskedGather()?
1128 ↗	(On Diff #37337)	Why is this a dyn_cast, if the result is assumed to be non-null in line 1131? If this is just for the assert, I'd make this a cast, and make the assert assert(isa<VectorType>(CI->getType())).
1155 ↗	(On Diff #37337)	This is a bit odd. If AlignVal >= VecType->getScalarSizeInBits(), then this is a nop. So, let's say originally AlignVal < VecType->getScalarSizeInBits(). This means that you're raising the alignment for the store of the first element. Are you sure that's what you want?
1252 ↗	(On Diff #37337)	Same comments apply as for the MaskedLoad.
1358 ↗	(On Diff #37337)	This looks very similar to ScalarizeMaskedLoad, except for the all-ones case. Can they perhaps be combined?
1387 ↗	(On Diff #37337)	Let's say the mask is a constant, but not an all-ones constant. We could just iterate over the bits of the mask, and place only the loads where the bit is actually 1, instead of creating all of the branchy code. Normally it wouldn't matter, since SimplifyCFG (I think) would clean it up - but I don't think there's a SimplifyCFG run between here and ISel. Does something else clean up the mess for the non-all-ones constant case?
../lib/Target/X86/X86TargetTransformInfo.cpp
1173 ↗	(On Diff #37337)	In case you're going to touch this code anyway - isn't it enough to check hasAVX2() (That is, doesn't hasAVX512() imply hasAVX2() ?) )
1190 ↗	(On Diff #37337)	Shouldn't there be an upper limit to the DataWidth too? (Or to the vector element count, for that matter?)

I'll separate patches:

Remove "Consecutive" argument from isLegalMaskedLoad() / store
Alignments and constant masks for masked load / store scalarization procedure
Gather/Scatter scalarization procedure

../include/llvm/Analysis/TargetTransformInfo.h
580 ↗	(On Diff #37337)	You are right. I forgot about "Consecutive". I decided to remove "Consecutive" from load/store.
../lib/CodeGen/CodeGenPrepare.cpp
1123 ↗	(On Diff #37337)	I added alignment for gather/scatter. And here as well. I can put it as a separate patch.
1128 ↗	(On Diff #37337)	The form is not so important. I use VecType later and this form is more convenient for me.
1155 ↗	(On Diff #37337)	Thanks! I changed to "min". (That what I initially meant for). If masked operation has alignment 16 (it is a vector op), I can't put 16 to scalar. I reduce it scalar type.
1358 ↗	(On Diff #37337)	It is similar but not the same. I also extract a pointer from the vector of pointers. I don't want to mix scatter and store, gather and load.
1387 ↗	(On Diff #37337)	Yes, this optimization is possible. For load/store as well. I'll add it.
../lib/Target/X86/X86TargetTransformInfo.cpp
1173 ↗	(On Diff #37337)	This function should be changed anyway. We have masked load/store on AVX, not only AVX2. We have masked load/store for i16 and i8 vectors on SKX. But I need to add CodeGen support for this. I'll change to hasAVX2 meanwhile.
1190 ↗	(On Diff #37337)	If the vector will be too wide, type legalizer will split it. I don't know what will be if vector width is not power of 2, I can reject this case.

I moved the load-store related changes to another patch. In this patch I have:

isLegalGather() isLegalScatter() interface
code that scalarizes masked.gather and masked.scatter intrinsics when the interface (1) returns false.

LGTM with some nits.

../lib/CodeGen/CodeGenPrepare.cpp
1129 ↗	(On Diff #37724)	That's ok, I'm just saying I would prefer not to have an unchecked dyn_cast<>.
1361 ↗	(On Diff #37724)	Don't you need AlignVal = std::min(AlignVal, VecType->getScalarSizeInBits()), like for the load/store case?
1367 ↗	(On Diff #37724)	This gets eventually cleaned up is Src0 is undef, right?
1410 ↗	(On Diff #37724)	LoadInst* Load -> LoadInst *Load
1493 ↗	(On Diff #37724)	Same as above re AlignVal
../lib/Target/X86/X86TargetTransformInfo.cpp
1212 ↗	(On Diff #37724)	So, say, a <32 x i32> gather will get split into two <16 x i32> gathers by the legalizer? In any case: a) We should probably reject the not-power-of-2 case, unless we know the legalizer can handle it. If we don't reject it, then I think there should be a test for it. b) I guess the name is a bit confusing - it's not really "isLegal" (because some things it accepts may not actually be legal on the target), it's more like "isLegalizeable". I don't think we should change the name, but it may be worth to note this in the declaration in TargetTransformInfo.h.

This revision is now accepted and ready to land.Oct 22 2015, 4:16 AM

Hi Michael, not committing the code. I want to hear your opinion about rejection of non-power-of-2 elements.

../lib/CodeGen/CodeGenPrepare.cpp
1361 ↗	(On Diff #37724)	No. Gather / scatter alignment means alignment for one element, not for a vector. If the specified alignment is bigger than element size, we can't handle it properly on X86. The only one option that I see here is to scalarize masked-gather-scatter to a chain of scalar loads with required alignment.
1367 ↗	(On Diff #37724)	Yes, the codegen with clean it anyway.
../lib/Target/X86/X86TargetTransformInfo.cpp
1212 ↗	(On Diff #37724)	a) I can add rejection of the not-power-of-2 cases. And a test. But I want to know your opinion about vector and scalar types in this function: (This comment I want to put inside) // This function is called now in two cases: from the Loop Vectorizer // and from the Scalarizer. // When the Loop Vectorizer asks about legality of the feature, // the vectorization factor is not calculated yet. The Loop Vectorizer // sends a scalar type and the decision is based on the width of the // scalar element. // Later on, the cost model will estimate usage this intrinsic based on // the vector type. // The Scalarizer asks again about legality. It sends a vector type. // In this case we can reject non-power-of-2 vectors. if (isa<VectorType>(DataTy) && !isPowerOf2_32(DataTy->getVectorNumElements())) return false; Type *ScalarTy = DataTy->getScalarType(); ... b) "IsLegal" comes from TypeLegalizer and it is equal to "IsSupported". I wrote in the comments in TargetTransformInfo.h "Return true if the target supports.."

mkuper added inline comments.Oct 25 2015, 7:16 AM

../lib/CodeGen/CodeGenPrepare.cpp
1361 ↗	(On Diff #37724)	Ah, right, of course.
../lib/Target/X86/X86TargetTransformInfo.cpp
1212 ↗	(On Diff #37724)	SGTM.

Closed by commit rL251237: Scalarizer for masked.gather and masked.scatter intrinsics. (authored by delena). · Explain WhyOct 25 2015, 8:40 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

Analysis/

TargetTransformInfo.h

14 lines

TargetTransformInfoImpl.h

4 lines

lib/

Analysis/

TargetTransformInfo.cpp

8 lines

CodeGen/

CodeGenPrepare.cpp

262 lines

Target/

X86/

X86TargetTransformInfo.h

2 lines

X86TargetTransformInfo.cpp

27 lines

test/

CodeGen/

X86/

masked_gather_scatter.ll

85 lines

Diff 38347

llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 310 Lines • ▼ Show 20 Lines	bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
unsigned AddrSpace = 0) const;		unsigned AddrSpace = 0) const;

/// \brief Return true if the target supports masked load/store		/// \brief Return true if the target supports masked load/store
/// AVX2 and AVX-512 targets allow masks for consecutive load and store for		/// AVX2 and AVX-512 targets allow masks for consecutive load and store for
/// 32 and 64 bit elements.		/// 32 and 64 bit elements.
bool isLegalMaskedStore(Type *DataType) const;		bool isLegalMaskedStore(Type *DataType) const;
bool isLegalMaskedLoad(Type *DataType) const;		bool isLegalMaskedLoad(Type *DataType) const;

		/// \brief Return true if the target supports masked gather/scatter
		/// AVX-512 fully supports gather and scatter for vectors with 32 and 64
		/// bits scalar type.
		bool isLegalMaskedScatter(Type *DataType) const;
		bool isLegalMaskedGather(Type *DataType) const;

/// \brief Return the cost of the scaling factor used in the addressing		/// \brief Return the cost of the scaling factor used in the addressing
/// mode represented by AM for this target, for a load/store		/// mode represented by AM for this target, for a load/store
/// of the specified type.		/// of the specified type.
/// If the AM is supported, the return value must be >= 0.		/// If the AM is supported, the return value must be >= 0.
/// If the AM is not supported, it returns a negative value.		/// If the AM is not supported, it returns a negative value.
/// TODO: Handle pre/postinc as well.		/// TODO: Handle pre/postinc as well.
int getScalingFactorCost(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,		int getScalingFactorCost(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
bool HasBaseReg, int64_t Scale,		bool HasBaseReg, int64_t Scale,
▲ Show 20 Lines • Show All 237 Lines • ▼ Show 20 Lines	public:
virtual bool isLegalAddImmediate(int64_t Imm) = 0;		virtual bool isLegalAddImmediate(int64_t Imm) = 0;
virtual bool isLegalICmpImmediate(int64_t Imm) = 0;		virtual bool isLegalICmpImmediate(int64_t Imm) = 0;
virtual bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV,		virtual bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV,
int64_t BaseOffset, bool HasBaseReg,		int64_t BaseOffset, bool HasBaseReg,
int64_t Scale,		int64_t Scale,
unsigned AddrSpace) = 0;		unsigned AddrSpace) = 0;
virtual bool isLegalMaskedStore(Type *DataType) = 0;		virtual bool isLegalMaskedStore(Type *DataType) = 0;
virtual bool isLegalMaskedLoad(Type *DataType) = 0;		virtual bool isLegalMaskedLoad(Type *DataType) = 0;
		virtual bool isLegalMaskedScatter(Type *DataType) = 0;
		virtual bool isLegalMaskedGather(Type *DataType) = 0;
virtual int getScalingFactorCost(Type Ty, GlobalValue BaseGV,		virtual int getScalingFactorCost(Type Ty, GlobalValue BaseGV,
int64_t BaseOffset, bool HasBaseReg,		int64_t BaseOffset, bool HasBaseReg,
int64_t Scale, unsigned AddrSpace) = 0;		int64_t Scale, unsigned AddrSpace) = 0;
virtual bool isTruncateFree(Type Ty1, Type Ty2) = 0;		virtual bool isTruncateFree(Type Ty1, Type Ty2) = 0;
virtual bool isZExtFree(Type Ty1, Type Ty2) = 0;		virtual bool isZExtFree(Type Ty1, Type Ty2) = 0;
virtual bool isProfitableToHoist(Instruction *I) = 0;		virtual bool isProfitableToHoist(Instruction *I) = 0;
virtual bool isTypeLegal(Type *Ty) = 0;		virtual bool isTypeLegal(Type *Ty) = 0;
virtual unsigned getJumpBufAlignment() = 0;		virtual unsigned getJumpBufAlignment() = 0;
▲ Show 20 Lines • Show All 113 Lines • ▼ Show 20 Lines	return Impl.isLegalAddressingMode(Ty, BaseGV, BaseOffset, HasBaseReg,
Scale, AddrSpace);		Scale, AddrSpace);
}		}
bool isLegalMaskedStore(Type *DataType) override {		bool isLegalMaskedStore(Type *DataType) override {
return Impl.isLegalMaskedStore(DataType);		return Impl.isLegalMaskedStore(DataType);
}		}
bool isLegalMaskedLoad(Type *DataType) override {		bool isLegalMaskedLoad(Type *DataType) override {
return Impl.isLegalMaskedLoad(DataType);		return Impl.isLegalMaskedLoad(DataType);
}		}
		bool isLegalMaskedScatter(Type *DataType) override {
		return Impl.isLegalMaskedScatter(DataType);
		}
		bool isLegalMaskedGather(Type *DataType) override {
		return Impl.isLegalMaskedGather(DataType);
		}
int getScalingFactorCost(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,		int getScalingFactorCost(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
bool HasBaseReg, int64_t Scale,		bool HasBaseReg, int64_t Scale,
unsigned AddrSpace) override {		unsigned AddrSpace) override {
return Impl.getScalingFactorCost(Ty, BaseGV, BaseOffset, HasBaseReg,		return Impl.getScalingFactorCost(Ty, BaseGV, BaseOffset, HasBaseReg,
Scale, AddrSpace);		Scale, AddrSpace);
}		}
bool isTruncateFree(Type Ty1, Type Ty2) override {		bool isTruncateFree(Type Ty1, Type Ty2) override {
return Impl.isTruncateFree(Ty1, Ty2);		return Impl.isTruncateFree(Ty1, Ty2);
▲ Show 20 Lines • Show All 224 Lines • Show Last 20 Lines

llvm/trunk/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 207 Lines • ▼ Show 20 Lines	bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
// taken from the implementation of LSR.		// taken from the implementation of LSR.
return !BaseGV && BaseOffset == 0 && (Scale == 0 \|\| Scale == 1);		return !BaseGV && BaseOffset == 0 && (Scale == 0 \|\| Scale == 1);
}		}

bool isLegalMaskedStore(Type *DataType) { return false; }		bool isLegalMaskedStore(Type *DataType) { return false; }

bool isLegalMaskedLoad(Type *DataType) { return false; }		bool isLegalMaskedLoad(Type *DataType) { return false; }

		bool isLegalMaskedScatter(Type *DataType) { return false; }

		bool isLegalMaskedGather(Type *DataType) { return false; }

int getScalingFactorCost(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,		int getScalingFactorCost(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
bool HasBaseReg, int64_t Scale, unsigned AddrSpace) {		bool HasBaseReg, int64_t Scale, unsigned AddrSpace) {
// Guess that all legal addressing mode are free.		// Guess that all legal addressing mode are free.
if (isLegalAddressingMode(Ty, BaseGV, BaseOffset, HasBaseReg,		if (isLegalAddressingMode(Ty, BaseGV, BaseOffset, HasBaseReg,
Scale, AddrSpace))		Scale, AddrSpace))
return 0;		return 0;
return -1;		return -1;
}		}
▲ Show 20 Lines • Show All 272 Lines • Show Last 20 Lines

llvm/trunk/lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 115 Lines • ▼ Show 20 Lines
	bool TargetTransformInfo::isLegalMaskedStore(Type *DataType) const {			bool TargetTransformInfo::isLegalMaskedStore(Type *DataType) const {
	return TTIImpl->isLegalMaskedStore(DataType);			return TTIImpl->isLegalMaskedStore(DataType);
	}			}

	bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType) const {			bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType) const {
	return TTIImpl->isLegalMaskedLoad(DataType);			return TTIImpl->isLegalMaskedLoad(DataType);
	}			}

				bool TargetTransformInfo::isLegalMaskedGather(Type *DataType) const {
				return TTIImpl->isLegalMaskedGather(DataType);
				}

				bool TargetTransformInfo::isLegalMaskedScatter(Type *DataType) const {
				return TTIImpl->isLegalMaskedGather(DataType);
				}

	int TargetTransformInfo::getScalingFactorCost(Type Ty, GlobalValue BaseGV,			int TargetTransformInfo::getScalingFactorCost(Type Ty, GlobalValue BaseGV,
	int64_t BaseOffset,			int64_t BaseOffset,
	bool HasBaseReg,			bool HasBaseReg,
	int64_t Scale,			int64_t Scale,
	unsigned AddrSpace) const {			unsigned AddrSpace) const {
	int Cost = TTIImpl->getScalingFactorCost(Ty, BaseGV, BaseOffset, HasBaseReg,			int Cost = TTIImpl->getScalingFactorCost(Ty, BaseGV, BaseOffset, HasBaseReg,
	Scale, AddrSpace);			Scale, AddrSpace);
	assert(Cost >= 0 && "TTI should not produce negative costs!");			assert(Cost >= 0 && "TTI should not produce negative costs!");
	▲ Show 20 Lines • Show All 255 Lines • Show Last 20 Lines

llvm/trunk/lib/CodeGen/CodeGenPrepare.cpp

Show First 20 Lines • Show All 1,209 Lines • ▼ Show 20 Lines	for (unsigned Idx = 0; Idx < VectorWidth; ++Idx) {
// %Elt = load i32* %EltAddr		// %Elt = load i32* %EltAddr
// VResult = insertelement <16 x i32> VResult, i32 %Elt, i32 Idx		// VResult = insertelement <16 x i32> VResult, i32 %Elt, i32 Idx
//		//
CondBlock = IfBlock->splitBasicBlock(InsertPt->getIterator(), "cond.load");		CondBlock = IfBlock->splitBasicBlock(InsertPt->getIterator(), "cond.load");
Builder.SetInsertPoint(InsertPt);		Builder.SetInsertPoint(InsertPt);

Value *Gep =		Value *Gep =
Builder.CreateInBoundsGEP(EltTy, FirstEltPtr, Builder.getInt32(Idx));		Builder.CreateInBoundsGEP(EltTy, FirstEltPtr, Builder.getInt32(Idx));
LoadInst* Load = Builder.CreateAlignedLoad(Gep, AlignVal);		LoadInst *Load = Builder.CreateAlignedLoad(Gep, AlignVal);
VResult = Builder.CreateInsertElement(VResult, Load, Builder.getInt32(Idx));		VResult = Builder.CreateInsertElement(VResult, Load, Builder.getInt32(Idx));

// Create "else" block, fill it in the next iteration		// Create "else" block, fill it in the next iteration
BasicBlock *NewIfBlock =		BasicBlock *NewIfBlock =
CondBlock->splitBasicBlock(InsertPt->getIterator(), "else");		CondBlock->splitBasicBlock(InsertPt->getIterator(), "else");
Builder.SetInsertPoint(InsertPt);		Builder.SetInsertPoint(InsertPt);
Instruction *OldBr = IfBlock->getTerminator();		Instruction *OldBr = IfBlock->getTerminator();
BranchInst::Create(CondBlock, NewIfBlock, Cmp, OldBr);		BranchInst::Create(CondBlock, NewIfBlock, Cmp, OldBr);
▲ Show 20 Lines • Show All 121 Lines • ▼ Show 20 Lines	for (unsigned Idx = 0; Idx < VectorWidth; ++Idx) {
Instruction *OldBr = IfBlock->getTerminator();		Instruction *OldBr = IfBlock->getTerminator();
BranchInst::Create(CondBlock, NewIfBlock, Cmp, OldBr);		BranchInst::Create(CondBlock, NewIfBlock, Cmp, OldBr);
OldBr->eraseFromParent();		OldBr->eraseFromParent();
IfBlock = NewIfBlock;		IfBlock = NewIfBlock;
}		}
CI->eraseFromParent();		CI->eraseFromParent();
}		}

		// Translate a masked gather intrinsic like
		// <16 x i32 > @llvm.masked.gather.v16i32( <16 x i32*> %Ptrs, i32 4,
		// <16 x i1> %Mask, <16 x i32> %Src)
		// to a chain of basic blocks, with loading element one-by-one if
		// the appropriate mask bit is set
		//
		// % Ptrs = getelementptr i32, i32* %base, <16 x i64> %ind
		// % Mask0 = extractelement <16 x i1> %Mask, i32 0
		// % ToLoad0 = icmp eq i1 % Mask0, true
		// br i1 % ToLoad0, label %cond.load, label %else
		//
		// cond.load:
		// % Ptr0 = extractelement <16 x i32*> %Ptrs, i32 0
		// % Load0 = load i32, i32* % Ptr0, align 4
		// % Res0 = insertelement <16 x i32> undef, i32 % Load0, i32 0
		// br label %else
		//
		// else:
		// %res.phi.else = phi <16 x i32>[% Res0, %cond.load], [undef, % 0]
		// % Mask1 = extractelement <16 x i1> %Mask, i32 1
		// % ToLoad1 = icmp eq i1 % Mask1, true
		// br i1 % ToLoad1, label %cond.load1, label %else2
		//
		// cond.load1:
		// % Ptr1 = extractelement <16 x i32*> %Ptrs, i32 1
		// % Load1 = load i32, i32* % Ptr1, align 4
		// % Res1 = insertelement <16 x i32> %res.phi.else, i32 % Load1, i32 1
		// br label %else2
		// . . .
		// % Result = select <16 x i1> %Mask, <16 x i32> %res.phi.select, <16 x i32> %Src
		// ret <16 x i32> %Result
		static void ScalarizeMaskedGather(CallInst *CI) {
		Value *Ptrs = CI->getArgOperand(0);
		Value *Alignment = CI->getArgOperand(1);
		Value *Mask = CI->getArgOperand(2);
		Value *Src0 = CI->getArgOperand(3);

		VectorType *VecType = dyn_cast<VectorType>(CI->getType());

		assert(VecType && "Unexpected return type of masked load intrinsic");

		IRBuilder<> Builder(CI->getContext());
		Instruction *InsertPt = CI;
		BasicBlock *IfBlock = CI->getParent();
		BasicBlock *CondBlock = nullptr;
		BasicBlock *PrevIfBlock = CI->getParent();
		Builder.SetInsertPoint(InsertPt);
		unsigned AlignVal = cast<ConstantInt>(Alignment)->getZExtValue();

		Builder.SetCurrentDebugLocation(CI->getDebugLoc());

		Value *UndefVal = UndefValue::get(VecType);

		// The result vector
		Value *VResult = UndefVal;
		unsigned VectorWidth = VecType->getNumElements();

		// Shorten the way if the mask is a vector of constants.
		bool IsConstMask = isa<ConstantVector>(Mask);

		if (IsConstMask) {
		for (unsigned Idx = 0; Idx < VectorWidth; ++Idx) {
		if (cast<ConstantVector>(Mask)->getOperand(Idx)->isNullValue())
		continue;
		Value *Ptr = Builder.CreateExtractElement(Ptrs, Builder.getInt32(Idx),
		"Ptr" + Twine(Idx));
		LoadInst *Load = Builder.CreateAlignedLoad(Ptr, AlignVal,
		"Load" + Twine(Idx));
		VResult = Builder.CreateInsertElement(VResult, Load,
		Builder.getInt32(Idx),
		"Res" + Twine(Idx));
		}
		Value *NewI = Builder.CreateSelect(Mask, VResult, Src0);
		CI->replaceAllUsesWith(NewI);
		CI->eraseFromParent();
		return;
		}

		PHINode *Phi = nullptr;
		Value *PrevPhi = UndefVal;

		for (unsigned Idx = 0; Idx < VectorWidth; ++Idx) {

		// Fill the "else" block, created in the previous iteration
		//
		// %Mask1 = extractelement <16 x i1> %Mask, i32 1
		// %ToLoad1 = icmp eq i1 %Mask1, true
		// br i1 %ToLoad1, label %cond.load, label %else
		//
		if (Idx > 0) {
		Phi = Builder.CreatePHI(VecType, 2, "res.phi.else");
		Phi->addIncoming(VResult, CondBlock);
		Phi->addIncoming(PrevPhi, PrevIfBlock);
		PrevPhi = Phi;
		VResult = Phi;
		}

		Value *Predicate = Builder.CreateExtractElement(Mask,
		Builder.getInt32(Idx),
		"Mask" + Twine(Idx));
		Value *Cmp = Builder.CreateICmp(ICmpInst::ICMP_EQ, Predicate,
		ConstantInt::get(Predicate->getType(), 1),
		"ToLoad" + Twine(Idx));

		// Create "cond" block
		//
		// %EltAddr = getelementptr i32* %1, i32 0
		// %Elt = load i32* %EltAddr
		// VResult = insertelement <16 x i32> VResult, i32 %Elt, i32 Idx
		//
		CondBlock = IfBlock->splitBasicBlock(InsertPt, "cond.load");
		Builder.SetInsertPoint(InsertPt);

		Value *Ptr = Builder.CreateExtractElement(Ptrs, Builder.getInt32(Idx),
		"Ptr" + Twine(Idx));
		LoadInst *Load = Builder.CreateAlignedLoad(Ptr, AlignVal,
		"Load" + Twine(Idx));
		VResult = Builder.CreateInsertElement(VResult, Load, Builder.getInt32(Idx),
		"Res" + Twine(Idx));

		// Create "else" block, fill it in the next iteration
		BasicBlock *NewIfBlock = CondBlock->splitBasicBlock(InsertPt, "else");
		Builder.SetInsertPoint(InsertPt);
		Instruction *OldBr = IfBlock->getTerminator();
		BranchInst::Create(CondBlock, NewIfBlock, Cmp, OldBr);
		OldBr->eraseFromParent();
		PrevIfBlock = IfBlock;
		IfBlock = NewIfBlock;
		}

		Phi = Builder.CreatePHI(VecType, 2, "res.phi.select");
		Phi->addIncoming(VResult, CondBlock);
		Phi->addIncoming(PrevPhi, PrevIfBlock);
		Value *NewI = Builder.CreateSelect(Mask, Phi, Src0);
		CI->replaceAllUsesWith(NewI);
		CI->eraseFromParent();
		}

		// Translate a masked scatter intrinsic, like
		// void @llvm.masked.scatter.v16i32(<16 x i32> %Src, <16 x i32> %Ptrs, i32 4,
		// <16 x i1> %Mask)
		// to a chain of basic blocks, that stores element one-by-one if
		// the appropriate mask bit is set.
		//
		// % Ptrs = getelementptr i32, i32* %ptr, <16 x i64> %ind
		// % Mask0 = extractelement <16 x i1> % Mask, i32 0
		// % ToStore0 = icmp eq i1 % Mask0, true
		// br i1 %ToStore0, label %cond.store, label %else
		//
		// cond.store:
		// % Elt0 = extractelement <16 x i32> %Src, i32 0
		// % Ptr0 = extractelement <16 x i32*> %Ptrs, i32 0
		// store i32 %Elt0, i32* % Ptr0, align 4
		// br label %else
		//
		// else:
		// % Mask1 = extractelement <16 x i1> % Mask, i32 1
		// % ToStore1 = icmp eq i1 % Mask1, true
		// br i1 % ToStore1, label %cond.store1, label %else2
		//
		// cond.store1:
		// % Elt1 = extractelement <16 x i32> %Src, i32 1
		// % Ptr1 = extractelement <16 x i32*> %Ptrs, i32 1
		// store i32 % Elt1, i32* % Ptr1, align 4
		// br label %else2
		// . . .
		static void ScalarizeMaskedScatter(CallInst *CI) {
		Value *Src = CI->getArgOperand(0);
		Value *Ptrs = CI->getArgOperand(1);
		Value *Alignment = CI->getArgOperand(2);
		Value *Mask = CI->getArgOperand(3);

		assert(isa<VectorType>(Src->getType()) &&
		"Unexpected data type in masked scatter intrinsic");
		assert(isa<VectorType>(Ptrs->getType()) &&
		isa<PointerType>(Ptrs->getType()->getVectorElementType()) &&
		"Vector of pointers is expected in masked scatter intrinsic");

		IRBuilder<> Builder(CI->getContext());
		Instruction *InsertPt = CI;
		BasicBlock *IfBlock = CI->getParent();
		Builder.SetInsertPoint(InsertPt);
		Builder.SetCurrentDebugLocation(CI->getDebugLoc());

		unsigned AlignVal = cast<ConstantInt>(Alignment)->getZExtValue();
		unsigned VectorWidth = Src->getType()->getVectorNumElements();

		// Shorten the way if the mask is a vector of constants.
		bool IsConstMask = isa<ConstantVector>(Mask);

		if (IsConstMask) {
		for (unsigned Idx = 0; Idx < VectorWidth; ++Idx) {
		if (cast<ConstantVector>(Mask)->getOperand(Idx)->isNullValue())
		continue;
		Value *OneElt = Builder.CreateExtractElement(Src, Builder.getInt32(Idx),
		"Elt" + Twine(Idx));
		Value *Ptr = Builder.CreateExtractElement(Ptrs, Builder.getInt32(Idx),
		"Ptr" + Twine(Idx));
		Builder.CreateAlignedStore(OneElt, Ptr, AlignVal);
		}
		CI->eraseFromParent();
		return;
		}
		for (unsigned Idx = 0; Idx < VectorWidth; ++Idx) {
		// Fill the "else" block, created in the previous iteration
		//
		// % Mask1 = extractelement <16 x i1> % Mask, i32 Idx
		// % ToStore = icmp eq i1 % Mask1, true
		// br i1 % ToStore, label %cond.store, label %else
		//
		Value *Predicate = Builder.CreateExtractElement(Mask,
		Builder.getInt32(Idx),
		"Mask" + Twine(Idx));
		Value *Cmp =
		Builder.CreateICmp(ICmpInst::ICMP_EQ, Predicate,
		ConstantInt::get(Predicate->getType(), 1),
		"ToStore" + Twine(Idx));

		// Create "cond" block
		//
		// % Elt1 = extractelement <16 x i32> %Src, i32 1
		// % Ptr1 = extractelement <16 x i32*> %Ptrs, i32 1
		// %store i32 % Elt1, i32* % Ptr1
		//
		BasicBlock *CondBlock = IfBlock->splitBasicBlock(InsertPt, "cond.store");
		Builder.SetInsertPoint(InsertPt);

		Value *OneElt = Builder.CreateExtractElement(Src, Builder.getInt32(Idx),
		"Elt" + Twine(Idx));
		Value *Ptr = Builder.CreateExtractElement(Ptrs, Builder.getInt32(Idx),
		"Ptr" + Twine(Idx));
		Builder.CreateAlignedStore(OneElt, Ptr, AlignVal);

		// Create "else" block, fill it in the next iteration
		BasicBlock *NewIfBlock = CondBlock->splitBasicBlock(InsertPt, "else");
		Builder.SetInsertPoint(InsertPt);
		Instruction *OldBr = IfBlock->getTerminator();
		BranchInst::Create(CondBlock, NewIfBlock, Cmp, OldBr);
		OldBr->eraseFromParent();
		IfBlock = NewIfBlock;
		}
		CI->eraseFromParent();
		}

bool CodeGenPrepare::optimizeCallInst(CallInst *CI, bool& ModifiedDT) {		bool CodeGenPrepare::optimizeCallInst(CallInst *CI, bool& ModifiedDT) {
BasicBlock *BB = CI->getParent();		BasicBlock *BB = CI->getParent();

// Lower inline assembly if we can.		// Lower inline assembly if we can.
// If we found an inline asm expession, and if the target knows how to		// If we found an inline asm expession, and if the target knows how to
// lower it to normal LLVM code, do so now.		// lower it to normal LLVM code, do so now.
if (TLI && isa<InlineAsm>(CI->getCalledValue())) {		if (TLI && isa<InlineAsm>(CI->getCalledValue())) {
if (TLI->ExpandInlineAsm(CI)) {		if (TLI->ExpandInlineAsm(CI)) {
▲ Show 20 Lines • Show All 91 Lines • ▼ Show 20 Lines	if (II) {
case Intrinsic::masked_store: {		case Intrinsic::masked_store: {
if (!TTI->isLegalMaskedStore(CI->getArgOperand(0)->getType())) {		if (!TTI->isLegalMaskedStore(CI->getArgOperand(0)->getType())) {
ScalarizeMaskedStore(CI);		ScalarizeMaskedStore(CI);
ModifiedDT = true;		ModifiedDT = true;
return true;		return true;
}		}
return false;		return false;
}		}
		case Intrinsic::masked_gather: {
		if (!TTI->isLegalMaskedGather(CI->getType())) {
		ScalarizeMaskedGather(CI);
		ModifiedDT = true;
		return true;
		}
		return false;
		}
		case Intrinsic::masked_scatter: {
		if (!TTI->isLegalMaskedScatter(CI->getArgOperand(0)->getType())) {
		ScalarizeMaskedScatter(CI);
		ModifiedDT = true;
		return true;
		}
		return false;
		}
case Intrinsic::aarch64_stlxr:		case Intrinsic::aarch64_stlxr:
case Intrinsic::aarch64_stxr: {		case Intrinsic::aarch64_stxr: {
ZExtInst *ExtVal = dyn_cast<ZExtInst>(CI->getArgOperand(0));		ZExtInst *ExtVal = dyn_cast<ZExtInst>(CI->getArgOperand(0));
if (!ExtVal \|\| !ExtVal->hasOneUse() \|\|		if (!ExtVal \|\| !ExtVal->hasOneUse() \|\|
ExtVal->getParent() == CI->getParent())		ExtVal->getParent() == CI->getParent())
return false;		return false;
// Sink a zext feeding stlxr/stxr before it, so it can be folded into it.		// Sink a zext feeding stlxr/stxr before it, so it can be folded into it.
ExtVal->moveBefore(CI);		ExtVal->moveBefore(CI);
▲ Show 20 Lines • Show All 3,504 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86TargetTransformInfo.h

Show First 20 Lines • Show All 84 Lines • ▼ Show 20 Lines	public:

int getIntImmCost(const APInt &Imm, Type *Ty);		int getIntImmCost(const APInt &Imm, Type *Ty);

int getIntImmCost(unsigned Opcode, unsigned Idx, const APInt &Imm, Type *Ty);		int getIntImmCost(unsigned Opcode, unsigned Idx, const APInt &Imm, Type *Ty);
int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,		int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,
Type *Ty);		Type *Ty);
bool isLegalMaskedLoad(Type *DataType);		bool isLegalMaskedLoad(Type *DataType);
bool isLegalMaskedStore(Type *DataType);		bool isLegalMaskedStore(Type *DataType);
		bool isLegalMaskedGather(Type *DataType);
		bool isLegalMaskedScatter(Type *DataType);
bool areInlineCompatible(const Function *Caller,		bool areInlineCompatible(const Function *Caller,
const Function *Callee) const;		const Function *Callee) const;

/// @}		/// @}
};		};

} // end namespace llvm		} // end namespace llvm

#endif		#endif

llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 1,197 Lines • ▼ Show 20 Lines	bool X86TTIImpl::isLegalMaskedLoad(Type *DataTy) {

return (DataWidth >= 32 && ST->hasAVX2());		return (DataWidth >= 32 && ST->hasAVX2());
}		}

bool X86TTIImpl::isLegalMaskedStore(Type *DataType) {		bool X86TTIImpl::isLegalMaskedStore(Type *DataType) {
return isLegalMaskedLoad(DataType);		return isLegalMaskedLoad(DataType);
}		}

		bool X86TTIImpl::isLegalMaskedGather(Type *DataTy) {
		// This function is called now in two cases: from the Loop Vectorizer
		// and from the Scalarizer.
		// When the Loop Vectorizer asks about legality of the feature,
		// the vectorization factor is not calculated yet. The Loop Vectorizer
		// sends a scalar type and the decision is based on the width of the
		// scalar element.
		// Later on, the cost model will estimate usage this intrinsic based on
		// the vector type.
		// The Scalarizer asks again about legality. It sends a vector type.
		// In this case we can reject non-power-of-2 vectors.
		if (isa<VectorType>(DataTy) && !isPowerOf2_32(DataTy->getVectorNumElements()))
		return false;
		Type *ScalarTy = DataTy->getScalarType();
		// TODO: Pointers should also be legal,
		// but it requires additional support in composing intrinsics name.
		// getPrimitiveSizeInBits() returns 0 for PointerType
		int DataWidth = ScalarTy->getPrimitiveSizeInBits();

		// AVX-512 allows gather and scatter
		return DataWidth >= 32 && ST->hasAVX512();
		}

		bool X86TTIImpl::isLegalMaskedScatter(Type *DataType) {
		return isLegalMaskedGather(DataType);
		}

bool X86TTIImpl::areInlineCompatible(const Function *Caller,		bool X86TTIImpl::areInlineCompatible(const Function *Caller,
const Function *Callee) const {		const Function *Callee) const {
const TargetMachine &TM = getTLI()->getTargetMachine();		const TargetMachine &TM = getTLI()->getTargetMachine();

// Work this as a subsetting of subtarget features.		// Work this as a subsetting of subtarget features.
const FeatureBitset &CallerBits =		const FeatureBitset &CallerBits =
TM.getSubtargetImpl(*Caller)->getFeatureBits();		TM.getSubtargetImpl(*Caller)->getFeatureBits();
const FeatureBitset &CalleeBits =		const FeatureBitset &CalleeBits =
TM.getSubtargetImpl(*Callee)->getFeatureBits();		TM.getSubtargetImpl(*Callee)->getFeatureBits();

// FIXME: This is likely too limiting as it will include subtarget features		// FIXME: This is likely too limiting as it will include subtarget features
// that we might not care about for inlining, but it is conservatively		// that we might not care about for inlining, but it is conservatively
// correct.		// correct.
return (CallerBits & CalleeBits) == CalleeBits;		return (CallerBits & CalleeBits) == CalleeBits;
}		}

llvm/trunk/test/CodeGen/X86/masked_gather_scatter.ll

; RUN: llc -mtriple=x86_64-apple-darwin -mcpu=knl < %s \| FileCheck %s -check-prefix=KNL		; RUN: llc -mtriple=x86_64-apple-darwin -mcpu=knl < %s \| FileCheck %s -check-prefix=KNL
		; RUN: opt -mtriple=x86_64-apple-darwin -codegenprepare -mcpu=corei7-avx -S < %s \| FileCheck %s -check-prefix=SCALAR


target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"		target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"		target triple = "x86_64-unknown-linux-gnu"

; KNL-LABEL: test1		; KNL-LABEL: test1
; KNL: kxnorw %k1, %k1, %k1		; KNL: kxnorw %k1, %k1, %k1
; KNL: vgatherdps (%rdi,%zmm0,4), %zmm1 {%k1}		; KNL: vgatherdps (%rdi,%zmm0,4), %zmm1 {%k1}

		; SCALAR-LABEL: test1
		; SCALAR: extractelement <16 x float*>
		; SCALAR-NEXT: load float
		; SCALAR-NEXT: insertelement <16 x float>
		; SCALAR-NEXT: extractelement <16 x float*>
		; SCALAR-NEXT: load float

define <16 x float> @test1(float* %base, <16 x i32> %ind) {		define <16 x float> @test1(float* %base, <16 x i32> %ind) {

%broadcast.splatinsert = insertelement <16 x float> undef, float %base, i32 0		%broadcast.splatinsert = insertelement <16 x float> undef, float %base, i32 0
%broadcast.splat = shufflevector <16 x float> %broadcast.splatinsert, <16 x float> undef, <16 x i32> zeroinitializer		%broadcast.splat = shufflevector <16 x float> %broadcast.splatinsert, <16 x float> undef, <16 x i32> zeroinitializer

%sext_ind = sext <16 x i32> %ind to <16 x i64>		%sext_ind = sext <16 x i32> %ind to <16 x i64>
%gep.random = getelementptr float, <16 x float*> %broadcast.splat, <16 x i64> %sext_ind		%gep.random = getelementptr float, <16 x float*> %broadcast.splat, <16 x i64> %sext_ind

%res = call <16 x float> @llvm.masked.gather.v16f32(<16 x float*> %gep.random, i32 4, <16 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <16 x float> undef)		%res = call <16 x float> @llvm.masked.gather.v16f32(<16 x float*> %gep.random, i32 4, <16 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <16 x float> undef)
ret <16 x float>%res		ret <16 x float>%res
}		}

declare <16 x i32> @llvm.masked.gather.v16i32(<16 x i32*>, i32, <16 x i1>, <16 x i32>)		declare <16 x i32> @llvm.masked.gather.v16i32(<16 x i32*>, i32, <16 x i1>, <16 x i32>)
declare <16 x float> @llvm.masked.gather.v16f32(<16 x float*>, i32, <16 x i1>, <16 x float>)		declare <16 x float> @llvm.masked.gather.v16f32(<16 x float*>, i32, <16 x i1>, <16 x float>)
declare <8 x i32> @llvm.masked.gather.v8i32(<8 x i32*> , i32, <8 x i1> , <8 x i32> )		declare <8 x i32> @llvm.masked.gather.v8i32(<8 x i32*> , i32, <8 x i1> , <8 x i32> )

; KNL-LABEL: test2		; KNL-LABEL: test2
; KNL: kmovw %esi, %k1		; KNL: kmovw %esi, %k1
; KNL: vgatherdps (%rdi,%zmm0,4), %zmm1 {%k1}		; KNL: vgatherdps (%rdi,%zmm0,4), %zmm1 {%k1}

		; SCALAR-LABEL: test2
		; SCALAR: extractelement <16 x float*>
		; SCALAR-NEXT: load float
		; SCALAR-NEXT: insertelement <16 x float>
		; SCALAR-NEXT: br label %else
		; SCALAR: else:
		; SCALAR-NEXT: %res.phi.else = phi
		; SCALAR-NEXT: %Mask1 = extractelement <16 x i1> %imask, i32 1
		; SCALAR-NEXT: %ToLoad1 = icmp eq i1 %Mask1, true
		; SCALAR-NEXT: br i1 %ToLoad1, label %cond.load1, label %else2

define <16 x float> @test2(float* %base, <16 x i32> %ind, i16 %mask) {		define <16 x float> @test2(float* %base, <16 x i32> %ind, i16 %mask) {

%broadcast.splatinsert = insertelement <16 x float> undef, float %base, i32 0		%broadcast.splatinsert = insertelement <16 x float> undef, float %base, i32 0
%broadcast.splat = shufflevector <16 x float> %broadcast.splatinsert, <16 x float> undef, <16 x i32> zeroinitializer		%broadcast.splat = shufflevector <16 x float> %broadcast.splatinsert, <16 x float> undef, <16 x i32> zeroinitializer

%sext_ind = sext <16 x i32> %ind to <16 x i64>		%sext_ind = sext <16 x i32> %ind to <16 x i64>
%gep.random = getelementptr float, <16 x float*> %broadcast.splat, <16 x i64> %sext_ind		%gep.random = getelementptr float, <16 x float*> %broadcast.splat, <16 x i64> %sext_ind
%imask = bitcast i16 %mask to <16 x i1>		%imask = bitcast i16 %mask to <16 x i1>
Show All 35 Lines	define <16 x i32> @test4(i32* %base, <16 x i32> %ind, i16 %mask) {
ret <16 x i32> %res		ret <16 x i32> %res
}		}

; KNL-LABEL: test5		; KNL-LABEL: test5
; KNL: kmovw %k1, %k2		; KNL: kmovw %k1, %k2
; KNL: vpscatterdd {{.*}}%k2		; KNL: vpscatterdd {{.*}}%k2
; KNL: vpscatterdd {{.*}}%k1		; KNL: vpscatterdd {{.*}}%k1

		; SCALAR-LABEL: test5
		; SCALAR: %Mask0 = extractelement <16 x i1> %imask, i32 0
		; SCALAR-NEXT: %ToStore0 = icmp eq i1 %Mask0, true
		; SCALAR-NEXT: br i1 %ToStore0, label %cond.store, label %else
		; SCALAR: cond.store:
		; SCALAR-NEXT: %Elt0 = extractelement <16 x i32> %val, i32 0
		; SCALAR-NEXT: %Ptr0 = extractelement <16 x i32*> %gep.random, i32 0
		; SCALAR-NEXT: store i32 %Elt0, i32* %Ptr0, align 4
		; SCALAR-NEXT: br label %else
		; SCALAR: else:
		; SCALAR-NEXT: %Mask1 = extractelement <16 x i1> %imask, i32 1
		; SCALAR-NEXT: %ToStore1 = icmp eq i1 %Mask1, true
		; SCALAR-NEXT: br i1 %ToStore1, label %cond.store1, label %else2

define void @test5(i32* %base, <16 x i32> %ind, i16 %mask, <16 x i32>%val) {		define void @test5(i32* %base, <16 x i32> %ind, i16 %mask, <16 x i32>%val) {

%broadcast.splatinsert = insertelement <16 x i32> undef, i32 %base, i32 0		%broadcast.splatinsert = insertelement <16 x i32> undef, i32 %base, i32 0
%broadcast.splat = shufflevector <16 x i32> %broadcast.splatinsert, <16 x i32> undef, <16 x i32> zeroinitializer		%broadcast.splat = shufflevector <16 x i32> %broadcast.splatinsert, <16 x i32> undef, <16 x i32> zeroinitializer

%gep.random = getelementptr i32, <16 x i32*> %broadcast.splat, <16 x i32> %ind		%gep.random = getelementptr i32, <16 x i32*> %broadcast.splat, <16 x i32> %ind
%imask = bitcast i16 %mask to <16 x i1>		%imask = bitcast i16 %mask to <16 x i1>
call void @llvm.masked.scatter.v16i32(<16 x i32>%val, <16 x i32*> %gep.random, i32 4, <16 x i1> %imask)		call void @llvm.masked.scatter.v16i32(<16 x i32>%val, <16 x i32*> %gep.random, i32 4, <16 x i1> %imask)
call void @llvm.masked.scatter.v16i32(<16 x i32>%val, <16 x i32*> %gep.random, i32 4, <16 x i1> %imask)		call void @llvm.masked.scatter.v16i32(<16 x i32>%val, <16 x i32*> %gep.random, i32 4, <16 x i1> %imask)
ret void		ret void
}		}

declare void @llvm.masked.scatter.v8i32(<8 x i32> , <8 x i32*> , i32 , <8 x i1> )		declare void @llvm.masked.scatter.v8i32(<8 x i32> , <8 x i32*> , i32 , <8 x i1> )
declare void @llvm.masked.scatter.v16i32(<16 x i32> , <16 x i32*> , i32 , <16 x i1> )		declare void @llvm.masked.scatter.v16i32(<16 x i32> , <16 x i32*> , i32 , <16 x i1> )

; KNL-LABEL: test6		; KNL-LABEL: test6
; KNL: kxnorw %k1, %k1, %k1		; KNL: kxnorw %k1, %k1, %k1
; KNL: kxnorw %k2, %k2, %k2		; KNL: kxnorw %k2, %k2, %k2
; KNL: vpgatherqd (,%zmm{{.}}), %ymm{{.}} {%k2}		; KNL: vpgatherqd (,%zmm{{.}}), %ymm{{.}} {%k2}
; KNL: vpscatterqd %ymm{{.}}, (,%zmm{{.}}) {%k1}		; KNL: vpscatterqd %ymm{{.}}, (,%zmm{{.}}) {%k1}

		; SCALAR-LABEL: test6
		; SCALAR: store i32 %Elt0, i32* %Ptr01, align 4
		; SCALAR-NEXT: %Elt1 = extractelement <8 x i32> %a1, i32 1
		; SCALAR-NEXT: %Ptr12 = extractelement <8 x i32*> %ptr, i32 1
		; SCALAR-NEXT: store i32 %Elt1, i32* %Ptr12, align 4
		; SCALAR-NEXT: %Elt2 = extractelement <8 x i32> %a1, i32 2
		; SCALAR-NEXT: %Ptr23 = extractelement <8 x i32*> %ptr, i32 2
		; SCALAR-NEXT: store i32 %Elt2, i32* %Ptr23, align 4

define <8 x i32> @test6(<8 x i32>%a1, <8 x i32*> %ptr) {		define <8 x i32> @test6(<8 x i32>%a1, <8 x i32*> %ptr) {

%a = call <8 x i32> @llvm.masked.gather.v8i32(<8 x i32*> %ptr, i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x i32> undef)		%a = call <8 x i32> @llvm.masked.gather.v8i32(<8 x i32*> %ptr, i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x i32> undef)

call void @llvm.masked.scatter.v8i32(<8 x i32> %a1, <8 x i32*> %ptr, i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>)		call void @llvm.masked.scatter.v8i32(<8 x i32> %a1, <8 x i32*> %ptr, i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>)
ret <8 x i32>%a		ret <8 x i32>%a
}		}

▲ Show 20 Lines • Show All 133 Lines • ▼ Show 20 Lines	define <16 x float> @test14(float* %base, i32 %ind, <16 x float*> %vec) {

%gep.random = getelementptr float, <16 x float*> %broadcast.splat, i32 %ind		%gep.random = getelementptr float, <16 x float*> %broadcast.splat, i32 %ind

%res = call <16 x float> @llvm.masked.gather.v16f32(<16 x float*> %gep.random, i32 4, <16 x i1> undef, <16 x float> undef)		%res = call <16 x float> @llvm.masked.gather.v16f32(<16 x float*> %gep.random, i32 4, <16 x i1> undef, <16 x float> undef)
ret <16 x float>%res		ret <16 x float>%res
}		}


		; KNL-LABEL: test15
		; KNL: kmovw %eax, %k1
		; KNL: vgatherdps (%rdi,%zmm0,4), %zmm1 {%k1}

		; SCALAR-LABEL: test15
		; SCALAR: extractelement <16 x float*>
		; SCALAR-NEXT: load float
		; SCALAR-NEXT: insertelement <16 x float>
		; SCALAR-NEXT: extractelement <16 x float*>
		; SCALAR-NEXT: load float

		define <16 x float> @test15(float* %base, <16 x i32> %ind) {

		%broadcast.splatinsert = insertelement <16 x float> undef, float %base, i32 0
		%broadcast.splat = shufflevector <16 x float> %broadcast.splatinsert, <16 x float> undef, <16 x i32> zeroinitializer

		%sext_ind = sext <16 x i32> %ind to <16 x i64>
		%gep.random = getelementptr float, <16 x float*> %broadcast.splat, <16 x i64> %sext_ind

		%res = call <16 x float> @llvm.masked.gather.v16f32(<16 x float*> %gep.random, i32 4, <16 x i1> <i1 false, i1 false, i1 true, i1 true, i1 false, i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false>, <16 x float> undef)
		ret <16 x float>%res
		}

		; Check non-power-of-2 case. It should be scalarized.
		declare <3 x i32> @llvm.masked.gather.v3i32(<3 x i32*>, i32, <3 x i1>, <3 x i32>)
		; KNL-LABEL: test16
		; KNL: testb
		; KNL: je
		; KNL: testb
		; KNL: je
		; KNL: testb
		; KNL: je
		define <3 x i32> @test16(<3 x i32*> %base, <3 x i32> %ind, <3 x i1> %mask, <3 x i32> %src0) {
		%sext_ind = sext <3 x i32> %ind to <3 x i64>
		%gep.random = getelementptr i32, <3 x i32*> %base, <3 x i64> %sext_ind
		%res = call <3 x i32> @llvm.masked.gather.v3i32(<3 x i32*> %gep.random, i32 4, <3 x i1> %mask, <3 x i32> %src0)
		ret <3 x i32>%res
		}