This is an archive of the discontinued LLVM Phabricator instance.

Scalarization for masked gather/scatter intrinsics
ClosedPublic

Authored by delena on Oct 14 2015, 5:27 AM.

Download Raw Diff

Details

Reviewers

qcolombet
mzolotukhin
mkuper
hfinkel

Commits

rG092858588a1c: Scalarizer for masked.gather and masked.scatter intrinsics. When the target…
rL251237: Scalarizer for masked.gather and masked.scatter intrinsics.

Summary

Masked gather/scatter intrinsics are supported on AVX-512 targets and only for 32/64 bit integers and FP types.
In all other cases these intrinsics should be split in a chain of basic blocks and a sequence of scalar load/store operations.

Example:
<16 x i32 > @llvm.masked.gather.v16i32( <16 x i32*> %Ptrs, i32 4, <16 x i1> %Mask, <16 x i32> %Src)
is translated to:

%Mask0 = extractelement <16 x i1> %Mask, i32 0
% ToLoad0 = icmp eq i1 % Mask0, true
br i1 % ToLoad0, label %cond.load, label %else

cond.load:
% Ptr0 = extractelement <16 x i32*> %Ptrs, i32 0
% Load0 = load i32, i32* % Ptr0, align 4
% Res0 = insertelement <16 x i32> undef, i32 % Load0, i32 0
br label %else

else:
%res.phi.else = phi <16 x i32>[% Res0, %cond.load], [undef, % 0]
% Mask1 = extractelement <16 x i1> %Mask, i32 1
% ToLoad1 = icmp eq i1 % Mask1, true
...

Diff Detail

Repository: rL LLVM

Event Timeline

delena updated this revision to Diff 37337.Oct 14 2015, 5:27 AM

delena retitled this revision from to Scalarization for masked gather/scatter intrinsics.

delena updated this object.

delena added reviewers: qcolombet, hfinkel, mzolotukhin.

delena set the repository for this revision to rL LLVM.

delena added a subscriber: llvm-commits.

delena added a reviewer: mkuper.Oct 15 2015, 2:47 AM

Hi Elena,
Some comments inline.

../include/llvm/Analysis/TargetTransformInfo.h
580	How is this different from isLegalMaskedStore() with Consecutive == 0?
../lib/CodeGen/CodeGenPrepare.cpp
1123	Perhaps these changes belong in a separate patch from the addition of ScalarizeMaskedGather()?
1128	Why is this a dyn_cast, if the result is assumed to be non-null in line 1131? If this is just for the assert, I'd make this a cast, and make the assert assert(isa<VectorType>(CI->getType())).
1155	This is a bit odd. If AlignVal >= VecType->getScalarSizeInBits(), then this is a nop. So, let's say originally AlignVal < VecType->getScalarSizeInBits(). This means that you're raising the alignment for the store of the first element. Are you sure that's what you want?
1252	Same comments apply as for the MaskedLoad.
1358	This looks very similar to ScalarizeMaskedLoad, except for the all-ones case. Can they perhaps be combined?
1387	Let's say the mask is a constant, but not an all-ones constant. We could just iterate over the bits of the mask, and place only the loads where the bit is actually 1, instead of creating all of the branchy code. Normally it wouldn't matter, since SimplifyCFG (I think) would clean it up - but I don't think there's a SimplifyCFG run between here and ISel. Does something else clean up the mess for the non-all-ones constant case?
../lib/Target/X86/X86TargetTransformInfo.cpp
1173	In case you're going to touch this code anyway - isn't it enough to check hasAVX2() (That is, doesn't hasAVX512() imply hasAVX2() ?) )
1190	Shouldn't there be an upper limit to the DataWidth too? (Or to the vector element count, for that matter?)

I'll separate patches:

Remove "Consecutive" argument from isLegalMaskedLoad() / store
Alignments and constant masks for masked load / store scalarization procedure
Gather/Scatter scalarization procedure

../include/llvm/Analysis/TargetTransformInfo.h
580	You are right. I forgot about "Consecutive". I decided to remove "Consecutive" from load/store.
../lib/CodeGen/CodeGenPrepare.cpp
1123	I added alignment for gather/scatter. And here as well. I can put it as a separate patch.
1128	The form is not so important. I use VecType later and this form is more convenient for me.
1155	Thanks! I changed to "min". (That what I initially meant for). If masked operation has alignment 16 (it is a vector op), I can't put 16 to scalar. I reduce it scalar type.
1358	It is similar but not the same. I also extract a pointer from the vector of pointers. I don't want to mix scatter and store, gather and load.
1387	Yes, this optimization is possible. For load/store as well. I'll add it.
../lib/Target/X86/X86TargetTransformInfo.cpp
1173	This function should be changed anyway. We have masked load/store on AVX, not only AVX2. We have masked load/store for i16 and i8 vectors on SKX. But I need to add CodeGen support for this. I'll change to hasAVX2 meanwhile.
1190	If the vector will be too wide, type legalizer will split it. I don't know what will be if vector width is not power of 2, I can reject this case.

I moved the load-store related changes to another patch. In this patch I have:

isLegalGather() isLegalScatter() interface
code that scalarizes masked.gather and masked.scatter intrinsics when the interface (1) returns false.

LGTM with some nits.

../lib/CodeGen/CodeGenPrepare.cpp
1130–1133	That's ok, I'm just saying I would prefer not to have an unchecked dyn_cast<>.
1392	Don't you need AlignVal = std::min(AlignVal, VecType->getScalarSizeInBits()), like for the load/store case?
1398	This gets eventually cleaned up is Src0 is undef, right?
1441	LoadInst* Load -> LoadInst *Load
1524	Same as above re AlignVal
../lib/Target/X86/X86TargetTransformInfo.cpp
1190	So, say, a <32 x i32> gather will get split into two <16 x i32> gathers by the legalizer? In any case: a) We should probably reject the not-power-of-2 case, unless we know the legalizer can handle it. If we don't reject it, then I think there should be a test for it. b) I guess the name is a bit confusing - it's not really "isLegal" (because some things it accepts may not actually be legal on the target), it's more like "isLegalizeable". I don't think we should change the name, but it may be worth to note this in the declaration in TargetTransformInfo.h.

This revision is now accepted and ready to land.Oct 22 2015, 4:16 AM

Hi Michael, not committing the code. I want to hear your opinion about rejection of non-power-of-2 elements.

../lib/CodeGen/CodeGenPrepare.cpp
1392	No. Gather / scatter alignment means alignment for one element, not for a vector. If the specified alignment is bigger than element size, we can't handle it properly on X86. The only one option that I see here is to scalarize masked-gather-scatter to a chain of scalar loads with required alignment.
1398	Yes, the codegen with clean it anyway.
../lib/Target/X86/X86TargetTransformInfo.cpp
1190	a) I can add rejection of the not-power-of-2 cases. And a test. But I want to know your opinion about vector and scalar types in this function: (This comment I want to put inside) // This function is called now in two cases: from the Loop Vectorizer // and from the Scalarizer. // When the Loop Vectorizer asks about legality of the feature, // the vectorization factor is not calculated yet. The Loop Vectorizer // sends a scalar type and the decision is based on the width of the // scalar element. // Later on, the cost model will estimate usage this intrinsic based on // the vector type. // The Scalarizer asks again about legality. It sends a vector type. // In this case we can reject non-power-of-2 vectors. if (isa<VectorType>(DataTy) && !isPowerOf2_32(DataTy->getVectorNumElements())) return false; Type *ScalarTy = DataTy->getScalarType(); ... b) "IsLegal" comes from TypeLegalizer and it is equal to "IsSupported". I wrote in the comments in TargetTransformInfo.h "Return true if the target supports.."

mkuper added inline comments.Oct 25 2015, 7:16 AM

../lib/CodeGen/CodeGenPrepare.cpp
1392	Ah, right, of course.
../lib/Target/X86/X86TargetTransformInfo.cpp
1190	SGTM.

Closed by commit rL251237: Scalarizer for masked.gather and masked.scatter intrinsics. (authored by delena). · Explain WhyOct 25 2015, 8:40 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

../

include/

llvm/

Analysis/

TargetTransformInfo.h

15 lines

TargetTransformInfoImpl.h

4 lines

lib/

Analysis/

TargetTransformInfo.cpp

8 lines

CodeGen/

CodeGenPrepare.cpp

308 lines

Target/

X86/

X86TargetTransformInfo.h

2 lines

X86TargetTransformInfo.cpp

16 lines

test/

CodeGen/

X86/

masked_gather_scatter.ll

46 lines

Diff 37337

../include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 311 Lines • ▼ Show 20 Lines	public:

/// \brief Return true if the target works with masked instruction		/// \brief Return true if the target works with masked instruction
/// AVX2 allows masks for consecutive load and store for i32 and i64 elements.		/// AVX2 allows masks for consecutive load and store for i32 and i64 elements.
/// AVX-512 architecture will also allow masks for non-consecutive memory		/// AVX-512 architecture will also allow masks for non-consecutive memory
/// accesses.		/// accesses.
bool isLegalMaskedStore(Type *DataType, int Consecutive) const;		bool isLegalMaskedStore(Type *DataType, int Consecutive) const;
bool isLegalMaskedLoad(Type *DataType, int Consecutive) const;		bool isLegalMaskedLoad(Type *DataType, int Consecutive) const;

		/// \brief Return true if the target works with masked gather/scatter
		/// instruction. AVX2 supports only gathers without masks.
		/// AVX-512 architecture has full support for masked gather and
		/// scatter operations.
		bool isLegalMaskedScatter(Type *DataType) const;
		bool isLegalMaskedGather(Type *DataType) const;

/// \brief Return the cost of the scaling factor used in the addressing		/// \brief Return the cost of the scaling factor used in the addressing
/// mode represented by AM for this target, for a load/store		/// mode represented by AM for this target, for a load/store
/// of the specified type.		/// of the specified type.
/// If the AM is supported, the return value must be >= 0.		/// If the AM is supported, the return value must be >= 0.
/// If the AM is not supported, it returns a negative value.		/// If the AM is not supported, it returns a negative value.
/// TODO: Handle pre/postinc as well.		/// TODO: Handle pre/postinc as well.
int getScalingFactorCost(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,		int getScalingFactorCost(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
bool HasBaseReg, int64_t Scale,		bool HasBaseReg, int64_t Scale,
▲ Show 20 Lines • Show All 237 Lines • ▼ Show 20 Lines	public:
virtual bool isLegalAddImmediate(int64_t Imm) = 0;		virtual bool isLegalAddImmediate(int64_t Imm) = 0;
virtual bool isLegalICmpImmediate(int64_t Imm) = 0;		virtual bool isLegalICmpImmediate(int64_t Imm) = 0;
virtual bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV,		virtual bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV,
int64_t BaseOffset, bool HasBaseReg,		int64_t BaseOffset, bool HasBaseReg,
int64_t Scale,		int64_t Scale,
unsigned AddrSpace) = 0;		unsigned AddrSpace) = 0;
virtual bool isLegalMaskedStore(Type *DataType, int Consecutive) = 0;		virtual bool isLegalMaskedStore(Type *DataType, int Consecutive) = 0;
virtual bool isLegalMaskedLoad(Type *DataType, int Consecutive) = 0;		virtual bool isLegalMaskedLoad(Type *DataType, int Consecutive) = 0;
		virtual bool isLegalMaskedScatter(Type *DataType) = 0;
		mkuperUnsubmitted Not Done Reply Inline Actions How is this different from isLegalMaskedStore() with Consecutive == 0? mkuper: How is this different from isLegalMaskedStore() with Consecutive == 0?
		delenaAuthorUnsubmitted Not Done Reply Inline Actions You are right. I forgot about "Consecutive". I decided to remove "Consecutive" from load/store. delena: You are right. I forgot about "Consecutive". I decided to remove "Consecutive" from load/store.
		virtual bool isLegalMaskedGather(Type *DataType) = 0;
virtual int getScalingFactorCost(Type Ty, GlobalValue BaseGV,		virtual int getScalingFactorCost(Type Ty, GlobalValue BaseGV,
int64_t BaseOffset, bool HasBaseReg,		int64_t BaseOffset, bool HasBaseReg,
int64_t Scale, unsigned AddrSpace) = 0;		int64_t Scale, unsigned AddrSpace) = 0;
virtual bool isTruncateFree(Type Ty1, Type Ty2) = 0;		virtual bool isTruncateFree(Type Ty1, Type Ty2) = 0;
virtual bool isZExtFree(Type Ty1, Type Ty2) = 0;		virtual bool isZExtFree(Type Ty1, Type Ty2) = 0;
virtual bool isProfitableToHoist(Instruction *I) = 0;		virtual bool isProfitableToHoist(Instruction *I) = 0;
virtual bool isTypeLegal(Type *Ty) = 0;		virtual bool isTypeLegal(Type *Ty) = 0;
virtual unsigned getJumpBufAlignment() = 0;		virtual unsigned getJumpBufAlignment() = 0;
▲ Show 20 Lines • Show All 113 Lines • ▼ Show 20 Lines	return Impl.isLegalAddressingMode(Ty, BaseGV, BaseOffset, HasBaseReg,
Scale, AddrSpace);		Scale, AddrSpace);
}		}
bool isLegalMaskedStore(Type *DataType, int Consecutive) override {		bool isLegalMaskedStore(Type *DataType, int Consecutive) override {
return Impl.isLegalMaskedStore(DataType, Consecutive);		return Impl.isLegalMaskedStore(DataType, Consecutive);
}		}
bool isLegalMaskedLoad(Type *DataType, int Consecutive) override {		bool isLegalMaskedLoad(Type *DataType, int Consecutive) override {
return Impl.isLegalMaskedLoad(DataType, Consecutive);		return Impl.isLegalMaskedLoad(DataType, Consecutive);
}		}
		bool isLegalMaskedScatter(Type *DataType) override {
		return Impl.isLegalMaskedScatter(DataType);
		}
		bool isLegalMaskedGather(Type *DataType) override {
		return Impl.isLegalMaskedGather(DataType);
		}
int getScalingFactorCost(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,		int getScalingFactorCost(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
bool HasBaseReg, int64_t Scale,		bool HasBaseReg, int64_t Scale,
unsigned AddrSpace) override {		unsigned AddrSpace) override {
return Impl.getScalingFactorCost(Ty, BaseGV, BaseOffset, HasBaseReg,		return Impl.getScalingFactorCost(Ty, BaseGV, BaseOffset, HasBaseReg,
Scale, AddrSpace);		Scale, AddrSpace);
}		}
bool isTruncateFree(Type Ty1, Type Ty2) override {		bool isTruncateFree(Type Ty1, Type Ty2) override {
return Impl.isTruncateFree(Ty1, Ty2);		return Impl.isTruncateFree(Ty1, Ty2);
▲ Show 20 Lines • Show All 224 Lines • Show Last 20 Lines

../include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 207 Lines • ▼ Show 20 Lines	bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
// taken from the implementation of LSR.		// taken from the implementation of LSR.
return !BaseGV && BaseOffset == 0 && (Scale == 0 \|\| Scale == 1);		return !BaseGV && BaseOffset == 0 && (Scale == 0 \|\| Scale == 1);
}		}

bool isLegalMaskedStore(Type *DataType, int Consecutive) { return false; }		bool isLegalMaskedStore(Type *DataType, int Consecutive) { return false; }

bool isLegalMaskedLoad(Type *DataType, int Consecutive) { return false; }		bool isLegalMaskedLoad(Type *DataType, int Consecutive) { return false; }

		bool isLegalMaskedScatter(Type *DataType) { return false; }

		bool isLegalMaskedGather(Type *DataType) { return false; }

int getScalingFactorCost(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,		int getScalingFactorCost(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
bool HasBaseReg, int64_t Scale, unsigned AddrSpace) {		bool HasBaseReg, int64_t Scale, unsigned AddrSpace) {
// Guess that all legal addressing mode are free.		// Guess that all legal addressing mode are free.
if (isLegalAddressingMode(Ty, BaseGV, BaseOffset, HasBaseReg,		if (isLegalAddressingMode(Ty, BaseGV, BaseOffset, HasBaseReg,
Scale, AddrSpace))		Scale, AddrSpace))
return 0;		return 0;
return -1;		return -1;
}		}
▲ Show 20 Lines • Show All 272 Lines • Show Last 20 Lines

../lib/Analysis/TargetTransformInfo.cpp

Show First 20 Lines • Show All 117 Lines • ▼ Show 20 Lines	bool TargetTransformInfo::isLegalMaskedStore(Type *DataType,
return TTIImpl->isLegalMaskedStore(DataType, Consecutive);		return TTIImpl->isLegalMaskedStore(DataType, Consecutive);
}		}

bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType,		bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType,
int Consecutive) const {		int Consecutive) const {
return TTIImpl->isLegalMaskedLoad(DataType, Consecutive);		return TTIImpl->isLegalMaskedLoad(DataType, Consecutive);
}		}

		bool TargetTransformInfo::isLegalMaskedGather(Type *DataType) const {
		return TTIImpl->isLegalMaskedGather(DataType);
		}

		bool TargetTransformInfo::isLegalMaskedScatter(Type *DataType) const {
		return TTIImpl->isLegalMaskedGather(DataType);
		}

int TargetTransformInfo::getScalingFactorCost(Type Ty, GlobalValue BaseGV,		int TargetTransformInfo::getScalingFactorCost(Type Ty, GlobalValue BaseGV,
int64_t BaseOffset,		int64_t BaseOffset,
bool HasBaseReg,		bool HasBaseReg,
int64_t Scale,		int64_t Scale,
unsigned AddrSpace) const {		unsigned AddrSpace) const {
int Cost = TTIImpl->getScalingFactorCost(Ty, BaseGV, BaseOffset, HasBaseReg,		int Cost = TTIImpl->getScalingFactorCost(Ty, BaseGV, BaseOffset, HasBaseReg,
Scale, AddrSpace);		Scale, AddrSpace);
assert(Cost >= 0 && "TTI should not produce negative costs!");		assert(Cost >= 0 && "TTI should not produce negative costs!");
▲ Show 20 Lines • Show All 255 Lines • Show Last 20 Lines

../lib/CodeGen/CodeGenPrepare.cpp

Show First 20 Lines • Show All 1,114 Lines • ▼ Show 20 Lines
//else2: ; preds = %else, %cond.load1		//else2: ; preds = %else, %cond.load1
// %res.phi.else3 = phi <16 x i32> [ %11, %cond.load1 ], [ %res.phi.else, %else ]		// %res.phi.else3 = phi <16 x i32> [ %11, %cond.load1 ], [ %res.phi.else, %else ]
// %12 = extractelement <16 x i1> %mask, i32 2		// %12 = extractelement <16 x i1> %mask, i32 2
// %13 = icmp eq i1 %12, true		// %13 = icmp eq i1 %12, true
// br i1 %13, label %cond.load4, label %else5		// br i1 %13, label %cond.load4, label %else5
//		//
static void ScalarizeMaskedLoad(CallInst *CI) {		static void ScalarizeMaskedLoad(CallInst *CI) {
Value *Ptr = CI->getArgOperand(0);		Value *Ptr = CI->getArgOperand(0);
Value *Src0 = CI->getArgOperand(3);		Value *Alignment = CI->getArgOperand(1);
		mkuperUnsubmitted Not Done Reply Inline Actions Perhaps these changes belong in a separate patch from the addition of ScalarizeMaskedGather()? mkuper: Perhaps these changes belong in a separate patch from the addition of ScalarizeMaskedGather()?
		delenaAuthorUnsubmitted Not Done Reply Inline Actions I added alignment for gather/scatter. And here as well. I can put it as a separate patch. delena: I added alignment for gather/scatter. And here as well. I can put it as a separate patch.
Value *Mask = CI->getArgOperand(2);		Value *Mask = CI->getArgOperand(2);
VectorType *VecType = dyn_cast<VectorType>(CI->getType());		Value *Src0 = CI->getArgOperand(3);
Type *EltTy = VecType->getElementType();

		unsigned AlignVal = cast<ConstantInt>(Alignment)->getZExtValue();
		VectorType *VecType = dyn_cast<VectorType>(CI->getType());
		mkuperUnsubmitted Not Done Reply Inline Actions Why is this a dyn_cast, if the result is assumed to be non-null in line 1131? If this is just for the assert, I'd make this a cast, and make the assert assert(isa<VectorType>(CI->getType())). mkuper: Why is this a dyn_cast, if the result is assumed to be non-null in line 1131? If this is just…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions The form is not so important. I use VecType later and this form is more convenient for me. delena: The form is not so important. I use VecType later and this form is more convenient for me.
assert(VecType && "Unexpected return type of masked load intrinsic");		assert(VecType && "Unexpected return type of masked load intrinsic");

		Type *EltTy = VecType->getElementType();
		AlignVal = std::max(AlignVal, VecType->getScalarSizeInBits());

		mkuperUnsubmitted Not Done Reply Inline Actions That's ok, I'm just saying I would prefer not to have an unchecked dyn_cast<>. mkuper: That's ok, I'm just saying I would prefer not to have an unchecked dyn_cast<>.
IRBuilder<> Builder(CI->getContext());		IRBuilder<> Builder(CI->getContext());
Instruction *InsertPt = CI;		Instruction *InsertPt = CI;
BasicBlock *IfBlock = CI->getParent();		BasicBlock *IfBlock = CI->getParent();
BasicBlock *CondBlock = nullptr;		BasicBlock *CondBlock = nullptr;
BasicBlock *PrevIfBlock = CI->getParent();		BasicBlock *PrevIfBlock = CI->getParent();
Builder.SetInsertPoint(InsertPt);

		Builder.SetInsertPoint(InsertPt);
Builder.SetCurrentDebugLocation(CI->getDebugLoc());		Builder.SetCurrentDebugLocation(CI->getDebugLoc());

		// Shorten the way if the mask is all-true.
		bool IsAllOnesMask = isa<Constant>(Mask) &&
		cast<Constant>(Mask)->isAllOnesValue();

		if (IsAllOnesMask) {
		Value *NewI = Builder.CreateAlignedLoad(Ptr, AlignVal);
		CI->replaceAllUsesWith(NewI);
		CI->eraseFromParent();
		return;
		}

		// Adjust alignment for the scalar instruction.
		AlignVal = std::max(AlignVal, VecType->getScalarSizeInBits());
		mkuperUnsubmitted Done Reply Inline Actions This is a bit odd. If AlignVal >= VecType->getScalarSizeInBits(), then this is a nop. So, let's say originally AlignVal < VecType->getScalarSizeInBits(). This means that you're raising the alignment for the store of the first element. Are you sure that's what you want? mkuper: This is a bit odd. If AlignVal >= VecType->getScalarSizeInBits(), then this is a nop. So…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions Thanks! I changed to "min". (That what I initially meant for). If masked operation has alignment 16 (it is a vector op), I can't put 16 to scalar. I reduce it scalar type. delena: Thanks! I changed to "min". (That what I initially meant for). If masked operation has…
// Bitcast %addr fron i8* to EltTy*		// Bitcast %addr fron i8* to EltTy*
Type *NewPtrType =		Type *NewPtrType =
EltTy->getPointerTo(cast<PointerType>(Ptr->getType())->getAddressSpace());		EltTy->getPointerTo(cast<PointerType>(Ptr->getType())->getAddressSpace());
Value *FirstEltPtr = Builder.CreateBitCast(Ptr, NewPtrType);		Value *FirstEltPtr = Builder.CreateBitCast(Ptr, NewPtrType);
Value *UndefVal = UndefValue::get(VecType);		Value *UndefVal = UndefValue::get(VecType);

// The result vector		// The result vector
Value *VResult = UndefVal;		Value *VResult = UndefVal;
Show All 29 Lines	for (unsigned Idx = 0; Idx < VectorWidth; ++Idx) {
// %Elt = load i32* %EltAddr		// %Elt = load i32* %EltAddr
// VResult = insertelement <16 x i32> VResult, i32 %Elt, i32 Idx		// VResult = insertelement <16 x i32> VResult, i32 %Elt, i32 Idx
//		//
CondBlock = IfBlock->splitBasicBlock(InsertPt->getIterator(), "cond.load");		CondBlock = IfBlock->splitBasicBlock(InsertPt->getIterator(), "cond.load");
Builder.SetInsertPoint(InsertPt);		Builder.SetInsertPoint(InsertPt);

Value *Gep =		Value *Gep =
Builder.CreateInBoundsGEP(EltTy, FirstEltPtr, Builder.getInt32(Idx));		Builder.CreateInBoundsGEP(EltTy, FirstEltPtr, Builder.getInt32(Idx));
LoadInst* Load = Builder.CreateLoad(Gep, false);		LoadInst* Load = Builder.CreateAlignedLoad(Gep, AlignVal);
VResult = Builder.CreateInsertElement(VResult, Load, Builder.getInt32(Idx));		VResult = Builder.CreateInsertElement(VResult, Load, Builder.getInt32(Idx));

// Create "else" block, fill it in the next iteration		// Create "else" block, fill it in the next iteration
BasicBlock *NewIfBlock =		BasicBlock *NewIfBlock =
CondBlock->splitBasicBlock(InsertPt->getIterator(), "else");		CondBlock->splitBasicBlock(InsertPt->getIterator(), "else");
Builder.SetInsertPoint(InsertPt);		Builder.SetInsertPoint(InsertPt);
Instruction *OldBr = IfBlock->getTerminator();		Instruction *OldBr = IfBlock->getTerminator();
BranchInst::Create(CondBlock, NewIfBlock, Cmp, OldBr);		BranchInst::Create(CondBlock, NewIfBlock, Cmp, OldBr);
Show All 34 Lines
//		//
// cond.store1: ; preds = %else		// cond.store1: ; preds = %else
// %8 = extractelement <16 x i32> %val, i32 1		// %8 = extractelement <16 x i32> %val, i32 1
// %9 = getelementptr i32* %1, i32 1		// %9 = getelementptr i32* %1, i32 1
// store i32 %8, i32* %9		// store i32 %8, i32* %9
// br label %else2		// br label %else2
// . . .		// . . .
static void ScalarizeMaskedStore(CallInst *CI) {		static void ScalarizeMaskedStore(CallInst *CI) {
Value *Ptr = CI->getArgOperand(1);
Value *Src = CI->getArgOperand(0);		Value *Src = CI->getArgOperand(0);
		mkuperUnsubmitted Not Done Reply Inline Actions Same comments apply as for the MaskedLoad. mkuper: Same comments apply as for the MaskedLoad.
		Value *Ptr = CI->getArgOperand(1);
		Value *Alignment = CI->getArgOperand(2);
Value *Mask = CI->getArgOperand(3);		Value *Mask = CI->getArgOperand(3);

		unsigned AlignVal = cast<ConstantInt>(Alignment)->getZExtValue();
VectorType *VecType = dyn_cast<VectorType>(Src->getType());		VectorType *VecType = dyn_cast<VectorType>(Src->getType());
Type *EltTy = VecType->getElementType();

assert(VecType && "Unexpected data type in masked store intrinsic");		assert(VecType && "Unexpected data type in masked store intrinsic");

		Type *EltTy = VecType->getElementType();

IRBuilder<> Builder(CI->getContext());		IRBuilder<> Builder(CI->getContext());
Instruction *InsertPt = CI;		Instruction *InsertPt = CI;
BasicBlock *IfBlock = CI->getParent();		BasicBlock *IfBlock = CI->getParent();
Builder.SetInsertPoint(InsertPt);		Builder.SetInsertPoint(InsertPt);
Builder.SetCurrentDebugLocation(CI->getDebugLoc());		Builder.SetCurrentDebugLocation(CI->getDebugLoc());

		// Shorten the way if the mask is all-true.
		bool IsAllOnesMask = isa<Constant>(Mask) &&
		cast<Constant>(Mask)->isAllOnesValue();

		if (IsAllOnesMask) {
		Builder.CreateAlignedStore(Src, Ptr, AlignVal);
		CI->eraseFromParent();
		return;
		}

		// Adjust alignment for the scalar instruction.
		AlignVal = std::max(AlignVal, VecType->getScalarSizeInBits());
// Bitcast %addr fron i8* to EltTy*		// Bitcast %addr fron i8* to EltTy*
Type *NewPtrType =		Type *NewPtrType =
EltTy->getPointerTo(cast<PointerType>(Ptr->getType())->getAddressSpace());		EltTy->getPointerTo(cast<PointerType>(Ptr->getType())->getAddressSpace());
Value *FirstEltPtr = Builder.CreateBitCast(Ptr, NewPtrType);		Value *FirstEltPtr = Builder.CreateBitCast(Ptr, NewPtrType);

unsigned VectorWidth = VecType->getNumElements();		unsigned VectorWidth = VecType->getNumElements();
for (unsigned Idx = 0; Idx < VectorWidth; ++Idx) {		for (unsigned Idx = 0; Idx < VectorWidth; ++Idx) {

// Fill the "else" block, created in the previous iteration		// Fill the "else" block, created in the previous iteration
//		//
// %mask_1 = extractelement <16 x i1> %mask, i32 Idx		// %mask_1 = extractelement <16 x i1> %mask, i32 Idx
// %to_store = icmp eq i1 %mask_1, true		// %to_store = icmp eq i1 %mask_1, true
// br i1 %to_load, label %cond.store, label %else		// br i1 %to_store, label %cond.store, label %else
//		//
Value *Predicate = Builder.CreateExtractElement(Mask, Builder.getInt32(Idx));		Value *Predicate = Builder.CreateExtractElement(Mask, Builder.getInt32(Idx));
Value *Cmp = Builder.CreateICmp(ICmpInst::ICMP_EQ, Predicate,		Value *Cmp = Builder.CreateICmp(ICmpInst::ICMP_EQ, Predicate,
ConstantInt::get(Predicate->getType(), 1));		ConstantInt::get(Predicate->getType(), 1));

// Create "cond" block		// Create "cond" block
//		//
// %OneElt = extractelement <16 x i32> %Src, i32 Idx		// %OneElt = extractelement <16 x i32> %Src, i32 Idx
// %EltAddr = getelementptr i32* %1, i32 0		// %EltAddr = getelementptr i32* %1, i32 0
// %store i32 %OneElt, i32* %EltAddr		// %store i32 %OneElt, i32* %EltAddr
//		//
BasicBlock *CondBlock =		BasicBlock *CondBlock =
IfBlock->splitBasicBlock(InsertPt->getIterator(), "cond.store");		IfBlock->splitBasicBlock(InsertPt->getIterator(), "cond.store");
Builder.SetInsertPoint(InsertPt);		Builder.SetInsertPoint(InsertPt);

Value *OneElt = Builder.CreateExtractElement(Src, Builder.getInt32(Idx));		Value *OneElt = Builder.CreateExtractElement(Src, Builder.getInt32(Idx));
Value *Gep =		Value *Gep =
Builder.CreateInBoundsGEP(EltTy, FirstEltPtr, Builder.getInt32(Idx));		Builder.CreateInBoundsGEP(EltTy, FirstEltPtr, Builder.getInt32(Idx));
Builder.CreateStore(OneElt, Gep);		Builder.CreateAlignedStore(OneElt, Gep, AlignVal);

// Create "else" block, fill it in the next iteration		// Create "else" block, fill it in the next iteration
BasicBlock *NewIfBlock =		BasicBlock *NewIfBlock =
CondBlock->splitBasicBlock(InsertPt->getIterator(), "else");		CondBlock->splitBasicBlock(InsertPt->getIterator(), "else");
Builder.SetInsertPoint(InsertPt);		Builder.SetInsertPoint(InsertPt);
Instruction *OldBr = IfBlock->getTerminator();		Instruction *OldBr = IfBlock->getTerminator();
BranchInst::Create(CondBlock, NewIfBlock, Cmp, OldBr);		BranchInst::Create(CondBlock, NewIfBlock, Cmp, OldBr);
OldBr->eraseFromParent();		OldBr->eraseFromParent();
IfBlock = NewIfBlock;		IfBlock = NewIfBlock;
}		}
CI->eraseFromParent();		CI->eraseFromParent();
}		}

		// Translate a masked gather intrinsic like
		// <16 x i32 > @llvm.masked.gather.v16i32( <16 x i32*> %Ptrs, i32 4,
		// <16 x i1> %Mask, <16 x i32> %Src)
		// to a chain of basic blocks, with loading element one-by-one if
		// the appropriate mask bit is set
		//
		// %Ptrs = getelementptr i32, i32* %base, <16 x i64> %ind
		// %Mask0 = extractelement <16 x i1> %Mask, i32 0
		// % ToLoad0 = icmp eq i1 % Mask0, true
		// br i1 % 2, label %cond.load, label %else
		//
		// cond.load:
		// % Ptr0 = extractelement <16 x i32*> %Ptrs, i32 0
		// % Load0 = load i32, i32* % Ptr0, align 4
		// % Res0 = insertelement <16 x i32> undef, i32 % Load0, i32 0
		// br label %else
		//
		// else:
		// %res.phi.else = phi <16 x i32>[% Res0, %cond.load], [undef, % 0]
		// % Mask1 = extractelement <16 x i1> %Mask, i32 1
		// % ToLoad1 = icmp eq i1 % Mask1, true
		// br i1 % ToLoad1, label %cond.load1, label %else2
		//
		// cond.load1:
		// % Ptr1 = extractelement <16 x i32*> %Ptrs, i32 1
		// % Load1 = load i32, i32* % Ptr1, align 4
		// % Res1 = insertelement <16 x i32> %res.phi.else, i32 % Load1, i32 1
		// br label %else2
		// . . .
		// %81 = select <16 x i1> %Mask, <16 x i32> %res.phi.select, <16 x i32> %Src
		// ret <16 x i32> %81
		static void ScalarizeMaskedGather(CallInst *CI) {
		Value *Ptrs = CI->getArgOperand(0);
		mkuperUnsubmitted Not Done Reply Inline Actions This looks very similar to ScalarizeMaskedLoad, except for the all-ones case. Can they perhaps be combined? mkuper: This looks very similar to ScalarizeMaskedLoad, except for the all-ones case. Can they perhaps…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions It is similar but not the same. I also extract a pointer from the vector of pointers. I don't want to mix scatter and store, gather and load. delena: It is similar but not the same. I also extract a pointer from the vector of pointers. I don't…
		Value *Alignment = CI->getArgOperand(1);
		Value *Mask = CI->getArgOperand(2);
		Value *Src0 = CI->getArgOperand(3);

		VectorType *VecType = dyn_cast<VectorType>(CI->getType());

		assert(VecType && "Unexpected return type of masked load intrinsic");

		IRBuilder<> Builder(CI->getContext());
		Instruction *InsertPt = CI;
		BasicBlock *IfBlock = CI->getParent();
		BasicBlock *CondBlock = nullptr;
		BasicBlock *PrevIfBlock = CI->getParent();
		Builder.SetInsertPoint(InsertPt);
		unsigned AlignVal = cast<ConstantInt>(Alignment)->getZExtValue();

		Builder.SetCurrentDebugLocation(CI->getDebugLoc());

		Value *UndefVal = UndefValue::get(VecType);

		// The result vector
		Value *VResult = UndefVal;
		unsigned VectorWidth = VecType->getNumElements();

		// Shorten the way if the mask is all-true.
		bool IsAllOnesMask = isa<Constant>(Mask) &&
		cast<Constant>(Mask)->isAllOnesValue();

		if (IsAllOnesMask) {
		mkuperUnsubmitted Not Done Reply Inline Actions Let's say the mask is a constant, but not an all-ones constant. We could just iterate over the bits of the mask, and place only the loads where the bit is actually 1, instead of creating all of the branchy code. Normally it wouldn't matter, since SimplifyCFG (I think) would clean it up - but I don't think there's a SimplifyCFG run between here and ISel. Does something else clean up the mess for the non-all-ones constant case? mkuper: Let's say the mask is a constant, but not an all-ones constant. We could just iterate over the…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions Yes, this optimization is possible. For load/store as well. I'll add it. delena: Yes, this optimization is possible. For load/store as well. I'll add it.
		for (unsigned Idx = 0; Idx < VectorWidth; ++Idx) {
		Value *Ptr = Builder.CreateExtractElement(Ptrs, Builder.getInt32(Idx),
		"Ptr" + Twine(Idx));
		LoadInst* Load = Builder.CreateAlignedLoad(Ptr, AlignVal,
		"Load" + Twine(Idx));
		mkuperUnsubmitted Not Done Reply Inline Actions Don't you need AlignVal = std::min(AlignVal, VecType->getScalarSizeInBits()), like for the load/store case? mkuper: Don't you need AlignVal = std::min(AlignVal, VecType->getScalarSizeInBits()), like for the…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions No. Gather / scatter alignment means alignment for one element, not for a vector. If the specified alignment is bigger than element size, we can't handle it properly on X86. The only one option that I see here is to scalarize masked-gather-scatter to a chain of scalar loads with required alignment. delena: No. Gather / scatter alignment means alignment for one element, not for a vector. If the…
		mkuperUnsubmitted Not Done Reply Inline Actions Ah, right, of course. mkuper: Ah, right, of course.
		VResult = Builder.CreateInsertElement(VResult, Load,
		Builder.getInt32(Idx),
		"Res" + Twine(Idx));
		}
		CI->replaceAllUsesWith(VResult);
		CI->eraseFromParent();
		mkuperUnsubmitted Not Done Reply Inline Actions This gets eventually cleaned up is Src0 is undef, right? mkuper: This gets eventually cleaned up is Src0 is undef, right?
		delenaAuthorUnsubmitted Not Done Reply Inline Actions Yes, the codegen with clean it anyway. delena: Yes, the codegen with clean it anyway.
		return;
		}

		PHINode *Phi = nullptr;
		Value *PrevPhi = UndefVal;

		for (unsigned Idx = 0; Idx < VectorWidth; ++Idx) {

		// Fill the "else" block, created in the previous iteration
		//
		// %Mask1 = extractelement <16 x i1> %Mask, i32 1
		// %ToLoad1 = icmp eq i1 %Mask1, true
		// br i1 %ToLoad1, label %cond.load, label %else
		//
		if (Idx > 0) {
		Phi = Builder.CreatePHI(VecType, 2, "res.phi.else");
		Phi->addIncoming(VResult, CondBlock);
		Phi->addIncoming(PrevPhi, PrevIfBlock);
		PrevPhi = Phi;
		VResult = Phi;
		}

		Value *Predicate = Builder.CreateExtractElement(Mask,
		Builder.getInt32(Idx),
		"Mask" + Twine(Idx));
		Value *Cmp = Builder.CreateICmp(ICmpInst::ICMP_EQ, Predicate,
		ConstantInt::get(Predicate->getType(), 1),
		"ToLoad" + Twine(Idx));

		// Create "cond" block
		//
		// %EltAddr = getelementptr i32* %1, i32 0
		// %Elt = load i32* %EltAddr
		// VResult = insertelement <16 x i32> VResult, i32 %Elt, i32 Idx
		//
		CondBlock = IfBlock->splitBasicBlock(InsertPt, "cond.load");
		Builder.SetInsertPoint(InsertPt);

		Value *Ptr = Builder.CreateExtractElement(Ptrs, Builder.getInt32(Idx),
		"Ptr" + Twine(Idx));
		LoadInst* Load = Builder.CreateAlignedLoad(Ptr, AlignVal,
		"Load" + Twine(Idx));
		VResult = Builder.CreateInsertElement(VResult, Load, Builder.getInt32(Idx),
		mkuperUnsubmitted Done Reply Inline Actions LoadInst* Load -> LoadInst Load mkuper:* LoadInst* Load -> LoadInst *Load
		"Res" + Twine(Idx));

		// Create "else" block, fill it in the next iteration
		BasicBlock *NewIfBlock = CondBlock->splitBasicBlock(InsertPt, "else");
		Builder.SetInsertPoint(InsertPt);
		Instruction *OldBr = IfBlock->getTerminator();
		BranchInst::Create(CondBlock, NewIfBlock, Cmp, OldBr);
		OldBr->eraseFromParent();
		PrevIfBlock = IfBlock;
		IfBlock = NewIfBlock;
		}

		Phi = Builder.CreatePHI(VecType, 2, "res.phi.select");
		Phi->addIncoming(VResult, CondBlock);
		Phi->addIncoming(PrevPhi, PrevIfBlock);
		Value *NewI = Builder.CreateSelect(Mask, Phi, Src0);
		CI->replaceAllUsesWith(NewI);
		CI->eraseFromParent();
		}

		// Translate a masked scatter intrinsic, like
		// void @llvm.masked.scatter.v16i32(<16 x i32> %Src, <16 x i32> %Ptrs, i32 4,
		// <16 x i1> %Mask)
		// to a chain of basic blocks, that stores element one-by-one if
		// the appropriate mask bit is set.
		//
		// % Ptrs = getelementptr i32, i32* %ptr, <16 x i64> %ind
		// % Mask0 = extractelement <16 x i1> % Mask, i32 0
		// % ToStore0 = icmp eq i1 % Mask0, true
		// br i1 %ToStore0, label %cond.store, label %else
		//
		// cond.store:
		// % Elt0 = extractelement <16 x i32> %Src, i32 0
		// % Ptr0 = extractelement <16 x i32*> %Ptrs, i32 0
		// store i32 %Elt0, i32* % Ptr0, align 4
		// br label %else
		//
		// else:
		// % Mask1 = extractelement <16 x i1> % Mask, i32 1
		// % ToStore1 = icmp eq i1 % Mask1, true
		// br i1 % ToStore1, label %cond.store1, label %else2
		//
		// cond.store1:
		// % Elt1 = extractelement <16 x i32> %Src, i32 1
		// % Ptr1 = extractelement <16 x i32*> %Ptrs, i32 1
		// store i32 % Elt1, i32* % Ptr1, align 4
		// br label %else2
		// . . .
		static void ScalarizeMaskedScatter(CallInst *CI) {
		Value *Src = CI->getArgOperand(0);
		Value *Ptrs = CI->getArgOperand(1);
		Value *Alignment = CI->getArgOperand(2);
		Value *Mask = CI->getArgOperand(3);

		assert(isa<VectorType>(Src->getType()) &&
		"Unexpected data type in masked scatter intrinsic");
		assert(isa<VectorType>(Ptrs->getType()) &&
		isa<PointerType>(Ptrs->getType()->getVectorElementType()) &&
		"Vector of pointers is expected in masked scatter intrinsic");

		IRBuilder<> Builder(CI->getContext());
		Instruction *InsertPt = CI;
		BasicBlock *IfBlock = CI->getParent();
		Builder.SetInsertPoint(InsertPt);
		Builder.SetCurrentDebugLocation(CI->getDebugLoc());

		unsigned AlignVal = cast<ConstantInt>(Alignment)->getZExtValue();
		unsigned VectorWidth = Src->getType()->getVectorNumElements();
		bool IsAllOnesMask = isa<Constant>(Mask) &&
		cast<Constant>(Mask)->isAllOnesValue();

		if (IsAllOnesMask) {
		// Simple case, just store all elements from the data vector,
		// one by one.
		for (unsigned Idx = 0; Idx < VectorWidth; ++Idx) {
		Value *OneElt = Builder.CreateExtractElement(Src, Builder.getInt32(Idx),
		"Elt" + Twine(Idx));
		Value *Ptr = Builder.CreateExtractElement(Ptrs, Builder.getInt32(Idx),
		"Ptr" + Twine(Idx));
		Builder.CreateAlignedStore(OneElt, Ptr, AlignVal);
		}
		CI->eraseFromParent();
		return;
		mkuperUnsubmitted Not Done Reply Inline Actions Same as above re AlignVal mkuper: Same as above re AlignVal
		}
		for (unsigned Idx = 0; Idx < VectorWidth; ++Idx) {
		// Fill the "else" block, created in the previous iteration
		//
		// % Mask1 = extractelement <16 x i1> % Mask, i32 Idx
		// % ToStore = icmp eq i1 % Mask1, true
		// br i1 % ToStore, label %cond.store, label %else
		//
		Value *Predicate = Builder.CreateExtractElement(Mask,
		Builder.getInt32(Idx),
		"Mask" + Twine(Idx));
		Value *Cmp =
		Builder.CreateICmp(ICmpInst::ICMP_EQ, Predicate,
		ConstantInt::get(Predicate->getType(), 1),
		"ToStore" + Twine(Idx));

		// Create "cond" block
		//
		// % Elt1 = extractelement <16 x i32> %Src, i32 1
		// % Ptr1 = extractelement <16 x i32*> %Ptrs, i32 1
		// %store i32 % Elt1, i32* % Ptr1
		//
		BasicBlock *CondBlock = IfBlock->splitBasicBlock(InsertPt, "cond.store");
		Builder.SetInsertPoint(InsertPt);

		Value *OneElt = Builder.CreateExtractElement(Src, Builder.getInt32(Idx),
		"Elt" + Twine(Idx));
		Value *Ptr = Builder.CreateExtractElement(Ptrs, Builder.getInt32(Idx),
		"Ptr" + Twine(Idx));
		Builder.CreateAlignedStore(OneElt, Ptr, AlignVal);

		// Create "else" block, fill it in the next iteration
		BasicBlock *NewIfBlock = CondBlock->splitBasicBlock(InsertPt, "else");
		Builder.SetInsertPoint(InsertPt);
		Instruction *OldBr = IfBlock->getTerminator();
		BranchInst::Create(CondBlock, NewIfBlock, Cmp, OldBr);
		OldBr->eraseFromParent();
		IfBlock = NewIfBlock;
		}
		CI->eraseFromParent();
		}

bool CodeGenPrepare::optimizeCallInst(CallInst *CI, bool& ModifiedDT) {		bool CodeGenPrepare::optimizeCallInst(CallInst *CI, bool& ModifiedDT) {
BasicBlock *BB = CI->getParent();		BasicBlock *BB = CI->getParent();

// Lower inline assembly if we can.		// Lower inline assembly if we can.
// If we found an inline asm expession, and if the target knows how to		// If we found an inline asm expession, and if the target knows how to
// lower it to normal LLVM code, do so now.		// lower it to normal LLVM code, do so now.
if (TLI && isa<InlineAsm>(CI->getCalledValue())) {		if (TLI && isa<InlineAsm>(CI->getCalledValue())) {
if (TLI->ExpandInlineAsm(CI)) {		if (TLI->ExpandInlineAsm(CI)) {
▲ Show 20 Lines • Show All 91 Lines • ▼ Show 20 Lines	if (II) {
case Intrinsic::masked_store: {		case Intrinsic::masked_store: {
if (!TTI->isLegalMaskedStore(CI->getArgOperand(0)->getType(), 1)) {		if (!TTI->isLegalMaskedStore(CI->getArgOperand(0)->getType(), 1)) {
ScalarizeMaskedStore(CI);		ScalarizeMaskedStore(CI);
ModifiedDT = true;		ModifiedDT = true;
return true;		return true;
}		}
return false;		return false;
}		}
		case Intrinsic::masked_gather: {
		if (!TTI->isLegalMaskedGather(CI->getType())) {
		ScalarizeMaskedGather(CI);
		ModifiedDT = true;
		return true;
		}
		return false;
		}
		case Intrinsic::masked_scatter: {
		if (!TTI->isLegalMaskedScatter(CI->getArgOperand(0)->getType())) {
		ScalarizeMaskedScatter(CI);
		ModifiedDT = true;
		return true;
		}
		return false;
		}
case Intrinsic::aarch64_stlxr:		case Intrinsic::aarch64_stlxr:
case Intrinsic::aarch64_stxr: {		case Intrinsic::aarch64_stxr: {
ZExtInst *ExtVal = dyn_cast<ZExtInst>(CI->getArgOperand(0));		ZExtInst *ExtVal = dyn_cast<ZExtInst>(CI->getArgOperand(0));
if (!ExtVal \|\| !ExtVal->hasOneUse() \|\|		if (!ExtVal \|\| !ExtVal->hasOneUse() \|\|
ExtVal->getParent() == CI->getParent())		ExtVal->getParent() == CI->getParent())
return false;		return false;
// Sink a zext feeding stlxr/stxr before it, so it can be folded into it.		// Sink a zext feeding stlxr/stxr before it, so it can be folded into it.
ExtVal->moveBefore(CI);		ExtVal->moveBefore(CI);
▲ Show 20 Lines • Show All 3,425 Lines • Show Last 20 Lines

../lib/Target/X86/X86TargetTransformInfo.h

Show First 20 Lines • Show All 84 Lines • ▼ Show 20 Lines	public:

int getIntImmCost(const APInt &Imm, Type *Ty);		int getIntImmCost(const APInt &Imm, Type *Ty);

int getIntImmCost(unsigned Opcode, unsigned Idx, const APInt &Imm, Type *Ty);		int getIntImmCost(unsigned Opcode, unsigned Idx, const APInt &Imm, Type *Ty);
int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,		int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,
Type *Ty);		Type *Ty);
bool isLegalMaskedLoad(Type *DataType, int Consecutive);		bool isLegalMaskedLoad(Type *DataType, int Consecutive);
bool isLegalMaskedStore(Type *DataType, int Consecutive);		bool isLegalMaskedStore(Type *DataType, int Consecutive);
		bool isLegalMaskedGather(Type *DataType);
		bool isLegalMaskedScatter(Type *DataType);
bool areInlineCompatible(const Function *Caller,		bool areInlineCompatible(const Function *Caller,
const Function *Callee) const;		const Function *Callee) const;

/// @}		/// @}
};		};

} // end namespace llvm		} // end namespace llvm

#endif		#endif

../lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 1,162 Lines • ▼ Show 20 Lines	case Intrinsic::experimental_patchpoint_i64:
break;		break;
}		}
return X86TTIImpl::getIntImmCost(Imm, Ty);		return X86TTIImpl::getIntImmCost(Imm, Ty);
}		}

bool X86TTIImpl::isLegalMaskedLoad(Type *DataTy, int Consecutive) {		bool X86TTIImpl::isLegalMaskedLoad(Type *DataTy, int Consecutive) {
int DataWidth = DataTy->getPrimitiveSizeInBits();		int DataWidth = DataTy->getPrimitiveSizeInBits();

// Todo: AVX512 allows gather/scatter, works with strided and random as well
if ((DataWidth < 32) \|\| (Consecutive == 0))		if ((DataWidth < 32) \|\| (Consecutive == 0))
return false;		return false;
if (ST->hasAVX512() \|\| ST->hasAVX2())		if (ST->hasAVX512() \|\| ST->hasAVX2())
		mkuperUnsubmitted Not Done Reply Inline Actions In case you're going to touch this code anyway - isn't it enough to check hasAVX2() (That is, doesn't hasAVX512() imply hasAVX2() ?) ) mkuper: In case you're going to touch this code anyway - isn't it enough to check hasAVX2() (That is…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions This function should be changed anyway. We have masked load/store on AVX, not only AVX2. We have masked load/store for i16 and i8 vectors on SKX. But I need to add CodeGen support for this. I'll change to hasAVX2 meanwhile. delena: This function should be changed anyway. 1. We have masked load/store on AVX, not only AVX2. 2.
return true;		return true;
return false;		return false;
}		}

bool X86TTIImpl::isLegalMaskedStore(Type *DataType, int Consecutive) {		bool X86TTIImpl::isLegalMaskedStore(Type *DataType, int Consecutive) {
return isLegalMaskedLoad(DataType, Consecutive);		return isLegalMaskedLoad(DataType, Consecutive);
}		}

		bool X86TTIImpl::isLegalMaskedGather(Type *DataTy) {
		if (DataTy->isVectorTy())
		DataTy = cast<VectorType>(DataTy)->getVectorElementType();

		unsigned DataWidth = DataTy->isPointerTy() ? DL.getPointerSizeInBits() :
		DataTy->getPrimitiveSizeInBits();

		// AVX-512 allows gather and scatter
		return DataWidth >= 32 && ST->hasAVX512();
		mkuperUnsubmitted Not Done Reply Inline Actions Shouldn't there be an upper limit to the DataWidth too? (Or to the vector element count, for that matter?) mkuper: Shouldn't there be an upper limit to the DataWidth too? (Or to the vector element count, for…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions If the vector will be too wide, type legalizer will split it. I don't know what will be if vector width is not power of 2, I can reject this case. delena: If the vector will be too wide, type legalizer will split it. I don't know what will be if…
		mkuperUnsubmitted Not Done Reply Inline Actions So, say, a <32 x i32> gather will get split into two <16 x i32> gathers by the legalizer? In any case: a) We should probably reject the not-power-of-2 case, unless we know the legalizer can handle it. If we don't reject it, then I think there should be a test for it. b) I guess the name is a bit confusing - it's not really "isLegal" (because some things it accepts may not actually be legal on the target), it's more like "isLegalizeable". I don't think we should change the name, but it may be worth to note this in the declaration in TargetTransformInfo.h. mkuper: So, say, a <32 x i32> gather will get split into two <16 x i32> gathers by the legalizer? In…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions a) I can add rejection of the not-power-of-2 cases. And a test. But I want to know your opinion about vector and scalar types in this function: (This comment I want to put inside) // This function is called now in two cases: from the Loop Vectorizer // and from the Scalarizer. // When the Loop Vectorizer asks about legality of the feature, // the vectorization factor is not calculated yet. The Loop Vectorizer // sends a scalar type and the decision is based on the width of the // scalar element. // Later on, the cost model will estimate usage this intrinsic based on // the vector type. // The Scalarizer asks again about legality. It sends a vector type. // In this case we can reject non-power-of-2 vectors. if (isa<VectorType>(DataTy) && !isPowerOf2_32(DataTy->getVectorNumElements())) return false; Type ScalarTy = DataTy->getScalarType(); ... b) "IsLegal" comes from TypeLegalizer and it is equal to "IsSupported". I wrote in the comments in TargetTransformInfo.h "Return true if the target supports.." delena:* a) I can add rejection of the not-power-of-2 cases. And a test. But I want to know your opinion…
		mkuperUnsubmitted Not Done Reply Inline Actions SGTM. mkuper: SGTM.
		}

		bool X86TTIImpl::isLegalMaskedScatter(Type *DataType) {
		return isLegalMaskedGather(DataType);
		}

bool X86TTIImpl::areInlineCompatible(const Function *Caller,		bool X86TTIImpl::areInlineCompatible(const Function *Caller,
const Function *Callee) const {		const Function *Callee) const {
const TargetMachine &TM = getTLI()->getTargetMachine();		const TargetMachine &TM = getTLI()->getTargetMachine();

// Work this as a subsetting of subtarget features.		// Work this as a subsetting of subtarget features.
const FeatureBitset &CallerBits =		const FeatureBitset &CallerBits =
TM.getSubtargetImpl(*Caller)->getFeatureBits();		TM.getSubtargetImpl(*Caller)->getFeatureBits();
const FeatureBitset &CalleeBits =		const FeatureBitset &CalleeBits =
TM.getSubtargetImpl(*Callee)->getFeatureBits();		TM.getSubtargetImpl(*Callee)->getFeatureBits();

// FIXME: This is likely too limiting as it will include subtarget features		// FIXME: This is likely too limiting as it will include subtarget features
// that we might not care about for inlining, but it is conservatively		// that we might not care about for inlining, but it is conservatively
// correct.		// correct.
return (CallerBits & CalleeBits) == CalleeBits;		return (CallerBits & CalleeBits) == CalleeBits;
}		}

../test/CodeGen/X86/masked_gather_scatter.ll

; RUN: llc -mtriple=x86_64-apple-darwin -mcpu=knl < %s \| FileCheck %s -check-prefix=KNL		; RUN: llc -mtriple=x86_64-apple-darwin -mcpu=knl < %s \| FileCheck %s -check-prefix=KNL
		; RUN: opt -mtriple=x86_64-apple-darwin -codegenprepare -mcpu=corei7-avx -S < %s \| FileCheck %s -check-prefix=SCALAR


target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"		target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"		target triple = "x86_64-unknown-linux-gnu"

; KNL-LABEL: test1		; KNL-LABEL: test1
; KNL: kxnorw %k1, %k1, %k1		; KNL: kxnorw %k1, %k1, %k1
; KNL: vgatherdps (%rdi,%zmm0,4), %zmm1 {%k1}		; KNL: vgatherdps (%rdi,%zmm0,4), %zmm1 {%k1}

		; SCALAR-LABEL: test1
		; SCALAR: extractelement <16 x float*>
		; SCALAR-NEXT: load float
		; SCALAR-NEXT: insertelement <16 x float>
		; SCALAR-NEXT: extractelement <16 x float*>
		; SCALAR-NEXT: load float

define <16 x float> @test1(float* %base, <16 x i32> %ind) {		define <16 x float> @test1(float* %base, <16 x i32> %ind) {

%broadcast.splatinsert = insertelement <16 x float> undef, float %base, i32 0		%broadcast.splatinsert = insertelement <16 x float> undef, float %base, i32 0
%broadcast.splat = shufflevector <16 x float> %broadcast.splatinsert, <16 x float> undef, <16 x i32> zeroinitializer		%broadcast.splat = shufflevector <16 x float> %broadcast.splatinsert, <16 x float> undef, <16 x i32> zeroinitializer

%sext_ind = sext <16 x i32> %ind to <16 x i64>		%sext_ind = sext <16 x i32> %ind to <16 x i64>
%gep.random = getelementptr float, <16 x float*> %broadcast.splat, <16 x i64> %sext_ind		%gep.random = getelementptr float, <16 x float*> %broadcast.splat, <16 x i64> %sext_ind

%res = call <16 x float> @llvm.masked.gather.v16f32(<16 x float*> %gep.random, i32 4, <16 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <16 x float> undef)		%res = call <16 x float> @llvm.masked.gather.v16f32(<16 x float*> %gep.random, i32 4, <16 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <16 x float> undef)
ret <16 x float>%res		ret <16 x float>%res
}		}

declare <16 x i32> @llvm.masked.gather.v16i32(<16 x i32*>, i32, <16 x i1>, <16 x i32>)		declare <16 x i32> @llvm.masked.gather.v16i32(<16 x i32*>, i32, <16 x i1>, <16 x i32>)
declare <16 x float> @llvm.masked.gather.v16f32(<16 x float*>, i32, <16 x i1>, <16 x float>)		declare <16 x float> @llvm.masked.gather.v16f32(<16 x float*>, i32, <16 x i1>, <16 x float>)
declare <8 x i32> @llvm.masked.gather.v8i32(<8 x i32*> , i32, <8 x i1> , <8 x i32> )		declare <8 x i32> @llvm.masked.gather.v8i32(<8 x i32*> , i32, <8 x i1> , <8 x i32> )

; KNL-LABEL: test2		; KNL-LABEL: test2
; KNL: kmovw %esi, %k1		; KNL: kmovw %esi, %k1
; KNL: vgatherdps (%rdi,%zmm0,4), %zmm1 {%k1}		; KNL: vgatherdps (%rdi,%zmm0,4), %zmm1 {%k1}

		; SCALAR-LABEL: test2
		; SCALAR: extractelement <16 x float*>
		; SCALAR-NEXT: load float
		; SCALAR-NEXT: insertelement <16 x float>
		; SCALAR-NEXT: br label %else
		; SCALAR: else:
		; SCALAR-NEXT: %res.phi.else = phi
		; SCALAR-NEXT: %Mask1 = extractelement <16 x i1> %imask, i32 1
		; SCALAR-NEXT: %ToLoad1 = icmp eq i1 %Mask1, true
		; SCALAR-NEXT: br i1 %ToLoad1, label %cond.load1, label %else2

define <16 x float> @test2(float* %base, <16 x i32> %ind, i16 %mask) {		define <16 x float> @test2(float* %base, <16 x i32> %ind, i16 %mask) {

%broadcast.splatinsert = insertelement <16 x float> undef, float %base, i32 0		%broadcast.splatinsert = insertelement <16 x float> undef, float %base, i32 0
%broadcast.splat = shufflevector <16 x float> %broadcast.splatinsert, <16 x float> undef, <16 x i32> zeroinitializer		%broadcast.splat = shufflevector <16 x float> %broadcast.splatinsert, <16 x float> undef, <16 x i32> zeroinitializer

%sext_ind = sext <16 x i32> %ind to <16 x i64>		%sext_ind = sext <16 x i32> %ind to <16 x i64>
%gep.random = getelementptr float, <16 x float*> %broadcast.splat, <16 x i64> %sext_ind		%gep.random = getelementptr float, <16 x float*> %broadcast.splat, <16 x i64> %sext_ind
%imask = bitcast i16 %mask to <16 x i1>		%imask = bitcast i16 %mask to <16 x i1>
Show All 35 Lines	define <16 x i32> @test4(i32* %base, <16 x i32> %ind, i16 %mask) {
ret <16 x i32> %res		ret <16 x i32> %res
}		}

; KNL-LABEL: test5		; KNL-LABEL: test5
; KNL: kmovw %k1, %k2		; KNL: kmovw %k1, %k2
; KNL: vpscatterdd {{.*}}%k2		; KNL: vpscatterdd {{.*}}%k2
; KNL: vpscatterdd {{.*}}%k1		; KNL: vpscatterdd {{.*}}%k1

		; SCALAR-LABEL: test5
		; SCALAR: %Mask0 = extractelement <16 x i1> %imask, i32 0
		; SCALAR-NEXT: %ToStore0 = icmp eq i1 %Mask0, true
		; SCALAR-NEXT: br i1 %ToStore0, label %cond.store, label %else
		; SCALAR: cond.store:
		; SCALAR-NEXT: %Elt0 = extractelement <16 x i32> %val, i32 0
		; SCALAR-NEXT: %Ptr0 = extractelement <16 x i32*> %gep.random, i32 0
		; SCALAR-NEXT: store i32 %Elt0, i32* %Ptr0, align 4
		; SCALAR-NEXT: br label %else
		; SCALAR: else:
		; SCALAR-NEXT: %Mask1 = extractelement <16 x i1> %imask, i32 1
		; SCALAR-NEXT: %ToStore1 = icmp eq i1 %Mask1, true
		; SCALAR-NEXT: br i1 %ToStore1, label %cond.store1, label %else2

define void @test5(i32* %base, <16 x i32> %ind, i16 %mask, <16 x i32>%val) {		define void @test5(i32* %base, <16 x i32> %ind, i16 %mask, <16 x i32>%val) {

%broadcast.splatinsert = insertelement <16 x i32> undef, i32 %base, i32 0		%broadcast.splatinsert = insertelement <16 x i32> undef, i32 %base, i32 0
%broadcast.splat = shufflevector <16 x i32> %broadcast.splatinsert, <16 x i32> undef, <16 x i32> zeroinitializer		%broadcast.splat = shufflevector <16 x i32> %broadcast.splatinsert, <16 x i32> undef, <16 x i32> zeroinitializer

%gep.random = getelementptr i32, <16 x i32*> %broadcast.splat, <16 x i32> %ind		%gep.random = getelementptr i32, <16 x i32*> %broadcast.splat, <16 x i32> %ind
%imask = bitcast i16 %mask to <16 x i1>		%imask = bitcast i16 %mask to <16 x i1>
call void @llvm.masked.scatter.v16i32(<16 x i32>%val, <16 x i32*> %gep.random, i32 4, <16 x i1> %imask)		call void @llvm.masked.scatter.v16i32(<16 x i32>%val, <16 x i32*> %gep.random, i32 4, <16 x i1> %imask)
call void @llvm.masked.scatter.v16i32(<16 x i32>%val, <16 x i32*> %gep.random, i32 4, <16 x i1> %imask)		call void @llvm.masked.scatter.v16i32(<16 x i32>%val, <16 x i32*> %gep.random, i32 4, <16 x i1> %imask)
ret void		ret void
}		}

declare void @llvm.masked.scatter.v8i32(<8 x i32> , <8 x i32*> , i32 , <8 x i1> )		declare void @llvm.masked.scatter.v8i32(<8 x i32> , <8 x i32*> , i32 , <8 x i1> )
declare void @llvm.masked.scatter.v16i32(<16 x i32> , <16 x i32*> , i32 , <16 x i1> )		declare void @llvm.masked.scatter.v16i32(<16 x i32> , <16 x i32*> , i32 , <16 x i1> )

; KNL-LABEL: test6		; KNL-LABEL: test6
; KNL: kxnorw %k1, %k1, %k1		; KNL: kxnorw %k1, %k1, %k1
; KNL: kxnorw %k2, %k2, %k2		; KNL: kxnorw %k2, %k2, %k2
; KNL: vpgatherqd (,%zmm{{.}}), %ymm{{.}} {%k2}		; KNL: vpgatherqd (,%zmm{{.}}), %ymm{{.}} {%k2}
; KNL: vpscatterqd %ymm{{.}}, (,%zmm{{.}}) {%k1}		; KNL: vpscatterqd %ymm{{.}}, (,%zmm{{.}}) {%k1}

		; SCALAR-LABEL: test6
		; SCALAR: store i32 %Elt0, i32* %Ptr01, align 4
		; SCALAR-NEXT: %Elt1 = extractelement <8 x i32> %a1, i32 1
		; SCALAR-NEXT: %Ptr12 = extractelement <8 x i32*> %ptr, i32 1
		; SCALAR-NEXT: store i32 %Elt1, i32* %Ptr12, align 4
		; SCALAR-NEXT: %Elt2 = extractelement <8 x i32> %a1, i32 2
		; SCALAR-NEXT: %Ptr23 = extractelement <8 x i32*> %ptr, i32 2
		; SCALAR-NEXT: store i32 %Elt2, i32* %Ptr23, align 4

define <8 x i32> @test6(<8 x i32>%a1, <8 x i32*> %ptr) {		define <8 x i32> @test6(<8 x i32>%a1, <8 x i32*> %ptr) {

%a = call <8 x i32> @llvm.masked.gather.v8i32(<8 x i32*> %ptr, i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x i32> undef)		%a = call <8 x i32> @llvm.masked.gather.v8i32(<8 x i32*> %ptr, i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x i32> undef)

call void @llvm.masked.scatter.v8i32(<8 x i32> %a1, <8 x i32*> %ptr, i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>)		call void @llvm.masked.scatter.v8i32(<8 x i32> %a1, <8 x i32*> %ptr, i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>)
ret <8 x i32>%a		ret <8 x i32>%a
}		}

▲ Show 20 Lines • Show All 141 Lines • Show Last 20 Lines