This is an archive of the discontinued LLVM Phabricator instance.

[x86, InstCombine] transform x86 AVX masked stores to LLVM intrinsics
ClosedPublic

Authored by spatel on Feb 20 2016, 4:42 PM.

Download Raw Diff

Details

Reviewers

RKSimon
delena
craig.topper

Commits

rG1ace99351fdf: [x86, InstCombine] transform x86 AVX masked stores to LLVM intrinsics
rL262064: [x86, InstCombine] transform x86 AVX masked stores to LLVM intrinsics

Summary

The intended effect of this patch in conjunction with:
http://reviews.llvm.org/rL259392
http://reviews.llvm.org/rL260145

is that customers using the AVX intrinsics in C will benefit from combines when the store mask is constant:

void mstore_zero_mask(float *f, __m128 v) {
  _mm_maskstore_ps(f, _mm_set1_epi32(0), v);
}

void mstore_fake_ones_mask(float *f, __m128 v) {
  _mm_maskstore_ps(f, _mm_set1_epi32(1), v);
}

void mstore_ones_mask(float *f, __m128 v) {
  _mm_maskstore_ps(f, _mm_set1_epi32(0x80000000), v);
}

void mstore_one_set_elt_mask(float *f, __m128 v) {
  _mm_maskstore_ps(f, _mm_set_epi32(0x80000000, 0, 0, 0), v);
}

...so none of the above will actually generate a masked store for optimized code.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel updated this revision to Diff 48602.Feb 20 2016, 4:42 PM

spatel retitled this revision from to [x86, InstCombine] transform x86 AVX masked stores to LLVM intrinsics.

spatel updated this object.

spatel added reviewers: RKSimon, delena, craig.topper.

spatel added a subscriber: llvm-commits.

Herald added a subscriber: mcrosier. · View Herald TranscriptFeb 20 2016, 4:42 PM

Couple of minor queries.

lib/Transforms/InstCombine/InstCombineCalls.cpp
824 ↗	(On Diff #48602)	Not a priority but can the blendv instcombines share this helper (or something very similar)?
1609 ↗	(On Diff #48602)	If this returns nullptr shouldn't the case break?

spatel added inline comments.Feb 21 2016, 8:23 AM

lib/Transforms/InstCombine/InstCombineCalls.cpp
824 ↗	(On Diff #48602)	Yes - I'll do that as a precursor or follow-up patch.
1609 ↗	(On Diff #48602)	This is interesting - simplifyX86MaskedStore() always returns nullptr because eraseInstFromFunction() always returns nullptr. I'll change it to return a bool to indicate if a change was made.

Patch updated:

I checked in the sign bit -> bool helper function in http://reviews.llvm.org/rL261483
Updated simplifyX86MaskedStore() to return 'true' when it makes a change; this allows the caller's 'case' to break and try other optimizations.

I'm happy with this - does anyone else have any comments?

delena added inline comments.Feb 24 2016, 11:26 PM

lib/Transforms/InstCombine/InstCombineCalls.cpp
854 ↗	(On Diff #48627)	I think you lost address space here. I know that address space is not handled properly in masked operations and there is an open discussion in LLVM-dev right now. I think it will be resolved sooner or later. I suggest to keep the original address space anyway.

Patch updated:
Fixed to propagate address space of original pointer argument.

spatel marked an inline comment as done.Feb 25 2016, 8:11 AM

spatel added inline comments.

lib/Transforms/InstCombine/InstCombineCalls.cpp
855 ↗	(On Diff #49067)	Thanks - I didn't think it was possible to specify the address space with the AVX intrinsic, but we should do the right thing here anyway.

LGTM

This revision is now accepted and ready to land.Feb 26 2016, 11:26 AM

Closed by commit rL262064: [x86, InstCombine] transform x86 AVX masked stores to LLVM intrinsics (authored by spatel). · Explain WhyFeb 26 2016, 1:08 PM

This revision was automatically updated to reflect the committed changes.

spatel marked an inline comment as done.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

InstCombine/

InstCombineCalls.cpp

49 lines

test/

Transforms/

InstCombine/

x86-masked-memops.ll

94 lines

Diff 49225

llvm/trunk/lib/Transforms/InstCombine/InstCombineCalls.cpp

Show First 20 Lines • Show All 822 Lines • ▼ Show 20 Lines	static Instruction *simplifyMaskedScatter(IntrinsicInst &II, InstCombiner &IC) {
// If the mask is all zeros, a scatter does nothing.		// If the mask is all zeros, a scatter does nothing.
auto *ConstMask = dyn_cast<Constant>(II.getArgOperand(3));		auto *ConstMask = dyn_cast<Constant>(II.getArgOperand(3));
if (ConstMask && ConstMask->isNullValue())		if (ConstMask && ConstMask->isNullValue())
return IC.eraseInstFromFunction(II);		return IC.eraseInstFromFunction(II);

return nullptr;		return nullptr;
}		}

		// TODO: If the x86 backend knew how to convert a bool vector mask back to an
		// XMM register mask efficiently, we could transform all x86 masked intrinsics
		// to LLVM masked intrinsics and remove the x86 masked intrinsic defs.
		static bool simplifyX86MaskedStore(IntrinsicInst &II, InstCombiner &IC) {
		Value *Ptr = II.getOperand(0);
		Value *Mask = II.getOperand(1);
		Value *Vec = II.getOperand(2);

		// Special case a zero mask since that's not a ConstantDataVector:
		// this masked store instruction does nothing.
		if (isa<ConstantAggregateZero>(Mask)) {
		IC.eraseInstFromFunction(II);
		return true;
		}

		auto *ConstMask = dyn_cast<ConstantDataVector>(Mask);
		if (!ConstMask)
		return false;

		// The mask is constant. Convert this x86 intrinsic to the LLVM instrinsic
		// to allow target-independent optimizations.

		// First, cast the x86 intrinsic scalar pointer to a vector pointer to match
		// the LLVM intrinsic definition for the pointer argument.
		unsigned AddrSpace = cast<PointerType>(Ptr->getType())->getAddressSpace();
		PointerType *VecPtrTy = PointerType::get(Vec->getType(), AddrSpace);

		Value *PtrCast = IC.Builder->CreateBitCast(Ptr, VecPtrTy, "castvec");

		// Second, convert the x86 XMM integer vector mask to a vector of bools based
		// on each element's most significant bit (the sign bit).
		Constant *BoolMask = getNegativeIsTrueBoolVec(ConstMask);

		IC.Builder->CreateMaskedStore(Vec, PtrCast, 1, BoolMask);

		// 'Replace uses' doesn't work for stores. Erase the original masked store.
		IC.eraseInstFromFunction(II);
		return true;
		}

/// CallInst simplification. This mostly only handles folding of intrinsic		/// CallInst simplification. This mostly only handles folding of intrinsic
/// instructions. For normal calls, it allows visitCallSite to do the heavy		/// instructions. For normal calls, it allows visitCallSite to do the heavy
/// lifting.		/// lifting.
Instruction *InstCombiner::visitCallInst(CallInst &CI) {		Instruction *InstCombiner::visitCallInst(CallInst &CI) {
auto Args = CI.arg_operands();		auto Args = CI.arg_operands();
if (Value *V = SimplifyCall(CI.getCalledValue(), Args.begin(), Args.end(), DL,		if (Value *V = SimplifyCall(CI.getCalledValue(), Args.begin(), Args.end(), DL,
TLI, DT, AC))		TLI, DT, AC))
return replaceInstUsesWith(CI, V);		return replaceInstUsesWith(CI, V);
▲ Show 20 Lines • Show All 746 Lines • ▼ Show 20 Lines	Instruction *InstCombiner::visitCallInst(CallInst &CI) {
case Intrinsic::x86_avx_vperm2f128_pd_256:		case Intrinsic::x86_avx_vperm2f128_pd_256:
case Intrinsic::x86_avx_vperm2f128_ps_256:		case Intrinsic::x86_avx_vperm2f128_ps_256:
case Intrinsic::x86_avx_vperm2f128_si_256:		case Intrinsic::x86_avx_vperm2f128_si_256:
case Intrinsic::x86_avx2_vperm2i128:		case Intrinsic::x86_avx2_vperm2i128:
if (Value V = simplifyX86vperm2(II, *Builder))		if (Value V = simplifyX86vperm2(II, *Builder))
return replaceInstUsesWith(*II, V);		return replaceInstUsesWith(*II, V);
break;		break;

		case Intrinsic::x86_avx_maskstore_ps:
		case Intrinsic::x86_avx_maskstore_pd:
		case Intrinsic::x86_avx_maskstore_ps_256:
		case Intrinsic::x86_avx_maskstore_pd_256:
		// TODO: The AVX2 integer variants can go here too.
		if (simplifyX86MaskedStore(II, this))
		return nullptr;
		break;

case Intrinsic::x86_xop_vpcomb:		case Intrinsic::x86_xop_vpcomb:
case Intrinsic::x86_xop_vpcomd:		case Intrinsic::x86_xop_vpcomd:
case Intrinsic::x86_xop_vpcomq:		case Intrinsic::x86_xop_vpcomq:
case Intrinsic::x86_xop_vpcomw:		case Intrinsic::x86_xop_vpcomw:
if (Value V = simplifyX86vpcom(II, *Builder, true))		if (Value V = simplifyX86vpcom(II, *Builder, true))
return replaceInstUsesWith(*II, V);		return replaceInstUsesWith(*II, V);
break;		break;

▲ Show 20 Lines • Show All 993 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/InstCombine/x86-masked-memops.ll

				; RUN: opt < %s -instcombine -S \| FileCheck %s

				; If the mask isn't constant, do nothing.

				define void @mstore(i8* %f, <4 x i32> %mask, <4 x float> %v) {
				tail call void @llvm.x86.avx.maskstore.ps(i8* %f, <4 x i32> %mask, <4 x float> %v)
				ret void

				; CHECK-LABEL: @mstore(
				; CHECK-NEXT: tail call void @llvm.x86.avx.maskstore.ps(i8* %f, <4 x i32> %mask, <4 x float> %v)
				; CHECK-NEXT: ret void
				}

				; Zero mask is a nop.

				define void @mstore_zeros(i8* %f, <4 x float> %v) {
				tail call void @llvm.x86.avx.maskstore.ps(i8* %f, <4 x i32> zeroinitializer, <4 x float> %v)
				ret void

				; CHECK-LABEL: @mstore_zeros(
				; CHECK-NEXT: ret void
				}

				; Only the sign bit matters.

				define void @mstore_fake_ones(i8* %f, <4 x float> %v) {
				tail call void @llvm.x86.avx.maskstore.ps(i8* %f, <4 x i32> <i32 1, i32 2, i32 3, i32 2147483647>, <4 x float> %v)
				ret void

				; CHECK-LABEL: @mstore_fake_ones(
				; CHECK-NEXT: ret void
				}

				; All mask bits are set, so this is just a vector store.

				define void @mstore_real_ones(i8* %f, <4 x float> %v) {
				tail call void @llvm.x86.avx.maskstore.ps(i8* %f, <4 x i32> <i32 -1, i32 -2, i32 -3, i32 -2147483648>, <4 x float> %v)
				ret void

				; CHECK-LABEL: @mstore_real_ones(
				; CHECK-NEXT: %castvec = bitcast i8* %f to <4 x float>*
				; CHECK-NEXT: store <4 x float> %v, <4 x float>* %castvec
				; CHECK-NEXT: ret void
				}

				; It's a constant mask, so convert to an LLVM intrinsic. The backend should optimize further.

				define void @mstore_one_one(i8* %f, <4 x float> %v) {
				tail call void @llvm.x86.avx.maskstore.ps(i8* %f, <4 x i32> <i32 0, i32 0, i32 0, i32 -1>, <4 x float> %v)
				ret void

				; CHECK-LABEL: @mstore_one_one(
				; CHECK-NEXT: %castvec = bitcast i8* %f to <4 x float>*
				; CHECK-NEXT: call void @llvm.masked.store.v4f32(<4 x float> %v, <4 x float>* %castvec, i32 1, <4 x i1> <i1 false, i1 false, i1 false, i1 true>)
				; CHECK-NEXT: ret void
				}

				; Try doubles.

				define void @mstore_one_one_double(i8* %f, <2 x double> %v) {
				tail call void @llvm.x86.avx.maskstore.pd(i8* %f, <2 x i64> <i64 -1, i64 0>, <2 x double> %v)
				ret void

				; CHECK-LABEL: @mstore_one_one_double(
				; CHECK-NEXT: %castvec = bitcast i8* %f to <2 x double>*
				; CHECK-NEXT: call void @llvm.masked.store.v2f64(<2 x double> %v, <2 x double>* %castvec, i32 1, <2 x i1> <i1 true, i1 false>)
				; CHECK-NEXT: ret void
				}

				define void @mstore_v8f32(i8* %f, <8 x float> %v) {
				tail call void @llvm.x86.avx.maskstore.ps.256(i8* %f, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 -1, i32 -2, i32 -3, i32 -4>, <8 x float> %v)
				ret void

				; CHECK-LABEL: @mstore_v8f32(
				; CHECK-NEXT: %castvec = bitcast i8* %f to <8 x float>*
				; CHECK-NEXT: call void @llvm.masked.store.v8f32(<8 x float> %v, <8 x float>* %castvec, i32 1, <8 x i1> <i1 false, i1 false, i1 false, i1 false, i1 true, i1 true, i1 true, i1 true>)
				; CHECK-NEXT: ret void
				}

				define void @mstore_v4f64(i8* %f, <4 x double> %v) {
				tail call void @llvm.x86.avx.maskstore.pd.256(i8* %f, <4 x i64> <i64 -1, i64 0, i64 1, i64 2>, <4 x double> %v)
				ret void

				; CHECK-LABEL: @mstore_v4f64(
				; CHECK-NEXT: %castvec = bitcast i8* %f to <4 x double>*
				; CHECK-NEXT: call void @llvm.masked.store.v4f64(<4 x double> %v, <4 x double>* %castvec, i32 1, <4 x i1> <i1 true, i1 false, i1 false, i1 false>)
				; CHECK-NEXT: ret void
				}

				declare void @llvm.x86.avx.maskstore.ps(i8*, <4 x i32>, <4 x float>)
				declare void @llvm.x86.avx.maskstore.pd(i8*, <2 x i64>, <2 x double>)
				declare void @llvm.x86.avx.maskstore.ps.256(i8*, <8 x i32>, <8 x float>)
				declare void @llvm.x86.avx.maskstore.pd.256(i8*, <4 x i64>, <4 x double>)