This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
1/1
BasicTTIImpl.h
-
test/Analysis/CostModel/X86/
-
Analysis/
-
CostModel/
-
X86/
-
interleaved-store-accesses-with-gaps.ll

Differential D112877

[BasicTTI] getInterleavedMemoryOpCost(): discount unused members of mask if mask for gap will be used
ClosedPublic

Authored by lebedev.ri on Oct 30 2021, 4:08 PM.

Download Raw Diff

Details

Reviewers

RKSimon
pengfei
dorit
Ayal
hsaito
fhahn

Commits

rGa4b64f772711: [BasicTTI] getInterleavedMemoryOpCost(): discount unused members of mask if…

Summary

As it can be seen in InnerLoopVectorizer::vectorizeInterleaveGroup(),
in some cases (reported by UseMaskForGaps), the gaps in the interleaved load/store group
will be masked away by another constant mask, so there is no need to
account for the cost of replication of the mask for these.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

lebedev.ri requested review of this revision.Oct 30 2021, 4:08 PM

lebedev.ri created this revision.

lebedev.ri mentioned this in D112873: [X86] `X86TTIImpl::getInterleavedMemoryOpCostAVX512()`: fallback to scalarization cost computation for mask.Oct 30 2021, 4:12 PM

lebedev.ri added a child revision: D112873: [X86] `X86TTIImpl::getInterleavedMemoryOpCostAVX512()`: fallback to scalarization cost computation for mask.

Harbormaster completed remote builds in B131606: Diff 383624.Oct 30 2021, 4:50 PM

ping

LGTM (by inspection) - I tend to prefer to use the getScalarizationOverhead wrapper when we're using all the elements of the type, but I think it's better for consistency here to always specify the demanded elts.

Ideally we'd have at least some test coverage for this, but I understand mask gaps codegen can be tricky.

llvm/include/llvm/CodeGen/BasicTTIImpl.h
1241–1244	Worth pulling these out? const APInt DemandedAllSubElts = APInt::getAllOnes(NumSubElts); const APInt DemandedAllResultElts = APInt::getAllOnes(NumElts);

@RKSimon thank you for the review!
Applied nit suggestion.

Yeah, it would indeed be really great to have test coverage for this.

Harbormaster completed remote builds in B132197: Diff 384420.Nov 3 2021, 7:18 AM

lebedev.ri mentioned this in rGc6b3da1d663a: [NFC][X86] Duplicate LV test into a costmodel test.Nov 3 2021, 7:31 AM

And we have a winner!
Test coverage added, will land immediately.

This revision was not accepted when it landed; it landed in state Needs Review.Nov 3 2021, 7:33 AM

This revision was landed with ongoing or failed builds.

Closed by commit rGa4b64f772711: [BasicTTI] getInterleavedMemoryOpCost(): discount unused members of mask if… (authored by lebedev.ri). · Explain Why

This revision was automatically updated to reflect the committed changes.

lebedev.ri added a commit: rGa4b64f772711: [BasicTTI] getInterleavedMemoryOpCost(): discount unused members of mask if….

Thanks!

Harbormaster completed remote builds in B132203: Diff 384428.Nov 3 2021, 8:15 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

BasicTTIImpl.h

16 lines

test/

Analysis/

CostModel/

X86/

interleaved-store-accesses-with-gaps.ll

8 lines

Diff 384429

llvm/include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 1,232 Lines • ▼ Show 20 Lines	if (Cost.isValid() && VecTySize > VecTyLTSize) {
// will be used.		// will be used.
Cost = divideCeil(UsedInsts.count() * Cost.getValue().getValue(),		Cost = divideCeil(UsedInsts.count() * Cost.getValue().getValue(),
NumLegalInsts);		NumLegalInsts);
}		}

// Then plus the cost of interleave operation.		// Then plus the cost of interleave operation.
assert(Indices.size() <= Factor &&		assert(Indices.size() <= Factor &&
"Interleaved memory op has too many members");		"Interleaved memory op has too many members");

		const APInt DemandedAllSubElts = APInt::getAllOnes(NumSubElts);
		const APInt DemandedAllResultElts = APInt::getAllOnes(NumElts);

		RKSimonUnsubmitted Done Reply Inline Actions Worth pulling these out? const APInt DemandedAllSubElts = APInt::getAllOnes(NumSubElts); const APInt DemandedAllResultElts = APInt::getAllOnes(NumElts); RKSimon: Worth pulling these out? const APInt DemandedAllSubElts = APInt::getAllOnes(NumSubElts); const…
APInt DemandedLoadStoreElts = APInt::getZero(NumElts);		APInt DemandedLoadStoreElts = APInt::getZero(NumElts);
for (unsigned Index : Indices) {		for (unsigned Index : Indices) {
assert(Index < Factor && "Invalid index for interleaved memory op");		assert(Index < Factor && "Invalid index for interleaved memory op");
for (unsigned Elm = 0; Elm < NumSubElts; Elm++)		for (unsigned Elm = 0; Elm < NumSubElts; Elm++)
DemandedLoadStoreElts.setBit(Index + Elm * Factor);		DemandedLoadStoreElts.setBit(Index + Elm * Factor);
}		}

if (Opcode == Instruction::Load) {		if (Opcode == Instruction::Load) {
// The interleave cost is similar to extract sub vectors' elements		// The interleave cost is similar to extract sub vectors' elements
// from the wide vector, and insert them into sub vectors.		// from the wide vector, and insert them into sub vectors.
//		//
// E.g. An interleaved load of factor 2 (with one member of index 0):		// E.g. An interleaved load of factor 2 (with one member of index 0):
// %vec = load <8 x i32>, <8 x i32>* %ptr		// %vec = load <8 x i32>, <8 x i32>* %ptr
// %v0 = shuffle %vec, undef, <0, 2, 4, 6> ; Index 0		// %v0 = shuffle %vec, undef, <0, 2, 4, 6> ; Index 0
// The cost is estimated as extract elements at 0, 2, 4, 6 from the		// The cost is estimated as extract elements at 0, 2, 4, 6 from the
// <8 x i32> vector and insert them into a <4 x i32> vector.		// <8 x i32> vector and insert them into a <4 x i32> vector.
InstructionCost InsSubCost =		InstructionCost InsSubCost =
getScalarizationOverhead(SubVT, /Insert/ true, /Extract/ false);		thisT()->getScalarizationOverhead(SubVT, DemandedAllSubElts,
		/Insert/ true, /Extract/ false);
Cost += Indices.size() * InsSubCost;		Cost += Indices.size() * InsSubCost;
Cost +=		Cost +=
thisT()->getScalarizationOverhead(VT, DemandedLoadStoreElts,		thisT()->getScalarizationOverhead(VT, DemandedLoadStoreElts,
/Insert/ false, /Extract/ true);		/Insert/ false, /Extract/ true);
} else {		} else {
// The interleave cost is extract elements from sub vectors, and		// The interleave cost is extract elements from sub vectors, and
// insert them into the wide vector.		// insert them into the wide vector.
//		//
// E.g. An interleaved store of factor 3 with 2 members at indices 0,1:		// E.g. An interleaved store of factor 3 with 2 members at indices 0,1:
// (using VF=4):		// (using VF=4):
// %v0_v1 = shuffle %v0, %v1, <0,4,undef,1,5,undef,2,6,undef,3,7,undef>		// %v0_v1 = shuffle %v0, %v1, <0,4,undef,1,5,undef,2,6,undef,3,7,undef>
// %gaps.mask = <true, true, false, true, true, false,		// %gaps.mask = <true, true, false, true, true, false,
// true, true, false, true, true, false>		// true, true, false, true, true, false>
// call llvm.masked.store <12 x i32> %v0_v1, <12 x i32>* %ptr,		// call llvm.masked.store <12 x i32> %v0_v1, <12 x i32>* %ptr,
// i32 Align, <12 x i1> %gaps.mask		// i32 Align, <12 x i1> %gaps.mask
// The cost is estimated as extract all elements (of actual members,		// The cost is estimated as extract all elements (of actual members,
// excluding gaps) from both <4 x i32> vectors and insert into the <12 x		// excluding gaps) from both <4 x i32> vectors and insert into the <12 x
// i32> vector.		// i32> vector.
InstructionCost ExtSubCost =		InstructionCost ExtSubCost =
getScalarizationOverhead(SubVT, /Insert/ false, /Extract/ true);		thisT()->getScalarizationOverhead(SubVT, DemandedAllSubElts,
		/Insert/ false, /Extract/ true);
Cost += ExtSubCost * Indices.size();		Cost += ExtSubCost * Indices.size();
Cost += thisT()->getScalarizationOverhead(VT, DemandedLoadStoreElts,		Cost += thisT()->getScalarizationOverhead(VT, DemandedLoadStoreElts,
/Insert/ true,		/Insert/ true,
/Extract/ false);		/Extract/ false);
}		}

if (!UseMaskForCond)		if (!UseMaskForCond)
return Cost;		return Cost;

Type *I8Type = Type::getInt8Ty(VT->getContext());		Type *I8Type = Type::getInt8Ty(VT->getContext());
auto *MaskVT = FixedVectorType::get(I8Type, NumElts);		auto *MaskVT = FixedVectorType::get(I8Type, NumElts);
SubVT = FixedVectorType::get(I8Type, NumSubElts);		SubVT = FixedVectorType::get(I8Type, NumSubElts);

// The Mask shuffling cost is extract all the elements of the Mask		// The Mask shuffling cost is extract all the elements of the Mask
// and insert each of them Factor times into the wide vector:		// and insert each of them Factor times into the wide vector:
//		//
// E.g. an interleaved group with factor 3:		// E.g. an interleaved group with factor 3:
// %mask = icmp ult <8 x i32> %vec1, %vec2		// %mask = icmp ult <8 x i32> %vec1, %vec2
// %interleaved.mask = shufflevector <8 x i1> %mask, <8 x i1> undef,		// %interleaved.mask = shufflevector <8 x i1> %mask, <8 x i1> undef,
// <24 x i32> <0,0,0,1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7>		// <24 x i32> <0,0,0,1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7>
// The cost is estimated as extract all mask elements from the <8xi1> mask		// The cost is estimated as extract all mask elements from the <8xi1> mask
// vector and insert them factor times into the <24xi1> shuffled mask		// vector and insert them factor times into the <24xi1> shuffled mask
// vector.		// vector.
Cost += getScalarizationOverhead(SubVT, /Insert/ false, /Extract/ true);
Cost +=		Cost +=
getScalarizationOverhead(MaskVT, /Insert/ true, /Extract/ false);		thisT()->getScalarizationOverhead(SubVT, DemandedAllSubElts,
		/Insert/ false, /Extract/ true);
		Cost += thisT()->getScalarizationOverhead(
		MaskVT, UseMaskForGaps ? DemandedLoadStoreElts : DemandedAllResultElts,
		/Insert/ true, /Extract/ false);

// The Gaps mask is invariant and created outside the loop, therefore the		// The Gaps mask is invariant and created outside the loop, therefore the
// cost of creating it is not accounted for here. However if we have both		// cost of creating it is not accounted for here. However if we have both
// a MaskForGaps and some other mask that guards the execution of the		// a MaskForGaps and some other mask that guards the execution of the
// memory access, we need to account for the cost of And-ing the two masks		// memory access, we need to account for the cost of And-ing the two masks
// inside the loop.		// inside the loop.
if (UseMaskForGaps)		if (UseMaskForGaps)
Cost += thisT()->getArithmeticInstrCost(BinaryOperator::And, MaskVT,		Cost += thisT()->getArithmeticInstrCost(BinaryOperator::And, MaskVT,
▲ Show 20 Lines • Show All 910 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/X86/interleaved-store-accesses-with-gaps.ll

	Show First 20 Lines • Show All 101 Lines • ▼ Show 20 Lines
	; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 16 For instruction: store i16 %2, i16* %arrayidx7, align 2			; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 16 For instruction: store i16 %2, i16* %arrayidx7, align 2

	; ENABLED_MASKED_STRIDED: LV: Checking a loop in "test2"			; ENABLED_MASKED_STRIDED: LV: Checking a loop in "test2"
	;			;
	; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For instruction: store i16 %0, i16* %arrayidx2, align 2			; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For instruction: store i16 %0, i16* %arrayidx2, align 2
	; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For instruction: store i16 %2, i16* %arrayidx7, align 2			; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For instruction: store i16 %2, i16* %arrayidx7, align 2
	;			;
	; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 2 For instruction: store i16 %0, i16* %arrayidx2, align 2			; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 2 For instruction: store i16 %0, i16* %arrayidx2, align 2
	; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 20 for VF 2 For instruction: store i16 %2, i16* %arrayidx7, align 2			; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 16 for VF 2 For instruction: store i16 %2, i16* %arrayidx7, align 2
	;			;
	; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 4 For instruction: store i16 %0, i16* %arrayidx2, align 2			; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 4 For instruction: store i16 %0, i16* %arrayidx2, align 2
	; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 41 for VF 4 For instruction: store i16 %2, i16* %arrayidx7, align 2			; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 33 for VF 4 For instruction: store i16 %2, i16* %arrayidx7, align 2
	;			;
	; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 8 For instruction: store i16 %0, i16* %arrayidx2, align 2			; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 8 For instruction: store i16 %0, i16* %arrayidx2, align 2
	; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 83 for VF 8 For instruction: store i16 %2, i16* %arrayidx7, align 2			; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 68 for VF 8 For instruction: store i16 %2, i16* %arrayidx7, align 2
	;			;
	; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 16 For instruction: store i16 %0, i16* %arrayidx2, align 2			; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 16 For instruction: store i16 %0, i16* %arrayidx2, align 2
	; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 181 for VF 16 For instruction: store i16 %2, i16* %arrayidx7, align 2			; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 152 for VF 16 For instruction: store i16 %2, i16* %arrayidx7, align 2

	define void @test2(i16* noalias nocapture %points, i32 %numPoints, i16* noalias nocapture readonly %x, i16* noalias nocapture readonly %y) {			define void @test2(i16* noalias nocapture %points, i32 %numPoints, i16* noalias nocapture readonly %x, i16* noalias nocapture readonly %y) {
	entry:			entry:
	%cmp15 = icmp sgt i32 %numPoints, 0			%cmp15 = icmp sgt i32 %numPoints, 0
	br i1 %cmp15, label %for.body.preheader, label %for.end			br i1 %cmp15, label %for.body.preheader, label %for.end

	for.body.preheader:			for.body.preheader:
	%wide.trip.count = zext i32 %numPoints to i64			%wide.trip.count = zext i32 %numPoints to i64
	▲ Show 20 Lines • Show All 77 Lines • Show Last 20 Lines