This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/X86/
-
Target/
-
X86/
-
X86TargetTransformInfo.cpp
-
test/
-
Analysis/CostModel/X86/
-
CostModel/
-
X86/
-
interleaved-load-i32-stride-2-indices-0u.ll
-
interleaved-load-i32-stride-3-indices-01u.ll
-
interleaved-load-i32-stride-3-indices-0uu.ll
-
interleaved-load-i32-stride-4-indices-012u.ll
-
interleaved-load-i32-stride-4-indices-01uu.ll
-
interleaved-load-i32-stride-4-indices-0uuu.ll
-
Transforms/LoopVectorize/X86/
-
LoopVectorize/
-
X86/
4/4
pr48340.ll

Differential D111174

[X86][Costmodel] Improve cost modelling for not-fully-interleaved load
ClosedPublic

Authored by lebedev.ri on Oct 5 2021, 11:55 AM.

Download Raw Diff

Details

Reviewers

RKSimon

Commits

rG3d7bf6625a6e: [X86][Costmodel] Improve cost modelling for not-fully-interleaved load

Summary

While i've modelled most of the relevant tuples for AVX2,
that only covered fully-interleaved groups.

By definition, interleaving load of stride N means:
load N*VF elements, and shuffle them into N VF-sized vectors,
with 0'th vector containing elements [0, VF)*stride + 0,
and 1'th vector containing elements [0, VF)*stride + 1.
Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6)

Now, not fully interleaved load, is when not all of these vectors is demanded.
So at worst, we could just pretend that everything is demanded,
and discard the non-demanded vectors. What this means is that the cost
for not-fully-interleaved group should be not greater than the cost
for the same fully-interleaved group, but perhaps somewhat less.
Examples:
https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4)
https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2)
https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1)

As we have established over the course of last ~70 patches, (wow)
BaseT::getInterleavedMemoryOpCos() is absolutely bogus,
it is usually almost an order of magnitude overestimation,
so i would claim that we should at least use the hardcoded costs
of fully interleaved load groups.

We could go further and adjust them e.g. by the number of demanded indices,
but then i'm somewhat fearful of underestimating the cost.

Thoughts?

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

lebedev.ri created this revision.Oct 5 2021, 11:55 AM

Herald added subscribers: pengfei, hiraditya. · View Herald TranscriptOct 5 2021, 11:55 AM

lebedev.ri requested review of this revision.Oct 5 2021, 11:55 AM

lebedev.ri added a subscriber: jeroen.dobbelaere.

lebedev.ri added inline comments.

llvm/test/Transforms/LoopVectorize/X86/pr48340.ll
12	@jeroen.dobbelaere this test broke. any suggestions how it can be made less fragile? :)

jeroen.dobbelaere added inline comments.Oct 5 2021, 12:28 PM

llvm/test/Transforms/LoopVectorize/X86/pr48340.ll
12	Any idea on how to convince (force) loop-vectorize to do the vectorization ?

lebedev.ri added inline comments.Oct 5 2021, 12:30 PM

llvm/test/Transforms/LoopVectorize/X86/pr48340.ll
12	Oh it did vectorize alright, it just decided to do the interleaved load instead of gather.

Avoid testcase regression.

llvm/test/Transforms/LoopVectorize/X86/pr48340.ll
12	Never mind, got it.

lebedev.ri edited the summary of this revision. (Show Details)Oct 5 2021, 1:07 PM

Harbormaster completed remote builds in B127139: Diff 377332.Oct 5 2021, 1:25 PM

Matt added a subscriber: Matt.Oct 6 2021, 9:13 AM

Let's deal with D111220 first, i think that one is rather straight-forward (and the costs are obviously correct & won't require further fixes).

While i would love/prefer to get D111546+D111460 merged first, i suppose this doesn't strictly have to wait for that.

@RKSimon does this seem like an okay intermediate step before we decide on discounting non-fully-interleaved groups?

In D111174#3061110, @lebedev.ri wrote:

While i would love/prefer to get D111546+D111460 merged first, i suppose this doesn't strictly have to wait for that.

@RKSimon does this seem like an okay intermediate step before we decide on discounting non-fully-interleaved groups?

Yes, it needs rebasing after D111822 but LGTM for committal.

This revision is now accepted and ready to land.Oct 14 2021, 12:36 PM

In D111174#3065041, @RKSimon wrote:

In D111174#3061110, @lebedev.ri wrote:

While i would love/prefer to get D111546+D111460 merged first, i suppose this doesn't strictly have to wait for that.

@RKSimon does this seem like an okay intermediate step before we decide on discounting non-fully-interleaved groups?

Yes, it needs rebasing after D111822 but LGTM for committal.

Thank you for the review!

This revision was landed with ongoing or failed builds.Oct 14 2021, 1:15 PM

Closed by commit rG3d7bf6625a6e: [X86][Costmodel] Improve cost modelling for not-fully-interleaved load (authored by lebedev.ri). · Explain Why

This revision was automatically updated to reflect the committed changes.

lebedev.ri added a commit: rG3d7bf6625a6e: [X86][Costmodel] Improve cost modelling for not-fully-interleaved load.

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86TargetTransformInfo.cpp

10 lines

test/

Analysis/

CostModel/

X86/

interleaved-load-i32-stride-2-indices-0u.ll

8 lines

interleaved-load-i32-stride-3-indices-01u.ll

8 lines

interleaved-load-i32-stride-3-indices-0uu.ll

8 lines

interleaved-load-i32-stride-4-indices-012u.ll

8 lines

interleaved-load-i32-stride-4-indices-01uu.ll

8 lines

interleaved-load-i32-stride-4-indices-0uuu.ll

8 lines

Transforms/

LoopVectorize/

X86/

pr48340.ll

23 lines

Diff 379814

llvm/lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 5,159 Lines • ▼ Show 20 Lines	return getInterleavedMemoryOpCostAVX512(
AddressSpace, CostKind, UseMaskForCond, UseMaskForGaps);		AddressSpace, CostKind, UseMaskForCond, UseMaskForGaps);

// Get estimation for interleaved load/store operations for SSE-AVX2.		// Get estimation for interleaved load/store operations for SSE-AVX2.
// As opposed to AVX-512, SSE-AVX2 do not have generic shuffles that allow		// As opposed to AVX-512, SSE-AVX2 do not have generic shuffles that allow
// computing the cost using a generic formula as a function of generic		// computing the cost using a generic formula as a function of generic
// shuffles. We therefore use a lookup table instead, filled according to		// shuffles. We therefore use a lookup table instead, filled according to
// the instruction sequences that codegen currently generates.		// the instruction sequences that codegen currently generates.

// We currently support only fully-interleaved groups, with no gaps.
// TODO: Support also strided loads (interleaved-groups with gaps).
if (Indices.size() && Indices.size() != Factor)
return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
Alignment, AddressSpace, CostKind);

// VecTy for interleave memop is <VF*Factor x Elt>.		// VecTy for interleave memop is <VF*Factor x Elt>.
// So, for VF=4, Interleave Factor = 3, Element type = i32 we have		// So, for VF=4, Interleave Factor = 3, Element type = i32 we have
// VecTy = <12 x i32>.		// VecTy = <12 x i32>.
MVT LegalVT = getTLI()->getTypeLegalizationCost(DL, VecTy).second;		MVT LegalVT = getTLI()->getTypeLegalizationCost(DL, VecTy).second;

// This function can be called with VecTy=<6xi128>, Factor=3, in which case		// This function can be called with VecTy=<6xi128>, Factor=3, in which case
// the VF=2, while v2i128 is an unsupported MVT vector type		// the VF=2, while v2i128 is an unsupported MVT vector type
// (see MachineValueType.h::getVectorVT()).		// (see MachineValueType.h::getVectorVT()).
▲ Show 20 Lines • Show All 199 Lines • ▼ Show 20 Lines	static const CostTblEntry AVX2InterleavedStoreTbl[] = {
{6, MVT::v16i32, 66}, // interleave 6 x 16i32 into 96i32 (and store)		{6, MVT::v16i32, 66}, // interleave 6 x 16i32 into 96i32 (and store)

{6, MVT::v2i64, 8}, // interleave 6 x 2i64 into 12i64 (and store)		{6, MVT::v2i64, 8}, // interleave 6 x 2i64 into 12i64 (and store)
{6, MVT::v4i64, 15}, // interleave 6 x 4i64 into 24i64 (and store)		{6, MVT::v4i64, 15}, // interleave 6 x 4i64 into 24i64 (and store)
{6, MVT::v8i64, 30}, // interleave 6 x 8i64 into 48i64 (and store)		{6, MVT::v8i64, 30}, // interleave 6 x 8i64 into 48i64 (and store)
};		};

if (Opcode == Instruction::Load) {		if (Opcode == Instruction::Load) {
		// FIXME: if we have a partially-interleaved groups, with gaps,
		// should we discount the not-demanded indicies?
if (ST->hasAVX2())		if (ST->hasAVX2())
if (const auto *Entry = CostTableLookup(AVX2InterleavedLoadTbl, Factor,		if (const auto *Entry = CostTableLookup(AVX2InterleavedLoadTbl, Factor,
ETy.getSimpleVT()))		ETy.getSimpleVT()))
return MemOpCosts + Entry->Cost;		return MemOpCosts + Entry->Cost;
} else {		} else {
assert(Opcode == Instruction::Store &&		assert(Opcode == Instruction::Store &&
"Expected Store Instruction at this point");		"Expected Store Instruction at this point");
		assert((!Indices.size() \|\| Indices.size() == Factor) &&
		"Interleaved store only supports fully-interleaved groups.");
if (ST->hasAVX2())		if (ST->hasAVX2())
if (const auto *Entry = CostTableLookup(AVX2InterleavedStoreTbl, Factor,		if (const auto *Entry = CostTableLookup(AVX2InterleavedStoreTbl, Factor,
ETy.getSimpleVT()))		ETy.getSimpleVT()))
return MemOpCosts + Entry->Cost;		return MemOpCosts + Entry->Cost;
}		}

return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,		return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
Alignment, AddressSpace, CostKind,		Alignment, AddressSpace, CostKind,
UseMaskForCond, UseMaskForGaps);		UseMaskForCond, UseMaskForGaps);
}		}

llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-2-indices-0u.ll

	Show All 18 Lines
	;			;
	; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 5 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 5 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 11 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 11 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 24 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 24 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 48 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 48 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
	;			;
	; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 5 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 11 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 3 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 24 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 6 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 48 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 12 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
	;			;
	; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 1 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 1 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 1 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 1 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 1 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 1 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 2 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 2 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 13 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 13 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 50 for VF 64 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 50 for VF 64 For instruction: %v0 = load i32, i32* %in0, align 4
	Show All 30 Lines

llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-3-indices-01u.ll

	Show All 18 Lines
	;			;
	; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 12 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 12 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 21 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 21 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 47 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 47 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 94 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 94 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
	;			;
	; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 12 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 6 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 21 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 5 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 47 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 10 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 94 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 20 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
	;			;
	; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 3 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 3 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 5 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 5 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 9 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 9 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 36 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 36 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 144 for VF 64 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 144 for VF 64 For instruction: %v0 = load i32, i32* %in0, align 4
	Show All 34 Lines

llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-3-indices-0uu.ll

	Show All 18 Lines
	;			;
	; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 7 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 7 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 11 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 11 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 25 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 25 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 50 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 50 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
	;			;
	; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 7 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 6 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 11 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 5 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 25 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 10 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 50 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 20 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
	;			;
	; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 1 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 1 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 1 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 1 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 2 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 2 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 3 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 3 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 21 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 21 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 78 for VF 64 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 78 for VF 64 For instruction: %v0 = load i32, i32* %in0, align 4
	Show All 31 Lines

llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-4-indices-012u.ll

	Show All 18 Lines
	;			;
	; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 16 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 16 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 32 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 32 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 70 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 70 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 140 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 140 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
	;			;
	; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 16 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 5 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 32 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 10 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 70 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 20 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 140 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 40 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
	;			;
	; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 4 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 4 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 4 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 4 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 6 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 6 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 17 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 17 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 71 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 71 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
	;			;
	Show All 37 Lines

llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-4-indices-01uu.ll

	Show All 18 Lines
	;			;
	; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 11 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 11 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 22 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 22 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 48 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 48 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 96 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 96 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
	;			;
	; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 11 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 5 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 22 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 10 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 48 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 20 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 96 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 40 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
	;			;
	; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 3 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 3 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 5 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 5 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 13 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 13 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 50 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 50 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 160 for VF 64 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 160 for VF 64 For instruction: %v0 = load i32, i32* %in0, align 4
	Show All 35 Lines

llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-4-indices-0uuu.ll

	Show All 18 Lines
	;			;
	; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 6 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 6 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 12 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 12 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 26 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 26 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 52 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 52 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
	;			;
	; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 6 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 5 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 12 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 10 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 26 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 20 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 52 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 40 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
	;			;
	; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 1 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 1 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 1 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 1 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 2 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 2 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 5 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 5 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 29 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 29 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX512: LV: Found an estimated cost of 80 for VF 64 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX512: LV: Found an estimated cost of 80 for VF 64 For instruction: %v0 = load i32, i32* %in0, align 4
	Show All 32 Lines

llvm/test/Transforms/LoopVectorize/X86/pr48340.ll

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -loop-vectorize --force-vector-width=4 --force-vector-interleave=0 -S -o - < %s \| FileCheck %s			; RUN: opt -loop-vectorize --force-vector-width=4 --force-vector-interleave=0 -S -o - < %s \| FileCheck %s

	target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	%0 = type { i32 }			%0 = type { i32 }
	%1 = type { i64 }			%1 = type { i64 }

	define void @foo(i64* %p, i64* %p.last) unnamed_addr #0 {			define void @foo(i64* %p, i64* %p.last) unnamed_addr #0 {
	; CHECK-LABEL: @foo(			; CHECK-LABEL: @foo(
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK: [[WIDE_MASKED_GATHER0:%.]] = call <4 x %0> @llvm.masked.gather.v4p0s_s.v4p0p0s_s.0(<4 x %0*> [[TMP5:%.]], i32 8, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x %0*> undef)			; CHECK: [[WIDE_MASKED_GATHER:%.]] = call <4 x %0> @llvm.masked.gather.v4p0s_s.v4p0p0s_s.0(<4 x %0*> [[TMP11:%.]], i32 8, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x %0*> undef)
	lebedev.riAuthorUnsubmitted Done Reply Inline Actions @jeroen.dobbelaere this test broke. any suggestions how it can be made less fragile? :) lebedev.ri: @jeroen.dobbelaere this test broke. any suggestions how it can be made less fragile? :)
	jeroen.dobbelaereUnsubmitted Done Reply Inline Actions Any idea on how to convince (force) loop-vectorize to do the vectorization ? jeroen.dobbelaere: Any idea on how to convince (force) loop-vectorize to do the vectorization ?
	lebedev.riAuthorUnsubmitted Done Reply Inline Actions Oh it did vectorize alright, it just decided to do the interleaved load instead of gather. lebedev.ri: Oh it did vectorize alright, it just decided to do the interleaved load instead of gather.
	lebedev.riAuthorUnsubmitted Done Reply Inline Actions Never mind, got it. lebedev.ri: Never mind, got it.
	; CHECK-NEXT: [[WIDE_MASKED_GATHER1:%.]] = call <4 x %0> @llvm.masked.gather.v4p0s_s.v4p0p0s_s.0(<4 x %0*> [[TMP6:%.]], i32 8, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x %0*> undef)			; CHECK-NEXT: [[WIDE_MASKED_GATHER5:%.]] = call <4 x %0> @llvm.masked.gather.v4p0s_s.v4p0p0s_s.0(<4 x %0*> [[TMP12:%.]], i32 8, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x %0*> undef)
	; CHECK-NEXT: [[WIDE_MASKED_GATHER2:%.]] = call <4 x %0> @llvm.masked.gather.v4p0s_s.v4p0p0s_s.0(<4 x %0*> [[TMP7:%.]], i32 8, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x %0*> undef)			; CHECK-NEXT: [[WIDE_MASKED_GATHER6:%.]] = call <4 x %0> @llvm.masked.gather.v4p0s_s.v4p0p0s_s.0(<4 x %0*> [[TMP13:%.]], i32 8, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x %0*> undef)
	; CHECK-NEXT: [[WIDE_MASKED_GATHER3:%.]] = call <4 x %0> @llvm.masked.gather.v4p0s_s.v4p0p0s_s.0(<4 x %0*> [[TMP8:%.]], i32 8, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x %0*> undef)			; CHECK-NEXT: [[WIDE_MASKED_GATHER7:%.]] = call <4 x %0> @llvm.masked.gather.v4p0s_s.v4p0p0s_s.0(<4 x %0*> [[TMP14:%.]], i32 8, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x %0*> undef)
				;
	entry:			entry:
	br label %loop			br label %loop

	loop:			loop:
	%p2 = phi i64* [ %p, %entry ], [ %p.inc, %loop ]			%p2 = phi i64* [ %p, %entry ], [ %p.inc, %loop ]
	%p.inc = getelementptr inbounds i64, i64* %p2, i64 2			%p.inc = getelementptr inbounds i64, i64* %p2, i64 4
	%p3 = bitcast i64* %p2 to %0**			%p3 = bitcast i64* %p2 to %0**
	%v = load %0, %0* %p3, align 8			%v = load %0, %0* %p3, align 8
	%b = icmp eq i64* %p.inc, %p.last			%b = icmp eq i64* %p.inc, %p.last
	br i1 %b, label %exit, label %loop			br i1 %b, label %exit, label %loop

	exit:			exit:
	ret void			ret void
	}			}

	define void @bar(i64* %p, i64* %p.last) unnamed_addr #0 {			define void @bar(i64* %p, i64* %p.last) unnamed_addr #0 {
	; CHECK-LABEL: @bar(			; CHECK-LABEL: @bar(
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK: [[WIDE_MASKED_GATHER0:%.]] = call <4 x %1> @llvm.masked.gather.v4p0s_s.v4p0p0s_s.1(<4 x %1*> [[TMP5:%.]], i32 8, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x %1*> undef)			; CHECK: [[WIDE_MASKED_GATHER:%.]] = call <4 x %1> @llvm.masked.gather.v4p0s_s.v4p0p0s_s.1(<4 x %1*> [[TMP11:%.]], i32 8, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x %1*> undef)
	; CHECK-NEXT: [[WIDE_MASKED_GATHER1:%.]] = call <4 x %1> @llvm.masked.gather.v4p0s_s.v4p0p0s_s.1(<4 x %1*> [[TMP6:%.]], i32 8, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x %1*> undef)			; CHECK-NEXT: [[WIDE_MASKED_GATHER5:%.]] = call <4 x %1> @llvm.masked.gather.v4p0s_s.v4p0p0s_s.1(<4 x %1*> [[TMP12:%.]], i32 8, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x %1*> undef)
	; CHECK-NEXT: [[WIDE_MASKED_GATHER2:%.]] = call <4 x %1> @llvm.masked.gather.v4p0s_s.v4p0p0s_s.1(<4 x %1*> [[TMP7:%.]], i32 8, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x %1*> undef)			; CHECK-NEXT: [[WIDE_MASKED_GATHER6:%.]] = call <4 x %1> @llvm.masked.gather.v4p0s_s.v4p0p0s_s.1(<4 x %1*> [[TMP13:%.]], i32 8, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x %1*> undef)
	; CHECK-NEXT: [[WIDE_MASKED_GATHER3:%.]] = call <4 x %1> @llvm.masked.gather.v4p0s_s.v4p0p0s_s.1(<4 x %1*> [[TMP8:%.]], i32 8, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x %1*> undef)			; CHECK-NEXT: [[WIDE_MASKED_GATHER7:%.]] = call <4 x %1> @llvm.masked.gather.v4p0s_s.v4p0p0s_s.1(<4 x %1*> [[TMP14:%.]], i32 8, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x %1*> undef)
				;
	entry:			entry:
	br label %loop			br label %loop

	loop:			loop:
	%p2 = phi i64* [ %p, %entry ], [ %p.inc, %loop ]			%p2 = phi i64* [ %p, %entry ], [ %p.inc, %loop ]
	%p.inc = getelementptr inbounds i64, i64* %p2, i64 2			%p.inc = getelementptr inbounds i64, i64* %p2, i64 4
	%p3 = bitcast i64* %p2 to %1**			%p3 = bitcast i64* %p2 to %1**
	%v = load %1, %1* %p3, align 8			%v = load %1, %1* %p3, align 8
	%b = icmp eq i64* %p.inc, %p.last			%b = icmp eq i64* %p.inc, %p.last
	br i1 %b, label %exit, label %loop			br i1 %b, label %exit, label %loop

	exit:			exit:
	ret void			ret void
	}			}

	attributes #0 = { "target-cpu"="skylake" }			attributes #0 = { "target-cpu"="skylake" }

This is an archive of the discontinued LLVM Phabricator instance.

[X86][Costmodel] Improve cost modelling for not-fully-interleaved loadClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 379814

llvm/lib/Target/X86/X86TargetTransformInfo.cpp

llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-2-indices-0u.ll

llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-3-indices-01u.ll

llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-3-indices-0uu.ll

llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-4-indices-012u.ll

llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-4-indices-01uu.ll

llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-4-indices-0uuu.ll

llvm/test/Transforms/LoopVectorize/X86/pr48340.ll

[X86][Costmodel] Improve cost modelling for not-fully-interleaved load
ClosedPublic