This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/X86/
-
Target/
-
X86/
-
X86TargetTransformInfo.cpp
-
test/Analysis/CostModel/X86/
-
Analysis/
-
CostModel/
-
X86/
1
interleaved-store-i16-stride-2.ll
1
interleaved-store-i8-stride-2.ll

Differential D111941

[X86][Costmodel] Add SSE2 sub-128bit vXi8/16/32 and 128/256-bit vXi32/64 stride 2 interleaved store costs
ClosedPublic

Authored by RKSimon on Oct 16 2021, 9:59 AM.

Download Raw Diff

Details

Reviewers

lebedev.ri

Commits

rGf04133815360: [X86][Costmodel] Add SSE2 sub-128bit vXi32/f32 stride 2 interleaved store costs
rGc850d5c5c8a1: [X86][Costmodel] Add SSE2 sub-128bit vXi8/16 stride 2 interleaved store costs

Summary

These all expand to 1 or 2 UNPCK shuffle ops. AVX1/AVX2 sometimes expands to a subvector-concat + permute pattern instead but the costs turn out to be very similar, so move them from the AVX2 to the SSE2 cost table.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

RKSimon created this revision.Oct 16 2021, 9:59 AM

Herald added subscribers: pengfei, hiraditya. · View Herald TranscriptOct 16 2021, 9:59 AM

RKSimon requested review of this revision.Oct 16 2021, 9:59 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 16 2021, 9:59 AM

Harbormaster completed remote builds in B129197: Diff 380193.Oct 16 2021, 10:45 AM

RKSimon retitled this revision from [X86][Costmodel] Add SSE2 sub-128bit vXi8/16/32 and 128/256-bit vXi32/64 stride 2 interleaved load costs to [X86][Costmodel] Add SSE2 sub-128bit vXi8/16/32 and 128/256-bit vXi32/64 stride 2 interleaved store costs.Oct 17 2021, 8:21 AM

I suppose this is a better ballpark, but i'm not really sold on i64/i32-vf4 part.

llvm/test/Analysis/CostModel/X86/interleaved-store-i16-stride-2.ll
13	LG
llvm/test/Analysis/CostModel/X86/interleaved-store-i32-stride-2.ll
13 ↗	(On Diff #380193)	`@store_i32_stride2_vf4`'s codegen looks really different, i'm not sure this is right: https://godbolt.org/z/dojn9enWK https://godbolt.org/z/zfEPrYovd
llvm/test/Analysis/CostModel/X86/interleaved-store-i64-stride-2.ll
13 ↗	(On Diff #380193)	All of `@store_i64_stride2_vf2`/`@store_i64_stride2_vf4`'s codegen looks really different.
llvm/test/Analysis/CostModel/X86/interleaved-store-i8-stride-2.ll
13	LG

This revision is now accepted and ready to land.Oct 17 2021, 10:13 AM

I'll just address the i8/i16 cases first

This revision was landed with ongoing or failed builds.Oct 18 2021, 5:54 AM

Closed by commit rGc850d5c5c8a1: [X86][Costmodel] Add SSE2 sub-128bit vXi8/16 stride 2 interleaved store costs (authored by RKSimon). · Explain Why

This revision was automatically updated to reflect the committed changes.

RKSimon added a commit: rGc850d5c5c8a1: [X86][Costmodel] Add SSE2 sub-128bit vXi8/16 stride 2 interleaved store costs.

RKSimon added a commit: rGf04133815360: [X86][Costmodel] Add SSE2 sub-128bit vXi32/f32 stride 2 interleaved store costs.

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86TargetTransformInfo.cpp

19 lines

test/

Analysis/

CostModel/

X86/

interleaved-store-i16-stride-2.ll

8 lines

interleaved-store-i8-stride-2.ll

12 lines

Diff 380359

llvm/lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 5,312 Lines • ▼ Show 20 Lines	static const CostTblEntry SSE2InterleavedLoadTbl[] = {

{2, MVT::v2i32, 2}, // (load 4i32 and) deinterleave into 2 x 2i32		{2, MVT::v2i32, 2}, // (load 4i32 and) deinterleave into 2 x 2i32
{2, MVT::v4i32, 2}, // (load 8i32 and) deinterleave into 2 x 4i32		{2, MVT::v4i32, 2}, // (load 8i32 and) deinterleave into 2 x 4i32

{2, MVT::v2i64, 2}, // (load 4i64 and) deinterleave into 2 x 2i64		{2, MVT::v2i64, 2}, // (load 4i64 and) deinterleave into 2 x 2i64
};		};

static const CostTblEntry AVX2InterleavedStoreTbl[] = {		static const CostTblEntry AVX2InterleavedStoreTbl[] = {
{2, MVT::v2i8, 1}, // interleave 2 x 2i8 into 4i8 (and store)
{2, MVT::v4i8, 1}, // interleave 2 x 4i8 into 8i8 (and store)
{2, MVT::v8i8, 1}, // interleave 2 x 8i8 into 16i8 (and store)
{2, MVT::v16i8, 3}, // interleave 2 x 16i8 into 32i8 (and store)		{2, MVT::v16i8, 3}, // interleave 2 x 16i8 into 32i8 (and store)
{2, MVT::v32i8, 4}, // interleave 2 x 32i8 into 64i8 (and store)		{2, MVT::v32i8, 4}, // interleave 2 x 32i8 into 64i8 (and store)

{2, MVT::v2i16, 1}, // interleave 2 x 2i16 into 4i16 (and store)
{2, MVT::v4i16, 1}, // interleave 2 x 4i16 into 8i16 (and store)
{2, MVT::v8i16, 3}, // interleave 2 x 8i16 into 16i16 (and store)		{2, MVT::v8i16, 3}, // interleave 2 x 8i16 into 16i16 (and store)
{2, MVT::v16i16, 4}, // interleave 2 x 16i16 into 32i16 (and store)		{2, MVT::v16i16, 4}, // interleave 2 x 16i16 into 32i16 (and store)
{2, MVT::v32i16, 8}, // interleave 2 x 32i16 into 64i16 (and store)		{2, MVT::v32i16, 8}, // interleave 2 x 32i16 into 64i16 (and store)

{2, MVT::v2i32, 1}, // interleave 2 x 2i32 into 4i32 (and store)		{2, MVT::v2i32, 1}, // interleave 2 x 2i32 into 4i32 (and store)
{2, MVT::v4i32, 2}, // interleave 2 x 4i32 into 8i32 (and store)		{2, MVT::v4i32, 2}, // interleave 2 x 4i32 into 8i32 (and store)
{2, MVT::v8i32, 4}, // interleave 2 x 8i32 into 16i32 (and store)		{2, MVT::v8i32, 4}, // interleave 2 x 8i32 into 16i32 (and store)
{2, MVT::v16i32, 8}, // interleave 2 x 16i32 into 32i32 (and store)		{2, MVT::v16i32, 8}, // interleave 2 x 16i32 into 32i32 (and store)
▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	static const CostTblEntry AVX2InterleavedStoreTbl[] = {
{6, MVT::v8i32, 33}, // interleave 6 x 8i32 into 48i32 (and store)		{6, MVT::v8i32, 33}, // interleave 6 x 8i32 into 48i32 (and store)
{6, MVT::v16i32, 66}, // interleave 6 x 16i32 into 96i32 (and store)		{6, MVT::v16i32, 66}, // interleave 6 x 16i32 into 96i32 (and store)

{6, MVT::v2i64, 8}, // interleave 6 x 2i64 into 12i64 (and store)		{6, MVT::v2i64, 8}, // interleave 6 x 2i64 into 12i64 (and store)
{6, MVT::v4i64, 15}, // interleave 6 x 4i64 into 24i64 (and store)		{6, MVT::v4i64, 15}, // interleave 6 x 4i64 into 24i64 (and store)
{6, MVT::v8i64, 30}, // interleave 6 x 8i64 into 48i64 (and store)		{6, MVT::v8i64, 30}, // interleave 6 x 8i64 into 48i64 (and store)
};		};

		static const CostTblEntry SSE2InterleavedStoreTbl[] = {
		{2, MVT::v2i8, 1}, // interleave 2 x 2i8 into 4i8 (and store)
		{2, MVT::v4i8, 1}, // interleave 2 x 4i8 into 8i8 (and store)
		{2, MVT::v8i8, 1}, // interleave 2 x 8i8 into 16i8 (and store)

		{2, MVT::v2i16, 1}, // interleave 2 x 2i16 into 4i16 (and store)
		{2, MVT::v4i16, 1}, // interleave 2 x 4i16 into 8i16 (and store)
		};

if (Opcode == Instruction::Load) {		if (Opcode == Instruction::Load) {
// FIXME: if we have a partially-interleaved groups, with gaps,		// FIXME: if we have a partially-interleaved groups, with gaps,
// should we discount the not-demanded indicies?		// should we discount the not-demanded indicies?
if (ST->hasAVX2())		if (ST->hasAVX2())
if (const auto *Entry = CostTableLookup(AVX2InterleavedLoadTbl, Factor,		if (const auto *Entry = CostTableLookup(AVX2InterleavedLoadTbl, Factor,
ETy.getSimpleVT()))		ETy.getSimpleVT()))
return MemOpCosts + Entry->Cost;		return MemOpCosts + Entry->Cost;

Show All 10 Lines	if (Opcode == Instruction::Load) {
assert(Opcode == Instruction::Store &&		assert(Opcode == Instruction::Store &&
"Expected Store Instruction at this point");		"Expected Store Instruction at this point");
assert((!Indices.size() \|\| Indices.size() == Factor) &&		assert((!Indices.size() \|\| Indices.size() == Factor) &&
"Interleaved store only supports fully-interleaved groups.");		"Interleaved store only supports fully-interleaved groups.");
if (ST->hasAVX2())		if (ST->hasAVX2())
if (const auto *Entry = CostTableLookup(AVX2InterleavedStoreTbl, Factor,		if (const auto *Entry = CostTableLookup(AVX2InterleavedStoreTbl, Factor,
ETy.getSimpleVT()))		ETy.getSimpleVT()))
return MemOpCosts + Entry->Cost;		return MemOpCosts + Entry->Cost;

		if (ST->hasSSE2())
		if (const auto *Entry = CostTableLookup(SSE2InterleavedStoreTbl, Factor,
		ETy.getSimpleVT()))
		return MemOpCosts + Entry->Cost;
}		}

return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,		return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
Alignment, AddressSpace, CostKind,		Alignment, AddressSpace, CostKind,
UseMaskForCond, UseMaskForGaps);		UseMaskForCond, UseMaskForGaps);
}		}

llvm/test/Analysis/CostModel/X86/interleaved-store-i16-stride-2.ll

	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+sse2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,SSE2			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+sse2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,SSE2
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX1			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX1
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX2			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX2
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx512bw,+avx512vl --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX512			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx512bw,+avx512vl --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX512
	; REQUIRES: asserts			; REQUIRES: asserts

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	@A = global [1024 x i8] zeroinitializer, align 128			@A = global [1024 x i8] zeroinitializer, align 128
	@B = global [1024 x i16] zeroinitializer, align 128			@B = global [1024 x i16] zeroinitializer, align 128

	; CHECK: LV: Checking a loop in "test"			; CHECK: LV: Checking a loop in "test"
				lebedev.riUnsubmitted Not Done Reply Inline Actions LG lebedev.ri: LG
	;			;
	; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: store i16 %v1, i16* %out1, align 2			; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: store i16 %v1, i16* %out1, align 2
	; SSE2: LV: Found an estimated cost of 9 for VF 2 For instruction: store i16 %v1, i16* %out1, align 2			; SSE2: LV: Found an estimated cost of 2 for VF 2 For instruction: store i16 %v1, i16* %out1, align 2
	; SSE2: LV: Found an estimated cost of 17 for VF 4 For instruction: store i16 %v1, i16* %out1, align 2			; SSE2: LV: Found an estimated cost of 2 for VF 4 For instruction: store i16 %v1, i16* %out1, align 2
	; SSE2: LV: Found an estimated cost of 34 for VF 8 For instruction: store i16 %v1, i16* %out1, align 2			; SSE2: LV: Found an estimated cost of 34 for VF 8 For instruction: store i16 %v1, i16* %out1, align 2
	; SSE2: LV: Found an estimated cost of 68 for VF 16 For instruction: store i16 %v1, i16* %out1, align 2			; SSE2: LV: Found an estimated cost of 68 for VF 16 For instruction: store i16 %v1, i16* %out1, align 2
	;			;
	; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: store i16 %v1, i16* %out1, align 2			; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: store i16 %v1, i16* %out1, align 2
	; AVX1: LV: Found an estimated cost of 9 for VF 2 For instruction: store i16 %v1, i16* %out1, align 2			; AVX1: LV: Found an estimated cost of 2 for VF 2 For instruction: store i16 %v1, i16* %out1, align 2
	; AVX1: LV: Found an estimated cost of 17 for VF 4 For instruction: store i16 %v1, i16* %out1, align 2			; AVX1: LV: Found an estimated cost of 2 for VF 4 For instruction: store i16 %v1, i16* %out1, align 2
	; AVX1: LV: Found an estimated cost of 35 for VF 8 For instruction: store i16 %v1, i16* %out1, align 2			; AVX1: LV: Found an estimated cost of 35 for VF 8 For instruction: store i16 %v1, i16* %out1, align 2
	; AVX1: LV: Found an estimated cost of 86 for VF 16 For instruction: store i16 %v1, i16* %out1, align 2			; AVX1: LV: Found an estimated cost of 86 for VF 16 For instruction: store i16 %v1, i16* %out1, align 2
	; AVX1: LV: Found an estimated cost of 172 for VF 32 For instruction: store i16 %v1, i16* %out1, align 2			; AVX1: LV: Found an estimated cost of 172 for VF 32 For instruction: store i16 %v1, i16* %out1, align 2
	;			;
	; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: store i16 %v1, i16* %out1, align 2			; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: store i16 %v1, i16* %out1, align 2
	; AVX2: LV: Found an estimated cost of 2 for VF 2 For instruction: store i16 %v1, i16* %out1, align 2			; AVX2: LV: Found an estimated cost of 2 for VF 2 For instruction: store i16 %v1, i16* %out1, align 2
	; AVX2: LV: Found an estimated cost of 2 for VF 4 For instruction: store i16 %v1, i16* %out1, align 2			; AVX2: LV: Found an estimated cost of 2 for VF 4 For instruction: store i16 %v1, i16* %out1, align 2
	; AVX2: LV: Found an estimated cost of 4 for VF 8 For instruction: store i16 %v1, i16* %out1, align 2			; AVX2: LV: Found an estimated cost of 4 for VF 8 For instruction: store i16 %v1, i16* %out1, align 2
	▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/X86/interleaved-store-i8-stride-2.ll

	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+sse2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,SSE2			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+sse2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,SSE2
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX1			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX1
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX2			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX2
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx512bw,+avx512vl --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX512			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx512bw,+avx512vl --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX512
	; REQUIRES: asserts			; REQUIRES: asserts

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	@A = global [1024 x i8] zeroinitializer, align 128			@A = global [1024 x i8] zeroinitializer, align 128
	@B = global [1024 x i8] zeroinitializer, align 128			@B = global [1024 x i8] zeroinitializer, align 128

	; CHECK: LV: Checking a loop in "test"			; CHECK: LV: Checking a loop in "test"
				lebedev.riUnsubmitted Not Done Reply Inline Actions LG lebedev.ri: LG
	;			;
	; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: store i8 %v1, i8* %out1, align 1			; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: store i8 %v1, i8* %out1, align 1
	; SSE2: LV: Found an estimated cost of 14 for VF 2 For instruction: store i8 %v1, i8* %out1, align 1			; SSE2: LV: Found an estimated cost of 2 for VF 2 For instruction: store i8 %v1, i8* %out1, align 1
	; SSE2: LV: Found an estimated cost of 30 for VF 4 For instruction: store i8 %v1, i8* %out1, align 1			; SSE2: LV: Found an estimated cost of 2 for VF 4 For instruction: store i8 %v1, i8* %out1, align 1
	; SSE2: LV: Found an estimated cost of 62 for VF 8 For instruction: store i8 %v1, i8* %out1, align 1			; SSE2: LV: Found an estimated cost of 2 for VF 8 For instruction: store i8 %v1, i8* %out1, align 1
	; SSE2: LV: Found an estimated cost of 126 for VF 16 For instruction: store i8 %v1, i8* %out1, align 1			; SSE2: LV: Found an estimated cost of 126 for VF 16 For instruction: store i8 %v1, i8* %out1, align 1
	;			;
	; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: store i8 %v1, i8* %out1, align 1			; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: store i8 %v1, i8* %out1, align 1
	; AVX1: LV: Found an estimated cost of 9 for VF 2 For instruction: store i8 %v1, i8* %out1, align 1			; AVX1: LV: Found an estimated cost of 2 for VF 2 For instruction: store i8 %v1, i8* %out1, align 1
	; AVX1: LV: Found an estimated cost of 17 for VF 4 For instruction: store i8 %v1, i8* %out1, align 1			; AVX1: LV: Found an estimated cost of 2 for VF 4 For instruction: store i8 %v1, i8* %out1, align 1
	; AVX1: LV: Found an estimated cost of 33 for VF 8 For instruction: store i8 %v1, i8* %out1, align 1			; AVX1: LV: Found an estimated cost of 2 for VF 8 For instruction: store i8 %v1, i8* %out1, align 1
	; AVX1: LV: Found an estimated cost of 67 for VF 16 For instruction: store i8 %v1, i8* %out1, align 1			; AVX1: LV: Found an estimated cost of 67 for VF 16 For instruction: store i8 %v1, i8* %out1, align 1
	; AVX1: LV: Found an estimated cost of 166 for VF 32 For instruction: store i8 %v1, i8* %out1, align 1			; AVX1: LV: Found an estimated cost of 166 for VF 32 For instruction: store i8 %v1, i8* %out1, align 1

	; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: store i8 %v1, i8* %out1, align 1			; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: store i8 %v1, i8* %out1, align 1
	; AVX2: LV: Found an estimated cost of 2 for VF 2 For instruction: store i8 %v1, i8* %out1, align 1			; AVX2: LV: Found an estimated cost of 2 for VF 2 For instruction: store i8 %v1, i8* %out1, align 1
	; AVX2: LV: Found an estimated cost of 2 for VF 4 For instruction: store i8 %v1, i8* %out1, align 1			; AVX2: LV: Found an estimated cost of 2 for VF 4 For instruction: store i8 %v1, i8* %out1, align 1
	; AVX2: LV: Found an estimated cost of 2 for VF 8 For instruction: store i8 %v1, i8* %out1, align 1			; AVX2: LV: Found an estimated cost of 2 for VF 8 For instruction: store i8 %v1, i8* %out1, align 1
	; AVX2: LV: Found an estimated cost of 4 for VF 16 For instruction: store i8 %v1, i8* %out1, align 1			; AVX2: LV: Found an estimated cost of 4 for VF 16 For instruction: store i8 %v1, i8* %out1, align 1
	▲ Show 20 Lines • Show All 41 Lines • Show Last 20 Lines