This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/X86/
-
Target/
-
X86/
-
X86TargetTransformInfo.cpp
-
test/Analysis/CostModel/X86/
-
Analysis/
-
CostModel/
-
X86/
-
interleaved-load-double.ll
-
interleaved-load-i64.ll
-
interleaved-store-double.ll
1/1
interleaved-store-i64.ll

Differential D40008

[X86][TTI] update costs of interleaved load\store of i64\double
ClosedPublic

Authored by magabari on Nov 14 2017, 12:32 AM.

Download Raw Diff

Details

Reviewers

RKSimon
dorit
delena
craig.topper

Commits

rG6e6d5326a13b: [TTI][X86] update costs of interleaved load\store of i64\double
rL318385: [TTI][X86] update costs of interleaved load\store of i64\double

Summary

This patch contains more accurate cost of interelaved load\store of stride 2 for the types int64\double on AVX2.

Diff Detail

Event Timeline

magabari created this revision.Nov 14 2017, 12:32 AM

magabari retitled this revision from [X86][TTI] update costs of interleaved load to [X86][TTI] update costs of interleaved load\store of i64\double.Nov 14 2017, 12:34 AM

magabari edited the summary of this revision. (Show Details)

magabari added reviewers: RKSimon, dorit, delena, craig.topper.

magabari added a subscriber: llvm-commits.

I think it would be nice to make the testcases smaller; Right now you have something like this:
for (…) {
Dst[2*i] = Dst[2*i] + Src[2*i] * k
Dst[2*i+1] = Dst[2*i+1] + Src[2*i+1] * k
}
...which actually tests both strided loads and strided stores.
So you could either use one test to check both store and load costs (and even then you probably don't need both a mul and an add just to check memops costs).
Or if you want to separate the load and store cases, the Load test could be something like:
for (…) {
s += Src[2*i]
s += Src[2*i+1]
}
The Store test could be something like:
For(…){

Dst[2*i] = k1;
Dst[2*i+1] = k2;

}

test/Analysis/CostModel/X86/interleaved-store-i64.ll
2	I see some of the interleave tests in this directory use -mcpu=core_avx2 and some use -mcpu=skylake. I wonder which one we want to use?

fixed dorit notes

You missed just one mcpu=skylake :)
LGTM with this change

This revision is now accepted and ready to land.Nov 15 2017, 11:55 PM

Closed by commit rL318385: [TTI][X86] update costs of interleaved load\store of i64\double (authored by magabari). · Explain WhyNov 16 2017, 1:38 AM

This revision was automatically updated to reflect the committed changes.

@RKSimon @magabari I'd like to add some more tuples, but i have a question: how are the costs actually derived?
For example, the assembly for interleaved load of i16 w/ stride 2: https://godbolt.org/z/hjb3d5x6E
What's it cost? I'm guessing it's not just 10, aka the instruction count excluding the loads/stores?
Is it 5 from Block RThroughput: 4.8 from MCA: https://godbolt.org/z/fxYcEj3Wx ?
Which CPU should be used for these numbers?

Herald added a project: Restricted Project. · View Herald TranscriptApr 26 2021, 12:48 AM

Herald added a subscriber: pengfei. · View Herald Transcript

In D40008#2715994, @lebedev.ri wrote:

@RKSimon @magabari I'd like to add some more tuples, but i have a question: how are the costs actually derived?
For example, the assembly for interleaved load of i16 w/ stride 2: https://godbolt.org/z/hjb3d5x6E
What's it cost? I'm guessing it's not just 10, aka the instruction count excluding the loads/stores?
Is it 5 from Block RThroughput: 4.8 from MCA: https://godbolt.org/z/fxYcEj3Wx ?
Which CPU should be used for these numbers?

I believe they were taken from IACA probably with a Haswell CPU - a reciprocal throughput from llvm-mca should be similar.

Usually with cost tables we tend to compare numbers from similar spec CPUs (AVX2 - Haswell/Ryzen) and choose the worst.....

In D40008#2716332, @RKSimon wrote:

In D40008#2715994, @lebedev.ri wrote:

@RKSimon @magabari I'd like to add some more tuples, but i have a question: how are the costs actually derived?
For example, the assembly for interleaved load of i16 w/ stride 2: https://godbolt.org/z/hjb3d5x6E
What's it cost? I'm guessing it's not just 10, aka the instruction count excluding the loads/stores?
Is it 5 from Block RThroughput: 4.8 from MCA: https://godbolt.org/z/fxYcEj3Wx ?
Which CPU should be used for these numbers?

I believe they were taken from IACA probably with a Haswell CPU - a reciprocal throughput from llvm-mca should be similar.

Usually with cost tables we tend to compare numbers from similar spec CPUs (AVX2 - Haswell/Ryzen) and choose the worst.....

I see. So in this case we have:

znver1/2 4.8 https://godbolt.org/z/W9x6GWdnh https://godbolt.org/z/dx7718YT9 (likely unreliable, awaiting zen3)
haswell/broadwell/skylake 9 https://godbolt.org/z/bzG17drjn https://godbolt.org/z/frnEfeY6K https://godbolt.org/z/o7jK9M9hK

therefore for that tuple we choose 9, correct? I'm not seeing any other sched models for AVX2 but not AVX512 CPU's.

And another question: now that we've established the rules, should i be submitting these changes through review,
or committing these directly? I fear former would either result in bulky patches that are hard to review,
or saturate the review queue.

lebedev.ri mentioned this in D101924: [X86] Improve costmodel for scalar byte swaps.May 6 2021, 2:02 PM

Revision Contents

Path

Size

lib/

Target/

X86/

X86TargetTransformInfo.cpp

6 lines

test/

Analysis/

CostModel/

X86/

interleaved-load-double.ll

45 lines

interleaved-load-i64.ll

45 lines

interleaved-store-double.ll

45 lines

interleaved-store-i64.ll

45 lines

Diff 122791

lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 2,633 Lines • ▼ Show 20 Lines	int X86TTIImpl::getInterleavedMemoryOpCostAVX2(unsigned Opcode, Type *VecTy,
// TODO: Complete for other data-types and strides.		// TODO: Complete for other data-types and strides.
// Each combination of Stride, ElementTy and VF results in a different		// Each combination of Stride, ElementTy and VF results in a different
// sequence; The cost tables are therefore accessed with:		// sequence; The cost tables are therefore accessed with:
// Factor (stride) and VectorType=VFxElemType.		// Factor (stride) and VectorType=VFxElemType.
// The Cost accounts only for the shuffle sequence;		// The Cost accounts only for the shuffle sequence;
// The cost of the loads/stores is accounted for separately.		// The cost of the loads/stores is accounted for separately.
//		//
static const CostTblEntry AVX2InterleavedLoadTbl[] = {		static const CostTblEntry AVX2InterleavedLoadTbl[] = {
		{ 2, MVT::v4i64, 6 }, //(load 8i64 and) deinterleave into 2 x 4i64
		{ 2, MVT::v4f64, 6 }, //(load 8f64 and) deinterleave into 2 x 4f64

{ 3, MVT::v2i8, 10 }, //(load 6i8 and) deinterleave into 3 x 2i8		{ 3, MVT::v2i8, 10 }, //(load 6i8 and) deinterleave into 3 x 2i8
{ 3, MVT::v4i8, 4 }, //(load 12i8 and) deinterleave into 3 x 4i8		{ 3, MVT::v4i8, 4 }, //(load 12i8 and) deinterleave into 3 x 4i8
{ 3, MVT::v8i8, 9 }, //(load 24i8 and) deinterleave into 3 x 8i8		{ 3, MVT::v8i8, 9 }, //(load 24i8 and) deinterleave into 3 x 8i8
{ 3, MVT::v16i8, 11}, //(load 48i8 and) deinterleave into 3 x 16i8		{ 3, MVT::v16i8, 11}, //(load 48i8 and) deinterleave into 3 x 16i8
{ 3, MVT::v32i8, 13}, //(load 96i8 and) deinterleave into 3 x 32i8		{ 3, MVT::v32i8, 13}, //(load 96i8 and) deinterleave into 3 x 32i8
{ 3, MVT::v8f32, 17 }, //(load 24f32 and)deinterleave into 3 x 8f32		{ 3, MVT::v8f32, 17 }, //(load 24f32 and)deinterleave into 3 x 8f32

{ 4, MVT::v2i8, 12 }, //(load 8i8 and) deinterleave into 4 x 2i8		{ 4, MVT::v2i8, 12 }, //(load 8i8 and) deinterleave into 4 x 2i8
{ 4, MVT::v4i8, 4 }, //(load 16i8 and) deinterleave into 4 x 4i8		{ 4, MVT::v4i8, 4 }, //(load 16i8 and) deinterleave into 4 x 4i8
{ 4, MVT::v8i8, 20 }, //(load 32i8 and) deinterleave into 4 x 8i8		{ 4, MVT::v8i8, 20 }, //(load 32i8 and) deinterleave into 4 x 8i8
{ 4, MVT::v16i8, 39 }, //(load 64i8 and) deinterleave into 4 x 16i8		{ 4, MVT::v16i8, 39 }, //(load 64i8 and) deinterleave into 4 x 16i8
{ 4, MVT::v32i8, 80 }, //(load 128i8 and) deinterleave into 4 x 32i8		{ 4, MVT::v32i8, 80 }, //(load 128i8 and) deinterleave into 4 x 32i8

{ 8, MVT::v8f32, 40 } //(load 64f32 and)deinterleave into 8 x 8f32		{ 8, MVT::v8f32, 40 } //(load 64f32 and)deinterleave into 8 x 8f32
};		};

static const CostTblEntry AVX2InterleavedStoreTbl[] = {		static const CostTblEntry AVX2InterleavedStoreTbl[] = {
		{ 2, MVT::v4i64, 6 }, //interleave into 2 x 4i64 into 8i64 (and store)
		{ 2, MVT::v4f64, 6 }, //interleave into 2 x 4f64 into 8f64 (and store)

{ 3, MVT::v2i8, 7 }, //interleave 3 x 2i8 into 6i8 (and store)		{ 3, MVT::v2i8, 7 }, //interleave 3 x 2i8 into 6i8 (and store)
{ 3, MVT::v4i8, 8 }, //interleave 3 x 4i8 into 12i8 (and store)		{ 3, MVT::v4i8, 8 }, //interleave 3 x 4i8 into 12i8 (and store)
{ 3, MVT::v8i8, 11 }, //interleave 3 x 8i8 into 24i8 (and store)		{ 3, MVT::v8i8, 11 }, //interleave 3 x 8i8 into 24i8 (and store)
{ 3, MVT::v16i8, 11 }, //interleave 3 x 16i8 into 48i8 (and store)		{ 3, MVT::v16i8, 11 }, //interleave 3 x 16i8 into 48i8 (and store)
{ 3, MVT::v32i8, 13 }, //interleave 3 x 32i8 into 96i8 (and store)		{ 3, MVT::v32i8, 13 }, //interleave 3 x 32i8 into 96i8 (and store)

{ 4, MVT::v2i8, 12 }, //interleave 4 x 2i8 into 8i8 (and store)		{ 4, MVT::v2i8, 12 }, //interleave 4 x 2i8 into 8i8 (and store)
{ 4, MVT::v4i8, 9 }, //interleave 4 x 4i8 into 16i8 (and store)		{ 4, MVT::v4i8, 9 }, //interleave 4 x 4i8 into 16i8 (and store)
▲ Show 20 Lines • Show All 172 Lines • Show Last 20 Lines

test/Analysis/CostModel/X86/interleaved-load-double.ll

This file was added.

				; REQUIRES: asserts
				; RUN: opt -S -loop-vectorize -debug-only=loop-vectorize -mcpu=skylake %s 2>&1 \| FileCheck %s
				target datalayout = "e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"
				target triple = "i386-unknown-linux-gnu"

				@doublesrc = common local_unnamed_addr global [120 x double] zeroinitializer, align 4
				@doubledst = common local_unnamed_addr global [120 x double] zeroinitializer, align 4

				; Function Attrs: norecurse nounwind
				define void @stride2double(double %k, i32 %width_) {
				entry:

				; CHECK: Found an estimated cost of 8 for VF 4 For instruction: %0 = load double

				%cmp27 = icmp sgt i32 %width_, 0
				br i1 %cmp27, label %for.body.lr.ph, label %for.cond.cleanup

				for.body.lr.ph: ; preds = %entry
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body.lr.ph, %for.body
				%i.028 = phi i32 [ 0, %for.body.lr.ph ], [ %add16, %for.body ]
				%arrayidx = getelementptr inbounds [120 x double], [120 x double]* @doublesrc, i32 0, i32 %i.028
				%0 = load double, double* %arrayidx, align 4
				%mul = fmul fast double %0, %k
				%arrayidx2 = getelementptr inbounds [120 x double], [120 x double]* @doubledst, i32 0, i32 %i.028
				%1 = load double, double* %arrayidx2, align 4
				%add3 = fadd fast double %1, %mul
				store double %add3, double* %arrayidx2, align 4
				%add4 = add nuw nsw i32 %i.028, 1
				%arrayidx5 = getelementptr inbounds [120 x double], [120 x double]* @doublesrc, i32 0, i32 %add4
				%2 = load double, double* %arrayidx5, align 4
				%mul6 = fmul fast double %2, %k
				%arrayidx8 = getelementptr inbounds [120 x double], [120 x double]* @doubledst, i32 0, i32 %add4
				%3 = load double, double* %arrayidx8, align 4
				%add9 = fadd fast double %3, %mul6
				store double %add9, double* %arrayidx8, align 4
				%add16 = add nuw nsw i32 %i.028, 2
				%cmp = icmp slt i32 %add16, %width_
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

test/Analysis/CostModel/X86/interleaved-load-i64.ll

This file was added.

				; REQUIRES: asserts
				; RUN: opt -S -loop-vectorize -debug-only=loop-vectorize -mcpu=skylake %s 2>&1 \| FileCheck %s
				target datalayout = "e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"
				target triple = "i386-unknown-linux-gnu"

				@i64src = common local_unnamed_addr global [120 x i64] zeroinitializer, align 4
				@i64dst = common local_unnamed_addr global [120 x i64] zeroinitializer, align 4

				; Function Attrs: norecurse nounwind
				define void @stride2i64(i64 %k, i32 %width_) {
				entry:

				; CHECK: Found an estimated cost of 8 for VF 4 For instruction: %0 = load i64

				%cmp27 = icmp sgt i32 %width_, 0
				br i1 %cmp27, label %for.body.lr.ph, label %for.cond.cleanup

				for.body.lr.ph: ; preds = %entry
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body:
				%i.028 = phi i32 [ 0, %for.body.lr.ph ], [ %add16, %for.body ]
				%arrayidx = getelementptr inbounds [120 x i64], [120 x i64]* @i64src, i32 0, i32 %i.028
				%0 = load i64, i64* %arrayidx, align 4
				%mul = mul i64 %0, %k
				%arrayidx2 = getelementptr inbounds [120 x i64], [120 x i64]* @i64dst, i32 0, i32 %i.028
				%1 = load i64, i64* %arrayidx2, align 4
				%add3 = add i64 %1, %mul
				store i64 %add3, i64* %arrayidx2, align 4
				%add4 = add nuw nsw i32 %i.028, 1
				%arrayidx5 = getelementptr inbounds [120 x i64], [120 x i64]* @i64src, i32 0, i32 %add4
				%2 = load i64, i64* %arrayidx5, align 4
				%mul6 = mul i64 %2, %k
				%arrayidx8 = getelementptr inbounds [120 x i64], [120 x i64]* @i64dst, i32 0, i32 %add4
				%3 = load i64, i64* %arrayidx8, align 4
				%add9 = add i64 %3, %mul6
				store i64 %add9, i64* %arrayidx8, align 4
				%add16 = add nuw nsw i32 %i.028, 2
				%cmp = icmp slt i32 %add16, %width_
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

test/Analysis/CostModel/X86/interleaved-store-double.ll

This file was added.

				; REQUIRES: asserts
				; RUN: opt -S -loop-vectorize -debug-only=loop-vectorize -mcpu=skylake %s 2>&1 \| FileCheck %s
				target datalayout = "e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"
				target triple = "i386-unknown-linux-gnu"

				@doublesrc = common local_unnamed_addr global [120 x double] zeroinitializer, align 4
				@doubledst = common local_unnamed_addr global [120 x double] zeroinitializer, align 4

				; Function Attrs: norecurse nounwind
				define void @stride2double(double %k, i32 %width_) {
				entry:

				; CHECK: Found an estimated cost of 8 for VF 4 For instruction: store double

				%cmp27 = icmp sgt i32 %width_, 0
				br i1 %cmp27, label %for.body.lr.ph, label %for.cond.cleanup

				for.body.lr.ph: ; preds = %entry
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body.lr.ph, %for.body
				%i.028 = phi i32 [ 0, %for.body.lr.ph ], [ %add16, %for.body ]
				%arrayidx = getelementptr inbounds [120 x double], [120 x double]* @doublesrc, i32 0, i32 %i.028
				%0 = load double, double* %arrayidx, align 4
				%mul = fmul fast double %0, %k
				%arrayidx2 = getelementptr inbounds [120 x double], [120 x double]* @doubledst, i32 0, i32 %i.028
				%1 = load double, double* %arrayidx2, align 4
				%add3 = fadd fast double %1, %mul
				store double %add3, double* %arrayidx2, align 4
				%add4 = add nuw nsw i32 %i.028, 1
				%arrayidx5 = getelementptr inbounds [120 x double], [120 x double]* @doublesrc, i32 0, i32 %add4
				%2 = load double, double* %arrayidx5, align 4
				%mul6 = fmul fast double %2, %k
				%arrayidx8 = getelementptr inbounds [120 x double], [120 x double]* @doubledst, i32 0, i32 %add4
				%3 = load double, double* %arrayidx8, align 4
				%add9 = fadd fast double %3, %mul6
				store double %add9, double* %arrayidx8, align 4
				%add16 = add nuw nsw i32 %i.028, 2
				%cmp = icmp slt i32 %add16, %width_
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

test/Analysis/CostModel/X86/interleaved-store-i64.ll

This file was added.

				; REQUIRES: asserts
				; RUN: opt -S -loop-vectorize -debug-only=loop-vectorize -mcpu=skylake %s 2>&1 \| FileCheck %s
				doritUnsubmitted Done Reply Inline Actions I see some of the interleave tests in this directory use -mcpu=core_avx2 and some use -mcpu=skylake. I wonder which one we want to use? dorit: I see some of the interleave tests in this directory use -mcpu=core_avx2 and some use…
				target datalayout = "e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"
				target triple = "i386-unknown-linux-gnu"

				@i64src = common local_unnamed_addr global [120 x i64] zeroinitializer, align 4
				@i64dst = common local_unnamed_addr global [120 x i64] zeroinitializer, align 4

				; Function Attrs: norecurse nounwind
				define void @stride2i64(i64 %k, i32 %width_) {
				entry:

				; CHECK: Found an estimated cost of 8 for VF 4 For instruction: store i64

				%cmp27 = icmp sgt i32 %width_, 0
				br i1 %cmp27, label %for.body.lr.ph, label %for.cond.cleanup

				for.body.lr.ph: ; preds = %entry
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body:
				%i.028 = phi i32 [ 0, %for.body.lr.ph ], [ %add16, %for.body ]
				%arrayidx = getelementptr inbounds [120 x i64], [120 x i64]* @i64src, i32 0, i32 %i.028
				%0 = load i64, i64* %arrayidx, align 4
				%mul = mul i64 %0, %k
				%arrayidx2 = getelementptr inbounds [120 x i64], [120 x i64]* @i64dst, i32 0, i32 %i.028
				%1 = load i64, i64* %arrayidx2, align 4
				%add3 = add i64 %1, %mul
				store i64 %add3, i64* %arrayidx2, align 4
				%add4 = add nuw nsw i32 %i.028, 1
				%arrayidx5 = getelementptr inbounds [120 x i64], [120 x i64]* @i64src, i32 0, i32 %add4
				%2 = load i64, i64* %arrayidx5, align 4
				%mul6 = mul i64 %2, %k
				%arrayidx8 = getelementptr inbounds [120 x i64], [120 x i64]* @i64dst, i32 0, i32 %add4
				%3 = load i64, i64* %arrayidx8, align 4
				%add9 = add i64 %3, %mul6
				store i64 %add9, i64* %arrayidx8, align 4
				%add16 = add nuw nsw i32 %i.028, 2
				%cmp = icmp slt i32 %add16, %width_
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

This is an archive of the discontinued LLVM Phabricator instance.

[X86][TTI] update costs of interleaved load\store of i64\doubleClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 122791

lib/Target/X86/X86TargetTransformInfo.cpp

test/Analysis/CostModel/X86/interleaved-load-double.ll

test/Analysis/CostModel/X86/interleaved-load-i64.ll

test/Analysis/CostModel/X86/interleaved-store-double.ll

test/Analysis/CostModel/X86/interleaved-store-i64.ll

[X86][TTI] update costs of interleaved load\store of i64\double
ClosedPublic