This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
4/9
AArch64TargetTransformInfo.cpp
-
test/
-
Analysis/CostModel/AArch64/
-
CostModel/
-
AArch64/
-
mul.ll
-
Transforms/
-
LoopVectorize/AArch64/
-
AArch64/
-
extractvalue-no-scalarization-required.ll
-
SLPVectorizer/AArch64/
-
AArch64/
-
mul.ll

Differential D92208

[AArch64][CostModel] Fixed costs for mul <2 x i64>
ClosedPublic

Authored by SjoerdMeijer on Nov 27 2020, 12:12 AM.

Download Raw Diff

Details

Reviewers

fhahn
dmgreen
sanwou01
NickGuy

Commits

rG5110ff08176f: [AArch64][CostModel] Fix cost for mul <2 x i64>

Summary

This was modeled to have a cost of 1, but since we do not have a MUL.2d this is scalarized into 2 instructions. I precommitted test/Analysis/CostModel/AArch64/mul.ll in rGa3b1fcbc0cf5.

The reason that regression test

Transforms/LoopVectorize/AArch64/extractvalue-no-scalarization-required.ll

needs changing is documented in LoopVectorize.cpp:6855:

// The cost of executing VF copies of the scalar instruction. This opcode
// is unknown. Assume that it is the same as 'mul'.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

SjoerdMeijer created this revision.Nov 27 2020, 12:12 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 27 2020, 12:12 AM

Herald added subscribers: danielkiss, hiraditya, kristof.beyls. · View Herald Transcript

SjoerdMeijer requested review of this revision.Nov 27 2020, 12:12 AM

dmgreen added inline comments.Nov 27 2020, 5:05 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
648	Formatting I think is usually done differently here?
652	I think this can just use LT.second == MVT::v2i64? (Otherwise it needs to account for scalable vectors, but dealing with the MVT is probably simpler).
654	Hmm. According this this it should have a cost around 8: https://godbolt.org/z/fjjEc7 LT.first is the cost factor to get it to the MVE::v2i64 type. getScalarizationOverhead could be used to get that overhead. What do you think of something like LT.first * (2 + 2*getScalarizationOverhead(extract) + getScalarizationOverhead(insert)) ? I'm not sure what cost that would give.

SjoerdMeijer added inline comments.Nov 27 2020, 5:16 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
654	Hmm. According this this it should have a cost around 8: https://godbolt.org/z/fjjEc7 I excluded the movs. In that link/example, the last two movs are for returning the vector, and the first 2 to shuffle arguments in place. Thus, the instruction cost I think are: 1 instruction for the lane extract, and 1 for scalar mul. Thus, for a <2 x i64> we would get 1 * 2 + 2 = 4, that's what I was trying to model here. What do you think?

SjoerdMeijer added inline comments.Nov 27 2020, 5:21 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
654	I meant this is what we do for one lane: Thus, the instruction cost I think are: 1 instruction for the lane extract, and 1 for scalar mul so this * 2 for both lanes.

dmgreen added inline comments.Nov 27 2020, 6:20 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
654	It would still have to get the vector over to integer registers for both inputs and put it back after. For vectors it would make sense to assume the values will be in vector regs (which in this case means two cross register bank copies). Something like this is acting the same: https://godbolt.org/z/M9h73n. The cost should probably be high, as far as I understand. It's often worse to vectorize then scalaraze, as opposed to just keeping the original scalar code. And 8 would be OK for the number of instructions. Even if they are MOV's, cross-register bank copies are often expensive. The extractvalue cost using mul is a little unfortunate, but that should probably be fixed separately if needed. There is also smull and umull which can handle 2 x i64 mul's, but it looks like isWideningInstruction does not handle them properly yet (and is always 0 by the look of it). Again they can be fixed by detecting the extends if needed.

SjoerdMeijer added inline comments.Nov 27 2020, 6:34 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
654	It would still have to get the vector over to integer registers for both inputs and put it back after. For vectors it would make sense to assume the values will be in vector regs (which in this case means two cross register bank copies). Something like this is acting the same: https://godbolt.org/z/M9h73n. Ok, agreed. I was playing a bit more with examples too: https://godbolt.org/z/z593Ws if the muls are chained, there's less overhead so the cost is variable, but agreed in general that the cost is high and adding extra costs for the movs is more accurate. The cost should probably be high, as far as I understand. It's often worse to vectorize then scalaraze, as opposed to just keeping the original scalar code. And 8 would be OK for the number of instructions. Even if they are MOV's, cross-register bank copies are often expensive. The extractvalue cost using mul is a little unfortunate, but that should probably be fixed separately if needed. Yep, that was my plan. There is also smull and umull which can handle 2 x i64 mul's, but it looks like isWideningInstruction does not handle them properly yet (and is always 0 by the look of it). Again they can be fixed by detecting the extends if needed. Will look in a follow up.

Thanks for the comments Dave, comments addressed.
This now shows changes in test/Transforms/SLPVectorizer/AArch64/mul.ll, which I had precommitted, and is actually the goal of this exercise.

SjoerdMeijer added inline comments.Nov 30 2020, 1:19 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
652	I have kept the check as it was because it checks for FixedVectorType, so looks correct to me, and with checking for MVT::v2i64 we would miss out on MVT::v4i64, etc.

dmgreen added inline comments.Nov 30 2020, 2:20 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
652	getTypeLegalizationCost will return two things - the cost factor needed to convert to a legal type and the legal type it would convert it to. So for a v4i64 this would be a cost of 2 and legal type of v2i64. i.e. I think this cost can just be `if (LT.second == MVT::v2i64)`, and that should capture all the cases we are interested in. (And something odd like a v2i40, which would be legalized to a v2i64 would also get the high cost, as far as I understand).

Thanks, that's a nice simplification.

Thanks. LGTM

This revision is now accepted and ready to land.Nov 30 2020, 3:09 AM

Closed by commit rG5110ff08176f: [AArch64][CostModel] Fix cost for mul <2 x i64> (authored by SjoerdMeijer). · Explain WhyNov 30 2020, 3:37 AM

This revision was automatically updated to reflect the committed changes.

SjoerdMeijer added a commit: rG5110ff08176f: [AArch64][CostModel] Fix cost for mul <2 x i64>.

SjoerdMeijer mentioned this in D92317: [LV] ExtractValue instruction costs.Nov 30 2020, 6:12 AM

SjoerdMeijer mentioned this in rGf44ba251354f: ExtractValue instruction costs.Dec 1 2020, 2:43 AM

dmgreen mentioned this in D123007: [AArch64] Increase cost of v2i64 multiplies.Apr 3 2022, 1:56 PM

dmgreen mentioned this in rG750bf3582a6d: [AArch64] Increase cost of v2i64 multiplies.Apr 4 2022, 9:42 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64TargetTransformInfo.cpp

14 lines

test/

Analysis/

CostModel/

AArch64/

mul.ll

4 lines

Transforms/

LoopVectorize/

AArch64/

extractvalue-no-scalarization-required.ll

4 lines

SLPVectorizer/

AArch64/

mul.ll

46 lines

Diff 308310

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 638 Lines • ▼ Show 20 Lines	if (Ty->isVectorTy()) {
Opd1Info, Opd2Info, Opd1PropInfo,		Opd1Info, Opd2Info, Opd1PropInfo,
Opd2PropInfo);		Opd2PropInfo);
// TODO: if one of the arguments is scalar, then it's not necessary to		// TODO: if one of the arguments is scalar, then it's not necessary to
// double the cost of handling the vector elements.		// double the cost of handling the vector elements.
Cost += Cost;		Cost += Cost;
}		}
return Cost;		return Cost;

case ISD::ADD:
case ISD::MUL:		case ISD::MUL:
		if (LT.second != MVT::v2i64)
		dmgreenUnsubmitted Not Done Reply Inline Actions Formatting I think is usually done differently here? dmgreen: Formatting I think is usually done differently here?
		return (Cost + 1) * LT.first;
		// Since we do not have a MUL.2d instruction, a mul <2 x i64> is expensive
		// as elements are extracted from the vectors and the muls scalarized.
		// As getScalarizationOverhead is a bit too pessimistic, we estimate the
		dmgreenUnsubmitted Not Done Reply Inline Actions I think this can just use LT.second == MVT::v2i64? (Otherwise it needs to account for scalable vectors, but dealing with the MVT is probably simpler). dmgreen: I think this can just use LT.second == MVT::v2i64? (Otherwise it needs to account for scalable…
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions I have kept the check as it was because it checks for FixedVectorType, so looks correct to me, and with checking for MVT::v2i64 we would miss out on MVT::v4i64, etc. SjoerdMeijer: I have kept the check as it was because it checks for FixedVectorType, so looks correct to me…
		dmgreenUnsubmitted Not Done Reply Inline Actions getTypeLegalizationCost will return two things - the cost factor needed to convert to a legal type and the legal type it would convert it to. So for a v4i64 this would be a cost of 2 and legal type of v2i64. i.e. I think this cost can just be `if (LT.second == MVT::v2i64)`, and that should capture all the cases we are interested in. (And something odd like a v2i40, which would be legalized to a v2i64 would also get the high cost, as far as I understand). dmgreen: getTypeLegalizationCost will return two things - the cost factor needed to convert to a legal…
		// cost for a i64 vector directly here, which is:
		// - four i64 extracts,
		dmgreenUnsubmitted Not Done Reply Inline Actions Hmm. According this this it should have a cost around 8: https://godbolt.org/z/fjjEc7 LT.first is the cost factor to get it to the MVE::v2i64 type. getScalarizationOverhead could be used to get that overhead. What do you think of something like LT.first * (2 + 2getScalarizationOverhead(extract) + getScalarizationOverhead(insert)) ? I'm not sure what cost that would give. dmgreen:* Hmm. According this this it should have a cost around 8: https://godbolt.org/z/fjjEc7 LT.first…
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Hmm. According this this it should have a cost around 8: https://godbolt.org/z/fjjEc7 I excluded the movs. In that link/example, the last two movs are for returning the vector, and the first 2 to shuffle arguments in place. Thus, the instruction cost I think are: 1 instruction for the lane extract, and 1 for scalar mul. Thus, for a <2 x i64> we would get 1 * 2 + 2 = 4, that's what I was trying to model here. What do you think? SjoerdMeijer: > Hmm. According this this it should have a cost around 8: > https://godbolt.org/z/fjjEc7 I…
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions I meant this is what we do for one lane: Thus, the instruction cost I think are: 1 instruction for the lane extract, and 1 for scalar mul so this * 2 for both lanes. SjoerdMeijer: I meant this is what we do for one lane: > Thus, the instruction cost I think are: 1…
		dmgreenUnsubmitted Not Done Reply Inline Actions It would still have to get the vector over to integer registers for both inputs and put it back after. For vectors it would make sense to assume the values will be in vector regs (which in this case means two cross register bank copies). Something like this is acting the same: https://godbolt.org/z/M9h73n. The cost should probably be high, as far as I understand. It's often worse to vectorize then scalaraze, as opposed to just keeping the original scalar code. And 8 would be OK for the number of instructions. Even if they are MOV's, cross-register bank copies are often expensive. The extractvalue cost using mul is a little unfortunate, but that should probably be fixed separately if needed. There is also smull and umull which can handle 2 x i64 mul's, but it looks like isWideningInstruction does not handle them properly yet (and is always 0 by the look of it). Again they can be fixed by detecting the extends if needed. dmgreen: It would still have to get the vector over to integer registers for both inputs and put it back…
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions It would still have to get the vector over to integer registers for both inputs and put it back after. For vectors it would make sense to assume the values will be in vector regs (which in this case means two cross register bank copies). Something like this is acting the same: https://godbolt.org/z/M9h73n. Ok, agreed. I was playing a bit more with examples too: https://godbolt.org/z/z593Ws if the muls are chained, there's less overhead so the cost is variable, but agreed in general that the cost is high and adding extra costs for the movs is more accurate. The cost should probably be high, as far as I understand. It's often worse to vectorize then scalaraze, as opposed to just keeping the original scalar code. And 8 would be OK for the number of instructions. Even if they are MOV's, cross-register bank copies are often expensive. The extractvalue cost using mul is a little unfortunate, but that should probably be fixed separately if needed. Yep, that was my plan. There is also smull and umull which can handle 2 x i64 mul's, but it looks like isWideningInstruction does not handle them properly yet (and is always 0 by the look of it). Again they can be fixed by detecting the extends if needed. Will look in a follow up. SjoerdMeijer: > It would still have to get the vector over to integer registers for both inputs and put it…
		// - two i64 inserts, and
		// - two muls.
		// So, for a v2i64 with LT.First = 1 the cost is 8, and for a v4i64 with
		// LT.first = 2 the cost is 16.
		return LT.first * 8;
		case ISD::ADD:
case ISD::XOR:		case ISD::XOR:
case ISD::OR:		case ISD::OR:
case ISD::AND:		case ISD::AND:
// These nodes are marked as 'custom' for combining purposes only.		// These nodes are marked as 'custom' for combining purposes only.
// We know that they are legal. See LowerAdd in ISelLowering.		// We know that they are legal. See LowerAdd in ISelLowering.
return (Cost + 1) * LT.first;		return (Cost + 1) * LT.first;

case ISD::FADD:		case ISD::FADD:
▲ Show 20 Lines • Show All 494 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/AArch64/mul.ll

	Show First 20 Lines • Show All 107 Lines • ▼ Show 20 Lines
	; THROUGHPUT-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i32> %1			; THROUGHPUT-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i32> %1
	;			;
	%1 = mul <8 x i32> %a, %b			%1 = mul <8 x i32> %a, %b
	ret <8 x i32> %1			ret <8 x i32> %1
	}			}

	define <2 x i64> @t13(<2 x i64> %a, <2 x i64> %b) {			define <2 x i64> @t13(<2 x i64> %a, <2 x i64> %b) {
	; THROUGHPUT-LABEL: 't13'			; THROUGHPUT-LABEL: 't13'
	; THROUGHPUT-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %1 = mul nsw <2 x i64> %a, %b			; THROUGHPUT-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %1 = mul nsw <2 x i64> %a, %b
	; THROUGHPUT-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %1			; THROUGHPUT-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %1
	;			;
	%1 = mul nsw <2 x i64> %a, %b			%1 = mul nsw <2 x i64> %a, %b
	ret <2 x i64> %1			ret <2 x i64> %1
	}			}

	define <4 x i64> @t14(<4 x i64> %a, <4 x i64> %b) {			define <4 x i64> @t14(<4 x i64> %a, <4 x i64> %b) {
	; THROUGHPUT-LABEL: 't14'			; THROUGHPUT-LABEL: 't14'
	; THROUGHPUT-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %1 = mul nsw <4 x i64> %a, %b			; THROUGHPUT-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %1 = mul nsw <4 x i64> %a, %b
	; THROUGHPUT-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i64> %1			; THROUGHPUT-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i64> %1
	;			;
	%1 = mul nsw <4 x i64> %a, %b			%1 = mul nsw <4 x i64> %a, %b
	ret <4 x i64> %1			ret <4 x i64> %1
	}			}

	define <2 x float> @t15(<2 x float> %a, <2 x float> %b) {			define <2 x float> @t15(<2 x float> %a, <2 x float> %b) {
	; THROUGHPUT-LABEL: 't15'			; THROUGHPUT-LABEL: 't15'
	▲ Show 20 Lines • Show All 78 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/extractvalue-no-scalarization-required.ll

	; REQUIRES: asserts			; REQUIRES: asserts

	; RUN: opt -loop-vectorize -mtriple=arm64-apple-ios %s -S -debug -disable-output 2>&1 \| FileCheck --check-prefix=CM %s			; RUN: opt -loop-vectorize -mtriple=arm64-apple-ios %s -S -debug -disable-output 2>&1 \| FileCheck --check-prefix=CM %s
	; RUN: opt -loop-vectorize -force-vector-width=2 -force-vector-interleave=1 %s -S \| FileCheck --check-prefix=FORCED %s			; RUN: opt -loop-vectorize -force-vector-width=2 -force-vector-interleave=1 %s -S \| FileCheck --check-prefix=FORCED %s

	; Test case from PR41294.			; Test case from PR41294.

	; Check scalar cost for extractvalue. The constant and loop invariant operands are free,			; Check scalar cost for extractvalue. The constant and loop invariant operands are free,
	; leaving cost 3 for scalarizing the result + 2 for executing the op with VF 2.			; leaving cost 3 for scalarizing the result + 2 for executing the op with VF 2.

	; CM: LV: Scalar loop costs: 7.			; CM: LV: Scalar loop costs: 7.
	; CM: LV: Found an estimated cost of 5 for VF 2 For instruction: %a = extractvalue { i64, i64 } %sv, 0			; CM: LV: Found an estimated cost of 19 for VF 2 For instruction: %a = extractvalue { i64, i64 } %sv, 0
	; CM-NEXT: LV: Found an estimated cost of 5 for VF 2 For instruction: %b = extractvalue { i64, i64 } %sv, 1			; CM-NEXT: LV: Found an estimated cost of 19 for VF 2 For instruction: %b = extractvalue { i64, i64 } %sv, 1

	; Check that the extractvalue operands are actually free in vector code.			; Check that the extractvalue operands are actually free in vector code.

	; FORCED-LABEL: vector.body: ; preds = %vector.body, %vector.ph			; FORCED-LABEL: vector.body: ; preds = %vector.body, %vector.ph
	; FORCED-NEXT: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]			; FORCED-NEXT: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
	; FORCED-NEXT: %0 = add i32 %index, 0			; FORCED-NEXT: %0 = add i32 %index, 0
	; FORCED-NEXT: %1 = extractvalue { i64, i64 } %sv, 0			; FORCED-NEXT: %1 = extractvalue { i64, i64 } %sv, 0
	; FORCED-NEXT: %2 = extractvalue { i64, i64 } %sv, 0			; FORCED-NEXT: %2 = extractvalue { i64, i64 } %sv, 0
	▲ Show 20 Lines • Show All 85 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/mul.ll

	Show All 21 Lines
	; mov x11, v1.d[1]			; mov x11, v1.d[1]
	; mul x8, x9, x8			; mul x8, x9, x8
	; mul x9, x11, x10			; mul x9, x11, x10
	; fmov d0, x8			; fmov d0, x8
	; mov v0.d[1], x9			; mov v0.d[1], x9
	; str q0, [x0]			; str q0, [x0]
	; ret			; ret
	;			;
	; but if we don't SLP vectorise these examples we get this which is smaller			; If we don't SLP vectorise but scalarize this we get this instead:
	; and faster:
	;			;
	; ldp x8, x9, [x1]			; ldp x8, x9, [x1]
	; ldp x10, x11, [x0]			; ldp x10, x11, [x0]
	; mul x9, x11, x9			; mul x9, x11, x9
	; mul x8, x10, x8			; mul x8, x10, x8
	; stp x8, x9, [x0]			; stp x8, x9, [x0]
	; ret			; ret
	;			;
	; FIXME: don't SLP vectorise this.

	define void @mul(i64* noalias nocapture %a, i64* noalias nocapture readonly %b) {			define void @mul(i64* noalias nocapture %a, i64* noalias nocapture readonly %b) {
	; CHECK-LABEL: @mul(			; CHECK-LABEL: @mul(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i64, i64 [[B:%.*]], i64 1			; CHECK-NEXT: [[TMP0:%.]] = load i64, i64 [[B:%.*]], align 8
	; CHECK-NEXT: [[TMP0:%.]] = bitcast i64 [[B]] to <2 x i64>*			; CHECK-NEXT: [[TMP1:%.]] = load i64, i64 [[A:%.*]], align 8
	; CHECK-NEXT: [[TMP1:%.]] = load <2 x i64>, <2 x i64> [[TMP0]], align 8			; CHECK-NEXT: [[MUL:%.*]] = mul nsw i64 [[TMP1]], [[TMP0]]
	; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i64, i64 [[A:%.*]], i64 1			; CHECK-NEXT: store i64 [[MUL]], i64* [[A]], align 8
	; CHECK-NEXT: [[TMP2:%.]] = bitcast i64 [[A]] to <2 x i64>*			; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i64, i64 [[B]], i64 1
	; CHECK-NEXT: [[TMP3:%.]] = load <2 x i64>, <2 x i64> [[TMP2]], align 8			; CHECK-NEXT: [[TMP2:%.]] = load i64, i64 [[ARRAYIDX2]], align 8
	; CHECK-NEXT: [[TMP4:%.*]] = mul nsw <2 x i64> [[TMP3]], [[TMP1]]			; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i64, i64 [[A]], i64 1
	; CHECK-NEXT: [[TMP5:%.]] = bitcast i64 [[A]] to <2 x i64>*			; CHECK-NEXT: [[TMP3:%.]] = load i64, i64 [[ARRAYIDX3]], align 8
	; CHECK-NEXT: store <2 x i64> [[TMP4]], <2 x i64>* [[TMP5]], align 8			; CHECK-NEXT: [[MUL4:%.*]] = mul nsw i64 [[TMP3]], [[TMP2]]
				; CHECK-NEXT: store i64 [[MUL4]], i64* [[ARRAYIDX3]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%0 = load i64, i64* %b, align 8			%0 = load i64, i64* %b, align 8
	%1 = load i64, i64* %a, align 8			%1 = load i64, i64* %a, align 8
	%mul = mul nsw i64 %1, %0			%mul = mul nsw i64 %1, %0
	store i64 %mul, i64* %a, align 8			store i64 %mul, i64* %a, align 8
	%arrayidx2 = getelementptr inbounds i64, i64* %b, i64 1			%arrayidx2 = getelementptr inbounds i64, i64* %b, i64 1
	Show All 12 Lines
	; a[1] *= b[1];			; a[1] *= b[1];
	; a[0] += b[0];			; a[0] += b[0];
	; a[1] += b[1];			; a[1] += b[1];
	; }			; }
	;			;
	define void @mac(i64* noalias nocapture %a, i64* noalias nocapture readonly %b) {			define void @mac(i64* noalias nocapture %a, i64* noalias nocapture readonly %b) {
	; CHECK-LABEL: @mac(			; CHECK-LABEL: @mac(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i64, i64 [[B:%.*]], i64 1			; CHECK-NEXT: [[TMP0:%.]] = load i64, i64 [[B:%.*]], align 8
	; CHECK-NEXT: [[TMP0:%.]] = bitcast i64 [[B]] to <2 x i64>*			; CHECK-NEXT: [[TMP1:%.]] = load i64, i64 [[A:%.*]], align 8
	; CHECK-NEXT: [[TMP1:%.]] = load <2 x i64>, <2 x i64> [[TMP0]], align 8			; CHECK-NEXT: [[MUL:%.*]] = mul nsw i64 [[TMP1]], [[TMP0]]
	; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i64, i64 [[A:%.*]], i64 1			; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i64, i64 [[B]], i64 1
	; CHECK-NEXT: [[TMP2:%.]] = bitcast i64 [[A]] to <2 x i64>*			; CHECK-NEXT: [[TMP2:%.]] = load i64, i64 [[ARRAYIDX2]], align 8
	; CHECK-NEXT: [[TMP3:%.]] = load <2 x i64>, <2 x i64> [[TMP2]], align 8			; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i64, i64 [[A]], i64 1
	; CHECK-NEXT: [[TMP4:%.*]] = mul nsw <2 x i64> [[TMP3]], [[TMP1]]			; CHECK-NEXT: [[TMP3:%.]] = load i64, i64 [[ARRAYIDX3]], align 8
	; CHECK-NEXT: [[TMP5:%.*]] = add nsw <2 x i64> [[TMP4]], [[TMP1]]			; CHECK-NEXT: [[MUL4:%.*]] = mul nsw i64 [[TMP3]], [[TMP2]]
	; CHECK-NEXT: [[TMP6:%.]] = bitcast i64 [[A]] to <2 x i64>*			; CHECK-NEXT: [[ADD:%.*]] = add nsw i64 [[MUL]], [[TMP0]]
	; CHECK-NEXT: store <2 x i64> [[TMP5]], <2 x i64>* [[TMP6]], align 8			; CHECK-NEXT: store i64 [[ADD]], i64* [[A]], align 8
				; CHECK-NEXT: [[ADD9:%.*]] = add nsw i64 [[MUL4]], [[TMP2]]
				; CHECK-NEXT: store i64 [[ADD9]], i64* [[ARRAYIDX3]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%0 = load i64, i64* %b, align 8			%0 = load i64, i64* %b, align 8
	%1 = load i64, i64* %a, align 8			%1 = load i64, i64* %a, align 8
	%mul = mul nsw i64 %1, %0			%mul = mul nsw i64 %1, %0
	%arrayidx2 = getelementptr inbounds i64, i64* %b, i64 1			%arrayidx2 = getelementptr inbounds i64, i64* %b, i64 1
	%2 = load i64, i64* %arrayidx2, align 8			%2 = load i64, i64* %arrayidx2, align 8
	Show All 9 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][CostModel] Fixed costs for mul <2 x i64>ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 308310

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/test/Analysis/CostModel/AArch64/mul.ll

llvm/test/Transforms/LoopVectorize/AArch64/extractvalue-no-scalarization-required.ll

llvm/test/Transforms/SLPVectorizer/AArch64/mul.ll

[AArch64][CostModel] Fixed costs for mul <2 x i64>
ClosedPublic