This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
8/8
AArch64TargetTransformInfo.cpp
-
test/Analysis/CostModel/AArch64/
-
Analysis/
-
CostModel/
-
AArch64/
3/7
reduce-fadd.ll

Differential D131028

[AArch64] Fix cost model for FADD vector reduction
AcceptedPublic

Authored by virnarula on Aug 2 2022, 3:55 PM.

Download Raw Diff

Details

Reviewers

fhahn
dmgreen
t.p.northover

Summary

Fix the cost model for FADD vector reduction. Cost was being over-estimated from BaseT::getArithmeticReductionCost function. Add special case to AArch64TTIImpl::getArithmeticReductionCost. Reflects a lowering where vector through element-wise vector adds until it is less than or equal to the size of a vector register. Then it is reduced with a pairwise add.

Correction also enables a more optimal lowering of dot product as shown through tests. Originally, the cost model was erroneously preventing this special lowering.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,040 ms	x64 debian > libFuzzer.libFuzzer::fuzzer-leak.test
	60,030 ms	x64 debian > libFuzzer.libFuzzer::value-profile-load.test

Event Timeline

virnarula created this revision.Aug 2 2022, 3:55 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 2 2022, 3:55 PM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

virnarula requested review of this revision.Aug 2 2022, 3:55 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 2 2022, 3:55 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B178889: Diff 449467.Aug 2 2022, 3:55 PM

virnarula edited the summary of this revision. (Show Details)Aug 2 2022, 4:01 PM

virnarula added a reviewer: fhahn.

Fix test files

Harbormaster completed remote builds in B178891: Diff 449470.Aug 2 2022, 4:06 PM

Thanks for the patch! The updated costs look like a great improvement.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2712–2713	Those don't appear in the current version on `main`. Is it possible that the patch has been applied on top of some other local changes?
2738	the formatting seems a bit off here, could you make sure it's formatted with `clang-format-diff`?
2744	nit: `std::pow`? It is Laos common to just use `unsigned` instead of `unsigned int`.
2749	nit: reflow comment and lower case `w`. Maybe say `we will use element-wise vector adds to reduce the elements.`
2757	nit: Start with uppercase `I`. Maybe say something like `Once the remaining elements fit into a single vector register they will be reduced using pairwise adds (faddp).`
llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll
55	I might have missed it, but could you make sure there's a test with `fp128` as element type?

Fix naming and comments. Add additional test cases.

Remove dot product test changes (will be committed alongside dot product changes).

Harbormaster completed remote builds in B179082: Diff 449735.Aug 3 2022, 12:56 PM

Fix formatting

fhahn added inline comments.Aug 3 2022, 1:18 PM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2746	It would probably be slightly cleaner to use `getRegisterBitWidth` instead of hardcoding the register size.
llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll
0–2	nit: why change this?
5	nit: It would be good to have those committed separately, and then have their cost unchanged in this diff.
86	Does the efficient lowering rely on `-0.0` to be the initial value? With any other start value, we would need at least one additional add, right? Not sure if it is worth to include this in the cost, but we might as well.

Harbormaster completed remote builds in B179103: Diff 449756.Aug 3 2022, 2:26 PM

I think the new costs are a lot better, but I worry about the implications of setting the fadd reduction costs. The SLP vectorizer will be the main user and whilst they might be correct in isolation, I worry about the comparative cost vs scalar fma. D125987 for example was trying to fix the same thing.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2740	HalfTy requires FullFP16, otherwise it gets unrolled. For example: https://godbolt.org/z/57WEb99q3

fhahn added a child revision: D131125: [Matrix] Add special case dot product lowering.Aug 4 2022, 6:31 AM

In D131028#3698795, @dmgreen wrote:

I think the new costs are a lot better, but I worry about the implications of setting the fadd reduction costs. The SLP vectorizer will be the main user and whilst they might be correct in isolation, I worry about the comparative cost vs scalar fma. D125987 for example was trying to fix the same thing.

Yeah this is an unfortunate potential impact on the SLP vectorizer :(

I doubt the improved costs here should make things *much* worse in practice and we already have the same issue with integer add reduction and mla IIUC. Should any negative impact materialize, I think we should work around an SLP issue in the SLPVectorizer directly, rather than through artificially inflating costs in TTI.

It might also increase the incentives to properly addressing the issue :)

The motivating use case for those improvements is using more accurate costs in other passes, like D131125

virnarula marked 4 inline comments as done.Aug 4 2022, 10:18 AM

virnarula added inline comments.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2712–2713	Yes that's what it seems like.
llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll
0–2	undoing
86	So in the getReductionCost function, we don't seem to have access to any of the arguments which makes this difficult. In the test cases, I was thinking we might just want to make all the 0.0 and -0.0 into undefs to avoid confusion for anyone reading the test cases. what do you think?

Fix order of commits for tests

Harbormaster completed remote builds in B179337: Diff 450056.Aug 4 2022, 12:25 PM

Yeah this is an unfortunate potential impact on the SLP vectorizer :(

I doubt the improved costs here should make things *much* worse in practice and we already have the same issue with integer add reduction and mla IIUC. Should any negative impact materialize, I think we should work around an SLP issue in the SLPVectorizer directly, rather than through artificially inflating costs in TTI.

It might also increase the incentives to properly addressing the issue :)

The motivating use case for those improvements is using more accurate costs in other passes, like D131125

Yeah - I worry that this might come up quite a lot. Adding floats together is pretty common, and multiplying them beforehand seems just as prevalent. I have this example, although it's maybe a little odd due to the extra shuffling in the loop: https://godbolt.org/z/3oqT1b58f.

Add fp16 attribute test cases

Harbormaster completed remote builds in B179538: Diff 450319.Aug 5 2022, 10:19 AM

fhahn added a parent revision: D131269: [CostModel] Add bfloat and fp128 reduction tests.Aug 5 2022, 10:45 AM

In D131028#3702132, @dmgreen wrote:

Yeah this is an unfortunate potential impact on the SLP vectorizer :(

I doubt the improved costs here should make things *much* worse in practice and we already have the same issue with integer add reduction and mla IIUC. Should any negative impact materialize, I think we should work around an SLP issue in the SLPVectorizer directly, rather than through artificially inflating costs in TTI.

It might also increase the incentives to properly addressing the issue :)

The motivating use case for those improvements is using more accurate costs in other passes, like D131125

Yeah - I worry that this might come up quite a lot. Adding floats together is pretty common, and multiplying them beforehand seems just as prevalent. I have this example, although it's maybe a little odd due to the extra shuffling in the loop: https://godbolt.org/z/3oqT1b58f.

Thanks for sharing the example! In this particular example with the patch we will use a vector fmul feeding an fadd reduction, but on a first glance this doesn't seem worse and maybe even slightly better overall. Here's the diff between the example with and without the patch (generated by diff base.s patch.s)

diff  a.s b.s
17,21c17,19
< 	ldp	s0, s1, [x10]
< 	ldp	s2, s3, [x10, #8]
< 	ldr	s4, [x10, #16]
< 	ldp	s6, s18, [x11]
< 	ldp	s5, s17, [x11, #8]
---
> 	ldr	s1, [x10]
> 	ldur	q2, [x10, #4]
> 	ldr	q5, [x11]
25c23
< 	fmov	s7, s6
---
> 	mov.16b	v0, v5
27,34c25,32
< 	ldr	s6, [x1, x13]
< 	fmov	s16, s5
< 	fmul	s5, s7, s1
< 	fmadd	s5, s18, s2, s5
< 	fmadd	s5, s16, s3, s5
< 	fmadd	s5, s17, s4, s5
< 	fmadd	s5, s6, s0, s5
< 	str	s5, [x2, x13]
---
> 	ldr	s4, [x1, x13]
> 	fmul.4s	v3, v5, v2
> 	faddp.4s	v3, v3, v3
> 	faddp.2s	s3, v3
> 	fmadd	s3, s4, s1, s3
> 	str	s3, [x2, x13]
> 	trn1.4s	v5, v4, v5
> 	mov.s	v5[2], v3[0]
36,37d33
< 	fmov	s17, s16
< 	fmov	s18, s7
43,46c39,41
< 	str	s6, [x11]
< 	str	s7, [x11, #4]
< 	str	s5, [x11, #8]
< 	str	s16, [x11, #12]
---
> 	stp	s4, s0, [x11]
> 	add	x12, x11, #12
> 	str	s3, [x11, #8]
47a43
> 	st1.s	{ v0 }[2], [x12]

virnarula marked an inline comment as done.Aug 5 2022, 2:57 PM

Style changes and check for fp16 being enabled

Harbormaster completed remote builds in B179620: Diff 450423.Aug 5 2022, 4:28 PM

Use clang-format

Harbormaster completed remote builds in B180037: Diff 450965.Aug 8 2022, 7:22 PM

I ran some perf sanity checks including SPEC2006 & SPEC2017 and a few internal test suites over the weekend and the change looks neutral, with all changes being in the noise level for my test system.

This patch LGTM, as it more accurately reflects the actual cost of fadd vector reductions on AArch64.

@dmgreen It would be great if you could let us know if you still have concerns about the case you shared!

This revision is now accepted and ready to land.Aug 15 2022, 6:27 AM

It was worse on every CPU I tried it on. I did take some time last week looking at if we could adjust the cost of fma, but it looked like it had issues in itself and I wasn't even sure it would fix the problems with SLP vectorization.

From the results I have, it looks like this patch causes more problems than it solves, and the stated reason for doing it (Matrix dotproduct lowering) seems niche compared to the amount of SLP vectorization this will enable. If there are known issues in the SLP vectorizer, in my opinion it would make sense to address those first. Maybe we won't get anywhere and we will end up having to enable this as-is, but it seems prudent to try.

In D131028#3723097, @dmgreen wrote:

It was worse on every CPU I tried it on. I did take some time last week looking at if we could adjust the cost of fma, but it looked like it had issues in itself and I wasn't even sure it would fix the problems with SLP vectorization.

By it, do you mean the example you shared?

Out of the results I have seen - there are a number that are a little better that have 4x manually unrolled float summations. They look OK, they seem to improve by 5-10%.

The example I shared was the most obviously worse, even if it is wrapped up in awkward SLP codegen. It is 20%-40% worse depending on the CPU. There are a few other cases that get worse that have the 4x manual unrolling, including a f64 matrix multiply and something called iir_lattice. As far as I can see all the example that get worse have multiplies into a reduction.

In D131028#3725222, @dmgreen wrote:

The example I shared was the most obviously worse, even if it is wrapped up in awkward SLP codegen. It is 20%-40% worse depending on the CPU. There are a few other cases that get worse that have the 4x manual unrolling, including a f64 matrix multiply and something called iir_lattice. As far as I can see all the example that get worse have multiplies into a reduction.

Ok, I checked the public A75 optimization guide and it looks like FMADD has a throughput of 2 while FADDP (Q form) only has a throughput of 1 and worse latency. I guess that would explain the issue or do you think the assembly diff is also worse assuming an implementation of FADDP that has the same latency/throughput as FMADD?

If the issue is the FADDP implementation on particular uarchs, then we should probably bump the FADDP cost on those uarchs.

If you are able to share any of those benchmarks that regress in build- & run-able form, I could also verify that assumption.

In D131028#3725287, @fhahn wrote:

In D131028#3725222, @dmgreen wrote:

The example I shared was the most obviously worse, even if it is wrapped up in awkward SLP codegen. It is 20%-40% worse depending on the CPU. There are a few other cases that get worse that have the 4x manual unrolling, including a f64 matrix multiply and something called iir_lattice. As far as I can see all the example that get worse have multiplies into a reduction.

Ok, I checked the public A75 optimization guide and it looks like FMADD has a throughput of 2 while FADDP (Q form) only has a throughput of 1 and worse latency. I guess that would explain the issue or do you think the assembly diff is also worse assuming an implementation of FADDP that has the same latency/throughput as FMADD?

If the issue is the FADDP implementation on particular uarchs, then we should probably bump the FADDP cost on those uarchs.

I don't think this is about micro-architecture differences, just different benchmarks. I happen to have a lot of tests that have the awkward SLP case and 4x unrolled loops, so show where this can perform worse. The same thing doesn't happen to come up in Spec or the other benchmarks you ran, so you didn't see any differences. The manually unrolled loops I think are less important in the grand scheme of things.

The problem isn't FADD vs FADDP. I agree that in isolation this is a decent improvement to the cost of fadd reduction. The problem is FMADD vs vector FMUL + FADDP's. Unless a micro-arch will be doing something very strange, back-to-back fmadd should be pretty efficient. There will not be a very large difference between 4xscalar fmadd and vector fmul+faddp+faddp, and yet the cost model will currently cost them as 8 vs 2+2 I think. With anything else going on (like the shuffles in the arm_biquad_cascade_df1_f32 case, the cost can easily be wrong enough to give worse performance.

But our current FP costs are all set to 2 (which isn't necessarily deliberate, just the default), and any modifications I've tried so far have only really made things look worse. I was attempting to mark the cost of a fadd that used a fmul as free, but it didn't help in the arm_biquad_cascade_df1_f32 case and seemed to cause other performance issues on its own. Perhaps it's best to move forward with this, accepting the regressions because it also gives improvements, and work on improving the other costs as we go forward.

llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll
31	Can we split the fp16 into their own functions, so that we can still use the script to generate the check lines. We do the same in a few of the other costmodel tests.

If you are able to share any of those benchmarks that regress in build- & run-able form, I could also verify that assumption.

For the arm_biquad case, I think it can use any old data as input. The blockSize we usually test with 16, 64 and 256 to give a range of values. numStages shouldn't matter too much, it is the outer loop. Something like 4 should be fine so long as the function is called in a loop to get the total runtime up.

fhahn mentioned this in D132872: [SLP] Account for loss due to FMAs when estimating fmul costs..Aug 29 2022, 11:01 AM

In D131028#3728441, @dmgreen wrote:

In D131028#3725287, @fhahn wrote:

In D131028#3725222, @dmgreen wrote:

The example I shared was the most obviously worse, even if it is wrapped up in awkward SLP codegen. It is 20%-40% worse depending on the CPU. There are a few other cases that get worse that have the 4x manual unrolling, including a f64 matrix multiply and something called iir_lattice. As far as I can see all the example that get worse have multiplies into a reduction.

Ok, I checked the public A75 optimization guide and it looks like FMADD has a throughput of 2 while FADDP (Q form) only has a throughput of 1 and worse latency. I guess that would explain the issue or do you think the assembly diff is also worse assuming an implementation of FADDP that has the same latency/throughput as FMADD?

If the issue is the FADDP implementation on particular uarchs, then we should probably bump the FADDP cost on those uarchs.

I don't think this is about micro-architecture differences, just different benchmarks. I happen to have a lot of tests that have the awkward SLP case and 4x unrolled loops, so show where this can perform worse. The same thing doesn't happen to come up in Spec or the other benchmarks you ran, so you didn't see any differences. The manually unrolled loops I think are less important in the grand scheme of things.

The problem isn't FADD vs FADDP. I agree that in isolation this is a decent improvement to the cost of fadd reduction. The problem is FMADD vs vector FMUL + FADDP's. Unless a micro-arch will be doing something very strange, back-to-back fmadd should be pretty efficient. There will not be a very large difference between 4xscalar fmadd and vector fmul+faddp+faddp, and yet the cost model will currently cost them as 8 vs 2+2 I think. With anything else going on (like the shuffles in the arm_biquad_cascade_df1_f32 case, the cost can easily be wrong enough to give worse performance.

But our current FP costs are all set to 2 (which isn't necessarily deliberate, just the default), and any modifications I've tried so far have only really made things look worse. I was attempting to mark the cost of a fadd that used a fmul as free, but it didn't help in the arm_biquad_cascade_df1_f32 case and seemed to cause other performance issues on its own. Perhaps it's best to move forward with this, accepting the regressions because it also gives improvements, and work on improving the other costs as we go forward.

I spent some time investigating another issue with ignoring scalar FMA costs and put up a potential fix for the non-horizontal reduction case: D132872. Depending on how this shakes out, I'll see if the same thing can be applied to horizontal reductions before going forward with the cost-model change.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64TargetTransformInfo.cpp

28 lines

test/

Analysis/

CostModel/

AArch64/

reduce-fadd.ll

77 lines

Diff 450965

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

//===-- AArch64TargetTransformInfo.cpp - AArch64 specific TTI -------------===//		//===-- AArch64TargetTransformInfo.cpp - AArch64 specific TTI -------------===//
		Lint: Lint Inline Actions clang-format not found in user’s local PATH; not linting file. Lint: Lint: clang-format not found in user’s local PATH; not linting file.
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

▲ Show 20 Lines • Show All 2,695 Lines • ▼ Show 20 Lines	AArch64TTIImpl::getArithmeticReductionCost(unsigned Opcode, VectorType *ValTy,
// XOR: llvm/test/CodeGen/AArch64/reduce-xor.ll		// XOR: llvm/test/CodeGen/AArch64/reduce-xor.ll
// AND: llvm/test/CodeGen/AArch64/reduce-and.ll		// AND: llvm/test/CodeGen/AArch64/reduce-and.ll
static const CostTblEntry CostTblNoPairwise[]{		static const CostTblEntry CostTblNoPairwise[]{
{ISD::ADD, MVT::v8i8, 2},		{ISD::ADD, MVT::v8i8, 2},
{ISD::ADD, MVT::v16i8, 2},		{ISD::ADD, MVT::v16i8, 2},
{ISD::ADD, MVT::v4i16, 2},		{ISD::ADD, MVT::v4i16, 2},
{ISD::ADD, MVT::v8i16, 2},		{ISD::ADD, MVT::v8i16, 2},
{ISD::ADD, MVT::v4i32, 2},		{ISD::ADD, MVT::v4i32, 2},
{ISD::ADD, MVT::v2i64, 2},		{ISD::ADD, MVT::v2i64, 2},
{ISD::OR, MVT::v8i8, 15},		{ISD::OR, MVT::v8i8, 15},
fhahnUnsubmitted Done Reply Inline Actions Those don't appear in the current version on `main`. Is it possible that the patch has been applied on top of some other local changes? fhahn: Those don't appear in the current version on `main`. Is it possible that the patch has been…
virnarulaAuthorUnsubmitted Done Reply Inline Actions Yes that's what it seems like. virnarula: Yes that's what it seems like.
{ISD::OR, MVT::v16i8, 17},		{ISD::OR, MVT::v16i8, 17},
{ISD::OR, MVT::v4i16, 7},		{ISD::OR, MVT::v4i16, 7},
{ISD::OR, MVT::v8i16, 9},		{ISD::OR, MVT::v8i16, 9},
{ISD::OR, MVT::v2i32, 3},		{ISD::OR, MVT::v2i32, 3},
{ISD::OR, MVT::v4i32, 5},		{ISD::OR, MVT::v4i32, 5},
{ISD::OR, MVT::v2i64, 3},		{ISD::OR, MVT::v2i64, 3},
{ISD::XOR, MVT::v8i8, 15},		{ISD::XOR, MVT::v8i8, 15},
{ISD::XOR, MVT::v16i8, 17},		{ISD::XOR, MVT::v16i8, 17},
{ISD::XOR, MVT::v4i16, 7},		{ISD::XOR, MVT::v4i16, 7},
{ISD::XOR, MVT::v8i16, 9},		{ISD::XOR, MVT::v8i16, 9},
{ISD::XOR, MVT::v2i32, 3},		{ISD::XOR, MVT::v2i32, 3},
{ISD::XOR, MVT::v4i32, 5},		{ISD::XOR, MVT::v4i32, 5},
{ISD::XOR, MVT::v2i64, 3},		{ISD::XOR, MVT::v2i64, 3},
{ISD::AND, MVT::v8i8, 15},		{ISD::AND, MVT::v8i8, 15},
{ISD::AND, MVT::v16i8, 17},		{ISD::AND, MVT::v16i8, 17},
{ISD::AND, MVT::v4i16, 7},		{ISD::AND, MVT::v4i16, 7},
{ISD::AND, MVT::v8i16, 9},		{ISD::AND, MVT::v8i16, 9},
{ISD::AND, MVT::v2i32, 3},		{ISD::AND, MVT::v2i32, 3},
{ISD::AND, MVT::v4i32, 5},		{ISD::AND, MVT::v4i32, 5},
{ISD::AND, MVT::v2i64, 3},		{ISD::AND, MVT::v2i64, 3},
};		};
switch (ISD) {		switch (ISD) {
default:		default:
break;		break;
		case ISD::FADD: {
		fhahnUnsubmitted Done Reply Inline Actions the formatting seems a bit off here, could you make sure it's formatted with `clang-format-diff`? fhahn: the formatting seems a bit off here, could you make sure it's formatted with `clang-format…
		Type *EltType = ValTy->getElementType();
		if (!EltType->isDoubleTy() && !EltType->isFloatTy() &&
		dmgreenUnsubmitted Done Reply Inline Actions HalfTy requires FullFP16, otherwise it gets unrolled. For example: https://godbolt.org/z/57WEb99q3 dmgreen: HalfTy requires FullFP16, otherwise it gets unrolled. For example: https://godbolt.
		!(EltType->isHalfTy() && ST->hasFullFP16()))
		break;

		unsigned NumVecElts = cast<FixedVectorType>(ValTy)->getNumElements();
		fhahnUnsubmitted Done Reply Inline Actions nit: `std::pow`? It is Laos common to just use `unsigned` instead of `unsigned int`. fhahn: nit: `std::pow`? It is Laos common to just use `unsigned` instead of `unsigned int`.
		unsigned RoundedNumVecElts = std::pow(2, Log2_32_Ceil(NumVecElts));
		unsigned BitsPerElement = EltType->getScalarSizeInBits();
		fhahnUnsubmitted Done Reply Inline Actions It would probably be slightly cleaner to use `getRegisterBitWidth` instead of hardcoding the register size. fhahn: It would probably be slightly cleaner to use `getRegisterBitWidth` instead of hardcoding the…
		auto VectorRegBitWidth =
		getRegisterBitWidth(TargetTransformInfo::RGK_FixedWidthVector);
		unsigned ElementsPerRegister = VectorRegBitWidth / BitsPerElement;
		fhahnUnsubmitted Done Reply Inline Actions nit: reflow comment and lower case `w`. Maybe say `we will use element-wise vector adds to reduce the elements.` fhahn: nit: reflow comment and lower case `w`. Maybe say `we will use element-wise vector adds to…

		// While there are more elements than fit into a single register,
		// we will use element-wise vector adds to reduce the elements.
		InstructionCost VectorFADD;
		unsigned ElementsLeft = RoundedNumVecElts;
		while (ElementsLeft > ElementsPerRegister) {
		VectorFADD += ElementsLeft / (ElementsPerRegister * 2);
		ElementsLeft /= 2;
		fhahnUnsubmitted Done Reply Inline Actions nit: Start with uppercase `I`. Maybe say something like `Once the remaining elements fit into a single vector register they will be reduced using pairwise adds (faddp).` fhahn: nit: Start with uppercase `I`. Maybe say something like `Once the remaining elements fit into a…
		}

		// Once the remaining elements fit into a single vector register,
		// they will be reduced using pairwise adds (faddp).
		InstructionCost FADDPCosts(Log2_32_Ceil(ElementsLeft));

		return FADDPCosts + VectorFADD;
		}
case ISD::ADD:		case ISD::ADD:
if (const auto *Entry = CostTableLookup(CostTblNoPairwise, ISD, MTy))		if (const auto *Entry = CostTableLookup(CostTblNoPairwise, ISD, MTy))
return (LT.first - 1) + Entry->Cost;		return (LT.first - 1) + Entry->Cost;
break;		break;
case ISD::XOR:		case ISD::XOR:
case ISD::AND:		case ISD::AND:
case ISD::OR:		case ISD::OR:
const auto *Entry = CostTableLookup(CostTblNoPairwise, ISD, MTy);		const auto *Entry = CostTableLookup(CostTblNoPairwise, ISD, MTy);
▲ Show 20 Lines • Show All 309 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll

	; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py			; RUN: opt -passes='print<cost-model>' 2>&1 -disable-output -mtriple=aarch64--linux-gnu < %s \| FileCheck --check-prefixes=COMMON,NOFP16 %s
	; RUN: opt -passes='print<cost-model>' 2>&1 -disable-output -mtriple=aarch64--linux-gnu < %s \| FileCheck %s			; RUN: opt -passes='print<cost-model>' 2>&1 -disable-output -mtriple=aarch64--linux-gnu -mattr=+fullfp16 < %s \| FileCheck --check-prefixes=COMMON,FP16 %s
				fhahnUnsubmitted Done Reply Inline Actions nit: why change this? fhahn: nit: why change this?
				virnarulaAuthorUnsubmitted Not Done Reply Inline Actions undoing virnarula: undoing
	; RUN: opt -passes='print<cost-model>' 2>&1 -disable-output -mtriple=aarch64--linux-gnu -mattr=+fullfp16 < %s \| FileCheck %s

	define void @strict_fp_reductions() {			define void @strict_fp_reductions() {
	; CHECK-LABEL: 'strict_fp_reductions'			; COMMON-LABEL: 'strict_fp_reductions'
				fhahnUnsubmitted Not Done Reply Inline Actions nit: It would be good to have those committed separately, and then have their cost unchanged in this diff. fhahn: nit: It would be good to have those committed separately, and then have their cost unchanged in…
	; CHECK-NEXT: Cost Model: Found an estimated cost of 21 for instruction: %fadd_v4f16 = call half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)			; COMMON-NEXT: Cost Model: Found an estimated cost of 21 for instruction: %fadd_v4f16 = call half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 45 for instruction: %fadd_v8f16 = call half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)			; COMMON-NEXT: Cost Model: Found an estimated cost of 45 for instruction: %fadd_v8f16 = call half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 21 for instruction: %fadd_v4f32 = call float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)			; COMMON-NEXT: Cost Model: Found an estimated cost of 21 for instruction: %fadd_v4f32 = call float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 42 for instruction: %fadd_v8f32 = call float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)			; COMMON-NEXT: Cost Model: Found an estimated cost of 42 for instruction: %fadd_v8f32 = call float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %fadd_v2f64 = call double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)			; COMMON-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %fadd_v2f64 = call double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 18 for instruction: %fadd_v4f64 = call double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)			; COMMON-NEXT: Cost Model: Found an estimated cost of 18 for instruction: %fadd_v4f64 = call double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f8 = call bfloat @llvm.vector.reduce.fadd.v4bf16(bfloat 0xR0000, <4 x bfloat> undef)			; COMMON-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f8 = call bfloat @llvm.vector.reduce.fadd.v4bf16(bfloat 0xR0000, <4 x bfloat> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %fadd_v4f128 = call fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)			; COMMON-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %fadd_v4f128 = call fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; COMMON-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	%fadd_v4f16 = call half @llvm.vector.reduce.fadd.v4f16(half 0.0, <4 x half> undef)			%fadd_v4f16 = call half @llvm.vector.reduce.fadd.v4f16(half 0.0, <4 x half> undef)
	%fadd_v8f16 = call half @llvm.vector.reduce.fadd.v8f16(half 0.0, <8 x half> undef)			%fadd_v8f16 = call half @llvm.vector.reduce.fadd.v8f16(half 0.0, <8 x half> undef)
	%fadd_v4f32 = call float @llvm.vector.reduce.fadd.v4f32(float 0.0, <4 x float> undef)			%fadd_v4f32 = call float @llvm.vector.reduce.fadd.v4f32(float 0.0, <4 x float> undef)
	%fadd_v8f32 = call float @llvm.vector.reduce.fadd.v8f32(float 0.0, <8 x float> undef)			%fadd_v8f32 = call float @llvm.vector.reduce.fadd.v8f32(float 0.0, <8 x float> undef)
	%fadd_v2f64 = call double @llvm.vector.reduce.fadd.v2f64(double 0.0, <2 x double> undef)			%fadd_v2f64 = call double @llvm.vector.reduce.fadd.v2f64(double 0.0, <2 x double> undef)
	%fadd_v4f64 = call double @llvm.vector.reduce.fadd.v4f64(double 0.0, <4 x double> undef)			%fadd_v4f64 = call double @llvm.vector.reduce.fadd.v4f64(double 0.0, <4 x double> undef)
	%fadd_v4f8 = call bfloat @llvm.vector.reduce.fadd.v4f8(bfloat 0.0, <4 x bfloat> undef)			%fadd_v4f8 = call bfloat @llvm.vector.reduce.fadd.v4f8(bfloat 0.0, <4 x bfloat> undef)
	%fadd_v4f128 = call fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)			%fadd_v4f128 = call fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)

	ret void			ret void
	}			}


	define void @fast_fp_reductions() {			define void @fast_fp_reductions() {
	; CHECK-LABEL: 'fast_fp_reductions'			; COMMON-LABEL: 'fast_fp_reductions'
	; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f16_fast = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)			; NOFP16-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f16_fast = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
				dmgreenUnsubmitted Not Done Reply Inline Actions Can we split the fp16 into their own functions, so that we can still use the script to generate the check lines. We do the same in a few of the other costmodel tests. dmgreen: Can we split the fp16 into their own functions, so that we can still use the script to generate…
	; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)			; NOFP16-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 30 for instruction: %fadd_v8f16 = call fast half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)			; NOFP16-NEXT: Cost Model: Found an estimated cost of 30 for instruction: %fadd_v8f16 = call fast half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 30 for instruction: %fadd_v8f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)			; NOFP16-NEXT: Cost Model: Found an estimated cost of 30 for instruction: %fadd_v8f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 46 for instruction: %fadd_v11f16 = call fast half @llvm.vector.reduce.fadd.v11f16(half 0xH0000, <11 x half> undef)			; NOFP16-NEXT: Cost Model: Found an estimated cost of 46 for instruction: %fadd_v11f16 = call fast half @llvm.vector.reduce.fadd.v11f16(half 0xH0000, <11 x half> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 52 for instruction: %fadd_v13f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v13f16(half 0xH0000, <13 x half> undef)			; NOFP16-NEXT: Cost Model: Found an estimated cost of 52 for instruction: %fadd_v13f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v13f16(half 0xH0000, <13 x half> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f32 = call fast float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)			; FP16-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %fadd_v4f16_fast = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 30 for instruction: %fadd_v8f32 = call fast float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)			; FP16-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %fadd_v4f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 30 for instruction: %fadd_v8f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)			; FP16-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %fadd_v8f16 = call fast half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 50 for instruction: %fadd_v13f32 = call fast float @llvm.vector.reduce.fadd.v13f32(float 0.000000e+00, <13 x float> undef)			; FP16-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %fadd_v8f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %fadd_v5f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v5f32(float 0.000000e+00, <5 x float> undef)			; FP16-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %fadd_v11f16 = call fast half @llvm.vector.reduce.fadd.v11f16(half 0xH0000, <11 x half> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %fadd_v2f64 = call fast double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)			; FP16-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %fadd_v13f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v13f16(half 0xH0000, <13 x half> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %fadd_v2f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %fadd_v4f64 = call fast double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)			; COMMON-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %fadd_v4f32 = call fast float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %fadd_v4f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)			; COMMON-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %fadd_v4f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 18 for instruction: %fadd_v7f64 = call fast double @llvm.vector.reduce.fadd.v7f64(double 0.000000e+00, <7 x double> undef)			; COMMON-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %fadd_v8f32 = call fast float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 27 for instruction: %fadd_v9f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v9f64(double 0.000000e+00, <9 x double> undef)			; COMMON-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %fadd_v8f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %fadd_v4f8 = call reassoc bfloat @llvm.vector.reduce.fadd.v4bf16(bfloat 0xR8000, <4 x bfloat> undef)			; COMMON-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %fadd_v13f32 = call fast float @llvm.vector.reduce.fadd.v13f32(float 0.000000e+00, <13 x float> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f128 = call reassoc fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)			; COMMON-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %fadd_v5f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v5f32(float 0.000000e+00, <5 x float> undef)
	; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void			; COMMON-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %fadd_v2f64 = call fast double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
				; COMMON-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %fadd_v2f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
				; COMMON-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %fadd_v4f64 = call fast double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)
				; COMMON-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %fadd_v4f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)
				; COMMON-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %fadd_v7f64 = call fast double @llvm.vector.reduce.fadd.v7f64(double 0.000000e+00, <7 x double> undef)
				; COMMON-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %fadd_v9f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v9f64(double 0.000000e+00, <9 x double> undef)
				; COMMON-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %fadd_v4f8 = call reassoc bfloat @llvm.vector.reduce.fadd.v4bf16(bfloat 0xR8000, <4 x bfloat> undef)
				; COMMON-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f128 = call reassoc fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
				; COMMON-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
	;			;
	%fadd_v4f16_fast = call fast half @llvm.vector.reduce.fadd.v4f16(half 0.0, <4 x half> undef)			%fadd_v4f16_fast = call fast half @llvm.vector.reduce.fadd.v4f16(half 0.0, <4 x half> undef)
	%fadd_v4f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v4f16(half 0.0, <4 x half> undef)			%fadd_v4f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v4f16(half 0.0, <4 x half> undef)
	fhahnUnsubmitted Done Reply Inline Actions I might have missed it, but could you make sure there's a test with `fp128` as element type? fhahn: I might have missed it, but could you make sure there's a test with `fp128` as element type?

	%fadd_v8f16 = call fast half @llvm.vector.reduce.fadd.v8f16(half 0.0, <8 x half> undef)			%fadd_v8f16 = call fast half @llvm.vector.reduce.fadd.v8f16(half 0.0, <8 x half> undef)
	%fadd_v8f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v8f16(half 0.0, <8 x half> undef)			%fadd_v8f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v8f16(half 0.0, <8 x half> undef)

	%fadd_v11f16 = call fast half @llvm.vector.reduce.fadd.v11f16(half 0.0, <11 x half> undef)			%fadd_v11f16 = call fast half @llvm.vector.reduce.fadd.v11f16(half 0.0, <11 x half> undef)
	%fadd_v13f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v13f16(half 0.0, <13 x half> undef)			%fadd_v13f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v13f16(half 0.0, <13 x half> undef)

	%fadd_v4f32 = call fast float @llvm.vector.reduce.fadd.v4f32(float 0.0, <4 x float> undef)			%fadd_v4f32 = call fast float @llvm.vector.reduce.fadd.v4f32(float 0.0, <4 x float> undef)
	%fadd_v4f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v4f32(float 0.0, <4 x float> undef)			%fadd_v4f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v4f32(float 0.0, <4 x float> undef)

	%fadd_v8f32 = call fast float @llvm.vector.reduce.fadd.v8f32(float 0.0, <8 x float> undef)			%fadd_v8f32 = call fast float @llvm.vector.reduce.fadd.v8f32(float 0.0, <8 x float> undef)
	%fadd_v8f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v8f32(float 0.0, <8 x float> undef)			%fadd_v8f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v8f32(float 0.0, <8 x float> undef)

	%fadd_v13f32 = call fast float @llvm.vector.reduce.fadd.v13f32(float 0.0, <13 x float> undef)			%fadd_v13f32 = call fast float @llvm.vector.reduce.fadd.v13f32(float 0.0, <13 x float> undef)
	%fadd_v5f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v5f32(float 0.0, <5 x float> undef)			%fadd_v5f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v5f32(float 0.0, <5 x float> undef)

	%fadd_v2f64 = call fast double @llvm.vector.reduce.fadd.v2f64(double 0.0, <2 x double> undef)			%fadd_v2f64 = call fast double @llvm.vector.reduce.fadd.v2f64(double 0.0, <2 x double> undef)
	%fadd_v2f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v2f64(double 0.0, <2 x double> undef)			%fadd_v2f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v2f64(double 0.0, <2 x double> undef)

	%fadd_v4f64 = call fast double @llvm.vector.reduce.fadd.v4f64(double 0.0, <4 x double> undef)			%fadd_v4f64 = call fast double @llvm.vector.reduce.fadd.v4f64(double 0.0, <4 x double> undef)
	%fadd_v4f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v4f64(double 0.0, <4 x double> undef)			%fadd_v4f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v4f64(double 0.0, <4 x double> undef)

	%fadd_v7f64 = call fast double @llvm.vector.reduce.fadd.v7f64(double 0.0, <7 x double> undef)			%fadd_v7f64 = call fast double @llvm.vector.reduce.fadd.v7f64(double 0.0, <7 x double> undef)
	%fadd_v9f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v9f64(double 0.0, <9 x double> undef)			%fadd_v9f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v9f64(double 0.0, <9 x double> undef)
				fhahnUnsubmitted Not Done Reply Inline Actions Does the efficient lowering rely on `-0.0` to be the initial value? With any other start value, we would need at least one additional add, right? Not sure if it is worth to include this in the cost, but we might as well. fhahn: Does the efficient lowering rely on `-0.0` to be the initial value? With any other start value…
				virnarulaAuthorUnsubmitted Done Reply Inline Actions So in the getReductionCost function, we don't seem to have access to any of the arguments which makes this difficult. In the test cases, I was thinking we might just want to make all the 0.0 and -0.0 into undefs to avoid confusion for anyone reading the test cases. what do you think? virnarula: So in the getReductionCost function, we don't seem to have access to any of the arguments which…

	%fadd_v4f8 = call reassoc bfloat @llvm.vector.reduce.fadd.v4f8(bfloat -0.0, <4 x bfloat> undef)			%fadd_v4f8 = call reassoc bfloat @llvm.vector.reduce.fadd.v4f8(bfloat -0.0, <4 x bfloat> undef)
	%fadd_v4f128 = call reassoc fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)			%fadd_v4f128 = call reassoc fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)

	ret void			ret void
	}			}

	declare bfloat @llvm.vector.reduce.fadd.v4f8(bfloat, <4 x bfloat>)			declare bfloat @llvm.vector.reduce.fadd.v4f8(bfloat, <4 x bfloat>)
	Show All 16 Lines