This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Fix cost model for FADD vector reduction
AcceptedPublic

Authored by virnarula on Aug 2 2022, 3:55 PM.

Details

Summary

Fix the cost model for FADD vector reduction. Cost was being over-estimated from BaseT::getArithmeticReductionCost function. Add special case to AArch64TTIImpl::getArithmeticReductionCost. Reflects a lowering where vector through element-wise vector adds until it is less than or equal to the size of a vector register. Then it is reduced with a pairwise add.

Correction also enables a more optimal lowering of dot product as shown through tests. Originally, the cost model was erroneously preventing this special lowering.

Diff Detail

Event Timeline

virnarula created this revision.Aug 2 2022, 3:55 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 2 2022, 3:55 PM
virnarula requested review of this revision.Aug 2 2022, 3:55 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 2 2022, 3:55 PM
virnarula edited the summary of this revision. (Show Details)Aug 2 2022, 4:01 PM
virnarula added a reviewer: fhahn.
virnarula updated this revision to Diff 449470.Aug 2 2022, 4:06 PM

Fix test files

Thanks for the patch! The updated costs look like a great improvement.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2591

Those don't appear in the current version on main. Is it possible that the patch has been applied on top of some other local changes?

2617

the formatting seems a bit off here, could you make sure it's formatted with clang-format-diff?

2623

nit: std::pow? It is Laos common to just use unsigned instead of unsigned int.

2628

nit: reflow comment and lower case w. Maybe say we will use element-wise vector adds to reduce the elements.

2636

nit: Start with uppercase I. Maybe say something like Once the remaining elements fit into a single vector register they will be reduced using pairwise adds (faddp).

llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll
48

I might have missed it, but could you make sure there's a test with fp128 as element type?

virnarula updated this revision to Diff 449720.Aug 3 2022, 11:06 AM
virnarula marked 3 inline comments as done.

Fix naming and comments. Add additional test cases.

virnarula updated this revision to Diff 449735.Aug 3 2022, 12:12 PM
virnarula marked an inline comment as done.

Remove dot product test changes (will be committed alongside dot product changes).

virnarula updated this revision to Diff 449756.Aug 3 2022, 1:06 PM

Fix formatting

fhahn added inline comments.Aug 3 2022, 1:18 PM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2624

It would probably be slightly cleaner to use getRegisterBitWidth instead of hardcoding the register size.

llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll
2

nit: why change this?

12

nit: It would be good to have those committed separately, and then have their cost unchanged in this diff.

72

Does the efficient lowering rely on -0.0 to be the initial value? With any other start value, we would need at least one additional add, right? Not sure if it is worth to include this in the cost, but we might as well.

I think the new costs are a lot better, but I worry about the implications of setting the fadd reduction costs. The SLP vectorizer will be the main user and whilst they might be correct in isolation, I worry about the comparative cost vs scalar fma. D125987 for example was trying to fix the same thing.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2618

HalfTy requires FullFP16, otherwise it gets unrolled. For example: https://godbolt.org/z/57WEb99q3

fhahn added a comment.Aug 4 2022, 6:36 AM

I think the new costs are a lot better, but I worry about the implications of setting the fadd reduction costs. The SLP vectorizer will be the main user and whilst they might be correct in isolation, I worry about the comparative cost vs scalar fma. D125987 for example was trying to fix the same thing.

Yeah this is an unfortunate potential impact on the SLP vectorizer :(

I doubt the improved costs here should make things *much* worse in practice and we already have the same issue with integer add reduction and mla IIUC. Should any negative impact materialize, I think we should work around an SLP issue in the SLPVectorizer directly, rather than through artificially inflating costs in TTI.

It might also increase the incentives to properly addressing the issue :)

The motivating use case for those improvements is using more accurate costs in other passes, like D131125

virnarula marked 4 inline comments as done.Aug 4 2022, 10:18 AM
virnarula added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2591

Yes that's what it seems like.

llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll
2

undoing

72

So in the getReductionCost function, we don't seem to have access to any of the arguments which makes this difficult. In the test cases, I was thinking we might just want to make all the 0.0 and -0.0 into undefs to avoid confusion for anyone reading the test cases. what do you think?

virnarula updated this revision to Diff 450056.Aug 4 2022, 10:38 AM
virnarula marked an inline comment as done.

Fix order of commits for tests

Yeah this is an unfortunate potential impact on the SLP vectorizer :(

I doubt the improved costs here should make things *much* worse in practice and we already have the same issue with integer add reduction and mla IIUC. Should any negative impact materialize, I think we should work around an SLP issue in the SLPVectorizer directly, rather than through artificially inflating costs in TTI.

It might also increase the incentives to properly addressing the issue :)

The motivating use case for those improvements is using more accurate costs in other passes, like D131125

Yeah - I worry that this might come up quite a lot. Adding floats together is pretty common, and multiplying them beforehand seems just as prevalent. I have this example, although it's maybe a little odd due to the extra shuffling in the loop: https://godbolt.org/z/3oqT1b58f.

virnarula updated this revision to Diff 450319.Aug 5 2022, 10:18 AM

Add fp16 attribute test cases

fhahn added a comment.Aug 5 2022, 2:14 PM

Yeah this is an unfortunate potential impact on the SLP vectorizer :(

I doubt the improved costs here should make things *much* worse in practice and we already have the same issue with integer add reduction and mla IIUC. Should any negative impact materialize, I think we should work around an SLP issue in the SLPVectorizer directly, rather than through artificially inflating costs in TTI.

It might also increase the incentives to properly addressing the issue :)

The motivating use case for those improvements is using more accurate costs in other passes, like D131125

Yeah - I worry that this might come up quite a lot. Adding floats together is pretty common, and multiplying them beforehand seems just as prevalent. I have this example, although it's maybe a little odd due to the extra shuffling in the loop: https://godbolt.org/z/3oqT1b58f.

Thanks for sharing the example! In this particular example with the patch we will use a vector fmul feeding an fadd reduction, but on a first glance this doesn't seem worse and maybe even slightly better overall. Here's the diff between the example with and without the patch (generated by diff base.s patch.s)

diff  a.s b.s
17,21c17,19
< 	ldp	s0, s1, [x10]
< 	ldp	s2, s3, [x10, #8]
< 	ldr	s4, [x10, #16]
< 	ldp	s6, s18, [x11]
< 	ldp	s5, s17, [x11, #8]
---
> 	ldr	s1, [x10]
> 	ldur	q2, [x10, #4]
> 	ldr	q5, [x11]
25c23
< 	fmov	s7, s6
---
> 	mov.16b	v0, v5
27,34c25,32
< 	ldr	s6, [x1, x13]
< 	fmov	s16, s5
< 	fmul	s5, s7, s1
< 	fmadd	s5, s18, s2, s5
< 	fmadd	s5, s16, s3, s5
< 	fmadd	s5, s17, s4, s5
< 	fmadd	s5, s6, s0, s5
< 	str	s5, [x2, x13]
---
> 	ldr	s4, [x1, x13]
> 	fmul.4s	v3, v5, v2
> 	faddp.4s	v3, v3, v3
> 	faddp.2s	s3, v3
> 	fmadd	s3, s4, s1, s3
> 	str	s3, [x2, x13]
> 	trn1.4s	v5, v4, v5
> 	mov.s	v5[2], v3[0]
36,37d33
< 	fmov	s17, s16
< 	fmov	s18, s7
43,46c39,41
< 	str	s6, [x11]
< 	str	s7, [x11, #4]
< 	str	s5, [x11, #8]
< 	str	s16, [x11, #12]
---
> 	stp	s4, s0, [x11]
> 	add	x12, x11, #12
> 	str	s3, [x11, #8]
47a43
> 	st1.s	{ v0 }[2], [x12]
virnarula marked an inline comment as done.Aug 5 2022, 2:57 PM
virnarula updated this revision to Diff 450423.Aug 5 2022, 2:58 PM

Style changes and check for fp16 being enabled

virnarula updated this revision to Diff 450965.Aug 8 2022, 2:56 PM

Use clang-format

fhahn accepted this revision.Aug 15 2022, 6:27 AM

I ran some perf sanity checks including SPEC2006 & SPEC2017 and a few internal test suites over the weekend and the change looks neutral, with all changes being in the noise level for my test system.

This patch LGTM, as it more accurately reflects the actual cost of fadd vector reductions on AArch64.

@dmgreen It would be great if you could let us know if you still have concerns about the case you shared!

This revision is now accepted and ready to land.Aug 15 2022, 6:27 AM

It was worse on every CPU I tried it on. I did take some time last week looking at if we could adjust the cost of fma, but it looked like it had issues in itself and I wasn't even sure it would fix the problems with SLP vectorization.

From the results I have, it looks like this patch causes more problems than it solves, and the stated reason for doing it (Matrix dotproduct lowering) seems niche compared to the amount of SLP vectorization this will enable. If there are known issues in the SLP vectorizer, in my opinion it would make sense to address those first. Maybe we won't get anywhere and we will end up having to enable this as-is, but it seems prudent to try.

fhahn added a comment.Aug 15 2022, 7:58 AM

It was worse on every CPU I tried it on. I did take some time last week looking at if we could adjust the cost of fma, but it looked like it had issues in itself and I wasn't even sure it would fix the problems with SLP vectorization.

By it, do you mean the example you shared?

Out of the results I have seen - there are a number that are a little better that have 4x manually unrolled float summations. They look OK, they seem to improve by 5-10%.

The example I shared was the most obviously worse, even if it is wrapped up in awkward SLP codegen. It is 20%-40% worse depending on the CPU. There are a few other cases that get worse that have the 4x manual unrolling, including a f64 matrix multiply and something called iir_lattice. As far as I can see all the example that get worse have multiplies into a reduction.

fhahn added a comment.Aug 16 2022, 1:21 AM

The example I shared was the most obviously worse, even if it is wrapped up in awkward SLP codegen. It is 20%-40% worse depending on the CPU. There are a few other cases that get worse that have the 4x manual unrolling, including a f64 matrix multiply and something called iir_lattice. As far as I can see all the example that get worse have multiplies into a reduction.

Ok, I checked the public A75 optimization guide and it looks like FMADD has a throughput of 2 while FADDP (Q form) only has a throughput of 1 and worse latency. I guess that would explain the issue or do you think the assembly diff is also worse assuming an implementation of FADDP that has the same latency/throughput as FMADD?

If the issue is the FADDP implementation on particular uarchs, then we should probably bump the FADDP cost on those uarchs.

fhahn added a comment.Aug 16 2022, 1:23 AM

If you are able to share any of those benchmarks that regress in build- & run-able form, I could also verify that assumption.

The example I shared was the most obviously worse, even if it is wrapped up in awkward SLP codegen. It is 20%-40% worse depending on the CPU. There are a few other cases that get worse that have the 4x manual unrolling, including a f64 matrix multiply and something called iir_lattice. As far as I can see all the example that get worse have multiplies into a reduction.

Ok, I checked the public A75 optimization guide and it looks like FMADD has a throughput of 2 while FADDP (Q form) only has a throughput of 1 and worse latency. I guess that would explain the issue or do you think the assembly diff is also worse assuming an implementation of FADDP that has the same latency/throughput as FMADD?

If the issue is the FADDP implementation on particular uarchs, then we should probably bump the FADDP cost on those uarchs.

I don't think this is about micro-architecture differences, just different benchmarks. I happen to have a lot of tests that have the awkward SLP case and 4x unrolled loops, so show where this can perform worse. The same thing doesn't happen to come up in Spec or the other benchmarks you ran, so you didn't see any differences. The manually unrolled loops I think are less important in the grand scheme of things.

The problem isn't FADD vs FADDP. I agree that in isolation this is a decent improvement to the cost of fadd reduction. The problem is FMADD vs vector FMUL + FADDP's. Unless a micro-arch will be doing something very strange, back-to-back fmadd should be pretty efficient. There will not be a very large difference between 4xscalar fmadd and vector fmul+faddp+faddp, and yet the cost model will currently cost them as 8 vs 2+2 I think. With anything else going on (like the shuffles in the arm_biquad_cascade_df1_f32 case, the cost can easily be wrong enough to give worse performance.

But our current FP costs are all set to 2 (which isn't necessarily deliberate, just the default), and any modifications I've tried so far have only really made things look worse. I was attempting to mark the cost of a fadd that used a fmul as free, but it didn't help in the arm_biquad_cascade_df1_f32 case and seemed to cause other performance issues on its own. Perhaps it's best to move forward with this, accepting the regressions because it also gives improvements, and work on improving the other costs as we go forward.

llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll
46

Can we split the fp16 into their own functions, so that we can still use the script to generate the check lines. We do the same in a few of the other costmodel tests.

If you are able to share any of those benchmarks that regress in build- & run-able form, I could also verify that assumption.

For the arm_biquad case, I think it can use any old data as input. The blockSize we usually test with 16, 64 and 256 to give a range of values. numStages shouldn't matter too much, it is the outer loop. Something like 4 should be fine so long as the function is called in a loop to get the total runtime up.

The example I shared was the most obviously worse, even if it is wrapped up in awkward SLP codegen. It is 20%-40% worse depending on the CPU. There are a few other cases that get worse that have the 4x manual unrolling, including a f64 matrix multiply and something called iir_lattice. As far as I can see all the example that get worse have multiplies into a reduction.

Ok, I checked the public A75 optimization guide and it looks like FMADD has a throughput of 2 while FADDP (Q form) only has a throughput of 1 and worse latency. I guess that would explain the issue or do you think the assembly diff is also worse assuming an implementation of FADDP that has the same latency/throughput as FMADD?

If the issue is the FADDP implementation on particular uarchs, then we should probably bump the FADDP cost on those uarchs.

I don't think this is about micro-architecture differences, just different benchmarks. I happen to have a lot of tests that have the awkward SLP case and 4x unrolled loops, so show where this can perform worse. The same thing doesn't happen to come up in Spec or the other benchmarks you ran, so you didn't see any differences. The manually unrolled loops I think are less important in the grand scheme of things.

The problem isn't FADD vs FADDP. I agree that in isolation this is a decent improvement to the cost of fadd reduction. The problem is FMADD vs vector FMUL + FADDP's. Unless a micro-arch will be doing something very strange, back-to-back fmadd should be pretty efficient. There will not be a very large difference between 4xscalar fmadd and vector fmul+faddp+faddp, and yet the cost model will currently cost them as 8 vs 2+2 I think. With anything else going on (like the shuffles in the arm_biquad_cascade_df1_f32 case, the cost can easily be wrong enough to give worse performance.

But our current FP costs are all set to 2 (which isn't necessarily deliberate, just the default), and any modifications I've tried so far have only really made things look worse. I was attempting to mark the cost of a fadd that used a fmul as free, but it didn't help in the arm_biquad_cascade_df1_f32 case and seemed to cause other performance issues on its own. Perhaps it's best to move forward with this, accepting the regressions because it also gives improvements, and work on improving the other costs as we go forward.

I spent some time investigating another issue with ignoring scalar FMA costs and put up a potential fix for the non-horizontal reduction case: D132872. Depending on how this shakes out, I'll see if the same thing can be applied to horizontal reductions before going forward with the cost-model change.