This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][TTI] Cost model FADD/FSUB/FNEG
ClosedPublic

Authored by SjoerdMeijer on Mar 14 2023, 4:51 AM.

Details

Summary

This lowers the cost for FADD, FSUB, and FNEG. The motivation is to avoid over-eager SLP vectorisation, that makes it look like SLP vectorisation is profitable but results in significant slow downs. Lowering the cost for scalar FADD and FSUB costs helps the profitability decision to favour the scalar version where vectorisation isn't beneficial. Performance results show a 7% improvement for Imagick from SPEC FP 2017, a small improvement in Blender, and unchanged results for the other apps in SPEC. RAJAPerf is neutral (mostly no changes).

For a bit more context, this is related to over-eager vectorisation in https://github.com/llvm/llvm-project/issues/61047 but the motivating test case is slightly different.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Mar 14 2023, 4:51 AM
SjoerdMeijer requested review of this revision.Mar 14 2023, 4:51 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 14 2023, 4:51 AM

I have now benchmarked RAJAPerf too, which is a loop based and FP heavy benchmark. Results are neutral, mostly no changes.
So overall I am only seeing the uplift of this, with no regression.

SjoerdMeijer added a subscriber: sushgokh.

Added context, fixed the test cases.

SjoerdMeijer retitled this revision from [AArch64][TTI] Cost model FADD/FSUB. WIP. to [AArch64][TTI] Cost model FADD/FSUB.Mar 16 2023, 8:15 AM
SjoerdMeijer edited the summary of this revision. (Show Details)

I think this makes a lot of sense. Especially if we are treating many shuffles with a cost of 1, floating point operations should not be twice the cost. We could consider doing the same for fmul from looking at software optimization guides, but the changes for fadd/fsub already have a high likelihood of causing some large changes. Adding fneg is worth it though as that should be a simple operation.

This might lead to a fair number of changes. From looking at the main regression I found for example, it was a case of identical code being produced by the loop vectorizer, but scalar and vector costs being closer lead it to have a higher minimum trip count for entering the vector body (from D109368). The real problem there is aliasing causing LD4 loads to not be recognized though, so it's tough to see how this change is really to blame. There are a number of improvements too, to make up for that regression.

@fhahn do you have any thoughts on the change from your side, or any benchmark results that could be helpful?

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2414

It is probably worth adding ISD::FNEG, and we might need to be more careful about the types. FP128 shouldn't be cheap, for example. Possibly a higher cost with fp16/bf16 too, when they are not available?

llvm/test/Transforms/SLPVectorizer/AArch64/matmul.ll
7

This source file, when compiled is not SLP vectorizing already, so I don't think things change in practice. It is using a number of llvm.fmuladd intrinsics when I compile it, which already seem to have a cost of 1.

I think this makes a lot of sense. Especially if we are treating many shuffles with a cost of 1, floating point operations should not be twice the cost. We could consider doing the same for fmul from looking at software optimization guides, but the changes for fadd/fsub already have a high likelihood of causing some large changes. Adding fneg is worth it though as that should be a simple operation.

This might lead to a fair number of changes. From looking at the main regression I found for example, it was a case of identical code being produced by the loop vectorizer, but scalar and vector costs being closer lead it to have a higher minimum trip count for entering the vector body (from D109368). The real problem there is aliasing causing LD4 loads to not be recognized though, so it's tough to see how this change is really to blame. There are a number of improvements too, to make up for that regression.

@fhahn do you have any thoughts on the change from your side, or any benchmark results that could be helpful?

Thanks a lot for looking at this Dave. I completely agree about giving other FP operations like fmul the same treatment. I forgot to mention why I only chose to change the cost for FADD/FSUB, but it's exactly for the reason you mentioned: I was a bit cautious about causing too many changes at the same time and am hoping to restrict the fallout by changing FADD/FSUB first before doing others. So I am more than happy to add FNEG too if we think that won't increase the risk of regressions.

I will also add the extra checks for the more exotic FP types like you suggested.

SjoerdMeijer retitled this revision from [AArch64][TTI] Cost model FADD/FSUB to [AArch64][TTI] Cost model FADD/FSUB/FNEG.
SjoerdMeijer edited the summary of this revision. (Show Details)

Added FNEG, and checks for BF16 and FP16.

Gentle ping. Any further opinions on this?

dmgreen accepted this revision.Apr 9 2023, 8:09 AM

I think this looks OK. It has quite a high chance of causing some issues somewhere, but the only two I have seen to be unrelated, seemingly caused by the costs on non-legal interleaving groups.

LGTM, but please keep an eye for any reported issues. Thanks.

This revision is now accepted and ready to land.Apr 9 2023, 8:09 AM

Agreed, and will keep an eye on how this goes.

Thanks for your help with this.

This revision was landed with ongoing or failed builds.Apr 11 2023, 1:46 AM
This revision was automatically updated to reflect the committed changes.

FYI I got some warnings from this new code, using clang 14.0.5:

/home/david.spickett/llvm-project/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:2421:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
  case ISD::FMUL:
  ^
/home/david.spickett/llvm-project/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:2421:3: note: insert 'LLVM_FALLTHROUGH;' to silence this warning
  case ISD::FMUL:
  ^
  LLVM_FALLTHROUGH;
/home/david.spickett/llvm-project/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:2421:3: note: insert 'break;' to avoid fall-through
  case ISD::FMUL:
  ^
  break;

Though the existing code directly above also doesn't annotate it either, so if you feel like fixing it in another patch maybe do that.

FYI I got some warnings from this new code, using clang 14.0.5:

/home/david.spickett/llvm-project/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:2421:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
  case ISD::FMUL:
  ^
/home/david.spickett/llvm-project/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:2421:3: note: insert 'LLVM_FALLTHROUGH;' to silence this warning
  case ISD::FMUL:
  ^
  LLVM_FALLTHROUGH;
/home/david.spickett/llvm-project/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:2421:3: note: insert 'break;' to avoid fall-through
  case ISD::FMUL:
  ^
  break;

Though the existing code directly above also doesn't annotate it either, so if you feel like fixing it in another patch maybe do that.

Yep, thanks, I noticed it too and will recommit soon with the annotation included in the recommit because there is a bot that errors on it:

\https://lab.llvm.org/buildbot/#/builders/57/builds/25973

MaskRay added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2420

Also an error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough] to address before you reland the patch :)

SjoerdMeijer added inline comments.Apr 11 2023, 10:56 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2420

Included/fixed in the recommit:
https://reviews.llvm.org/rGd827865e9f77