This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] Add generic fnmsub intrinsic
ClosedPublic

Authored by qiucf on Dec 19 2021, 8:48 PM.

Details

Reviewers
rzurob
jsji
nemanjai
shchenz
Group Reviewers
Restricted Project
Commits
rGb2497e54356d: [PowerPC] Add generic fnmsub intrinsic
Summary

Currently in Clang, we have various builtins for fnmsub operation:

  • __builtin_vsx_xvnmsubasp/__builtin_vsx_xvnmsubadp for float/double vector, they'll be transformed into -fma(a, b, -c) in LLVM IR
  • __builtin_ppc_fnmsubs/__builtin_ppc_fnmsub for float/double scalar, they'll generate corresponding intrinsic in IR

But for the vector version of builtin, the 3 op chain may be recognized as expensive by some passes (like early cse). We need some way to keep the fnmsub form until code generation.

This patch introduces ppc.fnmsub.* intrinsic to unify four fnmsub intrinsics.

Diff Detail

Unit TestsFailed

Event Timeline

qiucf created this revision.Dec 19 2021, 8:48 PM
qiucf requested review of this revision.Dec 19 2021, 8:48 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptDec 19 2021, 8:48 PM

Converting more generic code to target-specific intrinsics is sometimes necessary to ensure the generic IR doesn't get transformed in a way that is disadvantageous. I believe that the description of this review claims that to be the case for these negated FMA's. The obvious disadvantage of producing target-specific intrinsics is that the optimizer knows nothing about them so no advantageous transformations can happen either (i.e. hiding the semantics from the optimizer is sometimes a good thing and sometimes a bad thing).

The description of this patch includes no test case that shows the optimizer performing an undesirable transformation. So the motivation for making the front end produce more opaque code is not at all clear.

qiucf added a comment.Jan 19 2022, 8:30 PM

Converting more generic code to target-specific intrinsics is sometimes necessary to ensure the generic IR doesn't get transformed in a way that is disadvantageous. I believe that the description of this review claims that to be the case for these negated FMA's. The obvious disadvantage of producing target-specific intrinsics is that the optimizer knows nothing about them so no advantageous transformations can happen either (i.e. hiding the semantics from the optimizer is sometimes a good thing and sometimes a bad thing).

The description of this patch includes no test case that shows the optimizer performing an undesirable transformation. So the motivation for making the front end produce more opaque code is not at all clear.

Here's a pretty simple case: vector float foo(vector float a, vector float b, vector float c, vector float d) { return __builtin_vsx_xvnmsubasp(c, d, a*b); }

It current produces xvnegsp+xvmulsp+xvnmaddasp, after this patch it produces xvmulsp+xvnmsubasp. In some complicated cases, we can see much more unexpected instructions generated.

qiucf added a comment.Feb 9 2022, 9:37 PM

gentle ping

hiding the semantics from the optimizer is sometimes a good thing and sometimes a bad thing).

Agree. Imagining a case when the neg and fma (from fnmsub) can both be CSE-ed with another neg and fma, so we can totally eliminate the fnmsub. But after we convert it to an intrinsic, we may lose the opportunity to CSE the fnmsub.

Here's a pretty simple case: vector float foo(vector float a, vector float b, vector float c, vector float d) { return __builtin_vsx_xvnmsubasp(c, d, a*b); }
It current produces xvnegsp+xvmulsp+xvnmaddasp, after this patch it produces xvmulsp+xvnmsubasp. In some complicated cases, we can see much more unexpected instructions generated.

This is narrowed down from a real-world case. After CSE some part of the fnmsub, it is hard to optimize it back to a single hardware fnmsub instruction as normally we check the use number of a register and if the user number is not 1, we may exit the combine.

Is it possible to get some perf data for some float workloads with this patch? @qiucf

llvm/include/llvm/IR/IntrinsicsPowerPC.td
1737

When llvm_anyfloat_ty is f32 or f64, we will generate two intrinsics with same semantic. llvm.ppc.nmsub.f32 + llvm.ppc.fnmsubs and llvm.ppc.nmsub.f64 + llvm.ppc.fnmsub. At first glance, we seems can not delete the int_ppc_fnmsub and int_ppc_fnmsubs, because they are for XL compatibility and XL has seperated fnmsub for float and double and we need to map them 1 by 1. Better to check if it is possible to replace int_ppc_fnmsub and int_ppc_fnmsubs with int_ppc_nmsub. And if it can be replaced, we can use a meaningful name like int_ppc_fnmsub for the new intrinsic.

qiucf added a comment.Feb 22 2022, 1:45 AM

hiding the semantics from the optimizer is sometimes a good thing and sometimes a bad thing).

Agree. Imagining a case when the neg and fma (from fnmsub) can both be CSE-ed with another neg and fma, so we can totally eliminate the fnmsub. But after we convert it to an intrinsic, we may lose the opportunity to CSE the fnmsub.

Here's a pretty simple case: vector float foo(vector float a, vector float b, vector float c, vector float d) { return __builtin_vsx_xvnmsubasp(c, d, a*b); }
It current produces xvnegsp+xvmulsp+xvnmaddasp, after this patch it produces xvmulsp+xvnmsubasp. In some complicated cases, we can see much more unexpected instructions generated.

This is narrowed down from a real-world case. After CSE some part of the fnmsub, it is hard to optimize it back to a single hardware fnmsub instruction as normally we check the use number of a register and if the user number is not 1, we may exit the combine.

Is it possible to get some perf data for some float workloads with this patch? @qiucf

Thanks. I did not see performance change in some common benchmarks.

llvm/include/llvm/IR/IntrinsicsPowerPC.td
1737

We can do that, but that requires more work and seems beyond this patch's scope. See D105930, we'll need to handle the builtin in Clang. And the builtin explicitly generates type-M VSX instructions (I guess to reduce copy in simple cases).

qiucf updated this revision to Diff 411046.Feb 24 2022, 1:59 AM
qiucf marked an inline comment as done.
qiucf edited the summary of this revision. (Show Details)

Replace existing ppc.fnmsub and ppc.fnmsubs.

shchenz accepted this revision as: shchenz.Mar 2 2022, 6:21 PM

LGTM. Two nits about the comments and tests.

Please wait for some days in case other reviews have some comments.

clang/lib/CodeGen/CGBuiltin.cpp
15196

why add this comment here?

clang/test/CodeGen/PowerPC/builtins-ppc-xlcompat-math.c
98 ↗(On Diff #411046)

If we improve the check lines to CHECK-COUNT, do we still need the original CHECKs?

This revision is now accepted and ready to land.Mar 2 2022, 6:21 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 2 2022, 6:21 PM
qiucf updated this revision to Diff 413333.Mar 6 2022, 9:05 PM
qiucf marked an inline comment as done.
qiucf added inline comments.
clang/test/CodeGen/PowerPC/builtins-ppc-xlcompat-math.c
98 ↗(On Diff #411046)

Yes, otherwise we can't capture the right operands of llvm.ppc.fnmsub.f64.

This revision was landed with ongoing or failed builds.Mar 6 2022, 9:06 PM
This revision was automatically updated to reflect the committed changes.