This is an archive of the discontinued LLVM Phabricator instance.

DAGCombiner optimization for pow(x,0.75) and pow(x,0.25) on double and single precision even in case massv function is asked
ClosedPublic

Authored by masoud.ataei on May 28 2020, 10:45 AM.

Details

Summary

Here, I am proposing to add an special case for massv powf4/powd2 function (SIMD counterpart of powf/pow function in MASSV library) in MASSV pass to get later optimizations like conversion from pow(x,0.75) and pow(x,0.25) for double and single precision to sequence of sqrt's in the DAGCombiner in vector float case. My reason for doing this is: the optimized pow(x,0.75) and pow(x,0.25) for double and single precision to sequence of sqrt's is faster than powf4/powd2 on P8 and P9.

In case MASSV functions is called, and if the exponent of pow is 0.75 or 0.25, we will get the sequence of sqrt's and if exponent is not 0.75 or 0.25 we will get the appropriate MASSV function.

Diff Detail

Event Timeline

masoud.ataei created this revision.May 28 2020, 10:45 AM
Whitney added inline comments.May 28 2020, 6:10 PM
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
10943 ↗(On Diff #266943)

move this line up, so case and return on the same line like others.

llvm/lib/Target/PowerPC/PPCISelLowering.h
1063 ↗(On Diff #266943)

lowerToLibCall -> LowerToLibCall
follow the same convention as the other function startswith Lower

What we are doing is as follows:
llvm.pow(IR) --> FPOW(ISD) --> __powf4_P8/9(ISD/IR)

It makes more sense to do it in the IR pass from what I see. And then, you can query the TargetLibraryInfoImpl::isFunctionVectorizable(name) instead of the option. Maybe, we can do it in ppc pass: PPCLowerMASSVEntries which did something like:

__sind2_massv --> __sind2_P9 for a Power9 subtarget.

Does it make sense ?

Addressing the reviews

masoud.ataei marked 3 inline comments as done.Jun 1 2020, 9:08 AM

What we are doing is as follows:
llvm.pow(IR) --> FPOW(ISD) --> __powf4_P8/9(ISD/IR)

It makes more sense to do it in the IR pass from what I see. And then, you can query the TargetLibraryInfoImpl::isFunctionVectorizable(name) instead of the option. Maybe, we can do it in ppc pass: PPCLowerMASSVEntries which did something like:

__sind2_massv --> __sind2_P9 for a Power9 subtarget.

Does it make sense ?

I agree in general it makes more sense to do this kind of conversions in an IR pass like PPCLowerMASSVEntries. This is what we are currently doing in LLVM but there is a problem here. If we change the llvm intrinsic to libcall earlier than a later optimization (like in DAGCombiner) the later optimization won't be triggered. In the case that I am proposing the change, if we have pow(x,0.75) in the code, the PPCLowerMASSVEntries pass will currently change it to __powf4_P* libcall then later in the DAGCombiner we will not get the optimization pow(x,0.75) --> sqrt(x)*sqrt(sqrt(x)). So that's a problem, because sqrt(x)*sqrt(sqrt(x)) is faster.

What I am proposing is to move the conversion for powf4 to late in the compiler pipeline. With this change we will we will get above optimization when exponent is 0.75 and MASSV calls otherwise.

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
10943 ↗(On Diff #266943)

It was the clang-format suggestion to put the return in the new line. I agree it was too ugly.

masoud.ataei marked an inline comment as done.Jun 1 2020, 9:09 AM

What we are doing is as follows:
llvm.pow(IR) --> FPOW(ISD) --> __powf4_P8/9(ISD/IR)

It makes more sense to do it in the IR pass from what I see. And then, you can query the TargetLibraryInfoImpl::isFunctionVectorizable(name) instead of the option. Maybe, we can do it in ppc pass: PPCLowerMASSVEntries which did something like:

__sind2_massv --> __sind2_P9 for a Power9 subtarget.

Does it make sense ?

I agree in general it makes more sense to do this kind of conversions in an IR pass like PPCLowerMASSVEntries. This is what we are currently doing in LLVM but there is a problem here. If we change the llvm intrinsic to libcall earlier than a later optimization (like in DAGCombiner) the later optimization won't be triggered. In the case that I am proposing the change, if we have pow(x,0.75) in the code, the PPCLowerMASSVEntries pass will currently change it to __powf4_P* libcall then later in the DAGCombiner we will not get the optimization pow(x,0.75) --> sqrt(x)*sqrt(sqrt(x)). So that's a problem, because sqrt(x)*sqrt(sqrt(x)) is faster.

What I am proposing is to move the conversion for powf4 to late in the compiler pipeline. With this change we will we will get above optimization when exponent is 0.75 and MASSV calls otherwise.

If we expect the llvm.pow(x, 0.75) to be lowered as two sqrt and for others, they are libcall, can we just don't transform it as libcall for 0.75 in PPCLowerMASSVEntries? And it will be lowered as two sqrts in DAGCombine.

What we are doing is as follows:
llvm.pow(IR) --> FPOW(ISD) --> __powf4_P8/9(ISD/IR)

It makes more sense to do it in the IR pass from what I see. And then, you can query the TargetLibraryInfoImpl::isFunctionVectorizable(name) instead of the option. Maybe, we can do it in ppc pass: PPCLowerMASSVEntries which did something like:

__sind2_massv --> __sind2_P9 for a Power9 subtarget.

Does it make sense ?

I agree in general it makes more sense to do this kind of conversions in an IR pass like PPCLowerMASSVEntries. This is what we are currently doing in LLVM but there is a problem here. If we change the llvm intrinsic to libcall earlier than a later optimization (like in DAGCombiner) the later optimization won't be triggered. In the case that I am proposing the change, if we have pow(x,0.75) in the code, the PPCLowerMASSVEntries pass will currently change it to __powf4_P* libcall then later in the DAGCombiner we will not get the optimization pow(x,0.75) --> sqrt(x)*sqrt(sqrt(x)). So that's a problem, because sqrt(x)*sqrt(sqrt(x)) is faster.

What I am proposing is to move the conversion for powf4 to late in the compiler pipeline. With this change we will we will get above optimization when exponent is 0.75 and MASSV calls otherwise.

If we expect the llvm.pow(x, 0.75) to be lowered as two sqrt and for others, they are libcall, can we just don't transform it as libcall for 0.75 in PPCLowerMASSVEntries? And it will be lowered as two sqrts in DAGCombine.

For pow to be __powf4_P8, there are two step:

  1. in LoopVectorizePass, pow becomes __powf4_massv, then
  2. in PPCLowerMASSVEntries __powf4_massv becomes __powf4_P8

So when we reach PPCLowerMASSVEntries the pow intrinsic is already a libcall. I thought it is really ugly to undo the LoopVectorizePass conversion in PPCLowerMASSVEntries pass when there is an special case.

steven.zhang added a comment.EditedJun 2 2020, 9:16 PM

What we are doing is as follows:
llvm.pow(IR) --> FPOW(ISD) --> __powf4_P8/9(ISD/IR)

It makes more sense to do it in the IR pass from what I see. And then, you can query the TargetLibraryInfoImpl::isFunctionVectorizable(name) instead of the option. Maybe, we can do it in ppc pass: PPCLowerMASSVEntries which did something like:

__sind2_massv --> __sind2_P9 for a Power9 subtarget.

Does it make sense ?

I agree in general it makes more sense to do this kind of conversions in an IR pass like PPCLowerMASSVEntries. This is what we are currently doing in LLVM but there is a problem here. If we change the llvm intrinsic to libcall earlier than a later optimization (like in DAGCombiner) the later optimization won't be triggered. In the case that I am proposing the change, if we have pow(x,0.75) in the code, the PPCLowerMASSVEntries pass will currently change it to __powf4_P* libcall then later in the DAGCombiner we will not get the optimization pow(x,0.75) --> sqrt(x)*sqrt(sqrt(x)). So that's a problem, because sqrt(x)*sqrt(sqrt(x)) is faster.

What I am proposing is to move the conversion for powf4 to late in the compiler pipeline. With this change we will we will get above optimization when exponent is 0.75 and MASSV calls otherwise.

If we expect the llvm.pow(x, 0.75) to be lowered as two sqrt and for others, they are libcall, can we just don't transform it as libcall for 0.75 in PPCLowerMASSVEntries? And it will be lowered as two sqrts in DAGCombine.

For pow to be __powf4_P8, there are two step:

  1. in LoopVectorizePass, pow becomes __powf4_massv, then
  2. in PPCLowerMASSVEntries __powf4_massv becomes __powf4_P8

So when we reach PPCLowerMASSVEntries the pow intrinsic is already a libcall. I thought it is really ugly to undo the LoopVectorizePass conversion in PPCLowerMASSVEntries pass when there is an special case.

As what I see from your test is that, we are trying to lower the pow intrinsic(not pow libcall) to two sqrts or __powf4_P8/9. It is different from the case that, pow -> __powf4_massv -> __powf4_P8 as they are all libcall path.

If your intention is to have __powf4_massv transformed to two fsqrts if argument is 0.75, the test is missing in this patch and I am not sure if this patch could work. You might need to transform the __powf4_massv to llvm.pow intrinsic at some place like PartiallyInlineLibCallsLegacyPass if the argument is 0.75

Maybe, I miss some background here and I just want to get a clear understanding of the reason why we have to do it in DAG. Thank you for your patience.

masoud.ataei added a comment.EditedJun 3 2020, 8:14 AM

What we are doing is as follows:
llvm.pow(IR) --> FPOW(ISD) --> __powf4_P8/9(ISD/IR)

It makes more sense to do it in the IR pass from what I see. And then, you can query the TargetLibraryInfoImpl::isFunctionVectorizable(name) instead of the option. Maybe, we can do it in ppc pass: PPCLowerMASSVEntries which did something like:

__sind2_massv --> __sind2_P9 for a Power9 subtarget.

Does it make sense ?

I agree in general it makes more sense to do this kind of conversions in an IR pass like PPCLowerMASSVEntries. This is what we are currently doing in LLVM but there is a problem here. If we change the llvm intrinsic to libcall earlier than a later optimization (like in DAGCombiner) the later optimization won't be triggered. In the case that I am proposing the change, if we have pow(x,0.75) in the code, the PPCLowerMASSVEntries pass will currently change it to __powf4_P* libcall then later in the DAGCombiner we will not get the optimization pow(x,0.75) --> sqrt(x)*sqrt(sqrt(x)). So that's a problem, because sqrt(x)*sqrt(sqrt(x)) is faster.

What I am proposing is to move the conversion for powf4 to late in the compiler pipeline. With this change we will we will get above optimization when exponent is 0.75 and MASSV calls otherwise.

If we expect the llvm.pow(x, 0.75) to be lowered as two sqrt and for others, they are libcall, can we just don't transform it as libcall for 0.75 in PPCLowerMASSVEntries? And it will be lowered as two sqrts in DAGCombine.

For pow to be __powf4_P8, there are two step:

  1. in LoopVectorizePass, pow becomes __powf4_massv, then
  2. in PPCLowerMASSVEntries __powf4_massv becomes __powf4_P8

So when we reach PPCLowerMASSVEntries the pow intrinsic is already a libcall. I thought it is really ugly to undo the LoopVectorizePass conversion in PPCLowerMASSVEntries pass when there is an special case.

As what I see from your test is that, we are trying to lower the pow intrinsic(not pow libcall) to two sqrts or __powf4_P8/9. It is different from the case that, pow -> __powf4_massv -> __powf4_P8 as they are all libcall path.

If your intention is to have __powf4_massv transformed to two fsqrts if argument is 0.75, the test is missing in this patch and I am not sure if this patch could work. You might need to transform the __powf4_massv to llvm.pow intrinsic at some place like PartiallyInlineLibCallsLegacyPass if the argument is 0.75

Maybe, I miss some background here and I just want to get a clear understanding of the reason why we have to do it in DAG. Thank you for your patience.

Thank you for reviewing this patch.

I was also thinking that I needed to revert the conversion in LoopVectorizePass and PPCLowerMASSVEntries to llvm intrinsic llvm.pow before optimization to sqrt's happening. But it seems just with setting the operation action to custom setOperationAction(ISD::FPOW, MVT::v4f32, Custom); we will get the sqrt's optimization when pow(x,0.75) is used in the code.

I tested with a c code too

$ cat vst.c 
#include<math.h>
void my_vspow_075 (float y[], float x[]) {
  #pragma disjoint (*y, *x)

  float *xp=x, *yp=y;
  int i;

  for (i=0; i<1024; i++) {
     *yp=powf(*xp, 0.75);
     xp++;
     yp++;
  }
}

compile it with clang -Ofast -fveclib=MASSV vst.c -mllvm -print-after-all. I can see that after LoopVectorizePass, llvm.pow.f32 is converted to __powf4_massv and after PPCLowerMASSVEntries __powf4_massv is converted to __powf4_P8. But later we will get conversion to sqrt's. This is happening if you just set setOperationAction(ISD::FPOW, MVT::v4f32, Custom); no other changes. The rest of changes in PPCISelLowring.cpp is needed to handle the cases of powf(x,y) when y is not 0.75.

About whether we apply this changes in LoopVectorizePass, PPCLowerMASSVEntries or in PPCISelLowing, I was not the only person that decided to handle it in PPCISelLowing. I will be happy if they can speak up and help me to correctly answer your question. Their names are in the list of reviewers so I won't tag them here again.

I tested with a c code too

$ cat vst.c 
#include<math.h>
void my_vspow_075 (float y[], float x[]) {
  #pragma disjoint (*y, *x)

  float *xp=x, *yp=y;
  int i;

  for (i=0; i<1024; i++) {
     *yp=powf(*xp, 0.75);
     xp++;
     yp++;
  }
}

compile it with clang -Ofast -fveclib=MASSV vst.c -mllvm -print-after-all. I can see that after LoopVectorizePass, llvm.pow.f32 is converted to __powf4_massv and after PPCLowerMASSVEntries __powf4_massv is converted to __powf4_P8. But later we will get conversion to sqrt's. This is happening if you just set setOperationAction(ISD::FPOW, MVT::v4f32, Custom); no other changes. The rest of changes in PPCISelLowring.cpp is needed to handle the cases of powf(x,y) when y is not 0.75.

I don't see any difference of the assembly output from your example code with your patch. I see that we turn the llvm.pow.f32 into two sqrts w/o this patch, not __powf4_P8. I assume that, you want to turn __powf4_P8 into two sqrt's if the arguments is 0.75. Please correct me if I misunderstand the intention.

@steven.zhang
I think I had done a terrible mistake. When I tested with my c code, I didn't have if (ClVectorLibrary == TargetLibraryInfoImpl::MASSV) on

if (ClVectorLibrary == TargetLibraryInfoImpl::MASSV)
  setOperationAction(ISD::FPOW, MVT::v4f32, Custom);

and for some reason if I move this check inside the function LowerFPOWMASSV it works good. So I am updating the patch. Thank you for catching it.

Although, this fix is not ideal. In case we want to use other vector libraries like Accelerate or SVML on PowerPC in future, this code is preventing them to generate accurate libcall for them. Any idea how to fix this issue?

steven.zhang added a comment.EditedJun 4 2020, 11:28 PM

Although, this fix is not ideal. In case we want to use other vector libraries like Accelerate or SVML on PowerPC in future, this code is preventing them to generate accurate libcall for them. Any idea how to fix this issue?

As you mark the FPOW as custom which will change the cost when vectorize the llvm.pow.f32. So it will vectorize it as llvm.pow.v4f32 even MASSV is disabled, as we are telling the loop vectorizer that it is cheap in Backend to lower FPOW(llvm.pow.v4f32), which is not always true. That will bring regression for the code path with MASSV disabled if the argument is not 0.75.

I think, the best way to do this is to adjust the loop vectorizer cost model for PowerPC. If the argument is 0.75, the cost of vectorizing llvm.pow.f32 is small, no matter MASSV enabled or not. So that, we will always get llvm.pow.v4f32. But I don't know if it is easy to do it.

Another easier way is to turn the massv function to intrinsic, as that is the motivation of this patch as you described which makes sense to me. What do you think ?

diff --git a/llvm/lib/Target/PowerPC/PPCLowerMASSVEntries.cpp b/llvm/lib/Target/PowerPC/PPCLowerMASSVEntries.cpp
index 429b8a31fbe9..74bd31b0b044 100644
--- a/llvm/lib/Target/PowerPC/PPCLowerMASSVEntries.cpp
+++ b/llvm/lib/Target/PowerPC/PPCLowerMASSVEntries.cpp
@@ -105,6 +105,21 @@ bool PPCLowerMASSVEntries::lowerMASSVCall(CallInst *CI, Function &Func,
   if (CI->use_empty())
     return false;

+  // FIXME - add necessary fast math flag check here.
+  if (Func.getName() == "__powf4_massv") {
+    if (Constant *Exp = dyn_cast<Constant>(CI->getArgOperand(1))) {
+      if (ConstantFP *CFP = dyn_cast<ConstantFP>(Exp->getSplatValue())) {
+        // If the argument is 0.75, it is cheaper to turn it into pow intrinsic
+        // so that it could be optimzed as two sqrt's.
+        if (CFP->isExactlyValue(0.75)) {
+          CI->setCalledFunction(Intrinsic::getDeclaration(&M, Intrinsic::pow,
+                                                          CI->getType()));
+          return true;
+        }
+      }
+    }
+  }
+
   std::string MASSVEntryName = createMASSVFuncName(Func, Subtarget);
   FunctionCallee FCache = M.getOrInsertFunction(
       MASSVEntryName, Func.getFunctionType(), Func.getAttributes());

Moving completely the changes from PPCISelLowring.cpp to PPCLowerMASSVEntries.cpp (MASSV pass) to address the reviewer comments.

steven.zhang added inline comments.Jun 9 2020, 8:27 PM
llvm/lib/Target/PowerPC/PPCLowerMASSVEntries.cpp
103

The flag nsz don't matter here. It is only used to handle the 0.25 case. But maybe, you can extend it to support both 0.75 and 0.25 as what we did in DAGCombine or remove the nsz check here.

llvm/test/CodeGen/PowerPC/pow_massv_0.75exp.ll
48 ↗(On Diff #269511)

Don't add the flag fast as it is too general. But adding the ninf and afn and add another negative case that didn't have these to flags.

I added support for the case when exponent is 0.25 in addition to support for double precision cases.

masoud.ataei retitled this revision from DAGCombiner optimization for pow(x,0.75) even in case massv function is asked to DAGCombiner optimization for pow(x,0.75) and pow(x,0.25) on double and single precision even in case massv function is asked.Jun 10 2020, 12:22 PM
masoud.ataei edited the summary of this revision. (Show Details)

Sorry, I forgot the clang-format.

steven.zhang accepted this revision.Jun 10 2020, 3:55 PM

LGTM now. Thank you for your patient.

This revision is now accepted and ready to land.Jun 10 2020, 3:55 PM
This revision was automatically updated to reflect the committed changes.