This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC][MachineCombiner] reassociate fma to expose more ILP
ClosedPublic

Authored by shchenz on May 18 2020, 8:42 PM.

Details

Reviewers
hfinkel
jsji
Group Reviewers
Restricted Project
Commits
rGbd7096b977e1: [PowerPC] fma chain break to expose more ILP
Summary

This patch tries to reassociate two patterns related to FMA to expose more ILP on PowerPC.

// Pattern 1:
//   A =  FADD X,  Y          (Leaf)
//   B =  FMA  A,  M21,  M22  (Prev)
//   C =  FMA  B,  M31,  M32  (Root)
// -->
//   A =  FMA  X,  M21,  M22
//   B =  FMA  Y,  M31,  M32
//   C =  FADD A,  B
// Pattern 2:
//   A =  FMA  X,  M11,  M12  (Leaf)
//   B =  FMA  A,  M21,  M22  (Prev)
//   C =  FMA  B,  M31,  M32  (Root)
// -->
//   A =  FMUL M11,  M12
//   B =  FMA  X,  M21,  M22
//   D =  FMA  A,  M31,  M32
//   C =  FADD B,  D

breaking the dependency between A and B, allowing FMA to be executed in parallel (or back-to-back in a pipeline) instead of depending on each other.

Diff Detail

Event Timeline

shchenz created this revision.May 18 2020, 8:42 PM
shchenz edited the summary of this revision. (Show Details)May 18 2020, 9:02 PM
shchenz updated this revision to Diff 266001.May 25 2020, 4:27 AM

format fixing & rebase & some typo fixing

shchenz edited the summary of this revision. (Show Details)May 26 2020, 1:41 AM
spatel added a subscriber: spatel.May 27 2020, 1:40 PM

I did not look at the patch itself except to notice that it is a lot of code...so I have to ask - did you look at implementing at least the 1st pattern in DAGCombiner? That seems like a general improvement for any superscalar micro-arch with no register pressure disadvantage.

llvm/lib/Target/PowerPC/PPCInstrInfo.cpp
313–319

I was confused here because I was expecting the C++ style notation for FMA (X*Y+Z):
https://en.cppreference.com/w/cpp/numeric/math/fma

shchenz marked an inline comment as done.May 27 2020, 5:55 PM

I did not look at the patch itself except to notice that it is a lot of code...so I have to ask - did you look at implementing at least the 1st pattern in DAGCombiner? That seems like a general improvement for any superscalar micro-arch with no register pressure disadvantage.

Thanks for looking into this @spatel

Yes, I tried to implement pattern 1 in DAGCombiner, but I got some LIT failures related to register allocation on platform AArch64 and Thumb2. And this kind of opt will increase register pressure. I think it is better not to add it in DAGCombiner.
Reason I add these two patterns in MachineCombiner is:
1: This pass is targeted for ILP related optimization
2: Adding register pressure estimation model here should be easy than in DAGCombiner. We can do similar estimation like we did in MachineLICM if we want to model it in future?
3: These two patterns have to be put together. After breaking pattern 2: (fma+fma+fma) to (fmul+fma+fma+fadd), the last fadd can be combined with following two fmas as pattern 1, and we can get more paralleled fmas.

I agree that this can be exploited to other platforms that support destructive hardware fma instructions. But I am not familiar with other platform's instruction set, so currently I only implement it on PowerPC.

llvm/lib/Target/PowerPC/PPCInstrInfo.cpp
313–319

This comment is target-specific. On PowerPC, most fma like instructions such as xsmaddadp/xsmaddasp/xvmaddadp/xvmaddasp are defined with the above form in ISA.

I did not look at the patch itself except to notice that it is a lot of code...so I have to ask - did you look at implementing at least the 1st pattern in DAGCombiner? That seems like a general improvement for any superscalar micro-arch with no register pressure disadvantage.

Thanks for looking into this @spatel

Yes, I tried to implement pattern 1 in DAGCombiner, but I got some LIT failures related to register allocation on platform AArch64 and Thumb2. And this kind of opt will increase register pressure. I think it is better not to add it in DAGCombiner.
Reason I add these two patterns in MachineCombiner is:
1: This pass is targeted for ILP related optimization
2: Adding register pressure estimation model here should be easy than in DAGCombiner. We can do similar estimation like we did in MachineLICM if we want to model it in future?
3: These two patterns have to be put together. After breaking pattern 2: (fma+fma+fma) to (fmul+fma+fma+fadd), the last fadd can be combined with following two fmas as pattern 1, and we can get more paralleled fmas.

Thank you for the explanation. I agree that this is a better place to try the transform. DAGCombiner has no real register pressure analysis. One thing to be aware of: the compile-time cost of MachineCombiner was potentially high when analyzing large blocks. My experience with this pass is a few years old, so this might have changed, but there may be some bugzilla reports on that.

If you have a version of the transform as a DAGCombiner patch and can post it somewhere, I would still be interested in trying it out locally.

shchenz added a comment.EditedMay 28 2020, 6:46 AM

@spatel , Thanks for reminding the compiling time issue. I will have a test about compiling time later.

This was the prototype I implemented in DAGCombiner,

Before the final return statement of visitFMA() function in file DAGCombiner.cpp

Sorry for the bad formatting and it is not well tested.

+
+  // expose more ILP:
+  // (fma E, F, (fma C, D, (add A, B))) -> add ((fma C, D, A), (fma E, F, A))
+  TargetSchedModel SchedModel;
+  SchedModel.init(&DAG.getSubtarget());
+  if (UnsafeFPMath && SchedModel.getIssueWidth() > 1 && N2.getOpcode() == ISD::FMA && N2.hasOneUse() && (Options.UnsafeFPMath || isContractable(N2.getNode()))) {
+    SDValue OpADD = N2.getOperand(2);
+    if (OpADD.getOpcode() == ISD::FADD && OpADD.hasOneUse()) {
+      SDValue ADDLHS = DAG.getNode(ISD::FMA, SDLoc(N2), VT, N2.getOperand(0), N2.getOperand(1), OpADD.getOperand(0), Flags);
+      SDValue ADDRHS = DAG.getNode(ISD::FMA, SDLoc(N2), VT, N0, N1, OpADD.getOperand(1), Flags);
+      return DAG.getNode(ISD::FADD, DL, VT, ADDLHS, ADDRHS, OpADD.getNode()->getFlags());
+    }
+  }
+
shchenz updated this revision to Diff 267556.Jun 1 2020, 12:38 AM

set resource length extension to 1 on PowerPC

no compile time deg found in the benchmarks I run.

jsji accepted this revision as: jsji.Jun 6 2020, 1:37 PM

LGTM with some nits.

llvm/lib/Target/PowerPC/PPCInstrInfo.cpp
354

Can we define or use enum for index instead of hardcode 1,2,3,4? So that it will be easier to read and maintain.
eg:

#define InfoArrayIdxFMAInst 0
#define InfoArrayIdxFAddInst 1
#define InfoArrayIdxFMULInst 2
#define InfoArrayIdxAddOpIdx 3 
#define InfoArrayIdxMULOpIdx 4
395

Do we need to reset AddOpIdx befoer calling IsReassociable again? Or else the value is not -1 anymore, we won't be able to catch issues in following assert..

llvm/lib/Target/PowerPC/PPCInstrInfo.h
326

Pattern? -> P? Or replace P to Pattern in following line.

340

Can we make this comment more clearer? eg: Why we need to set it to 1? for what edge case?

This revision is now accepted and ready to land.Jun 6 2020, 1:37 PM
shchenz updated this revision to Diff 269089.Jun 7 2020, 7:24 PM

address review comments:
1: use the macro to represent array index instead of hard code
2: comments update

Can you add a hidden option with init false? You can turn it true later on.
So that people can try with your option off and on? Thanks!

shchenz added a comment.EditedJun 8 2020, 11:41 PM

Can you add a hidden option with init false? You can turn it true later on.
So that people can try with your option off and on? Thanks!

Have you met some issues with this being turned on? This is specific for PowerPC, I have verified that on PowerPC there is no deg for the patch on the benchmarks I run. If you just want to see the impact of this patch, I would suggest you comment out the two newly added lines in function PPCInstrInfo::getMachineCombinerPatterns, it will turn off this opt.

Can you add a hidden option with init false? You can turn it true later on.
So that people can try with your option off and on? Thanks!

Have you met some issues with this being turned on? This is specific for PowerPC, I have verified that on PowerPC there is no deg for the patch on the benchmarks I run. If you just want to see the impact of this patch, I would suggest you comment out the two newly added lines in function PPCInstrInfo::getMachineCombinerPatterns, it will turn off this opt.

From a user perspective, for anyone who investigate a benchmark or debug a problem, it is very frequently to disable or enable many optimizations, and it is impossible for a user to find someway to comment out some lines in some function for some optimizations and rebuild the compiler, and uncomment some lines and rebuild the compiler again and again.

I think this is a simple and reasonable suggestion. By the way, I do not have very strong opinion on this. -:) What do you think? @jsji @nemanjai

Can you add a hidden option with init false? You can turn it true later on.
So that people can try with your option off and on? Thanks!

This is not adding a new optimization but is increasing the scope of an existing optimization. So I don't think it is appropriate to add an option to control only this aspect of this. If there is a use case for allowing the user to have fine grained control over which combiner patterns to use, then we can add an enum option. But we need to have justification for why that is needed.

OTOH, I am not opposed to adding an option to turn off the machine combiner (i.e. change the value returned by PPCInstrInfo::useMachineCombiner()) - but that is orthogonal to this patch and can be done separately.

Can you add a hidden option with init false? You can turn it true later on.
So that people can try with your option off and on? Thanks!

This is not adding a new optimization but is increasing the scope of an existing optimization. So I don't think it is appropriate to add an option to control only this aspect of this. If there is a use case for allowing the user to have fine grained control over which combiner patterns to use, then we can add an enum option. But we need to have justification for why that is needed.

OTOH, I am not opposed to adding an option to turn off the machine combiner (i.e. change the value returned by PPCInstrInfo::useMachineCombiner()) - but that is orthogonal to this patch and can be done separately.

Thanks for the explanation!

@AaronLiu Thanks for posting your concerns.
@nemanjai Thanks for your explanation and suggestion. I am happy to add an option to control MachineCombiner pass on PowerPC later.

This revision was automatically updated to reflect the committed changes.

@AaronLiu Thanks for posting your concerns.
@nemanjai Thanks for your explanation and suggestion. I am happy to add an option to control MachineCombiner pass on PowerPC later.

There is already one hidden option used to turn on/off machine combiner pass on PowerPC.

static cl::opt<bool>
EnableMachineCombinerPass("ppc-machine-combiner",
                          cl::desc("Enable the machine combiner pass"),
                          cl::init(true), cl::Hidden);