This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] Improvements for BUILD_VECTOR Vol. 4
ClosedPublic

Authored by nemanjai on Oct 28 2016, 1:09 AM.

Details

Summary

This is the final patch that D25580 is split into. It adds the changes to the PPCMIPeephole.cpp that eliminate the redundancies that the new handling of BUILD_VECTOR nodes introduces. It also includes the test case that integrates the testing of all these changes. Please note that Phabricator has a tendency to collapse large files so you may miss the test case in the review altogether - please ensure to expand the file in your view.

The motivation for this patch:
Emitting reasonable code for BUILD_VECTOR nodes is fairly important due to the prevalence of these nodes. There are currently a number of patterns (some common and some less so) for which the PPC back end produces just terrible code. The idea with this series of patches is to identify as many as possible of those patterns and improve the code gen for them. Overall, this improves performance of the benchmarks in projects/test-suite and when I get the chance, I'll evaluate the effect of these patches on SPEC.
I will add a comment illustrating some patterns where the current code gen is terrible (I don't do it in the description as I don't know how to add code here).

Diff Detail

Repository
rL LLVM

Event Timeline

nemanjai updated this revision to Diff 76170.Oct 28 2016, 1:09 AM
nemanjai retitled this revision from to [PowerPC] Improvements for BUILD_VECTOR Vol. 4.
nemanjai updated this object.
nemanjai set the repository for this revision to rL LLVM.
nemanjai added a subscriber: llvm-commits.

Here are some examples of terrible code gen currently (rather than providing code, I'm just illustrating these with pseudo-SDAG nodes):

(build_vector (load $A), (load $B), (load $C) ...) -> performs each load even if they're consecutive, plus a load and vperm
(build_vector (fptosi), (fptosi), ...) -> will perform a conversion, move from VSR, move to VSR for each, then it will swap and merge which involves a load and vperm
(build_vector iNN:$A, iNN:$B, iNN:$C...) -> will do scalar_to_vector for each which involves an unnecessary swap on LE along with a load and vperm
(build_vector iNN:$A, iNN:$A, ...) -> will do a scalar_to_vector which involves a swap on LE

and so on. All of these are improved to produce optimal code both on Power8 and Power9. The Power9 versions utilize the new instructions that make some of these more efficient.

amehsan edited edge metadata.Oct 28 2016, 11:20 AM

Here are some examples of terrible code gen currently (rather than providing code, I'm just illustrating these with pseudo-SDAG nodes):

Could you also comment on what is your process for identifying problematic patterns that have been addressed in these patches?

lib/Target/PowerPC/PPCMIPeephole.cpp
151–154 ↗(On Diff #76170)

Can't we just replace uses of MI.getOperand(0) with MI.getOperand(1) instead of creating a new copy instruction?

244–246 ↗(On Diff #76170)

Given the name of variable being defined 'SameOpcode' these three lines that you have added requires some comment explaining what's going on.

304–334 ↗(On Diff #76170)

I think you can factor out the common logic here either in a lambda or a helper function. You are doing very similar things for P1 and P2.

Here are some examples of terrible code gen currently (rather than providing code, I'm just illustrating these with pseudo-SDAG nodes):

Could you also comment on what is your process for identifying problematic patterns that have been addressed in these patches?

Yeah, when I was implementing some of the P9 instructions that naturally fit the BUILD_VECTOR nodes, I realized that we produce poor code for some patterns. I wanted to see how bad this situation is so I wrote the C test case that is added as a comment to build-vector-tests.ll and examined the assembly. Then I picked off the patterns that produce poor code and fixed them one by one.

nemanjai added inline comments.Oct 28 2016, 12:07 PM
lib/Target/PowerPC/PPCMIPeephole.cpp
151–154 ↗(On Diff #76170)

This is safe to do. The COPY will be removed if not needed. I did something similar before and it caused a problem when the operand defined by the removed instruction had another use. Please see https://reviews.llvm.org/D25493.
I suppose I could traverse the use list for MI.getOperand(0) and see if it can be replaced as well, but I don't think that's any cleaner.

244–246 ↗(On Diff #76170)

Good point. I'll rename the variable and add a short comment.

304–334 ↗(On Diff #76170)

That's a great idea. I'll do that.

Yeah, when I was implementing some of the P9 instructions that naturally fit the BUILD_VECTOR nodes, I realized that we produce poor code for some patterns. I wanted to see how bad this situation is so I wrote the C test case that is added as a comment to build-vector-tests.ll and examined the assembly. Then I picked off the patterns that produce poor code and fixed them one by one.

Some thoughts about this.
(1) How do we know that these pieces of code has any significance? Some are clearly contrived: (vector unsigned long long) 4.74. For others I do not think they are going to be common patterns in the user code. It will be more interesting if we know that the compiler is likely to generate a pattern of that kind. How did you make that judgement?

(2) How did you concluded that the fixes that you are making is in the right place. For example, maybe for some of these patterns we need to change something in InstCombine. Is there an explanation for the approach that you have taken?

Yeah, when I was implementing some of the P9 instructions that naturally fit the BUILD_VECTOR nodes, I realized that we produce poor code for some patterns. I wanted to see how bad this situation is so I wrote the C test case that is added as a comment to build-vector-tests.ll and examined the assembly. Then I picked off the patterns that produce poor code and fixed them one by one.

Some thoughts about this.
(1) How do we know that these pieces of code has any significance? Some are clearly contrived: (vector unsigned long long) 4.74. For others I do not think they are going to be common patterns in the user code. It will be more interesting if we know that the compiler is likely to generate a pattern of that kind. How did you make that judgement?

As I mentioned, some patterns are fairly common, some are likely not. The contrived example you mentioned is certainly contrived, but not unheard of. I've seen inline functions that splat scalars into vectors. Passing a double literal to such a function would have that exact form. While I certainly agree that it would be interesting to know what the compiler frequently produces in terms of SD nodes, I don't know that such insight is always worth the time investment to gain. I think that keeping a clearly inferior code generation choice for a pattern when it can easily be improved is fairly hard to justify. An easy improvement to code gen is in my view a better time investment than a much more time-consuming investigation into what the vectorizer or another optimization will produce.

(2) How did you concluded that the fixes that you are making is in the right place. For example, maybe for some of these patterns we need to change something in InstCombine. Is there an explanation for the approach that you have taken?

This is an interesting question. There are two things that happen fairly early (i.e. DAG Combine). That's the conversion of a build vector of fp-to-int conversions into an fp-to-int conversion of a build vector and conversion of a build vector of consecutive loads into a wide load. The first is something that the target-independent DAG combine purposely omits. There is a combine of a build vector of int-to-fp nodes into an int-to-fp of a build vector node but not with the conversions going the other way around. So there's likely a good reason not to do this on some targets. My attempt to implement it there also causes failures in the ARM target which means that ARM somehow depends on this not happening. Again, rather than investigating whether this is a win on all targets (and whether ARM has a bug exposed by this), I felt that doing this for PPC where it is clearly a win is a better use of my time. The latter is not something that can happen all that early as proving that the loads are consecutive can only happen later in the pipeline.
The remaining changes are all quite late in the pipeline - lowering. So the argument that perhaps there are cases where these changes will end up eliminating opportunities for some optimization since we're now not expanding a BUILD_VECTOR is not a very strong one because there simply aren't many optimizations after this. All of these changes are meant to allow the SDAG lowering to match patterns in the .td files - which are demonstrably optimal for the node that is lowered into them.

nemanjai edited edge metadata.Oct 28 2016, 2:29 PM
nemanjai added a subscriber: wschmidt.
lib/Target/PowerPC/PPCMIPeephole.cpp
151–154 ↗(On Diff #76170)

We don't know that the COPY is guaranteed to be removed. AFAIK, We should not rely on downstream optimizations if we can generate better code. Yes, all uses should be replaced. There is a utility method for that: Check: include/llvm/CodeGen/MachineRegisterInfo.h for method:

void replaceRegWith(unsigned FromReg, unsigned ToReg)

Regarding (2): I think we need to make sure InstCombine and target Independent lowering are doing the right thing for these. Fixing as late as possible does not seem correct to me. But I leave to other reviewers to decide.

lib/Target/PowerPC/PPCMIPeephole.cpp
151–154 ↗(On Diff #76170)

Looking at this method again, it has differences with replaceAllUsesWith. First I thought it is the same. Some comment from people with more experience will be helpful.

nemanjai updated this revision to Diff 76839.Nov 3 2016, 4:41 AM

Changed the name of a variable that became misleading.
Eliminated some duplicated code.

hfinkel edited edge metadata.Nov 8 2016, 9:31 PM

I think that keeping a clearly inferior code generation choice for a pattern when it can easily be improved is fairly hard to justify.

I agree. Clearly there are limits (i.e. things that InstCombine should fix), but for things that are not clearly anti-canonical, having a high-quality backend means trying to do a good job on those inputs. There are lots of frontends, people using vector intrinsics, etc.

There is a combine of a build vector of int-to-fp nodes into an int-to-fp of a build vector node but not with the conversions going the other way around. So there's likely a good reason not to do this on some targets. My attempt to implement it there also causes failures in the ARM target which means that ARM somehow depends on this not happening. Again, rather than investigating whether this is a win on all targets (and whether ARM has a bug exposed by this), I felt that doing this for PPC where it is clearly a win is a better use of my time.

Did you try adding the corresponding code as a target-specific DAGCombine?

amehsan added inline comments.Nov 9 2016, 7:22 AM
lib/Target/PowerPC/PPCMIPeephole.cpp
151–154 ↗(On Diff #76170)

@hfinkel your comment here will be helpful as well.

<removed repeated comments>

Did you try adding the corresponding code as a target-specific DAGCombine?

Yes, this transform is implemented as a target-specific DAGCombine but it's in one of the other related patches (https://reviews.llvm.org/D25980). That patch is only this transform (build_vector of fp-to-int into an fp-to-int of a build_vector).

kbarton requested changes to this revision.Nov 15 2016, 3:17 PM
kbarton edited edge metadata.
kbarton added inline comments.
lib/Target/PowerPC/PPCMIPeephole.cpp
323 ↗(On Diff #76839)

If I understand this logic correctly, you only want to remove the FRSP if *both* operands are created using it. If one parameter is defined using an FSPR and the other one isn't, then you can't remove it in one case.
This logic currently won't do that, because the removeFRSPIfPossible lambda looks at the operands independently. So you need to change the lambda into a check if it's safe to remove the FSPR, and if it's safe to do for both P1 and P2, then remove them (or change the logic in the lambda in a different way to ensure equivalent functionality).

325 ↗(On Diff #76839)

Are you sure this will do what you want after P1 has possibly been erasedFromParent?
The documentation for eraseFromParent indicates the node is removed from the instruction list and deleted. I think it's better to do this check before calling removeFRSPIfPossible(P1).

This revision now requires changes to proceed.Nov 15 2016, 3:17 PM
nemanjai updated this revision to Diff 78676.Nov 20 2016, 5:09 PM
nemanjai edited edge metadata.

Updated the code in the PeepHole optimization to ensure that an attempt to remove the same instruction twice cannot occur.
Looking a little deeper into the relationship between FRSP and XVCVDPSP, I realized that it is safe to remove an FRSP that feeds an XVCVDPSP even if the FRSP only produced one half of the input vector. This is because the two operations are equivalent (other than one being scalar and the other one a vector operation) and they do not modify a single precision input.

kbarton accepted this revision.Nov 21 2016, 11:55 AM
kbarton edited edge metadata.

LGTM

This revision is now accepted and ready to land.Nov 21 2016, 11:55 AM
This revision was automatically updated to reflect the committed changes.
llvm/trunk/lib/Target/PowerPC/PPCMIPeephole.cpp