This is an archive of the discontinued LLVM Phabricator instance.

Power9 Instructions for build_vector improvements
ClosedPublic

Authored by nemanjai on Jun 8 2016, 7:04 AM.

Details

Summary

This patch exploits the following instructions:
mtvsrws
lxvwsx
mtvsrdd
mfvsrld

In order to improve some build_vector and extractelement patterns.

Diff Detail

Repository
rL LLVM

Event Timeline

nemanjai updated this revision to Diff 60029.Jun 8 2016, 7:04 AM
nemanjai retitled this revision from to Power9 Instructions for build_vector improvements.
nemanjai updated this object.
nemanjai set the repository for this revision to rL LLVM.
nemanjai added a subscriber: echristo.
amehsan added inline comments.Jun 8 2016, 2:24 PM
lib/Target/PowerPC/PPCInstrVSX.td
2243–2245

I think this should depend on how the extracted element is going to be used. If the subsequent use is somehow in a VSX register we do not want to do this. For example if we extract the integer, then convert it to floating point and do some FP arithmetic on it.

amehsan added inline comments.Jun 8 2016, 2:51 PM
lib/Target/PowerPC/PPCISelLowering.cpp
7491–7501

Is this true even if LOAD has users other than SCALAR_TO_VECTOR?

lib/Target/PowerPC/PPCInstrVSX.td
2243–2245

I am not saying that all cases should be handled in this patch. The example that I provided may need to be handled in DAGCombine and probably by the time we reach here, this is the right decision. That does not need to be in this patch. But I want to make sure that after adding this code, we do not have patterns for which we generate slower code on pwr9 compare to pwr8.

nemanjai added inline comments.Jun 13 2016, 11:02 AM
lib/Target/PowerPC/PPCISelLowering.cpp
7491–7501

Ah, thanks for pointing this out. Yes, there's a missing check for hasOneUse() on the LOAD. It will be in the updated patch (along with a test case to ensure we don't get rid of the splat).

lib/Target/PowerPC/PPCInstrVSX.td
2243–2245

Yes, I think the right thing to do in these cases would be either a DAG combine or a peephole to look for where we move stuff out of VSX registers just to move them back in.
In any case, the pattern for Power8 is a swap followed by a direct move. On Power9, we just avoid the initial swap.

nemanjai updated this revision to Diff 60574.Jun 13 2016, 12:06 PM

Added the missing check for only one use of the load when deciding whether to eliminate the splat when building a vector of i32's on Power9.

amehsan added inline comments.Jun 13 2016, 8:51 PM
lib/Target/PowerPC/PPCInstrVSX.td
2243–2245

That problem already exists on PWR8.

for

define double @test2(<2 x i64> %a) {
entry:
  %0 = extractelement <2 x i64> %a, i32 0
  %1 = sitofp i64 %0 to double
  ret double %1
}

we generate

xxswapd  0, 34
mfvsrd 3, 0
mtvsrd 0, 3
xscvsxddp 1, 0
blr

I will open a bugzilla item for this.

amehsan edited edge metadata.Jun 14 2016, 11:55 AM

As we discussed, before you commit the change, please add -verify-machineinstrs to your regression tests. No need to upload the patch again. Thanks.

nemanjai updated this revision to Diff 62708.Jul 4 2016, 12:50 PM
nemanjai edited edge metadata.

Some of the new instructions were being emitted for unintended code patterns (such as materializing a vector of zeros). The new sequences were inferior so this update ensures that we emit the better code sequence. For example, due to the "AddedComplexity", the initial patch emitted a load-immediate followed by a direct move for materializing ones or zeros into a vector. A vector of zeros can be produced with a single XXLXOR. A vector of ones can be produced by a splat-immediate (especially now that we have a VSX version of it).
Test case was modified accordingly.

This patch was functionally tested on the Power9 simulator.

kbarton requested changes to this revision.Aug 31 2016, 9:43 AM
kbarton edited edge metadata.

This is perhaps minor, but we should rethink the change in PPCInstPrinter.cpp. If this change is needed, then we should change all the print routines in a similar manner.

lib/Target/PowerPC/InstPrinter/PPCInstPrinter.cpp
328

I'm not sure about this change.
Why are we printing as unsigned int, instead of unsigned char?
It seems like this method, and the method above (printU7ImmOperand) should be using (unsigned char) instead of (unsigned int). It looks like this was done with the printU10ImmOperand below (and probably others, but I didn't look exhaustively).

This revision now requires changes to proceed.Aug 31 2016, 9:43 AM
nemanjai added inline comments.Sep 12 2016, 8:48 AM
lib/Target/PowerPC/InstPrinter/PPCInstPrinter.cpp
328

This is a great point. I don't know why I didn't think of just casting to unsigned char which will implicitly truncate. I'll try that and re-post.

nemanjai updated this revision to Diff 71406.Sep 14 2016, 11:50 AM
nemanjai edited edge metadata.

Updated the truncation of the 32-bit unsigned value to 8-bits in PPCInstrPrinter.cpp.

kbarton accepted this revision.Sep 21 2016, 10:56 AM
kbarton edited edge metadata.

LGTM

This revision is now accepted and ready to land.Sep 21 2016, 10:56 AM
nemanjai closed this revision.Sep 23 2016, 6:34 AM

Committed revision 282246.