This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] - Legalize illegal vector types by widening rather than integer promotion
ClosedPublic

Authored by nemanjai on May 19 2016, 11:26 AM.

Details

Summary

The legalization for an ISD::LOAD node when the type is v4i8 will convert it into an extended load to a vector type which ends up producing very bad code (see below). This patch simply adds a DAG combine that will convert such a load into an i32 load followed by a bitcast. This ends up collapsing into something more usable.

Diff Detail

Repository
rL LLVM

Event Timeline

nemanjai updated this revision to Diff 57828.May 19 2016, 11:26 AM
nemanjai retitled this revision from to [PowerPC] - Combine loads of v4i8 to loads of i32 followed by bitcast.
nemanjai updated this object.
nemanjai set the repository for this revision to rL LLVM.
nemanjai added a subscriber: llvm-commits.

An example of the type of code we were getting...
With a code pattern such as this:

define <16 x i8> @test(i32* %s, i32* %t) {
entry:
  %0 = bitcast i32* %s to <4 x i8>*
  %1 = load <4 x i8>, <4 x i8>* %0, align 4
  %2 = shufflevector <4 x i8> %1, <4 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3>
  ret <16 x i8> %2
}

We now get the following:

lwz 3, 0(3)
mtvsrd 34, 3
xxswapd  0, 34
xxspltw 34, 0, 3

and before this change, we were getting:

lbz 5, 0(3)
lbz 6, 1(3)
addis 4, 2, .LCPI0_0@toc@ha
addi 4, 4, .LCPI0_0@toc@l
mtvsrd 34, 5
lbz 5, 2(3)
lbz 3, 3(3)
lxvd2x 0, 0, 4
mtvsrd 35, 6
xxswapd  34, 34
mtvsrd 36, 5
mtvsrd 37, 3
xxswapd  35, 35
xxswapd  36, 36
xxswapd  37, 37
vmrglw 2, 3, 2
xxswapd  50, 0
vmrglw 19, 5, 4
vperm 2, 19, 2, 18
xxsldwi 12, 34, 34, 2
mfvsrwz 3, 34
xxsldwi 1, 34, 34, 1
xxsldwi 2, 34, 34, 3
mtvsrd 34, 3
mfvsrwz 3, 12
mfvsrwz 4, 1
mtvsrd 35, 3
mfvsrwz 3, 2
mtvsrd 36, 4
mtvsrd 37, 3
addis 3, 2, .LCPI0_1@toc@ha
xxswapd  34, 34
addi 3, 3, .LCPI0_1@toc@l
xxswapd  37, 37
lxvd2x 13, 0, 3
xxswapd  35, 35
xxswapd  36, 36
vmrglb 2, 5, 2
xxswapd  51, 13
vmrglb 3, 4, 3
vmrglh 2, 2, 3
xxspltw 34, 34, 3
vperm 2, 2, 2, 19
  • The TOC loads above were for materializing constant mask vectors for the vector shuffles that this degrades into.

This code pattern is a simulation of one that comes out of SROA and this patch provides a big improvement in one of the benchmarks.

hfinkel edited edge metadata.May 19 2016, 11:33 AM

Looks like you forgot the test case in the patch.

Looks like you forgot the test case in the patch.

Argh, yes I did. Too busy putting the details into the comments :D. I'll re-post in a few minutes (and one of these days I'll get used to running "svn add").

nemanjai updated this revision to Diff 57832.May 19 2016, 11:50 AM
nemanjai edited edge metadata.

Added the test case that I forgot in the initial patch.

amehsan added inline comments.May 20 2016, 7:01 AM
lib/Target/PowerPC/PPCISelLowering.cpp
10622–10635

Instead of fixing the issue for this particular pattern, can't we change type legalization, so that it always converts v4i8 to i32? This fixes the problem at hand, but if we have a different code pattern that includes v4i8, the current way of legalizing v4i8 in type legalization will kick in which seems to generate inefficient code, by promoting it to a larger vector and adding permutes and similar instructions.

I think we may need to change how type legalization handles v4i8, instead of fixing this particular pattern.

hfinkel added inline comments.May 20 2016, 10:24 AM
lib/Target/PowerPC/PPCISelLowering.cpp
10622–10635

To this point, we currently have code like this:

// We promote all non-typed operations to v4i32.
setOperationAction(ISD::AND   , VT, Promote);
AddPromotedToType (ISD::AND   , VT, MVT::v4i32);
setOperationAction(ISD::OR    , VT, Promote);
AddPromotedToType (ISD::OR    , VT, MVT::v4i32);
setOperationAction(ISD::XOR   , VT, Promote);
AddPromotedToType (ISD::XOR   , VT, MVT::v4i32);
setOperationAction(ISD::LOAD  , VT, Promote);
AddPromotedToType (ISD::LOAD  , VT, MVT::v4i32);
setOperationAction(ISD::SELECT, VT, Promote);
AddPromotedToType (ISD::SELECT, VT, MVT::v4i32);
setOperationAction(ISD::SELECT_CC, VT, Promote);
AddPromotedToType (ISD::SELECT_CC, VT, MVT::v4i32);
setOperationAction(ISD::STORE, VT, Promote);
AddPromotedToType (ISD::STORE, VT, MVT::v4i32);

maybe something similar would work in this area as well?

nemanjai added inline comments.May 20 2016, 10:44 AM
lib/Target/PowerPC/PPCISelLowering.cpp
10622–10635

Actually, you bring up a very good point. We really should be doing something better with legalization of v4i8 (and I imagine all vectors narrower than vectors our hardware actually handles). However, I don't think we can legalize it as a scalar type. We should actually be widening the vector (rather than promoting the integer element type). I'll re-post this patch to do that instead.

nemanjai updated this revision to Diff 58968.May 30 2016, 12:09 PM

This is a fundamentally different approach from the first attempt. As Ehsan suggested, I've updated how we legalize non-legal vector types. With this patch, if we do not support the vector type, we will widen it rather than performing an integer promotion which would often require scalarizing. However, widening can only be done when the bit-size of the vector element is a multiple of 8 (as we do not support any vectors made up of fractional byte elements).

There are instances where this approach produces worse code. This is exposed in some of the functions in the vsx.ll test case. To address one of those, this patch implements a DAG combine for a conversion of a v2i32 to v2f64 so that it remains a 2-instruction sequence. Similar DAG combines can later be implemented for other affected code patterns.

I've done some lightweight performance testing (the LNT tests on a quiet machine) and here are the results with benchmarks that take less than 5s to execute omitted (BEFORE_TIME is without this patch and AFTER_TIME is with this patch):

BENCHMARK_NAME                                                                            BEFORE_TIME AFTER_TIME  ABS_DIFF  PCT_DIFF
MultiSource/Applications/SPASS/Output/SPASS                                                    6.050     6.049    -0.001     -0.02%
MultiSource/Applications/JM/lencod/Output/lencod                                               5.102     5.138     0.036      0.70%
MultiSource/Applications/lambda-0.1.3/Output/lambda                                            5.891     5.889    -0.002     -0.04%
MultiSource/Applications/hexxagon/Output/hexxagon                                             13.814    13.808    -0.007     -0.05%
MultiSource/Applications/lua/Output/lua                                                       22.924    22.919    -0.004     -0.02%
MultiSource/Benchmarks/SciMark2-C/Output/scimark2                                             72.514    72.508    -0.006     -0.01%
MultiSource/Benchmarks/nbench/Output/nbench                                                   19.489    19.189    -0.300     -1.54%
MultiSource/Benchmarks/NPB-serial/is/Output/is                                                10.415    10.417     0.002      0.02%
MultiSource/Benchmarks/ASC_Sequoia/IRSmk/Output/IRSmk                                         32.249    32.987     0.739      2.29%
MultiSource/Benchmarks/ASC_Sequoia/AMGmk/Output/AMGmk                                         10.481    10.482     0.000      0.00%
MultiSource/Benchmarks/ASC_Sequoia/CrystalMk/Output/CrystalMk                                  7.537     7.537    -0.001     -0.01%
MultiSource/Benchmarks/TSVC/Reductions-flt/Output/Reductions-flt                               5.548     5.548     0.000      0.00%
MultiSource/Benchmarks/TSVC/Searching-flt/Output/Searching-flt                                 6.452     6.452    -0.001     -0.01%
MultiSource/Benchmarks/TSVC/GlobalDataFlow-dbl/Output/GlobalDataFlow-dbl                       6.823     6.825     0.002      0.03%
MultiSource/Benchmarks/TSVC/Reductions-dbl/Output/Reductions-dbl                               5.509     5.509     0.001      0.01%
MultiSource/Benchmarks/TSVC/Searching-dbl/Output/Searching-dbl                                 6.253     6.253     0.000      0.00%
MultiSource/Benchmarks/PAQ8p/Output/paq8p                                                    110.282    73.201   -37.080    -33.62%
MultiSource/Benchmarks/Bullet/Output/bullet                                                    5.242     5.209    -0.033     -0.63%
MultiSource/Benchmarks/7zip/Output/7zip-benchmark                                              7.853     7.842    -0.011     -0.14%
MultiSource/Benchmarks/mafft/Output/pairlocalalign                                            26.034    26.056     0.022      0.08%
SingleSource/Benchmarks/CoyoteBench/Output/almabench                                          30.836    30.827    -0.009     -0.03%
SingleSource/Benchmarks/CoyoteBench/Output/huffbench                                          21.345    21.350     0.005      0.02%
SingleSource/Benchmarks/Shootout/Output/sieve                                                  5.268     5.267    -0.001     -0.01%
SingleSource/Benchmarks/Shootout/Output/lists                                                  7.607     7.605    -0.002     -0.03%
SingleSource/Benchmarks/Shootout/Output/methcall                                              10.474    10.474     0.000      0.00%
SingleSource/Benchmarks/Shootout-C++/Output/lists                                             11.707    11.733     0.026      0.22%
SingleSource/Benchmarks/Shootout-C++/Output/methcall                                          11.177    11.175    -0.002     -0.02%
SingleSource/Benchmarks/Misc/Output/flops                                                      6.340     6.343     0.003      0.05%
SingleSource/Benchmarks/Misc/Output/salsa20                                                    6.292     6.291    -0.001     -0.01%
SingleSource/Benchmarks/Misc/Output/ReedSolomon                                                6.903     6.905     0.002      0.03%
SingleSource/Benchmarks/Misc-C++/Large/Output/sphereflake                                      6.154     6.154    -0.000     -0.00%
SingleSource/Benchmarks/Misc-C++/Output/stepanov_v1p2                                          9.809     9.809    -0.000     -0.00%
SingleSource/Benchmarks/Adobe-C++/Output/stepanov_abstraction                                  5.733     5.733    -0.000     -0.01%
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/Output/2mm                       18.724    16.745    -1.980    -10.57%
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/3mm/Output/3mm                       29.825    28.844    -0.981     -3.29%
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemm/Output/gemm                      8.889     8.518    -0.371     -4.18%
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/Output/symm                     13.260    14.198     0.937      7.07%
SingleSource/Benchmarks/Linpack/Output/linpack-pc                                             13.215    13.214    -0.001     -0.01%
SingleSource/Benchmarks/SmallPT/Output/smallpt                                                12.783    12.781    -0.001     -0.01%
SingleSource/UnitTests/Vector/Altivec/Output/alti.expandfft                                   23.281    23.291     0.009      0.04%
SingleSource/UnitTests/Vectorizer/Output/gcc-loops                                             7.511     7.511     0.000      0.00%
amehsan added inline comments.Jun 2 2016, 2:48 PM
lib/Target/PowerPC/PPCISelLowering.cpp
10307

This can be sunk further down, right before where we actually use it.

10308–10309

If I am reading everything correctly you need to add some code here, to prevent an assertion.

First of all the comment in include/llvm/CodeGen/ISDOpcodes.h above definition of EXTRACT_VECTOR_ELT says that index into the vector might be variable. Implementation of getConstantOperandVal uses a cast<ConstantSDNode> which "causing an assertion failure if it is not really an instance of the right type" according to http://llvm.org/docs/ProgrammersManual.html#the-isa-cast-and-dyn-cast-templates

nemanjai added inline comments.Jun 3 2016, 2:29 AM
lib/Target/PowerPC/PPCISelLowering.cpp
10308–10309

Yes, that is a very good point. I forgot to add a check that a node from which I'm extracting a constant is actually a constant node :). Will update the patch now.

nemanjai updated this revision to Diff 59512.Jun 3 2016, 2:32 AM

I had forgotten to check whether the operands of the extractelement nodes are actually constants which can cause asserts. Added the checks now.

nemanjai added inline comments.Jun 7 2016, 7:21 AM
lib/Target/PowerPC/PPCISelLowering.cpp
10302

If there are no further requests, I won't post another review for this, but the final patch will do away with these "isa" calls and replace them with dyn_cast calls and the corresponding getConstantOperandVal() calls will be replaced with getZExtValue() calls on the resulting ConstantSDNode's.

amehsan edited edge metadata.Jun 14 2016, 11:54 AM

As we discussed, before you commit the change, please add -verify-machineinstrs to your regression tests. No need to upload the patch again. Thanks.

nemanjai retitled this revision from [PowerPC] - Combine loads of v4i8 to loads of i32 followed by bitcast to [PowerPC] - Legalize illegal vector types by widening rather than integer promotion.Jun 20 2016, 1:54 PM
nemanjai edited edge metadata.

Just a friendly reminder that it would be nice to get this patch into trunk before we branch off 3.9. Please review and comment accordingly.

hfinkel accepted this revision.Jul 4 2016, 5:48 AM
hfinkel edited edge metadata.

I apologize for the delay. Some minor requests below, but otherwise, LGTM.

lib/Target/PowerPC/PPCISelLowering.h
141

This name does not make it clear that you're converting to FP values. Let's name these SINT_VEC_TO_FP and UINT_VEC_TO_FP. Also note there that this is used for illegal vector types.

443

Can you say something about why it is worse. N years from now someone will wonder whether this is still true ;)

lib/Target/PowerPC/PPCInstrVSX.td
72

The existing naming convention here seems to have lower-case latter parts. Not sure why, but we might as well follow it for now.

def PPCsvec2fp : ...
def PPCuvec2fp : ...
This revision is now accepted and ready to land.Jul 4 2016, 5:48 AM
nemanjai closed this revision.Jul 5 2016, 4:13 AM

Committed revision 274535.

lib/Target/PowerPC/PPCISelLowering.h
443

Thank you. Will do (and I'll fix the typo - they're not "vectory" types :)).