The legalization for an ISD::LOAD node when the type is v4i8 will convert it into an extended load to a vector type which ends up producing very bad code (see below). This patch simply adds a DAG combine that will convert such a load into an i32 load followed by a bitcast. This ends up collapsing into something more usable.
Diff Detail
- Repository
- rL LLVM
Event Timeline
An example of the type of code we were getting...
With a code pattern such as this:
define <16 x i8> @test(i32* %s, i32* %t) { entry: %0 = bitcast i32* %s to <4 x i8>* %1 = load <4 x i8>, <4 x i8>* %0, align 4 %2 = shufflevector <4 x i8> %1, <4 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3> ret <16 x i8> %2 }
We now get the following:
lwz 3, 0(3) mtvsrd 34, 3 xxswapd 0, 34 xxspltw 34, 0, 3
and before this change, we were getting:
lbz 5, 0(3) lbz 6, 1(3) addis 4, 2, .LCPI0_0@toc@ha addi 4, 4, .LCPI0_0@toc@l mtvsrd 34, 5 lbz 5, 2(3) lbz 3, 3(3) lxvd2x 0, 0, 4 mtvsrd 35, 6 xxswapd 34, 34 mtvsrd 36, 5 mtvsrd 37, 3 xxswapd 35, 35 xxswapd 36, 36 xxswapd 37, 37 vmrglw 2, 3, 2 xxswapd 50, 0 vmrglw 19, 5, 4 vperm 2, 19, 2, 18 xxsldwi 12, 34, 34, 2 mfvsrwz 3, 34 xxsldwi 1, 34, 34, 1 xxsldwi 2, 34, 34, 3 mtvsrd 34, 3 mfvsrwz 3, 12 mfvsrwz 4, 1 mtvsrd 35, 3 mfvsrwz 3, 2 mtvsrd 36, 4 mtvsrd 37, 3 addis 3, 2, .LCPI0_1@toc@ha xxswapd 34, 34 addi 3, 3, .LCPI0_1@toc@l xxswapd 37, 37 lxvd2x 13, 0, 3 xxswapd 35, 35 xxswapd 36, 36 vmrglb 2, 5, 2 xxswapd 51, 13 vmrglb 3, 4, 3 vmrglh 2, 2, 3 xxspltw 34, 34, 3 vperm 2, 2, 2, 19
- The TOC loads above were for materializing constant mask vectors for the vector shuffles that this degrades into.
This code pattern is a simulation of one that comes out of SROA and this patch provides a big improvement in one of the benchmarks.
Argh, yes I did. Too busy putting the details into the comments :D. I'll re-post in a few minutes (and one of these days I'll get used to running "svn add").
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
10620–10633 | Instead of fixing the issue for this particular pattern, can't we change type legalization, so that it always converts v4i8 to i32? This fixes the problem at hand, but if we have a different code pattern that includes v4i8, the current way of legalizing v4i8 in type legalization will kick in which seems to generate inefficient code, by promoting it to a larger vector and adding permutes and similar instructions. I think we may need to change how type legalization handles v4i8, instead of fixing this particular pattern. |
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
10620–10633 | To this point, we currently have code like this: // We promote all non-typed operations to v4i32. setOperationAction(ISD::AND , VT, Promote); AddPromotedToType (ISD::AND , VT, MVT::v4i32); setOperationAction(ISD::OR , VT, Promote); AddPromotedToType (ISD::OR , VT, MVT::v4i32); setOperationAction(ISD::XOR , VT, Promote); AddPromotedToType (ISD::XOR , VT, MVT::v4i32); setOperationAction(ISD::LOAD , VT, Promote); AddPromotedToType (ISD::LOAD , VT, MVT::v4i32); setOperationAction(ISD::SELECT, VT, Promote); AddPromotedToType (ISD::SELECT, VT, MVT::v4i32); setOperationAction(ISD::SELECT_CC, VT, Promote); AddPromotedToType (ISD::SELECT_CC, VT, MVT::v4i32); setOperationAction(ISD::STORE, VT, Promote); AddPromotedToType (ISD::STORE, VT, MVT::v4i32); maybe something similar would work in this area as well? |
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
10620–10633 | Actually, you bring up a very good point. We really should be doing something better with legalization of v4i8 (and I imagine all vectors narrower than vectors our hardware actually handles). However, I don't think we can legalize it as a scalar type. We should actually be widening the vector (rather than promoting the integer element type). I'll re-post this patch to do that instead. |
This is a fundamentally different approach from the first attempt. As Ehsan suggested, I've updated how we legalize non-legal vector types. With this patch, if we do not support the vector type, we will widen it rather than performing an integer promotion which would often require scalarizing. However, widening can only be done when the bit-size of the vector element is a multiple of 8 (as we do not support any vectors made up of fractional byte elements).
There are instances where this approach produces worse code. This is exposed in some of the functions in the vsx.ll test case. To address one of those, this patch implements a DAG combine for a conversion of a v2i32 to v2f64 so that it remains a 2-instruction sequence. Similar DAG combines can later be implemented for other affected code patterns.
I've done some lightweight performance testing (the LNT tests on a quiet machine) and here are the results with benchmarks that take less than 5s to execute omitted (BEFORE_TIME is without this patch and AFTER_TIME is with this patch):
BENCHMARK_NAME BEFORE_TIME AFTER_TIME ABS_DIFF PCT_DIFF MultiSource/Applications/SPASS/Output/SPASS 6.050 6.049 -0.001 -0.02% MultiSource/Applications/JM/lencod/Output/lencod 5.102 5.138 0.036 0.70% MultiSource/Applications/lambda-0.1.3/Output/lambda 5.891 5.889 -0.002 -0.04% MultiSource/Applications/hexxagon/Output/hexxagon 13.814 13.808 -0.007 -0.05% MultiSource/Applications/lua/Output/lua 22.924 22.919 -0.004 -0.02% MultiSource/Benchmarks/SciMark2-C/Output/scimark2 72.514 72.508 -0.006 -0.01% MultiSource/Benchmarks/nbench/Output/nbench 19.489 19.189 -0.300 -1.54% MultiSource/Benchmarks/NPB-serial/is/Output/is 10.415 10.417 0.002 0.02% MultiSource/Benchmarks/ASC_Sequoia/IRSmk/Output/IRSmk 32.249 32.987 0.739 2.29% MultiSource/Benchmarks/ASC_Sequoia/AMGmk/Output/AMGmk 10.481 10.482 0.000 0.00% MultiSource/Benchmarks/ASC_Sequoia/CrystalMk/Output/CrystalMk 7.537 7.537 -0.001 -0.01% MultiSource/Benchmarks/TSVC/Reductions-flt/Output/Reductions-flt 5.548 5.548 0.000 0.00% MultiSource/Benchmarks/TSVC/Searching-flt/Output/Searching-flt 6.452 6.452 -0.001 -0.01% MultiSource/Benchmarks/TSVC/GlobalDataFlow-dbl/Output/GlobalDataFlow-dbl 6.823 6.825 0.002 0.03% MultiSource/Benchmarks/TSVC/Reductions-dbl/Output/Reductions-dbl 5.509 5.509 0.001 0.01% MultiSource/Benchmarks/TSVC/Searching-dbl/Output/Searching-dbl 6.253 6.253 0.000 0.00% MultiSource/Benchmarks/PAQ8p/Output/paq8p 110.282 73.201 -37.080 -33.62% MultiSource/Benchmarks/Bullet/Output/bullet 5.242 5.209 -0.033 -0.63% MultiSource/Benchmarks/7zip/Output/7zip-benchmark 7.853 7.842 -0.011 -0.14% MultiSource/Benchmarks/mafft/Output/pairlocalalign 26.034 26.056 0.022 0.08% SingleSource/Benchmarks/CoyoteBench/Output/almabench 30.836 30.827 -0.009 -0.03% SingleSource/Benchmarks/CoyoteBench/Output/huffbench 21.345 21.350 0.005 0.02% SingleSource/Benchmarks/Shootout/Output/sieve 5.268 5.267 -0.001 -0.01% SingleSource/Benchmarks/Shootout/Output/lists 7.607 7.605 -0.002 -0.03% SingleSource/Benchmarks/Shootout/Output/methcall 10.474 10.474 0.000 0.00% SingleSource/Benchmarks/Shootout-C++/Output/lists 11.707 11.733 0.026 0.22% SingleSource/Benchmarks/Shootout-C++/Output/methcall 11.177 11.175 -0.002 -0.02% SingleSource/Benchmarks/Misc/Output/flops 6.340 6.343 0.003 0.05% SingleSource/Benchmarks/Misc/Output/salsa20 6.292 6.291 -0.001 -0.01% SingleSource/Benchmarks/Misc/Output/ReedSolomon 6.903 6.905 0.002 0.03% SingleSource/Benchmarks/Misc-C++/Large/Output/sphereflake 6.154 6.154 -0.000 -0.00% SingleSource/Benchmarks/Misc-C++/Output/stepanov_v1p2 9.809 9.809 -0.000 -0.00% SingleSource/Benchmarks/Adobe-C++/Output/stepanov_abstraction 5.733 5.733 -0.000 -0.01% SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/Output/2mm 18.724 16.745 -1.980 -10.57% SingleSource/Benchmarks/Polybench/linear-algebra/kernels/3mm/Output/3mm 29.825 28.844 -0.981 -3.29% SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemm/Output/gemm 8.889 8.518 -0.371 -4.18% SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/Output/symm 13.260 14.198 0.937 7.07% SingleSource/Benchmarks/Linpack/Output/linpack-pc 13.215 13.214 -0.001 -0.01% SingleSource/Benchmarks/SmallPT/Output/smallpt 12.783 12.781 -0.001 -0.01% SingleSource/UnitTests/Vector/Altivec/Output/alti.expandfft 23.281 23.291 0.009 0.04% SingleSource/UnitTests/Vectorizer/Output/gcc-loops 7.511 7.511 0.000 0.00%
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
10307 | This can be sunk further down, right before where we actually use it. | |
10308–10309 | If I am reading everything correctly you need to add some code here, to prevent an assertion. First of all the comment in include/llvm/CodeGen/ISDOpcodes.h above definition of EXTRACT_VECTOR_ELT says that index into the vector might be variable. Implementation of getConstantOperandVal uses a cast<ConstantSDNode> which "causing an assertion failure if it is not really an instance of the right type" according to http://llvm.org/docs/ProgrammersManual.html#the-isa-cast-and-dyn-cast-templates |
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
10308–10309 | Yes, that is a very good point. I forgot to add a check that a node from which I'm extracting a constant is actually a constant node :). Will update the patch now. |
I had forgotten to check whether the operands of the extractelement nodes are actually constants which can cause asserts. Added the checks now.
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
10302 | If there are no further requests, I won't post another review for this, but the final patch will do away with these "isa" calls and replace them with dyn_cast calls and the corresponding getConstantOperandVal() calls will be replaced with getZExtValue() calls on the resulting ConstantSDNode's. |
As we discussed, before you commit the change, please add -verify-machineinstrs to your regression tests. No need to upload the patch again. Thanks.
Just a friendly reminder that it would be nice to get this patch into trunk before we branch off 3.9. Please review and comment accordingly.
I apologize for the delay. Some minor requests below, but otherwise, LGTM.
lib/Target/PowerPC/PPCISelLowering.h | ||
---|---|---|
141 | This name does not make it clear that you're converting to FP values. Let's name these SINT_VEC_TO_FP and UINT_VEC_TO_FP. Also note there that this is used for illegal vector types. | |
443 | Can you say something about why it is worse. N years from now someone will wonder whether this is still true ;) | |
lib/Target/PowerPC/PPCInstrVSX.td | ||
72 | The existing naming convention here seems to have lower-case latter parts. Not sure why, but we might as well follow it for now. def PPCsvec2fp : ... def PPCuvec2fp : ... |
Committed revision 274535.
lib/Target/PowerPC/PPCISelLowering.h | ||
---|---|---|
443 | Thank you. Will do (and I'll fix the typo - they're not "vectory" types :)). |
This name does not make it clear that you're converting to FP values. Let's name these SINT_VEC_TO_FP and UINT_VEC_TO_FP. Also note there that this is used for illegal vector types.