- Convert UINT_TO_FP DAG nodes with a wider result vector element type to first do a zero extend of the source vector:
The type legalizer will scalarize a 'v4f64 = uint_to_fp v4i16' as part of widening the illegally typed operand. By zero extending the source before type legalization, the node is instead split and kept as a vector operation.
Doing this before type legalization seemed like a simple solution that will handle any source vectors, such as v4i16, without having to worry about things like widening. The type legalizer seems to scalarize the uint_to_fp if the input/output vector type is not legal.
I thought about having a check here for a scalarized source operand (e.g. urem), in which case it might be less desired to first insert the source elements in order to do a vector convert. There was however no cases in benchmarks where also any user of the uint_to_fp was scalarized, so it seemed that this consideration can be ignored. This was a benchmark that first made me try this:
f470.lbm cdlfbr : 4 0 -4 vuplhh : 0 2 +2 vuplhf : 0 2 +2 vcdlgb : 0 2 +2 vmrhg : 16 14 -2 vlvgh : 0 2 +2 vlvgb : 0 2 +2 vuplhb : 0 1 +1
The difference here was that a vector convert with a scalarized source, first vector elements are loaded and then unpacked. In that case it seems possibly better to do scalar conversions, since they support uint32 -> double and directly store into the right vector register: 2 x scalar conversions + 1 vmrhg. This however did not matter for performance and it was just one or two cases...
This is the check that was needed (and removed):
+bool SystemZTargetLowering::willBeScalarized(SDValue Op) const { + if (Op->isUndef()) + return false; + if (Op->getOpcode() == ISD::VECTOR_SHUFFLE) + return willBeScalarized(Op->getOperand(0)) || + willBeScalarized(Op->getOperand(1)); + + unsigned ScalarBits = Op.getValueType().getScalarSizeInBits(); + MVT CheckVT = MVT::getVectorVT(MVT::getIntegerVT(ScalarBits), + SystemZ::VectorBits / ScalarBits); + return (getOperationAction(Op->getOpcode(), CheckVT) == Expand); +} SDValue SystemZTargetLowering::combineUINT_TO_FP( SDNode *N, DAGCombinerInfo &DCI) const { + // Don't do this if Op is known to be scalarized. + if (willBeScalarized(Op)) + return SDValue(); +
At first I used a check to only do this when the vector conversion was supported (v2f64 on z14 and also v4f32 on z15), but then I figured that it is still better to do the vector zero extend instead of extracting and then extending each element:
f510.parest_r vlgvh : 0 20 +20 vlgvf : 540 520 -20 llhr : 90 110 +20 vuplhh : 17 12 -5
- Use a vllez + vle to load, shuffle and zero extend a vector with two elements in just two instructions:
I first tried to create the vllez with existing patterns by building a load and insert_vector_elt, but:
t23: i32,ch = load<(load 4 from %ir.Src), anyext from i16> t0, t2, undef:i64 t22: i32,ch = load<(load 4 from %ir.Src), anyext from i16> t0, t16, undef:i64 t29: v8i16 = BUILD_VECTOR Constant:i32<0>... t31: v8i16 = insert_vector_elt t29, t23, Constant:i32<3> t32: v8i16 = insert_vector_elt t31, t22, Constant:i32<0> => canonicalization in DAGCombiner::visitINSERT_VECTOR_ELT() => t29: v8i16 = BUILD_VECTOR Constant:i32<0>... t34: v8i16 = insert_vector_elt t29, t22, Constant:i32<0> t35: v8i16 = insert_vector_elt t34, t23, Constant:i32<3>
, which means that the insert-pattern for vllezh does not work.
Therefore a new SystemZISD::VLLEZ node seemed needed. The previous pattern for vllez has been renamed, and the new node SystemZISD::VLLEZ matches z_vllezi[8 - 64].
Tried also i32 -> i64, but this only changed one file (did not look like an improvement), and it involved extra searching code through a second VECTOR_SHUFFLE in tryVLLEZ(), so skipped.
Haven't tried v4i8 -> v4i32, since that would involve doing four loads instead of one, which seems a little worrisome.
I saw just a few cases on SPEC 06, where multiple LAYs were generated instead of one (13 bit displacement added twice, before both vllez and vle). It would have been better to do just one LAY and then use an immediate displacement in the vle. This seems like an existing problem which involves reg pressure consideration. Would it be worth trying a clean-up pass pre-RA that tried to minimize the number of LAYs? It could check if the base-address was killed at the first LAY and in that case reuse that LAYs result in the second one.
- Results
Only Imagick affected performance wise among benchmarks.
- combineUINT_TO_FP() gives 15% improvement
- tryVLLEZ() gives another 5-7% improvement.
Total improvement: 20-22%
imagick/morphology.s/MorphologyApply which is hot now uses 68 x (vllezh; vleh; vcdlgb).
(GCC does not do any vllez/vcdlgb, and nearly all uint to fp conversions are scalar, but it is still fast).