This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Avoid scalarization of S/UINT_TO_FP
ClosedPublic

Authored by jonpa on Mar 11 2020, 2:21 AM.

Details

Summary
  • Convert UINT_TO_FP DAG nodes with a wider result vector element type to first do a zero extend of the source vector:

The type legalizer will scalarize a 'v4f64 = uint_to_fp v4i16' as part of widening the illegally typed operand. By zero extending the source before type legalization, the node is instead split and kept as a vector operation.

Doing this before type legalization seemed like a simple solution that will handle any source vectors, such as v4i16, without having to worry about things like widening. The type legalizer seems to scalarize the uint_to_fp if the input/output vector type is not legal.

I thought about having a check here for a scalarized source operand (e.g. urem), in which case it might be less desired to first insert the source elements in order to do a vector convert. There was however no cases in benchmarks where also any user of the uint_to_fp was scalarized, so it seemed that this consideration can be ignored. This was a benchmark that first made me try this:

f470.lbm
cdlfbr         :                    4                    0       -4
vuplhh         :                    0                    2       +2
vuplhf         :                    0                    2       +2
vcdlgb         :                    0                    2       +2
vmrhg          :                   16                   14       -2
vlvgh          :                    0                    2       +2
vlvgb          :                    0                    2       +2
vuplhb         :                    0                    1       +1

The difference here was that a vector convert with a scalarized source, first vector elements are loaded and then unpacked. In that case it seems possibly better to do scalar conversions, since they support uint32 -> double and directly store into the right vector register: 2 x scalar conversions + 1 vmrhg. This however did not matter for performance and it was just one or two cases...

This is the check that was needed (and removed):

+bool SystemZTargetLowering::willBeScalarized(SDValue Op) const {
+  if (Op->isUndef())
+    return false;
+  if (Op->getOpcode() == ISD::VECTOR_SHUFFLE)
+    return willBeScalarized(Op->getOperand(0)) ||
+           willBeScalarized(Op->getOperand(1));
+
+  unsigned ScalarBits = Op.getValueType().getScalarSizeInBits();
+  MVT CheckVT = MVT::getVectorVT(MVT::getIntegerVT(ScalarBits),
+                                 SystemZ::VectorBits / ScalarBits);
+  return (getOperationAction(Op->getOpcode(), CheckVT) == Expand);
+}

 SDValue SystemZTargetLowering::combineUINT_TO_FP(
     SDNode *N, DAGCombinerInfo &DCI) const {

+  // Don't do this if Op is known to be scalarized.
+  if (willBeScalarized(Op))
+    return SDValue();
+

At first I used a check to only do this when the vector conversion was supported (v2f64 on z14 and also v4f32 on z15), but then I figured that it is still better to do the vector zero extend instead of extracting and then extending each element:

f510.parest_r
vlgvh          :                    0                   20      +20
vlgvf          :                  540                  520      -20
llhr           :                   90                  110      +20
vuplhh         :                   17                   12       -5
  • Use a vllez + vle to load, shuffle and zero extend a vector with two elements in just two instructions:

I first tried to create the vllez with existing patterns by building a load and insert_vector_elt, but:

 t23: i32,ch = load<(load 4 from %ir.Src), anyext from i16> t0, t2, undef:i64
 t22: i32,ch = load<(load 4 from %ir.Src), anyext from i16> t0, t16, undef:i64
     t29: v8i16 = BUILD_VECTOR Constant:i32<0>...
            t31: v8i16 = insert_vector_elt t29, t23, Constant:i32<3>
          t32: v8i16 = insert_vector_elt t31, t22, Constant:i32<0>

=> canonicalization in DAGCombiner::visitINSERT_VECTOR_ELT() =>

      t29: v8i16 = BUILD_VECTOR Constant:i32<0>...
             t34: v8i16 = insert_vector_elt t29, t22, Constant:i32<0>
           t35: v8i16 = insert_vector_elt t34, t23, Constant:i32<3>

, which means that the insert-pattern for vllezh does not work.

Therefore a new SystemZISD::VLLEZ node seemed needed. The previous pattern for vllez has been renamed, and the new node SystemZISD::VLLEZ matches z_vllezi[8 - 64].

Tried also i32 -> i64, but this only changed one file (did not look like an improvement), and it involved extra searching code through a second VECTOR_SHUFFLE in tryVLLEZ(), so skipped.

Haven't tried v4i8 -> v4i32, since that would involve doing four loads instead of one, which seems a little worrisome.

I saw just a few cases on SPEC 06, where multiple LAYs were generated instead of one (13 bit displacement added twice, before both vllez and vle). It would have been better to do just one LAY and then use an immediate displacement in the vle. This seems like an existing problem which involves reg pressure consideration. Would it be worth trying a clean-up pass pre-RA that tried to minimize the number of LAYs? It could check if the base-address was killed at the first LAY and in that case reuse that LAYs result in the second one.

  • Results

Only Imagick affected performance wise among benchmarks.

  • combineUINT_TO_FP() gives 15% improvement
  • tryVLLEZ() gives another 5-7% improvement.

Total improvement: 20-22%

imagick/morphology.s/MorphologyApply which is hot now uses 68 x (vllezh; vleh; vcdlgb).

(GCC does not do any vllez/vcdlgb, and nearly all uint to fp conversions are scalar, but it is still fast).

Diff Detail

Event Timeline

jonpa created this revision.Mar 11 2020, 2:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 11 2020, 2:21 AM

Not yet looking at the implementation details, but a couple of comments on the overall approach:

  • The UINT_TO_FP changes look good to me. The only question I have is whether we should do the same for SINT_TO_FP? (And in either case, this can probably be committed as a separate patch, apart from the VLLEZ changes.)
  • I agree that using VLLEZ to implement the zero-extend is a good idea. I'm wondering about the implementation however; it seems odd to combine two quite different manners of emitting that instruction.

First of all, the "canonicalization in DAGCombiner::visitINSERT_VECTOR_ELT()" problem you mention seems to be a problem in general; the code with the two element-inserts could easily arise just from normal middle-end action (not just that new back-end code you added), and we really ought to be able to handle such code in general. So I think we should try to do that, either by overriding the DAGCombine with our own rule (if that's possible), or possibly handling this as a SystemZDAGToDAGISel::Select rule? In any case, I'm sceptical about the new DAG opcode -- it's always better to avoid those for semantics that *can* be described with standard opcodes, since custom opcodes are opaque to all further analysis by the middle-end ...

The other question is how to handle the zero-extend expansion. Logically, a vector zero-extend is fully equivalent to a shuffle intermixing bytes from the original vector with a zero vector. So I think the back-end ought to handle both cases in the same way. Now, we have logic that will intelligently handle shuffles (the whole GeneralShuffle logic). From what I can see, this never emits an UNPACK -- however, it should in theory be able to emit a MERGE, which would later (if one of the inputs is a zero vector) be transformed into an UNPACK via SystemZTargetLowering::combineMERGE.

Given that, maybe the optimal solution to catch all cases would be:

  • first, expand vector zero-extend into a shuffle (always)
  • second, make sure that the GeneralShuffle logic (and/or combineMERGE) detects the case where an optimal solution is via element inserts into a zero vector (and falls back to UNPACK otherwise)
  • finally (as said above), make sure that element inserts into a zero vector get implemented via VLLEZ in all cases where this is possible
jonpa updated this revision to Diff 250353.Mar 14 2020, 3:48 AM
  • The UINT_TO_FP changes look good to me. The only question I have is whether we should do the same for SINT_TO_FP? (And in either case, this can probably be committed as a separate patch, apart from the VLLEZ changes.)

Yes, that looks also like a good idea. Patch updated to only do INT_TO_FP conversions - VLLEZ part will be posted separately.

jonpa retitled this revision from [SystemZ] Avoid scalarization of UINT_TO_FP + improve shuffled zext:ing loads. to [SystemZ] Avoid scalarization of S/UINT_TO_FP.Mar 14 2020, 3:49 AM
uweigand accepted this revision.Mar 16 2020, 2:27 AM

LGTM, thanks!

This revision is now accepted and ready to land.Mar 16 2020, 2:27 AM
This revision was automatically updated to reflect the committed changes.