This is an archive of the discontinued LLVM Phabricator instance.

[ARM] MVE VCVT lowering for f32->f16 truncs
ClosedPublic

Authored by dmgreen on Jun 4 2020, 3:48 AM.

Details

Summary

This adds code to lower f32 to f16 fp_trunc's using a pair of MVE VCVT instructions. Due to v4f16 not being legal, fp_round are often split up fairly early. So this reconstructs the vcvt's from a buildvector of fp_rounds from two vector inputs. Something like:

BUILDVECTOR(FP_ROUND(EXTRACT_ELT(X, 0),
           FP_ROUND(EXTRACT_ELT(Y, 0),
           FP_ROUND(EXTRACT_ELT(X, 1),
           FP_ROUND(EXTRACT_ELT(Y, 1), ...)

It adds a VCVTN node to handle this, which like VMOVN or VQMOVN lowers into the top/bottom lanes of an MVE instruction.

Diff Detail

Event Timeline

dmgreen created this revision.Jun 4 2020, 3:48 AM

Due to v4f16 not being legal

This is a MVE-specific thing? hmm.

Dealing with mixed types is a recurring problem with target-independent vector handling; we should probably try to extend the approach currently used by SIGN_EXTEND_VECTOR_INREG to other cast opcodes, so we don't have to keep repeating exactly the same hacks for every target with vector registers.

Due to v4f16 not being legal

This is a MVE-specific thing? hmm.

Dealing with mixed types is a recurring problem with target-independent vector handling; we should probably try to extend the approach currently used by SIGN_EXTEND_VECTOR_INREG to other cast opcodes, so we don't have to keep repeating exactly the same hacks for every target with vector registers.

Can you explain what kind of thing you are thinking of? A <4 x half> under MVE is very different to a <4 x i16> under MVE, as we currently do type promotion. And the "bottom half" of a v8i/f16 register is always a bit difficult to deal with.

If you have a better idea, I'm interested to hear it. I was currently planning to add something (probably pre-isel so it can look across BB boundaries) that took blobs of vector instructions surrounded by sext/zext/trunc (and now fpext/fptrunc) and turned them into something more MVE-y, by inserting shuffles like the tests above to allow us to use T/B instructions.

So I was thinking this was an MVE special.

x86 has similar issues with SSE2 floating-point. I think maybe some other targets have related issues.

The key function that's making everything a complete mess is DAGTypeLegalizer::WidenVecRes_Convert. Ideally, it would have some target-independent vector operation it could generate (compare to DAGTypeLegalizer::WidenVecOp_EXTEND). But barring that, it probably makes sense to custom-legalize it. Pattern-matching the build_vector isn't broken, exactly, but it involves a lot of DAG nodes.

Hmm. shuffle_trunc1 comes from something like this:

    t17: v4f32 = vector_shuffle<0,4,1,5> t3, t6
    t18: v4f32 = vector_shuffle<2,6,3,7> t3, t6
  t19: v8f32 = concat_vectors t17, t18
t12: v8f16 = fp_round t19, TargetConstant:i32<0>

Which then gets split into two halves because of the v8f32 and the two halves look like this in WidenVecRes_Convert:

  t17: v4f32 = vector_shuffle<0,4,1,5> t3, t6
t20: v4f16 = fp_round t17, TargetConstant:i32<0>

Which need to be concat back together. The two halves of the BUILD_VECTOR are combined and that is what we end up lowering.

shuffle_trunc3 is that but twice as wide, starting with:

    t32: v8f32 = vector_shuffle<0,8,1,9,2,10,3,11> t13, t14
    t33: v8f32 = vector_shuffle<4,12,5,13,6,14,7,15> t13, t14
  t34: v16f32 = concat_vectors t32, t33
t20: v16f16 = fp_round t34, TargetConstant:i32<0>

We end up with 4 BuildVectors that are combined back together into 2.

shuffle_trunc5 is this before we start legalizing the types:

    t8: v4f16 = fp_round t3, TargetConstant:i32<0>
  t11: v8f16 = concat_vectors t8, undef:v4f16
    t9: v4f16 = fp_round t6, TargetConstant:i32<0>
  t12: v8f16 = concat_vectors t9, undef:v4f16
t13: v8f16 = vector_shuffle<0,8,1,9,2,10,3,11> t11, t12

So it's hard to see the shuffle from the fp_round. Again though, it creates BuildVectors, BuildVectors simplify, we lower from the BuildVectors. Perhaps that's a bit of a stranger case with the v4f16 vectors. But unfortunately they are likely to come up from the vectorizer at the moment.

shuffle_trunc7 is the same thing but double the width:

      t13: v8f32 = concat_vectors t3, t6
    t16: v8f16 = fp_round t13, TargetConstant:i32<0>
  t19: v16f16 = concat_vectors t16, undef:v8f16
      t14: v8f32 = concat_vectors t9, t12
    t17: v8f16 = fp_round t14, TargetConstant:i32<0>
  t20: v16f16 = concat_vectors t17, undef:v8f16
t21: v16f16 = vector_shuffle<0,16,1,17,2,18,3,19,4,20,5,21,6,22,7,23> t19, t20

I guess I'm still having trouble seeing what we would reliably latch onto here.

I do have some old code that was using a dagcombine on a fptrunc(shufflevector), but that didn't handle all these cases and doing this from a buildvector seemed much simpler. It is the way that we lower all shuffles in the arm backend (like vext and vmovn) after all. The only difference here is that we have an fptrunc in the mix too.

efriedma accepted this revision.Jun 4 2020, 4:46 PM

LGTM

I wasn't really paying attention to how the interleaving works... I guess this is probably the least terrible solution here. (I still think we should make some target-independent enhancements in this area, but maybe there isn't enough common ground in this particular case.)

This revision is now accepted and ready to land.Jun 4 2020, 4:46 PM

Thanks. I can agree with "least terrible" :)

I have another patch to do the same thing with FP_EXT. I'll put that up soon and we can see whether that one is OK too.

This revision was automatically updated to reflect the committed changes.