This is an archive of the discontinued LLVM Phabricator instance.

[ARM][MVE] Enable *SHRN* for tail predication
ClosedPublic

Authored by samparker on Mar 5 2020, 2:00 AM.

Details

Summary

These instructions don't swap lanes so make them valid.

Diff Detail

Event Timeline

samparker created this revision.Mar 5 2020, 2:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 5 2020, 2:00 AM
SjoerdMeijer accepted this revision.Mar 5 2020, 2:31 AM

Agreed, they don't swap lanes. They can write to only bottom/top halfs, but that's fine. So looks like a straightforward change to me.

This revision is now accepted and ready to land.Mar 5 2020, 2:31 AM
This revision was automatically updated to reflect the committed changes.

Hey. Can you explain what makes an instruction validForTailPredication? I think I've lost track. And what do you mean by "swap lanes" in this case?

We're allowing instructions which produce a vector and where the output lanes are only dependent upon the same lane on the input register(s). So when I say 'swap', I should say 'exchange'.

OK. I'm not sure if that is enough, if I am understanding correctly. What if we load a v8i16, extend that into two v4i32's using something like a VMULL, then narrow that back into a single v8i16. I don't think this is something that autovec will produce (yet), but could come up from intrinsics in a way that people are likely to write. Something like this:

#include <arm_mve.h>
void test(short *x, short *y, short *z, int n) {
  while(n > 0) {
    int pred = vctp16q(n);
    int16x8_t a = vldrhq_z_s16(x, pred);
    int16x8_t b = vldrhq_z_s16(y, pred);
    int32x4_t top = vmulltq_int(a, b);
    int32x4_t bot = vmullbq_int(a, b);
    int16x8_t rtop = vqshrnbq(vuninitializedq_s16(), bot, 16);
    int16x8_t rbot = vqshrntq(rtop, top, 16);
    vstrhq_p_s16(z, rbot, pred);

    x += 8;
    y += 8;
    z += 8;
    n -= 8;
  }
}

I'm pretty sure that tail predicating this would not be valid, as the top bits of one of the mul's could be cut off.

Yes, the problem there is that the number of lanes isn't the same throughout the loop, not necessarily that we're using a narrowing operation. Do you know if there's a nice way to query operand/result types at the MI level?

Not sure. I think they are just register types at the instruction level. I thought that was why we excluded many instructions (like vmull and vshrn), because they all change the types, and that changing of the types probably means that the tail predication might not be valid.

From IR I think that would be <8 x i16> sext to <8 x i32>, so the number of lanes would be the same, but the types (and lanes they are computed in) changes.

PS. I noticed VFMA.f32 isn't marked as valid. I think that's one that should certainly be OK.