This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Extra MVE VMLAV reduction patterns
ClosedPublic

Authored by dmgreen on May 25 2020, 9:15 AM.

Details

Summary

These patterns for i8 and i16 VMLA's were missing. They end up from legalized vector.reduce.add.v8i16 and vector.reduce.add.v16i8, and although the instruction works differently (the mul and add are performed in a higher precision), I believe it is OK because only an i8/i16 are demanded from them, and so the results will be the same. At least, they pass any testing I can think to run on them.

There are some tests that end up looking worse, but are quite artificial due to passing half vector types through a call boundary. I would not expect the vmull to realistically come up like that, and a vmlava is likely better a lot of the time.

Diff Detail

Event Timeline

dmgreen created this revision.May 25 2020, 9:15 AM

There are some tests that end up looking worse, but are quite artificial due to passing half vector types through a call boundary. I would not expect the vmull to realistically come up like that, and a vmlava is likely better a lot of the time.

Looking at the tests, I think the key distinction in the cases that get "worse" is that sign/zero-extend can be folded into the multiply. It's not really related to the calling convention. That said, not sure how likely that is to come up in practice... I guess if it's produced by a load, we can sign/zero-extend for free?

There are some tests that end up looking worse, but are quite artificial due to passing half vector types through a call boundary. I would not expect the vmull to realistically come up like that, and a vmlava is likely better a lot of the time.

Looking at the tests, I think the key distinction in the cases that get "worse" is that sign/zero-extend can be folded into the multiply. It's not really related to the calling convention. That said, not sure how likely that is to come up in practice... I guess if it's produced by a load, we can sign/zero-extend for free?

Yes. Specifically it needs to be sign extended from something that already places the lanes in the correct places. MVE doesn't have a normal sign extent instructions like neon (from the bottom 8 lanes of a v16i8 to a v8i16, for example). It can only use top/bottom vmovl's which need the lanes to be in the correct place. A <8 x i8> through a call boundary is actually (apparently) a 128bit vector with widened lanes, hence my comment about the calling convention. Otherwise the extend wouldn't match and we wouldn't produce a vmull anyway. A vmovlb is really a

%s = shufflevector <16 x i8> %src, <16 x i8> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
%ext = sext <8 x i8> %s to <8 x i16>

Until we do lane interleaving (which is in the works but we don't do yet), I wouldn't expect this to come up in practice. Like you said a load/store will likely do the extend for free from most code.

efriedma accepted this revision.May 26 2020, 11:51 AM

We can always add more specific patterns if it matters, I guess.

LGTM

This revision is now accepted and ready to land.May 26 2020, 11:51 AM
This revision was automatically updated to reflect the committed changes.