This is an archive of the discontinued LLVM Phabricator instance.

[ARM][MVE] Only tail-fold integer add reductions
ClosedPublic

Authored by SjoerdMeijer on Jul 1 2020, 5:18 AM.

Details

Summary

If a vector body has live-out values, it is probably a reduction, which needs a final reduction step after the loop. MVE has a VADDV instruction to reduce integer vectors, but doesn't have an equivalent one for float vectors. A live-out value that is not recognised as reduction later in the optimisation pipeline will result in the tail-predicated loop to be reverted to a non-predicated loop and this is very expensive, i.e. it has a significant performance impact, which is what we hope to avoid with fine tuning the ARM TTI hook preferPredicateOverEpilogue implementation.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Jul 1 2020, 5:18 AM

Hello. I agree with this, but I would suggest we take it further.

There are multiple forms of reductions that the vectorizer can produce. It's at least:
int add, int mul
float add, float mul
and, or, xor
int smin, smax, umin, umax
float min & max.

I think the ARMLowOverheadLoops will only currently handle integer add's. Considering how powerful the vectorizers handling of reductions is, and how easy (and expensive) it would be for the ARMLowOverheadLoops pass to get that wrong, I would suggest just disabling _all_ reductions at the moment.

Thanks Dave, those are good points.
ARMLowOverheadLoops indeed currently handle only integer add's, so I will have a go at further restricting the int operations.

SjoerdMeijer retitled this revision from [ARM][MVE] Don't tail-fold float reductions to [ARM][MVE] Only tail-fold integer add reductions.

Now only integer add reductions are allowed, all other operations (including floats) are rejected.

dmgreen added inline comments.Jul 2 2020, 12:42 AM
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1427

I took a look at the reduction code in ARMLowOverheadLoops. The vectorizer can create add reductions in a number of places and in a number of ways (it can have multiple reductions, multiple reduction steps, look through phi's and selects and even change the type to a smaller bitwidth). I don't think the backend pass can handle all of them yet.

Considering how expensive reverting can be in comparison, what do you think of just disabling all reductions for the time being and re-enabling them once we know that things are working well enough? Maybe add a FIXME for it? I would expect D75069 to handle integer reduction in most cases eventually from the vectorizer directly and we can probably do something similar for other reduction types.

samparker added inline comments.Jul 2 2020, 1:12 AM
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1427

In my testing, I couldn't get the vectorizer to produce a tail-folded loops with multiple reductions, so I don't think we have to worry about that at the moment. My basic example also didn't even vectorize a FP reduction loop. I think disabling all reductions would be too conservative now, we should have done it from the beginning but, that now we actually can handle something, I think we should try to have this cost function align with LowOverheadLoops implementation.

SjoerdMeijer marked an inline comment as done.Jul 2 2020, 1:18 AM
SjoerdMeijer added inline comments.
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1427

If we really don't want tail-folding, I think that should be controlled in other way than restricting this hook to do nothing. We have option -prefer-predicate-over-epilog, and that should do that for now. If source-code modifications and more fine grained control is possible, we even have a pragma.

Like Sam said, I would like to start with integer add reduction, so that we are also testing that part in the back-end.

SjoerdMeijer marked an inline comment as done.Jul 2 2020, 1:25 AM
SjoerdMeijer added inline comments.
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1427

I accidentally hit Submit too soon, but wanted to add a bit more nuance to my previous comment: "this hook to do nothing" -> "this hook to do almost nothing".

I also wanted to add that in general matching up the behaviour here with the backend is probably best, and also that I could look into disallowing multiple reductions, but looks like we don't really have worry about that.

And if there are some inefficiencies around integer add reductions, I guess that's fine at this stage (tail-folding can be disabled, see previous comment), and I think it's good to have these inefficiencies explicit and visible.

samparker added inline comments.Jul 2 2020, 2:18 AM
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1427

If we go for matching the limitations of LowOverheadLoops (of which there are many!) I think we want to check that:

  • there's only one liveout.
  • it's an add.
  • and that add isn't using an add (because this breaks the pitifully weak pattern matching, see one_loop_add_add_v16i8 in the reductions.ll test.

Half related to reductions, I'd also prevent if there's a select in the loop - I'm pretty sure min/max kernels will fail to be tail predicated and we can't handle any VPSEL in the loop either.

dmgreen added inline comments.Jul 2 2020, 2:52 AM
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1427

Hmm. OK. I sent you some examples of the tests I had that were going wrong. Hopefully they will help.

I would go with conservative for the time being to make turning on TailPred by default easier. We can always change it later, Plus D75069 will change the way integer reductions are generated anyway! I would think the backend pass is more useful for other reduction in the long run - but even then, like we talked about, it might be better to do things earlier.

But if you guys think that you can get it to work - then sounds good. I'll leave it to you. I personally wouldn't spend a lot of time on it though, if the goal is in the end to do it another way.

SjoerdMeijer marked an inline comment as done.Jul 2 2020, 3:08 AM
SjoerdMeijer added inline comments.
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1427

That would make most sense to me.

  • there's only one liveout.
  • it's an add.

Check. (well, almost, need to restrict it to 1 live-out).

  • and that add isn't using an add (because this breaks the pitifully weak pattern matching, see one_loop_add_add_v16i8 in the reductions.ll test.

Will add this.

Half related to reductions, I'd also prevent if there's a select in the loop - I'm pretty sure min/max kernels will fail to be tail predicated and we can't handle any VPSEL in the loop either.

Yep, will add this. This should go in canTailPredicateInstruction(), perhaps I will do this in a little separate patch,

After our last discussion on this, we agreed that tail-folding should be disabled by default (which it is), that the only reductions that we want at this point are integer add reductions (added in this patch), and that we can control enabling/disabling reductions with an option (added in D83133). I thought this is what we all believed in. :)

dmgreen accepted this revision.Jul 13 2020, 11:39 AM

Sounds good to me. Thanks.

This revision is now accepted and ready to land.Jul 13 2020, 11:39 AM
This revision was automatically updated to reflect the committed changes.