If a vector body has live-out values, it is probably a reduction, which needs a final reduction step after the loop. MVE has a VADDV instruction to reduce integer vectors, but doesn't have an equivalent one for float vectors. A live-out value that is not recognised as reduction later in the optimisation pipeline will result in the tail-predicated loop to be reverted to a non-predicated loop and this is very expensive, i.e. it has a significant performance impact, which is what we hope to avoid with fine tuning the ARM TTI hook preferPredicateOverEpilogue implementation.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Hello. I agree with this, but I would suggest we take it further.
There are multiple forms of reductions that the vectorizer can produce. It's at least:
int add, int mul
float add, float mul
and, or, xor
int smin, smax, umin, umax
float min & max.
I think the ARMLowOverheadLoops will only currently handle integer add's. Considering how powerful the vectorizers handling of reductions is, and how easy (and expensive) it would be for the ARMLowOverheadLoops pass to get that wrong, I would suggest just disabling _all_ reductions at the moment.
Thanks Dave, those are good points.
ARMLowOverheadLoops indeed currently handle only integer add's, so I will have a go at further restricting the int operations.
Now only integer add reductions are allowed, all other operations (including floats) are rejected.
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp | ||
---|---|---|
1427 | I took a look at the reduction code in ARMLowOverheadLoops. The vectorizer can create add reductions in a number of places and in a number of ways (it can have multiple reductions, multiple reduction steps, look through phi's and selects and even change the type to a smaller bitwidth). I don't think the backend pass can handle all of them yet. Considering how expensive reverting can be in comparison, what do you think of just disabling all reductions for the time being and re-enabling them once we know that things are working well enough? Maybe add a FIXME for it? I would expect D75069 to handle integer reduction in most cases eventually from the vectorizer directly and we can probably do something similar for other reduction types. |
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp | ||
---|---|---|
1427 | In my testing, I couldn't get the vectorizer to produce a tail-folded loops with multiple reductions, so I don't think we have to worry about that at the moment. My basic example also didn't even vectorize a FP reduction loop. I think disabling all reductions would be too conservative now, we should have done it from the beginning but, that now we actually can handle something, I think we should try to have this cost function align with LowOverheadLoops implementation. |
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp | ||
---|---|---|
1427 | If we really don't want tail-folding, I think that should be controlled in other way than restricting this hook to do nothing. We have option -prefer-predicate-over-epilog, and that should do that for now. If source-code modifications and more fine grained control is possible, we even have a pragma. Like Sam said, I would like to start with integer add reduction, so that we are also testing that part in the back-end. |
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp | ||
---|---|---|
1427 | I accidentally hit Submit too soon, but wanted to add a bit more nuance to my previous comment: "this hook to do nothing" -> "this hook to do almost nothing". I also wanted to add that in general matching up the behaviour here with the backend is probably best, and also that I could look into disallowing multiple reductions, but looks like we don't really have worry about that. And if there are some inefficiencies around integer add reductions, I guess that's fine at this stage (tail-folding can be disabled, see previous comment), and I think it's good to have these inefficiencies explicit and visible. |
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp | ||
---|---|---|
1427 | If we go for matching the limitations of LowOverheadLoops (of which there are many!) I think we want to check that:
Half related to reductions, I'd also prevent if there's a select in the loop - I'm pretty sure min/max kernels will fail to be tail predicated and we can't handle any VPSEL in the loop either. |
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp | ||
---|---|---|
1427 | Hmm. OK. I sent you some examples of the tests I had that were going wrong. Hopefully they will help. I would go with conservative for the time being to make turning on TailPred by default easier. We can always change it later, Plus D75069 will change the way integer reductions are generated anyway! I would think the backend pass is more useful for other reduction in the long run - but even then, like we talked about, it might be better to do things earlier. But if you guys think that you can get it to work - then sounds good. I'll leave it to you. I personally wouldn't spend a lot of time on it though, if the goal is in the end to do it another way. |
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp | ||
---|---|---|
1427 | That would make most sense to me.
Check. (well, almost, need to restrict it to 1 live-out).
Will add this.
Yep, will add this. This should go in canTailPredicateInstruction(), perhaps I will do this in a little separate patch, |
After our last discussion on this, we agreed that tail-folding should be disabled by default (which it is), that the only reductions that we want at this point are integer add reductions (added in this patch), and that we can control enabling/disabling reductions with an option (added in D83133). I thought this is what we all believed in. :)
I took a look at the reduction code in ARMLowOverheadLoops. The vectorizer can create add reductions in a number of places and in a number of ways (it can have multiple reductions, multiple reduction steps, look through phi's and selects and even change the type to a smaller bitwidth). I don't think the backend pass can handle all of them yet.
Considering how expensive reverting can be in comparison, what do you think of just disabling all reductions for the time being and re-enabling them once we know that things are working well enough? Maybe add a FIXME for it? I would expect D75069 to handle integer reduction in most cases eventually from the vectorizer directly and we can probably do something similar for other reduction types.