This is an archive of the discontinued LLVM Phabricator instance.

[Instcombiner]Improve emission of logical or/and reductions.
ClosedPublic

Authored by ABataev on Feb 24 2021, 11:44 AM.

Details

Summary

For logical or/and reductions we emit regular intrinsics @llvm.vector.reduce.or/and.vxi1 calls.
These intrinsics are not effective for the logical or/and reductions,
especially if the optimizer is able to emit short circuit versions of
the scalar or/and instructions and vector code gets less effective than
the scalar version.
Instead, or reduction for i1 can be represented as:

%val = bitcast <ReduxWidth x i1> to iReduxWidth
%res = cmp ne iReduxWidth %val, 0

and reduction for i1 can be represented as:

%val = bitcast <ReduxWidth x i1> to iReduxWidth
%res = cmp eq iReduxWidth %val, 11111

This improves perfromance of the vector code significantly and make it
to outperform short circuit scalar code.

Diff Detail

Event Timeline

ABataev created this revision.Feb 24 2021, 11:44 AM
ABataev requested review of this revision.Feb 24 2021, 11:44 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 24 2021, 11:44 AM
ABataev updated this revision to Diff 326187.Feb 24 2021, 1:05 PM

Summary improvements

Hi @ABataev, I saw this patch come by and left some comments, hope that's alright!

llvm/lib/Transforms/Utils/LoopUtils.cpp
1031 ↗(On Diff #326187)

Can you add the condition that && isa<FixedVectorType>(Src)? (same request for LoopVectorize.cpp and SLPVectorize.cpp)

We're starting to make the LoopVectorizer vectorize for scalable VFs. This means we're currently fixing up cases like this where assumptions are made that are only valid for fixed-width vectors. For scalable vectors it might be possible to do the <vscale x N x i1> reduction as a compare on <vscale x 1 x iN>, but at least for SVE I know that we never want that.

1034–1036 ↗(On Diff #326187)

nit: cast<FixedVectorType>(Src->getType())->getNumElements()

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6949–6964

Should this be part of BasicTTImpl::getArithmeticReductionCost? That way you don't have to implement it twice.

Why can't we just do the costmodel change, and then canonicalize to the right variant (i.e. away from @llvm.vector.reduce.?.v?i1) in instcombine?

Why can't we just do the costmodel change, and then canonicalize to the right variant (i.e. away from @llvm.vector.reduce.?.v?i1) in instcombine?

+1 I'd prefer to avoid yet more special cases in the vectorizers.

Yeah, I also thought about moving this to instcombiner. This patch originally is more a question where it is better to implement it. I'll move the transformation to the instcombiner.

llvm/lib/Transforms/Utils/LoopUtils.cpp
1031 ↗(On Diff #326187)

Sure, will do.

ABataev updated this revision to Diff 326737.Feb 26 2021, 10:27 AM

Address comments.

Can't all the vectorizer code be removed now? X86TTIImpl::getArithmeticReductionCost already has bool reduction costs - not sure if we need to improve the default costs ?

Can't all the vectorizer code be removed now? X86TTIImpl::getArithmeticReductionCost already has bool reduction costs - not sure if we need to improve the default costs ?

Looks like the bool reduction costs are too high because they still rely on some vector instructions. The cost of bitcast + cmp is much lower than even AVX512 bool reductions, for example. What we can do, is to check ащкall targets that we have or/and bool reduction and fallback to the BasicTTI cost. And teach BasicTTI that the cost of bool or/and reductions is actually just bitcast+cmp.

llvm/lib/Transforms/Vectorize/ changes should not be there, the costmodel itself should be fixed.

llvm/lib/Transforms/Vectorize/ changes should not be there, the costmodel itself should be fixed.

So, do you suggest modifying getArithmeticReductionCost to return the correct cost for logical reductions?

llvm/lib/Transforms/Vectorize/ changes should not be there, the costmodel itself should be fixed.

So, do you suggest modifying getArithmeticReductionCost to return the correct cost for logical reductions?

Something along those lines, yes.

Also, could you please split off the instcombine change? That is basically LG.

llvm/test/Transforms/InstCombine/vector-logical-reductions.ll
4–30

These tests are too complex, they should be just:

llvm/lib/Transforms/Vectorize/ changes should not be there, the costmodel itself should be fixed.

So, do you suggest modifying getArithmeticReductionCost to return the correct cost for logical reductions?

Something along those lines, yes.

Also, could you please split off the instcombine change? That is basically LG.

Ok.

llvm/test/Transforms/InstCombine/vector-logical-reductions.ll
4–30

Will fix

ABataev updated this revision to Diff 327477.Mar 2 2021, 8:59 AM

Removed cost update

ABataev retitled this revision from [Vectorizers]Improve emission of logical or/and reductions. to [Instcombiner]Improve emission of logical or/and reductions..Mar 2 2021, 8:59 AM

Hm, actually, are we sure this is universally true/better?
For aarch64 this seems to result in larger codegen: https://godbolt.org/z/hczeMo

Hm, actually, are we sure this is universally true/better?
For aarch64 this seems to result in larger codegen: https://godbolt.org/z/hczeMo

Is this the same as these?
https://bugs.llvm.org/show_bug.cgi?id=41636
https://bugs.llvm.org/show_bug.cgi?id=41635
https://bugs.llvm.org/show_bug.cgi?id=41634

Hm, actually, are we sure this is universally true/better?
For aarch64 this seems to result in larger codegen: https://godbolt.org/z/hczeMo

Is this the same as these?
https://bugs.llvm.org/show_bug.cgi?id=41636
https://bugs.llvm.org/show_bug.cgi?id=41635
https://bugs.llvm.org/show_bug.cgi?id=41634

Yes, looks so.

lebedev.ri accepted this revision.Mar 2 2021, 9:50 AM

Okay, let's call this canonicalization.

This revision is now accepted and ready to land.Mar 2 2021, 9:50 AM
This revision was landed with ongoing or failed builds.Mar 4 2021, 8:01 AM
This revision was automatically updated to reflect the committed changes.