This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Scalarize binary ops following shuffles.
AbandonedPublic

Authored by FarhanaAleen on Apr 6 2018, 3:38 PM.

Details

Summary

This patch performs following transformation.

%vector1 = shufflevector
%vector2 = shufflevector
%vector-res = vector-binaryop %vector1, %vector2
%scalar-res = extractelement <2 x fhalf> %vector-res, i32 0
ret half %scalar-res

>

%scalar1 = extractelement <2 x fhalf> %vector1, i32 0
%scalar2 = extractelement <2 x fhalf> %vector2, i32 0
%scalar-res = scalar binaryop %scalar1, %scalar2
return half %scalar-res

Diff Detail

Event Timeline

FarhanaAleen created this revision.Apr 6 2018, 3:38 PM

The fact that we have to limit this transform to 16-bit is a good indicator that it's not suitable as a target-independent instcombine.
Ie, just because it shows improvement for the target that you're looking at doesn't mean it will do the same for other targets.
We should either:
(1) fix the pattern-matching problems that occur in later passes, so we can do this transform universally or
(2) move this transform to the DAG gated by type and target checks, so it only happens when profitable.

Agreed, this really must take into account target costs to be of any use - its true that given vector code we don't perform much in the way of scalarization (or re-vectorization to another width). Not sure if this is the kind of thing that slp could be extended to perform without making it even more messy.....

Thanks for your feedback and I agree with you guys.

Yes, there two options with their pros and cons.

Initially, I considered 2) which is performing scalarization in the DAGCombiner gated by type and target checks. The code for implementing it inside the DAGCombiner seemed a little messy since I could not find a cleaner way to classify scalar operations such as arithmetic, bit-wise operation as opposed to shuffle vector, build vector etc. The only option I could think of is to hard-code the list of scalar operations.

On the other hand, this transformation can be implemented in a cleaner way with 1) (inside InstCombine) under the same consideration(target checks) with a simple extension. 1) also allows us to reuse the existing implementation. Another benefit of 1) is it will contain all the relevant scalarization code in one place. The only con is it requires additional support for certain targets; X86 at least and may be more.

In general, the transformation seemed a target independent optimization since it is optimizing vector code performing effective operation only on a single element; which should be rather executed as scalar code. And most of the time, performance of scalar code is better than vector code. Therefore, this optimization should benefit all the other targets. The reason it's not benefiting x86 is due to a lack of proper support in the instruction selection which can be considered as trivial and can be improved.

Do you think we can consider 1) and implement it in a few stages?


     Stage-1 Support only for 16bits which works for all targets or, 

     Stage-2  Support it for all types with a hook it to the TTI and enable the transformation for the targets that supports this optimization currently;

     Stage-3- Support the missing instruction selection pattern in X86 and possibly others.

On the other hand, this transformation can be implemented in a cleaner way with 1) (inside InstCombine) under the same consideration(target checks) with a simple extension. 1) also allows us to reuse the existing implementation. Another benefit of 1) is it will contain all the relevant scalarization code in one place. The only con is it requires additional support for certain targets; X86 at least and may be more.

We shouldn't include any kind of target checks in InstCombine though. It's supposed to be used for universal simplifications/transforms. If there's any doubt, then there has to be an easy way for targets to reverse the transform.

In general, the transformation seemed a target independent optimization since it is optimizing vector code performing effective operation only on a single element; which should be rather executed as scalar code. And most of the time, performance of scalar code is better than vector code.

It's the "in general" and "most of the time" qualifiers that raise the red flag for me. I understand the appeal of making the transform in IR, and so I'd second Simon's suggestion about seeing if we can add this onto SLP vectorizer where you can use cost models to decide if it's profitable.

If you'd still like to pursue an instcombine solution, please post this to llvm-dev for more feedback. If there's consensus that we can do this here, I won't oppose it (it's obviously less work to do it here!).

It's the "in general" and "most of the time" qualifiers that raise the red flag for me.

I can see that.

I understand the appeal of making the transform in IR, and so I'd second Simon's suggestion about seeing if we can add this onto SLP vectorizer where you can use cost models to decide if it's profitable.

Ok, I will try this first and see how it goes.

FarhanaAleen abandoned this revision.Apr 10 2018, 3:41 PM