Hi All,
I'm currently looking into adding support to recognize and vectorize non SIMD kind of parallelism e.g. add sub patterns.
This kind of parallelism may be important in complex/numerical computations were these patterns are common.
These patterns can later be converted to instructions such as ADDSUBPS on X86.
I would like to get few inputs on the patch and design used to support this feature.
This patch adds support to recognize one asymmetric pairing (i.e. add/sub instruction). This is the rough design which i followed-
- Recognize add/sub patterns in getSameOpcode as shuffle vector instructions and handle shuffle vector in buildTree_rec.
- Calculate appropriate cost of vectorization when shuffle vector is used.
- Calculate appropriate mask and create shuffle vector instruction to vectorize these patterns.
The advantage of using shuffle vector is we can use the same shuffle vector code as in this patch to generally pair any alternative sequence such as addsub, subadd etc we just need to handle them in getSameOpcode and classify them as shuffle vectors.
I tested the patch on a local test case having large number of add/sub patterns and it seems to give a nice ~10% improvement.
Awaiting inputs.
Thanks and Regards
Karthik Bhat
Karthik thanks again for working on this.
Now that we generate shuffles we need to make sure costs are approximately right.
BasicTTI is supposed to return a conservative estimate if targets don't override it. We can't always return two here. The cost of the shuffle could be much higher.
We should return a conservative default. One is not right either.
A conservative cost would be something like the cost of VecEltNum * cost(extractelement, Tp) + cost(insert element, Tp), that is the cost of constructing the shuffled vector by extracting the individual items and then creating the result vector.
Targets should then override this to provide more accurate estimates.
Costs should depend on the type. So for example:
X86TTI::getShuffleCost(SK_Alternate, <2 x double>) == 1
X86TTI::getShuffleCost(SK_Alternate, <4 x float>) == 2
We try to make vectorized code not slower than the scalar version. It is therefore important to not underestimate costs of vectorized code.
If we are doing integer addsubs we need to make sure we also return sensible costs for integer alternate shuffles as well.
(My sample code before was using the wrong indices on the shuffle mask. The mask should have been 0, 5, 2, 7 of course)
We will have to look at what code we generate for <8 x i16> for example, <16 x i8> usw.