This is an archive of the discontinued LLVM Phabricator instance.

Add support to recognize non SIMD kind of parallelism in SLPVectorizer
ClosedPublic

Authored by karthikthecool on Jun 4 2014, 1:19 AM.

Details

Summary

Hi All,
I'm currently looking into adding support to recognize and vectorize non SIMD kind of parallelism e.g. add sub patterns.

This kind of parallelism may be important in complex/numerical computations were these patterns are common.
These patterns can later be converted to instructions such as ADDSUBPS on X86.

I would like to get few inputs on the patch and design used to support this feature.

This patch adds support to recognize one asymmetric pairing (i.e. add/sub instruction). This is the rough design which i followed-

  1. Recognize add/sub patterns in getSameOpcode as shuffle vector instructions and handle shuffle vector in buildTree_rec.
  2. Calculate appropriate cost of vectorization when shuffle vector is used.
  3. Calculate appropriate mask and create shuffle vector instruction to vectorize these patterns.

The advantage of using shuffle vector is we can use the same shuffle vector code as in this patch to generally pair any alternative sequence such as addsub, subadd etc we just need to handle them in getSameOpcode and classify them as shuffle vectors.

I tested the patch on a local test case having large number of add/sub patterns and it seems to give a nice ~10% improvement.

Awaiting inputs.

Thanks and Regards
Karthik Bhat

Diff Detail

Event Timeline

karthikthecool retitled this revision from to Add support to recognize non SIMD kind of parallelism in SLPVectorizer.
karthikthecool updated this object.
karthikthecool edited the test plan for this revision. (Show Details)
karthikthecool added a subscriber: Unknown Object (MLST).

Update the patch to fix an issue.
When vectorizing Shuffle Vector instruction make sure that it actually refers to alternate sequence such as add sub sequence and is not reffering to a set of ShuffleVector instruction itself.

aschwaighofer edited edge metadata.Jun 5 2014, 8:58 AM

Hi Karthik,

thanks for working on this! I have a few questions embedded in the source. Also, did you run the test suite for correctness and performance?

lib/Transforms/Vectorize/SLPVectorizer.cpp
152

Could we name this function "isAddSubInst" and add some documentation that we encode vector bundles that are combination of opcodes as a shufflevector.

494

I don't understand why you are relaxing this constraint. Why should a scalar instruction be part of two vectors?

765

I don't understand why you are relaxing this constraint. Within an ADDSUB operation you still want the instructions to be unique?

For the bundle [ADD1, SUB1, ADD2, SUB2] we still want all the members to be unique in the bundle?

1256

I think we should use the getShuffleCost instruction (and we would need a new ShuffleKind and implement the cost model for it).

1896

Why is this needed?

karthikthecool edited edge metadata.

Hi Arnold,
Thanks for the review. Yes i have run llvm regression test cases in debug and release modes and there are no failures. I'm yet to generate performance results for this patch. Will try to collect it over weekend.

Checking Opcode != Instruction::ShuffleVector was an experimental code unfortunetly i forgot to remove it while submitting the patch. Updated the patch to remove wrongly added code. Thanks for highlighting it I will be more careful next time.

Updated the cost model to use getShuffleCost and added a new ShuffleKind (SK_Alternate).
I have not yet handled SK_Alternate separately and we are returning default cost of 1 as with SK_Broadcast etc.)
I have added a TODO to handle cost model for shuffle vector in a better way as i can see that with current cost model sequences generated from code such as-

float c[4];
float a;
float b;
void foo () {
  c[0]= a+b;
  c[1] = a-b;
  c[2] = a+b;
  c[3] = a-b;
}

are not getting converted in shuffle vector.(it works when we set a lower -slp-threshold). It also works when both a and b are arrays.(e.g. test case attached in this patch (addsub.ll) )

Thanks for your inputs and time.
Regards
Karthik Bhat

Hi Arnold,Nadav,
Please find the lnt results with a sample size of 20. I did not observe any significant performance changes in lnt test cases but one of our internal test gave an improvement of ~5-6%.

Thanks for the time.

Regards
Karthik Bhat

aschwaighofer added inline comments.Jun 17 2014, 8:59 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1259

Ultimately, I think this could be replaced by a call to TargetTransfromInfo::getAddSubCost and BasicTargetTransformInfo should override this with the implementation you have here.
Targets can then override the cost as they see fit.

For now, your code should be conservatively correct for all targets assuming they return an accurate (conservative cost) for getShuffleCost(SK_Alternate) - and I don't see addsubs generated yet anyway, so introducing the getAddSubCost abstraction change would be premature.

Can you take a look at the code we generate for arm, arm64 and x86 to make sure that one instruction is correct? It seems to me that we generate 2 instructions for x86_64, arm, and arm64 and <4 x float>. Indicating that those targets should override getShuffleCost(SK_Alternate) and should return a cost of 2 for getShuffleCost(SK_Alternate, <4 x float>).

cat > test.ll

define void @test1(<4 x float> *%a, <4 x float> *%b, <4 x float> *%c) {
entry:
  %in1 = load <4 x float>* %a
  %in2 = load <4 x float>* %b
  %add = fadd <4 x float> %in1, %in2
  %sub = fsub <4 x float> %in1, %in2
  %Shuff = shufflevector <4 x float> %add,
                         <4 x float> %sub,
                         <4 x i32> <i32 0, i32 5, i32 1, i32 6>
  store <4 x float> %Shuff, <4 x float>* %c
  ret void
}

define void @test2(<2 x double> *%a, <2 x double> *%b, <2 x double> *%c) {
entry:
  %in1 = load <2 x double>* %a
  %in2 = load <2 x double>* %b
  %add = fadd <2 x double> %in1, %in2
  %sub = fsub <2 x double> %in1, %in2
  %Shuff = shufflevector <2 x double> %add,
                         <2 x double> %sub,
                         <2 x i32> <i32 0, i32 3>
  store <2 x double> %Shuff, <2 x double>* %c
  ret void
}


bin/llc -mtriple=arm64-apple-ios7.0 -mcpu=cyclone < testshufflevector.ll

        .section        __TEXT,__text,regular,pure_instructions
        .ios_version_min 7, 0
        .globl  _test1
        .align  2
_test1:                                 ; @test1
        .cfi_startproc
; BB#0:                                 ; %entry
        ldr      q0, [x0]
        ldr      q1, [x1]
        fadd.4s v2, v0, v1
        fsub.4s v0, v0, v1
// TWO INSTRUCTIONS
        ext.16b v0, v0, v2, #4
        zip1.4s v0, v2, v0
///
        str      q0, [x2]
        ret
        .cfi_endproc

bin/llc -mtriple=armv7s-apple-ios7.0 < testshufflevector.ll
bin/llc -mtriple=x86_64-apple-macos < testshufflevector.ll

With the adjustments to the cost model (X86TargetTransformInfo/ARMTargetTransformInfo/AArch64TargetTransformInfo::getShuffleCost) this LGTM.

Thanks

Hi Arnold,
Thanks for the review comments. Updated the patch to use shuffle cost of 2. I have modified getShuffleCost in BasicTargetTransformInfo and TargetTransformInfo as X86TargetTransformInfo::getShuffleCost and ARMTargetTransformInfo::getShuffleCost gets the cost from this common function.

Yes addsub instruction is not yet emitted by backend. We will need some more modification to recognize this shuffle as addsub instruction and generate corresponding target instruction. I will look into it once these changes are committed. Will also try to update the cost model to use getAddSubCost in followup patches.
Does this look good to commit now? Thanks once again for the review.
Regards
Karthik Bhat

Updated patch rebased to trunk. Thanks

hfinkel edited edge metadata.Jun 18 2014, 12:33 AM

Thanks for working on this!

lib/Transforms/Vectorize/SLPVectorizer.cpp
178

Can we add support for matching (fsub, fadd, fsub, fadd, ...)? I think that generating (fneg ( addsub (x, y))) would be nice.

Also, is there any particular reason that we're restricting this to floating-point add/sub? Granted, I know of no ISA with integer add/sub instructions, but I think that the lowering as (addsub ( x, xor (y, mask))) is likely more efficient than the scalar version.

test/Transforms/SLPVectorizer/X86/addsub.ll
14

Please check the actual shuffle indices

40

We don't need this ident metadata in the test files.

karthikthecool edited edge metadata.

Hi Hal,
Thanks for the review.
Yes we can handle fsub,fadd,fsub sequence as well.
Updated the patch as per review comments to handle sequence fsub,fadd,fsub,fadd,... and also add/sub sequence.
Updated the test case to check for the shuffle mask used.

Please if you could review the same.
Thanks
Karthik Bhat

aschwaighofer added inline comments.Jun 18 2014, 9:12 AM
lib/Analysis/TargetTransformInfo.cpp
575 ↗(On Diff #10546)

Please leave a cost of 1 here.

lib/CodeGen/BasicTargetTransformInfo.cpp
335 ↗(On Diff #10546)

Karthik thanks again for working on this.

Now that we generate shuffles we need to make sure costs are approximately right.

BasicTTI is supposed to return a conservative estimate if targets don't override it. We can't always return two here. The cost of the shuffle could be much higher.

We should return a conservative default. One is not right either.

A conservative cost would be something like the cost of VecEltNum * cost(extractelement, Tp) + cost(insert element, Tp), that is the cost of constructing the shuffled vector by extracting the individual items and then creating the result vector.

Targets should then override this to provide more accurate estimates.

Costs should depend on the type. So for example:

X86TTI::getShuffleCost(SK_Alternate, <2 x double>) == 1
X86TTI::getShuffleCost(SK_Alternate, <4 x float>) == 2

We try to make vectorized code not slower than the scalar version. It is therefore important to not underestimate costs of vectorized code.

If we are doing integer addsubs we need to make sure we also return sensible costs for integer alternate shuffles as well.

(My sample code before was using the wrong indices on the shuffle mask. The mask should have been 0, 5, 2, 7 of course)

We will have to look at what code we generate for <8 x i16> for example, <16 x i8> usw.

define void @test3(<8 x i16> *%a, <8 x i16> *%b, <8 x i16> *%c) {
entry:
  %in1 = load <8 x i16>* %a
  %in2 = load <8 x i16>* %b
  %add = add <8 x i16> %in1, %in2
  %sub = sub <8 x i16> %in1, %in2
  %Shuff = shufflevector <8 x i16> %add,
                         <8 x i16> %sub,
                         <8 x i32> <i32 0, i32 9, i32 2, i32 11, i32 4, i32 13, i32 6, i32 15>
  store <8 x i16> %Shuff, <8 x i16>* %c
  ret void
}

Hi Arnold,
Thanks for the pointers. Updated the cost model for SK_Alternate Shuffles.
The cost model is as follows-

  1. In BasicTTI created a function getAltShuffleOverhead.

As you mentioned the conservative cost of shuffle is the cost of extracting the element from an index from vector + cost of inserting the element into the result vector for all elements in result vector.
Since we will be alternatively picking elements from the 2 vectors which are of same type(i.e. element 0 of 1st vector, element 1 of 2nd vector, element 3 of 1st vector and so on..) the functions just runs a loop to calculate the cost of extracting elements from each index and adds it up to the cost of inserting the element at that index.

We have overidden the getShuffleCost for X86 and ARM. Created 2 tables NEONAltShuffleTable and X86AltShuffleTable with more accurate cost of shuffle. The cost here represents the number of instructions required to generate the shuffled vector.

Please if you could let me know your inputs on this.

For <8 x i16> as well we should be generating the correct mask as the logic to create mask is generic. The logic to create mask is we take the vector length(8 in this case) and run a loop from 0 to length alternatively selecting loopindex(i) and lengthofVector+i. So in this case as we will be generating sequence (0,8+1,2,8+3,4,8+5,6,8+7) i.e. <0,9, 2, 11,4,13, 6, 15>.

I tried to test it on a sample test but it exits before entering buildTree in vectorizeStoreChain as VF>ChainLen. I will try to come up with a working test to check <8 x i16> and <16 x i8> but i feel we should be generating correct mask in both these cases as per our logic.

Thanks once again for your time and inputs.
Regards
Karthik Bhat

karthikthecool accepted this revision.Jun 19 2014, 9:50 PM
karthikthecool added a reviewer: karthikthecool.

Thanks Arnold for the review. Committed as r211339. Now that SLPVectorizer is vectorizing these patterns will try to map these vector shuffles to instructions such as addsubpd etc.
Thanks!

This revision is now accepted and ready to land.Jun 19 2014, 9:50 PM
karthikthecool closed this revision.Jul 1 2014, 9:23 PM