This is an archive of the discontinued LLVM Phabricator instance.

Improve Cost model for SLPVectorizer when we have a vector division by power of 2
ClosedPublic

Authored by karthikthecool on Aug 19 2014, 6:33 AM.

Details

Summary

Hi All,
This patch improves the cost model in SLPVectorizer for division by power of 2. Currently the below code is not vectorized by clang in O3. Gcc though is able to vectorizes this.

void f(int* restrict a,int *restrict b,int *restrict c ) {
  a[0] = (b[0]+c[0])/2;
  a[1] = (b[1]+c[1])/2;
  a[2] = (b[2]+c[2])/2;
  a[3] = (b[3]+c[3])/2;
}

The problem is SLPVectorizer estimates the cost of vector divide as too high to be profitable and gives up on vectorization.
But in cases such as the above were we are dividing by power of 2 the cost infact is much less as backend converts them into instruction such as psrad/psraw on X86 targets.
The current patch updates the cost model when we divide by power of 2 to enable vectorization in such cases.

Please let me know your i/p's on the same.

Thanks and Regards
Karthik Bhat

Diff Detail

Event Timeline

karthikthecool retitled this revision from to Improve Cost model for SLPVectorizer when we have a vector division by power of 2.
karthikthecool updated this object.
karthikthecool edited the test plan for this revision. (Show Details)
karthikthecool added a subscriber: Unknown Object (MLST).
spatel added a subscriber: spatel.Aug 19 2014, 10:13 AM
spatel added inline comments.Aug 19 2014, 10:34 AM
lib/Target/X86/X86TargetTransformInfo.cpp
231

udiv should become a logical shift left: "vpsrld" or "vpsrlw" with AVX2. With SSE, it's "psrld" or "psrlw" (just remove the leading 'v').

sdiv is handled with a sequence of logical shift left, add, algebraic shift left. The cost should be the sum of those ops?

hfinkel edited edge metadata.Aug 19 2014, 10:38 AM
  • Original Message -----

From: "Sanjay Patel" <spatel@rotateright.com>
To: "kv bhat" <kv.bhat@samsung.com>, nrotem@apple.com, aschwaighofer@apple.com, hfinkel@anl.gov
Cc: spatel@rotateright.com, llvm-commits@cs.uiuc.edu
Sent: Tuesday, August 19, 2014 12:34:33 PM
Subject: Re: [PATCH] Improve Cost model for SLPVectorizer when we have a vector division by power of 2

Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:208
@@ +207,3 @@
+ {ISD::SDIV, MVT::v16i16, 1}, psraw instruction
+ {ISD::UDIV, MVT::v16i16, 1},
psraw instruction

+ {ISD::SDIV, MVT::v8i32, 1}, // psrad instruction

udiv should become a logical shift left: "vpsrld" or "vpsrlw" with
AVX2. With SSE, it's "psrld" or "psrlw" (just remove the leading
'v').

sdiv is handled with a sequence of logical shift left, add, algebraic
shift left. The cost should be the sum of those ops?

The costs are really about throughput. I'd imagine that it will only really be the sum if there is a single dependency chain, no uop fusion, etc. Likely best to measure it.

-Hal

http://reviews.llvm.org/D4971

Should there also be a corresponding change in the loop vectorizer?

include/llvm/Analysis/TargetTransformInfo.h
338

I wonder if this is the right design:

  1. This is really a subcase of the uniform constant case, and requires all implementations to handle the defaulting.
  2. We could have non-uniform values that are all powers of two.

Having some optional feature bits seems better. I imagine will want more special properties like this in the future.

Hi Hal,Sanjay,
Thanks for the inputs. The reason i added support for only uniform constant power of 2 is because currently x86 backend successfully emits PSRAW/PSRAD (but if i'm not wrong this requires all the shifts to be same. i.e. division by same power of 2).

Non uniform values that are all power of 2 though are currently not that profitable as backend doesn't emit/or have instructions that can do this profitably.

Hal i had a doubt, by feature bit do we mean having something like a member variable say isPowerof2 and setting this before calling getArithmeticInstrCost?
Please correct me if my understanding is wrong.

Thanks for all the help and taking your time to review the patch.
I will update the patch shortly based on your i/p's and raise updated patch.

Thanks
Karthik Bhat

  • Original Message -----

From: "Karthik Bhat" <kv.bhat@samsung.com>
To: "kv bhat" <kv.bhat@samsung.com>, nrotem@apple.com, aschwaighofer@apple.com, hfinkel@anl.gov
Cc: spatel@rotateright.com, llvm-commits@cs.uiuc.edu
Sent: Wednesday, August 20, 2014 12:18:05 PM
Subject: Re: [PATCH] Improve Cost model for SLPVectorizer when we have a vector division by power of 2

Hi Hal,Sanjay,
Thanks for the inputs. The reason i added support for only uniform
constant power of 2 is because currently x86 backend successfully
emits PSRAW/PSRAD (but if i'm not wrong this requires all the shifts
to be same. i.e. division by same power of 2).

Non uniform values that are all power of 2 though are currently not
that profitable as backend doesn't emit/or have instructions that
can do this profitably.

Hal i had a doubt, by feature bit do we mean having something like a
member variable say isPowerof2 and setting this before calling
getArithmeticInstrCost?
Please correct me if my understanding is wrong.

No; maybe we want to add something like:

enum OperandValueProperties {

OP_None = 0,
OP_PowerOf2 = 1

}
(this is really a bit field).

and then make getArithmeticInstrCost take (OperandValueKind Opd1Info, unsigned Prop1, OperandValueKind Opd2Info, OperandValueProperties Prop2)
instead of just Opd1Info and Opd2Info.

Then in the routines, you check for (Prop1 & OP_PowerOf2), etc.

-Hal

Thanks for all the help and taking your time to review the patch.
I will update the patch shortly based on your i/p's and raise updated
patch.

Thanks
Karthik Bhat

http://reviews.llvm.org/D4971

Very interesting. Not just X86, either. Filed http://llvm.org/bugs/show_bug.cgi?id=20714 for the AArch64 equivalent.

karthikthecool edited edge metadata.

Hi Hal,Sanjay,
Updated the patch as per review comments. The modifcations done are -

  1. Add feature bit as per review comments to check if we have a uniform const power of 2. Currently only handling uniform constant power of 2. Will check more on non uniform const with all power of 2 and come up with another patch.
  2. Updated Loop vectorizers cost model to handle similar situation.
  3. Updated cost table comments and value in X86TargetTransformInfo.

Please let me know if this looks good to commit. Thanks for your time and valuable inputs.

Regards
Karthik Bhat

This LGTM (but please also get an okay from someone familiar with the X86 vector instructions).

lib/Transforms/Vectorize/LoopVectorize.cpp
5855

This can just be:

if (SplatValue) {

Thanks Hal for the review. Updated the patch.
Nadav,Arnold any inputs on this one? Shall i go ahead with commit?
Thanks
Karthik Bhat

Hi Karthik,

sorry for the long post.
I have a few comments related to your patch.

I think your cost tables are incomplete. For example (correct me if I am wrong), the cost of a UDIV/SDIV by a constant splat of powers-of-2 is exactly the same on AVX2 and AVX (and on SSE as well) if we exclude the legalization cost which maybe different for SSE.

Also, the cost of an SDIV would be 4 and not 3:
on AVX for example, you would get a vpsrad + vpsrld + vpaddd + vpsrad.
You would get an extra arithmetic shift at the beginning because you would need to propagate the sign bit. Under the assumption that we are dividing by a non-negative power-of-2 then we are done (i.e. the cost is 4); otherwise we would pay an extra ISD::SUB to negate the result. In your case you don't need to account for the extra ISD::SUB because method APInt::isPowerOf2() would only return true for values that are > 0.

That said, if possible I suggest to reuse the existing cost tables as much as possible rather than defining new tables (see comments below).

I hope this makes sense to you.
-Andrea

lib/Target/X86/X86TargetTransformInfo.cpp
189–212

Here you can add the following check:

// Unsigned divisions by powers-of-2 are always reduced to SRL.
if (ISD == ISD::UDIV && TargetTransformInfo::OK_UniformConstantValue &&
    Op2PropInfo == TargetTransformInfo::OP_PowerOf2)
  ISD = ISD::SRL;

I had a quick look at the code and I am pretty sure that this is what we do for every target.

This will allow you to get rid of all your new entries for UDIV in the new cost tables you added for powers-of-2.

In the case of SDIV, each targets may provide its own custom implementation of sdiv x, pow2. On X86 we don't override method 'BuildSDIVPow2' and therefore we fall back to the standard expansion of signed divisions by pow2; as a result, we introduce the sequence SRA + SRL + ADD + SRA.

if (ISD == ISD::SDIV && TargetTransformInfo::OK_UniformConstantValue &&
    Op2PropInfo == TargetTransformInfo::OP_PowerOf2) {
  // On X86, vector signed division by constants power-of-two are
  // normally expanded to the sequence SRA + SRL + ADD + SRA.
  unsigned Cost = 2 * getArithmeticInstrCost(ISD::SRA, Ty, Op1Info, Op2Info, Opd1PropInfo, Opd2PropInfo);
  Cost += getArithmeticInstrCost(ISD::SRL, Ty, Op1Info, Op2Info, Opd1PropInfo, Opd2PropInfo);
  Cost += getArithmeticInstrCost(ISD::ADD, Ty, Op1Info, Op2Info, Opd1PropInfo, Opd2PropInfo);
  return Cost;
}

This will allow you to reuse the existing cost tables without having to add extra cost tables to handle the special case of divisions by powers-of-2.

As a side note:
in the future, we can think of removing the check for TargetTransformInfo::OK_UniformConstantValue to also address non-uniform constants as suggested by Hal.
For example, on AVX2, UDIV by non-uniform constants power-of-2 could be emitted as a single vprslv. That is because we can take advantage of vector shifts with variable count (which we don't have on SSE/AVX).
Unfortunately, this is not happening at the moment and we end up scalarizing the logical shifts instead. This is why we still need that check for now..

Example:

%div = udiv <4 x i32> %A, <i32 4, i32 8, i32 16, i32 4>

Could be strenght reduced to:

%div = lshr <4 x i32> %A, <i32 2, i32 3, i32 4, i32 2>

On AVX2, that logical shift right would be emitted as a single vpsrlvd.
Unfortunately this is not happening at the moment and we get a long sequence of
vpextrd + shrl + vpinsrd..

Similarly, On AVX2 we could strongly optimize the SDIV expansion by non-uniform constants power-of-2 (this - again - is not happening at the moment).

Hi Andrea,
Thanks for the i/p's. Yes for UDIV the cost can be reduced to that of SRL.
For SDIV though I checked the below code with gcc and clang we seem to get 3 extra instructions (vpsrld,vpaddd,vpsrad) for SDIV. Please correct me if i'm wrong.

void f(int* restrict a,int* restrict b,int* restrict c) {
  int i;
  for(i=0;i<4;i=i+4) {
    a[i] = (b[i]+c[i])/2;
    a[i+1] = (b[i+1]+c[i+1])/2;
    a[i+2] = (b[i+2]+c[i+2])/2;
    a[i+3] = (b[i+3]+c[i+3])/2;
  }
}

compiled with clang -O3 -mavx2 test.c -S -o test.s

#BB#0:                                 # %entry
vmovdqu	(%rdx), %xmm0
vpaddd	(%rsi), %xmm0, %xmm0
vpsrld	$31, %xmm0, %xmm1
vpaddd	%xmm1, %xmm0, %xmm0
vpsrad	$1, %xmm0, %xmm0
vmovdqu	%xmm0, (%rdi)
retq

I agree that reusing the existing table is a good option. I will update the patch accordingly.

Yes we can vectorize non constant power of 2 in avx2 but i suppose it will be using vpsrlvd,vpsravd?
Thanks all for helping me out here.

Regards
Karthik Bhat

Hi Andrea,
Thanks for clarifying my doubts. Updated the patch as per review comments and ran clang format.
I will try to handle vectorization of division by non constant pow of 2 seperatly in another patch.
Please let me know your opinion on the same.

Thanks and Regards
Karthik Bhat

andreadb accepted this revision.Aug 21 2014, 8:21 AM
andreadb added a reviewer: andreadb.

I have two minor comments, otherwise the patch LGTM!
(see comments below).

Thanks,
Andrea

lib/Target/X86/X86TargetTransformInfo.cpp
222

I think we also have to explicitly set Opd2PropInfo to 'TargetTransformInfo::OP_None.
The shift count of the resulting shift is unlikely to be a power-of-2.

230–235

Now that I think about it.. it is wrong to pass the old 'Opd1PropInfo' and 'Opd2PropInfo' to those new calls to 'getArithmeticInstrCost' (sorry, that code came from my original suggestion...).

The expanded shifts will have different shift counts which may or may not be powers-of-2. The Add will have different operands, so we cannot reuse the operand properties of the original UDIV/SDIV.

To be conservative here, we have to use the default values for Opd1PropInfo and Opd2PropInfo (OP_None) in each call to 'getArithmeticInstrCost'.

This revision is now accepted and ready to land.Aug 21 2014, 8:21 AM
spatel added inline comments.Aug 21 2014, 8:29 AM
lib/Target/X86/X86TargetTransformInfo.cpp
245

Nit: the changes to the bracket spacing are not consistent with the rest of the file. Can you adjust all of those in a later checkin so this patch is minimal?

Hi all -
If we recognize that power of 2 integer division is always converted to one or more simple ops (shifts, adds, subs) for all architectures via DAGCombiner, then can we hoist the changes out of X86TargetTransformInfo and into the superclass TargetTransformInfo so we don't have to repeat this logic for every target?

In D4971#28, @spatel wrote:

Hi all -
If we recognize that power of 2 integer division is always converted to one or more simple ops (shifts, adds, subs) for all architectures via DAGCombiner, then can we hoist the changes out of X86TargetTransformInfo and into the superclass TargetTransformInfo so we don't have to repeat this logic for every target?

We can safely do this only for UDIV. UDIV by pow-2 is always converted to a SRL. This is true for all targets.

However, we cannot guarantee that SDIV is treated the same way by all targets.
How SDIV gets expanded in the backend really depends is always target specific.
By default, SDIV is expanded into a sequence of shifts+adds (this is the behavior on X86). Other targets may not implement that same default behavior.
For example, Aarch64 custom expands SDIV bu Pow2 in a different way (see AArch64TargetLowering::BuildSDIVPow2).

Also, some targets may want to define TLI.isPow2DivCheap... so, as you can see, the problem is complicated.

In D4971#29, @andreadb wrote:

We can safely do this only for UDIV. UDIV by pow-2 is always converted to a SRL. This is true for all targets.

Does it make sense to split this patch into 2 pieces then? One that handles UDIV universally, and then a follow-on for SDIV. I was going to suggest additional test cases for each op anyway. :)

However, we cannot guarantee that SDIV is treated the same way by all targets.
How SDIV gets expanded in the backend really depends is always target specific.
By default, SDIV is expanded into a sequence of shifts+adds (this is the behavior on X86). Other targets may not implement that same default behavior.
For example, Aarch64 custom expands SDIV bu Pow2 in a different way (see AArch64TargetLowering::BuildSDIVPow2).

Right - Aarch64 has extra goodness via the rounding constant in "usra"; this is shown in the bug ( http://llvm.org/bugs/show_bug.cgi?id=20714 ) that Jim filed.

But we could still have a conservative upper cost bound for all targets? As you noted, division by exactly "2" costs one inst less, and there's one more instruction if we change this code to handle negative divisors, so we're not getting an exact cost value in any case.

Also, some targets may want to define TLI.isPow2DivCheap... so, as you can see, the problem is complicated.

I think PPC is the only arch that sets it (which seems like a bug to me, but I'm probably missing the reason; the PPC backend turns scalar signed int pow2div into sra/addze anyway).

I assume the intent of that flag is to say that the HW itself recognizes pow2div (signed or unsigned?) and can do it just as fast as a shift. But I'm not aware of any vector ISA that even includes an integer division instruction.

In D4971#30, @spatel wrote:

I assume the intent of that flag is to say that the HW itself recognizes pow2div (signed or unsigned?) and can do it just as fast as a shift.
But I'm not aware of any vector ISA that even includes an integer division instruction.

Let me try to make my point here clearer: if there's no vector integer division instruction to use, then I think the value of isPow2DivCheap() is irrelevant; we have to implement division using shift(s) anyway.

I don't know if checking for the existence of a vector division instruction is available at this level of LLVM, so the point may be moot...or that could be added as another target feature flag?

On Thu, Aug 21, 2014 at 11:16 AM, Andrea Di Biagio <andrea.dibiagio@gmail.com> wrote:

I think it makes perfect sense :-).
I have a (maybe stupid) question: do we really have to worry about the
case of UDIV by powers-of-2 in the vectorizer?
I am asking this because the optimizer would always convert an UDIV by
powers-of-2 into a SRL. So, by the time we run the vectorizer, all the
foldable UDIV by power-of-2 have been already optimized into SRL..>

That's an excellent question...
I converted both of the current test cases in the patch to 'udiv', and we already get vectorized 'lshr' with opt -O2. :)

Whether we need extra code down in TargetTransformInfo for UDIV as a safeguard, I don't know.

karthikthecool edited edge metadata.

Hi Andrea,Sanjay,
It seems like i missed out on some intresting discussion yesterday.
Updating the patch addressing Andrea and Sanjay's comments.

Intrestingly yes UDIV by power of 2 is actually coverted into SRL before it reaches vectorization code so i think checking for UDIV with power of 2 is as good as dead code here. Removed the same.

Opd1PropInfo,Opd2PropInfo actually doesn't effect cost of Instruction::AShr,Instruction::LShr,Instruction::Add as it is used only by SDIV. Having said that i agree it is incorrect to pass old property while getting the cost although they are not used as of now. So updated it to use OP_None.

Updated formatting comments given by Sanjay.
Since now we are only checking for signed division by power of 2 the existing test cases should suffice?

Thanks again for your interest in the patch. Does this now look good to commit?

Thanks and Regards
Karthik Bhat

Patch LGTM. Thanks!

Hi Andrea,Sanjay,
It seems like i missed out on some intresting discussion yesterday.
Updating the patch addressing Andrea and Sanjay's comments.

Intrestingly yes UDIV by power of 2 is actually coverted into SRL before it reaches vectorization code so i think checking for UDIV with power of 2 is as good as dead code here. Removed the same.

Opd1PropInfo,Opd2PropInfo actually doesn't effect cost of Instruction::AShr,Instruction::LShr,Instruction::Add as it is used only by SDIV. Having said that i agree it is incorrect to pass old property while getting the cost although they are not used as of now. So updated it to use OP_None.

Updated formatting comments given by Sanjay.
Since now we are only checking for signed division by power of 2 the existing test cases should suffice?

Thanks again for your interest in the patch. Does this now look good to commit?

Thanks and Regards
Karthik Bhat

spatel added inline comments.Aug 22 2014, 7:53 AM
test/Transforms/SLPVectorizer/X86/powof2div.ll
9

Does this test case require 3 operands and adds, or can it be simplified to just do the sdivs?

Please also use CHECK-LABEL for both test cases to make it easier for future additions in each file. If the output is entirely predictable, use CHECK-NEXT's and specify each instruction including the 'ret'.

Hi Karthik,
We haven't answered the biggest question that I have about this patch:
Can we supply a default cost calculation for SDIV that uses the sra/srl/add/sra sequence that is generated by default by DAGCombiner? If the answer is yes, then all targets will benefit. There should be a way to tie the cost calculation to whatever is implemented in "BuildSDIVPow2()" - so if a target is overriding that, they can also override the default cost calculation if they'd like.

Since you already have LGTM from the other reviewers, I won't hold up this patch, but please add a FIXME comment somewhere if you commit this only for x86 or file a bug so we don't lose track of the issue.

spatel added inline comments.Aug 24 2014, 3:57 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
5848

Hi All,
Updated patch as per Sanjay's comment and submitted as r216371.

I think the reason we have different cost tables for different targets is because it may not be possible to come up with a common upper bound which can be profitable in case of all targets.
For e.g. sra/srl/add/sra sequence may be the upper bound for X86 but this may/may not be true for other targets.
One more advantage which i feel of having target specific cost table is that it is easily extendable in case a new intruction is added to support some feature.

I'm not that familiar with DAGCombiner as of now will have a look at it and get back on this.
Thanks and Regards
Karthik Bhat

Hi Karthik -
Thanks for adding the FIXME comment.
Your patch did not address the inline comments I made regarding the test case for the SLP vectorizer and using dyn_cast<>. Please update.