This is an archive of the discontinued LLVM Phabricator instance.

[CodeGenPrepare] Move extractelement close to store if they can be combined.
ClosedPublic

Authored by qcolombet on Oct 22 2014, 3:55 PM.

Details

Summary

Hi,

This patch adds an optimization in CodeGenPrepare to move an extractelement right before a store when the target can combine them.
The optimization may promote any scalar operations to vector operations in the way to make that possible.

Thanks for your feedbacks.

  • Context **

Some targets use different register files for both vector and scalar operations. This means that transitioning from one domain to another may incur copy from one register file to another. These copies are not coalescable and may be expensive.
For example, according to the scheduling model, on cortex-A8 a vector to GPR move is 20 cycles.

  • Motivating Example **

Let us consider an example:
define void @foo(<2 x i32>* %addr1, i32* %dest) {

%in1 = load <2 x i32>* %addr1, align 8
%extract = extractelement <2 x i32> %in1, i32 1
%out = or i32 %extract, 1
store i32 %out, i32* %dest, align 4
ret void

}

As it is, this IR generates the following assembly on armv7:
vldr d16, [r0] @vector load
vmov.32 r0, d16[1] @ cross-register-file copy: 20 cycles
orr r0, r0, #1 @ scalar bitwise or
str r0, [r1] @ scalar store
bx lr

Whereas we could generate much faster code:
vldr d16, [r0] @ vector load
vorr.i32 d16, #0x1 @ vector bitwise or
vst1.32 {d16[1]}, [r1:32] @ vector extract + store
bx lr

Half of the computation made in the vector is useless, but this allows to get rid of the expensive cross-register-file copy.

  • Proposed Solution **

To avoid this cross-register-copy penalty, we promote the scalar operations to vector operations. The penalty will be removed if we manage to promote the whole chain of computation in the vector domain.
Currently, we do that only when the chain of computation ends by a store and the target is able to combine an extract with a store.

Stores are the most likely candidates, because other instructions produce values that would need to be promoted and so, extracted as some point[1]. Moreover, this is customary that targets feature stores that perform a vector extract (see AArch64 and X86 for instance).

The proposed implementation relies on the TargetTransformInfo to decide whether or not it is beneficial to promote a chain of computation in the vector domain. Unfortunately, this interface is rather inaccurate for this level of details and although this optimization may be beneficial for X86 and AArch64, the inaccuracy will lead to the optimization being too aggressive.
Basically in TargetTransformInfo, everything that is legal has a cost of 1, whereas, even if a vector type is legal, usually a vector operation is slightly more expensive than its scalar counterpart. That will lead to too many promotions that may not be counter balanced by the saving of the cross-register-file copy. For instance, on AArch64 this penalty is just 4 cycles.

For now, the optimization is just enabled for ARM prior than v8, since those processors have a larger penalty on cross-register-file copies, and the scope is limited to basic blocks. Because of these two factors, we limit the effects of the inaccuracy. Indeed, I did not want to build up a fancy cost model with block frequency and everything on top of that.

[1] We can imagine targets that can combine an extractelement with other instructions than just stores. If we want to go into that direction, the current interfaces must be augmented and, moreover, I think this becomes a global isel problem.

Thanks,
-Quentin

Diff Detail

Event Timeline

qcolombet updated this revision to Diff 15279.Oct 22 2014, 3:55 PM
qcolombet retitled this revision from to [CodeGenPrepare] Move extractelement close to store if they can be combined..
qcolombet updated this object.
qcolombet edited the test plan for this revision. (Show Details)
qcolombet added a reviewer: hfinkel.
qcolombet set the repository for this revision to rL LLVM.
qcolombet added a subscriber: Unknown Object (MLST).
hfinkel added inline comments.Oct 23 2014, 2:11 PM
include/llvm/Target/TargetLowering.h
268

I would really like this to be a callback that passes the type and the extraction index. For PowerPC, for example, I can make this free for some types, but only for index 0. For other indices, I can do it with a shuffle (which obviously has some extra cost, although not much as the load/store-based domain crossing).

IMHO, a great interface for this could be:
int getStoreExtractCombineCost(Type *VectorTy, Value *Idx);

lib/CodeGen/CodeGenPrepare.cpp
3268

You should add: because the target has asserted that it can combine the extract with the store (for free).

3318

What is the complication to handling this case?

If you really can't test this on ARM, please explicitly note this as FIXME.

3328

You also need to exclude division (because introducing a division by undef is a bad idea), or make sure that if you do this for division, then the other fields are set to some non-zero quantity.

hfinkel added inline comments.Oct 23 2014, 2:13 PM
lib/CodeGen/CodeGenPrepare.cpp
3391

Not withstanding my previous comment on division, we could really put more undefs in this vector instead of always using a splat.

Hi Hal,

Thanks for the feedback, in particular for catching the corner case with division.

I’ll update the patch shortly in the meantime a couple of inline comments.

Cheers,
-Quentin

include/llvm/Target/TargetLowering.h
268

Although I like the direction of this, how do we get the cost of the original sequence?
In other words, what value do you propose we use to decide whether or not it is worth kicking this optimization given those two parameters?

Indeed, I expect that we will replace the condition that guards the optimization with something like:
TLI->getStoreExtractCombineCost(ExtractInst->getType(), ExtractInst->getOperand(1)) < CostOfIndividualStoreAndExtract

I assume you were thinking "== 0” since you talked about free. Is it what you had in mind?

lib/CodeGen/CodeGenPrepare.cpp
3268

Agreed.

3318

None, I just have not seen any motivating example.
I’ll see if I can come up with something, otherwise I will put a FIXME.

3328

Good point, the problem may arise if the non-constant vector is used on the right hand side.

3391

Although it would give more freedom to the lowering to use undef, I was afraid the backends would be able to simplify the sequence to a scalar operation. A quick test proved me I was wrong.
So I will update with more undef (unless we are on the right hand side of a division :)).
Are you seeing other cases where we shouldn’t use undef?

hfinkel added inline comments.Oct 23 2014, 10:38 PM
include/llvm/Target/TargetLowering.h
268

Thinking about this, the interface I proposed will be difficult to use (because it seems like you'd need to cost-model the entire transformation in order to figure anything out). How about:

bool canCombineStoreAndExtract(Type *VectorTy, Value *Idx, unsigned &Cost)

where, if true is returned and we move forward with modeling the transformation, we add Cost to VectorCost in isProfitableToPromote. This seems better.

qcolombet added inline comments.Oct 24 2014, 10:34 AM
include/llvm/Target/TargetLowering.h
268

Agreed.
Thanks!

qcolombet updated this revision to Diff 15446.Oct 24 2014, 3:41 PM
  • Do not introduce undefined behavior when the variable to be promoted is on the rhs of a division.
  • Support ConstantFP value for promotion.
  • Change the target hook to something more precise that takes the vector type and the index of the extract.
  • Use as many undef as possible when materializing the vector constant.
  • Add a stress mode to cover all the added features even if they were not relevant for ARM.
  • Update the test cases: ConstantFP, Division (def and undef), Splat Constant, and as many as possible undef.

Hi Hal,

I think I have incorporated all your feedbacks with this update.
However, I would like your opinion on the target hook. In particular, I realized that we are missing a correlation to an alignment.
Indeed, we can imagine targets that can perform the store(extract) only on certain alignment (which is not the case for ARM, it supports unaligned memory accesses for vector).
I thought about adding another out parameter for the alignment, but I was thinking that some targets can have a different cost for different alignment. Therefore, that wouldn't be sufficient.

I think we could either:

  1. Have an out parameter (or another getter) that is a map of alignment => cost (sounds expensive though).
  2. Have a second target hook that gives the actual cost when we know the alignment.

Assuming we go for #2, the general approach would be:

  • Check that the target supports the store(extract) combine for the given vector type and index (no cost involved).
  • Find an opportunity.
  • Check for the cost of this combine.
  • Promote or not according to the profitability.

What do you think?

Thanks,
-Quentin

hfinkel edited edge metadata.Oct 26 2014, 1:21 AM
In D5921#13, @qcolombet wrote:

Hi Hal,

I think I have incorporated all your feedbacks with this update.
However, I would like your opinion on the target hook. In particular, I realized that we are missing a correlation to an alignment.
Indeed, we can imagine targets that can perform the store(extract) only on certain alignment (which is not the case for ARM, it supports unaligned memory accesses for vector).
I thought about adding another out parameter for the alignment, but I was thinking that some targets can have a different cost for different alignment. Therefore, that wouldn't be sufficient.

I think we could either:

  1. Have an out parameter (or another getter) that is a map of alignment => cost (sounds expensive though).
  2. Have a second target hook that gives the actual cost when we know the alignment.

Assuming we go for #2, the general approach would be:

  • Check that the target supports the store(extract) combine for the given vector type and index (no cost involved).
  • Find an opportunity.
  • Check for the cost of this combine.
  • Promote or not according to the profitability.

What do you think?

I think that, for the time being, call TLI's existing allowsMisalignedMemoryAccesses hook, and only perform the transformation if the function returns true, and also Fast == true. Otherwise, abort. If someone needs more control than this, then we'll have a concrete use case and can design around that.

Thanks,
-Quentin

hfinkel added inline comments.Oct 26 2014, 1:32 AM
lib/CodeGen/CodeGenPrepare.cpp
3362

Also urem, srem, frem

For fdiv and frem, allow the transformation if the instruction has the nnan fast-math flag.

3446

Don't repeat this logic; extract it into a function.

qcolombet updated this revision to Diff 15496.Oct 27 2014, 11:37 AM
qcolombet edited edge metadata.
  • Refactor the logic to test whether or not we may introduce undefined behavior.
  • Add support for frem, urem, and srem.
  • Add support for fast-math.
  • Update tests.
qcolombet updated this revision to Diff 15497.Oct 27 2014, 11:46 AM
  • Check whether or not the store to be combined is properly aligned.

Hi Hal,

Thanks for the new inputs.
I’ve addressed the <u|s|f>rem point and added support for fast-math.
However, I haven’t used the fast information with the store alignment. Indeed, I am not sure this gets us anything. This will tell us whether or not a store is slow, but since we are not changing the type of the store, this does not tell us if we are slowing down the code or if the combine is still supported.
We could assume "slow store" == "no combine”, but I prefer, like you said, that we wait for actual use cases to start doing heuristic in this domain.

Thoughts?

Thanks again,
-Quentin

However, I haven’t used the fast information with the store alignment. Indeed, I am not sure this gets us anything. This will tell us whether or not a store is slow, but since we are not changing the type of the store, this does not tell us if we are slowing down the code or if the combine is still supported.
We could assume "slow store" == "no combine”, but I prefer, like you said, that we wait for actual use cases to start doing heuristic in this domain.

Thoughts?

I agree. Thinking about it, checking Fast is actually not really right (Fast == false likely indicates that the actual unaligned case is handled by some OS trap handler, for example, which really has nothing to do with instruction selection, and hurts the case where the data is actually aligned).

lib/CodeGen/CodeGenPrepare.cpp
3280

Why are you also checking against the ABI alignment?

3367

This should be !Use->hasNoNaNs(); (I'd like to keep us from over-generalizing these kinds of checks and just using 'unsafe algebra' for everything -- we can be more specific here).

qcolombet updated this revision to Diff 15597.Oct 30 2014, 6:00 PM
  • Remove a copy-paste-o.
  • Use a more specific fast-math flag.

Hi Hal,

Here is the updated version.

Thanks again for the careful review!

Ok to commit?

Cheers,
-Quentin

lib/CodeGen/CodeGenPrepare.cpp
3280

Copy-paste-o.

3367

Make sense, thanks.

hfinkel accepted this revision.Oct 30 2014, 10:02 PM
hfinkel edited edge metadata.

Yes, LGTM. Thanks!

This revision is now accepted and ready to land.Oct 30 2014, 10:02 PM
qcolombet closed this revision.Oct 31 2014, 11:54 AM

Thanks Hal.

Committed r220978.