Page MenuHomePhabricator

Split the store of an int value merged from a pair of smaller values into multiple stores.
ClosedPublic

Authored by wmi on Jul 26 2016, 6:22 PM.

Details

Summary

The patch is to improve the code efficiency for the case described in https://llvm.org/bugs/show_bug.cgi?id=28726

For the instruction sequence of int64 store below, %int_tmp and %float_tmp are bundled together as an int64 data before stored into memory. If the int64 data is not used outside of the store, it is more efficient to generate separate stores for %int_tmp and %float_tmp.

Instruction sequence of int64 Store:

%ref.tmp = alloca i64, align 8
%1 = bitcast float %float_tmp to i32
%sroa.1.ext = zext i32 %1 to i64
%sroa.1.shift = shl nuw i64 %sroa.1.ext, 32
%sroa.0.ext = zext i32 %int_tmp to i64
%sroa.0.insert = or i64 %sroa.1.shift, %sroa.0.ext
store i64 %retval.sroa.0.0.insert.insert.i, i64* %ref.tmp, align 8

Instruction sequence of separate stores:

%ref.tmp = alloca i64, align 8
%1 = bitcast i64* %ref.tmp to i32*
store i32 %int_tmp, i32* %1, align 4
%2 = getelementptr i32, i32* %1, i64 1
%3 = bitcast i32* %2 to float*
store float %float_tmp, float* %3, align 4

Diff Detail

Repository
rL LLVM

Event Timeline

wmi updated this revision to Diff 65636.Jul 26 2016, 6:22 PM
wmi retitled this revision from to [InstCombine] Split int64 store into separate int32 stores .
wmi updated this object.
wmi added reviewers: majnemer, chandlerc.
wmi set the repository for this revision to rL LLVM.
wmi added subscribers: llvm-commits, davidxl.
arsenm added a subscriber: arsenm.Jul 26 2016, 6:35 PM

This looks target dependent, and also too specific for the 64-bit case, so probably shouldn't go in instcombine. This transform would be generally undesirable on my target

lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1278 ↗(On Diff #65636)

The alloca is not necessary for the example (and was misleading to me)

majnemer added inline comments.Jul 26 2016, 6:35 PM
lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1287–1292 ↗(On Diff #65636)

Is this more canonical than insertelement into a vector and performing a single store?
I'm not sure we have a canonicalization for this sort of thing, I'm not sure what the best thing to do is here...

Chandler, what are your thoughts?

1289 ↗(On Diff #65636)

This should be align 8 no?

chandlerc added inline comments.Jul 26 2016, 9:30 PM
lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1287–1292 ↗(On Diff #65636)

I agree with Matt generally -- the original bug should be handled by *splitting* loads and stores late in the pipeline with target specific knowledge. SDAG makes a lot of sense here as a target-independent in representation but target-specific in optimality transform.

Regarding the higher level question you pose David, I think that an integer is the most canonical thing here. I'll try to explain some of why.

The first question is "is one store more canonical than two stores". IMO, the answer should almost always be "yes". The reason is that splitting loads and stores is so much easier than merging. Later on in the pipeline we may have a very hard time effectively merging memory accesses due to memory model constraints, while earlier we may have more information that proves these things are safe. We have consistently moved in this direction both in LLVM and even in Clang itself.

The second question is "what type is the most canonical type?" I think there are no really good answers to that question. When merging adjacent values (as happens at the ABI level for pairs and structs, and in SROA, and elsewhere) there are a few options: a first-class-aggregate of the adjacent types, a vector of integers where the adjacent types are the same width, or an integer the size of the pair.

We've been moving LLVM away from first-class aggregate loads and stores for a while. In retrospect, I'm no longer convinced this was the correct decision, but reversing it would be a huge undertaking and not very easy to do effectively without crippling optimizations.

We also try to avoid forming novel vector types historically because we were worried about how they would lower. These days, we actually do a pretty fantastic job of this, and so I kind of re-evaluated this. However, there are still ways in which a vector of integer types falls down: non-uniform merges (an i32 next to 4 i8s for example), the fact that it still requires float -> int, and worst of all when the non-uniformity extends to overlapping things like unions.

Because of all of the last set of complications, a wide integer type seems a pretty reasonable way to express "a bag of bits". And expressing the extract in terms of bit math is nice because when the components are integers, we often end up effectively combining across the extract operations to further simplify things.

So I'm pretty happy with large integer types in the middle end for canonicalization. If we have lots of missed middle-end optimizations because of that representation that would be fixed by a different canonical form, we should definitely revisit this, but currently the test cases have largely involved missing lowering logic late in the backend.

wmi added a comment.Jul 26 2016, 10:40 PM

Matt, David and Chandler, thanks for the review.

I only looked at the code for x86-64, powerpc and aarch64 targets. For those targets, separate stores seemed better. I am not familiar with target like AMDGPU. I guess wider store is generally preferred than multiple narrower stores on AMDGPU? Since it may be undesirable on some targets, I agree it is more appropriate to implement it in SDAG pass. I will update the patch.

wmi updated this revision to Diff 66359.Aug 1 2016, 12:55 PM
wmi retitled this revision from [InstCombine] Split int64 store into separate int32 stores to Split the store of an int value merged from a pair of smaller values into multiple stores..

Addressed comments from Matt, David, and Chandler.
Two major changes:

  1. reimplement the split in DAGCombiner.
  2. remove the restriction of int64.
wmi updated this revision to Diff 66976.Aug 5 2016, 11:58 AM

Some more changes:

  1. add x86_64-unknown-linux-gnu triple for test.
  2. I found it was too limited for the optimization to say the value of store can only had one use so I removed it. It is possible that it has more than one use but the other uses are in cold blocks. After store splitting and machineSinking, those bitwise instructions will be moved to colder places. However, existing ISel cannot look beyond BB boundary so it is impossible to check other uses. Can we blindly do this (splitting an int64 store to two on x86 may not be very bad for performance)? I really hope we have globalISel here, or is there existing better solution for it? I have done some internal testing for it. With the extension of more than one uses of store value, I see 3% improvement for one internal benchmark (with D23210) and no regressions.
chandlerc edited edge metadata.Aug 25 2016, 8:44 PM

Sorry, I had lost track of this, but I can help finish up the review here.

In D22840#507217, @wmi wrote:

Some more changes:

  1. add x86_64-unknown-linux-gnu triple for test.
  2. I found it was too limited for the optimization to say the value of store can only had one use so I removed it. It is possible that it has more than one use but the other uses are in cold blocks. After store splitting and machineSinking, those bitwise instructions will be moved to colder places. However, existing ISel cannot look beyond BB boundary so it is impossible to check other uses. Can we blindly do this (splitting an int64 store to two on x86 may not be very bad for performance)? I really hope we have globalISel here, or is there existing better solution for it? I have done some internal testing for it. With the extension of more than one uses of store value, I see 3% improvement for one internal benchmark (with D23210) and no regressions.

I agree that the one-use test is too restrictive.

I think for x86, doing this blindly when it is a float and an int is a clear and unambiguous win.

I think it is much less clear for pairs of integers... Does restricting this to only cases with a mixture of floating point and integers still capture all of the improvements? If so, I'd start there.

To implement this, I'd actually detect the two elements merged into a larger integer, and pass those two values to the predicate routine so that targets can inspect them as part of determining whether split stores would be better. Then you can have the x86 target check for a mixture of FP and int types to trigger split stores.

Even if we end up wanting more cases to be split for x86, I think this is probably a good way to structure the predicate so that targets can make more detailed decisions about this kind of thing.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12159–12162 ↗(On Diff #66976)

No need for braces around the outer if here, although it probably won't matter based on the refactoring I suggested...

12173–12184 ↗(On Diff #66976)

You can write DAG examples using a more brief form that is pretty common:

/// (store (or (zext (bitcast F to i32) to i64),
///            (shl (zext I to i64), 32)), addr)
12217–12221 ↗(On Diff #66976)

A common slightly shorter idiom than this that we use is to test if Op1 is *not* SHL, and then swap, test again and bail, and then you know Op1 is the SHL and you can directly extract the Hi value.

12247 ↗(On Diff #66976)

FWIW, here is where I would do the target query, passing in the inputs to the zero extends.

Then in the x86 implementation of the target query, I would look for a bitcast from a float, or whatever we end up with for the heuristic.

wmi added a comment.Aug 26 2016, 8:38 AM

Chandler, thanks for the review!

For the case I saw performance change, it only contains int-float pair. I agree int-float pair is a more clear win: for int-float pair, splitting wide store will save two bitwise instructions and one float-to-int conversion instruction but increase a store instruction, so it is two instructions saving. For int-int pair, it is only one instruction saving so the saving is more blurred. Good suggestion, I will implement the target query.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12159–12162 ↗(On Diff #66976)

Ok.

12173–12184 ↗(On Diff #66976)

Ok. Will change it.

12217–12221 ↗(On Diff #66976)

That is definitely better. Will change it to swap.

12247 ↗(On Diff #66976)

Thanks. I feel the target will end up here as you suggested.

wmi updated this revision to Diff 69450.Aug 26 2016, 4:38 PM
wmi edited edge metadata.

Change the input of the target query to be elements of the value pair before they are merged. The target query only returns true on x86 when the input is a mixture of int and fp values.
Testcase update since now wide store of int value pair will not be splitted for now.

The approach here looks awesome.

Mostly really minor nits about comments, and some suggestions to minimize and clean up the test case.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12208 ↗(On Diff #69450)

I would just say "Sometimes" here to avoid people defaulting a new target into splitting the stores without measuring.

lib/Target/X86/X86ISelLowering.h
769–771 ↗(On Diff #69450)

It's probably also worth noting that beyond the instruction count difference, there is potentially a more significant benefit because it avoids the float->int domain switch for input value. I have a suspicion that is a significant part of the savings here.

778–780 ↗(On Diff #69450)

And here it is probably worth mentioning the other upside of only doing a single memory operation in addition to having minimal instruction count overhead.

test/Transforms/InstCombine/split-store.ll
16–31 ↗(On Diff #69450)

It would be good to minimize this test case rather than keeping it so close to what happens to come out of Clang for a particular C++ input.

I'd just write direct tests in LLVM IR. Something like:

define void @test1(i32 %i, float %f, i64* %ptr) {
entry:
  ... = zext ...
  ... = bitcast ...
  ... = zext ...
  ... = shl ...
  ... = or ...
  store ..., i64* %ptr
  ret void
}
wmi added inline comments.Aug 26 2016, 10:36 PM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12208 ↗(On Diff #69450)

Fixed.

lib/Target/X86/X86ISelLowering.h
769–771 ↗(On Diff #69450)

Comment added

778–780 ↗(On Diff #69450)

I am not sure here. The added memory operation is a store. Store usually will not stall the pipeline because there is store buffer, so I feel the cost of the extra store will be low?

test/Transforms/InstCombine/split-store.ll
16–31 ↗(On Diff #69450)

That makes the test much shorter!

wmi updated this revision to Diff 69473.Aug 26 2016, 10:39 PM

Address Chandler's comments: adjust comments and simplify testcase.

Hmmm. I'm not seeing the updated test cases? Not sure if phabricator is just not updating for me or what though...

lib/Target/X86/X86ISelLowering.h
778–780 ↗(On Diff #69473)

I agree the cost will be low, I just think it makes sense to mention the different aspects (for example, saving an entry in the store buffer).

wmi added a comment.Aug 27 2016, 9:55 AM

It is weird. However, I just realized in your example, alloca was replaced with a func param. It can make the test more shorter, so I updated the test again.

lib/Target/X86/X86ISelLowering.h
778–780 ↗(On Diff #69473)

Got it. Added the comment.

wmi updated this revision to Diff 69494.Aug 27 2016, 9:57 AM

Update comment and testcase.

chandlerc accepted this revision.Sep 1 2016, 2:49 PM
chandlerc edited edge metadata.

LGTM, feel free to submit. Just a suggestion on further simplification of the tests below, but all the important stuff is already there.

Also, this is a really nice change, thanks for working on it and getting the layering figured out so nicely here.

test/Transforms/InstCombine/split-store.ll
9 ↗(On Diff #69494)

You can also probable remove all the function attributes here.

This revision is now accepted and ready to land.Sep 1 2016, 2:49 PM
This revision was automatically updated to reflect the committed changes.