This is an archive of the discontinued LLVM Phabricator instance.

Add a pass to lift aggregate into allocas so SROA can get rid of them.
AbandonedPublic

Authored by deadalnix on Aug 23 2015, 12:47 AM.

Details

Summary

As per title. After various discussion, it seems that this may be a better approach than transforming them into scalar. The pass itself does not very much to optimize, but transform the IR is a way that subsequents passes, mainly SROA, can consume and optimize.

Diff Detail

Event Timeline

deadalnix updated this revision to Diff 32921.Aug 23 2015, 12:47 AM
deadalnix retitled this revision from to Add a pass to lift aggregate into allocas so SROA can get rid of them..
deadalnix updated this object.
deadalnix added a subscriber: llvm-commits.
mehdi_amini edited edge metadata.Aug 23 2015, 9:49 PM

Hi Amaury,

Overall that seems pretty good, I'm glad you find a way to leverage SROA.

I feel the implementation could benefit from more comments/description. You'll find some comments inline.

Best,

Mehdi

lib/Transforms/IPO/PassManagerBuilder.cpp
211

I think this should probably go just before SROA, either at the three places it is inserted in this file, or only before the one in the FPM. I tend to think the latter is more appropriate, unless there is room for the inliner to expose more cases than can be handled.

What is the impact of this transformation on the code generated by clang? Should it be enabled by a flag only?

lib/Transforms/Scalar/AggregateLifter.cpp
11

s/ans/and/

12

s/rewrting/rewriting/

78

It seems this is needed because FixupWorklist is a map. But it is not clear to me why are you needing the map behavior?

92

for-range?

112

getTypeStoreSize() is fine for the memcpy I think.
But since you get it just before creating the alloca, the name Size for the variable can be ambiguous. What about either moving it closer to its use, or renaming it something like StoreSize.

133

What about refactoring with an AggregateLifterPass that creates a transient object AggregateLifter which is not a pass, takes a reference to a Function and keep it as a member, and initialize a DataLayout reference member. (and make liftStore a private member)
This pattern avoid having to dereference multiple pointer all over the place to get back to the parent Function and the DataLayout. Also having less state in the Pass itself seems cleaner to me in general. I see the pass as a wrapper for the transformation only (single responsibility). I also believe it makes it easier to reuse, for instance in the new PassManager interface.

137

Smaller is not a correctness issue right? I think larger might be more problematic.

(Also at this point you might be storing somewhere else than at the beginning of the alloca, processing a GEP)

deadalnix added inline comments.Aug 23 2015, 10:37 PM
lib/Transforms/IPO/PassManagerBuilder.cpp
211

Yeah i inserted it pretty randomly. clang should not be affected by this as it doesn't generate aggregate loads/stores.

lib/Transforms/Scalar/AggregateLifter.cpp
78

In this diff, I don't really need. But you onlyhave the "top-down" part of the problem here : you get a load and lift from there.

In the case you get a store of a value that not come from a load to begin with, you need to go bottom up ie you lift starting from the store. While going bottom-up, you need to check if some plan exists for other aggregate you encounter, and this is where the map semantic is needed.

92

Is there a simple way to go backward using range ? construct à la range(rbegin rend) are not very handy.

133

Sounds good.

137

Yeah the comment is wrong.<

Also, I'm not sure what if, that is exactly why this comment is here. Thinking more about it, I think there may be a correctness concern here, I have to think more about this.

majnemer added inline comments.Aug 23 2015, 10:51 PM
lib/Transforms/Scalar/AggregateLifter.cpp
133

I imagine we'd want to run the utility from within CGP so that frontends can rely on CodeGen succeeding in -O0 situations.

deadalnix added inline comments.Aug 25 2015, 1:01 AM
lib/Transforms/Scalar/AggregateLifter.cpp
133

I'm not sure. This pass most likely make thing worse by itself. Without getting other passes to optimize the result, I'm not sure this is worth doing.

Maybe it could be integrated to SROA itself.

deadalnix updated this revision to Diff 33182.Aug 25 2015, 11:37 PM
deadalnix removed a reviewer: chandlerc.

Various nits.
Use FunctionLifter as a per function state.

Is the approach agreed upon ? If so, what is the best place to put this in ?
@chandlerc You seems to me the SROA guy, do you think this could be integrated with in SROA ? If yes, what is the best way to proceed ? If no, what is a good place to put this in ?

I'm really not sure this is the right approach, despite liking a number of
aspects and being optimistic early-on. But I've not yet had time to put my
thoughts here into a nice cohesive email. I'll do that as soon as I can to
help resolve this.

deadalnix updated this revision to Diff 34234.Sep 8 2015, 11:29 AM
deadalnix added a subscriber: chandlerc.

Rebase. @chandlerc , could you elaborate on your concerns ? Without any feedback from you, this is stuck in limbos.

deadalnix updated this revision to Diff 34740.Sep 14 2015, 3:03 PM

Rebase, ping ping ping

hfinkel edited edge metadata.Sep 18 2015, 8:03 AM

Chandler, please do comment on the approach. We should get this settled.

lib/Transforms/Scalar/AggregateLifter.cpp
14

You should provide some small examples here of what this actually means.

118

Don't need { } if there is only one conditional statement. Same comment applies to several places below.

134

Looks like we're dropping 'volatile' and any associated metadata here when we convert these loads.

deadalnix updated this revision to Diff 35305.Sep 21 2015, 1:59 PM
deadalnix edited edge metadata.

ping

deadalnix added inline comments.Sep 26 2015, 12:51 AM
lib/Transforms/Scalar/AggregateLifter.cpp
134

Aren't volatile not considered simple ? In which case, the transoformation shouldn't happen on them.

I indicated previously that I wasn't sure this is the correct approach. I wanted to clarify that.

The problem I see with the approach of lowering things as loads and stores of (perhaps excessively large) integers is that I think it creates a bad canonicalization problem. Let's consider what happens when we lower as i128 loads and stores of an aggregate. Later on, we may inline and end up access the low and high 64 bits as f64 types. But in order to do that, we would have to use shift and trunc instructions on the integer values. As a consequence of these shift and truncs, we would have integer operations on the bits and avoid properly canonicalizing loads and stores or the basic domain used. =/ I'm moderately worried about this.

The crux of the problem is that we would introduce *strongly typed* operations like 'shl', 'lshr' to extract the narrow regions from the former aggregate. The extractvalue (and insertvalue) instructions quite usefully can attach the type of the extracted or inserted value rather than any particular integer type. As a consequence, I think that we should model this by extracting each value from the FCA, and storing it separately.

For function arguments, this logic doesn't really seem to apply. There, I feel like frontends should be able to fully decompose things or fuse things into proper primitive types.

The only remaining reason why aggregates really exist are to support return types. I actually think that's OK, they're not a terrible syntax and mechanism for modeling multiple return types. I would particularly like it if the only thing you could reasonably do is extractvalue the independent values. This would make aggregates just the LLVM mechanism for doing multiple return types with proper SSA form. I think that's fine. Maybe some day we would even want to replace it with TokenType, for now aggregates provide a nice layer of type checking and sanity checking for returns.

Now, some may wonder what about my vehement arguments about bitfields using large integer types for loads and stores and the memory model implications. Fundamentally, I think these are different concerns. There are two core differences. First, bitfields are *defined* as a single memory location by the frontend language, and so it is reasonable for us to go to some lengths to operate on them as such. I don't think that aggregates are actually so defined in any source languages I'm aware of (C, C++, ObjC, Go, Swift, CUDA, to name a few). But as an example, if a frontend wants to *specifically define* a load or store of an aggregate as an atomic load or store of N bits of memory, I'd rather the frontend choose to express that in an N-bit integer operation with the IR rather than a transform pass doing it.

The second, and perhaps overriding reason why I think all of the arguments line up a different way for bitfields is that bitfields quite fundamentally *are* integers. We lose nothing by modeling them as such and create no canonicalization problems.

Does this make sense to folks?

(Again, sorry for the delay writing this up. A decent chunk of it was taking quite a few hours to carefully think through all the consequences and convince myself that they were OK.)

-Chandler

I indicated previously that I wasn't sure this is the correct approach. I wanted to clarify that.

The problem I see with the approach of lowering things as loads and stores of (perhaps excessively large) integers is that I think it creates a bad canonicalization problem. Let's consider what happens when we lower as i128 loads and stores of an aggregate. Later on, we may inline and end up access the low and high 64 bits as f64 types. But in order to do that, we would have to use shift and trunc instructions on the integer values. As a consequence of these shift and truncs, we would have integer operations on the bits and avoid properly canonicalizing loads and stores or the basic domain used. =/ I'm moderately worried about this.

You seems to be operating under the previous proposal, aka promoting aggregate as large integer stores/loads. This is a new proposal, that transform the load/store into memcpy to allocs, and let SROA do its job from there. This way of doing things would avoid the problem you mention.

The only remaining reason why aggregates really exist are to support return types. I actually think that's OK, they're not a terrible syntax and mechanism for modeling multiple return types. I would particularly like it if the only thing you could reasonably do is extractvalue the independent values. This would make aggregates just the LLVM mechanism for doing multiple return types with proper SSA form. I think that's fine. Maybe some day we would even want to replace it with TokenType, for now aggregates provide a nice layer of type checking and sanity checking for returns.

I do not think this is the only reason. Aggregate have memory layout that depend on various target characteristics and require some logic to compute offset and alignment of various fields in them. Duplicating that logic in every frontends seems like a net lose to me.

I'm not sure what you mean by TokenType, but in the current state of affair, it seems like we should support aggregate operations, at least.

The second, and perhaps overriding reason why I think all of the arguments line up a different way for bitfields is that bitfields quite fundamentally *are* integers. We lose nothing by modeling them as such and create no canonicalization problems.

I do think this proposal, that load from allocas, do not suffer from this problem.

I indicated previously that I wasn't sure this is the correct approach. I wanted to clarify that.

The problem I see with the approach of lowering things as loads and stores of (perhaps excessively large) integers is that I think it creates a bad canonicalization problem. Let's consider what happens when we lower as i128 loads and stores of an aggregate. Later on, we may inline and end up access the low and high 64 bits as f64 types. But in order to do that, we would have to use shift and trunc instructions on the integer values. As a consequence of these shift and truncs, we would have integer operations on the bits and avoid properly canonicalizing loads and stores or the basic domain used. =/ I'm moderately worried about this.

You seems to be operating under the previous proposal, aka promoting aggregate as large integer stores/loads. This is a new proposal, that transform the load/store into memcpy to allocs, and let SROA do its job from there. This way of doing things would avoid the problem you mention.

Sorry that I replied to the prior proposal. Anyways, this is... interesting.

So this *only* handles loads and stores of globals? Are those common? They aren't the common case of aggregate loads and stores that I have seen...

The only remaining reason why aggregates really exist are to support return types. I actually think that's OK, they're not a terrible syntax and mechanism for modeling multiple return types. I would particularly like it if the only thing you could reasonably do is extractvalue the independent values. This would make aggregates just the LLVM mechanism for doing multiple return types with proper SSA form. I think that's fine. Maybe some day we would even want to replace it with TokenType, for now aggregates provide a nice layer of type checking and sanity checking for returns.

I do not think this is the only reason. Aggregate have memory layout that depend on various target characteristics and require some logic to compute offset and alignment of various fields in them. Duplicating that logic in every frontends seems like a net lose to me.

I guess we disagree then, but we don't need to debate that on this thread.

I'm not sure what you mean by TokenType, but in the current state of affair, it seems like we should support aggregate operations, at least.

The second, and perhaps overriding reason why I think all of the arguments line up a different way for bitfields is that bitfields quite fundamentally *are* integers. We lose nothing by modeling them as such and create no canonicalization problems.

I do think this proposal, that load from allocas, do not suffer from this problem.

Sure, this doesn't make anything worse. Again, sorry for missing the change in design, the threads have become fairly fragmented.

But I don't really understand what this is trying to improve so it is hard for me to evaluate it. It seems really silly to memcpy into allocas just so that SROA can split them and promote them again. That is a lot of compile time and effort for something that we already know how to handle.

If *all* you want to handle are loads and stores of aggregates from globals, you could take the FCA splitter that is already used in SROA and promote it to a separate pass and run it over non-alloca instructions.... This would work today AFAICT? (You would need to make it a utility as SROA would also want to run it internally.)

deadalnix added a comment.EditedOct 1 2015, 12:13 AM

What makes you think it is limited to globals ? It is for any load and/or store from memory in general. SROA doesn't touch theses. Personally, the case I'm trying to solve are aggregate access to memory that is freshly allocated. Other people have voiced interest in having these kind of load/store optimized, I can't speak for their reasons, but overall, this seems to be of interest. i used globals in the tests cases as it was easy, but this should not be limited to global, or that would be fairly useless.

Right now, this kind of operation is plain ignored by the optimizer. What I'm trying to do is transform these into something the rest of the pipeline understands and will process nicely. Lifting theses values into allocas create something that is not identical, but similar enough to what clang generate that the rest of the pipeline pick it up and optimize it nicely.

This basically transforms :

%1 = load { i8*, i64 }, { i8*, i64 }* %ptr
%2 = extractvalue { i8*, i64 } %1, 0
%3 = extractvalue { i8*, i64 } %1, 1

Into something like

%1 = alloca { i8*, i64 }
call void @llvm.memcpy(%1, %ptr, sizeof({ i8*, i64 })) ; pseudo code, you get the idea.
%2.lifted = gep { i8*, i64 }, { i8*, i64 }* %1, 0
%2 = load i8*, i8** %2.lifted
%3.lifted = gep { i8*, i64 }, { i8*, i64 }* %1, 1
%3 = load i64, i64* %3.lifted

As result, instead of having aggregate loads and stores, you get a memcpy and a set of non aggregate loads and stores. The rest of the pipeline pick up on this nicely and is able to optimize this away.

This has been made in its own pass as to have a POC up and see if it works well (it does, that is the model I got the best results with so far), but I indeed wondering if making SROA do it is not a better idea.

OK, so it is trying to optimize everything that isn't a load or store from an alloca.

I replied to the general concept at the bottom, and I think that still applies -- I don't think it's the right design to create allocas and memcpy instructions just for SROA to clean up. We should directly optimize the aggregate loads and stores.

@chandler , do you think adding this to SROA, so we don't lift thing to then let SROA optimize but rather optimize right away in SROA make sense ? If so, do you have some pointer on how to implement this in SROA ?

Doing it in SROA in D13379 . Seems like a better place than in its own pass.

deadalnix abandoned this revision.Oct 12 2015, 4:07 PM