As per title. After various discussion, it seems that this may be a better approach than transforming them into scalar. The pass itself does not very much to optimize, but transform the IR is a way that subsequents passes, mainly SROA, can consume and optimize.
Details
Diff Detail
Event Timeline
Hi Amaury,
Overall that seems pretty good, I'm glad you find a way to leverage SROA.
I feel the implementation could benefit from more comments/description. You'll find some comments inline.
Best,
Mehdi
| lib/Transforms/IPO/PassManagerBuilder.cpp | ||
|---|---|---|
| 211 | I think this should probably go just before SROA, either at the three places it is inserted in this file, or only before the one in the FPM. I tend to think the latter is more appropriate, unless there is room for the inliner to expose more cases than can be handled. What is the impact of this transformation on the code generated by clang? Should it be enabled by a flag only? | |
| lib/Transforms/Scalar/AggregateLifter.cpp | ||
| 11 | s/ans/and/ | |
| 12 | s/rewrting/rewriting/ | |
| 78 | It seems this is needed because FixupWorklist is a map. But it is not clear to me why are you needing the map behavior? | |
| 92 | for-range? | |
| 112 | getTypeStoreSize() is fine for the memcpy I think. | |
| 133 | What about refactoring with an AggregateLifterPass that creates a transient object AggregateLifter which is not a pass, takes a reference to a Function and keep it as a member, and initialize a DataLayout reference member. (and make liftStore a private member) | |
| 137 | Smaller is not a correctness issue right? I think larger might be more problematic. (Also at this point you might be storing somewhere else than at the beginning of the alloca, processing a GEP) | |
| lib/Transforms/IPO/PassManagerBuilder.cpp | ||
|---|---|---|
| 211 | Yeah i inserted it pretty randomly. clang should not be affected by this as it doesn't generate aggregate loads/stores. | |
| lib/Transforms/Scalar/AggregateLifter.cpp | ||
| 78 | In this diff, I don't really need. But you onlyhave the "top-down" part of the problem here : you get a load and lift from there. In the case you get a store of a value that not come from a load to begin with, you need to go bottom up ie you lift starting from the store. While going bottom-up, you need to check if some plan exists for other aggregate you encounter, and this is where the map semantic is needed. | |
| 92 | Is there a simple way to go backward using range ? construct à la range(rbegin rend) are not very handy. | |
| 133 | Sounds good. | |
| 137 | Yeah the comment is wrong.< Also, I'm not sure what if, that is exactly why this comment is here. Thinking more about it, I think there may be a correctness concern here, I have to think more about this. | |
| lib/Transforms/Scalar/AggregateLifter.cpp | ||
|---|---|---|
| 133 | I imagine we'd want to run the utility from within CGP so that frontends can rely on CodeGen succeeding in -O0 situations. | |
| lib/Transforms/Scalar/AggregateLifter.cpp | ||
|---|---|---|
| 133 | I'm not sure. This pass most likely make thing worse by itself. Without getting other passes to optimize the result, I'm not sure this is worth doing. Maybe it could be integrated to SROA itself. | |
Various nits.
Use FunctionLifter as a per function state.
Is the approach agreed upon ? If so, what is the best place to put this in ?
@chandlerc You seems to me the SROA guy, do you think this could be integrated with in SROA ? If yes, what is the best way to proceed ? If no, what is a good place to put this in ?
I'm really not sure this is the right approach, despite liking a number of
aspects and being optimistic early-on. But I've not yet had time to put my
thoughts here into a nice cohesive email. I'll do that as soon as I can to
help resolve this.
Rebase. @chandlerc , could you elaborate on your concerns ? Without any feedback from you, this is stuck in limbos.
Chandler, please do comment on the approach. We should get this settled.
| lib/Transforms/Scalar/AggregateLifter.cpp | ||
|---|---|---|
| 14 | You should provide some small examples here of what this actually means. | |
| 118 | Don't need { } if there is only one conditional statement. Same comment applies to several places below. | |
| 134 | Looks like we're dropping 'volatile' and any associated metadata here when we convert these loads. | |
| lib/Transforms/Scalar/AggregateLifter.cpp | ||
|---|---|---|
| 134 | Aren't volatile not considered simple ? In which case, the transoformation shouldn't happen on them. | |
I indicated previously that I wasn't sure this is the correct approach. I wanted to clarify that.
The problem I see with the approach of lowering things as loads and stores of (perhaps excessively large) integers is that I think it creates a bad canonicalization problem. Let's consider what happens when we lower as i128 loads and stores of an aggregate. Later on, we may inline and end up access the low and high 64 bits as f64 types. But in order to do that, we would have to use shift and trunc instructions on the integer values. As a consequence of these shift and truncs, we would have integer operations on the bits and avoid properly canonicalizing loads and stores or the basic domain used. =/ I'm moderately worried about this.
The crux of the problem is that we would introduce *strongly typed* operations like 'shl', 'lshr' to extract the narrow regions from the former aggregate. The extractvalue (and insertvalue) instructions quite usefully can attach the type of the extracted or inserted value rather than any particular integer type. As a consequence, I think that we should model this by extracting each value from the FCA, and storing it separately.
For function arguments, this logic doesn't really seem to apply. There, I feel like frontends should be able to fully decompose things or fuse things into proper primitive types.
The only remaining reason why aggregates really exist are to support return types. I actually think that's OK, they're not a terrible syntax and mechanism for modeling multiple return types. I would particularly like it if the only thing you could reasonably do is extractvalue the independent values. This would make aggregates just the LLVM mechanism for doing multiple return types with proper SSA form. I think that's fine. Maybe some day we would even want to replace it with TokenType, for now aggregates provide a nice layer of type checking and sanity checking for returns.
Now, some may wonder what about my vehement arguments about bitfields using large integer types for loads and stores and the memory model implications. Fundamentally, I think these are different concerns. There are two core differences. First, bitfields are *defined* as a single memory location by the frontend language, and so it is reasonable for us to go to some lengths to operate on them as such. I don't think that aggregates are actually so defined in any source languages I'm aware of (C, C++, ObjC, Go, Swift, CUDA, to name a few). But as an example, if a frontend wants to *specifically define* a load or store of an aggregate as an atomic load or store of N bits of memory, I'd rather the frontend choose to express that in an N-bit integer operation with the IR rather than a transform pass doing it.
The second, and perhaps overriding reason why I think all of the arguments line up a different way for bitfields is that bitfields quite fundamentally *are* integers. We lose nothing by modeling them as such and create no canonicalization problems.
Does this make sense to folks?
(Again, sorry for the delay writing this up. A decent chunk of it was taking quite a few hours to carefully think through all the consequences and convince myself that they were OK.)
-Chandler
You seems to be operating under the previous proposal, aka promoting aggregate as large integer stores/loads. This is a new proposal, that transform the load/store into memcpy to allocs, and let SROA do its job from there. This way of doing things would avoid the problem you mention.
The only remaining reason why aggregates really exist are to support return types. I actually think that's OK, they're not a terrible syntax and mechanism for modeling multiple return types. I would particularly like it if the only thing you could reasonably do is extractvalue the independent values. This would make aggregates just the LLVM mechanism for doing multiple return types with proper SSA form. I think that's fine. Maybe some day we would even want to replace it with TokenType, for now aggregates provide a nice layer of type checking and sanity checking for returns.
I do not think this is the only reason. Aggregate have memory layout that depend on various target characteristics and require some logic to compute offset and alignment of various fields in them. Duplicating that logic in every frontends seems like a net lose to me.
I'm not sure what you mean by TokenType, but in the current state of affair, it seems like we should support aggregate operations, at least.
The second, and perhaps overriding reason why I think all of the arguments line up a different way for bitfields is that bitfields quite fundamentally *are* integers. We lose nothing by modeling them as such and create no canonicalization problems.
I do think this proposal, that load from allocas, do not suffer from this problem.
Sorry that I replied to the prior proposal. Anyways, this is... interesting.
So this *only* handles loads and stores of globals? Are those common? They aren't the common case of aggregate loads and stores that I have seen...
The only remaining reason why aggregates really exist are to support return types. I actually think that's OK, they're not a terrible syntax and mechanism for modeling multiple return types. I would particularly like it if the only thing you could reasonably do is extractvalue the independent values. This would make aggregates just the LLVM mechanism for doing multiple return types with proper SSA form. I think that's fine. Maybe some day we would even want to replace it with TokenType, for now aggregates provide a nice layer of type checking and sanity checking for returns.
I do not think this is the only reason. Aggregate have memory layout that depend on various target characteristics and require some logic to compute offset and alignment of various fields in them. Duplicating that logic in every frontends seems like a net lose to me.
I guess we disagree then, but we don't need to debate that on this thread.
I'm not sure what you mean by TokenType, but in the current state of affair, it seems like we should support aggregate operations, at least.
The second, and perhaps overriding reason why I think all of the arguments line up a different way for bitfields is that bitfields quite fundamentally *are* integers. We lose nothing by modeling them as such and create no canonicalization problems.
I do think this proposal, that load from allocas, do not suffer from this problem.
Sure, this doesn't make anything worse. Again, sorry for missing the change in design, the threads have become fairly fragmented.
But I don't really understand what this is trying to improve so it is hard for me to evaluate it. It seems really silly to memcpy into allocas just so that SROA can split them and promote them again. That is a lot of compile time and effort for something that we already know how to handle.
If *all* you want to handle are loads and stores of aggregates from globals, you could take the FCA splitter that is already used in SROA and promote it to a separate pass and run it over non-alloca instructions.... This would work today AFAICT? (You would need to make it a utility as SROA would also want to run it internally.)
What makes you think it is limited to globals ? It is for any load and/or store from memory in general. SROA doesn't touch theses. Personally, the case I'm trying to solve are aggregate access to memory that is freshly allocated. Other people have voiced interest in having these kind of load/store optimized, I can't speak for their reasons, but overall, this seems to be of interest. i used globals in the tests cases as it was easy, but this should not be limited to global, or that would be fairly useless.
Right now, this kind of operation is plain ignored by the optimizer. What I'm trying to do is transform these into something the rest of the pipeline understands and will process nicely. Lifting theses values into allocas create something that is not identical, but similar enough to what clang generate that the rest of the pipeline pick it up and optimize it nicely.
This basically transforms :
%1 = load { i8*, i64 }, { i8*, i64 }* %ptr
%2 = extractvalue { i8*, i64 } %1, 0
%3 = extractvalue { i8*, i64 } %1, 1Into something like
%1 = alloca { i8*, i64 }
call void @llvm.memcpy(%1, %ptr, sizeof({ i8*, i64 })) ; pseudo code, you get the idea.
%2.lifted = gep { i8*, i64 }, { i8*, i64 }* %1, 0
%2 = load i8*, i8** %2.lifted
%3.lifted = gep { i8*, i64 }, { i8*, i64 }* %1, 1
%3 = load i64, i64* %3.liftedAs result, instead of having aggregate loads and stores, you get a memcpy and a set of non aggregate loads and stores. The rest of the pipeline pick up on this nicely and is able to optimize this away.
This has been made in its own pass as to have a POC up and see if it works well (it does, that is the model I got the best results with so far), but I indeed wondering if making SROA do it is not a better idea.
OK, so it is trying to optimize everything that isn't a load or store from an alloca.
I replied to the general concept at the bottom, and I think that still applies -- I don't think it's the right design to create allocas and memcpy instructions just for SROA to clean up. We should directly optimize the aggregate loads and stores.
@chandler , do you think adding this to SROA, so we don't lift thing to then let SROA optimize but rather optimize right away in SROA make sense ? If so, do you have some pointer on how to implement this in SROA ?
I think this should probably go just before SROA, either at the three places it is inserted in this file, or only before the one in the FPM. I tend to think the latter is more appropriate, unless there is room for the inliner to expose more cases than can be handled.
What is the impact of this transformation on the code generated by clang? Should it be enabled by a flag only?