This is an archive of the discontinued LLVM Phabricator instance.

Add to the split module utility an SCC based method which allows not to globalize any local variables.
ClosedPublic

Authored by slarin on Jan 12 2016, 12:37 PM.

Details

Reviewers
pcc
Summary

Currently llvm::SplitModule as the first step globalizes all local objects, which might not be desirable in some scenarios.
This change adds a new utility llvm::SplitModuleSelective that uses SCC approach to search for a balanced partition without the need to externalize symbols.
Such partition might not be possible or fully balanced for a given number of partitions, and is a function of the module properties (global/local dependencies within the module).

Joint development Tobias Edler von Koch (tobias@codeaurora.org) and Sergei Larin (slarin@codeaurora.org)

Diff Detail

Event Timeline

slarin updated this revision to Diff 44662.Jan 12 2016, 12:37 PM
slarin retitled this revision from to Add to the split module utility an SCC based method which allows not to globalize any local variables..
slarin updated this object.
slarin added a subscriber: llvm-commits.
slarin added a reviewer: pcc.Jan 12 2016, 12:39 PM
slarin edited subscribers, added: tobiasvk; removed: mehdi_amini.
slarin added a subscriber: joker-eph-DISABLED.
pcc edited edge metadata.Jan 12 2016, 2:24 PM

As far as I can tell, it seems that in a typical LTO scenario with internalization, this partitioning would be strongly biased towards a single partition containing most of the code. Do you have a plan to address this?

I also have some lower level comments which will become more relevant if you can address the above.

include/llvm/CodeGen/ParallelCG.h
42–45

I don't think we need this extra flexibility. If this new way is better I think we should replace the existing SplitModule implementation with it.

lib/Transforms/Utils/SplitModule.cpp
46

I believe that what this function does is already implemented by llvm::EquivalenceClasses.

112–122

I think you can do all this stuff in one go with a loop that iterates over the whole map after visiting all the globals.

160

You can simplify this code by removing this line.

171–174

Same here

279

Should we do something better with module inline asm? Maybe add referenced globals to an equivalence class? (Probably doesn't need to happen in the initial version though.)

What is the issue with the current globalization? You mentioned that it "might not be desirable in some scenarios", can you elaborate?

include/llvm/CodeGen/ParallelCG.h
46

The added argument would probably benefit some documentation.

In D16124#325170, @pcc wrote:

As far as I can tell, it seems that in a typical LTO scenario with internalization, this partitioning would be strongly biased towards a single partition containing most of the code. Do you have a plan to address this?

That's correct. Locally, we remember which symbols were originally non-local and restore their linkage type just prior to splitting / CodeGen. Constant merging and similar optimizations also reduce the ability to partition which is a bigger issue (that I haven't got a great solution for yet). But as it is, this is just as much about correctness as it is about getting the 'best' split.

In D16124#325188, @joker.eph wrote:

What is the issue with the current globalization? You mentioned that it "might not be desirable in some scenarios", can you elaborate?

Well, an obvious example would applying LTO to a library or to only part of your program. If locals become global, you may end up with name collisions or symbols you don't want to export.

Peter,

You are right in general - we have seen this bias. To address the issue, we re-globalize things that were internalized for LTO leaving only _real_ dependencies. If that step is interesting for anyone - we can definitely upstream it... but it will not go all the way. Constant merging sometimes introduces additional dependencies that might be counterproductive for splitting, and we are currently looking at our options to address that.

Nevertheless, without this patch the whole (split) feature is kind of useless for us, and I will try to explain why when I answer to Mehdi's question...

include/llvm/CodeGen/ParallelCG.h
42–45

I can see it both ways. In my tree I have one more change that is not (yet) reflected in this patch- partially I did not want to change several things at once... for that case having custom sorting function as argument makes a lot of sense.

On the other hand, I wanted to preserve as much from the original implementation as possible since i do not have a good visibility into all possible use cases.

I guess, if we can simply add "split mode" flag to splitCodeGen it might do the trick in fewer lines of code. Your are the owner of this code - in some way it is your call...

46

Do you mean in comments? Sure - based on the final form-factor we would want to use.

lib/Transforms/Utils/SplitModule.cpp
46

Maybe... I need to look if it can be used for this - one question would be the use of std::shared_ptr in this case - memory and time is an issue for us. This code is highly stressed.

If you are OK with the general concept - I can always improve it with additional patches - I am planning to stay here for a while :)

112–122

Yes, I looked into that, and it seemed to me that doing it like this was marginally faster. Is your concern with code clarity or implementation efficiency?

160

I am not sure about that - I will let Tobias to comment on this...

279

Generally agree... we can do it in following patches...

It looks like we should set visibility hidden when turning a local into a global.

slarin added a subscriber: slarin.Jan 12 2016, 3:45 PM

Mehdi,

I think Tobias beat me to the answer - at the least we can have linker errors if we try to blindly globalize locals with identical names. Renaming them will not always work either, since our customers might rely on presence of specific symbols in the final image... There are also other reasons for that, but correctness of at least some corner cases seems like a hard justification on its own :-)

Welcome to the embedded world :) :) :)

Sergei


Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation

In the end it depends on what your linker does. If your linker tracks global symbols, turning locals into globals will not play well even with hidden attribute.

Besides, I want to "add" new functionality, not to change existing one (patch reflects that). It seems like a generic enough feature that people can find useful outside my use case...


Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation

pcc added a comment.Jan 12 2016, 4:07 PM

Constant merging sometimes introduces additional dependencies that might be counterproductive for splitting, and we are currently looking at our options to address that.

How bad is the problem? I think that as long as your re-globalization is conservatively correct (i.e. does not pollute the global namespace) it would still be a useful contribution.

Nevertheless, without this patch the whole (split) feature is kind of useless for us

I can understand that, but to be frank, your patch isn't very useful for upstream without the re-globalization step.

include/llvm/CodeGen/ParallelCG.h
42–45

I can see it both ways. In my tree I have one more change that is not (yet) reflected in this patch- partially I did not want to change several things at once... for that case having custom sorting function as argument makes a lot of sense.

I'd like to understand why.

On the other hand, I wanted to preserve as much from the original implementation as possible since i do not have a good visibility into all possible use cases.

The reason for externalizing everything was mostly "implementation simplicity at the cost of some correctness". If we can fix the correctness problem then I'd rather not have two implementations.

lib/Transforms/Utils/SplitModule.cpp
46

It certainly seems that EquivalenceClasses would be more efficient than your custom implementation here, I'd also strongly prefer not to have a second implementation of equivalence classes in the tree if at all possible.

112–122

Mostly code clarity. If it's faster, that's all the better.

Peter,

Constant merging sometimes introduces additional dependencies that

might be counterproductive for splitting, and we are currently looking at our
options to address that.

How bad is the problem? I think that as long as your re-globalization is
conservatively correct (i.e. does not pollute the global namespace) it would
still be a useful contribution.

It is entirely property of your code. If you do have a lot of reusable constants in the LTO module, const merging acts as a spaghetti explosion connecting otherwise unrelated SCC. We are considering two ways for dealing with it - do not merge heuristic (likely to be slow), or re-split some constants smartly - this could be done here, as we build SCCs. For our case it works even as is, with acceptable misbalance, but can be improved.

I can understand that, but to be frank, your patch isn't very useful for
upstream without the re-globalization step.

I am glad you think so. We will try to upstream it ASAP. The only question would be - same patch or an independent feature? Under common flag?

If we can fix the correctness problem then I'd rather not have two implementations.

Ok. I can probably make it work with an enum flag for llvm::SplitModule or something.

Let me refactor the patch with EquivalenceClasses and single SplitModule function. Meanwhile Tobias will work on re-globalization patch. Stay tuned :)

Thanks.

Sergei


Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation

Peter,

Here is one issue with EquivalenceClasses. It uses

std::set<ECValue> TheMapping;

for members. We purposefully used SetVector since we are dealing with pointers... As a result I have non deterministic mapping - which is not cool.

I will see if I can do some sorting prior to partition assignment, but if it fails I do not think this will work.

If you have better suggestions - please speak up.

Sergei

pcc added a comment.Jan 13 2016, 11:26 AM

The only question would be - same patch or an independent feature? Under common flag?

Up to you.

I will see if I can do some sorting prior to partition assignment, but if it fails I do not think this will work.

Yes, you will need to sort the equivalence classes. (It may also be beneficial to sort by partition size as well -- this is the LPT multiprocessor scheduling algorithm -- see https://en.wikipedia.org/wiki/Multiprocessor_scheduling). One thing that could help you is that the identity of the leader should be deterministic, so you can sort on its name.

slarin updated this revision to Diff 44891.Jan 14 2016, 8:52 AM
slarin edited edge metadata.

Add to the split module utility an SCC based method which allows not to globalize any local variables.

Currently llvm::SplitModule as the first step globalizes all local objects, which might not be desirable in some scenarios.
This change adds a new utility llvm::SplitModuleSelective that uses SCC approach to search for a balanced partition without the need to externalize symbols.
Such partition might not be possible or fully balanced for a given number of partitions, and is a function of the module properties (global/local dependencies within the module).

Joint development Tobias Edler von Koch (tobias@codeaurora.org) and Sergei Larin (slarin@codeaurora.org)

This would be even better if EquivalenceClasses would guarantee deterministic order... but anyhow - this should reflect Peter's comments.

The re-globalization patch will likely appear as an independent change later today.

pcc added inline comments.Jan 15 2016, 11:45 AM
lib/Transforms/Utils/SplitModule.cpp
62–77

I realized that you can reimplement this function in a simpler way that also avoids non-determinism in your loop (unordered_multimap can re-order values non-deterministically).

llvm::DenseMap<Comdat *, GlobalVariable *> ComdatMembers;
[...]
auto &Member = ComdatMembers[C];
if (Member)
  GVToClusterMap.unionSets(Member, GV);
else
  Member = GV;
160

Did you look into this?

185

If you store a ClusterMapType::member_iterator in Sets instead of a GlobalValue * you won't have to look it up again.

Excellent suggestions. New patch will be up shortly.

lib/Transforms/Utils/SplitModule.cpp
62–77

Yes, I like this!

160

Yes.. It is needed. Tobias is describing why now...

slarin updated this revision to Diff 45040.Jan 15 2016, 2:51 PM

Implemented Peter's suggestions.

tobiasvk added inline comments.Jan 15 2016, 2:57 PM
lib/Transforms/Utils/SplitModule.cpp
160

This is to handle the case where a function uses e.g. a bitcast of a GlobalValue... that bitcast would be ConstantExpr, which is shared between functions, but is not a GlobalValue itself. Constants can be nested, hence the worklist.

Oh I see, perhaps you mean that we could just always add U to the worklist, and for GlobalValues we simply wouldn't 'recurse' any deeper. If you prefer that it's fine, it just adds extra work when we already know there's only ever going to be one item on the worklist.

pcc accepted this revision.Jan 15 2016, 3:46 PM
pcc edited edge metadata.

LGTM with this one change. Thanks!

lib/Transforms/Utils/SplitModule.cpp
160

Sorry, I meant what you wrote in your second paragraph. It isn't all that much work and makes the code slightly simpler.

This revision is now accepted and ready to land.Jan 15 2016, 3:46 PM
In D16124#328371, @pcc wrote:

LGTM with this one change. Thanks!

Ok, no problem. Sergei will make the change.

Eugene.Zelenko added a subscriber: Eugene.Zelenko.

Committed in rL258083.