This is an archive of the discontinued LLVM Phabricator instance.

Randomly outline code for cold regions
Needs ReviewPublic

Authored by hiraditya on Jul 28 2019, 6:33 AM.

Details

Summary

In absence of profile information, it is possible to take advantage of hot-cold splitting
if the code base is mostly code e.g. hot/cold<0.2/0.8 (80:20 rule). It is a non-deterministic
outlining but can be beneficial in several instances where profile data is not available and determinism is not important.

Putting the diff here in case there is interest in having aggressive hot-cold-splitting.

Diff Detail

Event Timeline

hiraditya created this revision.Jul 28 2019, 6:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 28 2019, 6:33 AM

I'm not sure it is a good idea to have non-deterministic optimizations.

pcc added a subscriber: pcc.Jul 29 2019, 12:34 AM

It's worth noting that you could make this deterministic by seeding an RNG with something like the function name.

I'm not sure it is a good idea to have non-deterministic optimizations.

I agree with @lebedev.ri, I don't think non-deterministic optimizations are good as it would make triaging and debugging really difficult. I see @pcc has a follow on suggestion to make this deterministic. At the least that would be needed. But is it useful to do this optimization randomly in practice?

We can certainly make this deterministic with a sequence of lists. Thanks for the suggestion, I'll update this soon.

I'm not sure it is a good idea to have non-deterministic optimizations.

I agree with @lebedev.ri, I don't think non-deterministic optimizations are good as it would make triaging and debugging really difficult. I see @pcc has a follow on suggestion to make this deterministic. At the least that would be needed. But is it useful to do this optimization randomly in practice?

I guess It depends on the workload, for user-facing apps for example, where most of the code is cold. Aggressively outlining reduces program load time.

Updated to have a fixed sequence of 'random' numbers.

vsk added a comment.Jul 30 2019, 11:20 PM

I think that this is probably too risky. I've seen that splitting out the wrong block can introduce major performance regressions (up to and including stack overflow).

As a future direction for code reordering in llvm, a pass that operates on the machine instruction level may have more promise. At that level, there's a reduced/eliminated code size penalty and landingpads can be moved easily. There's less pass ordering flexibility, but experiments seem to show that late splitting works better anyway.

In D65376#1607799, @vsk wrote:

I think that this is probably too risky. I've seen that splitting out the wrong block can introduce major performance regressions (up to and including stack overflow).

As a future direction for code reordering in llvm, a pass that operates on the machine instruction level may have more promise. At that level, there's a reduced/eliminated code size penalty and landingpads can be moved easily. There's less pass ordering flexibility, but experiments seem to show that late splitting works better anyway.

Agreed, this is mostly intended for specific workloads where wrong out-ling doesn't incur major overhead. e.g., user facing applications, applications where initial load time is a major issue etc. We can keep this disabled by default so that only interested parties can use it. For some cases it can provide major startup improvements.

As a future direction for code reordering in llvm, a pass that operates on the machine instruction level may have more promise. At that level, there's a reduced/eliminated code size penalty and landingpads can be moved easily. There's less pass ordering flexibility, but experiments seem to show that late splitting works better anyway.

Agree with the reasoning here, probably that's why all previous hot-cold splitting were done at the RTL level. The problem with having machine-level passes like these are related to porting the optimization to every backend and maintenance. In my experience, even when they don't give the optimal results IR level passes scale well.

Do you actually need truly random, non-deterministic transform?
Can it instead be solved by non-non-deterministic outlining, based on some heuristic?

Do you actually need truly random, non-deterministic transform?

The idea is to aggressively outline more, because for some applications most of the code is cold, getting profile information is difficult and goal is to improve startup time. non-determinism is not required, it will be great to have a deterministic heuristic. I'm open to ideas.

Can it instead be solved by non-non-deterministic outlining, based on some heuristic?

There is some heuristic in the static-profile analysis but it does not outline as much.