Page MenuHomePhabricator

[Feedback requested] Implement cold spliting
Needs ReviewPublic

Authored by deadalnix on Feb 23 2016, 4:37 PM.

Details

Summary

This is an attempt at splitting cold code from regular code in a function. All landing pads are considered cold (in fact, they almost always are, especially if performance is a concern, and assuming this allow for simplification in the LSDA).

2 LSDA header and call sites are emited, the cold one nested within the regular one. They both share the same action table and type table.

A symbol is amited for the cold function, using the name of the function suffixed by $cold . Emiting a symbol is required when using .subsections_via_symbols .

Debug informations aren't supported at this stage.

Diff Detail

Event Timeline

deadalnix updated this revision to Diff 48857.Feb 23 2016, 4:37 PM
deadalnix retitled this revision from to [Feedback requested] Implement cold spliting.
deadalnix updated this object.
deadalnix added a subscriber: llvm-commits.
sanjoy added a subscriber: sanjoy.Feb 23 2016, 5:24 PM
sanjoy added a subscriber: reames.
davidxl edited edge metadata.Feb 23 2016, 9:26 PM

There are quite a few refactoring changes that can be combined and split out. After that this patch can be minimized and becomes easier to review. Can you do the split first?

David

lib/CodeGen/AsmPrinter/DwarfCFIException.cpp
46

This can be split (with other refactoring changes) into a NFC patch.

158

Is it necessary to move this function above here? seems like an irrelevant change.

179–180

Another candidate of NFC refactoring.

lib/MC/MCObjectFileInfo.cpp
78

Should this section be created on demand when getColdTextSection() is called?

448

use .text.unlikely to be consistent with the name used in function reordering.

lib/MC/MCStreamer.cpp
261

This refactor change can go in its own patch. Also this change is not NFC -- the original code report fatal error regardless of whether NDEBUG is defined or not.

test/CodeGen/X86/coldsplit.ll
2

need mtriple (either linux or darwin) -- this does not work on COFF yet.

mehdi_amini edited edge metadata.Feb 23 2016, 9:30 PM

I guess the motivation is performance, right? Do you have benchmarks results to motivate this?

MatzeB edited edge metadata.Feb 23 2016, 9:38 PM

Very interesting, thanks for working on this!

I will do a detailed review soon. At a first glance it looks like this patch could be split up into a part affecting MC and one affecting the AsmPrinters? That would ease review and improve chances that some parts get accepted sooner.

In D17555#360358, @joker.eph wrote:

I guess the motivation is performance, right? Do you have benchmarks results to motivate this?

There is always the chance for less instruction caches misses if the cold code is out of the way. On top of this it should improve application startup time as there is a chance that the cold code will never be loaded from disk.

escha added a subscriber: escha.Feb 23 2016, 9:50 PM

This is a very clever idea. Getting code out of the way (WAY out of the way) can save not just icache, but iTLB too. I've seen large gains in JITs with this sort of strategy (e.g. putting extremely rarely taken code, like exception exits and things the guard intrinsic would be used for); just sticking the code in a separate, far-away allocation got significant improvements. If this can be even a fraction as effective on non-JITted code, it should be pretty nice.

@joker.eph , this is beneficial for performance for application that are icache and iTLB bound. This works is based on the various patches that were made to use LLVM as a backend for HHVM and was presented here : https://www.youtube.com/watch?v=VZ7A7t5LcR8 .

The optimization is disabled in the general case as it can have negative impact when you aren't icache/itlb bound (for instance bzip).

It may be worthwhile for instrumented build like ASAN and for one shot apps, but I haven't tested this, so don't quote me on this.

lib/CodeGen/AsmPrinter/DwarfCFIException.cpp
179–180

No, this need to be extracted as this is now needed twice: one for the regular fragment and once for the cold fragment.

lib/MC/MCObjectFileInfo.cpp
448

cold is the term used all over the place so far. It looks like GCC's crowd want to kill .text.unlikely on their side, so I'd advocate to keep it consistent and go for .cold , unless there is a good reason to stick with .unlikely ?

davidxl added inline comments.Feb 23 2016, 11:21 PM
lib/CodeGen/AsmPrinter/DwarfCFIException.cpp
179–180

that is what I am suggesting -- this part can be extracted into a helper function in another patch without changing functionality. This patch can then use it for cold fragment as well.

Regarding .text.cold vs .text.unlikely, I am fine either way as long
as they are kept consistent.

Regarding performance impact, Teresa has done extensive tuning in the
past. Our experience is that with PGO, the compiler does a pretty good
job laying out hot BBs in long chains, so the function splitting's
impact on icache is not that significant. With huge page text, the
impact on TLB misses is also moderate. We do see some improvements
though.

thanks,

David

pete added a subscriber: pete.Feb 24 2016, 10:19 AM

FYI, this is something I'm interested in for the JIT use case. My code has lots of essentially never executed slow paths and getting them pulled far away from the normal code is interesting. Note that my case may be different that others in that the blocks I'm interested in pulling away are stone cold/never executed. This gives a clear profitability heuristic which is one of the complex parts of doing this for the general case.

deadalnix added inline comments.Feb 24 2016, 12:50 PM
test/CodeGen/X86/coldsplit.ll
2

Isn't the target trip in the module doing this already ?

davidxl added inline comments.Feb 24 2016, 12:56 PM
test/CodeGen/X86/coldsplit.ll
2

right -- but by extracting into command line, you can add RUN line for both ELF and MachO

Extracted some changes in http://reviews.llvm.org/D17579

test/CodeGen/X86/coldsplit.ll
2

Got you. Thanks.

silvas added a subscriber: silvas.Feb 24 2016, 1:04 PM

I can see why this would help iTLB/paging, but I'm not grokking why it would help icache very much compared to per-function machine block placement ensuring that the cold stuff ends up at the end on a separate cacheline (does MBP already do that?). In fact (playing devil's advocate) the MBP approach could be more beneficial because it could allow branches to be relaxed to smaller encodings.

The scenarios I can see this being a substantial win for icache over MBP is when you e.g. have two functions with 1.5 cachelines of hot text (and say 1 cacheline of cold text). With MBP, each function would end up using ceiling(1.5) = 2 cachelines for the hot and one cacheline for the cold, but with the splitting the linker would see 2x 1.5 cacheline hot + 2x 1 cachline cold and so you could put the two 1.5's together and only use 3 cachelines for the hot part. How often does that occur (and does the linker actually manage to exploit this?).
Since the benefit is based on the "rounding", we save at most just under ("just under" is determined by the text alignment) one cacheline every time we can pack these densely. The benefit is at most #hotFunctions * (sizeof(Cacheline) - alignof(Function)) text size for the hot working set.

That being said, this kind of low-level function splitting is a really powerful tool and I fully support adding it, but I agree with Mehdi that I'd like to see some supporting benchmark results.

Gerolf added a subscriber: Gerolf.Feb 24 2016, 2:34 PM

Can you also elaborate a bit more on the iCache benefits.Is this specific to an architecture or the JIT environment? Or part of a more elaborate implementation of function splitting than shared in the patch? Perhaps I'm missing something but I thought the splinter must be able to duplicate code to support code layout to reap Icache benefits. And a good block ordering algorithm is might already eat the benefits.

Also I'm curious about your design evaluation. I think the information you are looking at at the low level (at least currently) is also be available at the IR level. When the compiler could split there I can also see compile-time benefits eg. a function marked as cold would not have to be optimized. And there would be less code to optimize/analyze in the hot routines.

lib/CodeGen/AsmPrinter/AsmPrinter.cpp
887

That deserves at least more comment and rational. Is this a good heuristic on all machines?

895

It looks weird that within the loop cold section is set, but never reset for a hot block. It seems that it would be cleaner to group all blocks into hot and cold.

977

I think when all block are grouped into hot and code the end_of_function directive could be issued at the same place for hot and cold section. It seems hard to mantain to have to think about hot and cold in various context.

@Gerolf The patch was originally made for LLVM's HHVM backend. The kind of code generated is very branchy, with a lot of cold branches, large and with a flat profile (ie most function are being used, but various path within these function are almost never taken). This optimization has proven to be very valuable for this kind of code. HHVM already uses large pages for JITed code, and even with this this is a valuable optimization. The effectiveness of the technique is heavily dependent on the type of code at hand, and, while it proves to be useful for HHVM, isn't in the general case.

I'd say if you have a large application, with a relatively flat profile, you may want to try this. If not then this is useless to you and may even hurt.

deadalnix updated this revision to Diff 48994.Feb 24 2016, 5:06 PM
deadalnix edited edge metadata.

Rebased on top of some NFC patches, add triple in the llc call for tests rather than the module and test both linux and OSX.

deadalnix added inline comments.Feb 24 2016, 5:17 PM
lib/CodeGen/AsmPrinter/AsmPrinter.cpp
895

In practice it works, as this ends up being more or less what you get out of the MachineBlockPlacement pass. Ideally, that's be indeed preferable that MachineBlockPlacement flag a BB after which all BB are cold or something.

Having one split is preferable, as well as having all exception unwinding related code is the same fragment. You don't want to jump back and forth between cold and hot code, and generating LSDA would become way too hairy =.

977

I'm not sure what you mean here. Cold code is physically separated from hot, so you need symbols to express ranges in both.

lhames added a subscriber: lhames.Feb 24 2016, 7:14 PM

Neat - this could improve compile-times for lazily compiled functions in the ORC JIT. Thanks for working on this. :)

rafael added inline comments.
lib/CodeGen/AsmPrinter/AsmPrinter.cpp
901

At least on ELF you should not need both symbols. I would suggest just passing the function name to createTempSymbol.

Even on MachO you should be able to use a single linkerPrivate symbol, no?

davidxl added inline comments.Feb 27 2016, 4:13 PM
lib/CodeGen/AsmPrinter/AsmPrinter.cpp
875

Should the creation of the ColdTextSection be guarded with the option such that if it is not enabled, null pointer will be returned and checked here?

888

This is not correct -- it will trigger debug assert when MBB's frequency is larger than the Entry's -- skip that case first. Also you are using BP as a ratio here not really as branch probability.

BBFreq = ...;

IsCold = (BBFreq < EntryFreq && BranchProbablity::getBranchProbablity(BBFreq, EntryFreq) <= BranchProbability(1, 1<<...);

895

This does not look like a good assumption to make about MBP. MBP does outline cold blocks from loops and layout them last, but not all such blocks are suitable to be split out.

deadalnix added inline comments.Feb 29 2016, 10:11 AM
lib/CodeGen/AsmPrinter/AsmPrinter.cpp
895

I'm working toward MBP being able to flag a block. Note that MBP already outline blocks in loop if they run < 20% of the time.