Page MenuHomePhabricator

[HotColdSplitting] Add command line options for supplying cold function names via user input.
Needs ReviewPublic

Authored by rjf on Aug 10 2020, 12:22 AM.

Details

Reviewers
hiraditya
rcorcs
Summary

In some cases, the user might want to specify which functions are
explicitly cold. This commit adds two options, cold-functions-list and
cold-functions-file that enables the user to supply lists of cold
function names to hot/cold splitting. The optimization pass will then
mark any function encountered with the same name as cold.

Diff Detail

Event Timeline

rjf created this revision.Aug 10 2020, 12:22 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 10 2020, 12:22 AM
rjf requested review of this revision.Aug 10 2020, 12:22 AM
hiraditya added inline comments.Aug 10 2020, 7:17 AM
llvm/lib/Transforms/IPO/HotColdSplitting.cpp
107

Let's move this inside the HotColdSplitting class

712

Let's not read from stdin, only from a regular file.

rjf added a comment.Aug 11 2020, 9:39 AM

Based on discussion with @hiraditya and @rcorcs this morning, we're thinking doing this in in the ProfileSummaryInfo pass is better (have the PSI pass take as input user-specified cold funcs and mark them with cold attribute) as its benefit can be more wide-ranging than only for hotcoldsplit. Will prepare a separate patch for this.

vsk added a subscriber: vsk.Aug 11 2020, 6:05 PM

Any reason not to mark up the relevant functions with __attribute__((cold)) directly?

llvm/lib/Transforms/IPO/HotColdSplitting.cpp
700

Might be better to use SpecialCaseList.h rather than hand-rolling something new.

hiraditya added a comment.EditedAug 12 2020, 5:57 AM
In D85628#2212047, @vsk wrote:

Any reason not to mark up the relevant functions with __attribute__((cold)) directly?

In many instances it may not be not possible to annotate the functions e.g., standard library functions. A recent example we found was cxa_guard_acquire which could be cold in many instances.

rjf updated this revision to Diff 285087.Aug 12 2020, 7:38 AM

Address review comments from @vsk, @hiraditya

rjf marked 3 inline comments as done.Aug 12 2020, 7:39 AM
This comment was removed by rjf.
llvm/lib/Transforms/IPO/HotColdSplitting.cpp
700

It seems like the public API of SpecialCaseList.h only supports looking up whether a particular string is contained in the list, rather than retrieving all strings in the list. If we do this then we'll either have to a) traverse the entire module to match which function names in the current module are in the list, and mark them as cold, or b)

rjf marked an inline comment as done.Aug 12 2020, 7:41 AM
rjf added inline comments.
llvm/lib/Transforms/IPO/HotColdSplitting.cpp
700

Sorry, the above comment was written (and published) in mistake.

rjf updated this revision to Diff 285089.Aug 12 2020, 8:18 AM

Add debug messages.

rjf updated this revision to Diff 285097.Aug 12 2020, 9:07 AM

Support marking declarations, e.g. in @hiraditya's example of __cxa_guard_acquire/abort, as cold.

rjf updated this revision to Diff 285103.Aug 12 2020, 9:15 AM

Remove unused includes.

rjf updated this revision to Diff 285136.Aug 12 2020, 10:30 AM

Add a test case for command line-supplied cold functions list.

rjf updated this revision to Diff 285140.Aug 12 2020, 10:56 AM

Add the test command for read from file.

hiraditya added inline comments.Aug 12 2020, 12:14 PM
llvm/test/Transforms/HotColdSplit/custom-cold-cmd.ll
17

Remove trailing .*

hiraditya added inline comments.Aug 12 2020, 12:26 PM
llvm/lib/Transforms/IPO/HotColdSplitting.cpp
731

Marking cold functions should be in a separate loop before this loop.

hiraditya added inline comments.Aug 12 2020, 12:28 PM
llvm/lib/Transforms/IPO/HotColdSplitting.cpp
731

ignore me.

rjf updated this revision to Diff 285163.Aug 12 2020, 12:39 PM

Mark cold functions in advance before outlining them.

Originally, as illustrated in the test case, the go() function
should be cold but not marked as such, since it predates two functions that are cold but not yet marked as cold.

Thanks to @hiraditya for spotting this.

rjf marked 2 inline comments as done.Aug 12 2020, 12:40 PM
vsk added a comment.Aug 12 2020, 1:16 PM

I’m not convinced this is a good idea. In what use case is it not possible to mark up relevant functions? It doesn’t make sense to me to make alternations to standard library functions within the compiler. It seems better to simply patch the standard library. In some cases llvm does infer function attributes for library functions, but these are generally lower level attributes that can’t be specified at the source level, and the attribute is made available to other passes in the pipeline.

rjf added a comment.Aug 12 2020, 1:30 PM
In D85628#2213919, @vsk wrote:

I’m not convinced this is a good idea. In what use case is it not possible to mark up relevant functions? It doesn’t make sense to me to make alternations to standard library functions within the compiler. It seems better to simply patch the standard library. In some cases llvm does infer function attributes for library functions, but these are generally lower level attributes that can’t be specified at the source level, and the attribute is made available to other passes in the pipeline.

Do you mean this patch isn't a good idea in general, or the recent revision isn't a good idea? For the latter, I'm not sure if you meant we should not outline declarations or we should not split the original loop into two (e.g. marking as cold before outlining). IMO splitting the loop into two simply addresses what the original intent of what we're doing, which is to mark certain functions as cold before outlining. Whereas, if we don't outline declarations via user-provided input, it renders @hiraditya 's proposed testcase useless. Alternatively, we don't have to make the testcase involving standard library functions if that's what you want :).

rjf updated this revision to Diff 285186.Aug 12 2020, 3:00 PM

Update test case to reflect newest changes

rjf marked an inline comment as done.Aug 12 2020, 3:01 PM
jfb added a comment.Aug 12 2020, 3:08 PM
In D85628#2213940, @rjf wrote:
In D85628#2213919, @vsk wrote:

I’m not convinced this is a good idea. In what use case is it not possible to mark up relevant functions? It doesn’t make sense to me to make alternations to standard library functions within the compiler. It seems better to simply patch the standard library. In some cases llvm does infer function attributes for library functions, but these are generally lower level attributes that can’t be specified at the source level, and the attribute is made available to other passes in the pipeline.

Do you mean this patch isn't a good idea in general, or the recent revision isn't a good idea? For the latter, I'm not sure if you meant we should not outline declarations or we should not split the original loop into two (e.g. marking as cold before outlining). IMO splitting the loop into two simply addresses what the original intent of what we're doing, which is to mark certain functions as cold before outlining. Whereas, if we don't outline declarations via user-provided input, it renders @hiraditya 's proposed testcase useless. Alternatively, we don't have to make the testcase involving standard library functions if that's what you want :).

My understanding is that today code can be considered "cold" based on the following:

  1. Attribute on the function
  2. Likely / unlikely annotations
  3. Profile information
  4. Other compiler heuristics

This adds another way to do it, but it's kind of a side-injection and it doesn't seem particularly principled. Presumably the list you're feeding through the command-line comes from a profile? Why isn't it provided as profile information?

It doesn’t make sense to me to make alternations to standard library functions within the compiler. It seems better to simply patch the standard library.

The standard library function is an example. There could be cases where the same function is hot in one instance but cold in another. Annotating function declarations hot/cold could introduce regressions in that case.

This adds another way to do it, but it's kind of a side-injection and it doesn't seem particularly principled. Presumably the list you're feeding through the command-line comes from a profile? Why isn't it provided as profile information?

For example: https://godbolt.org/z/ThaGEW the constructor of static object is always cold, how can we outline this if say we don't have a profile information. A workload can have a set of cold functions which programmer would know, but they necessarily don't have profile information. If there's a better way to make compiler aware of this, I'll be more than happy to work on that.

For example: https://godbolt.org/z/ThaGEW the constructor of static object is always cold, how can we outline this if say we don't have a profile information. A workload can have a set of cold functions which programmer would know, but they necessarily don't have profile information. If there's a better way to make compiler aware of this, I'll be more than happy to work on that.

I don't think we need to mark the object constructor as cold. We want to mark inlined slow paths of local static variables (the call to __cxa_guard_acquire, the call to the object constructor, etc.,) as cold.

As one possibility the consequences of which I have certainly not thought through, but which maybe already exists to some degree, we can depend on a bit of inference - if __cxa_guard_acquire is marked as cold, then all code which is accessible only by passing the call to __cxa_guard_acquire is also cold. Intuition: blocks which *contain* unconditional cold code are themselves cold by transitivity. Intuition: don't call cold functions where you're not cold.

[[gnu::cold]] void cold_func();
void has_cold_branch(bool b) {
  // nothing here is cold
  if (b) {
    // everything here is cold
    cold_func();
    // everything here is cold
  } else {
    // nothing here is cold
  }
  // nothing here is cold
}

In particular, this would hopefully mean that the object constructor call which appears past the __cxa_guard_acquire would not get inlined into the call-site; rather, a call to the constructor would be emitted. (Unless inlining the constructor into the call-site is shorter than calling the constructor.) In this case, we would not compile the object constructor optimized for coldness - we would compile the object constructor as normal. In fact, since the same constructor is used in hot paths, we must not compile it as cold. But we would compile the call-site as though it were cold, since __cxa_guard_acquire is marked cold and this call site unconditionally passes through __cxa_guard_acquire.

I don't think we need to mark the object constructor as cold. We want to mark inlined slow paths of local static variables (the call to __cxa_guard_acquire, the call to the object constructor, etc.,) as cold.

If there was a way to provide handwritten profile/coverage file, maybe that would work in absence of profile information?

As one possibility the consequences of which I have certainly not thought through, but which maybe already exists to some degree, we can depend on a bit of inference

  • if cxa_guard_acquire is marked as cold, then all code which is accessible only by passing the call to cxa_guard_acquire is also cold.

This is taken care by dominance relation. HCS uses that to infer coldness/hotness.

If there was a way to provide handwritten profile/coverage file, maybe that would work in absence of profile information?

I am not sure I see a need for profiles here.

I don't think we need profile information. We just need __cxa_guard_acquire to be marked cold, and for the compiler to infer coldness of code in the same block as a call to something marked as cold. (Apparently HCS does this?)

The compiler can also infer coldness of hidden functions which are only called by cold code.

But let's take std::string as an example. Let's say we have a function with a local static variable of type std::string. The goal is to have the inlined slow path outlined to the cold section, and for the slow path to be minimal in size. So we just emit a call to the std::string ctor, which is compiled normally since it is inline and not hidden, instead of inlining the ctor into the slow path. We can make the assumption that the std::string ctor is ODR-used *somewhere* in the resulting binary, so we can make the assumption that forcing a reference to this function will not increase overall code size.

Whether the compiler emits a perf-optimized or a size-optimized definition for the std::string ctor may be influenced by profiles. But that seems to me like a separate question that doesn't pertain to the specific topic of handling local static variables.

But for another type, say hidden_foo which is defined in an anonymous namespace and which is only constructed once only at a site inferred to be cold, we can take a different approach and possibly inline the ctor. Here, if there is only one site, minimal code size would (likely) imply inlining the ctor; but if there are two sites, minimal code size would (likely) imply emitting a ctor definition and then calling it twice.

rjf added a comment.EditedAug 12 2020, 5:59 PM

If there was a way to provide handwritten profile/coverage file, maybe that would work in absence of profile information?

I am not sure I see a need for profiles here.

I don't think we need profile information. We just need __cxa_guard_acquire to be marked cold, and for the compiler to infer coldness of code in the same block as a call to something marked as cold. (Apparently HCS does this?)

I think you missed Aditya's point here: he's saying that certain things like outlining __cxa* functions might be unwise to mark as cold in general, hence leaving a command line option to supply user-defined cold func names might be a good idea for ad-hoc workloads; and alternatively to JF's point, he's saying if the command line option sounds too unprincipled we can also take in a more principled file format like code coverage or something like that, should that be available.

And also, yes, HCS propagates coldness information through the CFG. You can look at the implementation in OutliningRegion for more details.

Apparently there is already support for extended binary profile format (https://reviews.llvm.org/D66766) and we maybe able to use that. Thanks to @wenlei for sharing this information.

In D85628#2214536, @rjf wrote:

I think you missed Aditya's point here: he's saying that certain things like outlining __cxa* functions might be unwise to mark as cold in general

Maybe I missed that, or maybe I am missing some context. I didn't think there was a top-line goal of injecting cold attributes to arbitrary functions.

My understanding is that a key motivating case is minimizing native codegen for local static variables by understanding that the slow path is definitionally cold. If coldness propagates, then as long as __cxa_guard_acquire is attributed as cold, coldness propagation should extend to everywhere of interest. And since __cxa_guard_acquire is used for only one purpose, namely guarding local static variable initialization, and since that purpose is definitionally cold, it should be attributed as cold. I didn't think there was a need to inject cold attributes to other functions.

When it comes to library functions which are cold, presumably we can just patch libraries with cold-attributed function definitions but unattributed declarations, and attribute the declarations. It seems like adding compiler support for lists of cold functions is a workaround for not patching libraries?

I have two questions for this patch:

  • Why do we need another way to explicitly tell compiler that some functions are cold, rather than using existing mechanisms?
    • If this is for a random cold function from user code, then using profile or in-source annotation should be the way to go, which is more scalable and more sustainable.
    • If this is for special (generated) functions, e.g. __cxa_guard_acquire, then we shouldn't even need to tell the compiler. It's like we don't have to tell compiler EH pad, noreturn functions are cold, instead compiler should figure this out by itself. _cxa_guard_acquire is a generated function with very specific semantic, so I think it's similar to noreturn, EH pads in that regard.
  • If we have to tell compiler extra hotness/coldness info, why do we do it just HCS? hotness is very general, and could benefit many opts. Introducing a channel to tell specific optimization about hotness is not good design in general (imagine 10 passes each has its own way of getting hotness through a bunch of switches). If we really have to do this, I'd say we should do it in the framework, e.g. the static analysis part of BFI.

For my 2nd question, this patch is what I think how this should be done: https://reviews.llvm.org/D79485

In D85628#2214536, @rjf wrote:

I think you missed Aditya's point here: he's saying that certain things like outlining __cxa* functions might be unwise to mark as cold in general

Maybe I missed that, or maybe I am missing some context. I didn't think there was a top-line goal of injecting cold attributes to arbitrary functions.

My understanding is that a key motivating case is minimizing native codegen for local static variables by understanding that the slow path is definitionally cold. If coldness propagates, then as long as __cxa_guard_acquire is attributed as cold, coldness propagation should extend to everywhere of interest. And since __cxa_guard_acquire is used for only one purpose, namely guarding local static variable initialization, and since that purpose is definitionally cold, it should be attributed as cold. I didn't think there was a need to inject cold attributes to other functions.

When it comes to library functions which are cold, presumably we can just patch libraries with cold-attributed function definitions but unattributed declarations, and attribute the declarations.

Added a patch to libcxxabi: https://reviews.llvm.org/D85873

It seems like adding compiler support for lists of cold functions is a workaround for not patching libraries?

I think, in general, it would be useful to have a way to tell compiler the functions which are cold. This will avoid annotations, as it is possible to have a function hot in one context and cold in another e.g., inlining of std::string::string() which Laxman and I added in https://reviews.llvm.org/D22782 Although the patch improved some of the workloads quite significantly, it is also a constant source of noise to many other workloads.

For my 2nd question, this patch is what I think how this should be done: https://reviews.llvm.org/D79485

Thanks for sharing, this looks very promising.

Added a patch to libcxxabi: https://reviews.llvm.org/D85873

The patch looks legit. Wonder if such a patch should be sent to libstdc++ too.

But will the compiler see these declarations though when compiling a source file containing a local static variable, in a way that will let it propagate coldness from the __attribute__((cold))? Or does the compiler have built-in knowledge of these functions - presumably it does since it emits calls to them - and will the compiler's built-in knowledge need to be extended in addition to this patch?

The patch looks legit. Wonder if such a patch should be sent to libstdc++ too.

Sending it shortly.

But will the compiler see these declarations though when compiling a source file containing a local static variable, in a way that will let it propagate coldness from the attribute((cold))? Or does the compiler have built-in knowledge of these functions - presumably it does since it emits calls to them - and will the compiler's built-in knowledge need to be extended in addition to this patch?

The compiler will see the declaration while analyzing a callsite.

hiraditya added a comment.EditedAug 13 2020, 8:00 AM
In D85628#2214119, @jfb wrote:
In D85628#2213940, @rjf wrote:
In D85628#2213919, @vsk wrote:

I’m not convinced this is a good idea. In what use case is it not possible to mark up relevant functions? It doesn’t make sense to me to make alternations to standard library functions within the compiler. It seems better to simply patch the standard library. In some cases llvm does infer function attributes for library functions, but these are generally lower level attributes that can’t be specified at the source level, and the attribute is made available to other passes in the pipeline.

Do you mean this patch isn't a good idea in general, or the recent revision isn't a good idea? For the latter, I'm not sure if you meant we should not outline declarations or we should not split the original loop into two (e.g. marking as cold before outlining). IMO splitting the loop into two simply addresses what the original intent of what we're doing, which is to mark certain functions as cold before outlining. Whereas, if we don't outline declarations via user-provided input, it renders @hiraditya 's proposed testcase useless. Alternatively, we don't have to make the testcase involving standard library functions if that's what you want :).

My understanding is that today code can be considered "cold" based on the following:

  1. Attribute on the function
  2. Likely / unlikely annotations
  3. Profile information
  4. Other compiler heuristics

This adds another way to do it, but it's kind of a side-injection and it doesn't seem particularly principled. Presumably the list you're feeding through the command-line comes from a profile? Why isn't it provided as profile information?

Let me try to formulate the problem statement to motivate this work. I'm happy to work on a better approach.
Let's consider a repository which builds multiple applications. For App1 we have set of cold callsites (CS1), and cold function declarations (FD1); similarly for App2 we have CS2 and FD2. These sets have the following properties:

  • CS1 and CS2 may have some intersection but one may not be necessarily a subset of another. a non-intersecting example would be: calling std::lower_bound in a loop vs. calling in an isolated instance. std::lower_bound could be cold in the latter case.
  • FD1 and FD2 may have some intersection but one may not be necessarily a subset of another. a non-intersecting example would be: constructing std::unordered_map<string, string> vs. constructing a std::string. std::string::string() could be cold in the latter case.

It may not be possible to get profile information with sample-profile or instrumented-profile (e.g., mobile phone apps), however, product developers would know hotness/coldness of many call-sites based on domain knowledge.
In order to optimize these call-sites, how do we tell compiler the about these FDs and CSs? Adding annotations like __attribute__((cold)) to FD1 could regress App2 and vice-versa.
Supplying a human readable/editable file would be ideal.

vsk added a comment.Aug 14 2020, 1:28 PM

I have two questions for this patch:

  • Why do we need another way to explicitly tell compiler that some functions are cold, rather than using existing mechanisms?
    • If this is for a random cold function from user code, then using profile or in-source annotation should be the way to go, which is more scalable and more sustainable.
    • If this is for special (generated) functions, e.g. __cxa_guard_acquire, then we shouldn't even need to tell the compiler. It's like we don't have to tell compiler EH pad, noreturn functions are cold, instead compiler should figure this out by itself. _cxa_guard_acquire is a generated function with very specific semantic, so I think it's similar to noreturn, EH pads in that regard.
  • If we have to tell compiler extra hotness/coldness info, why do we do it just HCS? hotness is very general, and could benefit many opts. Introducing a channel to tell specific optimization about hotness is not good design in general (imagine 10 passes each has its own way of getting hotness through a bunch of switches). If we really have to do this, I'd say we should do it in the framework, e.g. the static analysis part of BFI.

+ 1. Thanks for writing this up, this reflects my thinking about the patch.

In order to optimize these call-sites, how do we tell compiler the about these FDs and CSs? Adding annotations like attribute((cold)) to FD1 could regress App2 and vice-versa. Supplying a human readable/editable file would be ideal.

I don't think supplying a list of cold functions to a compiler invocation is a scalable way to approach this problem. This looks like an attempt to hand-annotate individual call sites via a side channel, and it's likely too big (& imprecise) of a hammer to apply. E.g., after adding a cold function list to a program's cflags, performance may regress if the program picks up a hot use of a function in the list. It seems like an actual profile would be a better fit for your use case (if instrumentation/sampling is not possible, then it may be possible to construct a synthetic/fake profile as a pre-processing step?, but that has the same issues with source drift).

then it may be possible to construct a synthetic/fake profile as a pre-processing step

Seems like the most viable step so far. Thanks for the suggestion.