This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] Add handling for ColdCC calling convention and a pass to mark candidates with coldcc attribute
ClosedPublic

Authored by syzaara on Sep 29 2017, 10:17 AM.

Details

Summary

This patch adds support for the coldcc calling convention for Power. This changes the set of non-volatile registers. It includes a pass to stress test the implementation by marking all static directly called functions with the coldcc attribute through the option -enable-coldcc-stress-test. It also includes an option, -ppc-enable-coldcc, to add the coldcc attribute to functions which are cold at all call sites based on BlockFrequencyInfo.

Diff Detail

Event Timeline

syzaara created this revision.Sep 29 2017, 10:17 AM
hfinkel added inline comments.Oct 3 2017, 7:08 PM
lib/Target/PowerPC/PPCColdCC.cpp
1 ↗(On Diff #117174)

Can you please merge this logic into GlobalOpt? That's where we would currently set the FastCC calling convention. We could make that logic set the FastCC, or ColdCC, calling convention as appropriate. AFAIKT, there's nothing really target-dependent here.

If need be, we can add a TTI hook to enable it for particular targets. It's not clear to me that should be necessary.

90 ↗(On Diff #117174)

Please make this value a cl::opt so that we can easily experiment with tuning it as necessary.

120 ↗(On Diff #117174)

Is this check actually needed? Doesn't the backend handle this like it handles FastCC (i.e., if the target does nothing special, it just falls back to using the default calling convention). If something in this patch breaks that fallback, we need to clean that up. It should always be legal to add coldcc to an internal call regardless of whether the target does anything special with it.

124 ↗(On Diff #117174)

Reverse the order:

if (EnableColdCCStressTest || hasColdCallSite(F)) {

the cheaper-to-evaluate expression should be first.

nemanjai requested changes to this revision.Oct 9 2017, 7:42 AM

An important aspect of the testing here is missing. Namely checking for the coldcc calling convention on the IR. You should be able to accomplish this either through an MIR test case or just a test case running opt.

lib/Target/PowerPC/PPCColdCC.cpp
1 ↗(On Diff #117174)

I am all for making this target independent, but I would definitely opt for a TTI hook. The reasons being the following:

  • A target that doesn't want to use this shouldn't be forced to incur the cost of this analysis (however small).
  • Much less importantly, it is possible that some targets have incomplete/incorrect handling of this currently seldom-used calling convention. I believe a TTI hook would give people an opportunity to test this out and fix any issues they see on their targets.
9 ↗(On Diff #117174)

Really? Not even something minimal like:

/// Sets the calling convention on cold internal functions
/// and their callsites to coldcc.

:)

27 ↗(On Diff #117174)

I don't think it's a particularly good idea to have the DEBUG_TYPE be exactly the name of the calling convention.

50 ↗(On Diff #117174)

Do we not want to append
ModulePass::getAnalysisUsage(AU);
here?

61 ↗(On Diff #117174)

I'm not a fan of data members such as this. This isn't really a property of the module but the function. Furthermore, this is a data member that is only used in one function - and it's reset for every function encountered. Why is it not local to the function then?

68 ↗(On Diff #117174)

What happens if no users are found? I'm not sure if that's possible or if it would already be discarded. But if it might still happen, seems like you don't want to bother marking it coldcc.

81 ↗(On Diff #117174)

I don't really understand this approach. You loop over all the call sites, count how many you've seen. For each, you check if it's cold and increment the number of cold call sites. Finally, you return whether or not the two numbers are equal. Aside from the fact that the return statement is really just return NumCallSites == NumColdCallSites;, I don't understand the need to loop over all the call sites. Why is this function not just hasNonColdSite() which will return whether any sites are not cold (i.e. early exit with true if a non-cold site is found).
That seems more natural to me, but even if you want to keep this a positive test, it should:

  • Be renamed to something like allCallSitesCold()
  • Exit early if a non-cold site is found
87 ↗(On Diff #117174)

This applies to all the member functions here - they're all in need of comments explaining what they do. For this one, it appears that it checks whether the block that contains the call has a BlockFrequency less than 10% of it's parent's entry block. I think you should make this explicit in the comment (of course, you'll not use 10% when you convert this to a cl::opt value, but the approach is still the same).

90 ↗(On Diff #117174)

I agree. The option can take a percentage and please make that clear in the description of the option.

lib/Target/PowerPC/PPCFastISel.cpp
209

Why is this needed? I would imagine that with FastISel, we don't want to handle coldcc specially. I don't imagine handling it would add much (anything) to the compile time, but it just seems fundamentally like something we don't want to do - apply these ABI breaking optimizations when emitting code for -O0.

lib/Target/PowerPC/PPCFrameLowering.cpp
1953–1960

The code no longer matches the comment. You should update the comment to explain why a callee-saved register might not be live-in to the function.

1954

Variables start with uppercase.

1990–1993

This really needs a comment. It should be clear why we keep a register live after spilling if it wasn't live-in to the function.

lib/Target/PowerPC/PPCISelLowering.cpp
6392–6395

I think a ternary operator would make it more clear that you're calling the exact same function with a different argument. Same below of course.

lib/Target/PowerPC/PPCRegisterInfo.cpp
126

Why did this need to move up? It does not appear like a change that is needed since the first use of it is still below where it was before.

This revision now requires changes to proceed.Oct 9 2017, 7:42 AM
syzaara updated this revision to Diff 119598.Oct 19 2017, 10:08 AM
syzaara edited edge metadata.

The coldcc handling is now moved to GlobalOpt. A target can enable coldcc for internal functions which are cold at all call sites by overriding enableColdCC, or enable stress testing to mark all internal functions with coldcc by setting option -enable-coldcc-stress-test. The coldness threshold can be controlled with the option -coldcc-rel-freq.

syzaara added inline comments.Oct 19 2017, 10:10 AM
lib/Target/PowerPC/PPCFastISel.cpp
209

This is just for suppressing compiler warning messages that arise from including "PPCGenCallingConv.inc" and not having a use for the function RetCC_PPC_Cold defined in PPCGenCallingConv.inc. It's mentioned in the comment above as well.

hfinkel added inline comments.Oct 26 2017, 9:09 PM
include/llvm/Analysis/TargetTransformInfo.h
546

How about useCodeCCForColdCalls()? (I don't like enableColdCC() because coldcc is always "enabled", the question here is whether we actively form such calls.)

lib/Transforms/IPO/GlobalOpt.cpp
2216

The fastcc conversion is currently guarded by isProfitableToMakeFastCC, which really just does this:

/// Return true if this is a calling convention that we'd like to change.  The
/// idea here is that we don't want to mess with the convention if the user
/// explicitly requested something with performance implications like coldcc,
/// GHC, or anyregcc.
static bool isProfitableToMakeFastCC(Function *F) {
  CallingConv::ID CC = F->getCallingConv();
  // FIXME: Is it worth transforming x86_stdcallcc and x86_fastcallcc?
  return CC == CallingConv::C || CC == CallingConv::X86_ThisCall;
}

and we should have the same guard (which will also make sure we don't explicitly change something marked fastcc to coldcc). I recommend renaming this function to something like:

hasChangeableCC(Function *F)

and then using it in both places.

nemanjai added inline comments.Nov 5 2017, 9:21 PM
include/llvm/Analysis/TargetTransformInfo.h
546

I think that is meant to be useColdCCForColdCalls() and if so, I agree with that name.

But more importantly than the name, I don't feel that this should be an "always-on/always-off" type of query.
I feel that the target should be the arbiter of whether a specific cold callsite would provide any benefit if it were marked coldcc. Deciding that a callsite is cold is fine in GlobalOpt, but simply having a cold call in a function does not mean that marking that call with coldcc is useful.
Consider a trivial example where the caller has calls on the cold path as well as the hot path. Furthermore, all the values live across one are also live across the other. In such a case, marking the cold call as coldcc provides absolutely no gain and if the cold function is called frequently enough (i.e. either the block frequency is incorrect or the cold threshold is too high), this is a net performance loss.

So what I'm getting at is that I don't think the call frequency is enough to make the decision and I am suggesting that since the real meaning of coldcc is kind of target-specific, maybe the decision to mark functions with it should be as well. Perhaps the right thing to do would be to have the default semantics implemented in the TTI base class, allowing the target to override it.

lib/Target/PowerPC/PPCCallingConv.td
48

Period. Complete sentences in comments please.

302

The comment doesn't explain the TOC and stack pointer registers that are clearly missing from the list here. Please add a comment that we never consider the stack pointer a CSR and that we have versions of these lists with and without the TOC pointer (used depending on whether the function uses the TOC and is a non-leaf).

lib/Target/PowerPC/PPCFrameLowering.cpp
1956

Much like the comment below, this states the what but not the why. Namely, why should the registers that are live-in to the function not be marked as live-in to the block where the prologue is being inserted?

Furthermore in the below case, I can make sense of why we don't want to set the kill flag on the spill since the use will almost always come before the re-definition. But in this case, I don't really understand why we're only setting callee-saved-registers as live-in to the prologue block if they're not live-in to the function.

1992

The comment is still not adequate. All it says is what's obvious from the code - I can tell that we don't want to set the kill flag if the register is live in because the kill flag is set to !IsLiveIn. Don't state what the code does when it's so obvious, but why it does it.

lib/Target/PowerPC/PPCRegisterInfo.cpp
127

So this is now just two new empty lines?

lib/Transforms/IPO/GlobalOpt.cpp
104

It seems to me that 10% is way too high of a default. I would have expected that a call would be considered cold if it has a frequency in the range of what __builtin_expect(..., false) would produce (which I think is 1:2000). Maybe that is excessively low, but I'm fairly certain that 10% is too high.
A function being marked coldcc will end up spilling every register it uses, even if it's a leaf function. If one in 10 passes through the caller result in a call to such a "cold" function, it is likely that any improvements made to the caller as a result of the calee being coldcc are more than offset by the extra frame setup in the callee.

In any case whatever the default, I think that relying solely on the frequency won't yield good results (see my comment on the TTI hook).

2152

No braces on blocks with a single statement.

2159

changeCallSitesToColdCC()?

syzaara updated this revision to Diff 122457.Nov 10 2017, 9:18 AM
syzaara edited the summary of this revision. (Show Details)

Aside from just using the option -ppc-enable-coldcc to use coldcc on functions which are cold at all call sites, PowerPC will also check if the coldcc function is the only function called in it's callers.

nemanjai accepted this revision.Nov 22 2017, 5:46 AM

Other than the tiny issue with an unnecessary cast, LGTM.

I am going to approve this since I'm holding up approval. I think @hfinkel should also have another look at this before you commit so that he can verify that you adequately addressed his comment.

lib/Target/PowerPC/PPCTargetTransformInfo.cpp
245

Hold on, the type of I is Instruction& and we're casting it to Instruction. Why?

This revision is now accepted and ready to land.Nov 22 2017, 5:46 AM

@hfinkel Can you please take another look at this before I commit?

hfinkel added inline comments.Dec 5 2017, 11:19 AM
include/llvm/Analysis/TargetTransformInfo.h
544

callsites -> call sites

545

conv. -> convention.

lib/Target/PowerPC/PPCCallingConv.td
297

Period after non-volatile.

lib/Target/PowerPC/PPCFrameLowering.cpp
1955

Can you explain here how this situation comes up? Is this because we have callee-saved registers that are callee-saved iff they're not used for parameter passing?

lib/Target/PowerPC/PPCTargetTransformInfo.cpp
231

I don't think that this implementation makes sense. A function can call several potential cold calls (which seems prohibited here), and in addition, doing the check this way seems likely to make the whole process O(N^2). Just return true here (unless there's some actual target-specific logic we'd like to have).

lib/Transforms/IPO/GlobalOpt.cpp
2124

callsite -> call site

Either you're referring to CallSite, which is a capitalized class name, or write call site (because the term is actually two words).

2136

callsites -> call sites

syzaara updated this revision to Diff 126818.Dec 13 2017, 1:04 PM
syzaara added inline comments.Dec 13 2017, 1:08 PM
lib/Target/PowerPC/PPCFrameLowering.cpp
1955

When the reg is used for parameter passing, it will still part of the callee-saved list. However, we won't mark it as killed when we are storing to the stack.

hfinkel added inline comments.Dec 14 2017, 10:43 AM
include/llvm/Analysis/TargetTransformInfo.h
546

Given that:

bool useColdCCForColdCalls(Function &F) const;

makes a per-function decision, call in the name shouldn't be plural. Name it useColdCCForColdCall instead.

test/Transforms/GlobalOpt/coldcc_coldsites.ll
2

This test depends on PPC TTI, and so it has to be enabled only when the backend is compiled in. You'll need to create a PowerPC subdirectory of test/Transforms/GlobalOpt and put the test in that subdirectory. The directory also needs a lit.local.cfg. Copy the ones from test/Transforms/LoopVectorize/PowerPC/lit.local.cfg

syzaara updated this revision to Diff 127393.Dec 18 2017, 11:16 AM
hfinkel added inline comments.Dec 19 2017, 9:23 PM
lib/Transforms/IPO/GlobalOpt.cpp
2138

returning false if the caller function contains other non cold calls.

I still don't understand the purpose of this check. The rationale needs to be clearly explained (or just get rid of it). If a function is cold at all call sites, why not convert it to use coldcc? Why does it matter if any of its callers also have non-cold calls?

syzaara added inline comments.Dec 20 2017, 6:08 AM
lib/Transforms/IPO/GlobalOpt.cpp
2138

The motivation to use coldcc is to reduce the caller saved register spills in the caller, however if there is another non coldcc call site, the caller will have to spill the caller saved registers anyway. This defeats the purpose of using coldcc and having the overhead to spill all registers in the cold callee and then again spilling the caller saved registers in the caller around the non coldcc call site. This is why we want to only use coldcc for a call site in a function if all the calls sites in the function can use coldcc.

hfinkel added inline comments.Dec 21 2017, 8:09 AM
lib/Transforms/IPO/GlobalOpt.cpp
2138

If I'm misunderstanding what's going on, please correct me, but as I understand this: If I have a bunch of functions, A, B, C, and D, and these all have a cold path calling F (and no other function calls), then F is converted to be coldcc and all functions benefit from this optimization. Then, one day, an unsuspecting programmer adds a function call into some non-cold path in functions A. Now, suddenly, the performance of B, C, and D is also worse (because we're now no longer converting F to be coldcc, even though it is still only called on cold paths in all functions). This seems Bad. Obviously, sometimes these kinds of non-local effects are unavoidable, but when they're avoidable, we should avoid them. They're horrible for usability. The only justification for doing this would be if, by continuing to convert F to be coldcc, the performance of A would now be worse, but it doesn't seem like that should be the case (and it still might be better, depending on how we do shrink wrapping and where the other call is).

This revision was automatically updated to reflect the committed changes.