This patch adds support for the coldcc calling convention for Power. This changes the set of non-volatile registers. It includes a pass to stress test the implementation by marking all static directly called functions with the coldcc attribute through the option -enable-coldcc-stress-test. It also includes an option, -ppc-enable-coldcc, to add the coldcc attribute to functions which are cold at all call sites based on BlockFrequencyInfo.
Details
- Reviewers
kbarton nemanjai lei sfertile jtony stefanp gyiu inouehrs hfinkel echristo - Commits
- rG1f59ae311bc2: Re-commit : [PowerPC] Add handling for ColdCC calling convention and a pass to…
rG8e951fd2f626: [PowerPC] Add handling for ColdCC calling convention and a pass to mark…
rL323778: Re-commit : [PowerPC] Add handling for ColdCC calling convention and a pass to…
rL322721: [PowerPC] Add handling for ColdCC calling convention and a pass to mark
Diff Detail
Event Timeline
lib/Target/PowerPC/PPCColdCC.cpp | ||
---|---|---|
1 ↗ | (On Diff #117174) | Can you please merge this logic into GlobalOpt? That's where we would currently set the FastCC calling convention. We could make that logic set the FastCC, or ColdCC, calling convention as appropriate. AFAIKT, there's nothing really target-dependent here. If need be, we can add a TTI hook to enable it for particular targets. It's not clear to me that should be necessary. |
90 ↗ | (On Diff #117174) | Please make this value a cl::opt so that we can easily experiment with tuning it as necessary. |
120 ↗ | (On Diff #117174) | Is this check actually needed? Doesn't the backend handle this like it handles FastCC (i.e., if the target does nothing special, it just falls back to using the default calling convention). If something in this patch breaks that fallback, we need to clean that up. It should always be legal to add coldcc to an internal call regardless of whether the target does anything special with it. |
124 ↗ | (On Diff #117174) | Reverse the order: if (EnableColdCCStressTest || hasColdCallSite(F)) { the cheaper-to-evaluate expression should be first. |
An important aspect of the testing here is missing. Namely checking for the coldcc calling convention on the IR. You should be able to accomplish this either through an MIR test case or just a test case running opt.
lib/Target/PowerPC/PPCColdCC.cpp | ||
---|---|---|
1 ↗ | (On Diff #117174) | I am all for making this target independent, but I would definitely opt for a TTI hook. The reasons being the following:
|
9 ↗ | (On Diff #117174) | Really? Not even something minimal like: /// Sets the calling convention on cold internal functions /// and their callsites to coldcc. :) |
27 ↗ | (On Diff #117174) | I don't think it's a particularly good idea to have the DEBUG_TYPE be exactly the name of the calling convention. |
50 ↗ | (On Diff #117174) | Do we not want to append |
61 ↗ | (On Diff #117174) | I'm not a fan of data members such as this. This isn't really a property of the module but the function. Furthermore, this is a data member that is only used in one function - and it's reset for every function encountered. Why is it not local to the function then? |
68 ↗ | (On Diff #117174) | What happens if no users are found? I'm not sure if that's possible or if it would already be discarded. But if it might still happen, seems like you don't want to bother marking it coldcc. |
81 ↗ | (On Diff #117174) | I don't really understand this approach. You loop over all the call sites, count how many you've seen. For each, you check if it's cold and increment the number of cold call sites. Finally, you return whether or not the two numbers are equal. Aside from the fact that the return statement is really just return NumCallSites == NumColdCallSites;, I don't understand the need to loop over all the call sites. Why is this function not just hasNonColdSite() which will return whether any sites are not cold (i.e. early exit with true if a non-cold site is found).
|
87 ↗ | (On Diff #117174) | This applies to all the member functions here - they're all in need of comments explaining what they do. For this one, it appears that it checks whether the block that contains the call has a BlockFrequency less than 10% of it's parent's entry block. I think you should make this explicit in the comment (of course, you'll not use 10% when you convert this to a cl::opt value, but the approach is still the same). |
90 ↗ | (On Diff #117174) | I agree. The option can take a percentage and please make that clear in the description of the option. |
lib/Target/PowerPC/PPCFastISel.cpp | ||
209 | Why is this needed? I would imagine that with FastISel, we don't want to handle coldcc specially. I don't imagine handling it would add much (anything) to the compile time, but it just seems fundamentally like something we don't want to do - apply these ABI breaking optimizations when emitting code for -O0. | |
lib/Target/PowerPC/PPCFrameLowering.cpp | ||
1953–1960 | The code no longer matches the comment. You should update the comment to explain why a callee-saved register might not be live-in to the function. | |
1954 | Variables start with uppercase. | |
1990–1993 | This really needs a comment. It should be clear why we keep a register live after spilling if it wasn't live-in to the function. | |
lib/Target/PowerPC/PPCISelLowering.cpp | ||
6392–6395 | I think a ternary operator would make it more clear that you're calling the exact same function with a different argument. Same below of course. | |
lib/Target/PowerPC/PPCRegisterInfo.cpp | ||
126 | Why did this need to move up? It does not appear like a change that is needed since the first use of it is still below where it was before. |
The coldcc handling is now moved to GlobalOpt. A target can enable coldcc for internal functions which are cold at all call sites by overriding enableColdCC, or enable stress testing to mark all internal functions with coldcc by setting option -enable-coldcc-stress-test. The coldness threshold can be controlled with the option -coldcc-rel-freq.
lib/Target/PowerPC/PPCFastISel.cpp | ||
---|---|---|
209 | This is just for suppressing compiler warning messages that arise from including "PPCGenCallingConv.inc" and not having a use for the function RetCC_PPC_Cold defined in PPCGenCallingConv.inc. It's mentioned in the comment above as well. |
include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
546 | How about useCodeCCForColdCalls()? (I don't like enableColdCC() because coldcc is always "enabled", the question here is whether we actively form such calls.) | |
lib/Transforms/IPO/GlobalOpt.cpp | ||
2216 | The fastcc conversion is currently guarded by isProfitableToMakeFastCC, which really just does this: /// Return true if this is a calling convention that we'd like to change. The /// idea here is that we don't want to mess with the convention if the user /// explicitly requested something with performance implications like coldcc, /// GHC, or anyregcc. static bool isProfitableToMakeFastCC(Function *F) { CallingConv::ID CC = F->getCallingConv(); // FIXME: Is it worth transforming x86_stdcallcc and x86_fastcallcc? return CC == CallingConv::C || CC == CallingConv::X86_ThisCall; } and we should have the same guard (which will also make sure we don't explicitly change something marked fastcc to coldcc). I recommend renaming this function to something like: hasChangeableCC(Function *F) and then using it in both places. |
include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
546 | I think that is meant to be useColdCCForColdCalls() and if so, I agree with that name. But more importantly than the name, I don't feel that this should be an "always-on/always-off" type of query. So what I'm getting at is that I don't think the call frequency is enough to make the decision and I am suggesting that since the real meaning of coldcc is kind of target-specific, maybe the decision to mark functions with it should be as well. Perhaps the right thing to do would be to have the default semantics implemented in the TTI base class, allowing the target to override it. | |
lib/Target/PowerPC/PPCCallingConv.td | ||
48 | Period. Complete sentences in comments please. | |
302 | The comment doesn't explain the TOC and stack pointer registers that are clearly missing from the list here. Please add a comment that we never consider the stack pointer a CSR and that we have versions of these lists with and without the TOC pointer (used depending on whether the function uses the TOC and is a non-leaf). | |
lib/Target/PowerPC/PPCFrameLowering.cpp | ||
1956 | Much like the comment below, this states the what but not the why. Namely, why should the registers that are live-in to the function not be marked as live-in to the block where the prologue is being inserted? Furthermore in the below case, I can make sense of why we don't want to set the kill flag on the spill since the use will almost always come before the re-definition. But in this case, I don't really understand why we're only setting callee-saved-registers as live-in to the prologue block if they're not live-in to the function. | |
1992 | The comment is still not adequate. All it says is what's obvious from the code - I can tell that we don't want to set the kill flag if the register is live in because the kill flag is set to !IsLiveIn. Don't state what the code does when it's so obvious, but why it does it. | |
lib/Target/PowerPC/PPCRegisterInfo.cpp | ||
127 | So this is now just two new empty lines? | |
lib/Transforms/IPO/GlobalOpt.cpp | ||
104 | It seems to me that 10% is way too high of a default. I would have expected that a call would be considered cold if it has a frequency in the range of what __builtin_expect(..., false) would produce (which I think is 1:2000). Maybe that is excessively low, but I'm fairly certain that 10% is too high. In any case whatever the default, I think that relying solely on the frequency won't yield good results (see my comment on the TTI hook). | |
2152 | No braces on blocks with a single statement. | |
2159 | changeCallSitesToColdCC()? |
Aside from just using the option -ppc-enable-coldcc to use coldcc on functions which are cold at all call sites, PowerPC will also check if the coldcc function is the only function called in it's callers.
Other than the tiny issue with an unnecessary cast, LGTM.
I am going to approve this since I'm holding up approval. I think @hfinkel should also have another look at this before you commit so that he can verify that you adequately addressed his comment.
lib/Target/PowerPC/PPCTargetTransformInfo.cpp | ||
---|---|---|
245 | Hold on, the type of I is Instruction& and we're casting it to Instruction. Why? |
include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
544 | callsites -> call sites | |
545 | conv. -> convention. | |
lib/Target/PowerPC/PPCCallingConv.td | ||
297 | Period after non-volatile. | |
lib/Target/PowerPC/PPCFrameLowering.cpp | ||
1955 | Can you explain here how this situation comes up? Is this because we have callee-saved registers that are callee-saved iff they're not used for parameter passing? | |
lib/Target/PowerPC/PPCTargetTransformInfo.cpp | ||
231 | I don't think that this implementation makes sense. A function can call several potential cold calls (which seems prohibited here), and in addition, doing the check this way seems likely to make the whole process O(N^2). Just return true here (unless there's some actual target-specific logic we'd like to have). | |
lib/Transforms/IPO/GlobalOpt.cpp | ||
2124 | callsite -> call site Either you're referring to CallSite, which is a capitalized class name, or write call site (because the term is actually two words). | |
2136 | callsites -> call sites |
lib/Target/PowerPC/PPCFrameLowering.cpp | ||
---|---|---|
1955 | When the reg is used for parameter passing, it will still part of the callee-saved list. However, we won't mark it as killed when we are storing to the stack. |
include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
546 | Given that: bool useColdCCForColdCalls(Function &F) const; makes a per-function decision, call in the name shouldn't be plural. Name it useColdCCForColdCall instead. | |
test/Transforms/GlobalOpt/coldcc_coldsites.ll | ||
2 | This test depends on PPC TTI, and so it has to be enabled only when the backend is compiled in. You'll need to create a PowerPC subdirectory of test/Transforms/GlobalOpt and put the test in that subdirectory. The directory also needs a lit.local.cfg. Copy the ones from test/Transforms/LoopVectorize/PowerPC/lit.local.cfg |
lib/Transforms/IPO/GlobalOpt.cpp | ||
---|---|---|
2138 |
I still don't understand the purpose of this check. The rationale needs to be clearly explained (or just get rid of it). If a function is cold at all call sites, why not convert it to use coldcc? Why does it matter if any of its callers also have non-cold calls? |
lib/Transforms/IPO/GlobalOpt.cpp | ||
---|---|---|
2138 | The motivation to use coldcc is to reduce the caller saved register spills in the caller, however if there is another non coldcc call site, the caller will have to spill the caller saved registers anyway. This defeats the purpose of using coldcc and having the overhead to spill all registers in the cold callee and then again spilling the caller saved registers in the caller around the non coldcc call site. This is why we want to only use coldcc for a call site in a function if all the calls sites in the function can use coldcc. |
lib/Transforms/IPO/GlobalOpt.cpp | ||
---|---|---|
2138 | If I'm misunderstanding what's going on, please correct me, but as I understand this: If I have a bunch of functions, A, B, C, and D, and these all have a cold path calling F (and no other function calls), then F is converted to be coldcc and all functions benefit from this optimization. Then, one day, an unsuspecting programmer adds a function call into some non-cold path in functions A. Now, suddenly, the performance of B, C, and D is also worse (because we're now no longer converting F to be coldcc, even though it is still only called on cold paths in all functions). This seems Bad. Obviously, sometimes these kinds of non-local effects are unavoidable, but when they're avoidable, we should avoid them. They're horrible for usability. The only justification for doing this would be if, by continuing to convert F to be coldcc, the performance of A would now be worse, but it doesn't seem like that should be the case (and it still might be better, depending on how we do shrink wrapping and where the other call is). |
callsites -> call sites