Current PRE hoists common computations into
CMBB = DT->findNearestCommonDominator(MBB, MBB1).
However, if CMBB is in a hot loop body, we might get performance
degradation.
Details
- Reviewers
hfinkel jsji steven.zhang Jiangning anton-afanasyev ab rtereshin greened mzolotukhin nemanjai anil9 courbet SjoerdMeijer dmgreen - Commits
- rG19e5da4edc96: Merging r366570: --------------------------------------------------------------…
rL366729: Merging r366570:
rGdec624682e06: [MachineCSE][MachinePRE] Avoid hoisting code from code regions into hot BBs.
rL366570: [MachineCSE][MachinePRE] Avoid hoisting code from code regions into hot BBs.
Diff Detail
Event Timeline
However, if CMBB is in a loop body, we might get performance degradation.
But we might also get a performance improvement, because the performance of the inner loop is more significant than that of the outer loop. This latter case seems more likely to me, but do you have performance results from the test suite, or something else, showing otherwise?
Is the problem hoisting out of a cold inner region into a hot loop? Would profiling data help? Is this really a rematerialization problem?
@hfinkel , thanks for review.
do you have performance results from the test suite, or something else, showing otherwise?
Yes. We have observed ~7% degs in one of benchmark due to this.
However, if CMBB is in a loop body, we might get performance degradation.
This conclusion comes from the observation of the test suite code, I'm to paste a reduced case to tests soon, similar to current test, except for its branches are switchs.
But we might also get a performance improvement, because the performance of the inner loop is more significant than that of the outer loop.
I do miss this point you have mentioned. @nemanjai has already suggested me to take a look at MachineBlockFrequency.
Is the problem hoisting out of a cold inner region into a hot loop?
I agree with it. This patch should consider more about hotness of loops.
Updated the patch, using MachineBlockFrequency as metric to check if CMBB is appropriate to hoist into.
llvm/test/CodeGen/AArch64/O3-pipeline.ll | ||
---|---|---|
36 ↗ | (On Diff #208891) | irrelevant |
llvm/test/CodeGen/X86/O3-pipeline.ll | ||
33 ↗ | (On Diff #208891) | Please avoid irrelevant changes, commit them in another NFC patch if you would like to change them. |
70 ↗ | (On Diff #208891) | irrelevant |
97 ↗ | (On Diff #208891) | irrelevant changes. |
179 ↗ | (On Diff #208891) | extra line? irrelevant |
llvm/lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
873 | Should this also say something like "if OptForMinSize then return true"? Under the assumption that pre will reduce the codesize. |
llvm/lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
873 | Hello. Sorry. I meant more that - to my understanding - CSE is expected to decrease codesize. PRE can help perform more CSE so is expected to decrease codesize more. At Minsize (Oz) we don't really care which block is hot and which isn't, we just want to decrease codesize as much as possible. Hence this function, when optimising for minsize should just return true. Feel free to correct me if any of that sounds wrong. It probably doesn't make a large difference either way, but we might as well do it whilst we are here. |
Thanks. Looks like a nice change to me, other than one minor modification
llvm/lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
870 | I think you can use hasMinSize, which is the truly size-paranoid option. Os (hasOptSize) is probably fine with your new block frequency check, if it's expected to speed up some code (and the codesize changes are fairly minimal). |
llvm/lib/CodeGen/MachineCSE.cpp | ||
---|---|---|
875 | I would suggest more conservative < instead of <= here. This essentially makes sense for the cases when all BlockFreqs are unknown (so they are equal to 0). |
Btw, this change breaks multiple (more than two) hoisting to common dominator. I've tested this patch for the original test case taken from here: https://bugs.llvm.org/show_bug.cgi?id=38917. There are several comparisons giving 96 > 40 + 10, 96 > 29 + 10, 96 > 18 + 10 (so no hoisting at all), meanwhile 96 < 97 = 40 + 29 + 18 + 10.
However I do not see easy solution for this issue.
Good point! I think it would be an opportunity for our benchmark. I think I can enhance it in following patches. Maybe we also have to take register pressure into consideration.
I think you can use hasMinSize, which is the truly size-paranoid option.
Os (hasOptSize) is probably fine with your new block frequency check, if it's expected to speed up some code (and the codesize changes are fairly minimal).