This is an archive of the discontinued LLVM Phabricator instance.

[MachineCSE][MachinePRE] Do not hoist common computations into hot BBs
ClosedPublic

Authored by lkail on Jul 9 2019, 2:25 AM.

Diff Detail

Event Timeline

lkail created this revision.Jul 9 2019, 2:25 AM

However, if CMBB is in a loop body, we might get performance degradation.

But we might also get a performance improvement, because the performance of the inner loop is more significant than that of the outer loop. This latter case seems more likely to me, but do you have performance results from the test suite, or something else, showing otherwise?

Is the problem hoisting out of a cold inner region into a hot loop? Would profiling data help? Is this really a rematerialization problem?

lkail added a comment.EditedJul 9 2019, 6:46 PM

@hfinkel , thanks for review.

do you have performance results from the test suite, or something else, showing otherwise?

Yes. We have observed ~7% degs in one of benchmark due to this.

However, if CMBB is in a loop body, we might get performance degradation.

This conclusion comes from the observation of the test suite code, I'm to paste a reduced case to tests soon, similar to current test, except for its branches are switchs.

But we might also get a performance improvement, because the performance of the inner loop is more significant than that of the outer loop.

I do miss this point you have mentioned. @nemanjai has already suggested me to take a look at MachineBlockFrequency.

Is the problem hoisting out of a cold inner region into a hot loop?

I agree with it. This patch should consider more about hotness of loops.

lkail added a reviewer: anil9.Jul 9 2019, 6:56 PM
lkail updated this revision to Diff 208882.Jul 9 2019, 11:47 PM
lkail edited the summary of this revision. (Show Details)
lkail added reviewers: courbet, SjoerdMeijer.

Updated the patch, using MachineBlockFrequency as metric to check if CMBB is appropriate to hoist into.

lkail retitled this revision from [MachineCSE][MachinePRE] Do not hoist common computations into loop bodies to [MachineCSE][MachinePRE] Do not hoist common computations into hot BBs.Jul 9 2019, 11:58 PM
lkail updated this revision to Diff 208891.Jul 10 2019, 1:25 AM
jsji added inline comments.Jul 10 2019, 11:29 AM
llvm/test/CodeGen/AArch64/O3-pipeline.ll
36 ↗(On Diff #208891)

irrelevant

llvm/test/CodeGen/X86/O3-pipeline.ll
33 ↗(On Diff #208891)

Please avoid irrelevant changes, commit them in another NFC patch if you would like to change them.

70 ↗(On Diff #208891)

irrelevant

97 ↗(On Diff #208891)

irrelevant changes.

179 ↗(On Diff #208891)

extra line? irrelevant

lkail updated this revision to Diff 209440.Jul 12 2019, 2:42 AM

Address @jsji 's comments and added new test.

dmgreen added inline comments.
llvm/lib/CodeGen/MachineCSE.cpp
873

Should this also say something like "if OptForMinSize then return true"? Under the assumption that pre will reduce the codesize.

lkail marked an inline comment as done.Jul 15 2019, 6:31 PM
lkail added inline comments.
llvm/lib/CodeGen/MachineCSE.cpp
873

Hi @dmgreen , your concern makes sense, since CSE won't eliminate all common computations considering what isProfitableToCSE does. As a result, it might increase size after PRE. I think we can enhance it in following patches.

dmgreen added inline comments.Jul 17 2019, 10:56 AM
llvm/lib/CodeGen/MachineCSE.cpp
873

Hello. Sorry. I meant more that - to my understanding - CSE is expected to decrease codesize. PRE can help perform more CSE so is expected to decrease codesize more. At Minsize (Oz) we don't really care which block is hot and which isn't, we just want to decrease codesize as much as possible. Hence this function, when optimising for minsize should just return true.

Feel free to correct me if any of that sounds wrong. It probably doesn't make a large difference either way, but we might as well do it whilst we are here.

lkail updated this revision to Diff 210478.Jul 17 2019, 7:52 PM

Updated patch following @dmgreen 's suggestion.

dmgreen accepted this revision.Jul 18 2019, 3:40 AM

Thanks. Looks like a nice change to me, other than one minor modification

llvm/lib/CodeGen/MachineCSE.cpp
870

I think you can use hasMinSize, which is the truly size-paranoid option.

Os (hasOptSize) is probably fine with your new block frequency check, if it's expected to speed up some code (and the codesize changes are fairly minimal).

This revision is now accepted and ready to land.Jul 18 2019, 3:40 AM
llvm/lib/CodeGen/MachineCSE.cpp
875

I would suggest more conservative < instead of <= here. This essentially makes sense for the cases when all BlockFreqs are unknown (so they are equal to 0).

Btw, this change breaks multiple (more than two) hoisting to common dominator. I've tested this patch for the original test case taken from here: https://bugs.llvm.org/show_bug.cgi?id=38917. There are several comparisons giving 96 > 40 + 10, 96 > 29 + 10, 96 > 18 + 10 (so no hoisting at all), meanwhile 96 < 97 = 40 + 29 + 18 + 10.
However I do not see easy solution for this issue.

lkail added a comment.Jul 18 2019, 6:29 PM

Btw, this change breaks multiple (more than two) hoisting to common dominator. I've tested this patch for the original test case taken from here: https://bugs.llvm.org/show_bug.cgi?id=38917. There are several comparisons giving 96 > 40 + 10, 96 > 29 + 10, 96 > 18 + 10 (so no hoisting at all), meanwhile 96 < 97 = 40 + 29 + 18 + 10.
However I do not see easy solution for this issue.

Good point! I think it would be an opportunity for our benchmark. I think I can enhance it in following patches. Maybe we also have to take register pressure into consideration.

lkail updated this revision to Diff 210724.Jul 18 2019, 6:38 PM

Use hasMinSize to check if optimized for size.

This revision was automatically updated to reflect the committed changes.