This is an archive of the discontinued LLVM Phabricator instance.

[CGP] Fold empty dedicated exit blocks created by loopsimplify.
AbandonedPublic

Authored by bmakam on Jul 18 2017, 3:11 PM.

Details

Summary

If the destination block is almost empty exit block then this block
was created by loopsimplify during LSR when it canonicalized the loop
and can be merged.

Diff Detail

Event Timeline

bmakam planned changes to this revision.Jul 20 2017, 10:50 AM

The underlying issue is that after loopsimplify creates empty exit blocks, CGP cannot clean up the empty blocks if it happens to be coming from a switch case. I am working on an alternative approach to address this issue.

bmakam updated this revision to Diff 108703.Jul 28 2017, 1:48 PM
bmakam retitled this revision from [CGP] Folding into empty latch block is always profitable. to [CGP] Fold empty dedicated exit blocks created by loopsimplify..
bmakam edited the summary of this revision. (Show Details)
bmakam added a reviewer: junbuml.

Updated patch to address the underlying issue.

efriedma added inline comments.Jul 28 2017, 2:55 PM
test/Transforms/CodeGenPrepare/merge-empty-latch-block.ll
67

Probably not relevant, but it looks like there's a missed optimization here: we should rotate this loop.

92

If I'm following correctly, the problem is this block: you want it to go away, but cgp isn't folding it.

It looks like isMergingEmptyBlockProfitable is specifically trying to detect cases like this: folding away this BB involves inserting an extra COPY into the while.cond5, and while.cond5 is hotter than while.body.backedge.loopexit, so in theory you could lose performance.

In this particular situation, though, you want to fold it anyway? What distinguishes this testcase from the testcase in r289988?

bmakam added inline comments.Jul 31 2017, 8:36 AM
test/Transforms/CodeGenPrepare/merge-empty-latch-block.ll
92

Yes, your understanding is correct. This block was created by loopsimplify during LSR pass to canonicalize the loop such that it has dedicated exit blocks with the understanding that simplifyCFG will clean up blocks which are split out but end up being unnecessary. Since we do not run simplifyCFG after LSR, we rely on CGP to fold this block so that generated code is not pessimized.

If I understand correctly, isMergingEmptyBlockProfitable is trying to workaround the underlying problem i.e. new critical edges cannot be split properly in PhiElimination which results in COPY instructions be inserted into blocks with higher frequency. I'm not sure if GlobalISel can handle this issue, but the temporary solution in r289988 cannot be applied in general as I observed that after r308422, if we do not fold away empty exit blocks it pessimizes the generated code and resulted in a 3% regression in the same benchmark that was initially targeted with r289988.

My first solution was to fold the empty block if it were only a latch block and this avoided the regression caused by r308422 and also kept the gains due to r289988. However, I feel this was papering over the real issue, so I am now folding all the exit blocks because they were likely added by loopsimplify and need to be cleaned up. This recovered 0.7% of the lost regression. I am looking for feedback on what could be a reasonable approach and open for suggestions.

A gentle ping.

efriedma edited edge metadata.Aug 2 2017, 12:16 PM

It's hard to predict what exactly PHIElimination will do at this point in the pipeline... I don't have any better suggestions for a heuristic here.

(GlobalISel is irrelevant, I think; we keep around PHI nodes until register allocation.)

lib/CodeGen/CodeGenPrepare.cpp
657

Whitespace.

test/Transforms/CodeGenPrepare/merge-empty-latch-block.ll
50

Needs a comment explaining why you're looking for this pattern.

bmakam updated this revision to Diff 109399.Aug 2 2017, 1:02 PM

Update to address Eli's comments.

bmakam marked 2 inline comments as done.Aug 2 2017, 1:02 PM

To clarify, the original version of this patch recovered the full 3% regression, but the new version only recovers 0.7%?

Is the actual value of the PHI operand relevant here? It looks like some of the testcases on r289988 involve constants, which are materialized during isel (and can be substantially more expensive).

To clarify, the original version of this patch recovered the full 3% regression, but the new version only recovers 0.7%?

Correct.

Is the actual value of the PHI operand relevant here? It looks like some of the testcases on r289988 involve constants, which are materialized during isel (and can be substantially more expensive).

Thanks Eli, this looks interesting. Another observation is that if we increase cgp-freq-ratio-to-skip-merge to 1000 it will recover the full 3% regression. I am trying to reduce the exact basic block which when skipped merging in CGP removes placing copies in hot path and improves performance due to reduced dynamic instruction count and the basic block(s) which when merged eliminate the unnecessary branches and improve the performance due to better branching/i-cache utilization. This might provide an answer to your question about the relevance of PHI operands.

I have looked into the phi operands closely and I am not convinced if phi operands involving constants have any influence on the profitability of merging empty blocks. I identified empty exit blocks if when merged with their successors improved the performance a bit, yet if another similar empty exit block was merged with the destination block, it sinks the performance. The only difference between them is that the successor block is also empty when the performance regressed.

Is the actual value of the PHI operand relevant here? It looks like some of the testcases on r289988 involve constants, which are materialized during isel (and can be substantially more expensive).

I took a stab at this and created D37343. Please take a look.

bmakam abandoned this revision.Sep 9 2017, 10:07 PM

Subsumed by D37343