Page MenuHomePhabricator

Allow PRE to duplicate loads in LICM like loop case
AbandonedPublic

Authored by reames on Jan 19 2015, 4:37 PM.

Details

Summary

The idea of this patch is to allow load PRE to 'work' in a fairly common case involving loops. If we haven't pealed a loop, have a load in the header, one path through the loop which clobbers that location, and one or more path through the loop that does not, we can move the load into the preheader *and* the clobbering path. (see example below)

The first point for discussion: are we okay with the duplication? It's safe in that we haven't introduced a load along a dynamic path which previously didn't have one. In the worst case, we add one *extra* load along a dynamic path for each time the loop runs. (Not one per iteration, *one* in the loop preheader.)

Unfortunately, the IR presented to GVN/PRE does not make this easy. The canonical loop form created by LoopSimplify merges all of the loop latches into a single latch, and thus single predecessor for the header block. In order to treat certain paths differently, I had to add a mechanism to 'look through' some predecessors to get back to the original latches. This is the 'merge block' complexity in the patch.

I'm reasonable sure at this point that the 'merge block' trick works. It also has a side benefit in that it reduces the number of LoadPRE iterations required in some cases (e.g. you have a tree of inputs with all but one available), and thus might help compile time in some cases.

However, it does add a good amount of complexity to the LoadPRE code. Do we think this is worthwhile?

It's worth mentioning that the 'merge block' concept is essentially a specialized form of jump threading. In fact, running '-jump-threading' does help in the loop cases I've looked at. However, running jump threading also inhibits some optimizations currently performed by LoadPRE. If we have a merge point where both inputs are unavailable, PRE will currently insert a single load at the merge. Running jump threading breaks this. Keeping that case working as it does today is why we only look through merge blocks with a single unavailable predeccesor in the patch.

I'm currently of the belief that the complexity is worthwhile. I'm open to being convinced that either a) it's not, or b) there's a better approach.

Example (before, after loop simplify):
declare void @hold(i32) readonly
declare void @clobber()
define i32 @test1(i1 %cnd, i32* dereferenceable(400) %p) {
entry:

br label %header

header:

%v1 = load i32* %p
call void @hold(i32 %v1)
br i1 %cnd, label %bb1, label %bb2

bb1:

br label %merge

bb2:

call void @clobber()
br label %merge

merge:

br label %header

}

Example (after):
define i32 @test1(i1 %cnd, i32* dereferenceable(400) %p) {
entry:

%v1.pre = load i32* %p
br label %header

header: ; preds = %merge, %entry

%v1 = phi i32 [ %v12, %merge ], [ %v1.pre, %entry ]
call void @hold(i32 %v1)
br i1 %cnd, label %bb1, label %bb2

bb1: ; preds = %header

br label %merge

bb2: ; preds = %header

call void @clobber()
%v1.pre1 = load i32* %p
br label %merge

merge: ; preds = %bb2, %bb1

%v12 = phi i32 [ %v1.pre1, %bb2 ], [ %v1, %bb1 ]
br label %header

}

Diff Detail

Event Timeline

reames updated this revision to Diff 18405.Jan 19 2015, 4:37 PM
reames retitled this revision from to Allow PRE to duplicate loads in LICM like loop case.
reames updated this object.
reames edited the test plan for this revision. (Show Details)
reames added a subscriber: Unknown Object (MLST).
hfinkel edited edge metadata.Jan 25 2015, 7:29 AM

Generally speaking, I'm in favor of this. We only loose in the case where the loop trip count is (very) small and the additional load is on a relatively-hot path through the loop. However, I think we can use BPI to filter out such cases.

It's worth mentioning that the 'merge block' concept is essentially a specialized form of jump threading. In fact, running '-jump-threading' does help in the loop cases I've looked at. However, running jump threading also inhibits some optimizations currently performed by LoadPRE. If we have a merge point where both inputs are unavailable, PRE will currently insert a single load at the merge. Running jump threading breaks this. Keeping that case working as it does today is why we only look through merge blocks with a single unavailable predeccesor in the patch.

As a side comment, do we have a regression test covering this behavior? I believe this is a case where it is appropriate to have a regression test that runs opt -O3 (to get the default optimization pipeline), to make sure we don't disturb this current behavior. If we don't, can you please add one?

lib/Transforms/Scalar/GVN.cpp
1524

Thoughts on whether or not it is worthwhile to use getAnalysisIfAvailable<LoopInfo> for this kind of thing instead?

dberlin edited edge metadata.Jan 25 2015, 10:31 AM

I'm also generally fine with this, if we are fine with destroying later if
we rewrite this all :)

reames updated this revision to Diff 19056.Jan 30 2015, 1:57 PM
reames updated this object.
reames edited edge metadata.

This is the "real patch" given the overall direction was found reasonable. Please review.

reames updated this revision to Diff 19058.Jan 30 2015, 1:59 PM

Remove debug output code which shouldn't have been in previous version.

Generally speaking, I'm in favor of this. We only loose in the case where the loop trip count is (very) small and the additional load is on a relatively-hot path through the loop. However, I think we can use BPI to filter out such cases.

I agree we can, but should we? You might still be able to simplify the other path through the loop. It's not clearly profitable, but it's also not clearly not profitable either.

I'm open to implementing the BFI based filter if you want to see it.

As a side comment, do we have a regression test covering this behavior? I believe this is a case where it is appropriate to have a regression test that runs opt -O3 (to get the default optimization pipeline), to make sure we don't disturb this current behavior. If we don't, can you please add one?

I'm not sure if we have the O3 test, but we do have tests for GVN/PRE which would break if the jump threading behavior were integrated into GVN. I think that's actually what we want. I really suspect the current need for redundant merge points is just papering over deficiencies in the PRE code. In fact, there's a FIXME that says exactly that. :)

Also, writing an O3 test for this would be very hard. You'd need to construct it in a way where no pass can destroy the redundant control flow, but it's exposed to jump threading. This seems way too fragile to be worthwhile.

Generally speaking, I'm in favor of this. We only loose in the case where the loop trip count is (very) small and the additional load is on a relatively-hot path through the loop. However, I think we can use BPI to filter out such cases.

I agree we can, but should we? You might still be able to simplify the other path through the loop. It's not clearly profitable, but it's also not clearly not profitable either.

I'm open to implementing the BFI based filter if you want to see it.

I would really like to see it. -- It may yet turn out not to be practical, but I'd like the experiment to be performed.

As a side comment, do we have a regression test covering this behavior? I believe this is a case where it is appropriate to have a regression test that runs opt -O3 (to get the default optimization pipeline), to make sure we don't disturb this current behavior. If we don't, can you please add one?

I'm not sure if we have the O3 test, but we do have tests for GVN/PRE which would break if the jump threading behavior were integrated into GVN. I think that's actually what we want. I really suspect the current need for redundant merge points is just papering over deficiencies in the PRE code. In fact, there's a FIXME that says exactly that. :)

Also, writing an O3 test for this would be very hard. You'd need to construct it in a way where no pass can destroy the redundant control flow, but it's exposed to jump threading. This seems way too fragile to be worthwhile.

Okay, fair enough.

hfinkel added inline comments.Jan 30 2015, 3:11 PM
lib/Transforms/Scalar/GVN.cpp
1537

How do you know that you'll never have more than one properly-dominating blocks to this potential loop-header block?

1585

I'd not remove this blank line.

1612

Don't need { } here.

1700

Don't need { } here either (might as well fix that while you're here).

1843

Remove this.

chandlerc resigned from this revision.Mar 29 2015, 2:27 PM
chandlerc removed a reviewer: chandlerc.

Leaving the review of this to Hal and Danny

reames abandoned this revision.Aug 24 2018, 2:49 PM