The idea of this patch is to allow load PRE to 'work' in a fairly common case involving loops. If we haven't pealed a loop, have a load in the header, one path through the loop which clobbers that location, and one or more path through the loop that does not, we can move the load into the preheader *and* the clobbering path. (see example below)
The first point for discussion: are we okay with the duplication? It's safe in that we haven't introduced a load along a dynamic path which previously didn't have one. In the worst case, we add one *extra* load along a dynamic path for each time the loop runs. (Not one per iteration, *one* in the loop preheader.)
Unfortunately, the IR presented to GVN/PRE does not make this easy. The canonical loop form created by LoopSimplify merges all of the loop latches into a single latch, and thus single predecessor for the header block. In order to treat certain paths differently, I had to add a mechanism to 'look through' some predecessors to get back to the original latches. This is the 'merge block' complexity in the patch.
I'm reasonable sure at this point that the 'merge block' trick works. It also has a side benefit in that it reduces the number of LoadPRE iterations required in some cases (e.g. you have a tree of inputs with all but one available), and thus might help compile time in some cases.
However, it does add a good amount of complexity to the LoadPRE code. Do we think this is worthwhile?
It's worth mentioning that the 'merge block' concept is essentially a specialized form of jump threading. In fact, running '-jump-threading' does help in the loop cases I've looked at. However, running jump threading also inhibits some optimizations currently performed by LoadPRE. If we have a merge point where both inputs are unavailable, PRE will currently insert a single load at the merge. Running jump threading breaks this. Keeping that case working as it does today is why we only look through merge blocks with a single unavailable predeccesor in the patch.
I'm currently of the belief that the complexity is worthwhile. I'm open to being convinced that either a) it's not, or b) there's a better approach.
Example (before, after loop simplify):
declare void @hold(i32) readonly
declare void @clobber()
define i32 @test1(i1 %cnd, i32* dereferenceable(400) %p) {
entry:
br label %header
header:
%v1 = load i32* %p call void @hold(i32 %v1) br i1 %cnd, label %bb1, label %bb2
bb1:
br label %merge
bb2:
call void @clobber() br label %merge
merge:
br label %header
}
Example (after):
define i32 @test1(i1 %cnd, i32* dereferenceable(400) %p) {
entry:
%v1.pre = load i32* %p br label %header
header: ; preds = %merge, %entry
%v1 = phi i32 [ %v12, %merge ], [ %v1.pre, %entry ] call void @hold(i32 %v1) br i1 %cnd, label %bb1, label %bb2
bb1: ; preds = %header
br label %merge
bb2: ; preds = %header
call void @clobber() %v1.pre1 = load i32* %p br label %merge
merge: ; preds = %bb2, %bb1
%v12 = phi i32 [ %v1.pre1, %bb2 ], [ %v1, %bb1 ] br label %header
}
Thoughts on whether or not it is worthwhile to use getAnalysisIfAvailable<LoopInfo> for this kind of thing instead?