This is an archive of the discontinued LLVM Phabricator instance.

Allow PRE to duplicate loads in LICM like loop case
AbandonedPublic

Authored by reames on Jan 19 2015, 4:37 PM.

Download Raw Diff

Details

Reviewers

bob.wilson
nadav
• dberlin
nicholas
resistor
hfinkel

Summary

The idea of this patch is to allow load PRE to 'work' in a fairly common case involving loops. If we haven't pealed a loop, have a load in the header, one path through the loop which clobbers that location, and one or more path through the loop that does not, we can move the load into the preheader *and* the clobbering path. (see example below)

The first point for discussion: are we okay with the duplication? It's safe in that we haven't introduced a load along a dynamic path which previously didn't have one. In the worst case, we add one *extra* load along a dynamic path for each time the loop runs. (Not one per iteration, *one* in the loop preheader.)

Unfortunately, the IR presented to GVN/PRE does not make this easy. The canonical loop form created by LoopSimplify merges all of the loop latches into a single latch, and thus single predecessor for the header block. In order to treat certain paths differently, I had to add a mechanism to 'look through' some predecessors to get back to the original latches. This is the 'merge block' complexity in the patch.

I'm reasonable sure at this point that the 'merge block' trick works. It also has a side benefit in that it reduces the number of LoadPRE iterations required in some cases (e.g. you have a tree of inputs with all but one available), and thus might help compile time in some cases.

However, it does add a good amount of complexity to the LoadPRE code. Do we think this is worthwhile?

It's worth mentioning that the 'merge block' concept is essentially a specialized form of jump threading. In fact, running '-jump-threading' does help in the loop cases I've looked at. However, running jump threading also inhibits some optimizations currently performed by LoadPRE. If we have a merge point where both inputs are unavailable, PRE will currently insert a single load at the merge. Running jump threading breaks this. Keeping that case working as it does today is why we only look through merge blocks with a single unavailable predeccesor in the patch.

I'm currently of the belief that the complexity is worthwhile. I'm open to being convinced that either a) it's not, or b) there's a better approach.

Example (before, after loop simplify):
declare void @hold(i32) readonly
declare void @clobber()
define i32 @test1(i1 %cnd, i32* dereferenceable(400) %p) {
entry:

br label %header

header:

%v1 = load i32* %p
call void @hold(i32 %v1)
br i1 %cnd, label %bb1, label %bb2

bb1:

br label %merge

bb2:

call void @clobber()
br label %merge

merge:

br label %header

}

Example (after):
define i32 @test1(i1 %cnd, i32* dereferenceable(400) %p) {
entry:

%v1.pre = load i32* %p
br label %header

header: ; preds = %merge, %entry

%v1 = phi i32 [ %v12, %merge ], [ %v1.pre, %entry ]
call void @hold(i32 %v1)
br i1 %cnd, label %bb1, label %bb2

bb1: ; preds = %header

br label %merge

bb2: ; preds = %header

call void @clobber()
%v1.pre1 = load i32* %p
br label %merge

merge: ; preds = %bb2, %bb1

%v12 = phi i32 [ %v1.pre1, %bb2 ], [ %v1, %bb1 ]
br label %header

}

Diff Detail

Event Timeline

reames updated this revision to Diff 18405.Jan 19 2015, 4:37 PM

reames retitled this revision from to Allow PRE to duplicate loads in LICM like loop case.

reames updated this object.

reames edited the test plan for this revision. (Show Details)

reames added reviewers: • dberlin, resistor, nadav, bob.wilson, hfinkel, nicholas, chandlerc.

reames added a subscriber: Unknown Object (MLST).

Generally speaking, I'm in favor of this. We only loose in the case where the loop trip count is (very) small and the additional load is on a relatively-hot path through the loop. However, I think we can use BPI to filter out such cases.

It's worth mentioning that the 'merge block' concept is essentially a specialized form of jump threading. In fact, running '-jump-threading' does help in the loop cases I've looked at. However, running jump threading also inhibits some optimizations currently performed by LoadPRE. If we have a merge point where both inputs are unavailable, PRE will currently insert a single load at the merge. Running jump threading breaks this. Keeping that case working as it does today is why we only look through merge blocks with a single unavailable predeccesor in the patch.

As a side comment, do we have a regression test covering this behavior? I believe this is a case where it is appropriate to have a regression test that runs opt -O3 (to get the default optimization pipeline), to make sure we don't disturb this current behavior. If we don't, can you please add one?

lib/Transforms/Scalar/GVN.cpp
1524	Thoughts on whether or not it is worthwhile to use getAnalysisIfAvailable<LoopInfo> for this kind of thing instead?

I'm also generally fine with this, if we are fine with destroying later if
we rewrite this all :)

This is the "real patch" given the overall direction was found reasonable. Please review.

Remove debug output code which shouldn't have been in previous version.

In D7061#113287, @hfinkel wrote:

Generally speaking, I'm in favor of this. We only loose in the case where the loop trip count is (very) small and the additional load is on a relatively-hot path through the loop. However, I think we can use BPI to filter out such cases.

I agree we can, but should we? You might still be able to simplify the other path through the loop. It's not clearly profitable, but it's also not clearly not profitable either.

I'm open to implementing the BFI based filter if you want to see it.

As a side comment, do we have a regression test covering this behavior? I believe this is a case where it is appropriate to have a regression test that runs opt -O3 (to get the default optimization pipeline), to make sure we don't disturb this current behavior. If we don't, can you please add one?

I'm not sure if we have the O3 test, but we do have tests for GVN/PRE which would break if the jump threading behavior were integrated into GVN. I think that's actually what we want. I really suspect the current need for redundant merge points is just papering over deficiencies in the PRE code. In fact, there's a FIXME that says exactly that. :)

Also, writing an O3 test for this would be very hard. You'd need to construct it in a way where no pass can destroy the redundant control flow, but it's exposed to jump threading. This seems way too fragile to be worthwhile.

In D7061#116088, @reames wrote:

In D7061#113287, @hfinkel wrote:

Generally speaking, I'm in favor of this. We only loose in the case where the loop trip count is (very) small and the additional load is on a relatively-hot path through the loop. However, I think we can use BPI to filter out such cases.

I agree we can, but should we? You might still be able to simplify the other path through the loop. It's not clearly profitable, but it's also not clearly not profitable either.

I'm open to implementing the BFI based filter if you want to see it.

I would really like to see it. -- It may yet turn out not to be practical, but I'd like the experiment to be performed.

As a side comment, do we have a regression test covering this behavior? I believe this is a case where it is appropriate to have a regression test that runs opt -O3 (to get the default optimization pipeline), to make sure we don't disturb this current behavior. If we don't, can you please add one?

I'm not sure if we have the O3 test, but we do have tests for GVN/PRE which would break if the jump threading behavior were integrated into GVN. I think that's actually what we want. I really suspect the current need for redundant merge points is just papering over deficiencies in the PRE code. In fact, there's a FIXME that says exactly that. :)

Also, writing an O3 test for this would be very hard. You'd need to construct it in a way where no pass can destroy the redundant control flow, but it's exposed to jump threading. This seems way too fragile to be worthwhile.

Okay, fair enough.

hfinkel added inline comments.Jan 30 2015, 3:11 PM

lib/Transforms/Scalar/GVN.cpp
1537	How do you know that you'll never have more than one properly-dominating blocks to this potential loop-header block?
1585	I'd not remove this blank line.
1612	Don't need { } here.
1700	Don't need { } here either (might as well fix that while you're here).
1843	Remove this.

Leaving the review of this to Hal and Danny

reames abandoned this revision.Aug 24 2018, 2:49 PM

Herald added subscribers: jfb, bollu, mcrosier. · View Herald TranscriptAug 24 2018, 2:49 PM

reames mentioned this in D99926: [GVN] Introduce loop load PRE.Apr 8 2021, 12:42 PM

Revision Contents

Path

Size

lib/

Transforms/

Scalar/

GVN.cpp

184 lines

test/

Transforms/

GVN/

pre-loopsimplify.ll

113 lines

Diff 19058

lib/Transforms/Scalar/GVN.cpp

	Show First 20 Lines • Show All 292 Lines • ▼ Show 20 Lines
	ValuesPerBlock.push_back(AvailableValueInBlock::getLoad(DepBB, LD));			ValuesPerBlock.push_back(AvailableValueInBlock::getLoad(DepBB, LD));
	continue;			continue;
	}			}

	UnavailableBlocks.push_back(DepBB);			UnavailableBlocks.push_back(DepBB);
	}			}
	}			}

				/// Return true if 'Header' is actually the header of a valid loop. PredBlocks
				/// are either direct predeccessors of the 'Header' BB or indirect predecessors
				/// through 'merge only' blocks.
				static bool isLoopHeader(BasicBlock* Header,
				SmallVector<BasicBlock *, 4>& PredBlocks,
				DominatorTree &DT) {
				// TODO: If LoopInfo is available, just use that.

				hfinkelUnsubmitted Not Done Reply Inline Actions Thoughts on whether or not it is worthwhile to use getAnalysisIfAvailable<LoopInfo> for this kind of thing instead? hfinkel: Thoughts on whether or not it is worthwhile to use getAnalysisIfAvailable<LoopInfo> for this…
				if (PredBlocks.size() <= 1) {
				// Without at least two predecessors, this can't be a loop
				return false;
				}
				// We can't assume we're given a loop header. We need to ensure that all of
				// non-mergin predecessors are either inside the loop (i.e. dominated by the
				// header) or a single preheader. If we find any block which doesn't meet
				// this requirement, we do not have a natural loop.
				// Note: The term 'preheader' is being slightly abused here. We allow
				BasicBlock *LoopPred = nullptr;
				for (BasicBlock *Pred : PredBlocks) {
				if (DT.properlyDominates(Pred, Header)) {
				assert(!LoopPred && "Can only have one preheader!");
				hfinkelUnsubmitted Not Done Reply Inline Actions How do you know that you'll never have more than one properly-dominating blocks to this potential loop-header block? hfinkel: How do you know that you'll never have more than one properly-dominating blocks to this…
				LoopPred = Pred;
				} else if (DT.dominates(Header, Pred)) {
				// Iff this is a loop, this is either a latch, or a block which leads to
				// latch (i.e. via a merge only block)
				} else {
				// Not a natural loop at all
				return false;
				}
				}
				// At this point, we've seen all the predecessors, did we find one from
				// outside the loop? If not, this is dead code and should just be deleted.
				return LoopPred;
				}


				/// Return true if this is a natural loop header and at least one 'real latch'
				/// (i.e. what was a latch before LoopSimplify introduced a merge point) is
				/// fully available. The idea here is that we want to perform a transform
				/// analogous to LICM, but also allow pushing an extra copy of the load into
				/// one of several paths which lead to the backedge. (i.e. most paths through
				/// the loop should not involve a load, but the path which clobbers it will
				/// reload)
				static bool isLICMLike(BasicBlock* Header,
				SmallVector<BasicBlock *, 4>& PredBlocks,
				unsigned NumUnavailable,
				DominatorTree &DT) {
				// We can only judge profitability if this is a loop
				if (!isLoopHeader(Header, PredBlocks, DT))
				return false;

				// Do we end up with a least one (possibly indirect) predeccesor
				// (i.e. hopefully a latch) where the load is fully available? The
				// unavailable block could also be the loop predeccessor in which case this
				// devolves to classic LICM.
				return (PredBlocks.size() > NumUnavailable);
				}

				/// Return true if this is a merge block.
				/// A 'merge block' is one that ends with an unconditional jump to a single
				/// basic block and contains no useful computation of it's own. Such blocks
				/// arise frequently from loop simplify. JumpThreading removes such blocks,
				/// but may not have run immediately before GVN.
				static bool isMergeOnlyBlock(BasicBlock *BB) {
				if (&BB->getParent()->getEntryBlock() == BB)
				return false;
				if (BB->getTerminator()->getNumSuccessors() != 1) {
				return false;
				}
				for (auto I = BB->begin(), E = BB->end(); I != E; ++I) {
				switch (I->getOpcode()) {
				default:
				// unsupported instruction
				return false;
				case Instruction::PHI:
				case Instruction::Br:
				continue;
				};
				}
				return true;
				}

				/// Given a 'merge block' return true if we think it is profitable to look
				/// through the block and treat it's predeccessors as possible insertion
				/// points. This is conservative in that it defaults to not splitting unless
				/// it clearly does no harm.
				static bool isProfitableToSplit(BasicBlock *Merge,
				DenseMap<BasicBlock*, char>& FullyAvailableBlocks) {
				// If by looking at the merge blocks predeccessors we find only a single
				// unavailable predeccessor, we can do no harm by looking through it since
				// this block would be in our list of unavailable blocks anyways
				BasicBlock *UnavailablePred = nullptr;
				for (pred_iterator PI = pred_begin(Merge), E = pred_end(Merge);
				PI != E; ++PI) {
				BasicBlock Pred = PI;
				if (IsValueFullyAvailableInBlock(Pred, FullyAvailableBlocks, 0)) {
				hfinkelUnsubmitted Not Done Reply Inline Actions Don't need { } here. hfinkel: Don't need { } here.
				continue;
				}
				if (UnavailablePred && UnavailablePred != Pred)
				return false;
				UnavailablePred = Pred;

				if (Pred->getTerminator()->getNumSuccessors() != 1) {
				// We'd need to split this critical edge and don't currently keep
				// around the info to do so if we look through merge blocks.
				return false;
				}
				}
				return nullptr != UnavailablePred;
				}



	bool GVN::PerformLoadPRE(LoadInst *LI, AvailValInBlkVect &ValuesPerBlock,			bool GVN::PerformLoadPRE(LoadInst *LI, AvailValInBlkVect &ValuesPerBlock,
	UnavailBlkVect &UnavailableBlocks) {			UnavailBlkVect &UnavailableBlocks) {
	// Okay, we have some definitions of the value. This means that the value			// Okay, we have some definitions of the value. This means that the value
	// is available in some of our (transitive) predecessors. Lets think about			// is available in some of our (transitive) predecessors. Lets think about
	// doing PRE of this load. This will involve inserting a new load into the			// doing PRE of this load. This will involve inserting a new load into the
	// predecessor when it's not available. We could do this in general, but			// predecessor when it's not available. We could do this in general, but
	// prefer to not increase code size. As such, we only do this when we know			// prefer to not increase code size. As such, we only do this when we know
	// that we only have to insert one load (which means we're basically moving			// that we only have to insert one load (which means we're basically moving
	Show All 31 Lines
	// available.			// available.
	MapVector<BasicBlock , Value > PredLoads;			MapVector<BasicBlock , Value > PredLoads;
	DenseMap<BasicBlock*, char> FullyAvailableBlocks;			DenseMap<BasicBlock*, char> FullyAvailableBlocks;
	for (unsigned i = 0, e = ValuesPerBlock.size(); i != e; ++i)			for (unsigned i = 0, e = ValuesPerBlock.size(); i != e; ++i)
	FullyAvailableBlocks[ValuesPerBlock[i].BB] = true;			FullyAvailableBlocks[ValuesPerBlock[i].BB] = true;
	for (unsigned i = 0, e = UnavailableBlocks.size(); i != e; ++i)			for (unsigned i = 0, e = UnavailableBlocks.size(); i != e; ++i)
	FullyAvailableBlocks[UnavailableBlocks[i]] = false;			FullyAvailableBlocks[UnavailableBlocks[i]] = false;

	SmallVector<BasicBlock *, 4> CriticalEdgePred;
				// When we get done, we want PredBlocks to contain both all of the indirect
				// predeccessors we considered for placement including those which are fully
				// available. We only exclude a block from this list if we've looked through
				// it to consider its predeccessors. This only happens for partially
				// available 'merge' blocks.
				SmallVector<BasicBlock *, 4> PredBlocks;
				SmallSet<BasicBlock*, 4> Seen;
				// Step 1 - Identify any unique direct predecessors of LoadBB
	for (pred_iterator PI = pred_begin(LoadBB), E = pred_end(LoadBB);			for (pred_iterator PI = pred_begin(LoadBB), E = pred_end(LoadBB);
	PI != E; ++PI) {			PI != E; ++PI) {
	BasicBlock Pred = PI;			BasicBlock Pred = PI;
				if (!Seen.count(Pred)) {
				PredBlocks.push_back(Pred);
				Seen.insert(Pred);
				}
				}

				// Step 2 - Identify actual insertion sites (may involve looking through some
				// merge blocks already in the list)
				SmallVector<BasicBlock *, 4> CriticalEdgePred;
				for(auto PI = PredBlocks.begin(), E = PredBlocks.end(); PI != E; ++PI) {
				BasicBlock Pred = PI;
	if (IsValueFullyAvailableInBlock(Pred, FullyAvailableBlocks, 0)) {			if (IsValueFullyAvailableInBlock(Pred, FullyAvailableBlocks, 0)) {
				hfinkelUnsubmitted Not Done Reply Inline Actions Don't need { } here either (might as well fix that while you're here). hfinkel: Don't need { } here either (might as well fix that while you're here).
	continue;			continue;
	}			}

				// If this a merge block - one with a single successor and no computation
				// of it's own - which is partially available, we may want to look at it's
				// predeccessors as possible load insertion points. Note that a merge
				// block can not effect whether a load is 'anticipated' along some path.
				// This might be the single block we want to insert a load in. Look
				// through this block only if can do no harm. This is enough to get most
				// cases produced by loop simplify which will trigger the LICM-like case
				// below.
				if (isMergeOnlyBlock(Pred) &&
				isProfitableToSplit(Pred, FullyAvailableBlocks)) {
				// First, add each predeccessor block which hasn't already been
				// considered as a placement candidate. We happen to know that only
				// one of those is unavailable, but we need to add each so we can judge
				// profitability when we get done.
				for (pred_iterator PI = pred_begin(Pred), E = pred_end(Pred);
				PI != E; ++PI) {
				BasicBlock Pred = PI;
				if (!Seen.count(Pred)) {
				Seen.insert(Pred);
				PredBlocks.push_back(Pred);
				}
				}
				// Delete the original merge block from the list and adjust iterators
				auto Last = PI; --Last;
				PredBlocks.erase(PI);
				PI = Last;
				E = PredBlocks.end();
				continue;
				}

	if (Pred->getTerminator()->getNumSuccessors() != 1) {			if (Pred->getTerminator()->getNumSuccessors() != 1) {
	if (isa<IndirectBrInst>(Pred->getTerminator())) {			if (isa<IndirectBrInst>(Pred->getTerminator())) {
	DEBUG(dbgs() << "COULD NOT PRE LOAD BECAUSE OF INDBR CRITICAL EDGE '"			DEBUG(dbgs() << "COULD NOT PRE LOAD BECAUSE OF INDBR CRITICAL EDGE '"
	<< Pred->getName() << "': " << *LI << '\n');			<< Pred->getName() << "': " << *LI << '\n');
	return false;			return false;
	}			}

	if (LoadBB->isLandingPad()) {			if (LoadBB->isLandingPad()) {
	DEBUG(dbgs()			DEBUG(dbgs()
	<< "COULD NOT PRE LOAD BECAUSE OF LANDING PAD CRITICAL EDGE '"			<< "COULD NOT PRE LOAD BECAUSE OF LANDING PAD CRITICAL EDGE '"
	<< Pred->getName() << "': " << *LI << '\n');			<< Pred->getName() << "': " << *LI << '\n');
	return false;			return false;
	}			}

	hfinkelUnsubmitted Not Done Reply Inline Actions I'd not remove this blank line. hfinkel: I'd not remove this blank line.
	CriticalEdgePred.push_back(Pred);			CriticalEdgePred.push_back(Pred);
	} else {			} else {
	// Only add the predecessors that will not be split for now.			// Only add the predecessors that will not be split for now.
	PredLoads[Pred] = nullptr;			PredLoads[Pred] = nullptr;
	}			}
	}			}

				// FIXME: If we could restructure the CFG, we could make a common pred with
				// all the preds that don't have an available LI and insert a new load into
				// that one block. With care, this works for 'indirect' predecessors due to
				// merge blocks as well.

	// Decide whether PRE is profitable for this load.			// Decide whether PRE is profitable for this load.
	unsigned NumUnavailablePreds = PredLoads.size() + CriticalEdgePred.size();			unsigned NumUnavailablePreds = PredLoads.size() + CriticalEdgePred.size();
	assert(NumUnavailablePreds != 0 &&			assert(NumUnavailablePreds != 0 &&
	"Fully available value should already be eliminated!");			"Fully available value should already be eliminated!");

	// If this load is unavailable in multiple predecessors, reject it.			// We judge a load placement as profitable if:
	// FIXME: If we could restructure the CFG, we could make a common pred with			// - We're moving a single load into some (potentially indirect) predeccessor
	// all the preds that don't have an available LI and insert a new load into			// - We can pull a load into a loop header and reload it only along only some
	// that one block.			// paths through the loop. This is analogous to LICM, but additional allows
	if (NumUnavailablePreds != 1)			// a reload along one path in loops with interior control flow or multiple
				// latches.
				if (NumUnavailablePreds != 1 &&
				!isLICMLike(LoadBB, PredBlocks, NumUnavailablePreds, *DT))
	return false;			return false;

	// Split critical edges, and update the unavailable predecessors accordingly.			// Split critical edges, and update the unavailable predecessors accordingly.
	for (BasicBlock *OrigPred : CriticalEdgePred) {			for (BasicBlock *OrigPred : CriticalEdgePred) {
	BasicBlock *NewPred = splitCriticalEdges(OrigPred, LoadBB);			BasicBlock *NewPred = splitCriticalEdges(OrigPred, LoadBB);
	assert(!PredLoads.count(OrigPred) && "Split edges shouldn't be in map!");			assert(!PredLoads.count(OrigPred) && "Split edges shouldn't be in map!");
	PredLoads[NewPred] = nullptr;			PredLoads[NewPred] = nullptr;
	DEBUG(dbgs() << "Split critical edge " << OrigPred->getName() << "->"			DEBUG(dbgs() << "Split critical edge " << OrigPred->getName() << "->"
	▲ Show 20 Lines • Show All 55 Lines • ▼ Show 20 Lines
	// ordering issues. If a block hasn't been processed yet, we would be			// ordering issues. If a block hasn't been processed yet, we would be
	// marking a value as AVAIL-IN, which isn't what we intend.			// marking a value as AVAIL-IN, which isn't what we intend.
	VN.lookup_or_add(NewInsts[i]);			VN.lookup_or_add(NewInsts[i]);
	}			}

	for (const auto &PredLoad : PredLoads) {			for (const auto &PredLoad : PredLoads) {
	BasicBlock *UnavailablePred = PredLoad.first;			BasicBlock *UnavailablePred = PredLoad.first;
	Value *LoadPtr = PredLoad.second;			Value *LoadPtr = PredLoad.second;
				LoadPtr->dump();
				hfinkelUnsubmitted Not Done Reply Inline Actions Remove this. hfinkel: Remove this.

	Instruction *NewLoad = new LoadInst(LoadPtr, LI->getName()+".pre", false,			Instruction *NewLoad = new LoadInst(LoadPtr, LI->getName()+".pre", false,
	LI->getAlignment(),			LI->getAlignment(),
	UnavailablePred->getTerminator());			UnavailablePred->getTerminator());

	// Transfer the old load's AA tags to the new load.			// Transfer the old load's AA tags to the new load.
	AAMDNodes Tags;			AAMDNodes Tags;
	LI->getAAMetadata(Tags);			LI->getAAMetadata(Tags);
	▲ Show 20 Lines • Show All 292 Lines • Show Last 20 Lines

test/Transforms/GVN/pre-loopsimplify.ll

This file was added.

				; RUN: opt -basicaa -gvn -S < %s \| FileCheck %s

				target datalayout = "e"

				declare void @hold(i32) readonly
				declare void @clobber()

				; This is a classic LICM case
				define i32 @test1(i1 %cnd, i32* %p) {
				; CHECK-LABEL: @test1
				entry:
				; CHECK-LABEL: entry
				; CHECK-NEXT: %v1.pre = load i32* %p
				br label %header

				header:
				; CHECK-LABEL: header
				; CHECK-NOT: load i32* %p
				%v1 = load i32* %p
				call void @hold(i32 %v1)
				br label %header
				}

				; This is classic LICM, but we need to look through a merge block
				; to see it.
				define i32 @test2(i1 %cnd, i32* %p) {
				; CHECK-LABEL: @test2
				entry:
				; CHECK-LABEL: entry
				; CHECK-NEXT: %v1.pre = load i32* %p
				br label %header

				header:
				; CHECK-LABEL: header
				; CHECK-NOT: load i32* %p
				%v1 = load i32* %p
				call void @hold(i32 %v1)
				br i1 %cnd, label %bb1, label %bb2

				bb1:
				br label %merge

				bb2:
				br label %merge

				merge:
				br label %header
				}

				; In this case we have a load which is 'almost' loop invariant
				; it's modified along one path, but not another. We want to
				; pull the load out of the loop and reload in the clobbering
				; block.
				define i32 @test3(i1 %cnd, i32* %p) {
				; CHECK-LABEL: @test3
				entry:
				; CHECK-LABEL: entry
				; CHECK-NEXT: %v1.pre1 = load i32* %p
				br label %header

				header:
				; CHECK-LABEL: header
				; CHECK-NOT: load i32* %p
				%v1 = load i32* %p
				call void @hold(i32 %v1)
				br i1 %cnd, label %bb1, label %bb2

				bb1:
				br label %header

				bb2:
				; CHECK-LABEL: bb2
				; CHECK: call void @clobber()
				; CHECK-NEXT: %v1.pre = load i32* %p
				; CHECK-NEXT: br label %header

				call void @clobber()
				br label %header
				}

				; Same as above, but we need to look through a merge block
				; to see it. This is the form that loop-simplify produces
				; and is thus the most important test case in this file.
				define i32 @test4(i1 %cnd, i32* %p) {
				; CHECK-LABEL: @test4
				entry:
				; CHECK-LABEL: entry
				; CHECK-NEXT: %v1.pre = load i32* %p
				br label %header

				header:
				; CHECK-LABEL: header
				; CHECK-NOT: load i32* %p
				%v1 = load i32* %p
				call void @hold(i32 %v1)
				br i1 %cnd, label %bb1, label %bb2

				bb1:
				br label %merge

				bb2:
				; CHECK-LABEL: bb2
				; CHECK: call void @clobber()
				; CHECK-NEXT: %v1.pre1 = load i32* %p
				; CHECK-NEXT: br label %merge

				call void @clobber()
				br label %merge

				merge:
				br label %header
				}