This is an archive of the discontinued LLVM Phabricator instance.

[CGP] Sink invariant load to its use
AbandonedPublic

Authored by skatkov on Nov 8 2018, 7:53 PM.

Download Raw Diff

Details

Reviewers

wmi
arsenm
reames
craig.topper
sanjoy

Summary

If use of invariant load is located in colder basic block
it makes sense to sink the load to use.

Diff Detail

Event Timeline

skatkov created this revision.Nov 8 2018, 7:53 PM

Herald added a subscriber: wdng. · View Herald TranscriptNov 8 2018, 7:53 PM

Hi @arsenm, could you please take a look into this patch? If I turn the option on I see some AMDGPU test failures. I'm not an expert in AMDGPU asm/arch but it seems that 4 invariant loads are added in the beginning of pipeline in entry block and I hoist one of them to colder block. As a result as I understand instead of loading of two load at once this "double-load" is split into two loads. I'm not sure that it is a beneficial for this platform. So I wonder whether there is a easy way to prohibit optimization for such cases for this platform?
Even if there is no easy way to do that I'd like to land it with option off by default.

Thank you in advance,

In D54289#1292413, @skatkov wrote:

Hi @arsenm, could you please take a look into this patch? If I turn the option on I see some AMDGPU test failures. I'm not an expert in AMDGPU asm/arch but it seems that 4 invariant loads are added in the beginning of pipeline in entry block and I hoist one of them to colder block. As a result as I understand instead of loading of two load at once this "double-load" is split into two loads. I'm not sure that it is a beneficial for this platform. So I wonder whether there is a easy way to prohibit optimization for such cases for this platform?
Even if there is no easy way to do that I'd like to land it with option off by default.

Thank you in advance,

This is possibly beneficial in some specific circumstances, but probably not in general (especially if it inhibits vectorization, which I thought happened after CGP already?).

lib/CodeGen/CodeGenPrepare.cpp
5567–5568	Maybe make this cheap check before isDeferenceablePointer?
5567–5579	Maybe make these cheaper check before isDeferenceablePointer?

There's also a sinking pass, why isn't this done there?

In D54289#1292417, @arsenm wrote:

There's also a sinking pass, why isn't this done there?

Sink pass is written in the way it sink instructions only to nearest successors while for invariant load I want to move it anywhere in CFG...

I'll update a patch according to your comments.

Handled comments. Fow a while I'll double check Sink.cpp...

In D54289#1292418, @skatkov wrote:

In D54289#1292417, @arsenm wrote:

There's also a sinking pass, why isn't this done there?

Sink pass is written in the way it sink instructions only to nearest successors while for invariant load I want to move it anywhere in CFG...

Ok, I was wrong Sink pass iterates. If I add a support for invariant.load it can sink it but it definitely cannot sink it through diamond or loop. In parallel I'm going to take a look deeper whether I can handle my case in Sink pass...

In D54289#1292458, @skatkov wrote:

In D54289#1292418, @skatkov wrote:

In D54289#1292417, @arsenm wrote:

There's also a sinking pass, why isn't this done there?

Sink pass is written in the way it sink instructions only to nearest successors while for invariant load I want to move it anywhere in CFG...

Ok, I was wrong Sink pass iterates. If I add a support for invariant.load it can sink it but it definitely cannot sink it through diamond or loop. In parallel I'm going to take a look deeper whether I can handle my case in Sink pass...

I took a look at Sinking pass and I worry about its impact on register allocator. So it seems that this solution only for invariant load is safer.

Basing on https://bugs.llvm.org/show_bug.cgi?id=23603 I think this patch is incorrect.

It seems if I'd like to move invariant load I must ensure that pointer is dereferenceable at point of use.

This is invalid patch.

Revision Contents

Path

Size

lib/

CodeGen/

CodeGenPrepare.cpp

41 lines

test/

Transforms/

CodeGenPrepare/

invariant.load.ll

116 lines

Diff 173264

lib/CodeGen/CodeGenPrepare.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show All 19 Lines
#include "llvm/ADT/STLExtras.h"		#include "llvm/ADT/STLExtras.h"
#include "llvm/ADT/SmallPtrSet.h"		#include "llvm/ADT/SmallPtrSet.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
#include "llvm/Analysis/BlockFrequencyInfo.h"		#include "llvm/Analysis/BlockFrequencyInfo.h"
#include "llvm/Analysis/BranchProbabilityInfo.h"		#include "llvm/Analysis/BranchProbabilityInfo.h"
#include "llvm/Analysis/ConstantFolding.h"		#include "llvm/Analysis/ConstantFolding.h"
#include "llvm/Analysis/InstructionSimplify.h"		#include "llvm/Analysis/InstructionSimplify.h"
		#include "llvm/Analysis/Loads.h"
#include "llvm/Analysis/LoopInfo.h"		#include "llvm/Analysis/LoopInfo.h"
#include "llvm/Analysis/MemoryBuiltins.h"		#include "llvm/Analysis/MemoryBuiltins.h"
#include "llvm/Analysis/ProfileSummaryInfo.h"		#include "llvm/Analysis/ProfileSummaryInfo.h"
#include "llvm/Analysis/TargetLibraryInfo.h"		#include "llvm/Analysis/TargetLibraryInfo.h"
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/Transforms/Utils/Local.h"		#include "llvm/Transforms/Utils/Local.h"
#include "llvm/Analysis/ValueTracking.h"		#include "llvm/Analysis/ValueTracking.h"
#include "llvm/CodeGen/Analysis.h"		#include "llvm/CodeGen/Analysis.h"
▲ Show 20 Lines • Show All 180 Lines • ▼ Show 20 Lines	static cl::opt<bool> AddrSinkCombineScaledReg(
"addr-sink-combine-scaled-reg", cl::Hidden, cl::init(true),		"addr-sink-combine-scaled-reg", cl::Hidden, cl::init(true),
cl::desc("Allow combining of ScaledReg field in Address sinking."));		cl::desc("Allow combining of ScaledReg field in Address sinking."));

static cl::opt<bool>		static cl::opt<bool>
EnableGEPOffsetSplit("cgp-split-large-offset-gep", cl::Hidden,		EnableGEPOffsetSplit("cgp-split-large-offset-gep", cl::Hidden,
cl::init(true),		cl::init(true),
cl::desc("Enable splitting large offset of GEP."));		cl::desc("Enable splitting large offset of GEP."));

		static cl::opt<bool>
		EnableSinkInvariantLoad("cgp-sink-invariant-load", cl::Hidden, cl::init(false),
		cl::desc("Enable sink of invariant load to its use"));

namespace {		namespace {

enum ExtType {		enum ExtType {
ZeroExtension, // Zero extension has been seen.		ZeroExtension, // Zero extension has been seen.
SignExtension, // Sign extension has been seen.		SignExtension, // Sign extension has been seen.
BothExtension // This extension type is used if we saw sext after		BothExtension // This extension type is used if we saw sext after
// ZeroExtension had been set, or if we saw zext after		// ZeroExtension had been set, or if we saw zext after
// SignExtension had been set. It makes the type		// SignExtension had been set. It makes the type
▲ Show 20 Lines • Show All 118 Lines • ▼ Show 20 Lines	private:
bool optimizeInst(Instruction *I, bool &ModifiedDT);		bool optimizeInst(Instruction *I, bool &ModifiedDT);
bool optimizeMemoryInst(Instruction MemoryInst, Value Addr,		bool optimizeMemoryInst(Instruction MemoryInst, Value Addr,
Type *AccessTy, unsigned AddrSpace);		Type *AccessTy, unsigned AddrSpace);
bool optimizeInlineAsmInst(CallInst *CS);		bool optimizeInlineAsmInst(CallInst *CS);
bool optimizeCallInst(CallInst *CI, bool &ModifiedDT);		bool optimizeCallInst(CallInst *CI, bool &ModifiedDT);
bool optimizeExt(Instruction *&I);		bool optimizeExt(Instruction *&I);
bool optimizeExtUses(Instruction *I);		bool optimizeExtUses(Instruction *I);
bool optimizeLoadExt(LoadInst *Load);		bool optimizeLoadExt(LoadInst *Load);
		bool optimizeInvariantLoad(LoadInst *Load);
bool optimizeSelectInst(SelectInst *SI);		bool optimizeSelectInst(SelectInst *SI);
bool optimizeShuffleVectorInst(ShuffleVectorInst *SVI);		bool optimizeShuffleVectorInst(ShuffleVectorInst *SVI);
bool optimizeSwitchInst(SwitchInst *SI);		bool optimizeSwitchInst(SwitchInst *SI);
bool optimizeExtractElementInst(Instruction *Inst);		bool optimizeExtractElementInst(Instruction *Inst);
bool dupRetToEnableTailCallOpts(BasicBlock *BB);		bool dupRetToEnableTailCallOpts(BasicBlock *BB);
bool placeDbgValues(Function &F);		bool placeDbgValues(Function &F);
bool canFormExtLd(const SmallVectorImpl<Instruction *> &MovedExts,		bool canFormExtLd(const SmallVectorImpl<Instruction *> &MovedExts,
LoadInst &LI, Instruction &Inst, bool HasPromoted);		LoadInst &LI, Instruction &Inst, bool HasPromoted);
▲ Show 20 Lines • Show All 5,173 Lines • ▼ Show 20 Lines	if (cast<ConstantInt>(And->getOperand(1))->getValue() == DemandBits) {
And->eraseFromParent();		And->eraseFromParent();
++NumAndUses;		++NumAndUses;
}		}

++NumAndsAdded;		++NumAndsAdded;
return true;		return true;
}		}

		/// If Load is invariant then we can sink it closer to users if we find
		/// a suitable colder block.
		// Finding the colder block requires a domination tree and building it
		// for each invariant load seems expensive. For now we support only one use
		// and try to sink in the User's basic block if it colder.
		// TODO: support more users.
		bool CodeGenPrepare::optimizeInvariantLoad(LoadInst *Load) {
		// We need BFI to find a best location for load.
		if (!BFI)
		return false;
		if (!Load->getMetadata(LLVMContext::MD_invariant_load) \|\|
		!isDereferenceablePointer(Load->getPointerOperand(), *DL, Load))
		return false;
		// For simplicity just support only one user.
		if (!Load->hasOneUse())
		return false;
		arsenmUnsubmitted Not Done Reply Inline Actions Maybe make this cheap check before isDeferenceablePointer? arsenm: Maybe make this cheap check before isDeferenceablePointer?
		Instruction U = dyn_cast<Instruction>(Load->user_begin());
		// Cannot insert before Phi node, so need to find a better block.
		if (!U \|\| isa<PHINode>(U))
		return false;
		BasicBlock *LBB = Load->getParent();
		BasicBlock *UBB = U->getParent();
		if (LBB == UBB)
		return false;
		if (BFI->getBlockFreq(LBB) <= BFI->getBlockFreq(UBB))
		return false;

		arsenmUnsubmitted Not Done Reply Inline Actions Maybe make these cheaper check before isDeferenceablePointer? arsenm: Maybe make these cheaper check before isDeferenceablePointer?
		LLVM_DEBUG(dbgs() << "Move invariant load " << *Load << " to instruction "
		<< *U << "\n");
		Load->moveBefore(U);
		return true;
		}

/// Check if V (an operand of a select instruction) is an expensive instruction		/// Check if V (an operand of a select instruction) is an expensive instruction
/// that is only used once.		/// that is only used once.
static bool sinkSelectOperand(const TargetTransformInfo TTI, Value V) {		static bool sinkSelectOperand(const TargetTransformInfo TTI, Value V) {
auto *I = dyn_cast<Instruction>(V);		auto *I = dyn_cast<Instruction>(V);
// If it's safe to speculatively execute, then it should not have side		// If it's safe to speculatively execute, then it should not have side
// effects; therefore, it's safe to sink and possibly not execute.		// effects; therefore, it's safe to sink and possibly not execute.
return I && I->hasOneUse() && isSafeToSpeculativelyExecute(I) &&		return I && I->hasOneUse() && isSafeToSpeculativelyExecute(I) &&
TTI->getUserCost(I) >= TargetTransformInfo::TCC_Expensive;		TTI->getUserCost(I) >= TargetTransformInfo::TCC_Expensive;
▲ Show 20 Lines • Show All 1,035 Lines • ▼ Show 20 Lines	bool CodeGenPrepare::optimizeInst(Instruction *I, bool &ModifiedDT) {
if (CmpInst *CI = dyn_cast<CmpInst>(I))		if (CmpInst *CI = dyn_cast<CmpInst>(I))
if (!TLI \|\| !TLI->hasMultipleConditionRegisters())		if (!TLI \|\| !TLI->hasMultipleConditionRegisters())
return OptimizeCmpExpression(CI, TLI);		return OptimizeCmpExpression(CI, TLI);

if (LoadInst *LI = dyn_cast<LoadInst>(I)) {		if (LoadInst *LI = dyn_cast<LoadInst>(I)) {
LI->setMetadata(LLVMContext::MD_invariant_group, nullptr);		LI->setMetadata(LLVMContext::MD_invariant_group, nullptr);
if (TLI) {		if (TLI) {
bool Modified = optimizeLoadExt(LI);		bool Modified = optimizeLoadExt(LI);
		if (EnableSinkInvariantLoad)
		Modified \|= optimizeInvariantLoad(LI);
unsigned AS = LI->getPointerAddressSpace();		unsigned AS = LI->getPointerAddressSpace();
Modified \|= optimizeMemoryInst(I, I->getOperand(0), LI->getType(), AS);		Modified \|= optimizeMemoryInst(I, I->getOperand(0), LI->getType(), AS);
return Modified;		return Modified;
}		}
return false;		return false;
}		}

if (StoreInst *SI = dyn_cast<StoreInst>(I)) {		if (StoreInst *SI = dyn_cast<StoreInst>(I)) {
▲ Show 20 Lines • Show All 361 Lines • Show Last 20 Lines

test/Transforms/CodeGenPrepare/invariant.load.ll

This file was added.

				; RUN: opt -codegenprepare -cgp-sink-invariant-load=true -S < %s \| FileCheck %s

				target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
				target triple = "x86_64-apple-darwin10.0.0"

				; Check that we move load to colder block.
				define i8 @test_move_to_cold(i8* dereferenceable(1) %tmp, i1 %c) {
				; CHECK-LABEL: @test_move_to_cold
				enter:
				; CHECK-LABEL: enter:
				; CHECK-NEXT: br
				%val = load i8, i8* %tmp, !invariant.load !0
				br i1 %c, label %left, label %end, !prof !1
				left:
				; CHECK-LABEL: left:
				; CHECK-NEXT: load
				; CHECK-NEXT: %res.left
				%res.left = add i8 %val, 1
				br label %end
				end:
				%res = phi i8 [%res.left, %left], [0, %enter]
				ret i8 %res
				}

				; Check that we do not move load if pointer is not known to be dereferencable.
				define i8 @test_no_derefencable(i8* %tmp, i1 %c) {
				; CHECK-LABEL: @test_no_derefencable
				enter:
				; CHECK-LABEL: enter:
				; CHECK-NEXT: %val
				; CHECK-NEXT: br
				%val = load i8, i8* %tmp, !invariant.load !0
				br i1 %c, label %left, label %end, !prof !1
				left:
				; CHECK-LABEL: left:
				; CHECK-NEXT: %res.left
				%res.left = add i8 %val, 1
				br label %end
				end:
				%res = phi i8 [%res.left, %left], [0, %enter]
				ret i8 %res
				}

				; Check that we do not move load to hotter block.
				define i8 @test_move_to_hot(i8* dereferenceable(1) %tmp, i1 %c) {
				; CHECK-LABEL: @test_move_to_hot
				enter:
				; CHECK-LABEL: enter:
				; CHECK-NEXT: %val
				; CHECK-NEXT: br
				%val = load i8, i8* %tmp, !invariant.load !0
				br i1 %c, label %left, label %end, !prof !2
				left:
				%res.left = add i8 %val, 1
				br label %end
				end:
				%res = phi i8 [%res.left, %left], [0, %enter]
				ret i8 %res
				}

				; Check that we do not move load if the user in the same block.
				define i8 @test_the_same_block(i8* dereferenceable(1) %tmp, i8* dereferenceable(1) %tmp1) {
				; CHECK-LABEL: @test_the_same_block
				enter:
				; CHECK-LABEL: enter:
				; CHECK-NEXT: %val
				; CHECK-NEXT: %val2
				; CHECK-NEXT: %res
				%val = load i8, i8* %tmp, !invariant.load !0
				%val2 = load i8, i8* %tmp1
				%res = add i8 %val, %val2
				ret i8 %res
				}

				; Check that we do not move load if there are more than one use.
				define i8 @test_two_uses(i8* dereferenceable(1) %tmp, i8* dereferenceable(1) %tmp1) {
				; CHECK-LABEL: @test_two_uses
				enter:
				; CHECK-LABEL: enter:
				; CHECK-NEXT: %val
				; CHECK-NEXT: %val2
				; CHECK-NEXT: %res
				; CHECK-NEXT: %res2
				%val = load i8, i8* %tmp, !invariant.load !0
				%val2 = load i8, i8* %tmp1
				%res = add i8 %val, %val
				%res2 = add i8 %res, %val2
				ret i8 %res2
				}

				; Check that we do not move load to phi user.
				define i8 @test_phi(i8* dereferenceable(1) %tmp, i8* dereferenceable(1) %tmp1, i1 %c) {
				; CHECK-LABEL: @test_phi
				enter:
				; CHECK-LABEL: enter:
				; CHECK-NEXT: br
				br i1 %c, label %left, label %right, !prof !1
				left:
				; CHECK-LABEL: left:
				; CHECK-NEXT: load
				%val = load i8, i8* %tmp, !invariant.load !0
				br label %end
				right:
				; CHECK-LABEL: right:
				; CHECK-NEXT: load
				%val1 = load i8, i8* %tmp1, !invariant.load !0
				br label %end
				end:
				; CHECK-LABEL: end:
				%res = phi i8 [%val, %left], [%val1, %right]
				ret i8 %res
				}

				!0 = !{}
				!1 = !{!"branch_weights", i32 1, i32 100}
				!2 = !{!"branch_weights", i32 100, i32 1}