This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: In determining load clobbering in AnnotateUniform, don't scan if there are too many blocks.
AbandonedPublic

Authored by cfang on Jul 29 2020, 10:18 AM.

Download Raw Diff

Details

Reviewers

arsenm
rampitec

Summary

The algorithm to find load clobbering in function is in the order of O^2.
The compilation becomes very slow if there are too many blocks ( ~3000).
To limit the compile time, we introduce a threshold (default 2500) of the
number of basic blocks.

Diff Detail

Event Timeline

cfang created this revision.Jul 29 2020, 10:18 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 29 2020, 10:18 AM

Herald added subscribers: kerbowa, hiraditya, t-tye and 6 others. · View Herald Transcript

cfang requested review of this revision.Jul 29 2020, 10:18 AM

Herald added a subscriber: wdng. · View Herald TranscriptJul 29 2020, 10:18 AM

Does MemorySSA have the same problem? Could we just switch this to use MemorySSA?

cfang mentioned this in D81433: AMDGPU: Restrict the number of instructions to scan for getPointerDependencyFrom.Jul 29 2020, 10:19 AM

arsenm added inline comments.Jul 29 2020, 12:01 PM

llvm/lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
145–146	The logic here should be fixed first. This is checking if the load was clobbered, before the trivial check for isGlobalLoad. The expensive check should be reordered last
145–146	Actually it can go even deep,r under the isa<Argument> \|\| GlobalValue check

In D84873#2182544, @arsenm wrote:

Does MemorySSA have the same problem? Could we just switch this to use MemorySSA?

Disclaimer: I know nothing about this pass or the purpose of this patch, just trying to answer this question.
MemorySSA has its own internal threshold limiting the number of memory instructions that are traversed upwards. It does not care at how many blocks those memory instructions are spread over.

cfang mentioned this in D84890: AMDGPU: Put inexpensive ops first in AMDGPUAnnotateUniformValues::visitLoadInst.Jul 29 2020, 2:37 PM

cfang marked 2 inline comments as done.Jul 29 2020, 2:41 PM

cfang added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
145–146	Done in https://reviews.llvm.org/D84890. Actually we have to do the expensive function call for the case PitI != NULL anyway. So it won't resolve the issue we encountered (this this current patch is still needed).

In D84873#2182944, @asbirlea wrote:

In D84873#2182544, @arsenm wrote:

Does MemorySSA have the same problem? Could we just switch this to use MemorySSA?

Disclaimer: I know nothing about this pass or the purpose of this patch, just trying to answer this question.
MemorySSA has its own internal threshold limiting the number of memory instructions that are traversed upwards. It does not care at how many blocks those memory instructions are spread over.

Thanks for the comments! I see that in MemorySSA, it scans 100 memory instructions upwards to find whether it is clobbered.
In our case, we essentially check every basic block, and a max of also 100 instructions in each block to find the pointer dependence:
MDR->getPointerDependencyFrom(MemoryLocation(Ptr), true, StartIt, BB, Load);
At this moment, I am not clear how can we use the existing functionality in MemorySSA for our purpose.

Rebase after https://reviews.llvm.org/D84890

How much improvement does D84890 give vs. this?

In D84873#2186018, @arsenm wrote:

How much improvement does D84890 give vs. this?

For the test case I have, D84890 did not show a measurable improvement,
while this can reduce the time from 2m30s to 13 seconds.

arsenm added inline comments.Jul 30 2020, 3:41 PM

llvm/lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
118–126	This seems like a pretty stupid way of using this analysis. This is going to be re-scanning the same instructions many times. My quick look at MemoryDependenceAnalysis suggests the way you should use it is to use a combination of getDependency and getNonLocalPointeDependency, which has a cache and internally calls getPointerDependencyFrom. You would then have to walk up the chain of dependencies until you find no clobbers?

cfang added inline comments.Aug 4 2020, 4:02 PM

llvm/lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
118–126	You are right. We need to come out with a better memory dependence analysis algorithm to avoid redundant searching. Before that, we should live with the current approach, which is a correct one. As a result, we have to restrict the search to control the compile time.

Ping!

Should we commit this patch to fix the compilation time for now? Then we may look at the possibility to replace
MemoryDependenceAnaysis in AnnotateUniform pass?

In D84873#2357528, @cfang wrote:

Ping!

Should we commit this patch to fix the compilation time for now? Then we may look at the possibility to replace
MemoryDependenceAnaysis in AnnotateUniform pass?

This doesn't sound like a commitment to me

In D84873#2358999, @arsenm wrote:

In D84873#2357528, @cfang wrote:

Ping!

Should we commit this patch to fix the compilation time for now? Then we may look at the possibility to replace
MemoryDependenceAnaysis in AnnotateUniform pass?

This doesn't sound like a commitment to me

Do you mean we need to open a bug (new task) to redesign load clobbering in AnnotateUniform pass?
Given the current implementation, I think this proposal is an effective cut-off to an expensive searching (without caching).

In D84873#2427066, @cfang wrote:

In D84873#2358999, @arsenm wrote:

In D84873#2357528, @cfang wrote:

Ping!

Should we commit this patch to fix the compilation time for now? Then we may look at the possibility to replace
MemoryDependenceAnaysis in AnnotateUniform pass?

This doesn't sound like a commitment to me

Do you mean we need to open a bug (new task) to redesign load clobbering in AnnotateUniform pass?
Given the current implementation, I think this proposal is an effective cut-off to an expensive searching (without caching).

Yes, I don't like that this is just putting off a real fix

The issue has been workarounded by https://reviews.llvm.org/D94107
So abandon this one.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUAnnotateUniformValues.cpp

10 lines

Diff 281649

llvm/lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp

	Show All 15 Lines
	#include "Utils/AMDGPUBaseInfo.h"	#include "Utils/AMDGPUBaseInfo.h"
	#include "llvm/ADT/SetVector.h"	#include "llvm/ADT/SetVector.h"
	#include "llvm/Analysis/LegacyDivergenceAnalysis.h"	#include "llvm/Analysis/LegacyDivergenceAnalysis.h"
	#include "llvm/Analysis/LoopInfo.h"	#include "llvm/Analysis/LoopInfo.h"
	#include "llvm/Analysis/MemoryDependenceAnalysis.h"	#include "llvm/Analysis/MemoryDependenceAnalysis.h"
	#include "llvm/IR/IRBuilder.h"	#include "llvm/IR/IRBuilder.h"
	#include "llvm/IR/InstVisitor.h"	#include "llvm/IR/InstVisitor.h"
	#include "llvm/InitializePasses.h"	#include "llvm/InitializePasses.h"
		#include "llvm/Support/CommandLine.h"
	#include "llvm/Support/Debug.h"	#include "llvm/Support/Debug.h"
	#include "llvm/Support/raw_ostream.h"	#include "llvm/Support/raw_ostream.h"

	#define DEBUG_TYPE "amdgpu-annotate-uniform"	#define DEBUG_TYPE "amdgpu-annotate-uniform"

	using namespace llvm;	using namespace llvm;

		static cl::opt<size_t> BasicBlockScanLimit("amdgpu-annotate-uniform-bb-limit",
		cl::Hidden, cl::init(2500),
		cl::desc("Max num BBs to scan in uniform annotation"));

	namespace {	namespace {

	class AMDGPUAnnotateUniformValues : public FunctionPass,	class AMDGPUAnnotateUniformValues : public FunctionPass,
	public InstVisitor<AMDGPUAnnotateUniformValues> {	public InstVisitor<AMDGPUAnnotateUniformValues> {
	LegacyDivergenceAnalysis *DA;	LegacyDivergenceAnalysis *DA;
	MemoryDependenceResults *MDR;	MemoryDependenceResults *MDR;
	LoopInfo *LI;	LoopInfo *LI;
	DenseMap<Value, GetElementPtrInst> noClobberClones;	DenseMap<Value, GetElementPtrInst> noClobberClones;
	▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines
	L = P;	L = P;
	P = P->getParentLoop();	P = P->getParentLoop();
	} while (P);	} while (P);
	Checklist.insert(L->block_begin(), L->block_end());	Checklist.insert(L->block_begin(), L->block_end());
	Start = L->getHeader();	Start = L->getHeader();
	}	}

	DFS(Start, Checklist);	DFS(Start, Checklist);

		// To impove compilation, don't scan if there are too many BBS.
		if (Checklist.size() > BasicBlockScanLimit)
		return true;

	for (auto &BB : Checklist) {	for (auto &BB : Checklist) {
	BasicBlock::iterator StartIt = (!L && (BB == Load->getParent())) ?	BasicBlock::iterator StartIt = (!L && (BB == Load->getParent())) ?
	BasicBlock::iterator(Load) : BB->end();	BasicBlock::iterator(Load) : BB->end();
	auto Q = MDR->getPointerDependencyFrom(MemoryLocation(Ptr), true,	auto Q = MDR->getPointerDependencyFrom(MemoryLocation(Ptr), true,
	StartIt, BB, Load);	StartIt, BB, Load);
	if (Q.isClobber() \|\| Q.isUnknown())	if (Q.isClobber() \|\| Q.isUnknown())
	return true;	return true;
	}	}
	return false;	return false;
		arsenmUnsubmitted Not Done Reply Inline Actions This seems like a pretty stupid way of using this analysis. This is going to be re-scanning the same instructions many times. My quick look at MemoryDependenceAnalysis suggests the way you should use it is to use a combination of getDependency and getNonLocalPointeDependency, which has a cache and internally calls getPointerDependencyFrom. You would then have to walk up the chain of dependencies until you find no clobbers? arsenm: This seems like a pretty stupid way of using this analysis. This is going to be re-scanning the…
		cfangAuthorUnsubmitted Done Reply Inline Actions You are right. We need to come out with a better memory dependence analysis algorithm to avoid redundant searching. Before that, we should live with the current approach, which is a correct one. As a result, we have to restrict the search to control the compile time. cfang: You are right. We need to come out with a better memory dependence analysis algorithm to avoid…
	}	}

	void AMDGPUAnnotateUniformValues::visitBranchInst(BranchInst &I) {	void AMDGPUAnnotateUniformValues::visitBranchInst(BranchInst &I) {
	if (DA->isUniform(&I))	if (DA->isUniform(&I))
	setUniformMetadata(I.getParent()->getTerminator());	setUniformMetadata(I.getParent()->getTerminator());
	}	}

	void AMDGPUAnnotateUniformValues::visitLoadInst(LoadInst &I) {	void AMDGPUAnnotateUniformValues::visitLoadInst(LoadInst &I) {
	Value *Ptr = I.getPointerOperand();	Value *Ptr = I.getPointerOperand();
	if (!DA->isUniform(Ptr))	if (!DA->isUniform(Ptr))
	return;	return;
	auto isGlobalLoad = [&](LoadInst &Load)->bool {	auto isGlobalLoad = [&](LoadInst &Load)->bool {
	return Load.getPointerAddressSpace() == AMDGPUAS::GLOBAL_ADDRESS;	return Load.getPointerAddressSpace() == AMDGPUAS::GLOBAL_ADDRESS;
	};	};
	// We're tracking up to the Function boundaries, and cannot go beyond because	// We're tracking up to the Function boundaries, and cannot go beyond because
	// of FunctionPass restrictions. We can ensure that is memory not clobbered	// of FunctionPass restrictions. We can ensure that is memory not clobbered
	// for memory operations that are live in to entry points only.	// for memory operations that are live in to entry points only.
	bool NotClobbered = isEntryFunc && !isClobberedInFunction(&I);	bool NotClobbered = isEntryFunc && !isClobberedInFunction(&I);
	Instruction *PtrI = dyn_cast<Instruction>(Ptr);	Instruction *PtrI = dyn_cast<Instruction>(Ptr);
	if (!PtrI && NotClobbered && isGlobalLoad(I)) {	if (!PtrI && NotClobbered && isGlobalLoad(I)) {
		arsenmUnsubmitted Done Reply Inline Actions The logic here should be fixed first. This is checking if the load was clobbered, before the trivial check for isGlobalLoad. The expensive check should be reordered last arsenm: The logic here should be fixed first. This is checking if the load was clobbered, before the…
		arsenmUnsubmitted Done Reply Inline Actions Actually it can go even deep,r under the isa<Argument> \|\| GlobalValue check arsenm: Actually it can go even deep,r under the isa<Argument> \|\| GlobalValue check
		cfangAuthorUnsubmitted Done Reply Inline Actions Done in https://reviews.llvm.org/D84890. Actually we have to do the expensive function call for the case PitI != NULL anyway. So it won't resolve the issue we encountered (this this current patch is still needed). cfang: Done in https://reviews.llvm.org/D84890. Actually we have to do the expensive function call for…
	if (isa<Argument>(Ptr) \|\| isa<GlobalValue>(Ptr)) {	if (isa<Argument>(Ptr) \|\| isa<GlobalValue>(Ptr)) {
Context not available.