This is an archive of the discontinued LLVM Phabricator instance.

[CGP] Limit Complex Addressing mode by number of BasicBlocks to traverse
AbandonedPublic

Authored by skatkov on Nov 14 2018, 1:53 AM.

Download Raw Diff

Details

Reviewers

john.brawn
reames
bjope
chandlerc
haicheng

Summary

The finding the combined complex addressing mode in some cases becomes
expensive. This CL limits the work by number of basic blocks to traverse.

This is a fix for PR39625.

Diff Detail

Event Timeline

skatkov created this revision.Nov 14 2018, 1:53 AM

wmi added a subscriber: wmi.Nov 15 2018, 7:47 PM

ping.

Basically I think the code looks ok. However, I'm not so familiar with this algorithm so it is hard to comment about the actual solution.

My understanding is that you introduce a threshold, and if the size of the TraverseOrder vector grows past the threshold we bail out from findCommon. So what is the impact of this?
I assume it means that we limit the amount of "Complex Addressing mode" optimizations somehow. Is this limit only hit for "large" programs? When compiling a program that hits the threshold, do we lose some optimizations or will the amount of Complex Addressing mode optimizations in such a program be reduced significantly?

How did you choose the current threshold? (that is probably something people want to know when trying to finetune this 4 years from now, so it could be nice to say something about it in the commit msg even if it just is an estimate)

In D54517#1304803, @bjope wrote:

Basically I think the code looks ok. However, I'm not so familiar with this algorithm so it is hard to comment about the actual solution.

My understanding is that you introduce a threshold, and if the size of the TraverseOrder vector grows past the threshold we bail out from findCommon. So what is the impact of this?
I assume it means that we limit the amount of "Complex Addressing mode" optimizations somehow. Is this limit only hit for "large" programs? When compiling a program that hits the threshold, do we lose some optimizations or will the amount of Complex Addressing mode optimizations in such a program be reduced significantly?

How did you choose the current threshold? (that is probably something people want to know when trying to finetune this 4 years from now, so it could be nice to say something about it in the commit msg even if it just is an estimate)

Hi Bjorn, thank you for looking into this patch.

The optimization itself does the following:
Let's we have load from pointer p. We traverse all paths from p to original pointer skipping phi nodes and selects. Let's we found that p can be actually p.1, p.2, .. p.N. Each of p.i = gv.i + base.i + index.i * scale.i + offset.i.
If for all i: all parts are the same we apply optimization (move pointer computation close to loop to create a complex addressing load.
If for all i the parts are different only by one field gv, base or index then we go to complex case which is targeted by this patch.

Say the difference in the base. Generally we want to build a number of phi nodes to have right base at load.
TraverseOrder size is actually approximately the number of basic blocks between load's BB and all p.i's BB traversing be predecessors from load's BB.
The best benefit you get if all Phi nodes for base are already exists, so you will not create new Phi nodes but probably removes redundant ones (to get p).

The number 100 has no specific meaning. It is some threshold which is big enough to accept more or less big CFG.
The PR39625 contains a test which seems a corner case for this optimization. The distance between load and original p.i reaches the values about 16000 there. I did several runs with different values of this threshold and it choose 100 as giving the reasonable compile time even in debug build.

I hope it helps.

I don't have any objections to the patch myself, but I don't really have the knowledge to understand if this could be bad for some important use cases either. A user can ofcourse override the threshold if needed, so I guess that makes this patch a little bit less "dangerous".

It looks like @chandlerc and @haicheng has contributed to this pass in the past. Can perhaps one of you help out and take a quick look at this to see if introducing this threshold is a reasonable fix?

Thank you, Bjorn.

I guess @john.brawn would be best person to review this patch because he reviewed an original implementation.

FYI: in parallel there is an idea how to simplify algorithm significantly. I'll try to implement it...

Looks OK as a fix for the reported bug, with one minor nitpick. Taking a look at PR39625 it looks like there's an underlying problem where we insert a load of useless placeholders, e.g. for

declare void @otherfn()

define i32 @fn(i32* %arg1, i32* %arg2) {
entry:
  %gep1 = getelementptr i32, i32* %arg1, i32 4
  %gep2 = getelementptr i32, i32* %arg1, i32 8
  br i1 undef, label %a1, label %b1

a1:
  call void @otherfn()
  br label %middle

b1:
  call void @otherfn()
  br label %middle

middle:
  br i1 undef, label %a2, label %b2

a2:
  call void @otherfn()
  br label %end

b2:
  call void @otherfn()
  br label %end

end:
  %phi = phi i32* [ %gep1, %a2 ], [ %gep2, %b2 ]
  %val = load i32, i32* %phi, align 4
  ret i32 %val
}

we insert a total of 9 placeholders, but actually we need only one and the rest get deleted later. So I think there's something that could be done to improve that, but that's no reason not to do this fix.

lib/CodeGen/CodeGenPrepare.cpp
3095	This should have a << "\n" on the end.

This revision is now accepted and ready to land.Nov 26 2018, 7:40 AM

In D54517#1308063, @john.brawn wrote:
Looks OK as a fix for the reported bug, with one minor nitpick. Taking a look at PR39625 it looks like there's an underlying problem where we insert a load of useless placeholders, e.g. for
declare void @otherfn()

define i32 @fn(i32* %arg1, i32* %arg2) {
entry:
  %gep1 = getelementptr i32, i32* %arg1, i32 4
  %gep2 = getelementptr i32, i32* %arg1, i32 8
  br i1 undef, label %a1, label %b1

a1:
  call void @otherfn()
  br label %middle

b1:
  call void @otherfn()
  br label %middle

middle:
  br i1 undef, label %a2, label %b2

a2:
  call void @otherfn()
  br label %end

b2:
  call void @otherfn()
  br label %end

end:
  %phi = phi i32* [ %gep1, %a2 ], [ %gep2, %b2 ]
  %val = load i32, i32* %phi, align 4
  ret i32 %val
}
we insert a total of 9 placeholders, but actually we need only one and the rest get deleted later. So I think there's something that could be done to improve that, but that's no reason not to do this fix.

This is exactly an improvement I wrote in the comment above...

Hi @john.brawn, could you please take a look at https://reviews.llvm.org/D54932. It makes a compile time improvement and I guess we will not need this patch for a while.

abandon in favor of D54932.

Revision Contents

Path

Size

lib/

CodeGen/

CodeGenPrepare.cpp

19 lines

Diff 174001

lib/CodeGen/CodeGenPrepare.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 186 Lines • ▼ Show 20 Lines	EnableTypePromotionMerge("cgp-type-promotion-merge", cl::Hidden,
cl::desc("Enable merging of redundant sexts when one is dominating"		cl::desc("Enable merging of redundant sexts when one is dominating"
" the other."), cl::init(true));		" the other."), cl::init(true));

static cl::opt<bool> DisableComplexAddrModes(		static cl::opt<bool> DisableComplexAddrModes(
"disable-complex-addr-modes", cl::Hidden, cl::init(false),		"disable-complex-addr-modes", cl::Hidden, cl::init(false),
cl::desc("Disables combining addressing modes with different parts "		cl::desc("Disables combining addressing modes with different parts "
"in optimizeMemoryInst."));		"in optimizeMemoryInst."));

		static cl::opt<int> ComplexAddrModesThreshold(
		"complex-addr-modes-threshold", cl::Hidden, cl::init(100),
		cl::desc("Stop processing complex addressing mode optimization"
		" if number of basic blocks to process exceeds threshold"));

static cl::opt<bool>		static cl::opt<bool>
AddrSinkNewPhis("addr-sink-new-phis", cl::Hidden, cl::init(false),		AddrSinkNewPhis("addr-sink-new-phis", cl::Hidden, cl::init(false),
cl::desc("Allow creation of Phis in Address sinking."));		cl::desc("Allow creation of Phis in Address sinking."));

static cl::opt<bool>		static cl::opt<bool>
AddrSinkNewSelects("addr-sink-new-select", cl::Hidden, cl::init(true),		AddrSinkNewSelects("addr-sink-new-select", cl::Hidden, cl::init(true),
cl::desc("Allow creation of selects in Address sinking."));		cl::desc("Allow creation of selects in Address sinking."));

▲ Show 20 Lines • Show All 2,876 Lines • ▼ Show 20 Lines	Value *findCommon(FoldAddrToValueMapping &Map) {
// Simplification of Phi nodes is recursive, so some Phi node may		// Simplification of Phi nodes is recursive, so some Phi node may
// be simplified after we added it to AddrToBase.		// be simplified after we added it to AddrToBase.
// Using this mapping we can find the current value in AddrToBase.		// Using this mapping we can find the current value in AddrToBase.
SimplificationTracker ST(SQ);		SimplificationTracker ST(SQ);

// First step, DFS to create PHI nodes for all intermediate blocks.		// First step, DFS to create PHI nodes for all intermediate blocks.
// Also fill traverse order for the second step.		// Also fill traverse order for the second step.
SmallVector<ValueInBB, 32> TraverseOrder;		SmallVector<ValueInBB, 32> TraverseOrder;
InsertPlaceholders(Map, TraverseOrder, ST);		if (!InsertPlaceholders(Map, TraverseOrder, ST)) {
		LLVM_DEBUG(
		dbgs() << "Stop complex addressing mode due to exceed of threshold: "
		<< TraverseOrder.size());
		john.brawnUnsubmitted Not Done Reply Inline Actions This should have a << "\n" on the end. john.brawn: This should have a << "\n" on the end.
		ST.destroyNewNodes(CommonType);
		return nullptr;
		}

// Second Step, fill new nodes by merged values and simplify if possible.		// Second Step, fill new nodes by merged values and simplify if possible.
FillPlaceholders(Map, TraverseOrder, ST);		FillPlaceholders(Map, TraverseOrder, ST);

if (!AddrSinkNewSelects && ST.countNewSelectNodes() > 0) {		if (!AddrSinkNewSelects && ST.countNewSelectNodes() > 0) {
ST.destroyNewNodes(CommonType);		ST.destroyNewNodes(CommonType);
return nullptr;		return nullptr;
}		}
▲ Show 20 Lines • Show All 157 Lines • ▼ Show 20 Lines	void FillPlaceholders(FoldAddrToValueMapping &Map,
}		}
}		}

/// Starting from value recursively iterates over predecessors up to known		/// Starting from value recursively iterates over predecessors up to known
/// ending values represented in a map. For each traversed block inserts		/// ending values represented in a map. For each traversed block inserts
/// a placeholder Phi or Select.		/// a placeholder Phi or Select.
/// Reports all new created Phi/Select nodes by adding them to set.		/// Reports all new created Phi/Select nodes by adding them to set.
/// Also reports and order in what basic blocks have been traversed.		/// Also reports and order in what basic blocks have been traversed.
void InsertPlaceholders(FoldAddrToValueMapping &Map,		/// Return false if size of TraverseOrder exceeds threshold.
		bool InsertPlaceholders(FoldAddrToValueMapping &Map,
SmallVectorImpl<ValueInBB> &TraverseOrder,		SmallVectorImpl<ValueInBB> &TraverseOrder,
SimplificationTracker &ST) {		SimplificationTracker &ST) {
SmallVector<ValueInBB, 32> Worklist;		SmallVector<ValueInBB, 32> Worklist;
assert((isa<PHINode>(Original.first) \|\| isa<SelectInst>(Original.first)) &&		assert((isa<PHINode>(Original.first) \|\| isa<SelectInst>(Original.first)) &&
"Address must be a Phi or Select node");		"Address must be a Phi or Select node");
auto *Dummy = UndefValue::get(CommonType);		auto *Dummy = UndefValue::get(CommonType);
Worklist.push_back(Original);		Worklist.push_back(Original);
while (!Worklist.empty()) {		while (!Worklist.empty()) {
auto Current = Worklist.pop_back_val();		auto Current = Worklist.pop_back_val();
// If value is not an instruction it is something global, constant,		// If value is not an instruction it is something global, constant,
// parameter and we can say that this value is observable in any block.		// parameter and we can say that this value is observable in any block.
// Set block to null to denote it.		// Set block to null to denote it.
// Also please take into account that it is how we build anchors.		// Also please take into account that it is how we build anchors.
if (!isa<Instruction>(Current.first))		if (!isa<Instruction>(Current.first))
Current.second = nullptr;		Current.second = nullptr;
// if it is already visited or it is an ending value then skip it.		// if it is already visited or it is an ending value then skip it.
if (Map.find(Current) != Map.end())		if (Map.find(Current) != Map.end())
continue;		continue;
TraverseOrder.push_back(Current);		TraverseOrder.push_back(Current);
		if ((int)TraverseOrder.size() > ComplexAddrModesThreshold)
		return false;

Value *CurrentValue = Current.first;		Value *CurrentValue = Current.first;
BasicBlock *CurrentBlock = Current.second;		BasicBlock *CurrentBlock = Current.second;
// CurrentValue must be a Phi node or select. All others must be covered		// CurrentValue must be a Phi node or select. All others must be covered
// by anchors.		// by anchors.
Instruction *CurrentI = cast<Instruction>(CurrentValue);		Instruction *CurrentI = cast<Instruction>(CurrentValue);
bool IsDefinedInThisBB = CurrentI->getParent() == CurrentBlock;		bool IsDefinedInThisBB = CurrentI->getParent() == CurrentBlock;

Show All 33 Lines	while (!Worklist.empty()) {
Map[Current] = PHI;		Map[Current] = PHI;
ST.insertNewPhi(PHI);		ST.insertNewPhi(PHI);

// Add all predecessors in work list.		// Add all predecessors in work list.
for (auto B : predecessors(CurrentBlock))		for (auto B : predecessors(CurrentBlock))
Worklist.push_back({ CurrentPhi->getIncomingValueForBlock(B), B });		Worklist.push_back({ CurrentPhi->getIncomingValueForBlock(B), B });
}		}
}		}
		return true;
}		}

bool addrModeCombiningAllowed() {		bool addrModeCombiningAllowed() {
if (DisableComplexAddrModes)		if (DisableComplexAddrModes)
return false;		return false;
switch (DifferentField) {		switch (DifferentField) {
default:		default:
return false;		return false;
▲ Show 20 Lines • Show All 3,774 Lines • Show Last 20 Lines