Download Raw Diff

Details

Reviewers

hfinkel
jsji
steven.zhang
Jiangning
anton-afanasyev
ab
rtereshin
greened
mzolotukhin
nemanjai
anil9
courbet
SjoerdMeijer
dmgreen

Commits

rG19e5da4edc96: Merging r366570: --------------------------------------------------------------…
rL366729: Merging r366570:
rGdec624682e06: [MachineCSE][MachinePRE] Avoid hoisting code from code regions into hot BBs.
rL366570: [MachineCSE][MachinePRE] Avoid hoisting code from code regions into hot BBs.

Summary

Current PRE hoists common computations into
CMBB = DT->findNearestCommonDominator(MBB, MBB1).
However, if CMBB is in a hot loop body, we might get performance
degradation.

Diff Detail

Event Timeline

lkail created this revision.Jul 9 2019, 2:25 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 9 2019, 2:25 AM

Herald added subscribers: llvm-commits, MaskRay, hiraditya, nemanjai. · View Herald Transcript

lkail added reviewers: rtereshin, greened, mzolotukhin.Jul 9 2019, 2:40 AM

Herald added a subscriber: • wuzish. · View Herald TranscriptJul 9 2019, 2:40 AM

lkail added a reviewer: nemanjai.Jul 9 2019, 2:42 AM

However, if CMBB is in a loop body, we might get performance degradation.

But we might also get a performance improvement, because the performance of the inner loop is more significant than that of the outer loop. This latter case seems more likely to me, but do you have performance results from the test suite, or something else, showing otherwise?

Is the problem hoisting out of a cold inner region into a hot loop? Would profiling data help? Is this really a rematerialization problem?

@hfinkel , thanks for review.

do you have performance results from the test suite, or something else, showing otherwise?

Yes. We have observed ~7% degs in one of benchmark due to this.

However, if CMBB is in a loop body, we might get performance degradation.

This conclusion comes from the observation of the test suite code, I'm to paste a reduced case to tests soon, similar to current test, except for its branches are switchs.

But we might also get a performance improvement, because the performance of the inner loop is more significant than that of the outer loop.

I do miss this point you have mentioned. @nemanjai has already suggested me to take a look at MachineBlockFrequency.

Is the problem hoisting out of a cold inner region into a hot loop?

I agree with it. This patch should consider more about hotness of loops.

lkail added a reviewer: anil9.Jul 9 2019, 6:56 PM

Updated the patch, using MachineBlockFrequency as metric to check if CMBB is appropriate to hoist into.

Herald added a subscriber: javed.absar. · View Herald TranscriptJul 9 2019, 11:47 PM

lkail retitled this revision from [MachineCSE][MachinePRE] Do not hoist common computations into loop bodies to [MachineCSE][MachinePRE] Do not hoist common computations into hot BBs.Jul 9 2019, 11:58 PM

lkail updated this revision to Diff 208891.Jul 10 2019, 1:25 AM

jsji added inline comments.Jul 10 2019, 11:29 AM

llvm/test/CodeGen/AArch64/O3-pipeline.ll
36 ↗	(On Diff #208891)	irrelevant
llvm/test/CodeGen/X86/O3-pipeline.ll
33 ↗	(On Diff #208891)	Please avoid irrelevant changes, commit them in another NFC patch if you would like to change them.
70 ↗	(On Diff #208891)	irrelevant
97 ↗	(On Diff #208891)	irrelevant changes.
179 ↗	(On Diff #208891)	extra line? irrelevant

Address @jsji 's comments and added new test.

dmgreen added a subscriber: dmgreen.Jul 14 2019, 5:55 AM

dmgreen added inline comments.

llvm/lib/CodeGen/MachineCSE.cpp
873	Should this also say something like "if OptForMinSize then return true"? Under the assumption that pre will reduce the codesize.

lkail marked an inline comment as done.Jul 15 2019, 6:31 PM

lkail added inline comments.

llvm/lib/CodeGen/MachineCSE.cpp
873	Hi @dmgreen , your concern makes sense, since CSE won't eliminate all common computations considering what `isProfitableToCSE` does. As a result, it might increase size after PRE. I think we can enhance it in following patches.

Ping

dmgreen added inline comments.Jul 17 2019, 10:56 AM

llvm/lib/CodeGen/MachineCSE.cpp
873	Hello. Sorry. I meant more that - to my understanding - CSE is expected to decrease codesize. PRE can help perform more CSE so is expected to decrease codesize more. At Minsize (Oz) we don't really care which block is hot and which isn't, we just want to decrease codesize as much as possible. Hence this function, when optimising for minsize should just return true. Feel free to correct me if any of that sounds wrong. It probably doesn't make a large difference either way, but we might as well do it whilst we are here.

lkail added a reviewer: dmgreen.Jul 17 2019, 6:32 PM

Updated patch following @dmgreen 's suggestion.

Thanks. Looks like a nice change to me, other than one minor modification

llvm/lib/CodeGen/MachineCSE.cpp
870	I think you can use hasMinSize, which is the truly size-paranoid option. Os (hasOptSize) is probably fine with your new block frequency check, if it's expected to speed up some code (and the codesize changes are fairly minimal).

This revision is now accepted and ready to land.Jul 18 2019, 3:40 AM

anton-afanasyev added inline comments.Jul 18 2019, 3:20 PM

llvm/lib/CodeGen/MachineCSE.cpp
875	I would suggest more conservative `<` instead of `<=` here. This essentially makes sense for the cases when all `BlockFreq`s are unknown (so they are equal to `0`).

Btw, this change breaks multiple (more than two) hoisting to common dominator. I've tested this patch for the original test case taken from here: https://bugs.llvm.org/show_bug.cgi?id=38917. There are several comparisons giving 96 > 40 + 10, 96 > 29 + 10, 96 > 18 + 10 (so no hoisting at all), meanwhile 96 < 97 = 40 + 29 + 18 + 10.
However I do not see easy solution for this issue.

In D64394#1592621, @anton-afanasyev wrote:

Btw, this change breaks multiple (more than two) hoisting to common dominator. I've tested this patch for the original test case taken from here: https://bugs.llvm.org/show_bug.cgi?id=38917. There are several comparisons giving 96 > 40 + 10, 96 > 29 + 10, 96 > 18 + 10 (so no hoisting at all), meanwhile 96 < 97 = 40 + 29 + 18 + 10.
However I do not see easy solution for this issue.

Good point! I think it would be an opportunity for our benchmark. I think I can enhance it in following patches. Maybe we also have to take register pressure into consideration.

Use hasMinSize to check if optimized for size.

Closed by commit rL366570: [MachineCSE][MachinePRE] Avoid hoisting code from code regions into hot BBs. (authored by lkail). · Explain WhyJul 19 2019, 5:58 AM

This revision was automatically updated to reflect the committed changes.

Diff 208618

llvm/lib/CodeGen/MachineCSE.cpp

Show All 19 Lines
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
#include "llvm/Analysis/AliasAnalysis.h"		#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/Analysis/CFG.h"		#include "llvm/Analysis/CFG.h"
#include "llvm/CodeGen/MachineBasicBlock.h"		#include "llvm/CodeGen/MachineBasicBlock.h"
#include "llvm/CodeGen/MachineDominators.h"		#include "llvm/CodeGen/MachineDominators.h"
#include "llvm/CodeGen/MachineFunction.h"		#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineFunctionPass.h"		#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/CodeGen/MachineInstr.h"		#include "llvm/CodeGen/MachineInstr.h"
		#include "llvm/CodeGen/MachineLoopInfo.h"
#include "llvm/CodeGen/MachineOperand.h"		#include "llvm/CodeGen/MachineOperand.h"
#include "llvm/CodeGen/MachineRegisterInfo.h"		#include "llvm/CodeGen/MachineRegisterInfo.h"
#include "llvm/CodeGen/Passes.h"		#include "llvm/CodeGen/Passes.h"
#include "llvm/CodeGen/TargetInstrInfo.h"		#include "llvm/CodeGen/TargetInstrInfo.h"
#include "llvm/CodeGen/TargetOpcodes.h"		#include "llvm/CodeGen/TargetOpcodes.h"
#include "llvm/CodeGen/TargetRegisterInfo.h"		#include "llvm/CodeGen/TargetRegisterInfo.h"
#include "llvm/CodeGen/TargetSubtargetInfo.h"		#include "llvm/CodeGen/TargetSubtargetInfo.h"
#include "llvm/MC/MCInstrDesc.h"		#include "llvm/MC/MCInstrDesc.h"
Show All 25 Lines
namespace {		namespace {

class MachineCSE : public MachineFunctionPass {		class MachineCSE : public MachineFunctionPass {
const TargetInstrInfo *TII;		const TargetInstrInfo *TII;
const TargetRegisterInfo *TRI;		const TargetRegisterInfo *TRI;
AliasAnalysis *AA;		AliasAnalysis *AA;
MachineDominatorTree *DT;		MachineDominatorTree *DT;
MachineRegisterInfo *MRI;		MachineRegisterInfo *MRI;
		MachineLoopInfo *LI;

public:		public:
static char ID; // Pass identification		static char ID; // Pass identification

MachineCSE() : MachineFunctionPass(ID) {		MachineCSE() : MachineFunctionPass(ID) {
initializeMachineCSEPass(*PassRegistry::getPassRegistry());		initializeMachineCSEPass(*PassRegistry::getPassRegistry());
}		}

bool runOnMachineFunction(MachineFunction &MF) override;		bool runOnMachineFunction(MachineFunction &MF) override;

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.setPreservesCFG();		AU.setPreservesCFG();
MachineFunctionPass::getAnalysisUsage(AU);		MachineFunctionPass::getAnalysisUsage(AU);
AU.addRequired<AAResultsWrapperPass>();		AU.addRequired<AAResultsWrapperPass>();
AU.addPreservedID(MachineLoopInfoID);		AU.addPreservedID(MachineLoopInfoID);
AU.addRequired<MachineDominatorTree>();		AU.addRequired<MachineDominatorTree>();
AU.addPreserved<MachineDominatorTree>();		AU.addPreserved<MachineDominatorTree>();
		AU.addRequired<MachineLoopInfo>();
		AU.addPreserved<MachineLoopInfo>();
}		}

void releaseMemory() override {		void releaseMemory() override {
ScopeMap.clear();		ScopeMap.clear();
PREMap.clear();		PREMap.clear();
Exps.clear();		Exps.clear();
}		}

▲ Show 20 Lines • Show All 703 Lines • ▼ Show 20 Lines	for (MachineBasicBlock::iterator I = MBB->begin(), E = MBB->end(); I != E;) {
auto MBB1 = PREMap[MI];		auto MBB1 = PREMap[MI];
assert(		assert(
!DT->properlyDominates(MBB, MBB1) &&		!DT->properlyDominates(MBB, MBB1) &&
"MBB cannot properly dominate MBB1 while DFS through dominators tree!");		"MBB cannot properly dominate MBB1 while DFS through dominators tree!");
auto CMBB = DT->findNearestCommonDominator(MBB, MBB1);		auto CMBB = DT->findNearestCommonDominator(MBB, MBB1);
if (!CMBB->isLegalToHoistInto())		if (!CMBB->isLegalToHoistInto())
continue;		continue;

		// Don't hoist the instruction into a loop.
		// FIXME: Can we go on looking for a non-loop ancester in the dominator
		// tree?
		if (LI->getLoopFor(CMBB) != nullptr)
		continue;

// Two instrs are partial redundant if their basic blocks are reachable		// Two instrs are partial redundant if their basic blocks are reachable
// from one to another but one doesn't dominate another.		// from one to another but one doesn't dominate another.
if (CMBB != MBB1) {		if (CMBB != MBB1) {
auto BB = MBB->getBasicBlock(), BB1 = MBB1->getBasicBlock();		auto BB = MBB->getBasicBlock(), BB1 = MBB1->getBasicBlock();
if (BB != nullptr && BB1 != nullptr &&		if (BB != nullptr && BB1 != nullptr &&
(isPotentiallyReachable(BB1, BB) \|\|		(isPotentiallyReachable(BB1, BB) \|\|
isPotentiallyReachable(BB, BB1))) {		isPotentiallyReachable(BB, BB1))) {

Show All 39 Lines	bool MachineCSE::PerformSimplePRE(MachineDominatorTree *DT) {
} while (!BBs.empty());		} while (!BBs.empty());

return Changed;		return Changed;
}		}

bool MachineCSE::runOnMachineFunction(MachineFunction &MF) {		bool MachineCSE::runOnMachineFunction(MachineFunction &MF) {
if (skipFunction(MF.getFunction()))		if (skipFunction(MF.getFunction()))
return false;		return false;

		dmgreenUnsubmitted Not Done Reply Inline Actions I think you can use hasMinSize, which is the truly size-paranoid option. Os (hasOptSize) is probably fine with your new block frequency check, if it's expected to speed up some code (and the codesize changes are fairly minimal). dmgreen: I think you can use hasMinSize, which is the truly size-paranoid option. Os (hasOptSize) is…
TII = MF.getSubtarget().getInstrInfo();		TII = MF.getSubtarget().getInstrInfo();
TRI = MF.getSubtarget().getRegisterInfo();		TRI = MF.getSubtarget().getRegisterInfo();
MRI = &MF.getRegInfo();		MRI = &MF.getRegInfo();
		dmgreenUnsubmitted Not Done Reply Inline Actions Should this also say something like "if OptForMinSize then return true"? Under the assumption that pre will reduce the codesize. dmgreen: Should this also say something like "if OptForMinSize then return true"? Under the assumption…
		lkailAuthorUnsubmitted Done Reply Inline Actions Hi @dmgreen , your concern makes sense, since CSE won't eliminate all common computations considering what `isProfitableToCSE` does. As a result, it might increase size after PRE. I think we can enhance it in following patches. lkail: Hi @dmgreen , your concern makes sense, since CSE won't eliminate all common computations…
		dmgreenUnsubmitted Not Done Reply Inline Actions Hello. Sorry. I meant more that - to my understanding - CSE is expected to decrease codesize. PRE can help perform more CSE so is expected to decrease codesize more. At Minsize (Oz) we don't really care which block is hot and which isn't, we just want to decrease codesize as much as possible. Hence this function, when optimising for minsize should just return true. Feel free to correct me if any of that sounds wrong. It probably doesn't make a large difference either way, but we might as well do it whilst we are here. dmgreen: Hello. Sorry. I meant more that - to my understanding - CSE is expected to decrease codesize.
AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();		AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();
DT = &getAnalysis<MachineDominatorTree>();		DT = &getAnalysis<MachineDominatorTree>();
		anton-afanasyevUnsubmitted Not Done Reply Inline Actions I would suggest more conservative `<` instead of `<=` here. This essentially makes sense for the cases when all `BlockFreq`s are unknown (so they are equal to `0`). anton-afanasyev: I would suggest more conservative `<` instead of `<=` here. This essentially makes sense for…
		LI = &getAnalysis<MachineLoopInfo>();
LookAheadLimit = TII->getMachineCSELookAheadLimit();		LookAheadLimit = TII->getMachineCSELookAheadLimit();
bool ChangedPRE, ChangedCSE;		bool ChangedPRE, ChangedCSE;
ChangedPRE = PerformSimplePRE(DT);		ChangedPRE = PerformSimplePRE(DT);
ChangedCSE = PerformCSE(DT->getRootNode());		ChangedCSE = PerformCSE(DT->getRootNode());
return ChangedPRE \|\| ChangedCSE;		return ChangedPRE \|\| ChangedCSE;
}		}

llvm/test/CodeGen/PowerPC/machine-pre.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mcpu=pwr9 -mtriple=powerpc64le-unknown-unknown \			; RUN: llc -mcpu=pwr9 -mtriple=powerpc64le-unknown-unknown \
	; RUN: -ppc-asm-full-reg-names -verify-machineinstrs -O2 < %s \| FileCheck %s \			; RUN: -ppc-asm-full-reg-names -verify-machineinstrs -O2 < %s \| FileCheck %s \
	; RUN: --check-prefix=CHECK-P9			; RUN: --check-prefix=CHECK-P9

	define i32 @t(i32 %n, i32 %delta, i32 %a) {			define i32 @t(i32 %n, i32 %delta, i32 %a) {
	; CHECK-P9-LABEL: t:			; CHECK-P9-LABEL: t:
	; CHECK-P9: # %bb.0: # %entry			; CHECK-P9: # %bb.0: # %entry
	; CHECK-P9-NEXT: lis r7, 0			; CHECK-P9-NEXT: lis r7, 0
	; CHECK-P9-NEXT: li r6, 0			; CHECK-P9-NEXT: li r6, 0
				; CHECK-P9-NEXT: li r8, 0
	; CHECK-P9-NEXT: li r9, 0			; CHECK-P9-NEXT: li r9, 0
	; CHECK-P9-NEXT: li r10, 0
	; CHECK-P9-NEXT: ori r7, r7, 65535			; CHECK-P9-NEXT: ori r7, r7, 65535
	; CHECK-P9-NEXT: .p2align 5			; CHECK-P9-NEXT: .p2align 5
	; CHECK-P9-NEXT: .LBB0_1: # %header			; CHECK-P9-NEXT: .LBB0_1: # %header
	; CHECK-P9-NEXT: #			; CHECK-P9-NEXT: #
	; CHECK-P9-NEXT: addi r10, r10, 1			; CHECK-P9-NEXT: addi r9, r9, 1
	; CHECK-P9-NEXT: cmpw r10, r3			; CHECK-P9-NEXT: cmpw r9, r3
	; CHECK-P9-NEXT: addi r8, r5, 1024
	; CHECK-P9-NEXT: blt cr0, .LBB0_4			; CHECK-P9-NEXT: blt cr0, .LBB0_4
	; CHECK-P9-NEXT: # %bb.2: # %cont			; CHECK-P9-NEXT: # %bb.2: # %cont
	; CHECK-P9-NEXT: #			; CHECK-P9-NEXT: #
	; CHECK-P9-NEXT: add r9, r9, r4			; CHECK-P9-NEXT: add r8, r8, r4
	; CHECK-P9-NEXT: cmpw r9, r7			; CHECK-P9-NEXT: cmpw r8, r7
	; CHECK-P9-NEXT: bgt cr0, .LBB0_1			; CHECK-P9-NEXT: bgt cr0, .LBB0_1
	; CHECK-P9-NEXT: # %bb.3: # %cont.1			; CHECK-P9-NEXT: # %bb.3: # %cont.1
	; CHECK-P9-NEXT: mr r6, r8			; CHECK-P9-NEXT: addi r6, r5, 1024
	; CHECK-P9-NEXT: .LBB0_4: # %return			; CHECK-P9-NEXT: .LBB0_4: # %return
	; CHECK-P9-NEXT: mullw r3, r6, r8			; CHECK-P9-NEXT: addi r3, r5, 1024
				; CHECK-P9-NEXT: mullw r3, r6, r3
	; CHECK-P9-NEXT: blr			; CHECK-P9-NEXT: blr
	entry:			entry:
	br label %header			br label %header

	header:			header:
	%sum = phi i32 [ 0, %entry ], [ %sum.1, %cont ]			%sum = phi i32 [ 0, %entry ], [ %sum.1, %cont ]
	%i = phi i32 [ 0, %entry ], [ %i.1, %cont ]			%i = phi i32 [ 0, %entry ], [ %i.1, %cont ]
	%i.1 = add nsw i32 %i, 1			%i.1 = add nsw i32 %i, 1
	Show All 18 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[MachineCSE][MachinePRE] Do not hoist common computations into hot BBs
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 208618

llvm/lib/CodeGen/MachineCSE.cpp

llvm/test/CodeGen/PowerPC/machine-pre.ll

This is an archive of the discontinued LLVM Phabricator instance.

[MachineCSE][MachinePRE] Do not hoist common computations into hot BBsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 208618

llvm/lib/CodeGen/MachineCSE.cpp

llvm/test/CodeGen/PowerPC/machine-pre.ll

[MachineCSE][MachinePRE] Do not hoist common computations into hot BBs
ClosedPublic