This is an archive of the discontinued LLVM Phabricator instance.

Differential D30759

With PIE on x86_64, keep hot local arrays on the stack
Needs ReviewPublic

Authored by tmsriram on Mar 8 2017, 2:39 PM.

Download Raw Diff

Details

Reviewers

davidxl

Summary

Accessing local arrays is cheaper with PIE on x86_64 than global arrays.

This patch seeks to improve performance of PIE code on x86_64 when accessing local const arrays.

A few of our key benchmarks slow-down by as much as 3% because of this and we would like to see a way to do this.

Example here:

int foo(int i) {

int arr[] = { 0, 2, 7, 11 };
return arr[i];

}

$ clang -O2 foo.cc -S -fPIE
foo.s:

_Z3fooi: # @_Z3fooi

...
movslq  %edi, %rax                 # Access i
leaq    _ZZ3fooiE3arr(%rip), %rcx  # Access arr promoted to global
movl    (%rcx,%rax,4), %eax        # Access arr[i]
retq

...
.section .rodata,"a",@progbits
...
.L_ZZ3fooiE3arr:
.long 0 # 0x0
.long 2 # 0x2
.long 7 # 0x7
.long 11 # 0xb

The array is made a global and the access needs one extra instruction to get the address of array _ZZ3fooiE3arr, the lea instruction. This causes upto 3% perf. regression, larger on newer chips.

If this array were kept on the stack the access needs only two instructions. It does need to push the elements on the stack which needs to done once at entry. This is how the code would look like:

_Z3fooi: # @_Z3fooi

...

movaps .L_ZZ3fooiE3arr(%rip), %xmm0 # Vectorized, array push onto the stack
movaps %xmm0, -24(%rsp) # Push the array on the stack
movslq %edi, %rax # Access i
movl -24(%rsp,%rax,4), %eax # Access arr[i] in one instruction
...
.section .rodata,"a",@progbits
...
.L_ZZ3fooiE3arr:
.long 0 # 0x0
.long 2 # 0x2
.long 7 # 0x7
.long 11 # 0xb

The actual array access is one instruction in this case without accounting for the extra instructions needed to push it onto the stack:
movl -24(%rsp,%rax,4), %eax # Access arr[i] in one instruction

If the use of the array element is hot and is done more frequently compared to pushing the elements on the stack, it is better to keep this array on the stack for performance. This patch tries to do that based on profiles. It compares the count of the basic block where the element is accessed to the count of the basic block where the stack allocation is made. It also takes it into account the size of the array as more instructions are needed to push a larger array.

A couple of notes about this:

Clang turns on -fmerge-all-constants by default and moves all const local arrays to .rodata even before any optimizations are applied. So, in order for this patch to be effective, -fno-merge-all-constants must be used.
It is the instruction combine pass that analyzes alloca and tries to delete them when possible, if the allocas are for const arrays. This patch checks for profile count at this point.
This is only applicable x86_64 and that too for PIE builds.
GCC always keeps const stack arrays on stack unless -fmerge-all-constants is specified. What GCC does here is overkill in order to prevent this problem:

https://gcc.gnu.org/ml/gcc/2016-04/msg00178.html
The standard says that if the pointer to the stack array is accessed by calling this function recursively, it must be unique. GCC tries to honor this but does not differentiate between cases where the pointer is not accessed or the function is not called recursively.

Clang does not honor that part of the standard, I am not sure of the full story here. Since -fmerge-all-constants is the default, clang will always move the stack array to rodata and the pointer uniqueness if this function is entered recursively is violated. This is a separate discussion but I am pitching my solution to this here since it is within the

scope of this work:

I will add a clang option -fmerge-constants and not make -fmerge-all-constants the default, just like GCC does.
Then instruction combine can keep function local arrays on stack if it is called recursively.

Please let me know what you think.

Diff Detail

Event Timeline

tmsriram created this revision.Mar 8 2017, 2:39 PM

Herald added a subscriber: mehdi_amini. · View Herald TranscriptMar 8 2017, 2:39 PM

In theory, the C/C++ standards require behavior equivalent to -fno-merge-all-constants. In practice, code doesn't actually depend on that, so we made the decision many years ago to turn on -fmerge-all-constants by default.

Anyway, you don't really want the behavior of the optimizer to depend on whether the constant array is marked "static"; it probably doesn't reflect the user's intent in any useful way.

I'm sort of surprised you didn't mention register pressure at all in your explanation.

In theory, the C/C++ standards require behavior equivalent to -fno-merge-all-constants. In practice, code doesn't actually depend on that, so we made the decision many years ago to turn on -fmerge-all-constants by default.

Understood. Does it seem reasonable/useful to fix this along the
lines of GCC, -fmerge-constants and -fmerge-all-constants where the
latter applies to const arrays and a warning that this is happening
when the latter option is used?

-fmerge-all-constants has exactly the same meaning in clang and gcc. And it's generally beneficial for both codesize and performance, so turning it off to pursue performance is a bad idea.

I would suggest finding some other approach to solve your issue later in the optimization pipeline, preferably in a manner which is sensitive to register pressure. Maybe put the code in ConstantHoisting? You don't lose any useful information by promoting the alloca to a global constant; you can easily recreate it later.

In D30759#696138, @efriedma wrote:

In theory, the C/C++ standards require behavior equivalent to -fno-merge-all-constants. In practice, code doesn't actually depend on that, so we made the decision many years ago to turn on -fmerge-all-constants by default.

Understood. Does it seem reasonable/useful to fix this along the
lines of GCC, -fmerge-constants and -fmerge-all-constants where the
latter applies to const arrays and a warning that this is happening
when the latter option is used?

-fmerge-all-constants has exactly the same meaning in clang and gcc. And it's generally beneficial for both codesize and performance, so turning it off to pursue performance is a bad idea.

I would suggest finding some other approach to solve your issue later in the optimization pipeline, preferably in a manner which is sensitive to register pressure. Maybe put the code in ConstantHoisting? You don't lose any useful information by promoting the alloca to a global constant; you can easily recreate it

Digging a little more, here is what I found.

Like Eli pointed out, this is an optimization that is needed only when there is register pressure, otherwise the address computation can be hoisted out.
To recap, a global array access in PIE mode needs two instructions (for X86_64), an address computation using lea and the actual element access. If the array access is inside a hot loop, we have noticed the performance drop by a few percent due to the increased dynamic instruction count from the address computation when compared to non-PIE code.
Machine LICM does hoist the address computation of the array outside the loop but register allocation will sink the address computation back near the use via rematerialization if the register pressure is high.

I can think of two different ways in which I can solve this problem

a) Implement this in a late optimization pass, in LLVM IR and use a heuristic to compute register pressure.

When compiling for PIE, the optimization would move a global array to the stack if the register pressure estimate of the function where it is used is high.
Use a heuristic to compute the register usage. This is already done for instance in loop vectorizaton, function "LoopVectorizationCostModel::calculateRegisterUsage".
The heuristic would use the number of overlapping live ranges as the estimate for register usage.
If implemented in a late pass, just before code generation, the estimates would tend to be closer to actual.

b) Teach rematerialization, during greedy register allocation, to move global arrays to stack instead of recomputing the address everytime it is used.

This would be done in machine IR, during register allocation, when it is known that this is about to be rematerialized.
However, this optimization seems heavy-weight to do in machine level IR, not sure about the complexity/feasibility of doing this here.
The only reason to justify doing this here is the absence of a register usage estimation function in LLVM IR.

It looks like a) would be the way to go here. The need for a generic register usage estimation function has already been discussed by Wei to drive other optimizations . What do you think?

There's also another possibility: you could make the register allocator prefer to spill some other register which isn't on the critical path. Not sure if that's practical for the loops you care about.

That said, your possibility (a) seems reasonable.

Revision Contents

Path

Size

include/

llvm/

Transforms/

IPO/

PassManagerBuilder.h

3 lines

InstCombine/

InstCombine.h

6 lines

Scalar.h

3 lines

lib/

Transforms/

IPO/

PassManagerBuilder.cpp

10 lines

InstCombine/

InstCombineInternal.h

19 lines

InstCombineLoadStoreAlloca.cpp

66 lines

InstructionCombining.cpp

29 lines

test/

Transforms/

PGOProfile/

Inputs/

x86_64-pie-alloca.proftext

19 lines

x86_64-pie-alloca.ll

64 lines

Diff 91076

include/llvm/Transforms/IPO/PassManagerBuilder.h

	Show First 20 Lines • Show All 178 Lines • ▼ Show 20 Lines
	private:			private:
	void addExtensionsToPM(ExtensionPointTy ETy,			void addExtensionsToPM(ExtensionPointTy ETy,
	legacy::PassManagerBase &PM) const;			legacy::PassManagerBase &PM) const;
	void addInitialAliasAnalysisPasses(legacy::PassManagerBase &PM) const;			void addInitialAliasAnalysisPasses(legacy::PassManagerBase &PM) const;
	void addLTOOptimizationPasses(legacy::PassManagerBase &PM);			void addLTOOptimizationPasses(legacy::PassManagerBase &PM);
	void addLateLTOOptimizationPasses(legacy::PassManagerBase &PM);			void addLateLTOOptimizationPasses(legacy::PassManagerBase &PM);
	void addPGOInstrPasses(legacy::PassManagerBase &MPM);			void addPGOInstrPasses(legacy::PassManagerBase &MPM);
	void addFunctionSimplificationPasses(legacy::PassManagerBase &MPM);			void addFunctionSimplificationPasses(legacy::PassManagerBase &MPM);
	void addInstructionCombiningPass(legacy::PassManagerBase &MPM) const;			void addInstructionCombiningPass(legacy::PassManagerBase &MPM,
				bool UseProfileInfo = true) const;

	public:			public:
	/// populateFunctionPassManager - This fills in the function pass manager,			/// populateFunctionPassManager - This fills in the function pass manager,
	/// which is expected to be run on each function immediately as it is			/// which is expected to be run on each function immediately as it is
	/// generated. The idea is to reduce the size of the IR in memory.			/// generated. The idea is to reduce the size of the IR in memory.
	void populateFunctionPassManager(legacy::FunctionPassManager &FPM);			void populateFunctionPassManager(legacy::FunctionPassManager &FPM);

	/// populateModulePassManager - This sets up the primary pass manager.			/// populateModulePassManager - This sets up the primary pass manager.
	Show All 18 Lines

include/llvm/Transforms/InstCombine/InstCombine.h

	Show All 38 Lines

	/// \brief The legacy pass manager's instcombine pass.			/// \brief The legacy pass manager's instcombine pass.
	///			///
	/// This is a basic whole-function wrapper around the instcombine utility. It			/// This is a basic whole-function wrapper around the instcombine utility. It
	/// will try to combine all instructions in the function.			/// will try to combine all instructions in the function.
	class InstructionCombiningPass : public FunctionPass {			class InstructionCombiningPass : public FunctionPass {
	InstCombineWorklist Worklist;			InstCombineWorklist Worklist;
	const bool ExpensiveCombines;			const bool ExpensiveCombines;
				const bool UseProfileInfo;

	public:			public:
	static char ID; // Pass identification, replacement for typeid			static char ID; // Pass identification, replacement for typeid

	InstructionCombiningPass(bool ExpensiveCombines = true)			InstructionCombiningPass(bool ExpensiveCombines = true, bool UseProfileInfo = true)
	: FunctionPass(ID), ExpensiveCombines(ExpensiveCombines) {			: FunctionPass(ID), ExpensiveCombines(ExpensiveCombines),
				UseProfileInfo(UseProfileInfo) {
	initializeInstructionCombiningPassPass(*PassRegistry::getPassRegistry());			initializeInstructionCombiningPassPass(*PassRegistry::getPassRegistry());
	}			}

	void getAnalysisUsage(AnalysisUsage &AU) const override;			void getAnalysisUsage(AnalysisUsage &AU) const override;
	bool runOnFunction(Function &F) override;			bool runOnFunction(Function &F) override;
	};			};
	}			}

	#endif			#endif

include/llvm/Transforms/Scalar.h

	Show First 20 Lines • Show All 124 Lines • ▼ Show 20 Lines
	// instructions dead, so a subsequent DCE pass is useful.			// instructions dead, so a subsequent DCE pass is useful.
	//			//
	// This pass combines things like:			// This pass combines things like:
	// %Y = add int 1, %X			// %Y = add int 1, %X
	// %Z = add int 1, %Y			// %Z = add int 1, %Y
	// into:			// into:
	// %Z = add int 2, %X			// %Z = add int 2, %X
	//			//
	FunctionPass *createInstructionCombiningPass(bool ExpensiveCombines = true);			FunctionPass *createInstructionCombiningPass(bool ExpensiveCombines = true,
				bool UseProfileInfo = true);

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// LICM - This pass is a loop invariant code motion and memory promotion pass.			// LICM - This pass is a loop invariant code motion and memory promotion pass.
	//			//
	Pass *createLICMPass();			Pass *createLICMPass();

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	▲ Show 20 Lines • Show All 422 Lines • Show Last 20 Lines

lib/Transforms/IPO/PassManagerBuilder.cpp

Show First 20 Lines • Show All 218 Lines • ▼ Show 20 Lines	void PassManagerBuilder::addInitialAliasAnalysisPasses(
// Add TypeBasedAliasAnalysis before BasicAliasAnalysis so that		// Add TypeBasedAliasAnalysis before BasicAliasAnalysis so that
// BasicAliasAnalysis wins if they disagree. This is intended to help		// BasicAliasAnalysis wins if they disagree. This is intended to help
// support "obvious" type-punning idioms.		// support "obvious" type-punning idioms.
PM.add(createTypeBasedAAWrapperPass());		PM.add(createTypeBasedAAWrapperPass());
PM.add(createScopedNoAliasAAWrapperPass());		PM.add(createScopedNoAliasAAWrapperPass());
}		}

void PassManagerBuilder::addInstructionCombiningPass(		void PassManagerBuilder::addInstructionCombiningPass(
legacy::PassManagerBase &PM) const {		legacy::PassManagerBase &PM, bool UseProfileInfo) const {
bool ExpensiveCombines = OptLevel > 2;		bool ExpensiveCombines = OptLevel > 2;
PM.add(createInstructionCombiningPass(ExpensiveCombines));		PM.add(createInstructionCombiningPass(ExpensiveCombines, UseProfileInfo));
}		}

void PassManagerBuilder::populateFunctionPassManager(		void PassManagerBuilder::populateFunctionPassManager(
legacy::FunctionPassManager &FPM) {		legacy::FunctionPassManager &FPM) {
addExtensionsToPM(EP_EarlyAsPossible, FPM);		addExtensionsToPM(EP_EarlyAsPossible, FPM);

// Add LibraryInfo if we have some.		// Add LibraryInfo if we have some.
if (LibraryInfo)		if (LibraryInfo)
Show All 27 Lines	if (OptLevel > 0 && SizeLevel == 0 && !DisablePreInliner) {
// FIXME: The hint threshold has the same value used by the regular inliner.		// FIXME: The hint threshold has the same value used by the regular inliner.
// This should probably be lowered after performance testing.		// This should probably be lowered after performance testing.
IP.HintThreshold = 325;		IP.HintThreshold = 325;

MPM.add(createFunctionInliningPass(IP));		MPM.add(createFunctionInliningPass(IP));
MPM.add(createSROAPass());		MPM.add(createSROAPass());
MPM.add(createEarlyCSEPass()); // Catch trivial redundancies		MPM.add(createEarlyCSEPass()); // Catch trivial redundancies
MPM.add(createCFGSimplificationPass()); // Merge & remove BBs		MPM.add(createCFGSimplificationPass()); // Merge & remove BBs
MPM.add(createInstructionCombiningPass()); // Combine silly seq's		MPM.add(createInstructionCombiningPass(
		/ExpensiveCombines = / true,
		/UseProfileInfo = / false)); // Combine silly seq's
addExtensionsToPM(EP_Peephole, MPM);		addExtensionsToPM(EP_Peephole, MPM);
}		}
if (EnablePGOInstrGen) {		if (EnablePGOInstrGen) {
MPM.add(createPGOInstrumentationGenLegacyPass());		MPM.add(createPGOInstrumentationGenLegacyPass());
// Add the profile lowering pass.		// Add the profile lowering pass.
InstrProfOptions Options;		InstrProfOptions Options;
if (!PGOInstrGen.empty())		if (!PGOInstrGen.empty())
Options.InstrProfileOutput = PGOInstrGen;		Options.InstrProfileOutput = PGOInstrGen;
▲ Show 20 Lines • Show All 159 Lines • ▼ Show 20 Lines	if (!DisableUnitAtATime) {

MPM.add(createIPSCCPPass()); // IP SCCP		MPM.add(createIPSCCPPass()); // IP SCCP
MPM.add(createGlobalOptimizerPass()); // Optimize out global vars		MPM.add(createGlobalOptimizerPass()); // Optimize out global vars
// Promote any localized global vars.		// Promote any localized global vars.
MPM.add(createPromoteMemoryToRegisterPass());		MPM.add(createPromoteMemoryToRegisterPass());

MPM.add(createDeadArgEliminationPass()); // Dead argument elimination		MPM.add(createDeadArgEliminationPass()); // Dead argument elimination

addInstructionCombiningPass(MPM); // Clean up after IPCP & DAE		addInstructionCombiningPass(MPM, /UseProfileInfo = / false); // Clean up after IPCP & DAE
addExtensionsToPM(EP_Peephole, MPM);		addExtensionsToPM(EP_Peephole, MPM);
MPM.add(createCFGSimplificationPass()); // Clean up after IPCP & DAE		MPM.add(createCFGSimplificationPass()); // Clean up after IPCP & DAE
}		}

if (!PerformThinLTO) {		if (!PerformThinLTO) {
/// PGO instrumentation is added during the compile phase for ThinLTO, do		/// PGO instrumentation is added during the compile phase for ThinLTO, do
/// not run it a second time		/// not run it a second time
addPGOInstrPasses(MPM);		addPGOInstrPasses(MPM);
▲ Show 20 Lines • Show All 501 Lines • Show Last 20 Lines

lib/Transforms/InstCombine/InstCombineInternal.h

Show All 26 Lines
#include "llvm/IR/Operator.h"		#include "llvm/IR/Operator.h"
#include "llvm/IR/PatternMatch.h"		#include "llvm/IR/PatternMatch.h"
#include "llvm/Pass.h"		#include "llvm/Pass.h"
#include "llvm/Transforms/InstCombine/InstCombineWorklist.h"		#include "llvm/Transforms/InstCombine/InstCombineWorklist.h"

#define DEBUG_TYPE "instcombine"		#define DEBUG_TYPE "instcombine"

namespace llvm {		namespace llvm {
		class BlockFrequencyInfo;
		class ProfileSummaryInfo;
class CallSite;		class CallSite;
class DataLayout;		class DataLayout;
class DominatorTree;		class DominatorTree;
class TargetLibraryInfo;		class TargetLibraryInfo;
class DbgDeclareInst;		class DbgDeclareInst;
class MemIntrinsic;		class MemIntrinsic;
class MemSetInst;		class MemSetInst;

▲ Show 20 Lines • Show All 134 Lines • ▼ Show 20 Lines	public:
typedef IRBuilder<TargetFolder, IRBuilderCallbackInserter> BuilderTy;		typedef IRBuilder<TargetFolder, IRBuilderCallbackInserter> BuilderTy;
BuilderTy *Builder;		BuilderTy *Builder;

private:		private:
// Mode in which we are running the combiner.		// Mode in which we are running the combiner.
const bool MinimizeSize;		const bool MinimizeSize;
/// Enable combines that trigger rarely but are costly in compiletime.		/// Enable combines that trigger rarely but are costly in compiletime.
const bool ExpensiveCombines;		const bool ExpensiveCombines;
		const bool UseProfileInfo;

AliasAnalysis *AA;		AliasAnalysis *AA;

// Required analyses.		// Required analyses.
AssumptionCache &AC;		AssumptionCache &AC;
TargetLibraryInfo &TLI;		TargetLibraryInfo &TLI;
DominatorTree &DT;		DominatorTree &DT;
const DataLayout &DL;		const DataLayout &DL;
		BlockFrequencyInfo &BFI;
		ProfileSummaryInfo *PSI;

// Optional analyses. When non-null, these can both be used to do better		// Optional analyses. When non-null, these can both be used to do better
// combining and will be updated to reflect any changes.		// combining and will be updated to reflect any changes.
LoopInfo *LI;		LoopInfo *LI;

bool MadeIRChange;		bool MadeIRChange;

public:		public:
InstCombiner(InstCombineWorklist &Worklist, BuilderTy *Builder,		InstCombiner(InstCombineWorklist &Worklist, BuilderTy *Builder,
bool MinimizeSize, bool ExpensiveCombines, AliasAnalysis *AA,		bool MinimizeSize, bool ExpensiveCombines,
		bool UseProfileInfo, AliasAnalysis *AA,
AssumptionCache &AC, TargetLibraryInfo &TLI,		AssumptionCache &AC, TargetLibraryInfo &TLI,
DominatorTree &DT, const DataLayout &DL, LoopInfo *LI)		DominatorTree &DT, const DataLayout &DL,
		BlockFrequencyInfo &BFI, ProfileSummaryInfo *PSI,
		LoopInfo *LI)
: Worklist(Worklist), Builder(Builder), MinimizeSize(MinimizeSize),		: Worklist(Worklist), Builder(Builder), MinimizeSize(MinimizeSize),
ExpensiveCombines(ExpensiveCombines), AA(AA), AC(AC), TLI(TLI), DT(DT),		ExpensiveCombines(ExpensiveCombines), UseProfileInfo(UseProfileInfo),
DL(DL), LI(LI), MadeIRChange(false) {}		AA(AA), AC(AC), TLI(TLI), DT(DT),
		DL(DL), BFI(BFI), PSI(PSI), LI(LI), MadeIRChange(false) {}

/// \brief Run the combiner over the entire worklist until it is empty.		/// \brief Run the combiner over the entire worklist until it is empty.
///		///
/// \returns true if the IR is changed.		/// \returns true if the IR is changed.
bool run();		bool run();

AssumptionCache &getAssumptionCache() const { return AC; }		AssumptionCache &getAssumptionCache() const { return AC; }

const DataLayout &getDataLayout() const { return DL; }		const DataLayout &getDataLayout() const { return DL; }

		const BlockFrequencyInfo &getBlockFrequencyInfo() const { return BFI; }

DominatorTree &getDominatorTree() const { return DT; }		DominatorTree &getDominatorTree() const { return DT; }

LoopInfo *getLoopInfo() const { return LI; }		LoopInfo *getLoopInfo() const { return LI; }

TargetLibraryInfo &getTargetLibraryInfo() const { return TLI; }		TargetLibraryInfo &getTargetLibraryInfo() const { return TLI; }

// Visitation implementation - Implement instruction combining for different		// Visitation implementation - Implement instruction combining for different
// instruction types. The semantics are as follows:		// instruction types. The semantics are as follows:
▲ Show 20 Lines • Show All 450 Lines • Show Last 20 Lines

lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp

Show All 9 Lines
// This file implements the visit functions for load, store and alloca.		// This file implements the visit functions for load, store and alloca.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "InstCombineInternal.h"		#include "InstCombineInternal.h"
#include "llvm/ADT/MapVector.h"		#include "llvm/ADT/MapVector.h"
#include "llvm/ADT/SmallString.h"		#include "llvm/ADT/SmallString.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
		#include "llvm/ADT/Triple.h"
#include "llvm/Analysis/Loads.h"		#include "llvm/Analysis/Loads.h"
		#include "llvm/Analysis/BlockFrequencyInfo.h"
		#include "llvm/Analysis/ProfileSummaryInfo.h"
#include "llvm/IR/ConstantRange.h"		#include "llvm/IR/ConstantRange.h"
#include "llvm/IR/DataLayout.h"		#include "llvm/IR/DataLayout.h"
#include "llvm/IR/DebugInfo.h"		#include "llvm/IR/DebugInfo.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/LLVMContext.h"		#include "llvm/IR/LLVMContext.h"
#include "llvm/IR/MDBuilder.h"		#include "llvm/IR/MDBuilder.h"
#include "llvm/Transforms/Utils/BasicBlockUtils.h"		#include "llvm/Transforms/Utils/BasicBlockUtils.h"
#include "llvm/Transforms/Utils/Local.h"		#include "llvm/Transforms/Utils/Local.h"
Show All 15 Lines	if (ConstantExpr *CE = dyn_cast<ConstantExpr>(V)) {
if (CE->getOpcode() == Instruction::BitCast \|\|		if (CE->getOpcode() == Instruction::BitCast \|\|
CE->getOpcode() == Instruction::AddrSpaceCast \|\|		CE->getOpcode() == Instruction::AddrSpaceCast \|\|
CE->getOpcode() == Instruction::GetElementPtr)		CE->getOpcode() == Instruction::GetElementPtr)
return pointsToConstantGlobal(CE->getOperand(0));		return pointsToConstantGlobal(CE->getOperand(0));
}		}
return false;		return false;
}		}

		/// isAllocaPointerAccessHot - Look at the uses of the pointer to an
		/// alloca. If the use is hot and the count of the use is larger than
		/// the point at which the alloca is made, it is better to keep this
		/// array on the stack in PIE mode as accessing a const stack array is
		/// cheaper, performance wise. This applies to x86_64 only.
		static bool isAllocaPointerAccessHot(AllocaInst &AI, BlockFrequencyInfo &BFI,
		ProfileSummaryInfo *PSI, const DataLayout &DL,
		bool UseProfileInfo = true) {
		// We only care about constant length allocations.
		if (!isa<ConstantInt>(AI.getArraySize())) return false;

		// We only care about this when building as PIE for x86_64.
		// TODO: Do we also need to check for code model?
		Module *M = AI.getModule();
		Triple T = Triple(M->getTargetTriple());
		if (M->getPIELevel() == PIELevel::Default
		\|\| T.getArch() != Triple::ArchType::x86_64)
		return false;

		// If profile information is not yet available at this point, early instance
		// of InstCombine, return true conservatively so that this analysis can be
		// done later when profile info is available.
		if (!UseProfileInfo) {
		return true;
		}

		ConstantInt *CS = dyn_cast<ConstantInt>(AI.getArraySize());
		uint64_t TypeSize = DL.getTypeAllocSize(AI.getAllocatedType());
		APInt TotalSize = CS->getValue().zextOrSelf(128) * APInt(128, TypeSize);

		SmallVector<Value *, 35> ValuesToInspect;
		ValuesToInspect.emplace_back(&AI);

		while (!ValuesToInspect.empty()) {
		auto V = ValuesToInspect.pop_back_val();
		for (auto &U : V->uses()) {
		auto *I = cast<Instruction>(U.getUser());
		BasicBlock *B = I->getParent();
		if (dyn_cast<GetElementPtrInst>(I) && PSI->isHotBB(B, &BFI)) {
		auto Count = BFI.getBlockProfileCount(B);
		uint64_t CountVal = (Count) ? *Count : 0;
		auto AllocaCount =
		BFI.getBlockProfileCount(AI.getParent());
		uint64_t AllocaCountVal = (AllocaCount) ? *AllocaCount : 0;
		if (CountVal > AllocaCountVal && !TotalSize.ugt(CountVal)) {
		DEBUG(dbgs() << "Alloca with hot pointer access : " << AI << '\n');
		DEBUG(dbgs() << "Alloca size is " << TotalSize <<
		"bytes but is accessed " << CountVal << " times" << '\n');
		return true;
		}
		}
		// uses of bit cast need to be checked.
		if (isa<BitCastInst>(I) \|\| isa<AddrSpaceCastInst>(I)) {
		ValuesToInspect.emplace_back(I);
		}
		}
		}
		return false;
		}

/// isOnlyCopiedFromConstantGlobal - Recursively walk the uses of a (derived)		/// isOnlyCopiedFromConstantGlobal - Recursively walk the uses of a (derived)
/// pointer to an alloca. Ignore any reads of the pointer, return false if we		/// pointer to an alloca. Ignore any reads of the pointer, return false if we
/// see any stores or other unknown uses. If we see pointer arithmetic, keep		/// see any stores or other unknown uses. If we see pointer arithmetic, keep
/// track of whether it moves the pointer (with IsOffset) but otherwise traverse		/// track of whether it moves the pointer (with IsOffset) but otherwise traverse
/// the uses. If we see a memcpy/memmove that targets an unoffseted pointer to		/// the uses. If we see a memcpy/memmove that targets an unoffseted pointer to
/// the alloca, and if the source pointer is a pointer to a constant global, we		/// the alloca, and if the source pointer is a pointer to a constant global, we
/// can optimize this.		/// can optimize this.
static bool		static bool
▲ Show 20 Lines • Show All 316 Lines • ▼ Show 20 Lines	if (DL.getTypeAllocSize(AI.getAllocatedType()) == 0) {
EntryAI->setAlignment(MaxAlign);		EntryAI->setAlignment(MaxAlign);
if (AI.getType() != EntryAI->getType())		if (AI.getType() != EntryAI->getType())
return new BitCastInst(EntryAI, AI.getType());		return new BitCastInst(EntryAI, AI.getType());
return replaceInstUsesWith(AI, EntryAI);		return replaceInstUsesWith(AI, EntryAI);
}		}
}		}
}		}

if (AI.getAlignment()) {		if (AI.getAlignment() &&
		!isAllocaPointerAccessHot(AI, BFI, PSI, DL, UseProfileInfo)) {
// Check to see if this allocation is only modified by a memcpy/memmove from		// Check to see if this allocation is only modified by a memcpy/memmove from
// a constant global whose alignment is equal to or exceeds that of the		// a constant global whose alignment is equal to or exceeds that of the
// allocation. If this is the case, we can change all users to use		// allocation. If this is the case, we can change all users to use
// the constant global instead. This is commonly produced by the CFE by		// the constant global instead. This is commonly produced by the CFE by
// constructs like "void foo() { int A[] = {1,2,3,4,5,6,7,8,9...}; }" if 'A'		// constructs like "void foo() { int A[] = {1,2,3,4,5,6,7,8,9...}; }" if 'A'
// is only subsequently read.		// is only subsequently read.
SmallVector<Instruction *, 4> ToDelete;		SmallVector<Instruction *, 4> ToDelete;
if (MemTransferInst *Copy = isOnlyCopiedFromConstantGlobal(&AI, ToDelete)) {		if (MemTransferInst *Copy = isOnlyCopiedFromConstantGlobal(&AI, ToDelete)) {
▲ Show 20 Lines • Show All 1,167 Lines • Show Last 20 Lines

lib/Transforms/InstCombine/InstructionCombining.cpp

Show All 36 Lines
#include "InstCombineInternal.h"		#include "InstCombineInternal.h"
#include "llvm-c/Initialization.h"		#include "llvm-c/Initialization.h"
#include "llvm/ADT/SmallPtrSet.h"		#include "llvm/ADT/SmallPtrSet.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
#include "llvm/ADT/StringSwitch.h"		#include "llvm/ADT/StringSwitch.h"
#include "llvm/Analysis/AliasAnalysis.h"		#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/Analysis/AssumptionCache.h"		#include "llvm/Analysis/AssumptionCache.h"
#include "llvm/Analysis/BasicAliasAnalysis.h"		#include "llvm/Analysis/BasicAliasAnalysis.h"
		#include "llvm/Analysis/BlockFrequencyInfo.h"
#include "llvm/Analysis/CFG.h"		#include "llvm/Analysis/CFG.h"
#include "llvm/Analysis/ConstantFolding.h"		#include "llvm/Analysis/ConstantFolding.h"
#include "llvm/Analysis/EHPersonalities.h"		#include "llvm/Analysis/EHPersonalities.h"
#include "llvm/Analysis/GlobalsModRef.h"		#include "llvm/Analysis/GlobalsModRef.h"
#include "llvm/Analysis/InstructionSimplify.h"		#include "llvm/Analysis/InstructionSimplify.h"
#include "llvm/Analysis/LoopInfo.h"		#include "llvm/Analysis/LoopInfo.h"
#include "llvm/Analysis/MemoryBuiltins.h"		#include "llvm/Analysis/MemoryBuiltins.h"
		#include "llvm/Analysis/ProfileSummaryInfo.h"
#include "llvm/Analysis/TargetLibraryInfo.h"		#include "llvm/Analysis/TargetLibraryInfo.h"
#include "llvm/Analysis/ValueTracking.h"		#include "llvm/Analysis/ValueTracking.h"
#include "llvm/IR/CFG.h"		#include "llvm/IR/CFG.h"
#include "llvm/IR/DataLayout.h"		#include "llvm/IR/DataLayout.h"
#include "llvm/IR/Dominators.h"		#include "llvm/IR/Dominators.h"
#include "llvm/IR/GetElementPtrTypeIterator.h"		#include "llvm/IR/GetElementPtrTypeIterator.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/PatternMatch.h"		#include "llvm/IR/PatternMatch.h"
▲ Show 20 Lines • Show All 3,055 Lines • ▼ Show 20 Lines	static bool prepareICWorklistFromFunction(Function &F, const DataLayout &DL,

return MadeIRChange;		return MadeIRChange;
}		}

static bool		static bool
combineInstructionsOverFunction(Function &F, InstCombineWorklist &Worklist,		combineInstructionsOverFunction(Function &F, InstCombineWorklist &Worklist,
AliasAnalysis *AA, AssumptionCache &AC,		AliasAnalysis *AA, AssumptionCache &AC,
TargetLibraryInfo &TLI, DominatorTree &DT,		TargetLibraryInfo &TLI, DominatorTree &DT,
		BlockFrequencyInfo &BFI, ProfileSummaryInfo *PSI,
bool ExpensiveCombines = true,		bool ExpensiveCombines = true,
		bool UseProfileInfo = true,
LoopInfo *LI = nullptr) {		LoopInfo *LI = nullptr) {
auto &DL = F.getParent()->getDataLayout();		auto &DL = F.getParent()->getDataLayout();
ExpensiveCombines \|= EnableExpensiveCombines;		ExpensiveCombines \|= EnableExpensiveCombines;

/// Builder - This is an IRBuilder that automatically inserts new		/// Builder - This is an IRBuilder that automatically inserts new
/// instructions into the worklist when they are created.		/// instructions into the worklist when they are created.
IRBuilder<TargetFolder, IRBuilderCallbackInserter> Builder(		IRBuilder<TargetFolder, IRBuilderCallbackInserter> Builder(
F.getContext(), TargetFolder(DL),		F.getContext(), TargetFolder(DL),
Show All 14 Lines	combineInstructionsOverFunction(Function &F, InstCombineWorklist &Worklist,
for (;;) {		for (;;) {
++Iteration;		++Iteration;
DEBUG(dbgs() << "\n\nINSTCOMBINE ITERATION #" << Iteration << " on "		DEBUG(dbgs() << "\n\nINSTCOMBINE ITERATION #" << Iteration << " on "
<< F.getName() << "\n");		<< F.getName() << "\n");

bool Changed = prepareICWorklistFromFunction(F, DL, &TLI, Worklist);		bool Changed = prepareICWorklistFromFunction(F, DL, &TLI, Worklist);

InstCombiner IC(Worklist, &Builder, F.optForMinSize(), ExpensiveCombines,		InstCombiner IC(Worklist, &Builder, F.optForMinSize(), ExpensiveCombines,
AA, AC, TLI, DT, DL, LI);		UseProfileInfo, AA, AC, TLI, DT, DL, BFI, PSI, LI);
IC.MaxArraySizeForCombine = MaxArraySize;		IC.MaxArraySizeForCombine = MaxArraySize;
Changed \|= IC.run();		Changed \|= IC.run();

if (!Changed)		if (!Changed)
break;		break;
}		}

return DbgDeclaresChanged \|\| Iteration > 1;		return DbgDeclaresChanged \|\| Iteration > 1;
}		}

PreservedAnalyses InstCombinePass::run(Function &F,		PreservedAnalyses InstCombinePass::run(Function &F,
FunctionAnalysisManager &AM) {		FunctionAnalysisManager &AM) {
auto &AC = AM.getResult<AssumptionAnalysis>(F);		auto &AC = AM.getResult<AssumptionAnalysis>(F);
auto &DT = AM.getResult<DominatorTreeAnalysis>(F);		auto &DT = AM.getResult<DominatorTreeAnalysis>(F);
auto &TLI = AM.getResult<TargetLibraryAnalysis>(F);		auto &TLI = AM.getResult<TargetLibraryAnalysis>(F);

auto *LI = AM.getCachedResult<LoopAnalysis>(F);		auto *LI = AM.getCachedResult<LoopAnalysis>(F);
		auto &BFI = AM.getResult<BlockFrequencyAnalysis>(F);

		auto &MAM = AM.getResult<ModuleAnalysisManagerFunctionProxy>(F).getManager();
		auto PSI = MAM.getCachedResult<ProfileSummaryAnalysis>(F.getParent());

// FIXME: The AliasAnalysis is not yet supported in the new pass manager		// FIXME: The AliasAnalysis is not yet supported in the new pass manager
if (!combineInstructionsOverFunction(F, Worklist, nullptr, AC, TLI, DT,		if (!combineInstructionsOverFunction(F, Worklist, nullptr, AC, TLI, DT, BFI,
ExpensiveCombines, LI))		PSI, ExpensiveCombines, false, LI))
// No changes, all analyses are preserved.		// No changes, all analyses are preserved.
return PreservedAnalyses::all();		return PreservedAnalyses::all();

// Mark all the analyses that instcombine updates as preserved.		// Mark all the analyses that instcombine updates as preserved.
PreservedAnalyses PA;		PreservedAnalyses PA;
PA.preserveSet<CFGAnalyses>();		PA.preserveSet<CFGAnalyses>();
PA.preserve<AAManager>();		PA.preserve<AAManager>();
PA.preserve<GlobalsAA>();		PA.preserve<GlobalsAA>();
return PA;		return PA;
}		}

void InstructionCombiningPass::getAnalysisUsage(AnalysisUsage &AU) const {		void InstructionCombiningPass::getAnalysisUsage(AnalysisUsage &AU) const {
AU.setPreservesCFG();		AU.setPreservesCFG();
AU.addRequired<AAResultsWrapperPass>();		AU.addRequired<AAResultsWrapperPass>();
AU.addRequired<AssumptionCacheTracker>();		AU.addRequired<AssumptionCacheTracker>();
		AU.addRequired<BlockFrequencyInfoWrapperPass>();
		AU.addRequired<ProfileSummaryInfoWrapperPass>();
AU.addRequired<TargetLibraryInfoWrapperPass>();		AU.addRequired<TargetLibraryInfoWrapperPass>();
AU.addRequired<DominatorTreeWrapperPass>();		AU.addRequired<DominatorTreeWrapperPass>();
AU.addPreserved<DominatorTreeWrapperPass>();		AU.addPreserved<DominatorTreeWrapperPass>();
AU.addPreserved<AAResultsWrapperPass>();		AU.addPreserved<AAResultsWrapperPass>();
AU.addPreserved<BasicAAWrapperPass>();		AU.addPreserved<BasicAAWrapperPass>();
AU.addPreserved<GlobalsAAWrapperPass>();		AU.addPreserved<GlobalsAAWrapperPass>();
}		}

bool InstructionCombiningPass::runOnFunction(Function &F) {		bool InstructionCombiningPass::runOnFunction(Function &F) {
if (skipFunction(F))		if (skipFunction(F))
return false;		return false;

// Required analyses.		// Required analyses.
auto AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();		auto AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();
auto &AC = getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);		auto &AC = getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
auto &TLI = getAnalysis<TargetLibraryInfoWrapperPass>().getTLI();		auto &TLI = getAnalysis<TargetLibraryInfoWrapperPass>().getTLI();
auto &DT = getAnalysis<DominatorTreeWrapperPass>().getDomTree();		auto &DT = getAnalysis<DominatorTreeWrapperPass>().getDomTree();
		auto &BFI = getAnalysis<BlockFrequencyInfoWrapperPass>().getBFI();
		auto *PSI = getAnalysis<ProfileSummaryInfoWrapperPass>().getPSI();

// Optional analyses.		// Optional analyses.
auto *LIWP = getAnalysisIfAvailable<LoopInfoWrapperPass>();		auto *LIWP = getAnalysisIfAvailable<LoopInfoWrapperPass>();
auto *LI = LIWP ? &LIWP->getLoopInfo() : nullptr;		auto *LI = LIWP ? &LIWP->getLoopInfo() : nullptr;

return combineInstructionsOverFunction(F, Worklist, AA, AC, TLI, DT,		return combineInstructionsOverFunction(F, Worklist, AA, AC, TLI, DT, BFI, PSI,
ExpensiveCombines, LI);		ExpensiveCombines, UseProfileInfo, LI);
}		}

char InstructionCombiningPass::ID = 0;		char InstructionCombiningPass::ID = 0;
INITIALIZE_PASS_BEGIN(InstructionCombiningPass, "instcombine",		INITIALIZE_PASS_BEGIN(InstructionCombiningPass, "instcombine",
"Combine redundant instructions", false, false)		"Combine redundant instructions", false, false)
INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)		INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)
INITIALIZE_PASS_DEPENDENCY(TargetLibraryInfoWrapperPass)		INITIALIZE_PASS_DEPENDENCY(TargetLibraryInfoWrapperPass)
INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)		INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
		INITIALIZE_PASS_DEPENDENCY(BlockFrequencyInfoWrapperPass)
		INITIALIZE_PASS_DEPENDENCY(ProfileSummaryInfoWrapperPass)
INITIALIZE_PASS_DEPENDENCY(AAResultsWrapperPass)		INITIALIZE_PASS_DEPENDENCY(AAResultsWrapperPass)
INITIALIZE_PASS_DEPENDENCY(GlobalsAAWrapperPass)		INITIALIZE_PASS_DEPENDENCY(GlobalsAAWrapperPass)
INITIALIZE_PASS_END(InstructionCombiningPass, "instcombine",		INITIALIZE_PASS_END(InstructionCombiningPass, "instcombine",
"Combine redundant instructions", false, false)		"Combine redundant instructions", false, false)

// Initialization Routines		// Initialization Routines
void llvm::initializeInstCombine(PassRegistry &Registry) {		void llvm::initializeInstCombine(PassRegistry &Registry) {
initializeInstructionCombiningPassPass(Registry);		initializeInstructionCombiningPassPass(Registry);
}		}

void LLVMInitializeInstCombine(LLVMPassRegistryRef R) {		void LLVMInitializeInstCombine(LLVMPassRegistryRef R) {
initializeInstructionCombiningPassPass(*unwrap(R));		initializeInstructionCombiningPassPass(*unwrap(R));
}		}

FunctionPass *llvm::createInstructionCombiningPass(bool ExpensiveCombines) {		FunctionPass *llvm::createInstructionCombiningPass(bool ExpensiveCombines,
return new InstructionCombiningPass(ExpensiveCombines);		bool UseProfileInfo) {
		return new InstructionCombiningPass(ExpensiveCombines, UseProfileInfo);
}		}

test/Transforms/PGOProfile/Inputs/x86_64-pie-alloca.proftext

				# IR level Instrumentation Flag
				:ir
				_Z3usei
				# Func Hash:
				12884901887
				# Num Counters:
				1
				# Counter Values:
				100

				_Z3foov
				# Func Hash:
				34137660316
				# Num Counters:
				2
				# Counter Values:
				100
				1

test/Transforms/PGOProfile/x86_64-pie-alloca.ll

				; Test that alloca of const arrays are retained instead of converting them
				; to a global when their use is hot
				; RUN: llvm-profdata merge %S/Inputs/x86_64-pie-alloca.proftext -o %t.profdata
				; RUN: opt < %s -pgo-instr-use -pgo-test-profile-file=%t.profdata -instcombine -S \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-linux-gnu"

				@_ZZ3foovE3arr = private unnamed_addr constant [2 x i32] [i32 1, i32 2], align 4

				; Function Attrs: noinline uwtable
				define i32 @_Z3foov() {

				;CHECK: entry:
				;CHECK-NEXT: %arr = alloca i64, align 8
				;CHECK: store i64 8589934593, i64* %arr, align 8
				;CHECK: for.cond:

				entry:
				%arr = alloca [2 x i32], align 4
				%i = alloca i32, align 4
				%0 = bitcast [2 x i32]* %arr to i8*
				call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* bitcast ([2 x i32]* @_ZZ3foovE3arr to i8*), i64 8, i32 4, i1 false)
				store i32 0, i32* %i, align 4
				br label %for.cond

				for.cond: ; preds = %for.inc, %entry
				%1 = load i32, i32* %i, align 4
				%cmp = icmp slt i32 %1, 100
				br i1 %cmp, label %for.body, label %for.end

				for.body: ; preds = %for.cond
				%2 = load i32, i32* %i, align 4
				%rem = srem i32 %2, 2
				%idxprom = sext i32 %rem to i64
				%arrayidx = getelementptr inbounds [2 x i32], [2 x i32]* %arr, i64 0, i64 %idxprom
				%3 = load i32, i32* %arrayidx, align 4
				%call = call i32 @_Z3usei(i32 %3)
				br label %for.inc

				for.inc: ; preds = %for.body
				%4 = load i32, i32* %i, align 4
				%inc = add nsw i32 %4, 1
				store i32 %inc, i32* %i, align 4
				br label %for.cond

				for.end: ; preds = %for.cond
				ret i32 0
				}

				; Function Attrs: argmemonly nounwind
				declare void @llvm.memcpy.p0i8.p0i8.i64(i8* nocapture writeonly, i8* nocapture readonly, i64, i32, i1)

				declare i32 @_Z3usei(i32)

				!llvm.module.flags = !{!0, !1}

				!0 = !{i32 1, !"PIC Level", i32 2}
				!1 = !{i32 1, !"PIE Level", i32 2}

				; CHECK-DAG: {{![0-9]+}} = !{i32 1, !"ProfileSummary", {{![0-9]+}}}
				; CHECK-DAG: {{![0-9]+}} = !{!"NumFunctions", i64 2}
				; CHECK-DAG: {{![0-9]+}} = !{!"MaxFunctionCount", i64 100}
				; CHECK-DAG: {{![0-9]+}} = !{!"branch_weights", i32 100, i32 1}

This is an archive of the discontinued LLVM Phabricator instance.

With PIE on x86_64, keep hot local arrays on the stackNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 91076

include/llvm/Transforms/IPO/PassManagerBuilder.h

include/llvm/Transforms/InstCombine/InstCombine.h

include/llvm/Transforms/Scalar.h

lib/Transforms/IPO/PassManagerBuilder.cpp

lib/Transforms/InstCombine/InstCombineInternal.h

lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp

lib/Transforms/InstCombine/InstructionCombining.cpp

test/Transforms/PGOProfile/Inputs/x86_64-pie-alloca.proftext

test/Transforms/PGOProfile/x86_64-pie-alloca.ll

With PIE on x86_64, keep hot local arrays on the stack
Needs ReviewPublic