This is an archive of the discontinued LLVM Phabricator instance.

Differential D3223

'CSE' of ADRP instructions used for load/store addressing
Needs ReviewPublic

Authored by Jiangning on Mar 30 2014, 8:18 PM.

Download Raw Diff

Details

Reviewers

t.p.northover

Summary

Hi,

Attached patch is to do 'CSE' of ADRP instructions used for load/store addressing.

GCC always use common attribute to expose global symbols to other modules, but this seems not mandatory for most of the compilation scenarios (Let me know if I'm wrong). With -fno-common enabled, we could get a chance of merging global definitions within the current compilation module to be a single monolithic big structure. In this way, LLVM would be able to treat all global access as a field within the newly created big structure.

For global symbol access, we usually have the following instruction sequence,

adrp ; load page address (4K boundary)
add ; add the address within the page
load/store ; use the address from the last add

If the big structure can be always fit into a page, we would be able to see fewer adrp instructions.

LLVM already has a GlobalMerge Pass, and originally it only tries to merge static variables, so now this patch is to get it extended to support external variables for AArch64.

There are some other considerations,

In theory, ADRP reduction may not always be good, because
The live range holding page address would increase a lot, and register pressure can be increased accordingly.
The data section may have some holes, and data cache behavior may be changed accordingly.
The maximum offset being supported by AArch64 ldr/str instruction actually is 4096*Sizeof(elem_type), which is larger than a 4K page. In theory, we can create an even larger struct merged for global variables. But the static experiment shows fewer extra adrp reduction when increasing merged size from 4K to 8K.
The scenario of applying this optimization is different from ARM target because of different addressing mode.
Actually we can reduce the "add" instruction as well by propagating the in-page address, which is usually a relocation, to load/store instruction, and the original offset in load/store instruction can be dumped into fixup(addend) of relocation section. This is be a separate optimization for which we could give follow-up.
I know ARM64 back-end is in trunk now, and I can also port it to ARM64 if needed.

Finally, I don't have real AArch64 machine to test the patch, so here comes the static statistic data only for SPEC2000. I would be extremely appreciative if anybody can help me to validate the performance.

(The before/after merge column contains is the number of ADRP instructions)

CINT2000	before merge	after merge	decreased
256.bzip2	655	367	43.97%
186.crafty	3898	3786	2.87%
175.vpr	1636	1620	0.98%
255.vortex	7333	6950	5.22%
252.eon	1660	1658	0.12%
181.mcf	58	58	0.00%
164.gzip	718	622	13.37%
253.perlbmk	9460	9443	0.18%
197.parser	1437	1309	8.91%
300.twolf	2993	2713	9.36%
254.gap	6145	5596	8.93%
176.gcc	14650	12908	11.89%
183.equake	264	181	31.44%
188.ammp	937	862	8.00%
179.art	344	162	52.91%
177.mesa	4230	4230	0.00%

Thanks,
-Jiangning

Diff Detail

Event Timeline

Can't this be tested with "opt -global-merge"?

Hi Rafael,

No. I didn't enable global-merge-on-external by default, so you have to use -global-merge-on-external instead.

Thanks,
-Jiangning

Hi,

We applied this patch and are evaluating its performance. However, it only
works for AArch64. The test case added by this patch failed on ARM64.
Jiangning, can you take a look?

Thanks,
Zhaoshi

Rebase the patch on TOT, because the last patch will raise build time failure on TOT.

Hi Jianging,

It looks to me that you have two change sets in the same patch:

One that adds the support of external linkage to the global merge pass.
One that enables the global merge pass for AArch64.

Could you split the patch to match that?

Thanks,
-Quentin

Hi Quentin,

Yes, you are correct. Now I split it into two separate patches.

One is at http://reviews.llvm.org/D3431 for enabling global merge pass, the
other is at http://reviews.llvm.org/D3432 for implementing ADRP CSE for
global symbols.

I don't use the original code review at http://reviews.llvm.org/D3223 to
avoid confusion.

Thanks,
-Jiangning

2014-04-19 0:48 GMT+08:00 Quentin Colombet <qcolombet@apple.com>:

Hi Jianging,

It looks to me that you have two change sets in the same patch:
- One that adds the support of external linkage to the global merge pass.
- One that enables the global merge pass for AArch64.

Could you split the patch to match that?

Thanks,
-Quentin

http://reviews.llvm.org/D3223

Revision Contents

Path

Size

include/

llvm/

IR/

GlobalAlias.h

4 lines

lib/

CodeGen/

AsmPrinter/

AsmPrinter.cpp

7 lines

IR/

Globals.cpp

25 lines

Target/

AArch64/

AArch64ISelLowering.h

4 lines

AArch64ISelLowering.cpp

7 lines

AArch64TargetMachine.cpp

9 lines

Transforms/

Scalar/

GlobalMerge.cpp

54 lines

test/

CodeGen/

AArch64/

global_merge.ll

48 lines

Diff 8624

include/llvm/IR/GlobalAlias.h

Context not available.
	static inline bool classof(const Value *V) {	static inline bool classof(const Value *V) {
	return V->getValueID() == Value::GlobalAliasVal;	return V->getValueID() == Value::GlobalAliasVal;
	}	}

		// return the constant offset of an expression, with which this global var
		// has alias.
		uint64_t calculateOffset(const DataLayout &DL) const;
	};	};

	template <>	template <>
Context not available.

lib/CodeGen/AsmPrinter/AsmPrinter.cpp

Context not available.
	EmitVisibility(Name, Alias.getVisibility());	EmitVisibility(Name, Alias.getVisibility());

	// Emit the directives as assignments aka .set:	// Emit the directives as assignments aka .set:
	OutStreamer.EmitAssignment(Name,	const MCExpr *Expr = MCSymbolRefExpr::Create(Target, OutContext);
	MCSymbolRefExpr::Create(Target, OutContext));	if (uint64_t Offset = Alias.calculateOffset(*TM.getDataLayout()))
		Expr = MCBinaryExpr::CreateAdd(Expr,
		MCConstantExpr::Create(Offset, OutContext), OutContext);
		OutStreamer.EmitAssignment(Name, Expr);
	}	}
	}	}

Context not available.

lib/IR/Globals.cpp

Context not available.
	#include "llvm/IR/GlobalValue.h"	#include "llvm/IR/GlobalValue.h"
	#include "llvm/ADT/SmallPtrSet.h"	#include "llvm/ADT/SmallPtrSet.h"
	#include "llvm/IR/Constants.h"	#include "llvm/IR/Constants.h"
		#include "llvm/IR/DataLayout.h"
	#include "llvm/IR/DerivedTypes.h"	#include "llvm/IR/DerivedTypes.h"
	#include "llvm/IR/GlobalAlias.h"	#include "llvm/IR/GlobalAlias.h"
	#include "llvm/IR/GlobalVariable.h"	#include "llvm/IR/GlobalVariable.h"
Context not available.
	return GV;	return GV;
	}	}
	}	}

		uint64_t GlobalAlias::calculateOffset(const DataLayout &DL) const {
		uint64_t Offset = 0;
		const Constant *C = this;
		while (C) {
		if (const GlobalAlias *GA = dyn_cast<GlobalAlias>(C)) {
		C = GA->getAliasee();
		} else if (const ConstantExpr *CE = dyn_cast<ConstantExpr>(C)) {
		if (CE->getOpcode() == Instruction::GetElementPtr) {
		std::vector<Value*> Args;
		for (unsigned I = 1; I < CE->getNumOperands(); ++I)
		Args.push_back(CE->getOperand(I));
		Offset += DL.getIndexedOffset(CE->getOperand(0)->getType(), Args);
		}
		C = CE->getOperand(0);
		} else if (isa<GlobalValue>(C)) {
		return Offset;
		} else {
		assert(0 && "Unexpected type in alias chain!");
		return 0;
		}
		}
		return Offset;
		}
Context not available.

lib/Target/AArch64/AArch64ISelLowering.h

Context not available.
	virtual bool getTgtMemIntrinsic(IntrinsicInfo &Info, const CallInst &I,	virtual bool getTgtMemIntrinsic(IntrinsicInfo &Info, const CallInst &I,
	unsigned Intrinsic) const override;	unsigned Intrinsic) const override;

		/// getMaximalGlobalOffset - Returns the maximal possible offset which can
		/// be used for loads / stores from the global.
		unsigned getMaximalGlobalOffset() const override;

	protected:	protected:
	std::pair<const TargetRegisterClass*, uint8_t>	std::pair<const TargetRegisterClass*, uint8_t>
	findRepresentativeClass(MVT VT) const;	findRepresentativeClass(MVT VT) const;
Context not available.

lib/Target/AArch64/AArch64ISelLowering.cpp

Context not available.
	return AM.Scale != 0 && AM.Scale != 1;	return AM.Scale != 0 && AM.Scale != 1;
	return -1;	return -1;
	}	}

		/// getMaximalGlobalOffset - Returns the maximal possible offset which can
		/// be used for loads / stores from the global.
		unsigned AArch64TargetLowering::getMaximalGlobalOffset() const {
		return 4095;
		}

Context not available.

lib/Target/AArch64/AArch64TargetMachine.cpp

Context not available.
	#include "llvm/CodeGen/Passes.h"	#include "llvm/CodeGen/Passes.h"
	#include "llvm/PassManager.h"	#include "llvm/PassManager.h"
	#include "llvm/Support/TargetRegistry.h"	#include "llvm/Support/TargetRegistry.h"
		#include "llvm/Transforms/Scalar.h"

	using namespace llvm;	using namespace llvm;

Context not available.
	return *getAArch64TargetMachine().getSubtargetImpl();	return *getAArch64TargetMachine().getSubtargetImpl();
	}	}

		bool addPreISel() override;
	virtual bool addInstSelector();	virtual bool addInstSelector();
	virtual bool addPreEmitPass();	virtual bool addPreEmitPass();
	};	};
	} // namespace	} // namespace

		bool AArch64PassConfig::addPreISel() {
		if (TM->getOptLevel() != CodeGenOpt::None)
		addPass(createGlobalMergePass(TM));

		return false;
		}

	TargetPassConfig *AArch64TargetMachine::createPassConfig(PassManagerBase &PM) {	TargetPassConfig *AArch64TargetMachine::createPassConfig(PassManagerBase &PM) {
	return new AArch64PassConfig(this, PM);	return new AArch64PassConfig(this, PM);
	}	}
Context not available.

lib/Transforms/Scalar/GlobalMerge.cpp

Context not available.
	cl::desc("Enable global merge pass on constants"),	cl::desc("Enable global merge pass on constants"),
	cl::init(false));	cl::init(false));

		static cl::opt<bool>
		EnableGlobalMergeOnExternal("global-merge-on-external", cl::Hidden,
		cl::desc("Enable global merge pass on external linkage"),
		cl::init(false));

	STATISTIC(NumMerged , "Number of globals merged");	STATISTIC(NumMerged , "Number of globals merged");
	namespace {	namespace {
	class GlobalMerge : public FunctionPass {	class GlobalMerge : public FunctionPass {
Context not available.
	uint64_t MergedSize = 0;	uint64_t MergedSize = 0;
	std::vector<Type*> Tys;	std::vector<Type*> Tys;
	std::vector<Constant*> Inits;	std::vector<Constant*> Inits;
		bool InternalOnly = true;
	for (j = i; j != e; ++j) {	for (j = i; j != e; ++j) {
	Type *Ty = Globals[j]->getType()->getElementType();	Type *Ty = Globals[j]->getType()->getElementType();
	MergedSize += DL->getTypeAllocSize(Ty);	MergedSize += DL->getTypeAllocSize(Ty);
	if (MergedSize > MaxOffset) {	if (MergedSize > MaxOffset) {
		MergedSize -= DL->getTypeAllocSize(Ty);
	break;	break;
	}	}
	Tys.push_back(Ty);	Tys.push_back(Ty);
	Inits.push_back(Globals[j]->getInitializer());	Inits.push_back(Globals[j]->getInitializer());

		if (Globals[i]->hasExternalLinkage()) {
		InternalOnly = false;
		}
	}	}

	StructType *MergedTy = StructType::get(M.getContext(), Tys);	StructType *MergedTy = StructType::get(M.getContext(), Tys);
	Constant *MergedInit = ConstantStruct::get(MergedTy, Inits);	Constant *MergedInit = ConstantStruct::get(MergedTy, Inits);

		// If merged variables doesn't have external linkage, we needn't to expose
		// the symbol after merging.
		GlobalValue::LinkageTypes Linkage = InternalOnly ?
		GlobalValue::InternalLinkage :
		GlobalValue::ExternalLinkage ;

		// If merged variables have external linkage, we use symbol name of the
		// first variable merged as the suffix of global symbol name.
		Twine MergedGVName = InternalOnly ?
		"_MergedGlobals" :
		"_MergedGlobals_" + Globals[i]->getName() ;
	GlobalVariable *MergedGV = new GlobalVariable(M, MergedTy, isConst,	GlobalVariable *MergedGV = new GlobalVariable(M, MergedTy, isConst,
	GlobalValue::InternalLinkage,	Linkage, MergedInit, MergedGVName,
	MergedInit, "_MergedGlobals",	0, GlobalVariable::NotThreadLocal,
	0, GlobalVariable::NotThreadLocal,	AddrSpace);
	AddrSpace);
		if (EnableGlobalMergeOnExternal) {
		// If the alignment is not a power of 2, round up to the next power of 2.
		uint64_t Align = MergedSize;
		if (Align & (Align-1))
		Align = llvm::NextPowerOf2(Align);
		MergedGV->setAlignment(Align);
		}

	for (size_t k = i; k < j; ++k) {	for (size_t k = i; k < j; ++k) {
		GlobalValue::LinkageTypes Linkage = Globals[k]->getLinkage();
		std::string Name = Globals[k]->getName();

	Constant *Idx[2] = {	Constant *Idx[2] = {
	ConstantInt::get(Int32Ty, 0),	ConstantInt::get(Int32Ty, 0),
	ConstantInt::get(Int32Ty, k-i)	ConstantInt::get(Int32Ty, k-i)
Context not available.
	Constant *GEP = ConstantExpr::getInBoundsGetElementPtr(MergedGV, Idx);	Constant *GEP = ConstantExpr::getInBoundsGetElementPtr(MergedGV, Idx);
	Globals[k]->replaceAllUsesWith(GEP);	Globals[k]->replaceAllUsesWith(GEP);
	Globals[k]->eraseFromParent();	Globals[k]->eraseFromParent();

		if (Linkage != GlobalValue::InternalLinkage) {
		// Generate a new alias...
		new GlobalAlias(GEP->getType(), Linkage, Name, GEP, &M);
		}

	NumMerged++;	NumMerged++;
	}	}
	i = j;	i = j;
Context not available.
	// Grab all non-const globals.	// Grab all non-const globals.
	for (Module::global_iterator I = M.global_begin(),	for (Module::global_iterator I = M.global_begin(),
	E = M.global_end(); I != E; ++I) {	E = M.global_end(); I != E; ++I) {
	// Merge is safe for "normal" internal globals only	// Merge is safe for "normal" internal or external globals only
	if (!I->hasLocalLinkage() \|\| I->isThreadLocal() \|\| I->hasSection())	if (!((EnableGlobalMergeOnExternal && I->hasExternalLinkage())
		\|\| I->hasInternalLinkage())
		\|\| I->isDeclaration() \|\| I->isThreadLocal() \|\| I->hasSection())
	continue;	continue;

	PointerType *PT = dyn_cast<PointerType>(I->getType());	PointerType *PT = dyn_cast<PointerType>(I->getType());
Context not available.

test/CodeGen/AArch64/global_merge.ll

This file was added.

				; RUN: llc < %s -mtriple=aarch64-none-linux-gnu -global-merge-on-external=true \| FileCheck %s

				@x = global i32 0, align 4
				@y = global i32 0, align 4
				@z = global i32 0, align 4
				@m = internal global i32 0, align 4
				@n = internal global i32 0, align 4

				define void @f1(i32 %a1, i32 %a2) {
				; CHECK-LABEL: f1:
				; CHECK: adrp x{{[0-9]+}}, _MergedGlobals_
				store i32 %a1, i32* @x, align 4
				store i32 %a2, i32* @y, align 4

				; CHECK: adrp x{{[0-9]+}}, _MergedGlobals
				; CHECK-NOT: adrp
				store i32 %a1, i32* @m, align 4
				store i32 %a2, i32* @n, align 4
				ret void
				}

				define void @g1(i32 %a1, i32 %a2) {
				; CHECK-LABEL: g1:
				; CHECK: adrp

				; We should have only one adrp generated for this function.
				; CHECK-NOT: adrp
				store i32 %a1, i32* @y, align 4
				store i32 %a2, i32* @z, align 4
				ret void
				}

				; CHECK: .bss
				; CHECK: .globl _MergedGlobals_x
				; CHECK: .align 4
				; CHECK: _MergedGlobals_
				; CHECK: .size _MergedGlobals_x, 12

				; CHECK: .local _MergedGlobals
				; CHECK: .comm _MergedGlobals,8,8

				; CHECK: .globl x
				; CHECK: x = _MergedGlobals_x
				; CHECK: .globl y
				; CHECK: y = _MergedGlobals_x+4
				; CHECK: .globl z
				; CHECK: z = _MergedGlobals_x+8

This is an archive of the discontinued LLVM Phabricator instance.

'CSE' of ADRP instructions used for load/store addressingNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 8624

include/llvm/IR/GlobalAlias.h

lib/CodeGen/AsmPrinter/AsmPrinter.cpp

lib/IR/Globals.cpp

lib/Target/AArch64/AArch64ISelLowering.h

lib/Target/AArch64/AArch64ISelLowering.cpp

lib/Target/AArch64/AArch64TargetMachine.cpp

lib/Transforms/Scalar/GlobalMerge.cpp

test/CodeGen/AArch64/global_merge.ll

'CSE' of ADRP instructions used for load/store addressing
Needs ReviewPublic