This is an archive of the discontinued LLVM Phabricator instance.

[SROA] Only generate memcpy if the slices is large 'enough' (WIP).
Needs ReviewPublic

Authored by fhahn on Oct 6 2020, 6:11 AM.

Download Raw Diff

Details

Reviewers

efriedma
Carrot
arsenm
lebedev.ri

Summary

Currently SROA liberally creates llvm.memcpy calls when dealing with
small slices of allocas. Unfortunately there are multiple places in LLVM
that do not work too well with llvm.memcpy.

This can lead to surprising code gen, e.g. see PR47705 and PR47709 (LICM
does not hoist invariant llvm.memcyp calls).

We can side-step this issue in some cases, by letting SROA emit
loads/stores instead of memcpy if the slice is small and we can
reasonably expect vector versions of those loads and stores can be used.

The chosen threshold of 2 x widest vector register is somewhat
arbitrary, but should ensure that we can be reasonably confident that
those loads & stores will be lowered relatively efficiently.

The patch as is is not ideal, because it potentially results in a large
number of insert/extractvalue instructions to move the loaded/stored
values to and from the slice. We could (and maybe should) try to
directly emit the correct vector loads/stores.

At this stage I am mainly interested to see if there's a reason for not
doing so already. It might not be desirable to bake in too much
target-specific knowledge into something as general as SROA. I'll update
the tests if we settle on the final approach

This potentially provides some nice performance benefits, e.g. on ARM64
with -O3 -flto, 450.soplex runs roughly 2.2% faster and generates to
expected assembly for PR47705.

We should also work on improving the handling of llvm.memcpy in
different passes, but that might be tricky in some cases. For example,
it might be desirable to de-compose llvm.memcpy in separate load/store
parts if this would lead to the load part being loop-invariant.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	130 ms	linux > Clang.CodeGen::thinlto-distributed-newpm.ll
	480 ms	linux > Clang.CodeGenOpenCL::amdgpu-nullptr.cl
	70 ms	linux > LLVM.CodeGen/AMDGPU::memcpy-fixed-align.ll
	70 ms	linux > LLVM.DebugInfo/X86::sroasplit-1.ll
	180 ms	linux > LLVM.DebugInfo/X86::sroasplit-4.ll
		View Full Test Results (9 Failed)

Event Timeline

fhahn created this revision.Oct 6 2020, 6:11 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 6 2020, 6:11 AM

Herald added subscribers: dexonsmith, hiraditya, kristof.beyls. · View Herald Transcript

fhahn requested review of this revision.Oct 6 2020, 6:11 AM

Herald added a subscriber: wdng. · View Herald TranscriptOct 6 2020, 6:11 AM

bjope added a subscriber: bjope.Oct 6 2020, 6:25 AM

Harbormaster completed remote builds in B74125: Diff 296431.Oct 6 2020, 6:28 AM

This isn't really SROA specific is it?
Instcombine also expands small (16 bytes, hardcoded) memcpy's
I would think this should be done by some standalone pass, as a costmodel-driven optimization.

jdoerfert added a subscriber: jdoerfert.Oct 6 2020, 7:17 AM

nikic added a subscriber: nikic.Oct 6 2020, 7:40 AM

In D88893#2314402, @lebedev.ri wrote:

This isn't really SROA specific is it?
Instcombine also expands small (16 bytes, hardcoded) memcpy's
I would think this should be done by some standalone pass, as a costmodel-driven optimization.

yeah we might as well do this independently, although we might be able to catch the most important cases at the 'source' as well. Deciding which memcpys needs to be target specific I think, in a way of cost-modeling, which this patch does, although very naively. Unfortunately it is going to be quite hard to estimate whether expanding memcpy will enable further optimizations.

Cost modeling is target-specific yes, and also depends on the alignment of the copy.

If I'm understanding correctly, there are two significant benefits here:

Splitting apart the "load" and "store" parts of the memcpy, so, for example, the load can be hoisted out of a loop.
We have generally better optimizations for load/store instructions.

Unfortunately it is going to be quite hard to estimate whether expanding memcpy will enable further optimizations.

There are cases we could detect directly relatively easily: for example, whether we can hoist the load out of a loop, or whether the load is loading a constant value. More generally it could be tricky, yes.

Personally honestly i'm not so much not okay with doing this in SROA,
but it's that i'm uneasy about introduction of a new TTI dependency to some pass.
Because that implicitly changes it from being a target-independent pass
into being a cost-model driven transform, with rest of transforms
in the pass being "forgotten" to be costmodelled.

Instcombine also expands small (16 bytes, hardcoded) memcpy's

Will it help to schedule another instcombine after SROA, or put SROA before an existing instcombine pass schedule?

In D88893#2351846, @hiraditya wrote:

Instcombine also expands small (16 bytes, hardcoded) memcpy's

Will it help to schedule another instcombine after SROA, or put SROA before an existing instcombine pass schedule?

I'm not sure i understand the question.
There's for sure already instcombine invocations after SROA, e.g.: https://github.com/llvm/llvm-project/blob/88241ffb5636ebc0579d3ab8eeec78446a769c54/llvm/test/Other/opt-O3-pipeline.ll#L142-L164

This review seems to be stuck/dead, consider abandoning if no longer relevant.

Herald added a project: Restricted Project. · View Herald TranscriptJan 12 2023, 5:19 PM

Herald added a subscriber: StephenFan. · View Herald Transcript

Revision Contents

Path

Size

llvm/

include/

llvm/

Transforms/

Scalar/

SROA.h

4 lines

lib/

Transforms/

Scalar/

SROA.cpp

19 lines

Diff 296431

llvm/include/llvm/Transforms/Scalar/SROA.h

Show All 24 Lines
class AllocaInst;		class AllocaInst;
class AssumptionCache;		class AssumptionCache;
class DominatorTree;		class DominatorTree;
class Function;		class Function;
class Instruction;		class Instruction;
class LLVMContext;		class LLVMContext;
class PHINode;		class PHINode;
class SelectInst;		class SelectInst;
		class TargetTransformInfo;
class Use;		class Use;

/// A private "module" namespace for types and utilities used by SROA. These		/// A private "module" namespace for types and utilities used by SROA. These
/// are implementation details and should not be used by clients.		/// are implementation details and should not be used by clients.
namespace sroa LLVM_LIBRARY_VISIBILITY {		namespace sroa LLVM_LIBRARY_VISIBILITY {

class AllocaSliceRewriter;		class AllocaSliceRewriter;
class AllocaSlices;		class AllocaSlices;
Show All 19 Lines
/// 3) Finally, this will try to detect a pattern of accesses which map cleanly		/// 3) Finally, this will try to detect a pattern of accesses which map cleanly
/// onto insert and extract operations on a vector value, and convert them to		/// onto insert and extract operations on a vector value, and convert them to
/// this form. By doing so, it will enable promotion of vector aggregates to		/// this form. By doing so, it will enable promotion of vector aggregates to
/// SSA vector values.		/// SSA vector values.
class SROA : public PassInfoMixin<SROA> {		class SROA : public PassInfoMixin<SROA> {
LLVMContext *C = nullptr;		LLVMContext *C = nullptr;
DominatorTree *DT = nullptr;		DominatorTree *DT = nullptr;
AssumptionCache *AC = nullptr;		AssumptionCache *AC = nullptr;
		TargetTransformInfo *TTI = nullptr;

/// Worklist of alloca instructions to simplify.		/// Worklist of alloca instructions to simplify.
///		///
/// Each alloca in the function is added to this. Each new alloca formed gets		/// Each alloca in the function is added to this. Each new alloca formed gets
/// added to it as well to recursively simplify unless that alloca can be		/// added to it as well to recursively simplify unless that alloca can be
/// directly promoted. Finally, each time we rewrite a use of an alloca other		/// directly promoted. Finally, each time we rewrite a use of an alloca other
/// the one being actively rewritten, we add it back onto the list if not		/// the one being actively rewritten, we add it back onto the list if not
/// already present to ensure it is re-visited.		/// already present to ensure it is re-visited.
Show All 39 Lines	public:
PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);		PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);

private:		private:
friend class sroa::AllocaSliceRewriter;		friend class sroa::AllocaSliceRewriter;
friend class sroa::SROALegacyPass;		friend class sroa::SROALegacyPass;

/// Helper used by both the public run method and by the legacy pass.		/// Helper used by both the public run method and by the legacy pass.
PreservedAnalyses runImpl(Function &F, DominatorTree &RunDT,		PreservedAnalyses runImpl(Function &F, DominatorTree &RunDT,
AssumptionCache &RunAC);		AssumptionCache &RunAC, TargetTransformInfo &TTI);

bool presplitLoadsAndStores(AllocaInst &AI, sroa::AllocaSlices &AS);		bool presplitLoadsAndStores(AllocaInst &AI, sroa::AllocaSlices &AS);
AllocaInst *rewritePartition(AllocaInst &AI, sroa::AllocaSlices &AS,		AllocaInst *rewritePartition(AllocaInst &AI, sroa::AllocaSlices &AS,
sroa::Partition &P);		sroa::Partition &P);
bool splitAlloca(AllocaInst &AI, sroa::AllocaSlices &AS);		bool splitAlloca(AllocaInst &AI, sroa::AllocaSlices &AS);
bool runOnAlloca(AllocaInst &AI);		bool runOnAlloca(AllocaInst &AI);
void clobberUse(Use &U);		void clobberUse(Use &U);
bool deleteDeadInstructions(SmallPtrSetImpl<AllocaInst *> &DeletedAllocas);		bool deleteDeadInstructions(SmallPtrSetImpl<AllocaInst *> &DeletedAllocas);
bool promoteAllocas(Function &F);		bool promoteAllocas(Function &F);
};		};

} // end namespace llvm		} // end namespace llvm

#endif // LLVM_TRANSFORMS_SCALAR_SROA_H		#endif // LLVM_TRANSFORMS_SCALAR_SROA_H

llvm/lib/Transforms/Scalar/SROA.cpp

Show All 35 Lines
#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
#include "llvm/ADT/Twine.h"		#include "llvm/ADT/Twine.h"
#include "llvm/ADT/iterator.h"		#include "llvm/ADT/iterator.h"
#include "llvm/ADT/iterator_range.h"		#include "llvm/ADT/iterator_range.h"
#include "llvm/Analysis/AssumptionCache.h"		#include "llvm/Analysis/AssumptionCache.h"
#include "llvm/Analysis/GlobalsModRef.h"		#include "llvm/Analysis/GlobalsModRef.h"
#include "llvm/Analysis/Loads.h"		#include "llvm/Analysis/Loads.h"
#include "llvm/Analysis/PtrUseVisitor.h"		#include "llvm/Analysis/PtrUseVisitor.h"
		#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/Config/llvm-config.h"		#include "llvm/Config/llvm-config.h"
#include "llvm/IR/BasicBlock.h"		#include "llvm/IR/BasicBlock.h"
#include "llvm/IR/Constant.h"		#include "llvm/IR/Constant.h"
#include "llvm/IR/ConstantFolder.h"		#include "llvm/IR/ConstantFolder.h"
#include "llvm/IR/Constants.h"		#include "llvm/IR/Constants.h"
#include "llvm/IR/DIBuilder.h"		#include "llvm/IR/DIBuilder.h"
#include "llvm/IR/DataLayout.h"		#include "llvm/IR/DataLayout.h"
#include "llvm/IR/DebugInfoMetadata.h"		#include "llvm/IR/DebugInfoMetadata.h"
▲ Show 20 Lines • Show All 2,894 Lines • ▼ Show 20 Lines	bool visitMemTransferInst(MemTransferInst &II) {
// a single value type, just emit a memcpy.		// a single value type, just emit a memcpy.
bool EmitMemCpy =		bool EmitMemCpy =
!VecTy && !IntTy &&		!VecTy && !IntTy &&
(BeginOffset > NewAllocaBeginOffset \|\| EndOffset < NewAllocaEndOffset \|\|		(BeginOffset > NewAllocaBeginOffset \|\| EndOffset < NewAllocaEndOffset \|\|
SliceSize !=		SliceSize !=
DL.getTypeStoreSize(NewAI.getAllocatedType()).getFixedSize() \|\|		DL.getTypeStoreSize(NewAI.getAllocatedType()).getFixedSize() \|\|
!NewAI.getAllocatedType()->isSingleValueType());		!NewAI.getAllocatedType()->isSingleValueType());

		if (EmitMemCpy &&
		SliceSize ==
		DL.getTypeStoreSize(NewAI.getAllocatedType()).getFixedSize() &&
		(SliceSize <= 2 * Pass.TTI->getRegisterBitWidth(true)))
		EmitMemCpy = false;

// If we're just going to emit a memcpy, the alloca hasn't changed, and the		// If we're just going to emit a memcpy, the alloca hasn't changed, and the
// size hasn't been shrunk based on analysis of the viable range, this is		// size hasn't been shrunk based on analysis of the viable range, this is
// a no-op.		// a no-op.
if (EmitMemCpy && &OldAI == &NewAI) {		if (EmitMemCpy && &OldAI == &NewAI) {
// Ensure the start lines up.		// Ensure the start lines up.
assert(NewBeginOffset == BeginOffset);		assert(NewBeginOffset == BeginOffset);

// Rewrite the size as needed.		// Rewrite the size as needed.
▲ Show 20 Lines • Show All 1,749 Lines • ▼ Show 20 Lines	bool SROA::promoteAllocas(Function &F) {

LLVM_DEBUG(dbgs() << "Promoting allocas with mem2reg...\n");		LLVM_DEBUG(dbgs() << "Promoting allocas with mem2reg...\n");
PromoteMemToReg(PromotableAllocas, *DT, AC);		PromoteMemToReg(PromotableAllocas, *DT, AC);
PromotableAllocas.clear();		PromotableAllocas.clear();
return true;		return true;
}		}

PreservedAnalyses SROA::runImpl(Function &F, DominatorTree &RunDT,		PreservedAnalyses SROA::runImpl(Function &F, DominatorTree &RunDT,
AssumptionCache &RunAC) {		AssumptionCache &RunAC,
		TargetTransformInfo &RunTTI) {
LLVM_DEBUG(dbgs() << "SROA function: " << F.getName() << "\n");		LLVM_DEBUG(dbgs() << "SROA function: " << F.getName() << "\n");
C = &F.getContext();		C = &F.getContext();
DT = &RunDT;		DT = &RunDT;
AC = &RunAC;		AC = &RunAC;
		TTI = &RunTTI;

BasicBlock &EntryBB = F.getEntryBlock();		BasicBlock &EntryBB = F.getEntryBlock();
for (BasicBlock::iterator I = EntryBB.begin(), E = std::prev(EntryBB.end());		for (BasicBlock::iterator I = EntryBB.begin(), E = std::prev(EntryBB.end());
I != E; ++I) {		I != E; ++I) {
if (AllocaInst *AI = dyn_cast<AllocaInst>(I)) {		if (AllocaInst *AI = dyn_cast<AllocaInst>(I)) {
if (isa<ScalableVectorType>(AI->getAllocatedType())) {		if (isa<ScalableVectorType>(AI->getAllocatedType())) {
if (isAllocaPromotable(AI))		if (isAllocaPromotable(AI))
PromotableAllocas.push_back(AI);		PromotableAllocas.push_back(AI);
Show All 37 Lines	PreservedAnalyses SROA::runImpl(Function &F, DominatorTree &RunDT,
PreservedAnalyses PA;		PreservedAnalyses PA;
PA.preserveSet<CFGAnalyses>();		PA.preserveSet<CFGAnalyses>();
PA.preserve<GlobalsAA>();		PA.preserve<GlobalsAA>();
return PA;		return PA;
}		}

PreservedAnalyses SROA::run(Function &F, FunctionAnalysisManager &AM) {		PreservedAnalyses SROA::run(Function &F, FunctionAnalysisManager &AM) {
return runImpl(F, AM.getResult<DominatorTreeAnalysis>(F),		return runImpl(F, AM.getResult<DominatorTreeAnalysis>(F),
AM.getResult<AssumptionAnalysis>(F));		AM.getResult<AssumptionAnalysis>(F),
		AM.getResult<TargetIRAnalysis>(F));
}		}

/// A legacy pass for the legacy pass manager that wraps the \c SROA pass.		/// A legacy pass for the legacy pass manager that wraps the \c SROA pass.
///		///
/// This is in the llvm namespace purely to allow it to be a friend of the \c		/// This is in the llvm namespace purely to allow it to be a friend of the \c
/// SROA pass.		/// SROA pass.
class llvm::sroa::SROALegacyPass : public FunctionPass {		class llvm::sroa::SROALegacyPass : public FunctionPass {
/// The SROA implementation.		/// The SROA implementation.
SROA Impl;		SROA Impl;

public:		public:
static char ID;		static char ID;

SROALegacyPass() : FunctionPass(ID) {		SROALegacyPass() : FunctionPass(ID) {
initializeSROALegacyPassPass(*PassRegistry::getPassRegistry());		initializeSROALegacyPassPass(*PassRegistry::getPassRegistry());
}		}

bool runOnFunction(Function &F) override {		bool runOnFunction(Function &F) override {
if (skipFunction(F))		if (skipFunction(F))
return false;		return false;

auto PA = Impl.runImpl(		auto PA = Impl.runImpl(
F, getAnalysis<DominatorTreeWrapperPass>().getDomTree(),		F, getAnalysis<DominatorTreeWrapperPass>().getDomTree(),
getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F));		getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F),
		getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F));
return !PA.areAllPreserved();		return !PA.areAllPreserved();
}		}

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<AssumptionCacheTracker>();		AU.addRequired<AssumptionCacheTracker>();
AU.addRequired<DominatorTreeWrapperPass>();		AU.addRequired<DominatorTreeWrapperPass>();
		AU.addRequired<TargetTransformInfoWrapperPass>();
AU.addPreserved<GlobalsAAWrapperPass>();		AU.addPreserved<GlobalsAAWrapperPass>();
AU.setPreservesCFG();		AU.setPreservesCFG();
}		}

StringRef getPassName() const override { return "SROA"; }		StringRef getPassName() const override { return "SROA"; }
};		};

char SROALegacyPass::ID = 0;		char SROALegacyPass::ID = 0;

FunctionPass *llvm::createSROAPass() { return new SROALegacyPass(); }		FunctionPass *llvm::createSROAPass() { return new SROALegacyPass(); }

INITIALIZE_PASS_BEGIN(SROALegacyPass, "sroa",		INITIALIZE_PASS_BEGIN(SROALegacyPass, "sroa",
"Scalar Replacement Of Aggregates", false, false)		"Scalar Replacement Of Aggregates", false, false)
INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)		INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)
INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)		INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
		INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
INITIALIZE_PASS_END(SROALegacyPass, "sroa", "Scalar Replacement Of Aggregates",		INITIALIZE_PASS_END(SROALegacyPass, "sroa", "Scalar Replacement Of Aggregates",
false, false)		false, false)