This is an archive of the discontinued LLVM Phabricator instance.

Add Load Combine Pass
ClosedPublic

Authored by Bigcheese on Apr 30 2014, 7:52 PM.

Download Raw Diff

Details

Reviewers

chandlerc
atrick

Commits

rG289067cc3dbb: Add LoadCombine pass.
rL209791: Add LoadCombine pass.

Summary

This combines adjacent loads if no writes to memory or atomic ops occur between them.

This was motivated by us not generating a bswap for:

static inline uint64_t LoadU64_x8( const uint8_t* pData )
{
    return
(((uint64_t)pData[0])<<56)|(((uint64_t)pData[1])<<48)|(((uint64_t)pData[2])<<40)|(((uint64_t)pData[3])<<32)|(((uint64_t)pData[4])<<24)|(((uint64_t)pData[5])<<16)|(((uint64_t)pData[6])<<8)|((uint64_t)pData[7]);
}

Diff Detail

Repository: rL LLVM

Event Timeline

Bigcheese updated this revision to Diff 9001.Apr 30 2014, 7:52 PM

Bigcheese retitled this revision from to [InstCombine] Combine adjacent i8 loads..

Bigcheese updated this object.

Bigcheese edited the test plan for this revision. (Show Details)

Bigcheese added reviewers: chandlerc, atrick.

Adding llvm-commits to CC.

I think I would prefer to implement this as a standalone pass until we have
some clear phase ordering need for it in a utility like instcombine.

My suggestion: a pass that runs right after GVN.

I would also use the domtree rather than the basic block structure.

I think it really needs to handle arbitrary sizes, and both loads and
stores. You don't want to scan multiple times for this.

But after thinking all of this, it strikes me that the work done here is
*very* similar to the work done by the SLP vectorizer for loads and stores.
I think there is at least the possibility that this should be done by the
SLP vectorizer, or that some of the logic should be shared. I'd appreciate
Arnold's, Raul's, Nadav's, and Hal's thoughts on this idea.

I agree this combining can in theory be done as part of SLP vectorization,
targeting wider scalar registers as a limited form of vector hardware,
supporting wider loads/stores and some bitwise operations. That would
likely catch more cases than what can be done with a small standalone pass.

Raúl E Silvera | SWE | rsilvera@google.com | *408-789-2846*

I agree with Chandler that this should only be done once, late in the pipeline (post GVN). I am also concerned that if this runs before SLP vectorizer it will interfere with it. I'd like to get Arnold's comments on this.

Load combining should probably be done in a single pass over the block. First collect all the offsets, then sort, then look for pairs. See the LoadClustering stuff in MachineScheduler. Your RangeLimit skirts around this problem, but I don't think the arbitrary threshold is necessary.

Doing this per basic block is ok. Although there's no reason you can't do it almost as easily on an extended basic block (side exits ok, no merges). Chandler said do it on the domtree, but handling CFG merges would be too complicated and expensive.

Did you forget to check for Invokes?

Conceptually this is doing SLP vectorization, but it doesn't fit with our SLP algortihm which first finds a vectorizable use-def tree. Sticking it in GVN is another option, but again I'm concerned about it running before SLP. Maybe it can run in the SLP pass after the main algorithm.

atrick added a subscriber: aschwaighofer.May 1 2014, 10:09 PM

In my first comments, I didn't touch on the motivation for this. Can an argument be made that this load combining is generally profitable. Or are you just trying to match BSWAP?

atrick:

Load combining should probably be done in a single pass over the block. First collect all the offsets, then sort, then look for pairs. See the LoadClustering stuff in MachineScheduler. Your RangeLimit skirts around this problem, but I don't think the arbitrary threshold is necessary.

Doing this per basic block is ok. Although there's no reason you can't do it almost as easily on an extended basic block (side exits ok, no merges). Chandler said do it on the domtree, but handling CFG merges would be too complicated and expensive.

I agree this should be a basic block pass. The problem with multiple basic blocks is that you cannot introduce a load into a path that it would not otherwise be loaded. This would lead to potential a race conditions. So if all paths have the load, sure. I'm not sure how expensive this is to calculate though. I also believe we would have already hoisted the load into the parent BB.

Did you forget to check for Invokes?

Invokes are terminating instructions. There won't be any loads in the same basic block after it.

Conceptually this is doing SLP vectorization, but it doesn't fit with our SLP algortihm which first finds a vectorizable use-def tree. Sticking it in GVN is another option, but again I'm concerned about it running before SLP. Maybe it can run in the SLP pass after the main algorithm.

In this case wouldn't it be just as easy to make it a separate pass that runs after SLP?

Nadav:

Michael described the motivation to his patch by showing the LoadU64_x8 function and how it loads multiple scalars from memory just to combine them into a single word. If this function represents the kind of function that he wants to handle then I think that we should consider implementing this in SelectionDAG. We already merge consecutive loads and stores in SelectionDAG. Problems like this should be local to one basic block so the fact that we are working on a single basic block should not be a problem. At SelectionDAG level we should have enough information to make informed decisions on the profitability of merging loads/stores. I can’t imagine propagating all of the relevant information into IR land. The SLP vectorizer is designed to create vectors and teaching the SLP vectorizer to handle non-vector types is not a good idea. Also, generating vectors for code that should obviously stay scalar is not ideal. Michael, was I correct about the kind of problems that you are trying to solve? Have you considered SelectionDAG? I think MergeConsecutiveStores is a good place to look at.

My specific reason for looking into this was the stated problem, but I believe there are lots of other cases that can benefit from this. The main reason I didn't want to do it in SDAG is because we don't do bswap recognition in SDAG, and I'd rather not have multiple implementations of it. Another issue is the inliner, bswap is generally a great thing to inline, but if we only do this in SDAG we may choose not to.

There are four parts to the problem of widening chains when viewed from the SLP vectorizers perspective (staying in vector types):

Recognizing adjacent memory operation: this is obviously similar
Widening operations: we would widen only the load in this example
Building chains: there is no real chain of widened operations here: only the load is widened. One could imagine examples where we perform operations on the loaded i8 type before we build the i64.
Finding starting points of chains: this would be to recognize reduction into the wider type in this example.

If we were to model the example given in the patch in the slp vectorizer, I think we would (I changed the example to i32 to have to type less) recognize that
(I64 OR (I64 << (SEXT (I32 ...) to I64), 0))

(I64 << (SEXT(I32 ...) to I64), 32)))

Is a reduction into a I64 value which we can model as <2 x i32>:

(I64 CAST (<2 x i32> SHUFFLE (...)))

And then start a chain from the <2 x i32> (...) root which in this case would only be a <2 x i32> load (or in the real example a <8 x i8> load).

Do we expect that there would be longer chains that would benefit from widening that would start of such a pattern? I.e do we expect to be able to do some isomorphic operations in the smaller type (i8 or i32 in my example) before the reduction? If the answer is yes, I think it make sense thinking about doing this in the SLP vectorizer.

I am not sure that CodeGen deals well with such contortions, though: (I64 CAST (<8 x i8> SHUFFLE (<8 x i8> LOAD))) => (I64 (BSWAP (I64 LOAD)))? That could be fixed. What does our cost model say about such operations?

Teaching the SLP vectorizer to widen scalar types is a whole different complexity beast (I am not sure we want to model lanes of smaller types in a large type without using vectors).

I don't think the above transformation (building a value of a bigger type from a smaller type) is going to interfere with regular SLP vectorization because we start bottom-up (sink to source) and we don't have patterns that would start at an "or reduction" (bigger than 2 operations). If we implement a second transformation that starts from loads and widens operations top-down, then, I agree with Andy, we would have to be careful about phase ordering.

If however, all we want to catch is swap then this feels like a dag combine to me (with the gotcha of loosing analysis information during lowering mentioned below). But, it seems to me there is potential to catch longer chains leading to the loads.

Doing this at the IR level has the benefit that our memory analysis (BasicAA) is better in the current framework. Inlining can cause us to loose information about aliasing (lost noalias parameters, we should really fix this :), Hal had a patch but I digress ...).

I'll point out that on PPC we have byte-swapped loads, and we currently handle this in CodeGen using a target-specific DAG combine. We recognize a LOAD+BSWAP and BSWAP+STORE pair and produce a target-specific node for the desired instruction. This does not, however, handle the case where the the loads are combinable. However, we already have an optimization in DAGCombine that is supposed to combine consecutive loads, DAGCombiner::CombineConsecutiveLoads (maybe it requires AA to be active in DAGCombine to work optimally, but I normally turn that on for PPC, and I think that it should be safe in general). Perhaps it just needs to be called in the right place to work for this input?

Doing this at the IR level has the benefit that our memory analysis (BasicAA) is better in the current framework. Inlining can cause us to loose information about aliasing (lost noalias parameters, we should really fix this :), Hal had a patch but I digress ...).

Yes, I'll be getting back to this quite soon.

Moved out to separate pass run after SLP vectorizer. Still need to do a test-suite run.

This combines some loads that probably shouldn't be combined. I believe the proper fix here is to split (trunc (shr (load ...), <multiple of 8>), n) in SDAG for targets where the hardware will do load combining. This still exposes the optimization opportunity without regressing anything.

Bigcheese updated this object.May 7 2014, 1:41 AM

This pass will create unaligned loads. You'll need to use TTI (enhanced if necessary) to check whether or not the target supports unaligned loads of the required type (by, through TTI, calling TLI->allowsUnalignedMemoryAccesses, and making sure such access is considered "fast"). Otherwise, the code generator will end up breaking the load apart again and will likely generate worse code overall.

lib/Transforms/IPO/PassManagerBuilder.cpp
223 ↗	(On Diff #9145)	Please move this to after BBVectorize as well.
lib/Transforms/Scalar/LoadCombine.cpp
138 ↗	(On Diff #9145)	I suspect that this should really be DL->getTypeStoreSize(L.Load->getType()). (and the same for all of the places that you call getPrimitiveSizeInBits -- which, for one thing, won't work for pointer types).
143 ↗	(On Diff #9145)	This seems to strict. You actually only care that it is not greater, right?
165 ↗	(On Diff #9145)	If no one else has a better suggestion, I'd move this into IRBuilder.
test/Transforms/LoadCombine/load-combine.ll
46 ↗	(On Diff #9145)	Please add: CHECK-LABEL: @LoadU64_x64_0 (and similarly to the other tests)
47 ↗	(On Diff #9145)	Please also check the alignment on the load.

It looks like you first sort the loads, then use load[0] as the insertion point. How does this work if the loads are striding backward through memory with uses in between?

Since you are effectively hoisting loads, I think you should check mayThrow(), not just mayWriteMemory(). In the future, we will want to support read-only calls that mayThrow. (This would allow redundant load elimination across the calls--important for runtime safety checks.)

Test cases tend to be more effective with CHECK-LABEL on the name of each subtest.

Otherwise LGTM after addressing Hal's comments. I'm not sure what Hal meant by checking alignment. The new load inherits the alignment of the first aggregated load. We'll end up with an unaligned load as far as the compiler can tell, which is often not optimal. I'm not sure whether combining, then splitting in SelectionDAG will produce worse code or not without trying it. If the mechanism exists to split unaligned loads, can we force that on for x86 and see if we generate worse code in our load combining test cases?

Original Message -----

From: "Andrew Trick" <atrick@apple.com>
To: bigcheesegs@gmail.com, atrick@apple.com, chandlerc@gmail.com
Cc: hfinkel@anl.gov, nrotem@apple.com, aschwaighofer@apple.com, rsilvera@google.com, llvm-commits@cs.uiuc.edu
Sent: Wednesday, May 7, 2014 11:07:24 AM
Subject: Re: [PATCH] Add Load Combine Pass

It looks like you first sort the loads, then use load[0] as the
insertion point. How does this work if the loads are striding
backward through memory with uses in between?

Since you are effectively hoisting loads, I think you should check
mayThrow(), not just mayWriteMemory(). In the future, we will want
to support read-only calls that mayThrow. (This would allow
redundant load elimination across the calls--important for runtime
safety checks.)

Test cases tend to be more effective with CHECK-LABEL on the name of
each subtest.

Otherwise LGTM after addressing Hal's comments. I'm not sure what Hal
meant by checking alignment.

Two things:

In the regression test, the CHECK line should check the alignment of the generated load.
I had proposed that we only perform the transformation at all when TLI->allowsUnalignedMemoryAccesses returns true and sets *fast to true. When this function returns false, then SDAG will split the load (so Andy's experiment is possible). Also, I agree with Andy that if benchmark data shows the code quality from combining+SDAG-splitting to be better than that from doing nothing, then by all means do the former. Without benchmark data (from multiple platforms, preferably), showing that to be the case, I'd prefer the conservative approach I detailed.

-Hal

The new load inherits the alignment of
the first aggregated load. We'll end up with an unaligned load as
far as the compiler can tell, which is often not optimal. I'm not
sure whether combining, then splitting in SelectionDAG will produce
worse code or not without trying it. If the mechanism exists to
split unaligned loads, can we force that on for x86 and see if we
generate worse code in our load combining test cases?

http://reviews.llvm.org/D3580

atrick:

It looks like you first sort the loads, then use load[0] as the insertion point. How does this work if the loads are striding backward through memory with uses in between?

Oops. That is a problem. I did a backward test, but it didn't have a use of the value in-between the loads. I suppose I could also store the original insertion order...

Since you are effectively hoisting loads, I think you should check mayThrow(), not just mayWriteMemory(). In the future, we will want to support read-only calls that mayThrow. (This would allow redundant load elimination across the calls--important for runtime safety checks.)

I thought only terminating instructions can throw?

As for targets that have to split unaligned loads. I tested the original case on ARM and llvm actually generates slightly better code when SDAG splits the combined load than if we just leave it alone. I'll check other targets, but I believe that we do a pretty good job splicing. The only bad cases I ran into were where the combined load isn't actually used combined, it's just directly resplit via shr/trunc. Here I believe we will need to teach SDAG to split these loads back up, but that should be easy (but should be done before this lands).

Original Message -----

From: "Michael Spencer" <bigcheesegs@gmail.com>
To: bigcheesegs@gmail.com, atrick@apple.com, chandlerc@gmail.com
Cc: hfinkel@anl.gov, nrotem@apple.com, aschwaighofer@apple.com, rsilvera@google.com, llvm-commits@cs.uiuc.edu
Sent: Wednesday, May 7, 2014 2:19:00 PM
Subject: Re: [PATCH] Add Load Combine Pass

atrick:

It looks like you first sort the loads, then use load[0] as the
insertion point. How does this work if the loads are striding
backward through memory with uses in between?

Oops. That is a problem. I did a backward test, but it didn't have a
use of the value in-between the loads. I suppose I could also store
the original insertion order...

Since you are effectively hoisting loads, I think you should check
mayThrow(), not just mayWriteMemory(). In the future, we will want
to support read-only calls that mayThrow. (This would allow
redundant load elimination across the calls--important for runtime
safety checks.)

I thought only terminating instructions can throw?

*sigh* -- no.

As for targets that have to split unaligned loads. I tested the
original case on ARM and llvm actually generates slightly better
code when SDAG splits the combined load than if we just leave it
alone. I'll check other targets, but I believe that we do a pretty
good job splicing. The only bad cases I ran into were where the
combined load isn't actually used combined, it's just directly
resplit via shr/trunc. Here I believe we will need to teach SDAG to
split these loads back up, but that should be easy (but should be
done before this lands).

Okay.

-Hal

http://reviews.llvm.org/D3580

I ran test-suite with this pass and got some interesting results. Most tests have no changes, but 4 tests showed significant change.

+ is faster, - is slower.

nts.MultiSource/Benchmarks/MiBench/security-sha/security-sha	44%
nts.MultiSource/Benchmarks/Prolangs-C/agrep/agrep	12%
nts.MultiSource/Benchmarks/SciMark2-C/scimark2	-21%
nts.MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4	-39%	(this drops to -20% if we fix the load splicing cost model in DAGCombine)

Full data (collected across 10 runs each): https://docs.google.com/spreadsheets/d/13_4ZUBQQYXhexrMzVJ5VICNuLzUhovU8lOl0Rr6wG10/edit?usp=sharing

I'd like to commit this disabled by default while the regressions are fixed in tree.

LGTM

This revision is now accepted and ready to land.May 23 2014, 6:37 PM

Closed by commit rL209791 (authored by mspencer).

spatel mentioned this in D26149: [DAGCombiner] Match load by bytes idiom and fold it into a single load.Nov 15 2021, 8:36 AM

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

IR/

IRBuilder.h

24 lines

InitializePasses.h

1 line

Transforms/

Scalar.h

7 lines

lib/

Transforms/

IPO/

PassManagerBuilder.cpp

11 lines

Scalar/

CMakeLists.txt

1 line

LoadCombine.cpp

268 lines

Scalar.cpp

1 line

test/

Transforms/

LoadCombine/

load-combine.ll

190 lines

Diff 9901

llvm/trunk/include/llvm/IR/IRBuilder.h

Show First 20 Lines • Show All 1,458 Lines • ▼ Show 20 Lines	Value CreateVectorSplat(unsigned NumElts, Value V, const Twine &Name = "") {
Value *Undef = UndefValue::get(VectorType::get(V->getType(), NumElts));		Value *Undef = UndefValue::get(VectorType::get(V->getType(), NumElts));
V = CreateInsertElement(Undef, V, ConstantInt::get(I32Ty, 0),		V = CreateInsertElement(Undef, V, ConstantInt::get(I32Ty, 0),
Name + ".splatinsert");		Name + ".splatinsert");

// Shuffle the value across the desired number of elements.		// Shuffle the value across the desired number of elements.
Value *Zeros = ConstantAggregateZero::get(VectorType::get(I32Ty, NumElts));		Value *Zeros = ConstantAggregateZero::get(VectorType::get(I32Ty, NumElts));
return CreateShuffleVector(V, Undef, Zeros, Name + ".splat");		return CreateShuffleVector(V, Undef, Zeros, Name + ".splat");
}		}

		/// \brief Return a value that has been extracted from a larger integer type.
		Value CreateExtractInteger(const DataLayout &DL, Value From,
		IntegerType *ExtractedTy, uint64_t Offset,
		const Twine &Name) {
		IntegerType *IntTy = cast<IntegerType>(From->getType());
		assert(DL.getTypeStoreSize(ExtractedTy) + Offset <=
		DL.getTypeStoreSize(IntTy) &&
		"Element extends past full value");
		uint64_t ShAmt = 8 * Offset;
		Value *V = From;
		if (DL.isBigEndian())
		ShAmt = 8 * (DL.getTypeStoreSize(IntTy) -
		DL.getTypeStoreSize(ExtractedTy) - Offset);
		if (ShAmt) {
		V = CreateLShr(V, ShAmt, Name + ".shift");
		}
		assert(ExtractedTy->getBitWidth() <= IntTy->getBitWidth() &&
		"Cannot extract to a larger integer!");
		if (ExtractedTy != IntTy) {
		V = CreateTrunc(V, ExtractedTy, Name + ".trunc");
		}
		return V;
		}
};		};

// Create wrappers for C Binding types (see CBindingWrapping.h).		// Create wrappers for C Binding types (see CBindingWrapping.h).
DEFINE_SIMPLE_CONVERSION_FUNCTIONS(IRBuilder<>, LLVMBuilderRef)		DEFINE_SIMPLE_CONVERSION_FUNCTIONS(IRBuilder<>, LLVMBuilderRef)

}		}

#endif		#endif

llvm/trunk/include/llvm/InitializePasses.h

	Show First 20 Lines • Show All 266 Lines • ▼ Show 20 Lines
	void initializeInstSimplifierPass(PassRegistry&);			void initializeInstSimplifierPass(PassRegistry&);
	void initializeUnpackMachineBundlesPass(PassRegistry&);			void initializeUnpackMachineBundlesPass(PassRegistry&);
	void initializeFinalizeMachineBundlesPass(PassRegistry&);			void initializeFinalizeMachineBundlesPass(PassRegistry&);
	void initializeLoopVectorizePass(PassRegistry&);			void initializeLoopVectorizePass(PassRegistry&);
	void initializeSLPVectorizerPass(PassRegistry&);			void initializeSLPVectorizerPass(PassRegistry&);
	void initializeBBVectorizePass(PassRegistry&);			void initializeBBVectorizePass(PassRegistry&);
	void initializeMachineFunctionPrinterPassPass(PassRegistry&);			void initializeMachineFunctionPrinterPassPass(PassRegistry&);
	void initializeStackMapLivenessPass(PassRegistry&);			void initializeStackMapLivenessPass(PassRegistry&);
				void initializeLoadCombinePass(PassRegistry&);
	}			}

	#endif			#endif

llvm/trunk/include/llvm/Transforms/Scalar.h

	Show All 13 Lines

	#ifndef LLVM_TRANSFORMS_SCALAR_H			#ifndef LLVM_TRANSFORMS_SCALAR_H
	#define LLVM_TRANSFORMS_SCALAR_H			#define LLVM_TRANSFORMS_SCALAR_H

	#include "llvm/ADT/StringRef.h"			#include "llvm/ADT/StringRef.h"

	namespace llvm {			namespace llvm {

				class BasicBlockPass;
	class FunctionPass;			class FunctionPass;
	class Pass;			class Pass;
	class GetElementPtrInst;			class GetElementPtrInst;
	class PassInfo;			class PassInfo;
	class TerminatorInst;			class TerminatorInst;
	class TargetLowering;			class TargetLowering;
	class TargetMachine;			class TargetMachine;

	▲ Show 20 Lines • Show All 346 Lines • ▼ Show 20 Lines
	FunctionPass *createAddDiscriminatorsPass();			FunctionPass *createAddDiscriminatorsPass();

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// SeparateConstOffsetFromGEP - Split GEPs for better CSE			// SeparateConstOffsetFromGEP - Split GEPs for better CSE
	//			//
	FunctionPass *createSeparateConstOffsetFromGEPPass();			FunctionPass *createSeparateConstOffsetFromGEPPass();

				//===----------------------------------------------------------------------===//
				//
				// LoadCombine - Combine loads into bigger loads.
				//
				BasicBlockPass *createLoadCombinePass();

	} // End llvm namespace			} // End llvm namespace

	#endif			#endif

llvm/trunk/lib/Transforms/IPO/PassManagerBuilder.cpp

Show First 20 Lines • Show All 47 Lines • ▼ Show 20 Lines
static cl::opt<bool> UseNewSROA("use-new-sroa",		static cl::opt<bool> UseNewSROA("use-new-sroa",
cl::init(true), cl::Hidden,		cl::init(true), cl::Hidden,
cl::desc("Enable the new, experimental SROA pass"));		cl::desc("Enable the new, experimental SROA pass"));

static cl::opt<bool>		static cl::opt<bool>
RunLoopRerolling("reroll-loops", cl::Hidden,		RunLoopRerolling("reroll-loops", cl::Hidden,
cl::desc("Run the loop rerolling pass"));		cl::desc("Run the loop rerolling pass"));

		static cl::opt<bool> RunLoadCombine("combine-loads", cl::init(false),
		cl::Hidden,
		cl::desc("Run the load combining pass"));

PassManagerBuilder::PassManagerBuilder() {		PassManagerBuilder::PassManagerBuilder() {
OptLevel = 2;		OptLevel = 2;
SizeLevel = 0;		SizeLevel = 0;
LibraryInfo = nullptr;		LibraryInfo = nullptr;
Inliner = nullptr;		Inliner = nullptr;
DisableTailCalls = false;		DisableTailCalls = false;
DisableUnitAtATime = false;		DisableUnitAtATime = false;
DisableUnrollLoops = false;		DisableUnrollLoops = false;
BBVectorize = RunBBVectorization;		BBVectorize = RunBBVectorization;
SLPVectorize = RunSLPVectorization;		SLPVectorize = RunSLPVectorization;
LoopVectorize = RunLoopVectorization;		LoopVectorize = RunLoopVectorization;
RerollLoops = RunLoopRerolling;		RerollLoops = RunLoopRerolling;
		LoadCombine = RunLoadCombine;
}		}

PassManagerBuilder::~PassManagerBuilder() {		PassManagerBuilder::~PassManagerBuilder() {
delete LibraryInfo;		delete LibraryInfo;
delete Inliner;		delete Inliner;
}		}

/// Set of global extensions, automatically added as part of the standard set.		/// Set of global extensions, automatically added as part of the standard set.
▲ Show 20 Lines • Show All 155 Lines • ▼ Show 20 Lines	if (BBVectorize) {
else		else
MPM.add(createEarlyCSEPass()); // Catch trivial redundancies		MPM.add(createEarlyCSEPass()); // Catch trivial redundancies

// BBVectorize may have significantly shortened a loop body; unroll again.		// BBVectorize may have significantly shortened a loop body; unroll again.
if (!DisableUnrollLoops)		if (!DisableUnrollLoops)
MPM.add(createLoopUnrollPass());		MPM.add(createLoopUnrollPass());
}		}

		if (LoadCombine)
		MPM.add(createLoadCombinePass());

MPM.add(createAggressiveDCEPass()); // Delete dead instructions		MPM.add(createAggressiveDCEPass()); // Delete dead instructions
MPM.add(createCFGSimplificationPass()); // Merge & remove BBs		MPM.add(createCFGSimplificationPass()); // Merge & remove BBs
MPM.add(createInstructionCombiningPass()); // Clean up after everything.		MPM.add(createInstructionCombiningPass()); // Clean up after everything.
addExtensionsToPM(EP_Peephole, MPM);		addExtensionsToPM(EP_Peephole, MPM);

// FIXME: This is a HACK! The inliner pass above implicitly creates a CGSCC		// FIXME: This is a HACK! The inliner pass above implicitly creates a CGSCC
// pass manager that we are specifically trying to avoid. To prevent this		// pass manager that we are specifically trying to avoid. To prevent this
// we must insert a no-op module pass to reset the pass manager.		// we must insert a no-op module pass to reset the pass manager.
▲ Show 20 Lines • Show All 100 Lines • ▼ Show 20 Lines	void PassManagerBuilder::populateLTOPassManager(PassManagerBase &PM,
// More loops are countable; try to optimize them.		// More loops are countable; try to optimize them.
PM.add(createIndVarSimplifyPass());		PM.add(createIndVarSimplifyPass());
PM.add(createLoopDeletionPass());		PM.add(createLoopDeletionPass());
PM.add(createLoopVectorizePass(true, true));		PM.add(createLoopVectorizePass(true, true));

// More scalar chains could be vectorized due to more alias information		// More scalar chains could be vectorized due to more alias information
PM.add(createSLPVectorizerPass()); // Vectorize parallel scalar chains.		PM.add(createSLPVectorizerPass()); // Vectorize parallel scalar chains.

		if (LoadCombine)
		PM.add(createLoadCombinePass());

// Cleanup and simplify the code after the scalar optimizations.		// Cleanup and simplify the code after the scalar optimizations.
PM.add(createInstructionCombiningPass());		PM.add(createInstructionCombiningPass());
addExtensionsToPM(EP_Peephole, PM);		addExtensionsToPM(EP_Peephole, PM);

PM.add(createJumpThreadingPass());		PM.add(createJumpThreadingPass());

// Delete basic blocks, which optimization passes may have killed.		// Delete basic blocks, which optimization passes may have killed.
PM.add(createCFGSimplificationPass());		PM.add(createCFGSimplificationPass());
▲ Show 20 Lines • Show All 88 Lines • Show Last 20 Lines

llvm/trunk/lib/Transforms/Scalar/CMakeLists.txt

	add_llvm_library(LLVMScalarOpts			add_llvm_library(LLVMScalarOpts
	ADCE.cpp			ADCE.cpp
	ConstantHoisting.cpp			ConstantHoisting.cpp
	ConstantProp.cpp			ConstantProp.cpp
	CorrelatedValuePropagation.cpp			CorrelatedValuePropagation.cpp
	DCE.cpp			DCE.cpp
	DeadStoreElimination.cpp			DeadStoreElimination.cpp
	EarlyCSE.cpp			EarlyCSE.cpp
	FlattenCFGPass.cpp			FlattenCFGPass.cpp
	GVN.cpp			GVN.cpp
	GlobalMerge.cpp			GlobalMerge.cpp
	IndVarSimplify.cpp			IndVarSimplify.cpp
	JumpThreading.cpp			JumpThreading.cpp
	LICM.cpp			LICM.cpp
				LoadCombine.cpp
	LoopDeletion.cpp			LoopDeletion.cpp
	LoopIdiomRecognize.cpp			LoopIdiomRecognize.cpp
	LoopInstSimplify.cpp			LoopInstSimplify.cpp
	LoopRerollPass.cpp			LoopRerollPass.cpp
	LoopRotation.cpp			LoopRotation.cpp
	LoopStrengthReduce.cpp			LoopStrengthReduce.cpp
	LoopUnrollPass.cpp			LoopUnrollPass.cpp
	LoopUnswitch.cpp			LoopUnswitch.cpp
	Show All 19 Lines

llvm/trunk/lib/Transforms/Scalar/LoadCombine.cpp

				//===- LoadCombine.cpp - Combine Adjacent Loads ---------------------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				/// \file
				/// This transformation combines adjacent loads.
				///
				//===----------------------------------------------------------------------===//

				#include "llvm/Transforms/Scalar.h"

				#include "llvm/ADT/DenseMap.h"
				#include "llvm/ADT/Statistic.h"
				#include "llvm/Analysis/TargetFolder.h"
				#include "llvm/Pass.h"
				#include "llvm/IR/DataLayout.h"
				#include "llvm/IR/Function.h"
				#include "llvm/IR/Instructions.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Support/MathExtras.h"
				#include "llvm/Support/raw_ostream.h"

				using namespace llvm;

				#define DEBUG_TYPE "load-combine"

				STATISTIC(NumLoadsAnalyzed, "Number of loads analyzed for combining");
				STATISTIC(NumLoadsCombined, "Number of loads combined");

				namespace {
				struct PointerOffsetPair {
				Value *Pointer;
				uint64_t Offset;
				};

				struct LoadPOPPair {
				LoadPOPPair(LoadInst *L, PointerOffsetPair P, unsigned O)
				: Load(L), POP(P), InsertOrder(O) {}
				LoadPOPPair() {}
				LoadInst *Load;
				PointerOffsetPair POP;
				/// \brief The new load needs to be created before the first load in IR order.
				unsigned InsertOrder;
				};

				class LoadCombine : public BasicBlockPass {
				LLVMContext *C;
				const DataLayout *DL;

				public:
				LoadCombine()
				: BasicBlockPass(ID),
				C(nullptr), DL(nullptr) {
				initializeSROAPass(*PassRegistry::getPassRegistry());
				}
				bool doInitialization(Function &) override;
				bool runOnBasicBlock(BasicBlock &BB) override;
				void getAnalysisUsage(AnalysisUsage &AU) const override;

				const char *getPassName() const override { return "LoadCombine"; }
				static char ID;

				typedef IRBuilder<true, TargetFolder> BuilderTy;

				private:
				BuilderTy *Builder;

				PointerOffsetPair getPointerOffsetPair(LoadInst &);
				bool combineLoads(DenseMap<const Value *, SmallVector<LoadPOPPair, 8>> &);
				bool aggregateLoads(SmallVectorImpl<LoadPOPPair> &);
				bool combineLoads(SmallVectorImpl<LoadPOPPair> &);
				};
				}

				bool LoadCombine::doInitialization(Function &F) {
				DEBUG(dbgs() << "LoadCombine function: " << F.getName() << "\n");
				C = &F.getContext();
				DataLayoutPass *DLP = getAnalysisIfAvailable<DataLayoutPass>();
				if (!DLP) {
				DEBUG(dbgs() << " Skipping LoadCombine -- no target data!\n");
				return false;
				}
				DL = &DLP->getDataLayout();
				return true;
				}

				PointerOffsetPair LoadCombine::getPointerOffsetPair(LoadInst &LI) {
				PointerOffsetPair POP;
				POP.Pointer = LI.getPointerOperand();
				POP.Offset = 0;
				while (isa<BitCastInst>(POP.Pointer) \|\| isa<GetElementPtrInst>(POP.Pointer)) {
				if (auto *GEP = dyn_cast<GetElementPtrInst>(POP.Pointer)) {
				unsigned BitWidth = DL->getPointerTypeSizeInBits(GEP->getType());
				APInt Offset(BitWidth, 0);
				if (GEP->accumulateConstantOffset(*DL, Offset))
				POP.Offset += Offset.getZExtValue();
				else
				// Can't handle GEPs with variable indices.
				return POP;
				POP.Pointer = GEP->getPointerOperand();
				} else if (auto *BC = dyn_cast<BitCastInst>(POP.Pointer))
				POP.Pointer = BC->getOperand(0);
				}
				return POP;
				}

				bool LoadCombine::combineLoads(
				DenseMap<const Value *, SmallVector<LoadPOPPair, 8>> &LoadMap) {
				bool Combined = false;
				for (auto &Loads : LoadMap) {
				if (Loads.second.size() < 2)
				continue;
				std::sort(Loads.second.begin(), Loads.second.end(),
				[](const LoadPOPPair &A, const LoadPOPPair &B) {
				return A.POP.Offset < B.POP.Offset;
				});
				if (aggregateLoads(Loads.second))
				Combined = true;
				}
				return Combined;
				}

				/// \brief Try to aggregate loads from a sorted list of loads to be combined.
				///
				/// It is guaranteed that no writes occur between any of the loads. All loads
				/// have the same base pointer. There are at least two loads.
				bool LoadCombine::aggregateLoads(SmallVectorImpl<LoadPOPPair> &Loads) {
				assert(Loads.size() >= 2 && "Insufficient loads!");
				LoadInst *BaseLoad = nullptr;
				SmallVector<LoadPOPPair, 8> AggregateLoads;
				bool Combined = false;
				uint64_t PrevOffset = -1ull;
				uint64_t PrevSize = 0;
				for (auto &L : Loads) {
				if (PrevOffset == -1ull) {
				BaseLoad = L.Load;
				PrevOffset = L.POP.Offset;
				PrevSize = DL->getTypeStoreSize(L.Load->getType());
				AggregateLoads.push_back(L);
				continue;
				}
				if (L.Load->getAlignment() > BaseLoad->getAlignment())
				continue;
				if (L.POP.Offset > PrevOffset + PrevSize) {
				// No other load will be combinable
				if (combineLoads(AggregateLoads))
				Combined = true;
				AggregateLoads.clear();
				PrevOffset = -1;
				continue;
				}
				if (L.POP.Offset != PrevOffset + PrevSize)
				// This load is offset less than the size of the last load.
				// FIXME: We may want to handle this case.
				continue;
				PrevOffset = L.POP.Offset;
				PrevSize = DL->getTypeStoreSize(L.Load->getType());
				AggregateLoads.push_back(L);
				}
				if (combineLoads(AggregateLoads))
				Combined = true;
				return Combined;
				}

				/// \brief Given a list of combinable load. Combine the maximum number of them.
				bool LoadCombine::combineLoads(SmallVectorImpl<LoadPOPPair> &Loads) {
				// Remove loads from the end while the size is not a power of 2.
				unsigned TotalSize = 0;
				for (const auto &L : Loads)
				TotalSize += L.Load->getType()->getPrimitiveSizeInBits();
				while (TotalSize != 0 && !isPowerOf2_32(TotalSize))
				TotalSize -= Loads.pop_back_val().Load->getType()->getPrimitiveSizeInBits();
				if (Loads.size() < 2)
				return false;

				DEBUG({
				dbgs() << "*** Combining Loads ****\n";
				for (const auto &L : Loads) {
				dbgs() << L.POP.Offset << ": " << *L.Load << "\n";
				}
				});

				// Find first load. This is where we put the new load.
				LoadPOPPair FirstLP;
				FirstLP.InsertOrder = -1u;
				for (const auto &L : Loads)
				if (L.InsertOrder < FirstLP.InsertOrder)
				FirstLP = L;

				unsigned AddressSpace =
				FirstLP.POP.Pointer->getType()->getPointerAddressSpace();

				Builder->SetInsertPoint(FirstLP.Load);
				Value *Ptr = Builder->CreateConstGEP1_64(
				Builder->CreatePointerCast(Loads[0].POP.Pointer,
				Builder->getInt8PtrTy(AddressSpace)),
				Loads[0].POP.Offset);
				LoadInst *NewLoad = new LoadInst(
				Builder->CreatePointerCast(
				Ptr, PointerType::get(IntegerType::get(Ptr->getContext(), TotalSize),
				Ptr->getType()->getPointerAddressSpace())),
				Twine(Loads[0].Load->getName()) + ".combined", false,
				Loads[0].Load->getAlignment(), FirstLP.Load);

				for (const auto &L : Loads) {
				Builder->SetInsertPoint(L.Load);
				Value *V = Builder->CreateExtractInteger(
				*DL, NewLoad, cast<IntegerType>(L.Load->getType()),
				L.POP.Offset - Loads[0].POP.Offset, "combine.extract");
				L.Load->replaceAllUsesWith(V);
				}

				NumLoadsCombined = NumLoadsCombined + Loads.size();
				return true;
				}

				bool LoadCombine::runOnBasicBlock(BasicBlock &BB) {
				if (skipOptnoneFunction(BB) \|\| !DL)
				return false;

				IRBuilder<true, TargetFolder>
				TheBuilder(BB.getContext(), TargetFolder(DL));
				Builder = &TheBuilder;

				DenseMap<const Value *, SmallVector<LoadPOPPair, 8>> LoadMap;

				bool Combined = false;
				unsigned Index = 0;
				for (auto &I : BB) {
				if (I.mayWriteToMemory() \|\| I.mayThrow()) {
				if (combineLoads(LoadMap))
				Combined = true;
				LoadMap.clear();
				continue;
				}
				LoadInst *LI = dyn_cast<LoadInst>(&I);
				if (!LI)
				continue;
				++NumLoadsAnalyzed;
				if (!LI->isSimple() \|\| !LI->getType()->isIntegerTy())
				continue;
				auto POP = getPointerOffsetPair(*LI);
				if (!POP.Pointer)
				continue;
				LoadMap[POP.Pointer].push_back(LoadPOPPair(LI, POP, Index++));
				}
				if (combineLoads(LoadMap))
				Combined = true;
				return Combined;
				}

				void LoadCombine::getAnalysisUsage(AnalysisUsage &AU) const {
				AU.setPreservesCFG();
				}

				char LoadCombine::ID = 0;

				BasicBlockPass *llvm::createLoadCombinePass() {
				return new LoadCombine();
				}

				INITIALIZE_PASS(LoadCombine, "load-combine", "Combine Adjacent Loads", false,
				false)

llvm/trunk/lib/Transforms/Scalar/Scalar.cpp

Show First 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	void llvm::initializeScalarOpts(PassRegistry &Registry) {
initializeSROAPass(Registry);		initializeSROAPass(Registry);
initializeSROA_DTPass(Registry);		initializeSROA_DTPass(Registry);
initializeSROA_SSAUpPass(Registry);		initializeSROA_SSAUpPass(Registry);
initializeCFGSimplifyPassPass(Registry);		initializeCFGSimplifyPassPass(Registry);
initializeStructurizeCFGPass(Registry);		initializeStructurizeCFGPass(Registry);
initializeSinkingPass(Registry);		initializeSinkingPass(Registry);
initializeTailCallElimPass(Registry);		initializeTailCallElimPass(Registry);
initializeSeparateConstOffsetFromGEPPass(Registry);		initializeSeparateConstOffsetFromGEPPass(Registry);
		initializeLoadCombinePass(Registry);
}		}

void LLVMInitializeScalarOpts(LLVMPassRegistryRef R) {		void LLVMInitializeScalarOpts(LLVMPassRegistryRef R) {
initializeScalarOpts(*unwrap(R));		initializeScalarOpts(*unwrap(R));
}		}

void LLVMAddAggressiveDCEPass(LLVMPassManagerRef PM) {		void LLVMAddAggressiveDCEPass(LLVMPassManagerRef PM) {
unwrap(PM)->add(createAggressiveDCEPass());		unwrap(PM)->add(createAggressiveDCEPass());
▲ Show 20 Lines • Show All 131 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoadCombine/load-combine.ll

				; RUN: opt < %s -load-combine -instcombine -S \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				; Combine read from char* idiom.
				define i64 @LoadU64_x64_0(i64* %pData) {
				%1 = bitcast i64* %pData to i8*
				%2 = load i8* %1, align 1
				%3 = zext i8 %2 to i64
				%4 = shl nuw i64 %3, 56
				%5 = getelementptr inbounds i8* %1, i64 1
				%6 = load i8* %5, align 1
				%7 = zext i8 %6 to i64
				%8 = shl nuw nsw i64 %7, 48
				%9 = or i64 %8, %4
				%10 = getelementptr inbounds i8* %1, i64 2
				%11 = load i8* %10, align 1
				%12 = zext i8 %11 to i64
				%13 = shl nuw nsw i64 %12, 40
				%14 = or i64 %9, %13
				%15 = getelementptr inbounds i8* %1, i64 3
				%16 = load i8* %15, align 1
				%17 = zext i8 %16 to i64
				%18 = shl nuw nsw i64 %17, 32
				%19 = or i64 %14, %18
				%20 = getelementptr inbounds i8* %1, i64 4
				%21 = load i8* %20, align 1
				%22 = zext i8 %21 to i64
				%23 = shl nuw nsw i64 %22, 24
				%24 = or i64 %19, %23
				%25 = getelementptr inbounds i8* %1, i64 5
				%26 = load i8* %25, align 1
				%27 = zext i8 %26 to i64
				%28 = shl nuw nsw i64 %27, 16
				%29 = or i64 %24, %28
				%30 = getelementptr inbounds i8* %1, i64 6
				%31 = load i8* %30, align 1
				%32 = zext i8 %31 to i64
				%33 = shl nuw nsw i64 %32, 8
				%34 = or i64 %29, %33
				%35 = getelementptr inbounds i8* %1, i64 7
				%36 = load i8* %35, align 1
				%37 = zext i8 %36 to i64
				%38 = or i64 %34, %37
				ret i64 %38
				; CHECK-LABEL: @LoadU64_x64_0(
				; CHECK: load i64* %{{.*}}, align 1
				; CHECK-NOT: load
				}

				; Combine simple adjacent loads.
				define i32 @"2xi16_i32"(i16* %x) {
				%1 = load i16* %x, align 2
				%2 = getelementptr inbounds i16* %x, i64 1
				%3 = load i16* %2, align 2
				%4 = zext i16 %3 to i32
				%5 = shl nuw i32 %4, 16
				%6 = zext i16 %1 to i32
				%7 = or i32 %5, %6
				ret i32 %7
				; CHECK-LABEL: @"2xi16_i32"(
				; CHECK: load i32* %{{.*}}, align 2
				; CHECK-NOT: load
				}

				; Don't combine loads across stores.
				define i32 @"2xi16_i32_store"(i16* %x, i16* %y) {
				%1 = load i16* %x, align 2
				store i16 0, i16* %y, align 2
				%2 = getelementptr inbounds i16* %x, i64 1
				%3 = load i16* %2, align 2
				%4 = zext i16 %3 to i32
				%5 = shl nuw i32 %4, 16
				%6 = zext i16 %1 to i32
				%7 = or i32 %5, %6
				ret i32 %7
				; CHECK-LABEL: @"2xi16_i32_store"(
				; CHECK: load i16* %{{.*}}, align 2
				; CHECK: store
				; CHECK: load i16* %{{.*}}, align 2
				}

				; Don't combine loads with a gap.
				define i32 @"2xi16_i32_gap"(i16* %x) {
				%1 = load i16* %x, align 2
				%2 = getelementptr inbounds i16* %x, i64 2
				%3 = load i16* %2, align 2
				%4 = zext i16 %3 to i32
				%5 = shl nuw i32 %4, 16
				%6 = zext i16 %1 to i32
				%7 = or i32 %5, %6
				ret i32 %7
				; CHECK-LABEL: @"2xi16_i32_gap"(
				; CHECK: load i16* %{{.*}}, align 2
				; CHECK: load i16* %{{.*}}, align 2
				}

				; Combine out of order loads.
				define i32 @"2xi16_i32_order"(i16* %x) {
				%1 = getelementptr inbounds i16* %x, i64 1
				%2 = load i16* %1, align 2
				%3 = zext i16 %2 to i32
				%4 = load i16* %x, align 2
				%5 = shl nuw i32 %3, 16
				%6 = zext i16 %4 to i32
				%7 = or i32 %5, %6
				ret i32 %7
				; CHECK-LABEL: @"2xi16_i32_order"(
				; CHECK: load i32* %{{.*}}, align 2
				; CHECK-NOT: load
				}

				; Overlapping loads.
				define i32 @"2xi16_i32_overlap"(i8* %x) {
				%1 = bitcast i8* %x to i16*
				%2 = load i16* %1, align 2
				%3 = getelementptr inbounds i8* %x, i64 1
				%4 = bitcast i8* %3 to i16*
				%5 = load i16* %4, align 2
				%6 = zext i16 %5 to i32
				%7 = shl nuw i32 %6, 16
				%8 = zext i16 %2 to i32
				%9 = or i32 %7, %8
				ret i32 %9
				; CHECK-LABEL: @"2xi16_i32_overlap"(
				; CHECK: load i16* %{{.*}}, align 2
				; CHECK: load i16* %{{.*}}, align 2
				}

				; Combine valid alignments.
				define i64 @"2xi16_i64_align"(i8* %x) {
				%1 = bitcast i8* %x to i32*
				%2 = load i32* %1, align 4
				%3 = getelementptr inbounds i8* %x, i64 4
				%4 = bitcast i8* %3 to i16*
				%5 = load i16* %4, align 2
				%6 = getelementptr inbounds i8* %x, i64 6
				%7 = bitcast i8* %6 to i16*
				%8 = load i16* %7, align 2
				%9 = zext i16 %8 to i64
				%10 = shl nuw i64 %9, 48
				%11 = zext i16 %5 to i64
				%12 = shl nuw nsw i64 %11, 32
				%13 = zext i32 %2 to i64
				%14 = or i64 %12, %13
				%15 = or i64 %14, %10
				ret i64 %15
				; CHECK-LABEL: @"2xi16_i64_align"(
				; CHECK: load i64* %{{.*}}, align 4
				}

				; Non power of two.
				define i64 @"2xi16_i64_npo2"(i8* %x) {
				%1 = load i8* %x, align 1
				%2 = zext i8 %1 to i64
				%3 = getelementptr inbounds i8* %x, i64 1
				%4 = load i8* %3, align 1
				%5 = zext i8 %4 to i64
				%6 = shl nuw nsw i64 %5, 8
				%7 = or i64 %6, %2
				%8 = getelementptr inbounds i8* %x, i64 2
				%9 = load i8* %8, align 1
				%10 = zext i8 %9 to i64
				%11 = shl nuw nsw i64 %10, 16
				%12 = or i64 %11, %7
				%13 = getelementptr inbounds i8* %x, i64 3
				%14 = load i8* %13, align 1
				%15 = zext i8 %14 to i64
				%16 = shl nuw nsw i64 %15, 24
				%17 = or i64 %16, %12
				%18 = getelementptr inbounds i8* %x, i64 4
				%19 = load i8* %18, align 1
				%20 = zext i8 %19 to i64
				%21 = shl nuw nsw i64 %20, 32
				%22 = or i64 %21, %17
				%23 = getelementptr inbounds i8* %x, i64 5
				%24 = load i8* %23, align 1
				%25 = zext i8 %24 to i64
				%26 = shl nuw nsw i64 %25, 40
				%27 = or i64 %26, %22
				%28 = getelementptr inbounds i8* %x, i64 6
				%29 = load i8* %28, align 1
				%30 = zext i8 %29 to i64
				%31 = shl nuw nsw i64 %30, 48
				%32 = or i64 %31, %27
				ret i64 %32
				; CHECK-LABEL: @"2xi16_i64_npo2"(
				; CHECK: load i32* %{{.*}}, align 1
				}

This is an archive of the discontinued LLVM Phabricator instance.

Add Load Combine PassClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 9901

llvm/trunk/include/llvm/IR/IRBuilder.h

llvm/trunk/include/llvm/InitializePasses.h

llvm/trunk/include/llvm/Transforms/Scalar.h

llvm/trunk/lib/Transforms/IPO/PassManagerBuilder.cpp

llvm/trunk/lib/Transforms/Scalar/CMakeLists.txt

llvm/trunk/lib/Transforms/Scalar/LoadCombine.cpp

llvm/trunk/lib/Transforms/Scalar/Scalar.cpp

llvm/trunk/test/Transforms/LoadCombine/load-combine.ll

Add Load Combine Pass
ClosedPublic