This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/
-
llvm/
-
Analysis/
1
TargetTransformInfo.h
-
TargetTransformInfoImpl.h
-
InitializePasses.h
-
LinkAllPasses.h
-
Transforms/Scalar/
-
Scalar/
-
FastPathLibCalls.h
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
Passes/
-
PassBuilder.cpp
-
PassRegistry.def
-
Target/X86/
-
X86/
-
X86TargetTransformInfo.h
1
X86TargetTransformInfo.cpp
-
Transforms/Scalar/
-
Scalar/
-
CMakeLists.txt
2/20
FastPathLibCalls.cpp
-
test/
-
Other/
-
new-pm-defaults.ll
-
new-pm-thinlto-defaults.ll
-
Transforms/FastPathLibCalls/
-
FastPathLibCalls/
-
X86/
-
lit.local.cfg
-
memops.ll
-
basic.ll
-
memops.ll

Differential D36059

[memops] Add a new pass to inject fast-path code for specific library function calls.
Needs ReviewPublic

Authored by chandlerc on Jul 30 2017, 6:15 AM.

Download Raw Diff

Details

Reviewers

craig.topper
tejohnson
davidxl
echristo
mkuper

Summary

The initial motivation is providing fast, inline paths for memset and
memcpy with a dynamic size when that size happens to be small. Because
LLVM is *very* good at forming memset and memcpy out of raw loops and
many other constructs, it is especially important that these remain fast
even when used in circumstances where the library function call overhead
is unacceptably large.

The first attempt at addressing this was D35750, but that proved to only
exacerbate the issue rather than fixing it.

It turns out, at least for x86, we can emit a very minimal loop behind
a dynamic test on the size and dramatically improve the performance of
sizes that happen to be small.

To make all of this work *well* requires a lot of careful logic:

We need to analyze and discover scaling of the size fed to memset and memcpy.
We can't widen past the alignment.
We need to emit any loop with *exactly* the right IR to get efficient lowering from the backend.
It needs to run quite late to not be perturbed by other passes that try to "optimize" the loop.
We need to avoid this in optsize and minsize functions.
We need to generate checks for zero-length operations before the loop. This ends up being an even faster path.
But we need to not generate *redundant* checks which means adding a mini predicate analysis just to find existing zero checks. It turns out these are incredibly common because so many of these routines are created out of loops which we have already extracted just such a predicate from.

There is still more we should do here such as:

Don't emit these for cold libcalls.
Use value profile data (if available) to bias at least the branch weights and potentially the actual sizes.

However, for at least a few benchmarks here that end up hitting this very hard,
I'm seeing between 20% and 50% improvements already. Naturally, I'll be
gathering more data both on performance impact and code size impact, but
I wanted to go ahead and get this out for review.

Diff Detail

Build Status

Buildable 8741
Build 8741: arc lint + arc unit

Event Timeline

chandlerc created this revision.Jul 30 2017, 6:15 AM

Herald added subscribers: eraman, mgorny, mcrosier and 2 others. · View Herald TranscriptJul 30 2017, 6:15 AM

Harbormaster completed remote builds in B8741: Diff 108829.Jul 30 2017, 6:15 AM

What is advantage of making this transformation a separate pass instead of doing it during selection?

Some targets do not have support library and always expand mem intrinsics, the code making expansion is in LowerMemIntrinsics.cpp. Does it make sense to combine these implementations?

lib/Transforms/Scalar/FastPathLibCalls.cpp
443	Probably this function should be called only when `VerifyDomInfo` is `true` to reduce compile time?

It seems to me to conceptually belong to the backend. Why isn't this part of CodeGenPrepare (or injected by the target as part of its pre-ISel IR passes)?

tschuett added a subscriber: tschuett.Jul 30 2017, 10:41 AM

In D36059#825485, @sepavloff wrote:

What is advantage of making this transformation a separate pass instead of doing it during selection?

We need to form a fairly complex control flow structure, including loops and indexing into arrays within the loop. We also need to do substantial cross-basic-block analysis. All of these are between hard and impossible within the DAG. We could do it at MI but that seems to provide few if any benefits and a lot of added complexity. A bunch of our "expand a loop here" or "inject a loop here" logic has been moved to IR to make it easier to cope with (atomics, etc).

Some targets do not have support library and always expand mem intrinsics, the code making expansion is in LowerMemIntrinsics.cpp. Does it make sense to combine these implementations?

Possibly, but I'm not sure that it does. That lowering tries to produce a genuinely high-performance version, whereas this is trying to provide a *small* version with good performance for short lengths. I would expect this logic to only make sense when we're going to call a dedicated routine to handle most of the cases for code-size reasons.

Anyways, if we discover some useful components or pieces to share, we of course should, but I feel like the intent of the two passes is fairly distinct and useful to keep separate even if they share infrastructure.

lib/Transforms/Scalar/FastPathLibCalls.cpp
443	Sorry, this was just a debugging line. I added a simple pass-based verification to the tests, I'll remove this entirely.

In D36059#825488, @mehdi_amini wrote:

It seems to me to conceptually belong to the backend. Why isn't this part of CodeGenPrepare (or injected by the target as part of its pre-ISel IR passes)?

It definitely is similar to CGP.

The reason I didn't put it there after talking to folks (mostly Hal I think) was because I generally operate under the principle of "if it doesn't need to be in CGP, it should be separate" for maintenance, testing, etc. The usual case which necessitates transforms being in CGP is needing to participate in its iterative process, but that isn't true here. A common practical reason is because the logic is too small or isolated to really make sense as its own pass, but that doesn't seem to be true here as well.

As for where in the pipeline to put it, I'm open to suggestions, but putting it here has some advantages.

This code is forming a loop with array accesses within it. There is a lot of code (from LSR to CGP) that tries to help massage these patterns into the optimal form for the target. I didn't really want to have target-specific IR generation, and so having this pass run before LSR and CGP seems useful.

Similarly, LoopSink may also want to sink computations into this code if we start putting branch weights from profiling into it and some of these regions end up marked cold. So putting this before LoopSink seemed to make sense.

It is actively harmful for this code to be before any of the vectorization or unrolling passes though: we'll try to vectorize and unroll this loop when the whole point was to keep it small! ;] We could use loop metadata to prevent this, but scheduling it afterward seems easier (and honestly, those passes should use the trip count upper bound predicate and avoid the transformations, but that is an issue for another day).

Last but not least, this pass is relatively sensitive to alignment, and so putting it after we re-compute alignment from assumption information seemed goodness.

This narrows the position to one, between the alignment synthesis and LoopSink.

Still, while the above hopefully explains my thought process, it doesn't mean this is the *right* place. Very open to suggestions for other positioning in the pipeline and what issues would be addressed, or just why it would be more natural. Certainly, the closer to the target the better as this is a clearly very target specific transformation.

craig.topper added subscribers: guyblank, zvi.Jul 31 2017, 9:06 AM

In D36059#825518, @chandlerc wrote:

In D36059#825488, @mehdi_amini wrote:

It seems to me to conceptually belong to the backend. Why isn't this part of CodeGenPrepare (or injected by the target as part of its pre-ISel IR passes)?

It definitely is similar to CGP.

The reason I didn't put it there after talking to folks (mostly Hal I think) was because I generally operate under the principle of "if it doesn't need to be in CGP, it should be separate" for maintenance, testing, etc. The usual case which necessitates transforms being in CGP is needing to participate in its iterative process, but that isn't true here. A common practical reason is because the logic is too small or isolated to really make sense as its own pass, but that doesn't seem to be true here as well.

This is my opinion as well (as I think I expressed on IRC). CGP has accumulated a lot of different pieces of functionality because it makes sense for them to iterate (similar to why InstCombine has gotten that way). I see no reason for this to be part of that iterative scheme, and so it can be a separate pass (and, thus, it should be).

In D36059#826240, @hfinkel wrote:

In D36059#825518, @chandlerc wrote:

In D36059#825488, @mehdi_amini wrote:

It seems to me to conceptually belong to the backend. Why isn't this part of CodeGenPrepare (or injected by the target as part of its pre-ISel IR passes)?

It definitely is similar to CGP.

The reason I didn't put it there after talking to folks (mostly Hal I think) was because I generally operate under the principle of "if it doesn't need to be in CGP, it should be separate" for maintenance, testing, etc. The usual case which necessitates transforms being in CGP is needing to participate in its iterative process, but that isn't true here. A common practical reason is because the logic is too small or isolated to really make sense as its own pass, but that doesn't seem to be true here as well.

This is my opinion as well (as I think I expressed on IRC). CGP has accumulated a lot of different pieces of functionality because it makes sense for them to iterate (similar to why InstCombine has gotten that way). I see no reason for this to be part of that iterative scheme, and so it can be a separate pass (and, thus, it should be).

Sure, I agree with having a separate pass. My point was rather about the fact that it is added to the optimization pipeline instead of left up to the target "IR-lowering" passes.

@chandlerc's explanations make sense to me, but seeing it as part of buildModuleOptimizationPipeline is still a bit strange. I wonder if this shouldn't be split further to clearly identify the point where we start to do some "lowering" (if we agree to have such conceptual "stage" in the pipeline). A bit like I extracted the "function simplification" part of the pipeline (what runs in a CGSCC alternating with the inliner)

Just want to point out that we're getting a bit far afield of this patch. Would love any comments on the actually technique... So far, I don't have any performance regressions and I'm seeing large wins on benchmarks that happen to be sensiitive to short memcpy and memsets formed out of loops. Still working on size...

In D36059#826960, @mehdi_amini wrote:

Sure, I agree with having a separate pass. My point was rather about the fact that it is added to the optimization pipeline instead of left up to the target "IR-lowering" passes.

@chandlerc's explanations make sense to me, but seeing it as part of buildModuleOptimizationPipeline is still a bit strange. I wonder if this shouldn't be split further to clearly identify the point where we start to do some "lowering" (if we agree to have such conceptual "stage" in the pipeline). A bit like I extracted the "function simplification" part of the pipeline (what runs in a CGSCC alternating with the inliner)

I don't think this is any more or less "lowering" than the vectorizers or partial unrolling. Both are completely dependent on the target for "what ckind of code should i produce?".

I actually view the entire "optimization" phase as somewhat lowering -- we're destroying information and specializing for execution performance on a *particular* target. Many steps here lose significant analysis information in exchange for this.

Still, all of this is somewhat of a larger more meta discussion...

In D36059#826963, @chandlerc wrote:

Just want to point out that we're getting a bit far afield of this patch. Would love any comments on the actually technique...

That's because we like it :)

In D36059#826960, @mehdi_amini wrote:

I actually view the entire "optimization" phase as somewhat lowering -- we're destroying information and specializing for execution performance on a *particular* target. Many steps here lose significant analysis information in exchange for this.

I actually agree with that, and looking at the code for the new pass manager I see that your buildModuleOptimizationPipeline is not matching the legacy populateModulePassManager but already implementing separately only the end of the pipeline (I didn't notice before, and I know better the legacy one...).

Still, all of this is somewhat of a larger more meta discussion...

Yeah but it seems that these only happen when they're triggered by such patch.

davidxl added inline comments.Jul 31 2017, 11:14 PM

lib/Transforms/Scalar/FastPathLibCalls.cpp
90	Is it better to rely on instcombine and cfg simplification to get rid of the redundant zero guard which can be more general?
147	Framework can mean something much different. How about just call it 'Info'?

chandlerc added inline comments.Jul 31 2017, 11:17 PM

lib/Transforms/Scalar/FastPathLibCalls.cpp
90	Sadly, they're less powerful than this. You'd need PRE or something similar to get it, powered by PredicateInfo and GVN. Maybe JumpThreading and LVI could get it? All of these seem really heavy weight to run this late in the pipeline. =/
147	It's more than the information. It represents the actual fastpath CFG framework (for lack of a better word) that has been injected into the function and needs to be populated with the particular memop's logic... That said, I could totally add a comment. =] Would that help clarify enough?

davidxl added inline comments.Jul 31 2017, 11:46 PM

lib/Transforms/Scalar/FastPathLibCalls.cpp
90	Cases like the following can be handled by -instcombine + -simplfycfg. Wrapping the second redundant test into another check also works fine. What are the interesting cases that can not be handled? efine void @set1_nonzero1(i8* %ptr, i64 %size) { ; CHECK-LABEL: define void @set1_nonzero1( entry: %zero_cond = icmp eq i64 %size, 0 br i1 %zero_cond, label %exit, label %test test: %nonzero_cond = icmp ne i64 %size, 0 br i1 %nonzero_cond, label %call, label %exit call: call void @llvm.memset.p0i8.i64(i8* %ptr, i8 15, i64 %size, i32 1, i1 false) ret void exit: ret void } declare void @llvm.memset.p0i8.i64(i8* writeonly, i8, i64, i32, i1)
147	Sure, some comments will do.

chandlerc added inline comments.Aug 1 2017, 12:11 AM

lib/Transforms/Scalar/FastPathLibCalls.cpp
90	I'm honestly surprised instcombine+simplify-cfg would get this. It seems outside of their purview to do this kind of predicate analysis... I couldn't find where it does this in a cursory glance through the code. Anyways, we don't run instcombine after this pass, but only inst simplify. Not sure we want to add that expensive of a pass when this code can just handle the specific cases it wants.

davidxl added inline comments.Aug 1 2017, 12:27 AM

lib/Transforms/Scalar/FastPathLibCalls.cpp
90	instcombine just converts the second predicate into true/false. However, it seems simplifycfg alone can clean it up. This pass can potentially be moved ahead to share some of the common cleanups?

I suspect SimplifyCFG got it in SimplifyEqualityComparisonWithOnlyPredecessor.

In D36059#827032, @craig.topper wrote:

I suspect SimplifyCFG got it in SimplifyEqualityComparisonWithOnlyPredecessor.

Yeah, this doesn't surprise me as much, but from the comment:

This does a very limited form of jump threading.

Among cases it doesn't handle is one where, for example, the predicate is hoisted above an outer loop loop. I thought I had a test case that shows this, but I think it was a bit fragile so I removed it, but consider the memset being inside a loop and the test for zero getting unswitched out of that outer loop. I don't think SimplifyCFG will be able to get this, and I'll be very surprised if InstCombine does. I'll try to add a test case for this.

Still not sure how / if instcombine really handles this.

chandlerc added inline comments.Aug 1 2017, 12:50 AM

lib/Transforms/Scalar/FastPathLibCalls.cpp
90	I'm not sure how it converts the second predicate into a constant in the hard cases though. Anyways, see my response to Craig regarding simplify-cfg -- it's doing a much more limited form of this. I can add a test case that shows this at least. As for moving this pass to share cleanups, see my comments to Mehdi about the challenges of changing the position of the pass. I don't see a great way to get it in front of instcombine, but maybe there is one.

Instead of generating loop IR for the fast path, how about creating a versioned memcpy/memset with the constrained parameters guarded under the condition test? That way, in the back-end the exact preferred optimal code can be generated, allowing for unrolled loop bodies specific to individual targets.

I have size data now.

Across the test suite + SPEC the total size increase with this patch is under 1%. Looking at benchmarks which exhibit the largest size growth, most grow by a few hundered bytes or less, they just happen to be *tiny* benchmarks.

The most interesting growth I see are;

473.astar - 5% growth, but this is still under 2k growth, in absolute terms this benchmark is quite small.
mafft/pairlocalalign - 3.3% (+14k)
447.dealII - 2.1% (+50k)

Everything else is small either in percent, absolute size, or both.

Across our internal benchmarks, I see no regressions with this patch but I see some benchmarks with 30% and 40% improvements (no, those numbers aren't mistakes). The pattern I am seeing is that when this matters, it *MATTERS*. But most of the time, the libcall is fast enough. This still seems very worthwhile to me as the code patterns that end up impacted by this seem imminently reasonable.

So generally, I think this is a pretty clear net win, it is fairly isolated, and the code size cost seems very low. Any concerns with moving forward here?

In D36059#827088, @aemerson wrote:

Instead of generating loop IR for the fast path, how about creating a versioned memcpy/memset with the constrained parameters guarded under the condition test? That way, in the back-end the exact preferred optimal code can be generated, allowing for unrolled loop bodies specific to individual targets.

IMO, there is no need for doing this in this place. If we're just leaving a marker here for the target to expand, we don't need to do anything. We already get a chance to custom expand the libcall in the target. Adding the versioning doesn't make that any simpler given that it still needs to introduce a loop. If, for a particular target, it is worth emitting a versioned, carefully target-crafted loop or instruction sequence, I would expect them to not use this pass but to custom lower the calls in the backend much like x86 does for constant-size calls.

At least for x86 on Linux, I have no cases where something more complex than this trivial loop is a win compared to calling the library function.

Does it penalize cases on the border line? For instance, if a 7 byte memcpy requires a byte copy loop of 7 iterations which is above the threshold. In this case, the runtime check will purely add runtime overhead. This is unlike more general partial inlining of stringop operations where we may have internal APIs to by pass similar checks inside the library function.

include/llvm/Analysis/TargetTransformInfo.h
799	nit: I find OpByteSize not intuitive. Perhaps DataByteWidth?
lib/Target/X86/X86TargetTransformInfo.cpp
2235	Reference ? The overhead here is vague. Does it include PLT or not? or does it mean the set up cost of rep mov/sto?
lib/Transforms/Scalar/FastPathLibCalls.cpp
102	Why magic 10? In reality, I would think 2 or 3 is enough. Also add an internal option for this?
111	Why limiting to comparison with zero? More generally, why not collect more general predicate info such that the second size check can also be eliminated? or skip the fast path if size is known to be large?
123	< 0 means a huge unsigned length. In this case should you skip the fast path completely?
179	Is this pattern common?
191	how about on targets where unaligned access is ok?
207	The function is pretty large. Perhaps split the analysis and transformation?
314	Can you make this a callback invoked by buildFastPathMemOpFramework?

In D36059#828874, @chandlerc wrote:

IMO, there is no need for doing this in this place. If we're just leaving a marker here for the target to expand, we don't need to do anything. We already get a chance to custom expand the libcall in the target. Adding the versioning doesn't make that any simpler given that it still needs to introduce a loop.

The difference is that at the target codegen level we can't as easily do the predicate analysis as we can at the IR level.

If, for a particular target, it is worth emitting a versioned, carefully target-crafted loop or instruction sequence, I would expect them to not use this pass but to custom lower the calls in the backend much like x86 does for constant-size calls.

But in the patch description you say that one of the challenges is constructing *just* the right IR to get efficient codegen from the backend. I understand this is for x86 right now, but if you don't have plans to allow other targets to work well with it, why not put it into the Target/X86 directory and make it a backend-specific IR pass to avoid confusion?

To clear up one last thing for me, are you saying that there are no performance impacts across SPEC at all? Even astar and dealII? Any impacts across the test-suite benchmarks?

lib/Transforms/Scalar/FastPathLibCalls.cpp
101	AFAICT this depth doesn't seem to be modified anywhere.

In D36059#831659, @aemerson wrote:

In D36059#828874, @chandlerc wrote:

IMO, there is no need for doing this in this place. If we're just leaving a marker here for the target to expand, we don't need to do anything. We already get a chance to custom expand the libcall in the target. Adding the versioning doesn't make that any simpler given that it still needs to introduce a loop.

The difference is that at the target codegen level we can't as easily do the predicate analysis as we can at the IR level.

If, for a particular target, it is worth emitting a versioned, carefully target-crafted loop or instruction sequence, I would expect them to not use this pass but to custom lower the calls in the backend much like x86 does for constant-size calls.

But in the patch description you say that one of the challenges is constructing *just* the right IR to get efficient codegen from the backend. I understand this is for x86 right now, but if you don't have plans to allow other targets to work well with it, why not put it into the Target/X86 directory and make it a backend-specific IR pass to avoid confusion?

What this pass is doing is very generic, as is the IR produced, and the IR seems likely to me to make sense on many targets. No target is obligated to use this, but I'd like this to remain target independent. Target independent passes don't belong in the targets, even if they've only been tuned on one target so far.

To clear up one last thing for me, are you saying that there are no performance impacts across SPEC at all? Even astar and dealII? Any impacts across the test-suite benchmarks?

lib/Transforms/Scalar/FastPathLibCalls.cpp
92	Depth = 0 means no limit, right? Is there a reason this can't go quadratic? If so, we should comment here. Otherwise, maybe we need a depth limit.

kbarton added a subscriber: kbarton.Oct 16 2017, 3:54 PM

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfo.h

50 lines

TargetTransformInfoImpl.h

8 lines

InitializePasses.h

1 line

LinkAllPasses.h

2 lines

Transforms/

Scalar/

FastPathLibCalls.h

39 lines

lib/

Analysis/

TargetTransformInfo.cpp

10 lines

Passes/

PassBuilder.cpp

4 lines

PassRegistry.def

1 line

Target/

X86/

X86TargetTransformInfo.h

4 lines

X86TargetTransformInfo.cpp

25 lines

Transforms/

Scalar/

CMakeLists.txt

1 line

FastPathLibCalls.cpp

492 lines

test/

Other/

new-pm-defaults.ll

1 line

new-pm-thinlto-defaults.ll

1 line

Transforms/

FastPathLibCalls/

X86/

3 lines

275 lines

54 lines

535 lines

Diff 108829

include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 784 Lines • ▼ Show 20 Lines	void getMemcpyLoopResidualLoweringType(SmallVectorImpl<Type *> &OpsOut,
unsigned DestAlign) const;		unsigned DestAlign) const;

/// \returns True if we want to test the new memcpy lowering functionality in		/// \returns True if we want to test the new memcpy lowering functionality in
/// Transform/Utils.		/// Transform/Utils.
/// Temporary. Will be removed once we move to the new functionality and		/// Temporary. Will be removed once we move to the new functionality and
/// remove the old.		/// remove the old.
bool useWideIRMemcpyLoopLowering() const;		bool useWideIRMemcpyLoopLowering() const;

		/// Information about the desired sizes for an inlined fast-path mem library
		/// function.
		struct MemOpFastPathSizeInfo {
		/// The size of operation within the loop in number of bytes.
		///
		/// This must always be a power of two.
		int OpByteSize;
		davidxlUnsubmitted Not Done Reply Inline Actions nit: I find OpByteSize not intuitive. Perhaps DataByteWidth? davidxl: nit: I find OpByteSize not intuitive. Perhaps DataByteWidth?

		/// The maximum number of iterations where a loop is profitable.
		int MaxIterations;
		};

		/// Computes the minimum size and best granularity for emitting a fast-path
		/// loop to bypass a memset library call.
		///
		/// For small sizes, a raw loop may be substantially faster than calling
		/// memset. This routine tells LLVM up to what size this is profitable and
		/// what the step size of the loop should be. The \p MaxOpByteSize is
		/// provided by analyzing the alignment and size passed to the memset.
		///
		/// A zero for `MaxIterations` in the returned struct effectively disables
		/// inline fast-paths for the target.
		MemOpFastPathSizeInfo
		getMemsetInlineFastPathSizeInfo(int MaxOpByteSize) const;

		/// Computes the minimum size and best granularity for emitting a fast-path
		/// loop to bypass a memcpy library call.
		///
		/// For small sizes, a raw loop may be substantially faster than calling
		/// memset. This routine tells LLVM up to what size this is profitable and
		/// what the step size of the loop should be. The \p MaxOpByteSize is
		/// provided by analyzing the alignment and size passed to the memset.
		///
		/// A zero for `MaxIterations` in the returned struct effectively disables
		/// inline fast-paths for the target.
		MemOpFastPathSizeInfo
		getMemcpyInlineFastPathSizeInfo(int MaxOpByteSize) const;

/// \returns True if the two functions have compatible attributes for inlining		/// \returns True if the two functions have compatible attributes for inlining
/// purposes.		/// purposes.
bool areInlineCompatible(const Function *Caller,		bool areInlineCompatible(const Function *Caller,
const Function *Callee) const;		const Function *Callee) const;

/// \returns The bitwidth of the largest vector type that should be used to		/// \returns The bitwidth of the largest vector type that should be used to
/// load/store in the given address space.		/// load/store in the given address space.
unsigned getLoadStoreVecRegBitWidth(unsigned AddrSpace) const;		unsigned getLoadStoreVecRegBitWidth(unsigned AddrSpace) const;
▲ Show 20 Lines • Show All 193 Lines • ▼ Show 20 Lines	public:
virtual Value getOrCreateResultFromMemIntrinsic(IntrinsicInst Inst,		virtual Value getOrCreateResultFromMemIntrinsic(IntrinsicInst Inst,
Type *ExpectedType) = 0;		Type *ExpectedType) = 0;
virtual Type getMemcpyLoopLoweringType(LLVMContext &Context, Value Length,		virtual Type getMemcpyLoopLoweringType(LLVMContext &Context, Value Length,
unsigned SrcAlign,		unsigned SrcAlign,
unsigned DestAlign) const = 0;		unsigned DestAlign) const = 0;
virtual void getMemcpyLoopResidualLoweringType(		virtual void getMemcpyLoopResidualLoweringType(
SmallVectorImpl<Type *> &OpsOut, LLVMContext &Context,		SmallVectorImpl<Type *> &OpsOut, LLVMContext &Context,
unsigned RemainingBytes, unsigned SrcAlign, unsigned DestAlign) const = 0;		unsigned RemainingBytes, unsigned SrcAlign, unsigned DestAlign) const = 0;
		virtual MemOpFastPathSizeInfo
		getMemsetInlineFastPathSizeInfo(int MaxOpByteSize) const = 0;
		virtual MemOpFastPathSizeInfo
		getMemcpyInlineFastPathSizeInfo(int MaxOpByteSize) const = 0;
virtual bool areInlineCompatible(const Function *Caller,		virtual bool areInlineCompatible(const Function *Caller,
const Function *Callee) const = 0;		const Function *Callee) const = 0;
virtual unsigned getLoadStoreVecRegBitWidth(unsigned AddrSpace) const = 0;		virtual unsigned getLoadStoreVecRegBitWidth(unsigned AddrSpace) const = 0;
virtual bool isLegalToVectorizeLoad(LoadInst *LI) const = 0;		virtual bool isLegalToVectorizeLoad(LoadInst *LI) const = 0;
virtual bool isLegalToVectorizeStore(StoreInst *SI) const = 0;		virtual bool isLegalToVectorizeStore(StoreInst *SI) const = 0;
virtual bool isLegalToVectorizeLoadChain(unsigned ChainSizeInBytes,		virtual bool isLegalToVectorizeLoadChain(unsigned ChainSizeInBytes,
unsigned Alignment,		unsigned Alignment,
unsigned AddrSpace) const = 0;		unsigned AddrSpace) const = 0;
▲ Show 20 Lines • Show All 317 Lines • ▼ Show 20 Lines	public:
void getMemcpyLoopResidualLoweringType(SmallVectorImpl<Type *> &OpsOut,		void getMemcpyLoopResidualLoweringType(SmallVectorImpl<Type *> &OpsOut,
LLVMContext &Context,		LLVMContext &Context,
unsigned RemainingBytes,		unsigned RemainingBytes,
unsigned SrcAlign,		unsigned SrcAlign,
unsigned DestAlign) const override {		unsigned DestAlign) const override {
Impl.getMemcpyLoopResidualLoweringType(OpsOut, Context, RemainingBytes,		Impl.getMemcpyLoopResidualLoweringType(OpsOut, Context, RemainingBytes,
SrcAlign, DestAlign);		SrcAlign, DestAlign);
}		}
		MemOpFastPathSizeInfo
		getMemsetInlineFastPathSizeInfo(int MaxOpByteSize) const override {
		return Impl.getMemsetInlineFastPathSizeInfo(MaxOpByteSize);
		}
		MemOpFastPathSizeInfo
		getMemcpyInlineFastPathSizeInfo(int MaxOpByteSize) const override {
		return Impl.getMemcpyInlineFastPathSizeInfo(MaxOpByteSize);
		}
bool areInlineCompatible(const Function *Caller,		bool areInlineCompatible(const Function *Caller,
const Function *Callee) const override {		const Function *Callee) const override {
return Impl.areInlineCompatible(Caller, Callee);		return Impl.areInlineCompatible(Caller, Callee);
}		}
unsigned getLoadStoreVecRegBitWidth(unsigned AddrSpace) const override {		unsigned getLoadStoreVecRegBitWidth(unsigned AddrSpace) const override {
return Impl.getLoadStoreVecRegBitWidth(AddrSpace);		return Impl.getLoadStoreVecRegBitWidth(AddrSpace);
}		}
bool isLegalToVectorizeLoad(LoadInst *LI) const override {		bool isLegalToVectorizeLoad(LoadInst *LI) const override {
▲ Show 20 Lines • Show All 136 Lines • Show Last 20 Lines

include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 458 Lines • ▼ Show 20 Lines	void getMemcpyLoopResidualLoweringType(SmallVectorImpl<Type *> &OpsOut,
LLVMContext &Context,		LLVMContext &Context,
unsigned RemainingBytes,		unsigned RemainingBytes,
unsigned SrcAlign,		unsigned SrcAlign,
unsigned DestAlign) const {		unsigned DestAlign) const {
for (unsigned i = 0; i != RemainingBytes; ++i)		for (unsigned i = 0; i != RemainingBytes; ++i)
OpsOut.push_back(Type::getInt8Ty(Context));		OpsOut.push_back(Type::getInt8Ty(Context));
}		}

		TTI::MemOpFastPathSizeInfo getMemsetInlineFastPathSizeInfo(int) const {
		return {0, 0};
		}

		TTI::MemOpFastPathSizeInfo getMemcpyInlineFastPathSizeInfo(int) const {
		return {0, 0};
		}

bool areInlineCompatible(const Function *Caller,		bool areInlineCompatible(const Function *Caller,
const Function *Callee) const {		const Function *Callee) const {
return (Caller->getFnAttribute("target-cpu") ==		return (Caller->getFnAttribute("target-cpu") ==
Callee->getFnAttribute("target-cpu")) &&		Callee->getFnAttribute("target-cpu")) &&
(Caller->getFnAttribute("target-features") ==		(Caller->getFnAttribute("target-features") ==
Callee->getFnAttribute("target-features"));		Callee->getFnAttribute("target-features"));
}		}

▲ Show 20 Lines • Show All 274 Lines • Show Last 20 Lines

include/llvm/InitializePasses.h

	Show First 20 Lines • Show All 125 Lines • ▼ Show 20 Lines
	void initializeEdgeBundlesPass(PassRegistry&);			void initializeEdgeBundlesPass(PassRegistry&);
	void initializeEfficiencySanitizerPass(PassRegistry&);			void initializeEfficiencySanitizerPass(PassRegistry&);
	void initializeEliminateAvailableExternallyLegacyPassPass(PassRegistry&);			void initializeEliminateAvailableExternallyLegacyPassPass(PassRegistry&);
	void initializeExpandISelPseudosPass(PassRegistry&);			void initializeExpandISelPseudosPass(PassRegistry&);
	void initializeExpandPostRAPass(PassRegistry&);			void initializeExpandPostRAPass(PassRegistry&);
	void initializeExpandReductionsPass(PassRegistry&);			void initializeExpandReductionsPass(PassRegistry&);
	void initializeExternalAAWrapperPassPass(PassRegistry&);			void initializeExternalAAWrapperPassPass(PassRegistry&);
	void initializeFEntryInserterPass(PassRegistry&);			void initializeFEntryInserterPass(PassRegistry&);
				void initializeFastPathLibCallsLegacyPassPass(PassRegistry&);
	void initializeFinalizeMachineBundlesPass(PassRegistry&);			void initializeFinalizeMachineBundlesPass(PassRegistry&);
	void initializeFlattenCFGPassPass(PassRegistry&);			void initializeFlattenCFGPassPass(PassRegistry&);
	void initializeFloat2IntLegacyPassPass(PassRegistry&);			void initializeFloat2IntLegacyPassPass(PassRegistry&);
	void initializeForceFunctionAttrsLegacyPassPass(PassRegistry&);			void initializeForceFunctionAttrsLegacyPassPass(PassRegistry&);
	void initializeForwardControlFlowIntegrityPass(PassRegistry&);			void initializeForwardControlFlowIntegrityPass(PassRegistry&);
	void initializeFuncletLayoutPass(PassRegistry&);			void initializeFuncletLayoutPass(PassRegistry&);
	void initializeFunctionImportLegacyPassPass(PassRegistry&);			void initializeFunctionImportLegacyPassPass(PassRegistry&);
	void initializeGCMachineCodeAnalysisPass(PassRegistry&);			void initializeGCMachineCodeAnalysisPass(PassRegistry&);
	▲ Show 20 Lines • Show All 240 Lines • Show Last 20 Lines

include/llvm/LinkAllPasses.h

Show All 39 Lines
#include "llvm/IR/IRPrintingPasses.h"		#include "llvm/IR/IRPrintingPasses.h"
#include "llvm/Support/Valgrind.h"		#include "llvm/Support/Valgrind.h"
#include "llvm/Transforms/IPO.h"		#include "llvm/Transforms/IPO.h"
#include "llvm/Transforms/IPO/AlwaysInliner.h"		#include "llvm/Transforms/IPO/AlwaysInliner.h"
#include "llvm/Transforms/IPO/FunctionAttrs.h"		#include "llvm/Transforms/IPO/FunctionAttrs.h"
#include "llvm/Transforms/Instrumentation.h"		#include "llvm/Transforms/Instrumentation.h"
#include "llvm/Transforms/ObjCARC.h"		#include "llvm/Transforms/ObjCARC.h"
#include "llvm/Transforms/Scalar.h"		#include "llvm/Transforms/Scalar.h"
		#include "llvm/Transforms/Scalar/FastPathLibCalls.h"
#include "llvm/Transforms/Scalar/GVN.h"		#include "llvm/Transforms/Scalar/GVN.h"
#include "llvm/Transforms/Utils/SymbolRewriter.h"		#include "llvm/Transforms/Utils/SymbolRewriter.h"
#include "llvm/Transforms/Utils/UnifyFunctionExitNodes.h"		#include "llvm/Transforms/Utils/UnifyFunctionExitNodes.h"
#include "llvm/Transforms/Vectorize.h"		#include "llvm/Transforms/Vectorize.h"
#include <cstdlib>		#include <cstdlib>

namespace {		namespace {
struct ForcePassLinking {		struct ForcePassLinking {
Show All 38 Lines	ForcePassLinking() {
(void) llvm::createDomOnlyViewerPass();		(void) llvm::createDomOnlyViewerPass();
(void) llvm::createDomViewerPass();		(void) llvm::createDomViewerPass();
(void) llvm::createGCOVProfilerPass();		(void) llvm::createGCOVProfilerPass();
(void) llvm::createPGOInstrumentationGenLegacyPass();		(void) llvm::createPGOInstrumentationGenLegacyPass();
(void) llvm::createPGOInstrumentationUseLegacyPass();		(void) llvm::createPGOInstrumentationUseLegacyPass();
(void) llvm::createPGOIndirectCallPromotionLegacyPass();		(void) llvm::createPGOIndirectCallPromotionLegacyPass();
(void) llvm::createPGOMemOPSizeOptLegacyPass();		(void) llvm::createPGOMemOPSizeOptLegacyPass();
(void) llvm::createInstrProfilingLegacyPass();		(void) llvm::createInstrProfilingLegacyPass();
		(void) llvm::createFastPathLibCallsLegacyPass();
(void) llvm::createFunctionImportPass();		(void) llvm::createFunctionImportPass();
(void) llvm::createFunctionInliningPass();		(void) llvm::createFunctionInliningPass();
(void) llvm::createAlwaysInlinerLegacyPass();		(void) llvm::createAlwaysInlinerLegacyPass();
(void) llvm::createGlobalDCEPass();		(void) llvm::createGlobalDCEPass();
(void) llvm::createGlobalOptimizerPass();		(void) llvm::createGlobalOptimizerPass();
(void) llvm::createGlobalsAAWrapperPass();		(void) llvm::createGlobalsAAWrapperPass();
(void) llvm::createGuardWideningPass();		(void) llvm::createGuardWideningPass();
(void) llvm::createIPConstantPropagationPass();		(void) llvm::createIPConstantPropagationPass();
▲ Show 20 Lines • Show All 116 Lines • Show Last 20 Lines

include/llvm/Transforms/Scalar/FastPathLibCalls.h

This file was added.

				//===- FastPathLibCalls.h - Insert fast-path code for lib calls -- C++ --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_TRANSFORMS_SCALAR_FASTPATHLIBCALLS_H
				#define LLVM_TRANSFORMS_SCALAR_FASTPATHLIBCALLS_H

				#include "llvm/IR/PassManager.h"

				namespace llvm {

				/// Pass that injects fast-path code for calls to known library functions.
				///
				/// Inject a fast-path code sequence for known library functions where
				/// profitable. This pass specifically targets library functions with common
				/// code paths that can be profitably "inlined", potentially behind a dynamic
				/// test, rather than calling the library function.
				///
				/// For example, rather than call `memcpy` with a size of 16 bytes, if the size
				/// is known to be a multiple of 8 and the pointers suitable aligned we can
				/// emit a simple loop for short sizes that will run substantially faster than
				/// calling out to a library function. With profile information, we can even
				/// adjust thresholds and emit weights on the branches.
				class FastPathLibCallsPass : public PassInfoMixin<FastPathLibCallsPass> {
				public:
				PreservedAnalyses run(Function &F, FunctionAnalysisManager &FAM);
				};

				/// Create a legacy pass analogous to `FastPathLibCallsPass` above.
				Pass *createFastPathLibCallsLegacyPass();

				} // namespace llvm

				#endif // LLVM_TRANSFORMS_SCALAR_FASTPATHLIBCALLS_H

lib/Analysis/TargetTransformInfo.cpp

Show First 20 Lines • Show All 511 Lines • ▼ Show 20 Lines	void TargetTransformInfo::getMemcpyLoopResidualLoweringType(
TTIImpl->getMemcpyLoopResidualLoweringType(OpsOut, Context, RemainingBytes,		TTIImpl->getMemcpyLoopResidualLoweringType(OpsOut, Context, RemainingBytes,
SrcAlign, DestAlign);		SrcAlign, DestAlign);
}		}

bool TargetTransformInfo::useWideIRMemcpyLoopLowering() const {		bool TargetTransformInfo::useWideIRMemcpyLoopLowering() const {
return UseWideMemcpyLoopLowering;		return UseWideMemcpyLoopLowering;
}		}

		TargetTransformInfo::MemOpFastPathSizeInfo
		TargetTransformInfo::getMemsetInlineFastPathSizeInfo(int MaxOpByteSize) const {
		return TTIImpl->getMemsetInlineFastPathSizeInfo(MaxOpByteSize);
		}

		TargetTransformInfo::MemOpFastPathSizeInfo
		TargetTransformInfo::getMemcpyInlineFastPathSizeInfo(int MaxOpByteSize) const {
		return TTIImpl->getMemcpyInlineFastPathSizeInfo(MaxOpByteSize);
		}

bool TargetTransformInfo::areInlineCompatible(const Function *Caller,		bool TargetTransformInfo::areInlineCompatible(const Function *Caller,
const Function *Callee) const {		const Function *Callee) const {
return TTIImpl->areInlineCompatible(Caller, Callee);		return TTIImpl->areInlineCompatible(Caller, Callee);
}		}

unsigned TargetTransformInfo::getLoadStoreVecRegBitWidth(unsigned AS) const {		unsigned TargetTransformInfo::getLoadStoreVecRegBitWidth(unsigned AS) const {
return TTIImpl->getLoadStoreVecRegBitWidth(AS);		return TTIImpl->getLoadStoreVecRegBitWidth(AS);
}		}
▲ Show 20 Lines • Show All 93 Lines • Show Last 20 Lines

lib/Passes/PassBuilder.cpp

Show First 20 Lines • Show All 87 Lines • ▼ Show 20 Lines
#include "llvm/Transforms/Scalar/ADCE.h"		#include "llvm/Transforms/Scalar/ADCE.h"
#include "llvm/Transforms/Scalar/AlignmentFromAssumptions.h"		#include "llvm/Transforms/Scalar/AlignmentFromAssumptions.h"
#include "llvm/Transforms/Scalar/BDCE.h"		#include "llvm/Transforms/Scalar/BDCE.h"
#include "llvm/Transforms/Scalar/ConstantHoisting.h"		#include "llvm/Transforms/Scalar/ConstantHoisting.h"
#include "llvm/Transforms/Scalar/CorrelatedValuePropagation.h"		#include "llvm/Transforms/Scalar/CorrelatedValuePropagation.h"
#include "llvm/Transforms/Scalar/DCE.h"		#include "llvm/Transforms/Scalar/DCE.h"
#include "llvm/Transforms/Scalar/DeadStoreElimination.h"		#include "llvm/Transforms/Scalar/DeadStoreElimination.h"
#include "llvm/Transforms/Scalar/EarlyCSE.h"		#include "llvm/Transforms/Scalar/EarlyCSE.h"
		#include "llvm/Transforms/Scalar/FastPathLibCalls.h"
#include "llvm/Transforms/Scalar/Float2Int.h"		#include "llvm/Transforms/Scalar/Float2Int.h"
#include "llvm/Transforms/Scalar/GVN.h"		#include "llvm/Transforms/Scalar/GVN.h"
#include "llvm/Transforms/Scalar/GuardWidening.h"		#include "llvm/Transforms/Scalar/GuardWidening.h"
#include "llvm/Transforms/Scalar/IVUsersPrinter.h"		#include "llvm/Transforms/Scalar/IVUsersPrinter.h"
#include "llvm/Transforms/Scalar/IndVarSimplify.h"		#include "llvm/Transforms/Scalar/IndVarSimplify.h"
#include "llvm/Transforms/Scalar/JumpThreading.h"		#include "llvm/Transforms/Scalar/JumpThreading.h"
#include "llvm/Transforms/Scalar/LICM.h"		#include "llvm/Transforms/Scalar/LICM.h"
#include "llvm/Transforms/Scalar/LoopAccessAnalysisPrinter.h"		#include "llvm/Transforms/Scalar/LoopAccessAnalysisPrinter.h"
▲ Show 20 Lines • Show All 634 Lines • ▼ Show 20 Lines	PassBuilder::buildModuleOptimizationPipeline(OptimizationLevel Level,
OptimizePM.addPass(InstCombinePass());		OptimizePM.addPass(InstCombinePass());
OptimizePM.addPass(RequireAnalysisPass<OptimizationRemarkEmitterAnalysis, Function>());		OptimizePM.addPass(RequireAnalysisPass<OptimizationRemarkEmitterAnalysis, Function>());
OptimizePM.addPass(createFunctionToLoopPassAdaptor(LICMPass()));		OptimizePM.addPass(createFunctionToLoopPassAdaptor(LICMPass()));

// Now that we've vectorized and unrolled loops, we may have more refined		// Now that we've vectorized and unrolled loops, we may have more refined
// alignment information, try to re-derive it here.		// alignment information, try to re-derive it here.
OptimizePM.addPass(AlignmentFromAssumptionsPass());		OptimizePM.addPass(AlignmentFromAssumptionsPass());

		// Insert fast-path bypasses for library functions.
		OptimizePM.addPass(FastPathLibCallsPass());

// LoopSink pass sinks instructions hoisted by LICM, which serves as a		// LoopSink pass sinks instructions hoisted by LICM, which serves as a
// canonicalization pass that enables other optimizations. As a result,		// canonicalization pass that enables other optimizations. As a result,
// LoopSink pass needs to be a very late IR pass to avoid undoing LICM		// LoopSink pass needs to be a very late IR pass to avoid undoing LICM
// result too early.		// result too early.
OptimizePM.addPass(LoopSinkPass());		OptimizePM.addPass(LoopSinkPass());

// And finally clean up LCSSA form before generating code.		// And finally clean up LCSSA form before generating code.
OptimizePM.addPass(InstSimplifierPass());		OptimizePM.addPass(InstSimplifierPass());
▲ Show 20 Lines • Show All 1,010 Lines • Show Last 20 Lines

lib/Passes/PassRegistry.def

	Show First 20 Lines • Show All 145 Lines • ▼ Show 20 Lines
	FUNCTION_PASS("dot-cfg", CFGPrinterPass())			FUNCTION_PASS("dot-cfg", CFGPrinterPass())
	FUNCTION_PASS("dot-cfg-only", CFGOnlyPrinterPass())			FUNCTION_PASS("dot-cfg-only", CFGOnlyPrinterPass())
	FUNCTION_PASS("early-cse", EarlyCSEPass(/UseMemorySSA=/false))			FUNCTION_PASS("early-cse", EarlyCSEPass(/UseMemorySSA=/false))
	FUNCTION_PASS("early-cse-memssa", EarlyCSEPass(/UseMemorySSA=/true))			FUNCTION_PASS("early-cse-memssa", EarlyCSEPass(/UseMemorySSA=/true))
	FUNCTION_PASS("gvn-hoist", GVNHoistPass())			FUNCTION_PASS("gvn-hoist", GVNHoistPass())
	FUNCTION_PASS("instcombine", InstCombinePass())			FUNCTION_PASS("instcombine", InstCombinePass())
	FUNCTION_PASS("instsimplify", InstSimplifierPass())			FUNCTION_PASS("instsimplify", InstSimplifierPass())
	FUNCTION_PASS("invalidate<all>", InvalidateAllAnalysesPass())			FUNCTION_PASS("invalidate<all>", InvalidateAllAnalysesPass())
				FUNCTION_PASS("fast-path-lib-calls", FastPathLibCallsPass())
	FUNCTION_PASS("float2int", Float2IntPass())			FUNCTION_PASS("float2int", Float2IntPass())
	FUNCTION_PASS("no-op-function", NoOpFunctionPass())			FUNCTION_PASS("no-op-function", NoOpFunctionPass())
	FUNCTION_PASS("libcalls-shrinkwrap", LibCallsShrinkWrapPass())			FUNCTION_PASS("libcalls-shrinkwrap", LibCallsShrinkWrapPass())
	FUNCTION_PASS("loweratomic", LowerAtomicPass())			FUNCTION_PASS("loweratomic", LowerAtomicPass())
	FUNCTION_PASS("lower-expect", LowerExpectIntrinsicPass())			FUNCTION_PASS("lower-expect", LowerExpectIntrinsicPass())
	FUNCTION_PASS("lower-guard-intrinsic", LowerGuardIntrinsicPass())			FUNCTION_PASS("lower-guard-intrinsic", LowerGuardIntrinsicPass())
	FUNCTION_PASS("guard-widening", GuardWideningPass())			FUNCTION_PASS("guard-widening", GuardWideningPass())
	FUNCTION_PASS("gvn", GVN())			FUNCTION_PASS("gvn", GVN())
	▲ Show 20 Lines • Show All 75 Lines • Show Last 20 Lines

lib/Target/X86/X86TargetTransformInfo.h

Show First 20 Lines • Show All 102 Lines • ▼ Show 20 Lines	public:

int getIntImmCost(unsigned Opcode, unsigned Idx, const APInt &Imm, Type *Ty);		int getIntImmCost(unsigned Opcode, unsigned Idx, const APInt &Imm, Type *Ty);
int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,		int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,
Type *Ty);		Type *Ty);
bool isLegalMaskedLoad(Type *DataType);		bool isLegalMaskedLoad(Type *DataType);
bool isLegalMaskedStore(Type *DataType);		bool isLegalMaskedStore(Type *DataType);
bool isLegalMaskedGather(Type *DataType);		bool isLegalMaskedGather(Type *DataType);
bool isLegalMaskedScatter(Type *DataType);		bool isLegalMaskedScatter(Type *DataType);
		TargetTransformInfo::MemOpFastPathSizeInfo
		getMemsetInlineFastPathSizeInfo(int MaxOpByteSize) const;
		TargetTransformInfo::MemOpFastPathSizeInfo
		getMemcpyInlineFastPathSizeInfo(int MaxOpByteSize) const;
bool areInlineCompatible(const Function *Caller,		bool areInlineCompatible(const Function *Caller,
const Function *Callee) const;		const Function *Callee) const;
bool expandMemCmp(Instruction *I, unsigned &MaxLoadSize);		bool expandMemCmp(Instruction *I, unsigned &MaxLoadSize);
bool enableInterleavedAccessVectorization();		bool enableInterleavedAccessVectorization();
private:		private:
int getGSScalarCost(unsigned Opcode, Type *DataTy, bool VariableMask,		int getGSScalarCost(unsigned Opcode, Type *DataTy, bool VariableMask,
unsigned Alignment, unsigned AddressSpace);		unsigned Alignment, unsigned AddressSpace);
int getGSVectorCost(unsigned Opcode, Type DataTy, Value Ptr,		int getGSVectorCost(unsigned Opcode, Type DataTy, Value Ptr,
unsigned Alignment, unsigned AddressSpace);		unsigned Alignment, unsigned AddressSpace);

/// @}		/// @}
};		};

} // end namespace llvm		} // end namespace llvm

#endif		#endif

lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 2,215 Lines • ▼ Show 20 Lines	bool X86TTIImpl::isLegalMaskedGather(Type *DataTy) {
// AVX-512 allows gather and scatter		// AVX-512 allows gather and scatter
return (DataWidth == 32 \|\| DataWidth == 64) && ST->hasAVX512();		return (DataWidth == 32 \|\| DataWidth == 64) && ST->hasAVX512();
}		}

bool X86TTIImpl::isLegalMaskedScatter(Type *DataType) {		bool X86TTIImpl::isLegalMaskedScatter(Type *DataType) {
return isLegalMaskedGather(DataType);		return isLegalMaskedGather(DataType);
}		}

		TargetTransformInfo::MemOpFastPathSizeInfo
		X86TTIImpl::getMemsetInlineFastPathSizeInfo(int MaxOpByteSize) const {
		TTI::MemOpFastPathSizeInfo Result;

		// Cap the size at the word size.
		// FIXME: If we teach the x86 backend to lower 128-bit (and wider) integer
		// loads and stores using vectors, we may be able to grow this to encompass
		// vector sizes and we'll need to adjust the iteration count below as well.
		Result.OpByteSize = std::min(MaxOpByteSize, ST->is64Bit() ? 8 : 4);

		// On x86, a loop of up to six iterations remains profitable compared to the
		// overhead of a library call unless more than 16 bytes are being touched at
		davidxlUnsubmitted Not Done Reply Inline Actions Reference ? The overhead here is vague. Does it include PLT or not? or does it mean the set up cost of rep mov/sto? davidxl: Reference ? The overhead here is vague. Does it include PLT or not? or does it mean the set up…
		// which point the vector based code in the library call is advantageous even
		// after the overhead of the call itself.
		Result.MaxIterations = std::min(6, 16 / Result.OpByteSize);

		return Result;
		}

		TargetTransformInfo::MemOpFastPathSizeInfo
		X86TTIImpl::getMemcpyInlineFastPathSizeInfo(int MaxOpByteSize) const {
		// The heuristics for memset and memcpy are the same for x86.
		return getMemsetInlineFastPathSizeInfo(MaxOpByteSize);
		}

bool X86TTIImpl::areInlineCompatible(const Function *Caller,		bool X86TTIImpl::areInlineCompatible(const Function *Caller,
const Function *Callee) const {		const Function *Callee) const {
const TargetMachine &TM = getTLI()->getTargetMachine();		const TargetMachine &TM = getTLI()->getTargetMachine();

// Work this as a subsetting of subtarget features.		// Work this as a subsetting of subtarget features.
const FeatureBitset &CallerBits =		const FeatureBitset &CallerBits =
TM.getSubtargetImpl(*Caller)->getFeatureBits();		TM.getSubtargetImpl(*Caller)->getFeatureBits();
const FeatureBitset &CalleeBits =		const FeatureBitset &CalleeBits =
▲ Show 20 Lines • Show All 244 Lines • Show Last 20 Lines

lib/Transforms/Scalar/CMakeLists.txt

	add_llvm_library(LLVMScalarOpts			add_llvm_library(LLVMScalarOpts
	ADCE.cpp			ADCE.cpp
	AlignmentFromAssumptions.cpp			AlignmentFromAssumptions.cpp
	BDCE.cpp			BDCE.cpp
	ConstantHoisting.cpp			ConstantHoisting.cpp
	ConstantProp.cpp			ConstantProp.cpp
	CorrelatedValuePropagation.cpp			CorrelatedValuePropagation.cpp
	DCE.cpp			DCE.cpp
	DeadStoreElimination.cpp			DeadStoreElimination.cpp
	EarlyCSE.cpp			EarlyCSE.cpp
				FastPathLibCalls.cpp
	FlattenCFGPass.cpp			FlattenCFGPass.cpp
	Float2Int.cpp			Float2Int.cpp
	GuardWidening.cpp			GuardWidening.cpp
	GVN.cpp			GVN.cpp
	GVNHoist.cpp			GVNHoist.cpp
	GVNSink.cpp			GVNSink.cpp
	IVUsersPrinter.cpp			IVUsersPrinter.cpp
	InductiveRangeCheckElimination.cpp			InductiveRangeCheckElimination.cpp
	▲ Show 20 Lines • Show All 54 Lines • Show Last 20 Lines

lib/Transforms/Scalar/FastPathLibCalls.cpp

This file was added.

				//===-- FastPathLibCalls.cpp ----------------------------------------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//

				#include "llvm/Transforms/Scalar/FastPathLibCalls.h"
				#include "llvm/ADT/Sequence.h"
				#include "llvm/ADT/Statistic.h"
				#include "llvm/Analysis/AliasAnalysis.h"
				#include "llvm/Analysis/AssumptionCache.h"
				#include "llvm/Analysis/BlockFrequencyInfo.h"
				#include "llvm/Analysis/TargetLibraryInfo.h"
				#include "llvm/Analysis/TargetTransformInfo.h"
				#include "llvm/Analysis/ValueTracking.h"
				#include "llvm/IR/Dominators.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/IR/InstIterator.h"
				#include "llvm/IR/InstVisitor.h"
				#include "llvm/IR/Instructions.h"
				#include "llvm/IR/LLVMContext.h"
				#include "llvm/IR/Metadata.h"
				#include "llvm/IR/PatternMatch.h"
				#include "llvm/Support/CommandLine.h"
				#include "llvm/Transforms/Utils/BasicBlockUtils.h"
				using namespace llvm;

				#define DEBUG_TYPE "fastpathlibcals"

				STATISTIC(NumFastPaths, "Number of inserted fast paths");

				static cl::opt<int> ForceMaxOpByteSize(
				"fast-path-force-max-op-byte-size", cl::init(0),
				cl::desc("Forces a specific max operation byte size for library function "
				"fast paths, overriding the target."),
				cl::Hidden);

				static cl::opt<int> ForceMaxIterations(
				"fast-path-force-max-iterations", cl::init(0),
				cl::desc("Forces a specific max iterations for library "
				"function fast paths, overriding the target. Only has an effect "
				"if the max operation byte size is also forced."),
				cl::Hidden);

				namespace {
				class LibCallVisitor : private InstVisitor<LibCallVisitor, bool> {
				using BaseT = InstVisitor<LibCallVisitor, bool>;
				friend BaseT;

				const DataLayout &DL;
				AssumptionCache &AC;
				DominatorTree &DT;
				TargetLibraryInfo &TLI;
				TargetTransformInfo &TTI;

				public:
				LibCallVisitor(const DataLayout &DL, AssumptionCache &AC, DominatorTree &DT,
				TargetLibraryInfo &TLI, TargetTransformInfo &TTI)
				: DL(DL), AC(AC), DT(DT), TLI(TLI), TTI(TTI) {}

				/// Visit every instruction in the function and return if any changes were
				/// made.
				///
				/// This hides the various entry points of the base class so that we can
				/// implement our desired visit and return semantics.
				bool visit(Function &F) {
				bool Changed = false;

				// Loop somewhat carefully over the instructions as we will be moving them
				// when making changes.
				// FIXME: This is the worst possible way to iterate over instructions, but
				// it doesn't crash when the instruction list mutates.
				SmallVector<Instruction *, 16> Insts;
				for (Instruction &I : instructions(F))
				Insts.push_back(&I);
				for (Instruction *I : Insts)
				Changed \|= BaseT::visit(*I);

				return Changed;
				}

				private:
				// Base case implementation.
				bool visitInstruction(Instruction &) { return false; }

				/// Checks whether a value is known non-zero at a particular location.
				bool isKnownNonZero(Value *V, Instruction &I) {
				davidxlUnsubmitted Not Done Reply Inline Actions Is it better to rely on instcombine and cfg simplification to get rid of the redundant zero guard which can be more general? davidxl: Is it better to rely on instcombine and cfg simplification to get rid of the redundant zero…
				chandlercAuthorUnsubmitted Not Done Reply Inline Actions Sadly, they're less powerful than this. You'd need PRE or something similar to get it, powered by PredicateInfo and GVN. Maybe JumpThreading and LVI could get it? All of these seem really heavy weight to run this late in the pipeline. =/ chandlerc: Sadly, they're less powerful than this. You'd need PRE or something similar to get it…
				davidxlUnsubmitted Not Done Reply Inline Actions Cases like the following can be handled by -instcombine + -simplfycfg. Wrapping the second redundant test into another check also works fine. What are the interesting cases that can not be handled? efine void @set1_nonzero1(i8* %ptr, i64 %size) { ; CHECK-LABEL: define void @set1_nonzero1( entry: %zero_cond = icmp eq i64 %size, 0 br i1 %zero_cond, label %exit, label %test test: %nonzero_cond = icmp ne i64 %size, 0 br i1 %nonzero_cond, label %call, label %exit call: call void @llvm.memset.p0i8.i64(i8* %ptr, i8 15, i64 %size, i32 1, i1 false) ret void exit: ret void } declare void @llvm.memset.p0i8.i64(i8* writeonly, i8, i64, i32, i1) davidxl: Cases like the following can be handled by -instcombine + -simplfycfg. Wrapping the second…
				chandlercAuthorUnsubmitted Not Done Reply Inline Actions I'm honestly surprised instcombine+simplify-cfg would get this. It seems outside of their purview to do this kind of predicate analysis... I couldn't find where it does this in a cursory glance through the code. Anyways, we don't run instcombine after this pass, but only inst simplify. Not sure we want to add that expensive of a pass when this code can just handle the specific cases it wants. chandlerc: I'm honestly surprised instcombine+simplify-cfg would get this. It seems outside of their…
				davidxlUnsubmitted Not Done Reply Inline Actions instcombine just converts the second predicate into true/false. However, it seems simplifycfg alone can clean it up. This pass can potentially be moved ahead to share some of the common cleanups? davidxl: instcombine just converts the second predicate into true/false. However, it seems simplifycfg…
				chandlercAuthorUnsubmitted Not Done Reply Inline Actions I'm not sure how it converts the second predicate into a constant in the hard cases though. Anyways, see my response to Craig regarding simplify-cfg -- it's doing a much more limited form of this. I can add a test case that shows this at least. As for moving this pass to share cleanups, see my comments to Mehdi about the challenges of changing the position of the pass. I don't see a great way to get it in front of instcombine, but maybe there is one. chandlerc: I'm not sure how it converts the second predicate into a constant in the hard cases though.
				// If value tracking knows enough, we're done.
				if (llvm::isKnownNonZero(V, DL, /Depth/ 0, &AC, &I, &DT))
				hfinkelUnsubmitted Not Done Reply Inline Actions Depth = 0 means no limit, right? Is there a reason this can't go quadratic? If so, we should comment here. Otherwise, maybe we need a depth limit. hfinkel: Depth = 0 means no limit, right? Is there a reason this can't go quadratic? If so, we should…
				return true;

				// Otherwise we implement a really lame version of PredicateInfo.
				// FIXME: We should actually use PredicateInfo or some other more advanced
				// mechanism to analyze predicates.
				//
				// The lame version simply walks up the dominator tree looking for branches
				// on a test against zero where the non-zero edge dominates the location.
				int Depth = 0;
				aemersonUnsubmitted Not Done Reply Inline Actions AFAICT this depth doesn't seem to be modified anywhere. aemerson: AFAICT this depth doesn't seem to be modified anywhere.
				for (DomTreeNode *N = DT.getNode(I.getParent()); N && Depth < 10;
				davidxlUnsubmitted Not Done Reply Inline Actions Why magic 10? In reality, I would think 2 or 3 is enough. Also add an internal option for this? davidxl: Why magic 10? In reality, I would think 2 or 3 is enough. Also add an internal option for this?
				N = N->getIDom()) {
				auto *BB = N->getBlock();
				auto *BI = dyn_cast<BranchInst>(BB->getTerminator());
				if (!BI \|\| !BI->isConditional())
				continue;

				auto *Cmp = dyn_cast<ICmpInst>(BI->getCondition());
				if (!Cmp \|\| Cmp->getOperand(0) != V \|\|
				Cmp->getOperand(1) != ConstantInt::get(V->getType(), 0))
				davidxlUnsubmitted Not Done Reply Inline Actions Why limiting to comparison with zero? More generally, why not collect more general predicate info such that the second size check can also be eliminated? or skip the fast path if size is known to be large? davidxl: Why limiting to comparison with zero? More generally, why not collect more general predicate…
				continue;

				BasicBlock *NonZeroBB;
				switch (Cmp->getPredicate()) {
				default:
				llvm_unreachable("Invalid integer comparison predicate!");

				case ICmpInst::ICMP_NE:
				case ICmpInst::ICMP_UGT:
				case ICmpInst::ICMP_ULT:
				case ICmpInst::ICMP_SGT:
				case ICmpInst::ICMP_SLT:
				davidxlUnsubmitted Not Done Reply Inline Actions < 0 means a huge unsigned length. In this case should you skip the fast path completely? davidxl: < 0 means a huge unsigned length. In this case should you skip the fast path completely?
				// Predicates where a match precludes equality with zero.
				NonZeroBB = BI->getSuccessor(0);
				break;

				case ICmpInst::ICMP_EQ:
				case ICmpInst::ICMP_UGE:
				case ICmpInst::ICMP_ULE:
				case ICmpInst::ICMP_SGE:
				case ICmpInst::ICMP_SLE:
				// Predicates where failing to match precludes equality with zero.
				NonZeroBB = BI->getSuccessor(1);
				break;
				}

				// If the non-zero edge dominates the instruction given, we have
				// a non-zero predicate.
				if (DT.dominates({BB, NonZeroBB}, I.getParent()))
				return true;
				}

				return false;
				}

				struct FastPathMemOpFramework {
				davidxlUnsubmitted Not Done Reply Inline Actions Framework can mean something much different. How about just call it 'Info'? davidxl: Framework can mean something much different. How about just call it 'Info'?
				chandlercAuthorUnsubmitted Not Done Reply Inline Actions It's more than the information. It represents the actual fastpath CFG framework (for lack of a better word) that has been injected into the function and needs to be populated with the particular memop's logic... That said, I could totally add a comment. =] Would that help clarify enough? chandlerc: It's more than the information. It represents the actual fastpath CFG framework (for lack of a…
				davidxlUnsubmitted Done Reply Inline Actions Sure, some comments will do. davidxl: Sure, some comments will do.
				BasicBlock *HeadBB;
				BasicBlock *IfBB;
				BasicBlock *ThenBB;
				BasicBlock *ElseBB;
				BasicBlock *TailBB;

				// The memory operation byte size to use.
				int OpByteSize;

				// The `count`, scaled by the op byte size used in the loop, and available
				// within the `if` block.
				Value *Count;

				// The `then` basic block contains a loop, and we make the index of that
				// loop available here for use when populating a particular fast path.
				PHINode *Index;
				};

				template <typename CallableT>
				Optional<FastPathMemOpFramework>
				buildFastPathMemOpFramework(MemIntrinsic &I, CallableT GetSizeInfo) {
				Optional<FastPathMemOpFramework> FastPath = None;

				// First we analyze the IR looking for a good fastpath.

				// Try to match a scaling operation so we can use a coarser fast path.
				Value *Count = I.getLength();
				auto *CountTy = Count->getType();
				int ShiftScale = 0;
				ConstantInt *ShiftScaleC;
				using namespace PatternMatch;
				if (match(I.getLength(),
				davidxlUnsubmitted Not Done Reply Inline Actions Is this pattern common? davidxl: Is this pattern common?
				m_NUWShl(m_Value(Count), m_ConstantInt(ShiftScaleC)))) {
				// Don't bother with shifts wider than the number of bits in count.
				if (ShiftScaleC->getValue().uge(CountTy->getIntegerBitWidth()))
				return None;
				assert(I.getLength()->getType() == Count->getType() &&
				"Cannot change type with a shift!");
				ShiftScale = (int)ShiftScaleC->getValue().getZExtValue();
				}

				// Compute the alignment, mapping zero to the actual resulting alignment.
				int Alignment = std::max<int>(1, I.getAlignment());
				int MaxOpByteSize = std::min<int>(1 << ShiftScale, Alignment);
				davidxlUnsubmitted Not Done Reply Inline Actions how about on targets where unaligned access is ok? davidxl: how about on targets where unaligned access is ok?

				auto SizeInfo = GetSizeInfo(MaxOpByteSize);

				// For testing, we may have overrides for the TTI selected parameters.
				if (ForceMaxOpByteSize.getNumOccurrences() > 0) {
				SizeInfo.OpByteSize = std::min<int>(MaxOpByteSize, ForceMaxOpByteSize);
				SizeInfo.MaxIterations = ForceMaxIterations;
				}

				// If we won't fast-path any iterations, bail.
				if (SizeInfo.MaxIterations == 0)
				return FastPath;

				assert(SizeInfo.OpByteSize <= Alignment && "Stores would be underaligned!");

				// Otherwise build an actual fast path.
				davidxlUnsubmitted Not Done Reply Inline Actions The function is pretty large. Perhaps split the analysis and transformation? davidxl: The function is pretty large. Perhaps split the analysis and transformation?
				++NumFastPaths;
				FastPath = {};
				FastPath->OpByteSize = SizeInfo.OpByteSize;
				FastPath->HeadBB = I.getParent();
				IRBuilder<> IRB(&I);

				// If necessary, check for zero and bypass everything.
				if (!isKnownNonZero(Count, I)) {
				auto *ZeroCond = cast<Instruction>(
				IRB.CreateICmpNE(Count, ConstantInt::get(CountTy, 0), "zero_cond"));
				TerminatorInst *IfTerm = SplitBlockAndInsertIfThen(
				ZeroCond, &I, /Unreachable/ false, /BranchWeights/ nullptr, &DT);
				FastPath->IfBB = IfTerm->getParent();
				FastPath->IfBB->setName(Twine(FastPath->HeadBB->getName()) + ".if");
				FastPath->TailBB = I.getParent();
				// Lift the operation into its basic block.
				I.moveBefore(IfTerm);
				} else {
				FastPath->IfBB = FastPath->HeadBB;
				FastPath->TailBB = SplitBlock(FastPath->HeadBB,
				&*std::next(BasicBlock::iterator(I)), &DT);
				}
				FastPath->TailBB->setName(Twine(FastPath->HeadBB->getName()) + ".tail");
				IRB.SetInsertPoint(&I);

				// Adjust the count based on the op size we want for the loop.
				auto AdjustCountAndShiftScaleForOpSize =
				[&](Value Count, Value ByteSize, int ShiftScale,
				int OpByteSize) -> std::pair<Value *, int> {
				assert(OpByteSize > 0 && isPowerOf2_32(OpByteSize) &&
				"Invalid operation byte size!");

				// For one byte stores simply reset to the original byte size.
				if (OpByteSize == 1)
				return {ByteSize, 0};

				// When the op shift scale matches, we don't need to adjust anything.
				int OpShiftScale = Log2_32(OpByteSize);
				if (ShiftScale == OpShiftScale)
				return {Count, ShiftScale};

				assert(ShiftScale > OpShiftScale && "Cannot have a wider op than shift!");

				return {IRB.CreateShl(
				Count, ShiftScale - OpShiftScale, "loop_count",
				/HasNUW/ true,
				/HasNSW/ cast<Instruction>(ByteSize)->hasNoSignedWrap()),
				OpShiftScale};
				};
				std::tie(Count, ShiftScale) = AdjustCountAndShiftScaleForOpSize(
				Count, I.getLength(), ShiftScale, SizeInfo.OpByteSize);
				FastPath->Count = Count;

				// Now create the condition for using the fast path.
				auto *Cond = cast<Instruction>(IRB.CreateICmpULE(
				Count, ConstantInt::get(CountTy, SizeInfo.MaxIterations),
				"count_cond"));

				// Split into an if-then-else FastPath based on the condition.
				// FIXME: We should use profile information about the count (if available)
				// to guide the metadata on this branch.
				auto *ThenTerm = cast<BranchInst>(SplitBlockAndInsertIfThen(
				Cond, &I, /Unreachable/ false, /BranchWeights/ nullptr, &DT));

				FastPath->ThenBB = ThenTerm->getParent();
				FastPath->ThenBB->setName(Twine(FastPath->HeadBB->getName()) +
				".fast_path_then");
				FastPath->ElseBB = I.getParent();
				FastPath->ElseBB->setName(Twine(FastPath->HeadBB->getName()) +
				".fast_path_else");

				// Build the fast-path loop in the then block and save the index.
				ThenTerm->eraseFromParent();
				IRB.SetInsertPoint(FastPath->ThenBB);
				FastPath->Index = IRB.CreatePHI(CountTy, /NumReservedValues/ 2, "index");
				FastPath->Index->addIncoming(ConstantInt::get(CountTy, 0), FastPath->IfBB);
				auto *NextIndex = IRB.CreateAdd(FastPath->Index,
				ConstantInt::get(CountTy, 1), "next_index");
				auto *LoopCond = IRB.CreateICmpEQ(NextIndex, Count, "loop_cond");
				IRB.CreateCondBr(LoopCond, FastPath->TailBB, FastPath->ThenBB);
				FastPath->Index->addIncoming(NextIndex, FastPath->ThenBB);

				// If the tail's current IDom is the else, we need to update it now that
				// the then block directly connects to it.
				DomTreeNode *TailN = DT.getNode(FastPath->TailBB);
				if (TailN->getIDom()->getBlock() == FastPath->ElseBB)
				DT.changeImmediateDominator(TailN, DT.getNode(FastPath->HeadBB));

				return FastPath;
				}

				bool visitMemCpyInst(MemCpyInst &I) {
				if (I.isVolatile())
				return false;
				Value *ByteSize = I.getLength();
				// Constant sizes don't need a fast path, we can code generate an optimal
				// lowering.
				if (isa<Constant>(ByteSize))
				return false;

				auto FastPath = buildFastPathMemOpFramework(I, [&](int MaxOpByteSize) {
				return TTI.getMemcpyInlineFastPathSizeInfo(MaxOpByteSize);
				});
				if (!FastPath)
				return false;

				// Now build the inner part of the fastpath for this rotine.
				davidxlUnsubmitted Not Done Reply Inline Actions Can you make this a callback invoked by buildFastPathMemOpFramework? davidxl: Can you make this a callback invoked by buildFastPathMemOpFramework?
				IRBuilder<> IRB(FastPath->ThenBB->getFirstNonPHI());

				// Cast the pointer to the desired type.
				IntegerType ValTy = IRB.getIntNTy(FastPath->OpByteSize 8);
				PointerType *PtrTy = ValTy->getPointerTo();
				Value *Dst = PtrTy == I.getRawDest()->getType()
				? I.getRawDest()
				: IRB.CreatePointerCast(I.getDest(), PtrTy, "dst.cast");
				Value *Src = PtrTy == I.getRawSource()->getType()
				? I.getRawSource()
				: IRB.CreatePointerCast(I.getSource(), PtrTy, "src.cast");

				// Build a store loop in the then block.
				Value *Indices[] = {FastPath->Index};
				auto *IndexedDst =
				IRB.CreateInBoundsGEP(ValTy, Dst, Indices, "indexed_dst");
				auto *IndexedSrc =
				IRB.CreateInBoundsGEP(ValTy, Src, Indices, "indexed_src");
				IRB.CreateAlignedStore(IRB.CreateAlignedLoad(IndexedSrc, I.getAlignment()),
				IndexedDst, I.getAlignment());

				// Compute the byte size within that block to avoid computing it when
				// possible.
				if (FastPath->Count != ByteSize)
				if (auto *ByteSizeI = dyn_cast<Instruction>(ByteSize))
				if (ByteSizeI->hasOneUse())
				ByteSizeI->moveBefore(&I);

				// Return that we changed the function.
				return true;
				}

				bool visitMemSetInst(MemSetInst &I) {
				if (I.isVolatile())
				return false;
				Value *ByteSize = I.getLength();
				// Constant sizes don't need a fast path, we can code generate an optimal
				// lowering.
				if (isa<Constant>(ByteSize))
				return false;

				Value *V = I.getValue();
				assert(V->getType()->isIntegerTy(8) && "Non-i8 value in memset!");
				auto *CV = dyn_cast<ConstantInt>(V);

				auto FastPath = buildFastPathMemOpFramework(I, [&](int MaxOpByteSize) {
				// If we don't have a constant value, forcibly cap the size to one so we
				// don't need to scale it.
				if (!CV)
				MaxOpByteSize = 1;

				return TTI.getMemsetInlineFastPathSizeInfo(MaxOpByteSize);
				});
				if (!FastPath)
				return false;

				// Now build the inner part of the fastpath for this rotine.
				IRBuilder<> IRB(FastPath->ThenBB->getFirstNonPHI());

				// Scale up our value if necesasry.
				if (FastPath->OpByteSize > 1) {
				assert(CV && "Cannot scale up non-constant value!");

				IntegerType ScaledValTy = IRB.getIntNTy(FastPath->OpByteSize 8);
				APInt RawV = CV->getValue();
				if (RawV.getBitWidth() > 8)
				RawV = RawV.trunc(8);
				V = ConstantInt::get(ScaledValTy,
				APInt::getSplat(FastPath->OpByteSize * 8, RawV));
				}

				// Cast the pointer to the desired type.
				PointerType *DstTy = V->getType()->getPointerTo();
				Value *Dst = DstTy == I.getRawDest()->getType()
				? I.getRawDest()
				: IRB.CreatePointerCast(I.getDest(), DstTy, "dst.cast");

				// Add the store to the loop in the then block.
				Value *Indices[] = {FastPath->Index};
				auto *IndexedDst =
				IRB.CreateInBoundsGEP(V->getType(), Dst, Indices, "indexed_dst");
				IRB.CreateAlignedStore(V, IndexedDst, I.getAlignment());

				// Compute the byte size within that block to avoid computing it when
				// possible.
				if (FastPath->Count != ByteSize)
				if (auto *ByteSizeI = dyn_cast<Instruction>(ByteSize))
				if (ByteSizeI->hasOneUse())
				ByteSizeI->moveBefore(&I);

				// Return that we changed the function.
				return true;
				}

				bool visitCallSite(CallSite CS) {
				LibFunc F;
				if (!TLI.getLibFunc(CS, F))
				return false;

				switch (F) {
				default:
				// No fast-path logic.
				return false;
				}
				}
				};
				} // namespace

				static bool injectLibCallFastPaths(Function &F, AssumptionCache &AC,
				DominatorTree &DT, TargetLibraryInfo &TLI,
				TargetTransformInfo &TTI) {
				if (F.optForSize())
				return false;

				return LibCallVisitor(F.getParent()->getDataLayout(), AC, DT, TLI, TTI)
				.visit(F);
				}

				PreservedAnalyses FastPathLibCallsPass::run(Function &F,
				FunctionAnalysisManager &AM) {
				auto &AC = AM.getResult<AssumptionAnalysis>(F);
				auto &DT = AM.getResult<DominatorTreeAnalysis>(F);
				auto &TLI = AM.getResult<TargetLibraryAnalysis>(F);
				auto &TTI = AM.getResult<TargetIRAnalysis>(F);

				if (!injectLibCallFastPaths(F, AC, DT, TLI, TTI))
				return PreservedAnalyses::all();

				DT.verifyDomTree();
				sepavloffUnsubmitted Done Reply Inline Actions Probably this function should be called only when `VerifyDomInfo` is `true` to reduce compile time? sepavloff: Probably this function should be called only when `VerifyDomInfo` is `true` to reduce compile…
				chandlercAuthorUnsubmitted Not Done Reply Inline Actions Sorry, this was just a debugging line. I added a simple pass-based verification to the tests, I'll remove this entirely. chandlerc: Sorry, this was just a debugging line. I added a simple pass-based verification to the tests…

				PreservedAnalyses PA;
				PA.preserve<DominatorTreeAnalysis>();
				return PA;
				}

				namespace {
				struct FastPathLibCallsLegacyPass : public FunctionPass {
				static char ID;
				FastPathLibCallsLegacyPass() : FunctionPass(ID) {
				initializeFastPathLibCallsLegacyPassPass(*PassRegistry::getPassRegistry());
				}

				bool runOnFunction(Function &F) override {
				if (skipFunction(F))
				return false;

				auto &AC = getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
				auto &DT = getAnalysis<DominatorTreeWrapperPass>().getDomTree();
				auto &TLI = getAnalysis<TargetLibraryInfoWrapperPass>().getTLI();
				auto &TTI = getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);

				return injectLibCallFastPaths(F, AC, DT, TLI, TTI);
				}

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.addRequired<AssumptionCacheTracker>();
				AU.addRequired<DominatorTreeWrapperPass>();
				AU.addRequired<TargetLibraryInfoWrapperPass>();
				AU.addRequired<TargetTransformInfoWrapperPass>();

				AU.addPreserved<DominatorTreeWrapperPass>();
				}
				};
				} // namespace

				char FastPathLibCallsLegacyPass::ID = 0;
				INITIALIZE_PASS_BEGIN(FastPathLibCallsLegacyPass, "fast-path-lib-calls",
				"Fast Path Lib Calls", false, false)
				INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)
				INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(TargetLibraryInfoWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
				INITIALIZE_PASS_END(FastPathLibCallsLegacyPass, "fast-path-lib-calls",
				"Fast Path Lib Calls", false, false)

				Pass *llvm::createFastPathLibCallsLegacyPass() {
				return new FastPathLibCallsLegacyPass();
				}

test/Other/new-pm-defaults.ll

	Show First 20 Lines • Show All 196 Lines • ▼ Show 20 Lines
	; CHECK-O-NEXT: Running pass: SLPVectorizerPass			; CHECK-O-NEXT: Running pass: SLPVectorizerPass
	; CHECK-O-NEXT: Running pass: SimplifyCFGPass			; CHECK-O-NEXT: Running pass: SimplifyCFGPass
	; CHECK-O-NEXT: Running pass: InstCombinePass			; CHECK-O-NEXT: Running pass: InstCombinePass
	; CHECK-O-NEXT: Running pass: FunctionToLoopPassAdaptor<{{.*}}LoopUnrollPass			; CHECK-O-NEXT: Running pass: FunctionToLoopPassAdaptor<{{.*}}LoopUnrollPass
	; CHECK-O-NEXT: Running pass: InstCombinePass			; CHECK-O-NEXT: Running pass: InstCombinePass
	; CHECK-O-NEXT: Running pass: RequireAnalysisPass<{{.*}}OptimizationRemarkEmitterAnalysis			; CHECK-O-NEXT: Running pass: RequireAnalysisPass<{{.*}}OptimizationRemarkEmitterAnalysis
	; CHECK-O-NEXT: Running pass: FunctionToLoopPassAdaptor<{{.*}}LICMPass			; CHECK-O-NEXT: Running pass: FunctionToLoopPassAdaptor<{{.*}}LICMPass
	; CHECK-O-NEXT: Running pass: AlignmentFromAssumptionsPass			; CHECK-O-NEXT: Running pass: AlignmentFromAssumptionsPass
				; CHECK-O-NEXT: Running pass: FastPathLibCallsPass
	; CHECK-O-NEXT: Running pass: LoopSinkPass			; CHECK-O-NEXT: Running pass: LoopSinkPass
	; CHECK-O-NEXT: Running pass: InstSimplifierPass			; CHECK-O-NEXT: Running pass: InstSimplifierPass
	; CHECK-O-NEXT: Running pass: SimplifyCFGPass			; CHECK-O-NEXT: Running pass: SimplifyCFGPass
	; CHECK-O-NEXT: Finished llvm::Function pass manager run.			; CHECK-O-NEXT: Finished llvm::Function pass manager run.
	; CHECK-O-NEXT: Running pass: GlobalDCEPass			; CHECK-O-NEXT: Running pass: GlobalDCEPass
	; CHECK-O-NEXT: Running pass: ConstantMergePass			; CHECK-O-NEXT: Running pass: ConstantMergePass
	; CHECK-O-NEXT: Finished llvm::Module pass manager run.			; CHECK-O-NEXT: Finished llvm::Module pass manager run.
	; CHECK-O-NEXT: Finished llvm::Module pass manager run.			; CHECK-O-NEXT: Finished llvm::Module pass manager run.
	Show All 32 Lines

test/Other/new-pm-thinlto-defaults.ll

	Show First 20 Lines • Show All 183 Lines • ▼ Show 20 Lines
	; CHECK-POSTLINK-O-NEXT: Running pass: SLPVectorizerPass			; CHECK-POSTLINK-O-NEXT: Running pass: SLPVectorizerPass
	; CHECK-POSTLINK-O-NEXT: Running pass: SimplifyCFGPass			; CHECK-POSTLINK-O-NEXT: Running pass: SimplifyCFGPass
	; CHECK-POSTLINK-O-NEXT: Running pass: InstCombinePass			; CHECK-POSTLINK-O-NEXT: Running pass: InstCombinePass
	; CHECK-POSTLINK-O-NEXT: Running pass: FunctionToLoopPassAdaptor<{{.*}}LoopUnrollPass			; CHECK-POSTLINK-O-NEXT: Running pass: FunctionToLoopPassAdaptor<{{.*}}LoopUnrollPass
	; CHECK-POSTLINK-O-NEXT: Running pass: InstCombinePass			; CHECK-POSTLINK-O-NEXT: Running pass: InstCombinePass
	; CHECK-POSTLINK-O-NEXT: Running pass: RequireAnalysisPass<{{.*}}OptimizationRemarkEmitterAnalysis			; CHECK-POSTLINK-O-NEXT: Running pass: RequireAnalysisPass<{{.*}}OptimizationRemarkEmitterAnalysis
	; CHECK-POSTLINK-O-NEXT: Running pass: FunctionToLoopPassAdaptor<{{.*}}LICMPass			; CHECK-POSTLINK-O-NEXT: Running pass: FunctionToLoopPassAdaptor<{{.*}}LICMPass
	; CHECK-POSTLINK-O-NEXT: Running pass: AlignmentFromAssumptionsPass			; CHECK-POSTLINK-O-NEXT: Running pass: AlignmentFromAssumptionsPass
				; CHECK-POSTLINK-O-NEXT: Running pass: FastPathLibCallsPass
	; CHECK-POSTLINK-O-NEXT: Running pass: LoopSinkPass			; CHECK-POSTLINK-O-NEXT: Running pass: LoopSinkPass
	; CHECK-POSTLINK-O-NEXT: Running pass: InstSimplifierPass			; CHECK-POSTLINK-O-NEXT: Running pass: InstSimplifierPass
	; CHECK-POSTLINK-O-NEXT: Running pass: SimplifyCFGPass			; CHECK-POSTLINK-O-NEXT: Running pass: SimplifyCFGPass
	; CHECK-POSTLINK-O-NEXT: Finished llvm::Function pass manager run.			; CHECK-POSTLINK-O-NEXT: Finished llvm::Function pass manager run.
	; CHECK-POSTLINK-O-NEXT: Running pass: GlobalDCEPass			; CHECK-POSTLINK-O-NEXT: Running pass: GlobalDCEPass
	; CHECK-POSTLINK-O-NEXT: Running pass: ConstantMergePass			; CHECK-POSTLINK-O-NEXT: Running pass: ConstantMergePass
	; CHECK-POSTLINK-O-NEXT: Finished llvm::Module pass manager run.			; CHECK-POSTLINK-O-NEXT: Finished llvm::Module pass manager run.
	; CHECK-O-NEXT: Finished llvm::Module pass manager run.			; CHECK-O-NEXT: Finished llvm::Module pass manager run.
	Show All 33 Lines

test/Transforms/FastPathLibCalls/X86/lit.local.cfg

This file was added.

				if not 'X86' in config.root.targets:
				config.unsupported = True

test/Transforms/FastPathLibCalls/X86/memops.ll

This file was added.

				; RUN: opt -S < %s -mtriple=x86_64-unknown-linux-gnu -passes=fast-path-lib-calls \| FileCheck %s

				define void @set1(i8* %ptr, i64 %size) {
				; CHECK-LABEL: define void @set1(
				entry:
				call void @llvm.memset.p0i8.i64(i8* %ptr, i8 15, i64 %size, i32 1, i1 false)
				ret void
				; CHECK: entry:
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp ne i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %[[IF:.]], label %[[TAIL:.]]
				;
				; CHECK: [[IF]]:
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %size, 6
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %[[IF]] ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i8, i8 %ptr, i64 %[[INDEX]]
				; CHECK-NEXT: store i8 15, i8* %[[INDEXED_DST]], align 1
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %size
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: call void @llvm.memset.p0i8.i64(i8* %ptr, i8 15, i64 %size, i32 1, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: ret void
				}

				define void @set2(i16* %ptr, i64 %size) {
				; CHECK-LABEL: define void @set2(
				entry:
				%ptr.i8 = bitcast i16* %ptr to i8*
				%size.scaled = shl nuw nsw i64 %size, 1
				call void @llvm.memset.p0i8.i64(i8* %ptr.i8, i8 15, i64 %size.scaled, i32 2, i1 false)
				ret void
				; CHECK: entry:
				; CHECK-NEXT: %[[PTR_I8:.]] = bitcast i16 %ptr to i8*
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp ne i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %[[IF:.]], label %[[TAIL:.]]
				;
				; CHECK: [[IF]]:
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %size, 6
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %[[IF]] ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i16, i16 %ptr, i64 %[[INDEX]]
				; CHECK-NEXT: store i16 3855, i16* %[[INDEXED_DST]], align 2
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %size
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: %[[SIZE_SCALED:.*]] = shl nuw nsw i64 %size, 1
				; CHECK-NEXT: call void @llvm.memset.p0i8.i64(i8* %[[PTR_I8]], i8 15, i64 %[[SIZE_SCALED]], i32 2, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: ret void
				}

				define void @set4(i32* %ptr, i64 %size) {
				; CHECK-LABEL: define void @set4(
				entry:
				%ptr.i8 = bitcast i32* %ptr to i8*
				%size.scaled = shl nuw nsw i64 %size, 2
				call void @llvm.memset.p0i8.i64(i8* %ptr.i8, i8 15, i64 %size.scaled, i32 4, i1 false)
				ret void
				; CHECK: entry:
				; CHECK-NEXT: %[[PTR_I8:.]] = bitcast i32 %ptr to i8*
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp ne i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %[[IF:.]], label %[[TAIL:.]]
				;
				; CHECK: [[IF]]:
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %size, 4
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %[[IF]] ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i32, i32 %ptr, i64 %[[INDEX]]
				; CHECK-NEXT: store i32 252645135, i32* %[[INDEXED_DST]], align 4
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %size
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: %[[SIZE_SCALED:.*]] = shl nuw nsw i64 %size, 2
				; CHECK-NEXT: call void @llvm.memset.p0i8.i64(i8* %[[PTR_I8]], i8 15, i64 %[[SIZE_SCALED]], i32 4, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: ret void
				}

				define void @set8(i64* %ptr, i64 %size) {
				; CHECK-LABEL: define void @set8(
				entry:
				%ptr.i8 = bitcast i64* %ptr to i8*
				%size.scaled = shl nuw nsw i64 %size, 3
				call void @llvm.memset.p0i8.i64(i8* %ptr.i8, i8 15, i64 %size.scaled, i32 8, i1 false)
				ret void
				; CHECK: entry:
				; CHECK-NEXT: %[[PTR_I8:.]] = bitcast i64 %ptr to i8*
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp ne i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %[[IF:.]], label %[[TAIL:.]]
				;
				; CHECK: [[IF]]:
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %size, 2
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %[[IF]] ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i64, i64 %ptr, i64 %[[INDEX]]
				; CHECK-NEXT: store i64 1085102592571150095, i64* %[[INDEXED_DST]], align 8
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %size
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: %[[SIZE_SCALED:.*]] = shl nuw nsw i64 %size, 3
				; CHECK-NEXT: call void @llvm.memset.p0i8.i64(i8* %[[PTR_I8]], i8 15, i64 %[[SIZE_SCALED]], i32 8, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: ret void
				}

				define void @copy1(i8* noalias %dst, i8* noalias %src, i64 %size) {
				; CHECK-LABEL: define void @copy1(
				entry:
				call void @llvm.memcpy.p0i8.p0i8.i64(i8* %dst, i8* %src, i64 %size, i32 1, i1 false)
				ret void
				; CHECK: entry:
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp ne i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %[[IF:.]], label %[[TAIL:.]]
				;
				; CHECK: [[IF]]:
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %size, 6
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %[[IF]] ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i8, i8 %dst, i64 %[[INDEX]]
				; CHECK-NEXT: %[[INDEXED_SRC:.]] = getelementptr inbounds i8, i8 %src, i64 %[[INDEX]]
				; CHECK-NEXT: %[[LOAD:.]] = load i8, i8 %[[INDEXED_SRC]], align 1
				; CHECK-NEXT: store i8 %[[LOAD]], i8* %[[INDEXED_DST]], align 1
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %size
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: call void @llvm.memcpy.p0i8.p0i8.i64(i8* %dst, i8* %src, i64 %size, i32 1, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: ret void
				}

				define void @copy2(i16* noalias %dst, i16* noalias %src, i64 %size) {
				; CHECK-LABEL: define void @copy2(
				entry:
				%dst.i8 = bitcast i16* %dst to i8*
				%src.i8 = bitcast i16* %src to i8*
				%size.scaled = shl nuw nsw i64 %size, 1
				call void @llvm.memcpy.p0i8.p0i8.i64(i8* %dst.i8, i8* %src.i8, i64 %size.scaled, i32 2, i1 false)
				ret void
				; CHECK: entry:
				; CHECK-NEXT: %[[DST_I8:.]] = bitcast i16 %dst to i8*
				; CHECK-NEXT: %[[SRC_I8:.]] = bitcast i16 %src to i8*
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp ne i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %[[IF:.]], label %[[TAIL:.]]
				;
				; CHECK: [[IF]]:
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %size, 6
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %[[IF]] ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i16, i16 %dst, i64 %[[INDEX]]
				; CHECK-NEXT: %[[INDEXED_SRC:.]] = getelementptr inbounds i16, i16 %src, i64 %[[INDEX]]
				; CHECK-NEXT: %[[LOAD:.]] = load i16, i16 %[[INDEXED_SRC]], align 2
				; CHECK-NEXT: store i16 %[[LOAD]], i16* %[[INDEXED_DST]], align 2
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %size
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: %[[SIZE_SCALED:.*]] = shl nuw nsw i64 %size, 1
				; CHECK-NEXT: call void @llvm.memcpy.p0i8.p0i8.i64(i8* %[[DST_I8]], i8* %[[SRC_I8]], i64 %[[SIZE_SCALED]], i32 2, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: ret void
				}

				define void @copy4(i32* noalias %dst, i32* noalias %src, i64 %size) {
				; CHECK-LABEL: define void @copy4(
				entry:
				%dst.i8 = bitcast i32* %dst to i8*
				%src.i8 = bitcast i32* %src to i8*
				%size.scaled = shl nuw nsw i64 %size, 2
				call void @llvm.memcpy.p0i8.p0i8.i64(i8* %dst.i8, i8* %src.i8, i64 %size.scaled, i32 4, i1 false)
				ret void
				; CHECK: entry:
				; CHECK-NEXT: %[[DST_I8:.]] = bitcast i32 %dst to i8*
				; CHECK-NEXT: %[[SRC_I8:.]] = bitcast i32 %src to i8*
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp ne i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %[[IF:.]], label %[[TAIL:.]]
				;
				; CHECK: [[IF]]:
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %size, 4
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %[[IF]] ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i32, i32 %dst, i64 %[[INDEX]]
				; CHECK-NEXT: %[[INDEXED_SRC:.]] = getelementptr inbounds i32, i32 %src, i64 %[[INDEX]]
				; CHECK-NEXT: %[[LOAD:.]] = load i32, i32 %[[INDEXED_SRC]], align 4
				; CHECK-NEXT: store i32 %[[LOAD]], i32* %[[INDEXED_DST]], align 4
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %size
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: %[[SIZE_SCALED:.*]] = shl nuw nsw i64 %size, 2
				; CHECK-NEXT: call void @llvm.memcpy.p0i8.p0i8.i64(i8* %[[DST_I8]], i8* %[[SRC_I8]], i64 %[[SIZE_SCALED]], i32 4, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: ret void
				}

				define void @copy8(i64* noalias %dst, i64* noalias %src, i64 %size) {
				; CHECK-LABEL: define void @copy8(
				entry:
				%dst.i8 = bitcast i64* %dst to i8*
				%src.i8 = bitcast i64* %src to i8*
				%size.scaled = shl nuw nsw i64 %size, 3
				call void @llvm.memcpy.p0i8.p0i8.i64(i8* %dst.i8, i8* %src.i8, i64 %size.scaled, i32 8, i1 false)
				ret void
				; CHECK: entry:
				; CHECK-NEXT: %[[DST_I8:.]] = bitcast i64 %dst to i8*
				; CHECK-NEXT: %[[SRC_I8:.]] = bitcast i64 %src to i8*
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp ne i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %[[IF:.]], label %[[TAIL:.]]
				;
				; CHECK: [[IF]]:
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %size, 2
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %[[IF]] ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i64, i64 %dst, i64 %[[INDEX]]
				; CHECK-NEXT: %[[INDEXED_SRC:.]] = getelementptr inbounds i64, i64 %src, i64 %[[INDEX]]
				; CHECK-NEXT: %[[LOAD:.]] = load i64, i64 %[[INDEXED_SRC]], align 8
				; CHECK-NEXT: store i64 %[[LOAD]], i64* %[[INDEXED_DST]], align 8
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %size
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: %[[SIZE_SCALED:.*]] = shl nuw nsw i64 %size, 3
				; CHECK-NEXT: call void @llvm.memcpy.p0i8.p0i8.i64(i8* %[[DST_I8]], i8* %[[SRC_I8]], i64 %[[SIZE_SCALED]], i32 8, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: ret void
				}

				declare void @llvm.memcpy.p0i8.p0i8.i64(i8* writeonly, i8*, i64, i32, i1)

				declare void @llvm.memset.p0i8.i64(i8* writeonly, i8, i64, i32, i1)

test/Transforms/FastPathLibCalls/basic.ll

This file was added.

				; RUN: opt -S < %s -passes='fast-path-lib-calls,verify<domtree>' -fast-path-force-max-op-byte-size=4 -fast-path-force-max-iterations=3 \| FileCheck %s

				define void @baseline(i8* %ptr, i64 %size) {
				; CHECK-LABEL: define void @baseline(
				entry:
				call void @llvm.memset.p0i8.i64(i8* %ptr, i8 15, i64 %size, i32 1, i1 false)
				ret void
				; CHECK: entry:
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp ne i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %[[IF:.]], label %[[TAIL:.]]
				;
				; CHECK: [[IF]]:
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %size, 3
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %[[IF]] ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i8, i8 %ptr, i64 %[[INDEX]]
				; CHECK-NEXT: store i8 15, i8* %[[INDEXED_DST]], align 1
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %size
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: call void @llvm.memset.p0i8.i64(i8* %ptr, i8 15, i64 %size, i32 1, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: ret void
				}

				; Negative tests where we shouldn't do anything.

				define void @optsize(i8* %ptr, i64 %size) optsize {
				; CHECK-LABEL: define void @optsize(
				entry:
				call void @llvm.memset.p0i8.i64(i8* %ptr, i8 15, i64 %size, i32 1, i1 false)
				ret void
				; CHECK: entry:
				; CHECK-NEXT: call void @llvm.memset.p0i8.i64(i8* %ptr, i8 15, i64 %size, i32 1, i1 false)
				; CHECK-NEXT: ret void
				}

				define void @minsize(i8* %ptr, i64 %size) minsize {
				; CHECK-LABEL: define void @minsize(
				entry:
				call void @llvm.memset.p0i8.i64(i8* %ptr, i8 15, i64 %size, i32 1, i1 false)
				ret void
				; CHECK: entry:
				; CHECK-NEXT: call void @llvm.memset.p0i8.i64(i8* %ptr, i8 15, i64 %size, i32 1, i1 false)
				; CHECK-NEXT: ret void
				}

				declare void @llvm.memset.p0i8.i64(i8* writeonly, i8, i64, i32, i1)

test/Transforms/FastPathLibCalls/memops.ll

This file was added.

				; RUN: opt -S < %s -passes='fast-path-lib-calls,verify<domtree>' -fast-path-force-max-op-byte-size=4 -fast-path-force-max-iterations=3 \| FileCheck %s

				define void @set1(i8* %ptr, i64 %size) {
				; CHECK-LABEL: define void @set1(
				entry:
				call void @llvm.memset.p0i8.i64(i8* %ptr, i8 15, i64 %size, i32 1, i1 false)
				ret void
				; CHECK: entry:
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp ne i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %[[IF:.]], label %[[TAIL:.]]
				;
				; CHECK: [[IF]]:
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %size, 3
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %[[IF]] ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i8, i8 %ptr, i64 %[[INDEX]]
				; CHECK-NEXT: store i8 15, i8* %[[INDEXED_DST]], align 1
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %size
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: call void @llvm.memset.p0i8.i64(i8* %ptr, i8 15, i64 %size, i32 1, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: ret void
				}

				define void @set1_nonzero1(i8* %ptr, i64 %size) {
				; CHECK-LABEL: define void @set1_nonzero1(
				entry:
				%zero_cond = icmp eq i64 %size, 0
				br i1 %zero_cond, label %exit, label %call
				; CHECK: entry:
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp eq i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %exit, label %call

				call:
				call void @llvm.memset.p0i8.i64(i8* %ptr, i8 15, i64 %size, i32 1, i1 false)
				ret void
				; CHECK: call:
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %size, 3
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %call ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i8, i8 %ptr, i64 %[[INDEX]]
				; CHECK-NEXT: store i8 15, i8* %[[INDEXED_DST]], align 1
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %size
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: call void @llvm.memset.p0i8.i64(i8* %ptr, i8 15, i64 %size, i32 1, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: ret void

				exit:
				ret void
				}

				define void @set1_nonzero2(i8* %ptr, i64 %size) {
				; CHECK-LABEL: define void @set1_nonzero2(
				entry:
				%zero_cond = icmp sgt i64 %size, 0
				br i1 %zero_cond, label %call, label %exit
				; CHECK: entry:
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp sgt i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %call, label %exit

				call:
				call void @llvm.memset.p0i8.i64(i8* %ptr, i8 15, i64 %size, i32 1, i1 false)
				ret void
				; CHECK: call:
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %size, 3
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %call ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i8, i8 %ptr, i64 %[[INDEX]]
				; CHECK-NEXT: store i8 15, i8* %[[INDEXED_DST]], align 1
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %size
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: call void @llvm.memset.p0i8.i64(i8* %ptr, i8 15, i64 %size, i32 1, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: ret void

				exit:
				ret void
				}

				; Somewhat silly degenerate case -- the 'call' block is dead technically, but,
				; indeed, size will not be zero in that block.
				define void @set1_nonzero3(i8* %ptr, i64 %size) {
				; CHECK-LABEL: define void @set1_nonzero3(
				entry:
				%zero_cond = icmp uge i64 %size, 0
				br i1 %zero_cond, label %exit, label %call
				; CHECK: entry:
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp uge i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %exit, label %call

				call:
				call void @llvm.memset.p0i8.i64(i8* %ptr, i8 15, i64 %size, i32 1, i1 false)
				ret void
				; CHECK: call:
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %size, 3
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %call ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i8, i8 %ptr, i64 %[[INDEX]]
				; CHECK-NEXT: store i8 15, i8* %[[INDEXED_DST]], align 1
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %size
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: call void @llvm.memset.p0i8.i64(i8* %ptr, i8 15, i64 %size, i32 1, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: ret void

				exit:
				ret void
				}

				define void @set1_nonzero4(i8* %ptr, i64 %size, i1* %flag.ptr) {
				; CHECK-LABEL: define void @set1_nonzero4(
				entry:
				%zero_cond = icmp ne i64 %size, 0
				br i1 %zero_cond, label %loop.ph, label %exit
				; CHECK: entry:
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp ne i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %loop.ph, label %exit

				loop.ph:
				br label %loop.header

				loop.header:
				%flag = load i1, i1* %flag.ptr
				br i1 %flag, label %call, label %loop.exit

				call:
				call void @llvm.memset.p0i8.i64(i8* %ptr, i8 15, i64 %size, i32 1, i1 false)
				br label %loop.header
				; CHECK: call:
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %size, 3
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %call ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i8, i8 %ptr, i64 %[[INDEX]]
				; CHECK-NEXT: store i8 15, i8* %[[INDEXED_DST]], align 1
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %size
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: call void @llvm.memset.p0i8.i64(i8* %ptr, i8 15, i64 %size, i32 1, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: br label %loop.header

				loop.exit:
				ret void

				exit:
				ret void
				}

				define void @set2(i16* %ptr, i64 %size) {
				; CHECK-LABEL: define void @set2(
				entry:
				%ptr.i8 = bitcast i16* %ptr to i8*
				%size.scaled = shl nuw nsw i64 %size, 1
				call void @llvm.memset.p0i8.i64(i8* %ptr.i8, i8 15, i64 %size.scaled, i32 2, i1 false)
				ret void
				; CHECK: entry:
				; CHECK-NEXT: %[[PTR_I8:.]] = bitcast i16 %ptr to i8*
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp ne i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %[[IF:.]], label %[[TAIL:.]]
				;
				; CHECK: [[IF]]:
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %size, 3
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %[[IF]] ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i16, i16 %ptr, i64 %[[INDEX]]
				; CHECK-NEXT: store i16 3855, i16* %[[INDEXED_DST]], align 2
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %size
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: %[[SIZE_SCALED:.*]] = shl nuw nsw i64 %size, 1
				; CHECK-NEXT: call void @llvm.memset.p0i8.i64(i8* %[[PTR_I8]], i8 15, i64 %[[SIZE_SCALED]], i32 2, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: ret void
				}

				define void @set2_align1(i16* %ptr, i64 %size) {
				; CHECK-LABEL: define void @set2_align1(
				entry:
				%ptr.i8 = bitcast i16* %ptr to i8*
				%size.scaled = shl nuw nsw i64 %size, 1
				call void @llvm.memset.p0i8.i64(i8* %ptr.i8, i8 15, i64 %size.scaled, i32 1, i1 false)
				ret void
				; CHECK: entry:
				; CHECK-NEXT: %[[PTR_I8:.]] = bitcast i16 %ptr to i8*
				; CHECK-NEXT: %[[COUNT:.*]] = shl nuw nsw i64 %size, 1
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp ne i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %[[IF:.]], label %[[TAIL:.]]
				;
				; CHECK: [[IF]]:
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %[[COUNT]], 3
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %[[IF]] ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i8, i8 %[[PTR_I8]], i64 %[[INDEX]]
				; CHECK-NEXT: store i8 15, i8* %[[INDEXED_DST]], align 1
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %[[COUNT]]
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: call void @llvm.memset.p0i8.i64(i8* %[[PTR_I8]], i8 15, i64 %[[COUNT]], i32 1, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: ret void
				}

				; The memory intrinsics are weird and allow a 'zero' alignment that we need to
				; handle correctly.
				define void @set2_align0(i16* %ptr, i64 %size) {
				; CHECK-LABEL: define void @set2_align0(
				entry:
				%ptr.i8 = bitcast i16* %ptr to i8*
				%size.scaled = shl nuw nsw i64 %size, 1
				call void @llvm.memset.p0i8.i64(i8* %ptr.i8, i8 15, i64 %size.scaled, i32 0, i1 false)
				ret void
				; CHECK: entry:
				; CHECK-NEXT: %[[PTR_I8:.]] = bitcast i16 %ptr to i8*
				; CHECK-NEXT: %[[COUNT:.*]] = shl nuw nsw i64 %size, 1
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp ne i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %[[IF:.]], label %[[TAIL:.]]
				;
				; CHECK: [[IF]]:
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %[[COUNT]], 3
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %[[IF]] ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i8, i8 %[[PTR_I8]], i64 %[[INDEX]]
				; CHECK-NEXT: store i8 15, i8* %[[INDEXED_DST]]
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %[[COUNT]]
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: call void @llvm.memset.p0i8.i64(i8* %[[PTR_I8]], i8 15, i64 %[[COUNT]], i32 0, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: ret void
				}

				define void @set4(i32* %ptr, i64 %size) {
				; CHECK-LABEL: define void @set4(
				entry:
				%ptr.i8 = bitcast i32* %ptr to i8*
				%size.scaled = shl nuw nsw i64 %size, 2
				call void @llvm.memset.p0i8.i64(i8* %ptr.i8, i8 15, i64 %size.scaled, i32 4, i1 false)
				ret void
				; CHECK: entry:
				; CHECK-NEXT: %[[PTR_I8:.]] = bitcast i32 %ptr to i8*
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp ne i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %[[IF:.]], label %[[TAIL:.]]
				;
				; CHECK: [[IF]]:
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %size, 3
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %[[IF]] ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i32, i32 %ptr, i64 %[[INDEX]]
				; CHECK-NEXT: store i32 252645135, i32* %[[INDEXED_DST]], align 4
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %size
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: %[[SIZE_SCALED:.*]] = shl nuw nsw i64 %size, 2
				; CHECK-NEXT: call void @llvm.memset.p0i8.i64(i8* %[[PTR_I8]], i8 15, i64 %[[SIZE_SCALED]], i32 4, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: ret void
				}

				define void @set8(i64* %ptr, i64 %size) {
				; CHECK-LABEL: define void @set8(
				entry:
				%ptr.i8 = bitcast i64* %ptr to i8*
				%size.scaled = shl nuw nsw i64 %size, 3
				call void @llvm.memset.p0i8.i64(i8* %ptr.i8, i8 15, i64 %size.scaled, i32 8, i1 false)
				ret void
				; CHECK: entry:
				; CHECK-NEXT: %[[PTR_I8:.]] = bitcast i64 %ptr to i8*
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp ne i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %[[IF:.]], label %[[TAIL:.]]
				;
				; CHECK: [[IF]]:
				; CHECK-NEXT: %[[COUNT:.*]] = shl nuw nsw i64 %size, 1
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %[[COUNT]], 3
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %[[IF]] ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[PTR_I32:.]] = bitcast i64 %ptr to i32*
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i32, i32 %[[PTR_I32]], i64 %[[INDEX]]
				; CHECK-NEXT: store i32 252645135, i32* %[[INDEXED_DST]], align 8
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %[[COUNT]]
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: %[[SIZE_SCALED:.*]] = shl nuw nsw i64 %size, 3
				; CHECK-NEXT: call void @llvm.memset.p0i8.i64(i8* %[[PTR_I8]], i8 15, i64 %[[SIZE_SCALED]], i32 8, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: ret void
				}

				define i64 @set8_reuse_scaled_size(i64* %ptr, i64 %size) {
				; CHECK-LABEL: define i64 @set8_reuse_scaled_size(
				entry:
				%ptr.i8 = bitcast i64* %ptr to i8*
				%size.scaled = shl nuw nsw i64 %size, 3
				call void @llvm.memset.p0i8.i64(i8* %ptr.i8, i8 15, i64 %size.scaled, i32 8, i1 false)
				ret i64 %size.scaled
				; CHECK: entry:
				; CHECK-NEXT: %[[PTR_I8:.]] = bitcast i64 %ptr to i8*
				; CHECK-NEXT: %[[SIZE_SCALED:.*]] = shl nuw nsw i64 %size, 3
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp ne i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %[[IF:.]], label %[[TAIL:.]]
				;
				; CHECK: [[IF]]:
				; CHECK-NEXT: %[[COUNT:.*]] = shl nuw nsw i64 %size, 1
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %[[COUNT]], 3
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %[[IF]] ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[PTR_I32:.]] = bitcast i64 %ptr to i32*
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i32, i32 %[[PTR_I32]], i64 %[[INDEX]]
				; CHECK-NEXT: store i32 252645135, i32* %[[INDEXED_DST]], align 8
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %[[COUNT]]
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: call void @llvm.memset.p0i8.i64(i8* %[[PTR_I8]], i8 15, i64 %[[SIZE_SCALED]], i32 8, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: ret i64 %[[SIZE_SCALED]]
				}

				define void @copy1(i8* noalias %dst, i8* noalias %src, i64 %size) {
				; CHECK-LABEL: define void @copy1(
				entry:
				call void @llvm.memcpy.p0i8.p0i8.i64(i8* %dst, i8* %src, i64 %size, i32 1, i1 false)
				ret void
				; CHECK: entry:
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp ne i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %[[IF:.]], label %[[TAIL:.]]
				;
				; CHECK: [[IF]]:
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %size, 3
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %[[IF]] ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i8, i8 %dst, i64 %[[INDEX]]
				; CHECK-NEXT: %[[INDEXED_SRC:.]] = getelementptr inbounds i8, i8 %src, i64 %[[INDEX]]
				; CHECK-NEXT: %[[LOAD:.]] = load i8, i8 %[[INDEXED_SRC]], align 1
				; CHECK-NEXT: store i8 %[[LOAD]], i8* %[[INDEXED_DST]], align 1
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %size
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: call void @llvm.memcpy.p0i8.p0i8.i64(i8* %dst, i8* %src, i64 %size, i32 1, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: ret void
				}

				define void @copy2(i16* noalias %dst, i16* noalias %src, i64 %size) {
				; CHECK-LABEL: define void @copy2(
				entry:
				%dst.i8 = bitcast i16* %dst to i8*
				%src.i8 = bitcast i16* %src to i8*
				%size.scaled = shl nuw nsw i64 %size, 1
				call void @llvm.memcpy.p0i8.p0i8.i64(i8* %dst.i8, i8* %src.i8, i64 %size.scaled, i32 2, i1 false)
				ret void
				; CHECK: entry:
				; CHECK-NEXT: %[[DST_I8:.]] = bitcast i16 %dst to i8*
				; CHECK-NEXT: %[[SRC_I8:.]] = bitcast i16 %src to i8*
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp ne i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %[[IF:.]], label %[[TAIL:.]]
				;
				; CHECK: [[IF]]:
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %size, 3
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %[[IF]] ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i16, i16 %dst, i64 %[[INDEX]]
				; CHECK-NEXT: %[[INDEXED_SRC:.]] = getelementptr inbounds i16, i16 %src, i64 %[[INDEX]]
				; CHECK-NEXT: %[[LOAD:.]] = load i16, i16 %[[INDEXED_SRC]], align 2
				; CHECK-NEXT: store i16 %[[LOAD]], i16* %[[INDEXED_DST]], align 2
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %size
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: %[[SIZE_SCALED:.*]] = shl nuw nsw i64 %size, 1
				; CHECK-NEXT: call void @llvm.memcpy.p0i8.p0i8.i64(i8* %[[DST_I8]], i8* %[[SRC_I8]], i64 %[[SIZE_SCALED]], i32 2, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: ret void
				}

				define void @copy4(i32* noalias %dst, i32* noalias %src, i64 %size) {
				; CHECK-LABEL: define void @copy4(
				entry:
				%dst.i8 = bitcast i32* %dst to i8*
				%src.i8 = bitcast i32* %src to i8*
				%size.scaled = shl nuw nsw i64 %size, 2
				call void @llvm.memcpy.p0i8.p0i8.i64(i8* %dst.i8, i8* %src.i8, i64 %size.scaled, i32 4, i1 false)
				ret void
				; CHECK: entry:
				; CHECK-NEXT: %[[DST_I8:.]] = bitcast i32 %dst to i8*
				; CHECK-NEXT: %[[SRC_I8:.]] = bitcast i32 %src to i8*
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp ne i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %[[IF:.]], label %[[TAIL:.]]
				;
				; CHECK: [[IF]]:
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %size, 3
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %[[IF]] ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i32, i32 %dst, i64 %[[INDEX]]
				; CHECK-NEXT: %[[INDEXED_SRC:.]] = getelementptr inbounds i32, i32 %src, i64 %[[INDEX]]
				; CHECK-NEXT: %[[LOAD:.]] = load i32, i32 %[[INDEXED_SRC]], align 4
				; CHECK-NEXT: store i32 %[[LOAD]], i32* %[[INDEXED_DST]], align 4
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %size
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: %[[SIZE_SCALED:.*]] = shl nuw nsw i64 %size, 2
				; CHECK-NEXT: call void @llvm.memcpy.p0i8.p0i8.i64(i8* %[[DST_I8]], i8* %[[SRC_I8]], i64 %[[SIZE_SCALED]], i32 4, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: ret void
				}

				define void @copy8(i64* noalias %dst, i64* noalias %src, i64 %size) {
				; CHECK-LABEL: define void @copy8(
				entry:
				%dst.i8 = bitcast i64* %dst to i8*
				%src.i8 = bitcast i64* %src to i8*
				%size.scaled = shl nuw nsw i64 %size, 3
				call void @llvm.memcpy.p0i8.p0i8.i64(i8* %dst.i8, i8* %src.i8, i64 %size.scaled, i32 8, i1 false)
				ret void
				; CHECK: entry:
				; CHECK-NEXT: %[[DST_I8:.]] = bitcast i64 %dst to i8*
				; CHECK-NEXT: %[[SRC_I8:.]] = bitcast i64 %src to i8*
				; CHECK-NEXT: %[[ZERO_COND:.*]] = icmp ne i64 %size, 0
				; CHECK-NEXT: br i1 %[[ZERO_COND]], label %[[IF:.]], label %[[TAIL:.]]
				;
				; CHECK: [[IF]]:
				; CHECK-NEXT: %[[COUNT:.*]] = shl nuw nsw i64 %size, 1
				; CHECK-NEXT: %[[COUNT_COND:.*]] = icmp ule i64 %[[COUNT]], 3
				; CHECK-NEXT: br i1 %[[COUNT_COND]], label %[[THEN:.]], label %[[ELSE:.]]
				;
				; CHECK: [[THEN]]:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %[[IF]] ], [ %[[NEXT_INDEX:.]], %[[THEN]] ]
				; CHECK-NEXT: %[[DST_I32:.]] = bitcast i64 %dst to i32*
				; CHECK-NEXT: %[[SRC_I32:.]] = bitcast i64 %src to i32*
				; CHECK-NEXT: %[[INDEXED_DST:.]] = getelementptr inbounds i32, i32 %[[DST_I32]], i64 %[[INDEX]]
				; CHECK-NEXT: %[[INDEXED_SRC:.]] = getelementptr inbounds i32, i32 %[[SRC_I32]], i64 %[[INDEX]]
				; CHECK-NEXT: %[[LOAD:.]] = load i32, i32 %[[INDEXED_SRC]], align 8
				; CHECK-NEXT: store i32 %[[LOAD]], i32* %[[INDEXED_DST]], align 8
				; CHECK-NEXT: %[[NEXT_INDEX]] = add i64 %[[INDEX]], 1
				; CHECK-NEXT: %[[LOOP_COND:.*]] = icmp eq i64 %[[NEXT_INDEX]], %[[COUNT]]
				; CHECK-NEXT: br i1 %[[LOOP_COND]], label %[[TAIL:.*]], label %[[THEN]]
				;
				; CHECK: [[ELSE]]:
				; CHECK-NEXT: %[[SIZE_SCALED:.*]] = shl nuw nsw i64 %size, 3
				; CHECK-NEXT: call void @llvm.memcpy.p0i8.p0i8.i64(i8* %[[DST_I8]], i8* %[[SRC_I8]], i64 %[[SIZE_SCALED]], i32 8, i1 false)
				; CHECK-NEXT: br label %[[TAIL]]
				;
				; CHECK: [[TAIL]]:
				; CHECK-NEXT: ret void
				}

				declare void @llvm.memcpy.p0i8.p0i8.i64(i8* writeonly, i8*, i64, i32, i1)

				declare void @llvm.memset.p0i8.i64(i8* writeonly, i8, i64, i32, i1)

This is an archive of the discontinued LLVM Phabricator instance.

[memops] Add a new pass to inject fast-path code for specific library function calls.Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 108829

include/llvm/Analysis/TargetTransformInfo.h

include/llvm/Analysis/TargetTransformInfoImpl.h

include/llvm/InitializePasses.h

include/llvm/LinkAllPasses.h

include/llvm/Transforms/Scalar/FastPathLibCalls.h

lib/Analysis/TargetTransformInfo.cpp

lib/Passes/PassBuilder.cpp

lib/Passes/PassRegistry.def

lib/Target/X86/X86TargetTransformInfo.h

lib/Target/X86/X86TargetTransformInfo.cpp

lib/Transforms/Scalar/CMakeLists.txt

lib/Transforms/Scalar/FastPathLibCalls.cpp

test/Other/new-pm-defaults.ll

test/Other/new-pm-thinlto-defaults.ll

test/Transforms/FastPathLibCalls/X86/lit.local.cfg

test/Transforms/FastPathLibCalls/X86/memops.ll

test/Transforms/FastPathLibCalls/basic.ll

test/Transforms/FastPathLibCalls/memops.ll

[memops] Add a new pass to inject fast-path code for specific library function calls.
Needs ReviewPublic