This is an archive of the discontinued LLVM Phabricator instance.

[memops] Add a new pass to inject fast-path code for specific library function calls.
Needs ReviewPublic

Authored by chandlerc on Jul 30 2017, 6:15 AM.

Details

Summary

The initial motivation is providing fast, inline paths for memset and
memcpy with a dynamic size when that size happens to be small. Because
LLVM is *very* good at forming memset and memcpy out of raw loops and
many other constructs, it is especially important that these remain fast
even when used in circumstances where the library function call overhead
is unacceptably large.

The first attempt at addressing this was D35750, but that proved to only
exacerbate the issue rather than fixing it.

It turns out, at least for x86, we can emit a very minimal loop behind
a dynamic test on the size and dramatically improve the performance of
sizes that happen to be small.

To make all of this work *well* requires a lot of careful logic:

  • We need to analyze and discover scaling of the size fed to memset and memcpy.
  • We can't widen past the alignment.
  • We need to emit any loop with *exactly* the right IR to get efficient lowering from the backend.
  • It needs to run quite late to not be perturbed by other passes that try to "optimize" the loop.
  • We need to avoid this in optsize and minsize functions.
  • We need to generate checks for zero-length operations before the loop. This ends up being an even faster path.
  • But we need to not generate *redundant* checks which means adding a mini predicate analysis just to find existing zero checks. It turns out these are incredibly common because so many of these routines are created out of loops which we have already extracted just such a predicate from.

There is still more we should do here such as:

  1. Don't emit these for cold libcalls.
  2. Use value profile data (if available) to bias at least the branch weights and potentially the actual sizes.

However, for at least a few benchmarks here that end up hitting this very hard,
I'm seeing between 20% and 50% improvements already. Naturally, I'll be
gathering more data both on performance impact and code size impact, but
I wanted to go ahead and get this out for review.

Event Timeline

chandlerc created this revision.Jul 30 2017, 6:15 AM

What is advantage of making this transformation a separate pass instead of doing it during selection?

Some targets do not have support library and always expand mem intrinsics, the code making expansion is in LowerMemIntrinsics.cpp. Does it make sense to combine these implementations?

lib/Transforms/Scalar/FastPathLibCalls.cpp
443

Probably this function should be called only when VerifyDomInfo is true to reduce compile time?

It seems to me to conceptually belong to the backend. Why isn't this part of CodeGenPrepare (or injected by the target as part of its pre-ISel IR passes)?

chandlerc marked an inline comment as done.Jul 30 2017, 12:33 PM

What is advantage of making this transformation a separate pass instead of doing it during selection?

We need to form a fairly complex control flow structure, including loops and indexing into arrays within the loop. We also need to do substantial cross-basic-block analysis. All of these are between hard and impossible within the DAG. We could do it at MI but that seems to provide few if any benefits and a lot of added complexity. A bunch of our "expand a loop here" or "inject a loop here" logic has been moved to IR to make it easier to cope with (atomics, etc).

Some targets do not have support library and always expand mem intrinsics, the code making expansion is in LowerMemIntrinsics.cpp. Does it make sense to combine these implementations?

Possibly, but I'm not sure that it does. That lowering tries to produce a genuinely high-performance version, whereas this is trying to provide a *small* version with good performance for short lengths. I would expect this logic to only make sense when we're going to call a dedicated routine to handle most of the cases for code-size reasons.

Anyways, if we discover some useful components or pieces to share, we of course should, but I feel like the intent of the two passes is fairly distinct and useful to keep separate even if they share infrastructure.

lib/Transforms/Scalar/FastPathLibCalls.cpp
443

Sorry, this was just a debugging line. I added a simple pass-based verification to the tests, I'll remove this entirely.

chandlerc marked an inline comment as done.Jul 30 2017, 12:39 PM

It seems to me to conceptually belong to the backend. Why isn't this part of CodeGenPrepare (or injected by the target as part of its pre-ISel IR passes)?

It definitely is similar to CGP.

The reason I didn't put it there after talking to folks (mostly Hal I think) was because I generally operate under the principle of "if it doesn't need to be in CGP, it should be separate" for maintenance, testing, etc. The usual case which necessitates transforms being in CGP is needing to participate in its iterative process, but that isn't true here. A common practical reason is because the logic is too small or isolated to really make sense as its own pass, but that doesn't seem to be true here as well.

As for where in the pipeline to put it, I'm open to suggestions, but putting it here has some advantages.

This code is forming a loop with array accesses within it. There is a lot of code (from LSR to CGP) that tries to help massage these patterns into the optimal form for the target. I didn't really want to have target-specific IR generation, and so having this pass run before LSR and CGP seems useful.

Similarly, LoopSink may also want to sink computations into this code if we start putting branch weights from profiling into it and some of these regions end up marked cold. So putting this before LoopSink seemed to make sense.

It is actively harmful for this code to be before any of the vectorization or unrolling passes though: we'll try to vectorize and unroll this loop when the whole point was to keep it small! ;] We could use loop metadata to prevent this, but scheduling it afterward seems easier (and honestly, those passes should use the trip count upper bound predicate and avoid the transformations, but that is an issue for another day).

Last but not least, this pass is relatively sensitive to alignment, and so putting it after we re-compute alignment from assumption information seemed goodness.

This narrows the position to one, between the alignment synthesis and LoopSink.

Still, while the above hopefully explains my thought process, it doesn't mean this is the *right* place. Very open to suggestions for other positioning in the pipeline and what issues would be addressed, or just why it would be more natural. Certainly, the closer to the target the better as this is a clearly very target specific transformation.

It seems to me to conceptually belong to the backend. Why isn't this part of CodeGenPrepare (or injected by the target as part of its pre-ISel IR passes)?

It definitely is similar to CGP.

The reason I didn't put it there after talking to folks (mostly Hal I think) was because I generally operate under the principle of "if it doesn't need to be in CGP, it should be separate" for maintenance, testing, etc. The usual case which necessitates transforms being in CGP is needing to participate in its iterative process, but that isn't true here. A common practical reason is because the logic is too small or isolated to really make sense as its own pass, but that doesn't seem to be true here as well.

This is my opinion as well (as I think I expressed on IRC). CGP has accumulated a lot of different pieces of functionality because it makes sense for them to iterate (similar to why InstCombine has gotten that way). I see no reason for this to be part of that iterative scheme, and so it can be a separate pass (and, thus, it should be).

It seems to me to conceptually belong to the backend. Why isn't this part of CodeGenPrepare (or injected by the target as part of its pre-ISel IR passes)?

It definitely is similar to CGP.

The reason I didn't put it there after talking to folks (mostly Hal I think) was because I generally operate under the principle of "if it doesn't need to be in CGP, it should be separate" for maintenance, testing, etc. The usual case which necessitates transforms being in CGP is needing to participate in its iterative process, but that isn't true here. A common practical reason is because the logic is too small or isolated to really make sense as its own pass, but that doesn't seem to be true here as well.

This is my opinion as well (as I think I expressed on IRC). CGP has accumulated a lot of different pieces of functionality because it makes sense for them to iterate (similar to why InstCombine has gotten that way). I see no reason for this to be part of that iterative scheme, and so it can be a separate pass (and, thus, it should be).

Sure, I agree with having a separate pass. My point was rather about the fact that it is added to the optimization pipeline instead of left up to the target "IR-lowering" passes.

@chandlerc's explanations make sense to me, but seeing it as part of buildModuleOptimizationPipeline is still a bit strange. I wonder if this shouldn't be split further to clearly identify the point where we start to do some "lowering" (if we agree to have such conceptual "stage" in the pipeline). A bit like I extracted the "function simplification" part of the pipeline (what runs in a CGSCC alternating with the inliner)

Just want to point out that we're getting a bit far afield of this patch. Would love any comments on the actually technique... So far, I don't have any performance regressions and I'm seeing large wins on benchmarks that happen to be sensiitive to short memcpy and memsets formed out of loops. Still working on size...

Sure, I agree with having a separate pass. My point was rather about the fact that it is added to the optimization pipeline instead of left up to the target "IR-lowering" passes.

@chandlerc's explanations make sense to me, but seeing it as part of buildModuleOptimizationPipeline is still a bit strange. I wonder if this shouldn't be split further to clearly identify the point where we start to do some "lowering" (if we agree to have such conceptual "stage" in the pipeline). A bit like I extracted the "function simplification" part of the pipeline (what runs in a CGSCC alternating with the inliner)

I don't think this is any more or less "lowering" than the vectorizers or partial unrolling. Both are completely dependent on the target for "what ckind of code should i produce?".

I actually view the entire "optimization" phase as somewhat lowering -- we're destroying information and specializing for execution performance on a *particular* target. Many steps here lose significant analysis information in exchange for this.

Still, all of this is somewhat of a larger more meta discussion...

Just want to point out that we're getting a bit far afield of this patch. Would love any comments on the actually technique...

That's because we like it :)

I actually view the entire "optimization" phase as somewhat lowering -- we're destroying information and specializing for execution performance on a *particular* target. Many steps here lose significant analysis information in exchange for this.

I actually agree with that, and looking at the code for the new pass manager I see that your buildModuleOptimizationPipeline is not matching the legacy populateModulePassManager but already implementing separately only the end of the pipeline (I didn't notice before, and I know better the legacy one...).

Still, all of this is somewhat of a larger more meta discussion...

Yeah but it seems that these only happen when they're triggered by such patch.

davidxl added inline comments.Jul 31 2017, 11:14 PM
lib/Transforms/Scalar/FastPathLibCalls.cpp
90

Is it better to rely on instcombine and cfg simplification to get rid of the redundant zero guard which can be more general?

147

Framework can mean something much different. How about just call it 'Info'?

chandlerc added inline comments.Jul 31 2017, 11:17 PM
lib/Transforms/Scalar/FastPathLibCalls.cpp
90

Sadly, they're *less* powerful than this. You'd need PRE or something similar to get it, powered by PredicateInfo and GVN. Maybe JumpThreading and LVI could get it?

All of these seem really heavy weight to run this late in the pipeline. =/

147

It's more than the information. It represents the actual fastpath CFG framework (for lack of a better word) that has been injected into the function and needs to be populated with the particular memop's logic...

That said, I could totally add a comment. =] Would that help clarify enough?

davidxl added inline comments.Jul 31 2017, 11:46 PM
lib/Transforms/Scalar/FastPathLibCalls.cpp
90

Cases like the following can be handled by -instcombine + -simplfycfg. Wrapping the second redundant test into another check also works fine.

What are the interesting cases that can not be handled?

efine void @set1_nonzero1(i8* %ptr, i64 %size) {
; CHECK-LABEL: define void @set1_nonzero1(
entry:

%zero_cond = icmp eq i64 %size, 0
br i1 %zero_cond, label %exit, label %test

test:

%nonzero_cond = icmp ne i64 %size, 0
br i1 %nonzero_cond, label %call, label %exit

call:

call void @llvm.memset.p0i8.i64(i8* %ptr, i8 15, i64 %size, i32 1, i1 false)
ret void

exit:

ret void

}

declare void @llvm.memset.p0i8.i64(i8* writeonly, i8, i64, i32, i1)

147

Sure, some comments will do.

chandlerc added inline comments.Aug 1 2017, 12:11 AM
lib/Transforms/Scalar/FastPathLibCalls.cpp
90

I'm honestly surprised instcombine+simplify-cfg would get this. It seems outside of their purview to do this kind of predicate analysis... I couldn't find where it does this in a cursory glance through the code.

Anyways, we don't run instcombine after this pass, but only inst simplify. Not sure we want to add that expensive of a pass when this code can just handle the specific cases it wants.

davidxl added inline comments.Aug 1 2017, 12:27 AM
lib/Transforms/Scalar/FastPathLibCalls.cpp
90

instcombine just converts the second predicate into true/false.

However, it seems simplifycfg alone can clean it up. This pass can potentially be moved ahead to share some of the common cleanups?

craig.topper edited edge metadata.Aug 1 2017, 12:29 AM

I suspect SimplifyCFG got it in SimplifyEqualityComparisonWithOnlyPredecessor.

I suspect SimplifyCFG got it in SimplifyEqualityComparisonWithOnlyPredecessor.

Yeah, this doesn't surprise me as much, but from the comment:

This does a very limited form of jump threading.

Among cases it doesn't handle is one where, for example, the predicate is hoisted above an outer loop loop. I thought I had a test case that shows this, but I think it was a bit fragile so I removed it, but consider the memset being inside a loop and the test for zero getting unswitched out of that outer loop. I don't think SimplifyCFG will be able to get this, and I'll be very surprised if InstCombine does. I'll try to add a test case for this.

Still not sure how / if instcombine really handles this.

chandlerc added inline comments.Aug 1 2017, 12:50 AM
lib/Transforms/Scalar/FastPathLibCalls.cpp
90

I'm not sure how it converts the second predicate into a constant in the hard cases though.

Anyways, see my response to Craig regarding simplify-cfg -- it's doing a *much* more limited form of this. I can add a test case that shows this at least.

As for moving this pass to share cleanups, see my comments to Mehdi about the challenges of changing the position of the pass. I don't see a great way to get it in front of instcombine, but maybe there is one.

Instead of generating loop IR for the fast path, how about creating a versioned memcpy/memset with the constrained parameters guarded under the condition test? That way, in the back-end the exact preferred optimal code can be generated, allowing for unrolled loop bodies specific to individual targets.

chandlerc marked an inline comment as done.Aug 2 2017, 3:52 AM

I have size data now.

Across the test suite + SPEC the total size increase with this patch is under 1%. Looking at benchmarks which exhibit the largest size growth, most grow by a few hundered bytes or less, they just happen to be *tiny* benchmarks.

The most interesting growth I see are;

473.astar - 5% growth, but this is still under 2k growth, in absolute terms this benchmark is quite small.
mafft/pairlocalalign - 3.3% (+14k)
447.dealII - 2.1% (+50k)

Everything else is small either in percent, absolute size, or both.

Across our internal benchmarks, I see no regressions with this patch but I see some benchmarks with 30% and 40% improvements (no, those numbers aren't mistakes). The pattern I am seeing is that when this matters, it *MATTERS*. But most of the time, the libcall is fast enough. This still seems very worthwhile to me as the code patterns that end up impacted by this seem imminently reasonable.

So generally, I think this is a pretty clear net win, it is fairly isolated, and the code size cost seems very low. Any concerns with moving forward here?

Instead of generating loop IR for the fast path, how about creating a versioned memcpy/memset with the constrained parameters guarded under the condition test? That way, in the back-end the exact preferred optimal code can be generated, allowing for unrolled loop bodies specific to individual targets.

IMO, there is no need for doing this in this place. If we're just leaving a marker here for the target to expand, we don't need to do anything. We already get a chance to custom expand the libcall in the target. Adding the versioning doesn't make that any simpler given that it still needs to introduce a loop. If, for a particular target, it is worth emitting a versioned, carefully target-crafted loop or instruction sequence, I would expect them to not use this pass but to custom lower the calls in the backend much like x86 does for constant-size calls.

At least for x86 on Linux, I have no cases where something more complex than this trivial loop is a win compared to calling the library function.

davidxl edited edge metadata.Aug 3 2017, 9:03 PM

Does it penalize cases on the border line? For instance, if a 7 byte memcpy requires a byte copy loop of 7 iterations which is above the threshold. In this case, the runtime check will purely add runtime overhead. This is unlike more general partial inlining of stringop operations where we may have internal APIs to by pass similar checks inside the library function.

include/llvm/Analysis/TargetTransformInfo.h
799

nit: I find OpByteSize not intuitive. Perhaps DataByteWidth?

lib/Target/X86/X86TargetTransformInfo.cpp
2235

Reference ? The overhead here is vague. Does it include PLT or not? or does it mean the set up cost of rep mov/sto?

lib/Transforms/Scalar/FastPathLibCalls.cpp
102

Why magic 10? In reality, I would think 2 or 3 is enough. Also add an internal option for this?

111

Why limiting to comparison with zero? More generally, why not collect more general predicate info such that the second size check can also be eliminated? or skip the fast path if size is known to be large?

123

< 0 means a huge unsigned length. In this case should you skip the fast path completely?

179

Is this pattern common?

191

how about on targets where unaligned access is ok?

207

The function is pretty large. Perhaps split the analysis and transformation?

314

Can you make this a callback invoked by buildFastPathMemOpFramework?

IMO, there is no need for doing this in this place. If we're just leaving a marker here for the target to expand, we don't need to do anything. We already get a chance to custom expand the libcall in the target. Adding the versioning doesn't make that any simpler given that it still needs to introduce a loop.

The difference is that at the target codegen level we can't as easily do the predicate analysis as we can at the IR level.

If, for a particular target, it is worth emitting a versioned, carefully target-crafted loop or instruction sequence, I would expect them to not use this pass but to custom lower the calls in the backend much like x86 does for constant-size calls.

But in the patch description you say that one of the challenges is constructing *just* the right IR to get efficient codegen from the backend. I understand this is for x86 right now, but if you don't have plans to allow other targets to work well with it, why not put it into the Target/X86 directory and make it a backend-specific IR pass to avoid confusion?

To clear up one last thing for me, are you saying that there are no performance impacts across SPEC at all? Even astar and dealII? Any impacts across the test-suite benchmarks?

lib/Transforms/Scalar/FastPathLibCalls.cpp
101

AFAICT this depth doesn't seem to be modified anywhere.

IMO, there is no need for doing this in this place. If we're just leaving a marker here for the target to expand, we don't need to do anything. We already get a chance to custom expand the libcall in the target. Adding the versioning doesn't make that any simpler given that it still needs to introduce a loop.

The difference is that at the target codegen level we can't as easily do the predicate analysis as we can at the IR level.

If, for a particular target, it is worth emitting a versioned, carefully target-crafted loop or instruction sequence, I would expect them to not use this pass but to custom lower the calls in the backend much like x86 does for constant-size calls.

But in the patch description you say that one of the challenges is constructing *just* the right IR to get efficient codegen from the backend. I understand this is for x86 right now, but if you don't have plans to allow other targets to work well with it, why not put it into the Target/X86 directory and make it a backend-specific IR pass to avoid confusion?

What this pass is doing is very generic, as is the IR produced, and the IR seems likely to me to make sense on many targets. No target is obligated to use this, but I'd like this to remain target independent. Target independent passes don't belong in the targets, even if they've only been tuned on one target so far.

To clear up one last thing for me, are you saying that there are no performance impacts across SPEC at all? Even astar and dealII? Any impacts across the test-suite benchmarks?

lib/Transforms/Scalar/FastPathLibCalls.cpp
92

Depth = 0 means no limit, right? Is there a reason this can't go quadratic? If so, we should comment here. Otherwise, maybe we need a depth limit.