This is an archive of the discontinued LLVM Phabricator instance.

[LIR] re-enable generation of memmove with runtime checks
Needs ReviewPublic

Authored by sebpop on Feb 21 2017, 2:15 PM.

Details

Summary

This patch fixes https://bugs.llvm.org//show_bug.cgi?id=31391

Last time memmove was enabled, LIR was using the memory data dependence static analysis to disambiguate the use of memmove().
Those patches were backed out because the dependence analysis is unsafe, and it still contains the same errors as of today.
This patch avoids using the static analysis, and instead versions the loop conditional to whether memmove is legal.

A next patch will add support to generate the runtime checks for memset as described in https://bugs.llvm.org//show_bug.cgi?id=31391#c4

The code of this patch mostly comes from Hexagon-LIR, and the revert of the previous revert of the memmove in LIR.
Krzysztof, we will probably need to refactor this code: Hexagon requires some special handling for volatile memcpy, and the rest is usable for target independent code.

Diff Detail

Event Timeline

sebpop created this revision.Feb 21 2017, 2:15 PM

Why are we generating a runtime check for memmove, as opposed to memcpy? I mean, they're the same function on many platforms, but we inline calls to memcpy much more aggressively for small sizes, so there could be a performance benefit to generating llvm.memcpy instead of llvm.memmove.

Does it make sense to perform this transform as part of the loop-idiom pass? loop-idiom runs interleaved with inlining; inserting runtime checks before we inline a function seems like a bad idea.

Why are we generating a runtime check for memmove, as opposed to memcpy?

LIR is very conservative: it will generate a memcpy only when it knows that source and destination do not alias.
When the check for alias fails, this patch adds the needed checks to ensure that the dependence is in the right direction: WAR deps are ok to memmove, RAW deps are unsafe: see comments in the code https://bugs.llvm.org//show_bug.cgi?id=31391#c4
Of course we could add one more check to generate more memcpys than memmoves (when the number of bytes copied is less than the difference between source and destination).

I mean, they're the same function on many platforms, but we inline calls to memcpy much more aggressively for small sizes, so there could be a performance benefit to generating llvm.memcpy instead of llvm.memmove.

Right. If you tell me to add that check, I'll add it.

Does it make sense to perform this transform as part of the loop-idiom pass? loop-idiom runs interleaved with inlining; inserting runtime checks before we inline a function seems like a bad idea.

What is your suggestion?

I guess we can just generate memmove for the first iteration of this; we'll see if any testcases come up where it would help to generate memcpy instead.

In terms of where to run this... not sure there's really any existing pass that's appropriate. You want to run after inlining, but before the loop vectorizer. Maybe a new pass just before the vectorizer; call it loop-versioning-loop-idiom or something like that.

sebpop updated this revision to Diff 89792.Feb 25 2017, 9:09 AM

Changes to address comments from Eli and Chad.

efriedma added inline comments.Feb 27 2017, 11:48 AM
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
1018

getUniqueExitBlock?

1039

Why do we need to call SimplifyInstruction here?

1120

Can we simplify this code by requiring LoopSimplify? (hasDedicatedExits() makes updating the domtree much more straightforward.)

efriedma set the repository for this revision to rL LLVM.
kparzysz added inline comments.Feb 27 2017, 12:28 PM
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
1039

I guess back when this was written, it made some difference. I don't remember what it was though. Do you think it's unnecessary now?

kparzysz edited edge metadata.Feb 27 2017, 12:41 PM

Why are we generating a runtime check for memmove, as opposed to memcpy?

We can check for safety of memcpy: if the source and destination don't overlap, then it's safe to generate memcpy. On the other hand, when they do overlap, it is not always safe to generate memmove.

The volatile memcpy for Hexagon is probably not quite as important now.

It looks like there aren't any calls to createLoopVersioningIdiomPass()?

Oh, also, performance numbers would be nice at some point.

llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
1039

I can't see why it would make a difference... SimplifyInstruction probably won't simplify a non-trivial SCEV expression to a ConstantInt, and the rest will get cleaned up by instcombine later.

sebpop marked 3 inline comments as done.Feb 28 2017, 1:47 PM

The updated patch addresses the comments from Eli and Krzysztof.

Krzysztof, I have removed the code for memmove and memcpy detection
from the Hexagon LIR: this removes the generation of volatile memcpy.
Please make sure that it is ok to remove and to rely on the generic LIR for
the detection of memmove and memcpy.

Eli, I am running the spec2k, 2k6, and the test-suite with and without the patch.
I will either post relative speedup numbers by tomorrow, or I will amend
the patch if testing exposes problems.

sebpop updated this revision to Diff 90080.Feb 28 2017, 1:48 PM

Updated patch.

Krzysztof, I have removed the code for memmove and memcpy detection
from the Hexagon LIR: this removes the generation of volatile memcpy.
Please make sure that it is ok to remove and to rely on the generic LIR for
the detection of memmove and memcpy.

Yes, that is ok.

sebpop added a comment.Mar 1 2017, 8:41 AM

With the patch and compiling with "-mllvm -stats" the spec 2006:
Number of memcpy's formed from loop load+stores: 42
Number of memset's formed from loop stores: 1398
Number of memmove's formed from loop load+stores: 129

Before the patch on spec 2006:
Number of memcpy's formed from loop load+stores: 98
Number of memset's formed from loop stores: 1395
Number of memmove's formed from loop load+stores: 0

With the patch on the test-suite:
Number of memcpy's formed from loop load+stores: 121
Number of memset's formed from loop stores: 3243
Number of memmove's formed from loop load+stores: 1891

without the patch on the test-suite:
Number of memcpy's formed from loop load+stores: 140
Number of memset's formed from loop stores: 3213
Number of memmove's formed from loop load+stores: 0

With the patch and compiling with "-mllvm -stats" the spec 2006:
Number of memmove's formed from loop load+stores: 129

Before the patch on spec 2006:
Number of memmove's formed from loop load+stores: 0

This is nice.

Do you know what the most common strides are?

With the patch and compiling with "-mllvm -stats" the spec 2006:
Number of memcpy's formed from loop load+stores: 42

Before the patch on spec 2006:
Number of memcpy's formed from loop load+stores: 98

This looks bad.

sebpop added a comment.Mar 2 2017, 8:30 AM

With the patch and compiling with "-mllvm -stats" the spec 2006:
Number of memcpy's formed from loop load+stores: 42

Before the patch on spec 2006:
Number of memcpy's formed from loop load+stores: 98

This looks bad.

I've been trying to understand why the number of memcpys are changing:
the patch should only add new memmoves with runtime checks.
I then decided to rerun the spec2006, and it looks like the specmake was the problem
it truncated some of the stderr output.

With the patch I see the following on cpu2006:
Number of memcpy's formed from loop load+stores: 98
Number of memset's formed from loop stores: 1398
Number of memmove's formed from loop load+stores: 155

sebpop updated this revision to Diff 90343.Mar 2 2017, 11:39 AM

Added check for HasMemmove.

efriedma added inline comments.Mar 16 2017, 5:33 PM
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
403

This would not work right for if the target has memmove, but not memcpy. Maybe not an issue in practice, but still kind of confusing.

881

This whole worklist thing looks very suspicious; for the purpose of figuring out whether the memmove covers the loop, why do you need to special-case the pointer operand of the load instruction?

1210

This is breaking LoopSimplify form for the loop.

kparzysz added inline comments.Mar 17 2017, 8:17 AM
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
881

Could you explain the "special-casing" comment? I'm not seeing what you are referring to.

efriedma added inline comments.Mar 17 2017, 10:35 AM
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
881

Suppose, for example, you have something like this:

; Assume this is a loop, where %p and %q are induction variables.
%p2 = call i8* @foo(i8* returned %p)
%v = load i8* %p2
store i8 %v, i8* %q

You start off with the load and store on the worklist, add the call to the worklist, then conclude the loop is covered even though the call could have other side-effects.

Granted, that's probably unlikely to happen in practice, but I'm not sure what this loop is trying to accomplish.

kparzysz added inline comments.Mar 17 2017, 10:43 AM
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
881

The other side-effects should be checked elsewhere. This wouldn't be a strided load, so it shouldn't even get this far.

efriedma added inline comments.Mar 17 2017, 11:10 AM
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
881

The other side-effects should be checked elsewhere.

Elsewhere, where? Isn't the point of coverLoop to check for side-effects?

This wouldn't be a strided load, so it shouldn't even get this far.

It's a strided load because of the "returned" attribute on the parameter call. (IIRC, SCEV doesn't actually look through calls like this at the moment, but it could.)

kparzysz added inline comments.Mar 17 2017, 11:16 AM
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
881

The point of this loop is to see if the loop has other stuff in it besides the load and the store, and any instructions that are only used by the load and the store. In other words, if we removed the load and the store, and all instructions that would become recursively dead, would this loop still have anything left in it. If not, it means that the code used (only) by the load/store covers the loop entirely.

The side-effects of the code related to the load/store should be checked somewhere else, but not here. If it's not checked, the checks should be added.

evandro added a subscriber: evandro.Mar 7 2018, 9:03 AM