HomePhabricator

[SimplifyLibCalls] Add noalias from known callsites

Authored by xbolva00 on Aug 13 2019, 10:18 AM.

Description

[SimplifyLibCalls] Add noalias from known callsites

Summary:
Should be fine for memcpy, strcpy, strncpy.

Reviewers: jdoerfert, efriedma

Reviewed By: jdoerfert

Subscribers: uenoku, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D66135

llvm-svn: 368724

Details

Committed
xbolva00Aug 13 2019, 10:18 AM
Reviewer
jdoerfert
Differential Revision
D66135: [SimplifyLibCalls] Add noalias from known callsites
Parents
rG2a4f26b4c28a: [ValueTracking] Improve reverse assumption inference
Branches
Unknown
Tags
Unknown

Event Timeline

I thought about this, a bit more (after @lebedev.ri pinged me in IRC and I discussed it with @hfinkel):

  1. Why don't we simply annotate the function declarations (and the intrinsic definition in the td file)?
  2. Clang [0], and probably other software, emit memcpy where memset would be needed. While this will not break right away (we do not utilize the noalias information yet, see below), it has to be kept in mind.

How to make use of this information:
This patch *should* not break anything as we should not act on the information we add. To make use of the information we would need to teach AA about the following pattern:

  • If two pointers P and Q that are noalias in a scope and in that scope for sure accessed A_p and A_q bytes respectively, then the ranges P - &P[A_p] and Q - &Q[A_q] do not overlap.

[0] https://github.com/llvm/llvm-project/blob/fc76d8551f54e0a8b99306741d6d818b90334d6a/clang/lib/CodeGen/CGExprAgg.cpp#L1935

Why don't we simply annotate the function declarations (and the intrinsic definition in the td file)?

Original patch wanted to annonate things only if Site > 0 so I didnt consider that. Yes, we could do it.

2, If Clang generates “almost valid” code, it should be really fixed. Otherwise, it is a ticking bomb. We should not make Clang-workarounds, we have Flang, RustC, ...

Why don't we simply annotate the function declarations (and the intrinsic definition in the td file)?

Original patch wanted to annonate things only if Site > 0 so I didnt consider that. Yes, we could do it.

Would you mind adding it now? I don't think you need to remove the call site annotation (as we sometimes do only look into one place not the other).

2, If Clang generates “almost valid” code, it should be really fixed. Otherwise, it is a ticking bomb. We should not make Clang-workarounds, we have Flang, RustC, ...

IMHO, we should do that, yes. @rsmith, @rjmccall, what do you think? Is the reason behind this commit from 10y ago still a problem today [1]? Could we emit memset or a if (src != dst) memcpy(...)?

[1] https://github.com/llvm/llvm-project/commit/3ef668c27ae05a20329f406e58bfe74af507ce4c

Clang [0], and probably other software, emit memcpy where memset would be needed.

To be clear, you mean memmove (not memset).

2, If Clang generates “almost valid” code, it should be really fixed. Otherwise, it is a ticking bomb. We should not make Clang-workarounds, we have Flang, RustC, ...

IIRC, this turns out to be an important performance optimization on some systems. I'm well aware that incorrect code is often faster than correct code, but... my recollection is that the cost of adding the if (a != b) predicate on the call to memcpy produces measurable slowdowns in important cases. The cost of using memmove in all cases just because the pointers might be equal is even worse. LLVM and Clang do have requirements beyond the standard for correct operation. We require a compiler that doesn't break our dyn_cast scheme for improper subclasses, doesn't assume that getenv() won't return -1, and so on. One of these requirements is that memcpy handles the case where the two pointers are equal in a sensible manner. We certainly should document these somewhere.

In the end, this is not really so much a "Clang workaround" as that the preconditions defined in the standard don't reflect the reality of implemented systems. This happened with nonnull on memcpy, etc. as well. While we can, and do, exploit UB, there are practical considerations to how strictly we interpret those conditions. That all having been said, if we can add the if (a != b) without penalty, I'm all for doing so in Clang. That might not change, however, how we encode the properties of memcpy in the optimizer.

@rsmith , @chandlerc , thoughts on this?

xbolva00 added a comment.EditedAug 13 2019, 12:36 PM

I am not super familiar with memcpy or memmove implementation, but I heard about two things

  1. memmove is very close to memcpy performance (going backwards trick, etc..)
  2. some memcpy implementations handle correctly even overlapped buffers

But I since this patch does not optimize memcpy due to noalias info, this problem is possibly a note for future optimizations..

[1] https://github.com/llvm/llvm-project/commit/3ef668c27ae05a20329f406e58bfe74af507ce4c

New intrinsic llvm.memcpy.always_inline would probably fix it (?).

struct x { int a[100]; };

void foo(struct x *P, struct x *Q) {

*P = *Q;

}

Only Clang emits jmp memcpy..

Maybe introduce threshold T so..

if (size < T) emit llvm.memcpy.inline
else emit memmove?

I don't think it's generally the frontend's responsibility to decide things like whether to inline memcpy.

As a frontend author, I would like to be able to conveniently copy memory for a struct assignment in a single line of IR. It doesn't seem unreasonable for LLVM to ask me to use a slightly different intrinsic (or an extra parameter, whatever) when it's possible that the memory perfectly overlaps, though. If the backend then needs to add an equality-check to make this safe for some target, so be it.

As a frontend author, I would like to be able to conveniently copy memory for a struct assignment in a single line of IR.

memmove? :) this needs new benchmarks

xbolva00 added a comment.EditedAug 14 2019, 1:54 AM

And that commit is dated to 2009.
I believe we have better memmove implementations and optimizers (which can turn memmove to memcpy if *valid*).

Maybe add "nobuiltin" for struct copy memcpy?

xbolva00 added a comment.EditedAug 14 2019, 2:11 AM

Basically the current problem could arise more often with -O1 than -O2

https://godbolt.org/z/s4uE13

GCC emits memcpy code for int a[100000]..

ICC: -O1

cc(s*):

mov       rsi, rdi                                      #11.5
jmp       foo(x*, s*)                                  #11.5

foo(x*, s*):

mov       ecx, 5000                                     #7.8
rep   movsq                                             #7.8
ret

-O3

foo(x*, s*):

push      rsi                                           #6.37
mov       edx, 40000                                    #7.8
call      _intel_fast_memcpy                            #7.8
pop       rcx                                           #8.1
ret                                                     #8.1

cc(s*):

push      rsi                                           #10.22
mov       rsi, rdi                                      #7.8
mov       edx, 40000                                    #7.8
call      _intel_fast_memcpy                            #7.8
pop       rcx                                           #12.1
ret
xbolva00 added a comment.EditedAug 14 2019, 3:58 AM

memcpy is not defined if the source and destination pointers are exactly
equal, but other compilers do this optimization, and almost every memcpy
implementation handles this case safely

So another middle end opportunity: can we turn memmove to memcpy in middle end, @jdoerfert ? Very late so AA cant abuse new memcpy(?,?, nonconst) from memmove. - CGP? Backend?

Is noalias info propagated to backend (@craig.topper)?

Oh,
for glibc it works: https://godbolt.org/z/2tGIfv BUT new memcpy cannot be inlined !
So I believe Clang's codegen is dangerous.. @hfinkel @rjmccall

@craig.topper @RKSimon @spatel only constant sized memcpys are inlined?

uenoku added a subscriber: uenoku.Aug 14 2019, 4:39 AM

As a frontend author, I would like to be able to conveniently copy memory for a struct assignment in a single line of IR.

memmove? :) this needs new benchmarks

Abstractly, independent of the performance of the operation on any particular architecture, it seems reasonable for the frontend to distinguish between the "no overlap is possible", "only exact overlap is possible", and "arbitrary overlap is possible" cases and let the backend figure out the best way to optimize and compile that. The frontend usually has very accurate information on those fronts from language guarantees that the backend can't necessarily recover.

Of course, frontends also generally have better information about the runtime environment than backends do, and so sometimes it's easiest for the frontend to just make the decision itself. But in my experience this leads to trouble in the long run if the optimizer/backend ever want to take advantage of the high-level information; it's better for the frontend to generate high-level intrinsics that carry all the abstract information required and then independently communicate whatever it knows about the runtime.