Page MenuHomePhabricator

[clang/llvm] Allow efficient implementation of libc's memory functions in C/C++
Needs ReviewPublic

Authored by gchatelet on May 7 2019, 2:41 AM.

Details

Summary

POC for the rfc http://lists.llvm.org/pipermail/llvm-dev/2019-April/131973.html

It's still work in progress but gives the general idea.

Event Timeline

gchatelet created this revision.May 7 2019, 2:41 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMay 7 2019, 2:41 AM
gchatelet edited the summary of this revision. (Show Details)May 7 2019, 2:46 AM
gchatelet added reviewers: theraven, t.p.northover.
gchatelet updated this revision to Diff 198436.May 7 2019, 4:41 AM
  • Add documentation.
  • Fix permissive HasNoRuntimeAttribute

I wonder if a list of comma-separated names is the right approach. Would it make more sense to add a new attribute for each of the helpers, such as #no-runtime-for-memcpy? That should make querying simpler (one lookup in the table, rather than a lookup and a string scan) and also make it easier to add and remove attributes (merging is now just a matter of trying to add all of them, with the existing logic to deduplicate repeated attributes working).

We probably need to also carefully audit all of the transform passes to find ones that insert calls. Last time I looked, there were four places in LLVM where memcpy could be expanded, I wonder if there are a similar number where it can be synthesised...

I wonder if a list of comma-separated names is the right approach. Would it make more sense to add a new attribute for each of the helpers, such as #no-runtime-for-memcpy? That should make querying simpler (one lookup in the table, rather than a lookup and a string scan) and also make it easier to add and remove attributes (merging is now just a matter of trying to add all of them, with the existing logic to deduplicate repeated attributes working).

So I decided to go that route for two reasons:

  • The no_runtime_for attribute will be used almost exclusively by runtime implementers, on average lookup will return false and the parsing part should be marginal (famous last words?)
  • We need to support every function in TargetLibraryInfo (I count 434 of them) and I'm not sure adding one attribute per function will scale or can stay in sync.

Now I haven't thought about merging indeed, we may be able to reuse or clone the approach used for target-features?
For instance some backend disable specific functions and we may warn the user if it happens. e.g. setLibcallName(RTLIB::SHL_I128, nullptr); in X86ISelLowering.cpp

I'm not saying one attribute per helper is not feasible but I'd like to put it into perspective with other constraints.

We probably need to also carefully audit all of the transform passes to find ones that insert calls. Last time I looked, there were four places in LLVM where memcpy could be expanded, I wonder if there are a similar number where it can be synthesised...

Yes indeed, it's going to be long and painful, here are the functions calling CallLoweringInfo::setLibCallee :

llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp

  • ExpandLibCall(LC, Node, isSigned)
  • ExpandChainLibCall(LC, Node, isSigned)
  • ExpandDivRemLibCall(Node, Results)
  • ExpandSinCosLibCall(Node, Results)
  • ConvertNodeToLibcall(Node)
  • ConvertNodeToLibcall(Node)

llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp

  • ExpandIntRes_XMULO(N, Lo, Hi)

llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.cpp

  • ExpandChainLibCall(LC, Node, isSigned)

llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp

  • getMemcpy(Chain, dl, Dst, Src, Size, Align, isVol, AlwaysInline, isTailCall, DstPtrInfo, SrcPtrInfo)
  • getAtomicMemcpy(Chain, dl, Dst, DstAlign, Src, SrcAlign, Size, SizeTy, ElemSz, isTailCall, DstPtrInfo, SrcPtrInfo)
  • getMemmove(Chain, dl, Dst, Src, Size, Align, isVol, isTailCall, DstPtrInfo, SrcPtrInfo)
  • getAtomicMemmove(Chain, dl, Dst, DstAlign, Src, SrcAlign, Size, SizeTy, ElemSz, isTailCall, DstPtrInfo, SrcPtrInfo)
  • getMemset(Chain, dl, Dst, Src, Size, Align, isVol, isTailCall, DstPtrInfo)
  • getAtomicMemset(Chain, dl, Dst, DstAlign, Value, Size, SizeTy, ElemSz, isTailCall, DstPtrInfo)

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

  • visitIntrinsicCall(I, Intrinsic)

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp

  • makeLibCall(DAG, LC, RetVT, Ops, isSigned, dl, doesNotReturn, isReturnValueUsed, isPostTypeLegalization)
  • LowerToTLSEmulatedModel(GA, DAG)

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

  • LowerFSINCOS(Op, DAG)

llvm/lib/Target/AArch64/AArch64SelectionDAGInfo.cpp

  • EmitTargetCodeForMemset(DAG, dl, Chain, Dst, Src, Size, Align, isVolatile, DstPtrInfo)

llvm/lib/Target/ARM/ARMISelLowering.cpp

  • LowerToTLSGeneralDynamicModel(GA, DAG)

llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp

  • EmitSpecializedLibcall(DAG, dl, Chain, Dst, Src, Size, Align, LC)

llvm/lib/Target/Hexagon/HexagonSelectionDAGInfo.cpp

  • EmitTargetCodeForMemcpy(DAG, dl, Chain, Dst, Src, Size, Align, isVolatile, AlwaysInline, DstPtrInfo, SrcPtrInfo)

llvm/lib/Target/PowerPC/PPCISelLowering.cpp

  • LowerINIT_TRAMPOLINE(Op, DAG)

llvm/lib/Target/X86/X86ISelLowering.cpp

  • LowerWin64_i128OP(Op, DAG)
  • LowerFSINCOS(Op, Subtarget, DAG)

llvm/lib/Target/X86/X86SelectionDAGInfo.cpp

  • EmitTargetCodeForMemset(DAG, dl, Chain, Dst, Val, Size, Align, isVolatile, DstPtrInfo)

I would be careful about trying to over-generalize here. There are a few different related bits of functionality which seem to be interesting, given the discussion in the llvm-dev thread, here, and in related patches:

  1. The ability to specify -fno-builtin* on a per-function level, using function attributes.
  2. Improved optimization when -fno-builtin-memcpy is specified.
  3. The ability to avoid calls to memcpy for certain C constructs which would naturally be lowered to a memcpy call, like struct assignment of large structs, or explicit calls to __builtin_memcpy(). Maybe also some generalization of this involving other libc/libm/compiler-rt calls.
  4. The ability to force the compiler to generate "rep; movs" on x86 without inline asm.

It's not clear to me that all of this should be tied together. In particular, I'm not sure -fno-builtin-memcpy should imply the compiler never generates a call to memcpy(). On recent x86 chips, you might be able to get away with unconditionally using "rep movs", but generally an efficient memcpy for more than a few bytes is a lot longer than one instruction, and is not something reasonable for the compiler to synthesize inline.

If we're adding new IR attributes here, we should also consider the interaction with LTO.

I'm not convinced by the approach.

Can it still recognize the loop idiom into memcpy implementation but use memmove (as only memcpy has been blacklisted)?
Can it do the same for memmove (for the case without overlap), and end-up with infinite recursion?

I have a feeling we should pick a stance:

  • either not allow the compiler to lower a builtin to a call to the library; (my preferred choice, but I'm biased)
  • or not allow the library to use compiler builtins (but LTO flow with the runtime library *already* linked smells like trouble if we go this way).

The reason for my bias is that I have a multi-memcpy codegen in the compiler to generate those two calls:

memcpy(left,  in, n);
memcpy(right, in, n);

with a single loop.

I would be careful about trying to over-generalize here. There are a few different related bits of functionality which seem to be interesting, given the discussion in the llvm-dev thread, here, and in related patches:

Thx for the feedback @efriedma, I don't fully understand what you're suggesting here so I will try to reply inline.

  1. The ability to specify -fno-builtin* on a per-function level, using function attributes.

-fno-builtin* is about preventing clang/llvm from recognizing that a piece of code has the same semantic as a particular IR intrinsic, it has nothing to do with preventing the compiler from generating runtime calls.

  • fno-builtin is about transformation from code to IR (frontend)
  • The RFC is about the transformation from IR to runtime calls (backend)
  1. Improved optimization when -fno-builtin-memcpy is specified.

I don't see this happening because if -fno-builtin-memcpyis used, clang (frontend) might already have unrolled and vectorized the loop, It is then very hard - by simply looking at the IR - to recognize that it's a memcpy and generate good code (e.g. https://godbolt.org/z/JZ-mR0)
Here we really want the compiler to understand that we are copying memory (i.e. this is really @llvm.memcpy semantic) but we want to prevent it from calling the runtime.

  1. The ability to avoid calls to memcpy for certain C constructs which would naturally be lowered to a memcpy call, like struct assignment of large structs, or explicit calls to __builtin_memcpy(). Maybe also some generalization of this involving other libc/libm/compiler-rt calls.

I believe very few people will use the attribute described in the RFC, it will most probably be library maintainers that already know a good deal of how the compiler is allowed to transform the code.

  1. The ability to force the compiler to generate "rep; movs" on x86 without inline asm.

This is not strictly required - at least this is not too useful from the purpose of building memcpy functions (more on this a few lines below).

It's not clear to me that all of this should be tied together. In particular, I'm not sure -fno-builtin-memcpy should imply the compiler never generates a call to memcpy().

As a matter of fact, those are not tied together. There are different use cases with different solutions, the one I'm focusing on here is about preventing the compiler from synthesizing runtime calls because we want to be able to implement them directly from C / C++.
It is orthogonal to having the compiler recognize a piece of code as an IR intrinsic.

On recent x86 chips, you might be able to get away with unconditionally using "rep movs", but generally an efficient memcpy for more than a few bytes is a lot longer than one instruction, and is not something reasonable for the compiler to synthesize inline.

Well it depends. On Haswell and particularly Skylake it's hard to beat rep;movsb for anything bigger than 1k, be it aligned or not.
On other architectures and especially on the ones without ERMSB you have different strategies. Actually this is the very goal of this RFC: if you can inline or use PGO you can do a much better job for small sizes than calling libc's memcpy or inserting rep;movsb.

If we're adding new IR attributes here, we should also consider the interaction with LTO.

Yes this is a very different story, that's why I'm not exploring this route. It's rather possible that it would come with a high maintenance cost as well.

I'm not convinced by the approach.

Can it still recognize the loop idiom into memcpy implementation but use memmove (as only memcpy has been blacklisted)?

Yes it can and it's fine, passing src and dst arguments as __restrict will ensure that memcpy is the function to choose in this case.
This attribute is not to be used widely but is directed towards library maintainer that already know what the compiler is allowed to do.

Can it do the same for memmove (for the case without overlap), and end-up with infinite recursion?

I fail to see how this would happen. Could you craft some pseudo code that would exhibit the infinite recursion?
My take on it is that if the compiler can't generate the code it's totally OK to fail with an error message.

The purpose of this RFC it to get some help from the compiler to generate the best code for some building blocks (say copy 4 bytes, copy 8 bytes) and assemble them in a way that maximizes something (code size, runtime for certain parameter distributions).
It would be useful only to libc / libm / compiler-rt implementers to build these functions on top of smaller functions that we know the compiler can produce at compile time.

I don't think we want to use this attribute to generate fully the memcpy. I added the expansion into a rep;movsb for variable sized memcpy only because it's feasible on x86. It's preferable in this case since it has the correct semantic and prevents a compilation error but I would be fine with a compilation error here.

I have a feeling we should pick a stance:

  • either not allow the compiler to lower a builtin to a call to the library; (my preferred choice, but I'm biased)

From the point of view of LLVM @llvm.memcpy intrinsics has the semantic of memcpy it does not know if it comes from a lowering in the frontend (e.g. passing structures by value) or from a built in.
I believe this is feasible although I fail to see where my proposal falls short. Can you show a code example where the current proposal is problematic?

  • or not allow the library to use compiler builtins (but LTO flow with the runtime library *already* linked smells like trouble if we go this way).

This is currently generating very poor code because -fno-builtin prevents LLVM from understanding the copy semantic.

The reason for my bias is that I have a multi-memcpy codegen in the compiler to generate those two calls:

memcpy(left,  in, n);
memcpy(right, in, n);

with a single loop.

I don't quite understand how this is linked to the issue at hand. Can you provide more context? Pointers to code?

-fno-builtin* is about preventing clang/llvm from recognizing that a piece of code has the same semantic as a particular IR intrinsic, it has nothing to do with preventing the compiler from generating runtime calls.

It has a dual purpose for C library functions. One, it prevents the compiler from assuming an explicitly written call to that function has some particular semantics. Two, it prevents the compiler from assuming the underlying library function exists for the purpose of optimization. (These are sort of intertwined, but they both matter in this context.)

I believe very few people will use the attribute described in the RFC, it will most probably be library maintainers that already know a good deal of how the compiler is allowed to transform the code.

Sure, I'm happy to assume that memcpy/memset implementations are written using some appropriate subset of C. (We should probably document that subset at some point.)


I still think there are really two things you're trying to accomplish here, which should be handled separately.

  1. Add a function attribute that works like -fno-builtin-memcpy currently does, to prevent the compiler from synthesizing calls to memcpy.
  2. Provide a convenient way to write a small, fixed-length "memcpy", in C and in LLVM IR, that is always expanded to optimal straight-line code (without any calls or loops).

These correspond to proposals (1) and (2) from your RFC; I think we need both to arrive at a reasonable final state.

(The "rep; movs" is a third thing, which I think should also be handled separately, but it sounds like you don't think that's very important.)

I still think there are really two things you're trying to accomplish here, which should be handled separately.

  1. Add a function attribute that works like -fno-builtin-memcpy currently does, to prevent the compiler from synthesizing calls to memcpy.
  2. Provide a convenient way to write a small, fixed-length "memcpy", in C and in LLVM IR, that is always expanded to optimal straight-line code (without any calls or loops).

    These correspond to proposals (1) and (2) from your RFC; I think we need both to arrive at a reasonable final state.

    (The "rep; movs" is a third thing, which I think should also be handled separately, but it sounds like you don't think that's very important.)

Thank you for taking the time to comment, your feedback is highly appreciated.

I understand that your main concern is about conflating two orthogonal requirements (namely 1. and 2.) in a single attribute. Is that correct?
From my point of view, the RFC (and this Patch) really is about 1. - because 2. can already be achieved with builtins __builtin_memcpy(dst, src, <compile time size>).

My secondary goals in decreasing priority order are:

  • do not change the semantic of current builtins,
  • create a relatively straightforward LLVM patch that does not touch too much code and that is easy review,
  • do not add more confusion around -fno-builtin-* meaning,
  • do not add more builtins or IR intrinsics.

What is the main blocker on your end? What would you suggest so we can move forward?

My main blocker is that I want to make sure we're moving in the right direction: towards LLVM IR with clear semantics, towards straightforward rules for writing freestanding C code, and towards solutions which behave appropriately for all targets. There's clearly a problem here that's worth solving, but I want to make sure we're adding small features that interact with existing features in an obvious way. The patch as written is adding a new attribute that changes the semantics of C and LLVM IR in a subtle way that interacts with existing optimizations and lowering in ways that are complex and hard to understand.

I don't want to mix general restrictions on calling C library functions, with restrictions that apply specifically to memcpy: clang generally works on the assumption that a "memcpy" symbol is available in freestanding environments, and we don't really want to change that.

With -fno-builtin, specifically, we currently apply the restriction that optimizations will not introduce memcpy calls that would not exist with optimization disabled. This is sort of artificial, and and maybe a bit confusing, but it seems to work well enough in practice. gcc does something similar.

I don't really want to add an attribute that has a different meaning from -fno-builtin. An attribute that has the same meaning is a lot less problematic: it reduces potential confusion, and solves a related problem that -fno-builtin currently doesn't really work with LTO, because we don't encode it into the IR.

Your current patch is using the "AlwaysInline" flag for SelectionDAG::getMemcpy, which forces a memcpy to be lowered to an inline implementation. In general, this flag is only supported for constant-size memcpy calls; otherwise, on targets where EmitTargetCodeForMemcpy does not have some special lowering, like the x86 "rep;movs", you'll hit an assertion failure. And we don't really want to add an implementation of variable-length memcpy for a lot of targets; it's very inefficient on targets which don't have unaligned memory access. You could try to narrowly fix this to only apply "AlwaysInline" if the size is a constant integer, but there's a related problem: even if a memcpy is constant size in C, optimizations don't promise to preserve that, in general. We could theoretically add such a promise, I guess, but that's awkward: it would conditionally apply to llvm.memcpy where the parameter is already const, so it's not something we can easily verify.

If we added a new builtin llvm.memcpy.inline, or something like that, the end result ends up being simpler in many ways. It has obvious rules for both generating it and lowering it, which don't depend on attributes in the parent function. We could mark the size as "immarg", so we don't have to add special optimization rules for it. And if we have it, we have a "complete" solution for writing memcpy implementations in C: we make __builtin_memcpy, or a new __builtin_inline_memcpy, produce it, and it can be combined with -fno-builtin to ensure we don't produce any calls to the "real" memcpy. The downside of a new builtin, of course, is that we now have two functions with similar semantics, and potentially have to fix a bunch of places to check for both of them.

MemCpyOpt has been mentioned (the pass which transforms memcpy-like loops into llvm.memcpy). If we want it to perform some transform in circumstances where the call "memcpy" isn't available, we have to make sure to restrict it based on the target: in the worst case, on some targets without unaligned memory access, it could just act as a low-quality loop unroller. This applies no matter how the result is actually represented in IR.

My main blocker is that I want to make sure we're moving in the right direction: towards LLVM IR with clear semantics, towards straightforward rules for writing freestanding C code, and towards solutions which behave appropriately for all targets. There's clearly a problem here that's worth solving, but I want to make sure we're adding small features that interact with existing features in an obvious way. The patch as written is adding a new attribute that changes the semantics of C and LLVM IR in a subtle way that interacts with existing optimizations and lowering in ways that are complex and hard to understand.

This makes a lot of sense, I'm totally on board to reduce entropy and confusion along the way.

I don't want to mix general restrictions on calling C library functions, with restrictions that apply specifically to memcpy: clang generally works on the assumption that a "memcpy" symbol is available in freestanding environments, and we don't really want to change that.

With -fno-builtin, specifically, we currently apply the restriction that optimizations will not introduce memcpy calls that would not exist with optimization disabled. This is sort of artificial, and and maybe a bit confusing, but it seems to work well enough in practice. gcc does something similar.

I don't really want to add an attribute that has a different meaning from -fno-builtin. An attribute that has the same meaning is a lot less problematic: it reduces potential confusion, and solves a related problem that -fno-builtin currently doesn't really work with LTO, because we don't encode it into the IR.

Adding @tejohnson to the conversation.

IIUC you're offering to introduce something like __attribute__((no-builtin-memcpy)) or __attribute__((no-builtin("memcpy"))).
As attributes they would compose nicely with (Thin)LTO.

I believe we still want to turn -fno-builtin flags into attributes so there's no impedance mismatch between the flag and the attribute right?

Your current patch is using the "AlwaysInline" flag for SelectionDAG::getMemcpy, which forces a memcpy to be lowered to an inline implementation. In general, this flag is only supported for constant-size memcpy calls; otherwise, on targets where EmitTargetCodeForMemcpy does not have some special lowering, like the x86 "rep;movs", you'll hit an assertion failure. And we don't really want to add an implementation of variable-length memcpy for a lot of targets; it's very inefficient on targets which don't have unaligned memory access. You could try to narrowly fix this to only apply "AlwaysInline" if the size is a constant integer, but there's a related problem: even if a memcpy is constant size in C, optimizations don't promise to preserve that, in general. We could theoretically add such a promise, I guess, but that's awkward: it would conditionally apply to llvm.memcpy where the parameter is already const, so it's not something we can easily verify.

Fair enough. This patch was really to get the conversation started : I was myself not especially convinced about the approach. Hijacking the AlwaysInline parameter was a shortcut that would not work for other mem function anyways.

If we added a new builtin llvm.memcpy.inline, or something like that, the end result ends up being simpler in many ways. It has obvious rules for both generating it and lowering it, which don't depend on attributes in the parent function. We could mark the size as "immarg", so we don't have to add special optimization rules for it. And if we have it, we have a "complete" solution for writing memcpy implementations in C: we make __builtin_memcpy, or a new __builtin_inline_memcpy, produce it, and it can be combined with -fno-builtin to ensure we don't produce any calls to the "real" memcpy. The downside of a new builtin, of course, is that we now have two functions with similar semantics, and potentially have to fix a bunch of places to check for both of them.

This was one of the approaches I envisioned, it's much cleaner and it's also a lot more work : )
I'm willing to go that route knowing that it would also work for @theraven's use case.

MemCpyOpt has been mentioned (the pass which transforms memcpy-like loops into llvm.memcpy). If we want it to perform some transform in circumstances where the call "memcpy" isn't available, we have to make sure to restrict it based on the target: in the worst case, on some targets without unaligned memory access, it could just act as a low-quality loop unroller. This applies no matter how the result is actually represented in IR.

Yes if we have to generate loops it need to happen before SelectionDAG.


In this framework -ffreestanding stays as it is - namely it implies -fno-builtin. I still think that the semantic is somewhat surprising and unclear to the newcomer but I guess we can't do anything about it at this point - apart adding more documentation.

Lastly, if we are to introduce new IR intrinsics, how about adding some for memcmp and bcmp? It does not have to be part of this patch but I think it's worth considering for consistency.

My main blocker is that I want to make sure we're moving in the right direction: towards LLVM IR with clear semantics, towards straightforward rules for writing freestanding C code, and towards solutions which behave appropriately for all targets. There's clearly a problem here that's worth solving, but I want to make sure we're adding small features that interact with existing features in an obvious way. The patch as written is adding a new attribute that changes the semantics of C and LLVM IR in a subtle way that interacts with existing optimizations and lowering in ways that are complex and hard to understand.

This makes a lot of sense, I'm totally on board to reduce entropy and confusion along the way.

I don't want to mix general restrictions on calling C library functions, with restrictions that apply specifically to memcpy: clang generally works on the assumption that a "memcpy" symbol is available in freestanding environments, and we don't really want to change that.

With -fno-builtin, specifically, we currently apply the restriction that optimizations will not introduce memcpy calls that would not exist with optimization disabled. This is sort of artificial, and and maybe a bit confusing, but it seems to work well enough in practice. gcc does something similar.

I don't really want to add an attribute that has a different meaning from -fno-builtin. An attribute that has the same meaning is a lot less problematic: it reduces potential confusion, and solves a related problem that -fno-builtin currently doesn't really work with LTO, because we don't encode it into the IR.

Adding @tejohnson to the conversation.

IIUC you're offering to introduce something like __attribute__((no-builtin-memcpy)) or __attribute__((no-builtin("memcpy"))).
As attributes they would compose nicely with (Thin)LTO.

I believe we still want to turn -fno-builtin flags into attributes so there's no impedance mismatch between the flag and the attribute right?

I have a related patch that turns -fno-builtin* options into module flags, and uses them in the LTO backends. This addresses current issues with these flags not working well with LTO.
See https://reviews.llvm.org/D60162

I have a related patch that turns -fno-builtin* options into module flags

Do you have any opinion on representing -fno-builtin using a module flag vs. a function attribute in IR? It seems generally more flexible and easier to reason about a function attribute from my perspective. But I might be missing something about the semantics of -fno-builtin that would make that representation awkward. Or I guess it might just be more work to implement, given we have some IPO passes that use TargetLibraryInfo?

I have a related patch that turns -fno-builtin* options into module flags

Do you have any opinion on representing -fno-builtin using a module flag vs. a function attribute in IR? It seems generally more flexible and easier to reason about a function attribute from my perspective. But I might be missing something about the semantics of -fno-builtin that would make that representation awkward. Or I guess it might just be more work to implement, given we have some IPO passes that use TargetLibraryInfo?

I think that a function attribute would be better. We generally use these flags only in the context of certain translation units, and when we use LTO, it would be sad if we had to take the most-conservative settings across the entire application. When we insert new function call to a standard library, we always do it in the context of some function. We'd probably need to block inlining in some cases, but that's better than a global conservative setting.

I have a related patch that turns -fno-builtin* options into module flags

Do you have any opinion on representing -fno-builtin using a module flag vs. a function attribute in IR? It seems generally more flexible and easier to reason about a function attribute from my perspective. But I might be missing something about the semantics of -fno-builtin that would make that representation awkward. Or I guess it might just be more work to implement, given we have some IPO passes that use TargetLibraryInfo?

I think that a function attribute would be better. We generally use these flags only in the context of certain translation units, and when we use LTO, it would be sad if we had to take the most-conservative settings across the entire application. When we insert new function call to a standard library, we always do it in the context of some function. We'd probably need to block inlining in some cases, but that's better than a global conservative setting.

Using function level attributes instead of module flags does provide finer grained control and avoids the conservativeness when merging IR for LTO. The downsides I see, mostly just in terms of the engineering effort to get this to work, are:

  • need to prevent inlining with different attributes
  • currently the TargetLibraryInfo is constructed on a per-module basis. Presumably it would instead need to be created per Function - this one in particular seems like it would require fairly extensive changes.
gchatelet added a comment.EditedMay 15 2019, 2:56 AM

Using function level attributes instead of module flags does provide finer grained control and avoids the conservativeness when merging IR for LTO. The downsides I see, mostly just in terms of the engineering effort to get this to work, are:

  • need to prevent inlining with different attributes

IIUC this is needed regardless of the proposed change. Correct?

  • currently the TargetLibraryInfo is constructed on a per-module basis. Presumably it would instead need to be created per Function - this one in particular seems like it would require fairly extensive changes.

Yes this one is a bit worrying.
I think we can discard right away any solution that would mutate or create a TLI on a per function basis.
Another design would be something like the following:

auto FunctionTLI = ModuleTLI.createCustomizedTLI(F);

where FunctionTLI is itself a TargetLibraryInfo and calls to FunctionTLI would either return the function customizations or delegate to the module's TLI. WDYT?

I'm unsure if we want to support function level attribute right away or if it's OK to be in an intermediate state with only module level attributes.

Using function level attributes instead of module flags does provide finer grained control and avoids the conservativeness when merging IR for LTO. The downsides I see, mostly just in terms of the engineering effort to get this to work, are:

  • need to prevent inlining with different attributes

IIUC this is needed regardless of the proposed change. Correct?

With the module flags approach, no - because there won't be fine grained enough info to do so. Any merged IR will need to use the conservatively set merged flag for the whole module. Or did you mean in comparison with the approach in this patch? I haven't looked at this one in any detail yet.

  • currently the TargetLibraryInfo is constructed on a per-module basis. Presumably it would instead need to be created per Function - this one in particular seems like it would require fairly extensive changes.

Yes this one is a bit worrying.
I think we can discard right away any solution that would mutate or create a TLI on a per function basis.
Another design would be something like the following:

auto FunctionTLI = ModuleTLI.createCustomizedTLI(F);

where FunctionTLI is itself a TargetLibraryInfo and calls to FunctionTLI would either return the function customizations or delegate to the module's TLI. WDYT?

I don't think this makes it much easier - all TLI users still need to be taught to get and use this new Function level TLI. I guess for Function (or lower) passes it would be fairly straightforward, but for things like Module level or SCC passes it will require more wiring to ensure that the FunctionTLI is accessed in the appropriate places. Doable just a lot of manual changes.

I'm unsure if we want to support function level attribute right away or if it's OK to be in an intermediate state with only module level attributes.

I'm interested in thoughts from other developers here. The module flag change is straightforward, but unnecessary churn if we want to go the function attribute route. Which despite the work seems like the best long term approach...

I have a related patch that turns -fno-builtin* options into module flags

Do you have any opinion on representing -fno-builtin using a module flag vs. a function attribute in IR? It seems generally more flexible and easier to reason about a function attribute from my perspective. But I might be missing something about the semantics of -fno-builtin that would make that representation awkward. Or I guess it might just be more work to implement, given we have some IPO passes that use TargetLibraryInfo?

I think that a function attribute would be better. We generally use these flags only in the context of certain translation units, and when we use LTO, it would be sad if we had to take the most-conservative settings across the entire application. When we insert new function call to a standard library, we always do it in the context of some function. We'd probably need to block inlining in some cases, but that's better than a global conservative setting.

Using function level attributes instead of module flags does provide finer grained control and avoids the conservativeness when merging IR for LTO. The downsides I see, mostly just in terms of the engineering effort to get this to work, are:

  • need to prevent inlining with different attributes

I think that this should be relatively straightforward. You just need to update AttributeFuncs::areInlineCompatible and/or AttributeFuncs::mergeAttributesForInlining by adding a new MergeRule in include/llvm/IR/Attributes.td and writing a function like adjustCallerStackProbeSize.

  • currently the TargetLibraryInfo is constructed on a per-module basis. Presumably it would instead need to be created per Function - this one in particular seems like it would require fairly extensive changes.

Interesting point. The largest issue I see is that we need TLI available from loop passes, etc., and we can't automatically recompute a function-level analysis there. We need to make sure that it's always available and not invalidated. TLI is one of those analysis passes, being derived only from things which don't change (i.e., the target triple), or things that change very rarely (e.g., function attributes), that we probably don't want to require all passes to explicitly say that they preserve it (not that the mechanical change to all existing passes is hard, but it's easy to forget), so I think we'd like something like the CFG-only concept in the current passes, but stronger and perhaps turned on by default, for this kind of "attributes-only" pass. (@chandlerc , thoughts on this?).

Sorry I've been a bit slow to respond here...

I have a related patch that turns -fno-builtin* options into module flags

Do you have any opinion on representing -fno-builtin using a module flag vs. a function attribute in IR? It seems generally more flexible and easier to reason about a function attribute from my perspective. But I might be missing something about the semantics of -fno-builtin that would make that representation awkward. Or I guess it might just be more work to implement, given we have some IPO passes that use TargetLibraryInfo?

I think that a function attribute would be better. We generally use these flags only in the context of certain translation units, and when we use LTO, it would be sad if we had to take the most-conservative settings across the entire application. When we insert new function call to a standard library, we always do it in the context of some function. We'd probably need to block inlining in some cases, but that's better than a global conservative setting.

FWIW, I definitely agree here. This really is the end state we're going to find ourselves in and we should probably go directly there.

Using function level attributes instead of module flags does provide finer grained control and avoids the conservativeness when merging IR for LTO. The downsides I see, mostly just in terms of the engineering effort to get this to work, are:

  • need to prevent inlining with different attributes

I think that this should be relatively straightforward. You just need to update AttributeFuncs::areInlineCompatible and/or AttributeFuncs::mergeAttributesForInlining by adding a new MergeRule in include/llvm/IR/Attributes.td and writing a function like adjustCallerStackProbeSize.

+1

  • currently the TargetLibraryInfo is constructed on a per-module basis. Presumably it would instead need to be created per Function - this one in particular seems like it would require fairly extensive changes.

Interesting point. The largest issue I see is that we need TLI available from loop passes, etc., and we can't automatically recompute a function-level analysis there. We need to make sure that it's always available and not invalidated. TLI is one of those analysis passes, being derived only from things which don't change (i.e., the target triple), or things that change very rarely (e.g., function attributes), that we probably don't want to require all passes to explicitly say that they preserve it (not that the mechanical change to all existing passes is hard, but it's easy to forget), so I think we'd like something like the CFG-only concept in the current passes, but stronger and perhaps turned on by default, for this kind of "attributes-only" pass. (@chandlerc , thoughts on this?).

Yep, this makes sense.

The new PM makes this quite easy. The analysis itself gets to implement the invalidation hook, and say "nope, I'm not invalidated". In fact, in the new PM, TargetLibraryInfo already works this way. We build an instance per function and say it is never invalidated.

However, they are just trivial wrappers around shared implementations, so it will still require some non-trivial changes. Will need to remove the module-based access and move clients over to provide a function when they query it, etc.

IIRC, TargetTransformInfo already basically works exactly this way in both old and new PMs and we should be able to look at exactly the techniques it uses in both pass managers to build an effective way to manage these per-function.

AFAIU here is a coarse plan of what needs to happen

  1. Add a no-builtin clang function attribute that has the same semantic as the no-builtin cmd line argument
  2. Propagate it to the IR.
    • In the light of recent discussions and as @theraven suggested it seems more appropriate to encode them as individual IR attributes (e.g. "no-builtin-memcpy", "no-builtin-sqrt", etc..)
  3. Propagate/merge the no-builtin IR attribute for LTO by "updating AttributeFuncs::areInlineCompatible and/or AttributeFuncs::mergeAttributesForInlining and adding a new MergeRule in include/llvm/IR/Attributes.td and writing a function like adjustCallerStackProbeSize."
  4. Get inspiration from TargetTransformInfo to get TargetLibraryInfo on a per function basis for all passes and respect the IR attribute.

I'm not familiar with 3 and 4 but I can definitely have a look.
I'll update this patch to do 1 and 2 in the meantime. @tejohnson let me know how you want to proceed for your related patch. I'm happy to help if I can.

gchatelet updated this revision to Diff 200998.May 23 2019, 9:20 AM
  • Use no-builtin instead of no-runtime-for.
  • Use one attribute per runtime function to make merging easier.

The patch is still WIP and needs more work.

AFAIU here is a coarse plan of what needs to happen

  1. Add a no-builtin clang function attribute that has the same semantic as the no-builtin cmd line argument
  2. Propagate it to the IR.
    • In the light of recent discussions and as @theraven suggested it seems more appropriate to encode them as individual IR attributes (e.g. "no-builtin-memcpy", "no-builtin-sqrt", etc..)
  3. Propagate/merge the no-builtin IR attribute for LTO by "updating AttributeFuncs::areInlineCompatible and/or AttributeFuncs::mergeAttributesForInlining and adding a new MergeRule in include/llvm/IR/Attributes.td and writing a function like adjustCallerStackProbeSize."

This one isn't about LTO, but rather the inliner. You could have different functions in the same module even without LTO that have incompatible no-builtin attributes. There isn't any propagation required for LTO.

  1. Get inspiration from TargetTransformInfo to get TargetLibraryInfo on a per function basis for all passes and respect the IR attribute.

    I'm not familiar with 3 and 4 but I can definitely have a look. I'll update this patch to do 1 and 2 in the meantime. @tejohnson let me know how you want to proceed for your related patch. I'm happy to help if I can.

I will mark that one obsolete. I can work on 4, it may just take some time to get it all plumbed.

I have a question about qsort.. If we provide own implementation of qsort and replace calls to libc's qsort to our qsort, we could fully inline cmp function then. Ideas?

kristina added a subscriber: kristina.EditedMay 25 2019, 2:47 AM

I have a question about qsort.. If we provide own implementation of qsort and replace calls to libc's qsort to our qsort, we could fully inline cmp function then. Ideas?

qsort would seem out of scope here since this is mostly about string.h style functions, more specifically memcpy/memmove. Things like memset would also fall under this category if covered in subsequent diffs.

More specifically it's functions that can benefit a lot from backend specific features (NEON, SSE, etc) especially for larger operations. Such functions are also tricky since their performance is subject to alignment in a lot of cases (usually for SIMD operations and SIMD operations themselves may be undesirable in small cases). At the same time they are extremely important and even with ifunc or alikes, most standard libraries would not match a highly optimized version for the target that doesn't require linking (and thus branches), where decisions about which path to take (SIMD or no SIMD; SSE or AVX or MOV REP) can be determined at compilation time, eliminating the need for a lot of runtime alignment checks (more branches), which are used by standard library to determine whether a SIMD extension may be used and if so, if stack alignment is appropriate and if not, correct it. Eliminating that logic where possible would be desirable.

(I hope my answer wasn't totally off-base)

gchatelet updated this revision to Diff 203386.Jun 6 2019, 9:21 AM
  • Add documentation.
  • Fix permissive HasNoRuntimeAttribute
  • Mark interleave as disabled in the documentation.
  • Use no-builtin instead of no-runtime-for.
  • Adding an llvm.memcpy.inline intrinsic.
  • Adding __builtin_memcpy_inline clang builtin.

AFAIU here is a coarse plan of what needs to happen

  1. Add a no-builtin clang function attribute that has the same semantic as the no-builtin cmd line argument
  2. Propagate it to the IR.
    • In the light of recent discussions and as @theraven suggested it seems more appropriate to encode them as individual IR attributes (e.g. "no-builtin-memcpy", "no-builtin-sqrt", etc..)
  3. Propagate/merge the no-builtin IR attribute for LTO by "updating AttributeFuncs::areInlineCompatible and/or AttributeFuncs::mergeAttributesForInlining and adding a new MergeRule in include/llvm/IR/Attributes.td and writing a function like adjustCallerStackProbeSize."

This one isn't about LTO, but rather the inliner. You could have different functions in the same module even without LTO that have incompatible no-builtin attributes. There isn't any propagation required for LTO.

  1. Get inspiration from TargetTransformInfo to get TargetLibraryInfo on a per function basis for all passes and respect the IR attribute.

    I'm not familiar with 3 and 4 but I can definitely have a look. I'll update this patch to do 1 and 2 in the meantime. @tejohnson let me know how you want to proceed for your related patch. I'm happy to help if I can.

I will mark that one obsolete. I can work on 4, it may just take some time to get it all plumbed.

Checking in to see where we are on this issue. I haven't had any time to work on 4 but hopefully can start on that soon. But it needs the first part done to be effective.

Checking in to see where we are on this issue. I haven't had any time to work on 4 but hopefully can start on that soon. But it needs the first part done to be effective.

Thx for the heads up @tejohnson.
The patch is missing a bit of documentation but it shouldn't be too far from complete:

  • it adds a clang function attribute, e.g. __attribute__((no_builtin("memcpy")))
  • instructs clang to compose individual IR attributes from the clang attribute above, e.g. no-builtin-memcpy, no-builtin-memset, no-builtin-sqrt...
  • adds a specific builtin to clang for the memcpy case __builtin_memcpy_inline
  • adds an LLVM IR intrinsic int_memcpy_inline
  • adds an LLVM Instruction MemCpyInlineInst
  • instructs LLVM to forward the no-builtin-memcpy IR attribute from the function declaration to the actual memcpy calls inside the function's body (same for memset and memmove)
  • adds code to turn the MemCpyInlineInst into code, using DAG.getMemcpy with always_inline set.

Left to do:

  • finish implementing memset / memmove
  • check attributes and reject the one that are not implemented
  • add documentation

There's too many things going on in this patch and it may worth splitting it.

I had some time to work on this finally late last week. I decided the most straightforward thing was to implement the necessary interface changes to the TLI analysis to make it require a Function (without any changes yet to how that analysis operates). See D66428 that I just mailed for review. That takes care of the most widespread changes needed for this migration, and afterwards we can change the analysis to look at the function attributes and make a truly per-function TLI.

I had some time to work on this finally late last week. I decided the most straightforward thing was to implement the necessary interface changes to the TLI analysis to make it require a Function (without any changes yet to how that analysis operates). See D66428 that I just mailed for review. That takes care of the most widespread changes needed for this migration, and afterwards we can change the analysis to look at the function attributes and make a truly per-function TLI.

D66428 went in a few weeks ago at r371284, and I just mailed the follow on patch D67923 which will adds the support into the TLI analysis to use the Function to override the available builtins (with some of the code stubbed out since we don't yet have those per-Function attributes finalized).

@gchatelet where are you at on finalizing this patch? Also, I mentioned this offline but to follow up here: I think we will want an attribute to represent -fno-builtins (so that it doesn't need to be expanded out into the full list of individual no-builtin-{func} attributes, which would be both more verbose and less efficient, as well as being less backward compatible when new builtin funcs are added).

I had some time to work on this finally late last week. I decided the most straightforward thing was to implement the necessary interface changes to the TLI analysis to make it require a Function (without any changes yet to how that analysis operates). See D66428 that I just mailed for review. That takes care of the most widespread changes needed for this migration, and afterwards we can change the analysis to look at the function attributes and make a truly per-function TLI.

D66428 went in a few weeks ago at r371284, and I just mailed the follow on patch D67923 which will adds the support into the TLI analysis to use the Function to override the available builtins (with some of the code stubbed out since we don't yet have those per-Function attributes finalized).

@gchatelet where are you at on finalizing this patch? Also, I mentioned this offline but to follow up here: I think we will want an attribute to represent -fno-builtins (so that it doesn't need to be expanded out into the full list of individual no-builtin-{func} attributes, which would be both more verbose and less efficient, as well as being less backward compatible when new builtin funcs are added).

I'll break this patch in several pieces. The first one is to add the no_builtin attribute, see https://reviews.llvm.org/D61634.

I had some time to work on this finally late last week. I decided the most straightforward thing was to implement the necessary interface changes to the TLI analysis to make it require a Function (without any changes yet to how that analysis operates). See D66428 that I just mailed for review. That takes care of the most widespread changes needed for this migration, and afterwards we can change the analysis to look at the function attributes and make a truly per-function TLI.

D66428 went in a few weeks ago at r371284, and I just mailed the follow on patch D67923 which will adds the support into the TLI analysis to use the Function to override the available builtins (with some of the code stubbed out since we don't yet have those per-Function attributes finalized).

@gchatelet where are you at on finalizing this patch? Also, I mentioned this offline but to follow up here: I think we will want an attribute to represent -fno-builtins (so that it doesn't need to be expanded out into the full list of individual no-builtin-{func} attributes, which would be both more verbose and less efficient, as well as being less backward compatible when new builtin funcs are added).

I'll break this patch in several pieces. The first one is to add the no_builtin attribute, see https://reviews.llvm.org/D61634.

Are you planning to add a follow on patch that translates the various -fno-builtin* options to these attributes? Once that is done I can refine D67923 to actually set the TLI availability from the attributes and remove the clang functionality that does this from the options.

I had some time to work on this finally late last week. I decided the most straightforward thing was to implement the necessary interface changes to the TLI analysis to make it require a Function (without any changes yet to how that analysis operates). See D66428 that I just mailed for review. That takes care of the most widespread changes needed for this migration, and afterwards we can change the analysis to look at the function attributes and make a truly per-function TLI.

D66428 went in a few weeks ago at r371284, and I just mailed the follow on patch D67923 which will adds the support into the TLI analysis to use the Function to override the available builtins (with some of the code stubbed out since we don't yet have those per-Function attributes finalized).

@gchatelet where are you at on finalizing this patch? Also, I mentioned this offline but to follow up here: I think we will want an attribute to represent -fno-builtins (so that it doesn't need to be expanded out into the full list of individual no-builtin-{func} attributes, which would be both more verbose and less efficient, as well as being less backward compatible when new builtin funcs are added).

I'll break this patch in several pieces. The first one is to add the no_builtin attribute, see https://reviews.llvm.org/D61634.

Are you planning to add a follow on patch that translates the various -fno-builtin* options to these attributes? Once that is done I can refine D67923 to actually set the TLI availability from the attributes and remove the clang functionality that does this from the options.

The no-builtin attribute is in as of rG98f3151a7dded8838fafcb5f46e6c8358def96b8. I've started working on getting the -fno-builtin flag propagate to no-builtin function attributes. I'll ping back when its ready.