POC for the rfc http://lists.llvm.org/pipermail/llvm-dev/2019-April/131973.html
It's still work in progress but gives the general idea.
Differential D61634
[clang/llvm] Allow efficient implementation of libc's memory functions in C/C++ gchatelet on May 7 2019, 2:41 AM. Authored by
Details
POC for the rfc http://lists.llvm.org/pipermail/llvm-dev/2019-April/131973.html It's still work in progress but gives the general idea.
Diff Detail
Event TimelineComment Actions I wonder if a list of comma-separated names is the right approach. Would it make more sense to add a new attribute for each of the helpers, such as #no-runtime-for-memcpy? That should make querying simpler (one lookup in the table, rather than a lookup and a string scan) and also make it easier to add and remove attributes (merging is now just a matter of trying to add all of them, with the existing logic to deduplicate repeated attributes working). We probably need to also carefully audit all of the transform passes to find ones that insert calls. Last time I looked, there were four places in LLVM where memcpy could be expanded, I wonder if there are a similar number where it can be synthesised... Comment Actions So I decided to go that route for two reasons:
Now I haven't thought about merging indeed, we may be able to reuse or clone the approach used for target-features? I'm not saying one attribute per helper is not feasible but I'd like to put it into perspective with other constraints.
Yes indeed, it's going to be long and painful, here are the functions calling CallLoweringInfo::setLibCallee : llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.cpp
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
llvm/lib/Target/AArch64/AArch64SelectionDAGInfo.cpp
llvm/lib/Target/ARM/ARMISelLowering.cpp
llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
llvm/lib/Target/Hexagon/HexagonSelectionDAGInfo.cpp
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
llvm/lib/Target/X86/X86ISelLowering.cpp
llvm/lib/Target/X86/X86SelectionDAGInfo.cpp
Comment Actions I would be careful about trying to over-generalize here. There are a few different related bits of functionality which seem to be interesting, given the discussion in the llvm-dev thread, here, and in related patches:
It's not clear to me that all of this should be tied together. In particular, I'm not sure -fno-builtin-memcpy should imply the compiler never generates a call to memcpy(). On recent x86 chips, you might be able to get away with unconditionally using "rep movs", but generally an efficient memcpy for more than a few bytes is a lot longer than one instruction, and is not something reasonable for the compiler to synthesize inline. If we're adding new IR attributes here, we should also consider the interaction with LTO. Comment Actions I'm not convinced by the approach. Can it still recognize the loop idiom into memcpy implementation but use memmove (as only memcpy has been blacklisted)? I have a feeling we should pick a stance:
The reason for my bias is that I have a multi-memcpy codegen in the compiler to generate those two calls: memcpy(left, in, n); memcpy(right, in, n); with a single loop. Comment Actions Thx for the feedback @efriedma, I don't fully understand what you're suggesting here so I will try to reply inline.
-fno-builtin* is about preventing clang/llvm from recognizing that a piece of code has the same semantic as a particular IR intrinsic, it has nothing to do with preventing the compiler from generating runtime calls.
I don't see this happening because if -fno-builtin-memcpyis used, clang (frontend) might already have unrolled and vectorized the loop, It is then very hard - by simply looking at the IR - to recognize that it's a memcpy and generate good code (e.g. https://godbolt.org/z/JZ-mR0)
I believe very few people will use the attribute described in the RFC, it will most probably be library maintainers that already know a good deal of how the compiler is allowed to transform the code.
This is not strictly required - at least this is not too useful from the purpose of building memcpy functions (more on this a few lines below).
As a matter of fact, those are not tied together. There are different use cases with different solutions, the one I'm focusing on here is about preventing the compiler from synthesizing runtime calls because we want to be able to implement them directly from C / C++.
Well it depends. On Haswell and particularly Skylake it's hard to beat rep;movsb for anything bigger than 1k, be it aligned or not.
Yes this is a very different story, that's why I'm not exploring this route. It's rather possible that it would come with a high maintenance cost as well. Comment Actions Yes it can and it's fine, passing src and dst arguments as __restrict will ensure that memcpy is the function to choose in this case.
I fail to see how this would happen. Could you craft some pseudo code that would exhibit the infinite recursion? The purpose of this RFC it to get some help from the compiler to generate the best code for some building blocks (say copy 4 bytes, copy 8 bytes) and assemble them in a way that maximizes something (code size, runtime for certain parameter distributions). I don't think we want to use this attribute to generate fully the memcpy. I added the expansion into a rep;movsb for variable sized memcpy only because it's feasible on x86. It's preferable in this case since it has the correct semantic and prevents a compilation error but I would be fine with a compilation error here.
From the point of view of LLVM @llvm.memcpy intrinsics has the semantic of memcpy it does not know if it comes from a lowering in the frontend (e.g. passing structures by value) or from a built in.
This is currently generating very poor code because -fno-builtin prevents LLVM from understanding the copy semantic.
I don't quite understand how this is linked to the issue at hand. Can you provide more context? Pointers to code? Comment Actions
It has a dual purpose for C library functions. One, it prevents the compiler from assuming an explicitly written call to that function has some particular semantics. Two, it prevents the compiler from assuming the underlying library function exists for the purpose of optimization. (These are sort of intertwined, but they both matter in this context.)
Sure, I'm happy to assume that memcpy/memset implementations are written using some appropriate subset of C. (We should probably document that subset at some point.) I still think there are really two things you're trying to accomplish here, which should be handled separately.
These correspond to proposals (1) and (2) from your RFC; I think we need both to arrive at a reasonable final state. (The "rep; movs" is a third thing, which I think should also be handled separately, but it sounds like you don't think that's very important.) Comment Actions Thank you for taking the time to comment, your feedback is highly appreciated. I understand that your main concern is about conflating two orthogonal requirements (namely 1. and 2.) in a single attribute. Is that correct? My secondary goals in decreasing priority order are:
What is the main blocker on your end? What would you suggest so we can move forward? Comment Actions My main blocker is that I want to make sure we're moving in the right direction: towards LLVM IR with clear semantics, towards straightforward rules for writing freestanding C code, and towards solutions which behave appropriately for all targets. There's clearly a problem here that's worth solving, but I want to make sure we're adding small features that interact with existing features in an obvious way. The patch as written is adding a new attribute that changes the semantics of C and LLVM IR in a subtle way that interacts with existing optimizations and lowering in ways that are complex and hard to understand. I don't want to mix general restrictions on calling C library functions, with restrictions that apply specifically to memcpy: clang generally works on the assumption that a "memcpy" symbol is available in freestanding environments, and we don't really want to change that. With -fno-builtin, specifically, we currently apply the restriction that optimizations will not introduce memcpy calls that would not exist with optimization disabled. This is sort of artificial, and and maybe a bit confusing, but it seems to work well enough in practice. gcc does something similar. I don't really want to add an attribute that has a different meaning from -fno-builtin. An attribute that has the same meaning is a lot less problematic: it reduces potential confusion, and solves a related problem that -fno-builtin currently doesn't really work with LTO, because we don't encode it into the IR. Your current patch is using the "AlwaysInline" flag for SelectionDAG::getMemcpy, which forces a memcpy to be lowered to an inline implementation. In general, this flag is only supported for constant-size memcpy calls; otherwise, on targets where EmitTargetCodeForMemcpy does not have some special lowering, like the x86 "rep;movs", you'll hit an assertion failure. And we don't really want to add an implementation of variable-length memcpy for a lot of targets; it's very inefficient on targets which don't have unaligned memory access. You could try to narrowly fix this to only apply "AlwaysInline" if the size is a constant integer, but there's a related problem: even if a memcpy is constant size in C, optimizations don't promise to preserve that, in general. We could theoretically add such a promise, I guess, but that's awkward: it would conditionally apply to llvm.memcpy where the parameter is already const, so it's not something we can easily verify. If we added a new builtin llvm.memcpy.inline, or something like that, the end result ends up being simpler in many ways. It has obvious rules for both generating it and lowering it, which don't depend on attributes in the parent function. We could mark the size as "immarg", so we don't have to add special optimization rules for it. And if we have it, we have a "complete" solution for writing memcpy implementations in C: we make __builtin_memcpy, or a new __builtin_inline_memcpy, produce it, and it can be combined with -fno-builtin to ensure we don't produce any calls to the "real" memcpy. The downside of a new builtin, of course, is that we now have two functions with similar semantics, and potentially have to fix a bunch of places to check for both of them. MemCpyOpt has been mentioned (the pass which transforms memcpy-like loops into llvm.memcpy). If we want it to perform some transform in circumstances where the call "memcpy" isn't available, we have to make sure to restrict it based on the target: in the worst case, on some targets without unaligned memory access, it could just act as a low-quality loop unroller. This applies no matter how the result is actually represented in IR. Comment Actions This makes a lot of sense, I'm totally on board to reduce entropy and confusion along the way.
Adding @tejohnson to the conversation. IIUC you're offering to introduce something like __attribute__((no-builtin-memcpy)) or __attribute__((no-builtin("memcpy"))). I believe we still want to turn -fno-builtin flags into attributes so there's no impedance mismatch between the flag and the attribute right?
Fair enough. This patch was really to get the conversation started : I was myself not especially convinced about the approach. Hijacking the AlwaysInline parameter was a shortcut that would not work for other mem function anyways.
This was one of the approaches I envisioned, it's much cleaner and it's also a lot more work : )
Yes if we have to generate loops it need to happen before SelectionDAG. In this framework -ffreestanding stays as it is - namely it implies -fno-builtin. I still think that the semantic is somewhat surprising and unclear to the newcomer but I guess we can't do anything about it at this point - apart adding more documentation. Lastly, if we are to introduce new IR intrinsics, how about adding some for memcmp and bcmp? It does not have to be part of this patch but I think it's worth considering for consistency. Comment Actions I have a related patch that turns -fno-builtin* options into module flags, and uses them in the LTO backends. This addresses current issues with these flags not working well with LTO. Comment Actions
Do you have any opinion on representing -fno-builtin using a module flag vs. a function attribute in IR? It seems generally more flexible and easier to reason about a function attribute from my perspective. But I might be missing something about the semantics of -fno-builtin that would make that representation awkward. Or I guess it might just be more work to implement, given we have some IPO passes that use TargetLibraryInfo? Comment Actions I think that a function attribute would be better. We generally use these flags only in the context of certain translation units, and when we use LTO, it would be sad if we had to take the most-conservative settings across the entire application. When we insert new function call to a standard library, we always do it in the context of some function. We'd probably need to block inlining in some cases, but that's better than a global conservative setting. Comment Actions Using function level attributes instead of module flags does provide finer grained control and avoids the conservativeness when merging IR for LTO. The downsides I see, mostly just in terms of the engineering effort to get this to work, are:
Comment Actions IIUC this is needed regardless of the proposed change. Correct?
Yes this one is a bit worrying. auto FunctionTLI = ModuleTLI.createCustomizedTLI(F); where FunctionTLI is itself a TargetLibraryInfo and calls to FunctionTLI would either return the function customizations or delegate to the module's TLI. WDYT? I'm unsure if we want to support function level attribute right away or if it's OK to be in an intermediate state with only module level attributes. Comment Actions With the module flags approach, no - because there won't be fine grained enough info to do so. Any merged IR will need to use the conservatively set merged flag for the whole module. Or did you mean in comparison with the approach in this patch? I haven't looked at this one in any detail yet.
I don't think this makes it much easier - all TLI users still need to be taught to get and use this new Function level TLI. I guess for Function (or lower) passes it would be fairly straightforward, but for things like Module level or SCC passes it will require more wiring to ensure that the FunctionTLI is accessed in the appropriate places. Doable just a lot of manual changes.
I'm interested in thoughts from other developers here. The module flag change is straightforward, but unnecessary churn if we want to go the function attribute route. Which despite the work seems like the best long term approach... Comment Actions I think that this should be relatively straightforward. You just need to update AttributeFuncs::areInlineCompatible and/or AttributeFuncs::mergeAttributesForInlining by adding a new MergeRule in include/llvm/IR/Attributes.td and writing a function like adjustCallerStackProbeSize.
Interesting point. The largest issue I see is that we need TLI available from loop passes, etc., and we can't automatically recompute a function-level analysis there. We need to make sure that it's always available and not invalidated. TLI is one of those analysis passes, being derived only from things which don't change (i.e., the target triple), or things that change very rarely (e.g., function attributes), that we probably don't want to require all passes to explicitly say that they preserve it (not that the mechanical change to all existing passes is hard, but it's easy to forget), so I think we'd like something like the CFG-only concept in the current passes, but stronger and perhaps turned on by default, for this kind of "attributes-only" pass. (@chandlerc , thoughts on this?). Comment Actions Sorry I've been a bit slow to respond here... FWIW, I definitely agree here. This really is the end state we're going to find ourselves in and we should probably go directly there.
+1
Yep, this makes sense. The new PM makes this quite easy. The analysis itself gets to implement the invalidation hook, and say "nope, I'm not invalidated". In fact, in the new PM, TargetLibraryInfo already works this way. We build an instance per function and say it is never invalidated. However, they are just trivial wrappers around shared implementations, so it will still require some non-trivial changes. Will need to remove the module-based access and move clients over to provide a function when they query it, etc. IIRC, TargetTransformInfo already basically works exactly this way in both old and new PMs and we should be able to look at exactly the techniques it uses in both pass managers to build an effective way to manage these per-function. Comment Actions AFAIU here is a coarse plan of what needs to happen
I'm not familiar with 3 and 4 but I can definitely have a look. Comment Actions
The patch is still WIP and needs more work. Comment Actions This one isn't about LTO, but rather the inliner. You could have different functions in the same module even without LTO that have incompatible no-builtin attributes. There isn't any propagation required for LTO.
I will mark that one obsolete. I can work on 4, it may just take some time to get it all plumbed. Comment Actions I have a question about qsort.. If we provide own implementation of qsort and replace calls to libc's qsort to our qsort, we could fully inline cmp function then. Ideas? Comment Actions qsort would seem out of scope here since this is mostly about string.h style functions, more specifically memcpy/memmove. Things like memset would also fall under this category if covered in subsequent diffs. More specifically it's functions that can benefit a lot from backend specific features (NEON, SSE, etc) especially for larger operations. Such functions are also tricky since their performance is subject to alignment in a lot of cases (usually for SIMD operations and SIMD operations themselves may be undesirable in small cases). At the same time they are extremely important and even with ifunc or alikes, most standard libraries would not match a highly optimized version for the target that doesn't require linking (and thus branches), where decisions about which path to take (SIMD or no SIMD; SSE or AVX or MOV REP) can be determined at compilation time, eliminating the need for a lot of runtime alignment checks (more branches), which are used by standard library to determine whether a SIMD extension may be used and if so, if stack alignment is appropriate and if not, correct it. Eliminating that logic where possible would be desirable. (I hope my answer wasn't totally off-base) Comment Actions
Comment Actions Checking in to see where we are on this issue. I haven't had any time to work on 4 but hopefully can start on that soon. But it needs the first part done to be effective. Comment Actions Thx for the heads up @tejohnson.
Left to do:
There's too many things going on in this patch and it may worth splitting it. Comment Actions I had some time to work on this finally late last week. I decided the most straightforward thing was to implement the necessary interface changes to the TLI analysis to make it require a Function (without any changes yet to how that analysis operates). See D66428 that I just mailed for review. That takes care of the most widespread changes needed for this migration, and afterwards we can change the analysis to look at the function attributes and make a truly per-function TLI. Comment Actions D66428 went in a few weeks ago at r371284, and I just mailed the follow on patch D67923 which will adds the support into the TLI analysis to use the Function to override the available builtins (with some of the code stubbed out since we don't yet have those per-Function attributes finalized). @gchatelet where are you at on finalizing this patch? Also, I mentioned this offline but to follow up here: I think we will want an attribute to represent -fno-builtins (so that it doesn't need to be expanded out into the full list of individual no-builtin-{func} attributes, which would be both more verbose and less efficient, as well as being less backward compatible when new builtin funcs are added). Comment Actions I'll break this patch in several pieces. The first one is to add the no_builtin attribute, see https://reviews.llvm.org/D61634. Comment Actions Are you planning to add a follow on patch that translates the various -fno-builtin* options to these attributes? Once that is done I can refine D67923 to actually set the TLI availability from the attributes and remove the clang functionality that does this from the options. Comment Actions The no-builtin attribute is in as of rG98f3151a7dded8838fafcb5f46e6c8358def96b8. I've started working on getting the -fno-builtin flag propagate to no-builtin function attributes. I'll ping back when its ready. Comment Actions I've listed below what I believe is the status:
Done (D68028 committed at bd8791610948).
Done (also in D68028 committed at bd8791610948).
Not done yet - I can work on this.
Done (D67923 was the last patch in the series to enable this, committed at 878ab6df033d). I'm not quite sure where D71710 ([instrinsics] Add @llvm.memcpy.inline instrinsics) fits in to the above list. Anything else missing? Comment Actions Thx for the summary @tejohnson. That would be great!
I believe it does.
Yes when the intrinsic is in we need a way to access it from C++ so a Clang builtin is necessary. I'll take care of it. Comment Actions The last patch from my side just went in (D74162: [Inliner] Inlining should honor nobuiltin attributes). So I think this is now complete. |