Page MenuHomePhabricator

[libc] Add few docs and implementation of strcpy and strcat.
ClosedPublic

Authored by sivachandra on Sep 20 2019, 3:33 PM.

Details

Summary

This patch illustrates some of the features like modularity we want
in the new libc. Few other ideas like different kinds of testing, redirectors
etc are not yet present.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
MaskRay added inline comments.Sep 26 2019, 6:53 PM
libc/cmake/modules/LLVMLibCRules.cmake
128 ↗(On Diff #222037)

there is an existing cmake variable to specify this
llvm has moved to c++14 now.

157 ↗(On Diff #222037)

Prefer CMAKE_OBJCOPY. Does this target need llvm-objcopy to build?

libc/include/string.h
56 ↗(On Diff #222037)

place mem* functions together

libc/src/string/strcat/strcat.h
13 ↗(On Diff #222037)

blank line above namespace

15 ↗(On Diff #222037)

delete dest and src

libc/src/string/strcpy/strcpy_test.cpp
18 ↗(On Diff #222037)

clang-format

23 ↗(On Diff #222037)

clang-format.

This should be delete[] dest;

llvm/CMakeLists.txt
62 ↗(On Diff #222037)

Making it build by default makes me concerned: there are some cmake constructs which I am not sure build on a non-Linux platform. If nobody has tested that, probably hold off on this a bit.

MaskRay added a comment.EditedSep 26 2019, 7:37 PM

I get a bunch of Ninja targets. When the libc is ready, we will get thousands of targets like these:

build strcat: phony projects/libc/src/string/strcat/strcat
build strcat_objects: phony projects/libc/src/string/strcat/strcat_objects
build strcat_test: phony bin/strcat_test
build strcpy: phony projects/libc/src/string/strcpy/strcpy
build strcpy_objects: phony projects/libc/src/string/strcpy/strcpy_objects
build strcpy_test: phony bin/strcpy_test
build string_h: phony projects/libc/include/string_h

Is there some way not to create the targets in the "global namespace"? The target name bin/strcpy_test is also less ideal. With thousands tests, those executables can "overflow" people's bin/ directories.

The build system concerns me the most. I think you should probably find some capable CMake reviewers (I am not well versed in it).

libc/CMakeLists.txt
19 ↗(On Diff #222037)

Delete this. Just use CMAKE_LINKER.

On Windows, the variable defaults to set(CMAKE_LINKER "${LLVM_NATIVE_TOOLCHAIN}/bin/lld-link" CACHE FILEPATH "").

compiler-rt/lib/fuzzer uses ld -r, maybe you can read its CMakeLists.txt for some insights.

libc/cmake/modules/LLVMLibCRules.cmake
27 ↗(On Diff #222037)

Use file(COPY ...)

libc/docs/build_system.md
19 ↗(On Diff #222037)

I am not sure names libm.a and libc.aare required by POSIX or other standards. I think this is more of a convention. Citation needed if you state this.

libc/docs/source_layout.md
70 ↗(On Diff #222037)

This is usually named utils/ in LLVM projects.

test/ and unittests/ contain lit tests and unit tests, respectively.

libc/src/string/strcpy/strcpy.h
13 ↗(On Diff #222037)

Blank line

15 ↗(On Diff #222037)

Delete dest and src.

sivachandra marked 14 inline comments as done.

Another round of updates.

I get a bunch of Ninja targets. When the libc is ready, we will get thousands of targets like these:

build strcat: phony projects/libc/src/string/strcat/strcat
build strcat_objects: phony projects/libc/src/string/strcat/strcat_objects
build strcat_test: phony bin/strcat_test
build strcpy: phony projects/libc/src/string/strcpy/strcpy
build strcpy_objects: phony projects/libc/src/string/strcpy/strcpy_objects
build strcpy_test: phony bin/strcpy_test
build string_h: phony projects/libc/include/string_h

Is there some way not to create the targets in the "global namespace"? The target name bin/strcpy_test is also less ideal. With thousands tests, those executables can "overflow" people's bin/ directories.

I am not aware of a way to avoid creating targets in the global namespace. But, the target names are restricted C names anyway. So, I am not sure it will be a problem. If it does turn out to be a problem, it should be straightforward to add some prefixes and formulate naming rules as required.

I have now moved the test executables out of the bin directory. But note that the test executables are excluded from the "all" target anyway.

The build system concerns me the most. I think you should probably find some capable CMake reviewers (I am not well versed in it).

This is the first patch for the libc, and I am sure I did not get everything perfect. So, I also share your concerns about not getting everything perfect. At the same time, I am not worried about this as it isn't like something is getting set in stone - we can change as required as we learn along the way.

libc/cmake/modules/LLVMLibCRules.cmake
27 ↗(On Diff #222037)

file and configure_file do not create concrete build targets. We want concrete build targets so that we have the ability to pick and choose targets.

157 ↗(On Diff #222037)

We don't need llvm-objcopy just yet. But, I have a follow up change which will require llvm-objcopy. So, I have been carrying it here as well. Removed now anyway; will put it back in the next change.

libc/docs/build_system.md
19 ↗(On Diff #222037)

I couldn't pull out an official reference, but I have this for now: http://www.musl-libc.org/faq.html - See the question "Why is libm.a empty?"

libc/docs/source_layout.md
70 ↗(On Diff #222037)

I am for reducing the number of toplevel directories. So, I have modified it to this structure:

+ utils
      build_scripts
      testing
libc/src/string/strcat/strcat.h
15 ↗(On Diff #222037)

This is an internal header file, so I do not see any problem with keeping the names.

libc/src/string/strcpy/strcpy.h
15 ↗(On Diff #222037)

This is an internal header, so I do not see any problem with keeping the names.

llvm/CMakeLists.txt
62 ↗(On Diff #222037)

I am not sure what you mean by "default". libc will be built only if one specifies libc among the list of project with -DLLVM_ENABLE_PROJECTS, or if one says "all".

We don't have anything documenting the usage of reserved identifiers. The C standard reserves identifiers that begin with a double underscore or with an underscore or a capital as reserved for the 'implementation' but assumes that the compiler and library are a single blob. GCC has a documented policy that double-underscore identifiers are reserved for the compiler and underscore-capital identifiers are reserved for the library (but glibc doesn't follow it, so ended up with __block being used as an identifier in unistd.h, which broke many things for a long time). Do we want to have a stricter policy, for example that the only identifiers that we use in public headers that are not standard symbols are __LLVM_LIBC_{FOO} for macros and __llvm_libc_foo for other identifiers?

libc/docs/header_generation.md
9 ↗(On Diff #222087)

I'm not sure I understand how this will work. For existing code, it's common to have to do things like #define or #undef macros like _GNU_SOURCE for a single compilation unit before including libc headers. There are some annoying differences in philosophy between the header exports: most libc implementations export all interfaces but provide a mechanism for hiding non-standard (or later-standard) ones from portable code, glibc exports only standard functions (though I don't recall which set of standards) and requires feature macros to expose others. Code that works with glibc and other libc implementations ends up jumping through hoops to support both models. Are we adding a third mechanism? Does this work for projects that want to use only standard interfaces in most compilation units but some non-standard extensions in their platform abstraction layer?

21 ↗(On Diff #222087)

This sounds like you will end up with only one set of headers per configuration, so you lose the ability to have different projects using the same generated headers but enforcing different sets of standards compliance in their use of the interfaces.

libc/docs/implementation_standard.md
79 ↗(On Diff #222087)

There are a few things that are unclear to me in this description:

  1. How do we express the standards to which an entrypoint conforms? For example, a function defined in C11 or POSIX2008?
  2. How do we differentiate between things that we want to be preemptible versus things that we don't? If we want to call the preemptible version of a symbol in other libc code, will we have the ::foo symbol visible at library build time?
  3. How are we exposing information for building subsets of the implementation that avoid dependencies on certain platform features? For example, a CloudABI-compatible mode that does not provide (or consume) any functions that touch the global namespace.
libc/include/ctype.h
14 ↗(On Diff #222087)

BSD libcs include a cdefs.h that provides macros such as __BEGIN_DECLS and __END_DECLS wrapping this kind of pattern, and __restrict for language keywords that should not be in headers in certain versions of the standard. It looks as if __support/common.h is the equivalent - we should probably have an explicit rule that this header is included first in all libc headers and provide sensible helpers there.

The libc/docs/*.md should be .RST - there are no more no other .MD docs in llvm, they were all converted back to .RST

theraven added inline comments.Sep 27 2019, 3:10 AM
libc/include/math.h
20 ↗(On Diff #222087)

The long long functions should be exposed only for C99 or later and a version of C++ that supports the long long type.

libc/include/string.h
12 ↗(On Diff #222087)

Namespace pollution. The standard expects size_t to be exposed by this header, but not the other types in stddef.h. Software that relies on this pollution is non-portable (and will break on existing libc implementations that follow the standard).

18 ↗(On Diff #222087)

Missing restrict qualifiers on this and many other standard functions in this file. The standard defines memcpy as:

void *memcpy(void * restrict s1,
const void * restrict s2,
size_t n);

I'm surprised that clang doesn't warn about the declaration of memcpy with incorrect types - it usually notices missing qualifiers.

libc/src/string/strcpy/strcpy_test.cpp
16 ↗(On Diff #222087)

Having the libc test suite depend on a working C++ runtime and standard library is likely to make unit testing the library difficult. The implementation of std::string almost certainly depends on several libc functions and new and delete probably do as well. This means that we can test the namespaced libc functions in an environment that already has a libc, but we can't easily extend these tests to run in a process (or other environment) that has only LLVM libc.

I did consider such a layout. However, the same kind of distinction can be achieved by two things:

  1. Next to the implementation and/or target listing for a function, call out the standard/extension that prescribes it.
  2. When composing the target for libc.a for a platform, group the functions per standards/extensions they come from.

    Another point which made me not pick this layout: I agree that as a catalog, the structure you are suggesting could be more meaningful. But as a developer, my mental model is much simpler if all the functions from a header file are grouped in one place, irrespective of the standard they come from.

Which standard the function is from could be useful to think about while writing/reading the code -- it might be good to, in general, only use functions from the same or "lower" standard versions when implementing functions in a "higher" standard. Although...if the namespacing mechanism described here is kept, and calls are all made only to other llvm_libc-namespace functions, it's less problematic to use nonstandard functions. Doing so then won't pollute the linker namespace, unlike if you're calling C toplevel functions.

Maybe everything is fine, but given this setup, does anyone see any potential problems with compiling these functions for nvptx? I'd like to eventually see a mode where we compile an appropriate subset of these functions for GPU targets -- either in bitcode form for linking with our device-side OpenMP runtime library or as a native object -- to provide a more feature-complete offloading environment.

The one thing that caught by eye was the use of the section attribute, and I was curious what nvptx does with that. As far as I can tell, perhaps the answer is nothing. I took this:

cat /tmp/f.c 
int __attribute__((section(".llvm.libc.entrypoint.foo"))) foo ()  {
  return 0;
}

clang -target nvptx64 -O3 -S -o - /tmp/f.c
//
// Generated by LLVM NVPTX Back-End
//
 
.version 3.2
.target sm_20
.address_size 64
 
  // .globl foo
 
.visible .func  (.param .b32 func_retval0) foo()
{
  .reg .b32   %r<2>;
 
  mov.u32   %r1, 0;
  st.param.b32  [func_retval0+0], %r1;
  ret;
 
}

so maybe that part will just work?

Maybe everything is fine, but given this setup, does anyone see any potential problems with compiling these functions for nvptx? I'd like to eventually see a mode where we compile an appropriate subset of these functions for GPU targets -- either in bitcode form for linking with our device-side OpenMP runtime library or as a native object -- to provide a more feature-complete offloading environment.

The one thing that caught by eye was the use of the section attribute, and I was curious what nvptx does with that. As far as I can tell, perhaps the answer is nothing.

Then I think this scheme won't work, since the point of the sections is to enable the creation of the global symbols post-build.

E.g., I think the idea is that the main implementation defines the function with C++ name __llvm_libc::strcpy(char *, const char *), and places the code in the .llvm.libc.entrypoint.strcpy section. And then another tool comes along and iterates the llvm.libc.entrypoint sections, and adds global symbol aliases for each one.

That scheme feels probably over-complex, IMO, but I don't have an concrete counter-proposal in mind.

The libc/docs/*.md should be .RST - there are no more no other .MD docs in llvm, they were all converted back to .RST

So, does it mean, this has been reversed: https://reviews.llvm.org/D44910

lebedev.ri added a comment.EditedSep 27 2019, 10:25 AM

The libc/docs/*.md should be .RST - there are no more no other .MD docs in llvm, they were all converted back to .RST

So, does it mean, this has been reversed: https://reviews.llvm.org/D44910

llvm-project$ find -iname *.md | grep "docs"
llvm-project$

So yes, i suppose that patch needs to be reverted, to avoid misdirecting new docs.

The libc/docs/*.md should be .RST - there are no more no other .MD docs in llvm, they were all converted back to .RST

So, does it mean, this has been reversed: https://reviews.llvm.org/D44910

llvm-project$ find -iname *.md | grep "docs"
llvm-project$

So yes, i suppose that patch needs to be reverted, to avoid misdirecting new docs.

I think you should put *.md in quotes like this to see results:

$> find -iname "*.md" | grep "docs"

sivachandra marked 9 inline comments as done.

Address theraven's comments.

libc/docs/header_generation.md
9 ↗(On Diff #222087)

This is kind of related to the other question you have asked below. I have tried to address the two questions together below.

21 ↗(On Diff #222087)

Yes, that is the general direction in which this is going. We are making the headers for a configuration much simpler to navigate at the cost of having multiple sets of headers. In this day and age, I do not think forcing multiple sets of header files is a bad thing. Note also that users' build systems already have the knowledge and capability to handle multiple configurations. Hence, we are not making the build systems any more complicated.

This is not what traditional libcs have done. So, yes we are introducing a "third mechanism". At the same time, one can also argue that we are doing away with such mechanisms as we require that each configuration have its own set of header files.

libc/docs/implementation_standard.md
79 ↗(On Diff #222087)

Answers as per numbers in the comment.

  1. We can do it in the public header file, or in the implementation .cpp files. Or, at both the places. Did I understand this question correctly? Is this question related to or similar to #3 below? Like, are you asking as to how we will add a new function without breaking the old standard? If yes, then the %%include mechanism is present to accommodate such scenarios: we start with a baseline standard and %%include new standards until they become baseline.
  2. I frankly do not have a good answer and would prefer someone who cares about this use case to contribute. May be a real world example can help me think about this more clearly.
  3. For the header files, we have the %%include_file mechanism. For the library files, we pick and choose to compose a suitable library target. For example, like the one in lib/CMakeLists.txt of this patch.

At some level, my answers are only guessing about how things would evolve. So, I wouldn't be surprised if my answers here aren't valid or relevant even in say 3 months from now.

libc/include/ctype.h
14 ↗(On Diff #222087)

All this is great, so I have incorporated them now.

libc/include/math.h
20 ↗(On Diff #222087)

We are C17 and higher already. Should we still have such conditions?

libc/include/string.h
12 ↗(On Diff #222087)

Fixed.

18 ↗(On Diff #222087)

I used a script which scanned the standard document to produce these headers and missed restrict. I guess clang did not find it because we do a C++ compilation of the implementations.

libc/src/string/strcpy/strcpy_test.cpp
16 ↗(On Diff #222087)

Using gtest already brings in the C++ runtime. Should that also be avoided?

Maybe everything is fine, but given this setup, does anyone see any potential problems with compiling these functions for nvptx? I'd like to eventually see a mode where we compile an appropriate subset of these functions for GPU targets -- either in bitcode form for linking with our device-side OpenMP runtime library or as a native object -- to provide a more feature-complete offloading environment.

The one thing that caught by eye was the use of the section attribute, and I was curious what nvptx does with that. As far as I can tell, perhaps the answer is nothing. I took this:

cat /tmp/f.c 
int __attribute__((section(".llvm.libc.entrypoint.foo"))) foo ()  {
  return 0;
}

clang -target nvptx64 -O3 -S -o - /tmp/f.c
//
// Generated by LLVM NVPTX Back-End
//
  
.version 3.2
.target sm_20
.address_size 64
  
  // .globl foo
  
.visible .func  (.param .b32 func_retval0) foo()
{
  .reg .b32   %r<2>;
  
  mov.u32   %r1, 0;
  st.param.b32  [func_retval0+0], %r1;
  ret;
  
}

so maybe that part will just work?

The way it is setup, it is very ELF-ish (or should I be saying elvish :)

Very likely, non-ELF systems like Windows and nvptx will require some other mechanism. I working on the windows story in parallel, but I am totally new to nvptx. If you have any suggestions on how to make it work there, we can surely incorporate it.

Maybe everything is fine, but given this setup, does anyone see any potential problems with compiling these functions for nvptx? I'd like to eventually see a mode where we compile an appropriate subset of these functions for GPU targets -- either in bitcode form for linking with our device-side OpenMP runtime library or as a native object -- to provide a more feature-complete offloading environment.

The one thing that caught by eye was the use of the section attribute, and I was curious what nvptx does with that. As far as I can tell, perhaps the answer is nothing.

Then I think this scheme won't work, since the point of the sections is to enable the creation of the global symbols post-build.

E.g., I think the idea is that the main implementation defines the function with C++ name __llvm_libc::strcpy(char *, const char *), and places the code in the .llvm.libc.entrypoint.strcpy section. And then another tool comes along and iterates the llvm.libc.entrypoint sections, and adds global symbol aliases for each one.

That scheme feels probably over-complex, IMO, but I don't have an concrete counter-proposal in mind.

Another problem with the .llvm.libc.entrypoint.strcpy + objcopy --add-symbol scheme is that there is no way to change the st_size field. This can impact symbolization preciseness when debugging information is stripped. This makes me wonder what advantages the namespace __llvm_libc:: brings. @theraven questioned the disadvantage:

This means that we can test the namespaced libc functions in an environment that already has a libc, but we can't easily extend these tests to run in a process (or other environment) that has only LLVM libc.

libc/cmake/modules/LLVMLibCRules.cmake
164 ↗(On Diff #222243)

,function -> ,function,global otherwise it is STB_LOCAL.

libc/include/string.h
18 ↗(On Diff #222087)

So this should be

void *memcpy(void *__restrict, const void *__restrict, size_t);

to be usable in C++

Maybe everything is fine, but given this setup, does anyone see any potential problems with compiling these functions for nvptx? I'd like to eventually see a mode where we compile an appropriate subset of these functions for GPU targets -- either in bitcode form for linking with our device-side OpenMP runtime library or as a native object -- to provide a more feature-complete offloading environment.

The one thing that caught by eye was the use of the section attribute, and I was curious what nvptx does with that. As far as I can tell, perhaps the answer is nothing.

Then I think this scheme won't work, since the point of the sections is to enable the creation of the global symbols post-build.

E.g., I think the idea is that the main implementation defines the function with C++ name __llvm_libc::strcpy(char *, const char *), and places the code in the .llvm.libc.entrypoint.strcpy section. And then another tool comes along and iterates the llvm.libc.entrypoint sections, and adds global symbol aliases for each one.

That scheme feels probably over-complex, IMO, but I don't have an concrete counter-proposal in mind.

For what it's worth, FreeBSD libc does a similar namespacing trick in C. The internal symbols are underscore prefixed and they're exported as aliases (typically as weak aliases, to allow them to be preempted by other implementations, and to explicitly give names to callers for the preemptible and non-preemptible versions). Making symbols preemptible isn't really possible with PE/COFF, because the linkage model has a stronger concept of where definitions come from than ELF (at least, in the absence of symbol versions in ELF). On ELF platforms, we should support symbol versions as early as possible, because adding them later is an ABI break, even if we change no code.

theraven added inline comments.Sep 29 2019, 12:39 AM
libc/docs/header_generation.md
21 ↗(On Diff #222087)

I think that's fine when you consider building libc and shipping a single configuration, but a lot of projects that I've seen have different feature macros defined for different components. Are they now expected to rebuild the libc headers multiple times for each module? Do they need to drive that from their own build system (which is often not CMake)? It's even more complex when a project contains C89, C11, and C++11 files - these all have subtly different sets of requirements for the functions exposed in libc headers: do we require that they build a set for each? Or do you imagine that anyone shipping C11 will ship a powerset of headers?

The reason that we don't do the separate header thing in libcs today is that we end up with a huge explosion of the set of things that are supported. For example, in FreeBSD we support 3 versions of the C standard, 3 or 4 versions of POSIX, GNU and BSD extensions. Almost any combination of these is allowed, so we're looking at 20-30 possible sets of header files, before we start considering restricted subsets for sandboxed applications, custom configurations for sanitisers, and so on.

libc/docs/implementation_standard.md
79 ↗(On Diff #222087)

Compliance with overlapping standards is one of the core reasons that we make symbols preemptible in existing libc implementations. C reserves a set of identifiers for the C standard and a C89 program is free to use any other identifier. If POSIX2008 or C11 use those identifiers then libc should not call their implementations from internal code, but should allow theirs to be called. Pthreads is subtly different, where users are allowed to bring their own pthreads implementation and libc should correctly consume it.

The third question is more in terms of layering. For example, I recently tried building libc++ to work in kernel space. This is incredibly hard, because a lot of things depend on locale, for example, and pull in iostream dependencies, so you end up needing a load of things that have no real meaning in kernel space. The same is true for sandboxed environments, where things like open may not exist (though openat) may, (important when something like locale support in libc needs to open files: for a sandboxed deployment we should support either baking those files into the binary or not expose the symbols that depend on their working correctly).

If we don't start out with some declarative definitions of things like C89 / C99 / POSIX2008 compliance, then any kind of automatic tooling to generate a pure C11 library (no POSIX) or to ensure that the correct symbols are preemptible will be very hard.

libc/include/math.h
20 ↗(On Diff #222087)

We are building as C++17. There is no C17. I would hope that we're still aiming for the headers to be consumed by C89 programs, because there are a huge number of those in the world.

libc/include/string.h
16 ↗(On Diff #222243)

Does this work correctly with the inclusion guards? I don't see the stddef.h implementation here, so I don't know what those macros do (they don't do anything in FreeBSD libc's stddef.h, they appear to do something in libc++'s cstddef, though I'm not entirely sure what) .

The FreeBSD solution so this is to define types like __size_t, use these in headers that are supposed to use size_t in function prototypes but not provide a definition of size_t (yes, there are several in the C spec, it's annoying but that's what the standard says), and then add a guarded typedef to turn that into size_t in this header, stddef.h and a couple of other places.

jyknight added a comment.EditedSep 30 2019, 9:11 AM

Maybe everything is fine, but given this setup, does anyone see any potential problems with compiling these functions for nvptx? I'd like to eventually see a mode where we compile an appropriate subset of these functions for GPU targets -- either in bitcode form for linking with our device-side OpenMP runtime library or as a native object -- to provide a more feature-complete offloading environment.

The one thing that caught by eye was the use of the section attribute, and I was curious what nvptx does with that. As far as I can tell, perhaps the answer is nothing.

Then I think this scheme won't work, since the point of the sections is to enable the creation of the global symbols post-build.

E.g., I think the idea is that the main implementation defines the function with C++ name __llvm_libc::strcpy(char *, const char *), and places the code in the .llvm.libc.entrypoint.strcpy section. And then another tool comes along and iterates the llvm.libc.entrypoint sections, and adds global symbol aliases for each one.

That scheme feels probably over-complex, IMO, but I don't have an concrete counter-proposal in mind.

For what it's worth, FreeBSD libc does a similar namespacing trick in C. The internal symbols are underscore prefixed and they're exported as aliases (typically as weak aliases, to allow them to be preempted by other implementations, and to explicitly give names to callers for the preemptible and non-preemptible versions). Making symbols preemptible isn't really possible with PE/COFF, because the linkage model has a stronger concept of where definitions come from than ELF (at least, in the absence of symbol versions in ELF). On ELF platforms, we should support symbol versions as early as possible, because adding them later is an ABI break, even if we change no code.

Actually, now that I think I understand the existing proposal better, I believe it's broken, as well as confusing. It's getting the same effect as using __attribute__((alias)), except harder to understand. But it's not ok to have a single object file expose both a strong public alias and an internal alias, for any function that's not in the baseline ISO C standard. It would be OK if the aliases were weak, or if they were strong but exposed by a separate .o file. (In any case, I'd like to suggest not using an external objcopy invocation to achieve this.)

For example of why this is wrong -- consider if libc has an 'open.o' object file, which defines __llvm_libc::open, and has also had the alias open added to it with objcopy. Internally, if libc needs to call open, it calls __llvm_libc::open, which pulls in that open.o file, which then also defines the global 'open' function. Then there thus be a duplicate symbol error for any Standard C (e.g. non-posix) program which defines its own open function.

libc/docs/header_generation.md
21 ↗(On Diff #222087)

IMO, it makes sense not to bother making C99/C11-only functions conditionally available. The libc headers still ought to be compatible in C89 mode, but I don't see that there's really much point to excluding declarations for new functions like 'strtof', 'aligned_alloc', etc, when building in older standards modes.

The same most likely can apply to old POSIX versions.

However, I do think it is quite likely to be necessary to preserve the ability to conditionally disable the various standards "layers". That is -- for all the headers specified in ISO C, you should be able to disable the declarations added by POSIX (and extensions) with a define. And for all the headers specified in POSIX, you should be able to disable the declarations added by the GNU/BSD/etc extensions with a define.

libc/include/string.h
16 ↗(On Diff #222243)

Both Clang and GCC ship a stddef.h which supports these defines -- and it's expected to be first in the include path, before libc's headers.

For some reason, freebsd and some other platforms remove these compiler-shipped files and replace them with their own for their libc.

Exactly what the contract should be between the compiler headers and the libc headers could be a larger discussion, but for now, I'm strongly in favor of assuming that we're using the existing stddef.h from clang/gcc -- in which case this code will work correctly.

sivachandra marked 3 inline comments as done.Sep 30 2019, 12:06 PM

Actually, now that I think I understand the existing proposal better, I believe it's broken, as well as confusing. It's getting the same effect as using __attribute__((alias)), except harder to understand. But it's not ok to have a single object file expose both a strong public alias and an internal alias, for any function that's not in the baseline ISO C standard. It would be OK if the aliases were weak, or if they were strong but exposed by a separate .o file. (In any case, I'd like to suggest not using an external objcopy invocation to achieve this.)

The objcopy step is required to avoid putting mangled names with the alias attribute. If there is any other way to achieve the same thing, I am open to it.

For example of why this is wrong -- consider if libc has an 'open.o' object file, which defines __llvm_libc::open, and has also had the alias open added to it with objcopy. Internally, if libc needs to call open, it calls __llvm_libc::open, which pulls in that open.o file, which then also defines the global 'open' function. Then there thus be a duplicate symbol error for any Standard C (e.g. non-posix) program which defines its own open function.

I am open to making all public symbols weak. Should we start with that, or should make them weak on an as-needed basis?

libc/docs/header_generation.md
21 ↗(On Diff #222087)

With respect to excluding extension standards, I will go back to my earlier comment here: The %%include mechanism gives us a way to do it, but calls for a header set per configuration. Note that the libc source tree still only has a single set of files.

True that there will be an "explosion of configurations", but I expect a large number of these to be downstream configurations. Off-the-shelf, we should probably only provide what a normal Linux or a Windows development environment needs.

libc/include/math.h
20 ↗(On Diff #222087)

We have said in the proposal that we will only support C17 and higher: http://llvm.org/docs/Proposals/LLVMLibC.html

libc/include/string.h
16 ↗(On Diff #222243)

I was convinced by jyknight's reasoning last time. If not anything else, I like that it keeps our implementation surface smaller.

The objcopy step is required to avoid putting mangled names with the alias attribute. If there is any other way to achieve the same thing, I am open to it.

Ah, I see! I'd suggest using extern "C" instead. There's no need for these be C++-mangled -- you can simply use a name prefix instead. E.g., if you define it as extern "C" __llvm_libc_strcpy(...) {} then it's trivial to make the strcpy alias without objcopy magic.

The objcopy step is required to avoid putting mangled names with the alias attribute. If there is any other way to achieve the same thing, I am open to it.

Ah, I see! I'd suggest using extern "C" instead. There's no need for these be C++-mangled -- you can simply use a name prefix instead. E.g., if you define it as extern "C" __llvm_libc_strcpy(...) {} then it's trivial to make the strcpy alias without objcopy magic.

Yes, this solution was considered. But, it does not really solve the problem you brought up; we will still need to make the alias weak. It of course eliminates the objcopy step, but we end up with some kind of source code mismatch/inconsistency (for the lack of a better word I can think of). That is, we will have the main function in a global namespace, while the support/helper functions will live inside of a namespace. I rather prefer everything in a single LLVM-libc specific namespace. The objcopy step is neatly hidden behind a build rule, so I do not see it as being "complicated" or "confusing". Someone coming in to implement a new function just has to use the build rule and put the code in the namespace.

The libc/docs/*.md should be .RST - there are no more no other .MD docs in llvm, they were all converted back to .RST

So, does it mean, this has been reversed: https://reviews.llvm.org/D44910

llvm-project$ find -iname *.md | grep "docs"
llvm-project$

So yes, i suppose that patch needs to be reverted, to avoid misdirecting new docs.

I think you should put *.md in quotes like this to see results:

$> find -iname "*.md" | grep "docs"

Aha. But still, all but few docs migrated back to rst.
I'm not sure it's great to add new ones in the format being-migrated-from.

Aha. But still, all but few docs migrated back to rst.
I'm not sure it's great to add new ones in the format being-migrated-from.

Can you kindly point me to some reference which says rst is preferred over markdown? Not that I do not trust you, but I was told by others to prefer markdown, and so I used markdown. I want to have a reference to point to if I am questioned in future about this.

Aha. But still, all but few docs migrated back to rst.
I'm not sure it's great to add new ones in the format being-migrated-from.

Can you kindly point me to some reference which says rst is preferred over markdown? Not that I do not trust you, but I was told by others to prefer markdown, and so I used markdown. I want to have a reference to point to if I am questioned in future about this.

See e.g.
https://lists.llvm.org/pipermail/llvm-dev/2019-June/133038.html
https://reviews.llvm.org/D66305
I think there was some other disscussion but i can't find it.

The objcopy step is required to avoid putting mangled names with the alias attribute. If there is any other way to achieve the same thing, I am open to it.

Ah, I see! I'd suggest using extern "C" instead. There's no need for these be C++-mangled -- you can simply use a name prefix instead. E.g., if you define it as extern "C" __llvm_libc_strcpy(...) {} then it's trivial to make the strcpy alias without objcopy magic.

Yes, this solution was considered. But, it does not really solve the problem you brought up; we will still need to make the alias weak.

Correct, my suggestion above was only to reduce the complexity.

However, back to the weak-alias question -- I have a different suggestion for solving that.

The issue really only exists when you refer to object files across standards layers -- e.g., using an object file that exposes the POSIX symbol "open" from an object file implementing ISO C. If you make sure to always strictly-layer the libc, so that ISO C-implementing object files don't use any POSIX-exporting object files, and so on, you won't need to mark anything weak. For the example of "open", you'd have an internal implementation of open in its own file, only exposing libc-internal symbols. Then fopen (ISO C) can use that safely, without dragging in a definition of the symbol open. Separately, the implementation of open (POSIX) can be defined in its own file, also based on the internal open.)

It of course eliminates the objcopy step, but we end up with some kind of source code mismatch/inconsistency (for the lack of a better word I can think of). That is, we will have the main function in a global namespace, while the support/helper functions will live inside of a namespace.

Sort of... At the source level, you can keep using namespaces as usual -- you don't need to move the function to the global namespace. E.g., given

namespace __llvm_libc {
extern "C" __llvm_libc_strcat(...) {
  ...
}
}

the __llvm_libc_strcat function is within the namespace __llvm_libc, as far as C++ name-resolution semantics are concerned. It's only at a lower level, in the object-file linkage semantics, that it is "as if" it was in the global namespace -- the function has been given the symbol name __llvm_libc_strcat instead of a C++ namespace-mangled name.

I rather prefer everything in a single LLVM-libc specific namespace. The objcopy step is neatly hidden behind a build rule, so I do not see it as being "complicated" or "confusing". Someone coming in to implement a new function just has to use the build rule and put the code in the namespace.

I do think it's both complicated and confusing to have build rules invoking objcopy to post-process .o files after compilation. If that was necessary, it'd be one thing, but since it's not, I'd say the additional complexity is not justified.

The issue really only exists when you refer to object files across standards layers -- e.g., using an object file that exposes the POSIX symbol "open" from an object file implementing ISO C. If you make sure to always strictly-layer the libc, so that ISO C-implementing object files don't use any POSIX-exporting object files, and so on, you won't need to mark anything weak. For the example of "open", you'd have an internal implementation of open in its own file, only exposing libc-internal symbols. Then fopen (ISO C) can use that safely, without dragging in a definition of the symbol open. Separately, the implementation of open (POSIX) can be defined in its own file, also based on the internal open.)

This approach is fairly general can be adopted with the existing setup as well. In fact, I can image that irrespective of the approach we take, we will end up with patterns like this.

Sort of... At the source level, you can keep using namespaces as usual -- you don't need to move the function to the global namespace. E.g., given

namespace __llvm_libc {
extern "C" __llvm_libc_strcat(...) {
  ...
}
}

the __llvm_libc_strcat function is within the namespace __llvm_libc, as far as C++ name-resolution semantics are concerned. It's only at a lower level, in the object-file linkage semantics, that it is "as if" it was in the global namespace -- the function has been given the symbol name __llvm_libc_strcat instead of a C++ namespace-mangled name.

This approach was also considered. (I missed the __ prefixes in my considerations of course.)

I rather prefer everything in a single LLVM-libc specific namespace. The objcopy step is neatly hidden behind a build rule, so I do not see it as being "complicated" or "confusing". Someone coming in to implement a new function just has to use the build rule and put the code in the namespace.

I do think it's both complicated and confusing to have build rules invoking objcopy to post-process .o files after compilation. If that was necessary, it'd be one thing, but since it's not, I'd say the additional complexity is not justified.

This is in subjective territory...

I prefer not to have the repetitiveness of llvm_libc::llvm_libc_strcpy or anything similar. One can suggest using a macro to avoid the repetitiveness of course, but I'd rather avoid using a macro at every call site. To me, that seems not pretty and not necessary. I prefer to keep the implementation layer be as much as possible like a normal C++ library. The post-processing is done to make this C++ library be usable as a C library as well.

I agree that, "post-processing using objcopy" does sound complex as it kind of gives a feel of a complex binary surgery. But, what the current proposal is doing is to just add an alias symbol using objcopy. That is neither invasive, nor complex as "post-processing using objcopy" sounds. Moreover, as I have said earlier, developers do not need to deal with objcopy on a regular basis as it is hidden behind a build rule.

The issue really only exists when you refer to object files across standards layers -- e.g., using an object file that exposes the POSIX symbol "open" from an object file implementing ISO C. If you make sure to always strictly-layer the libc, so that ISO C-implementing object files don't use any POSIX-exporting object files, and so on, you won't need to mark anything weak. For the example of "open", you'd have an internal implementation of open in its own file, only exposing libc-internal symbols. Then fopen (ISO C) can use that safely, without dragging in a definition of the symbol open. Separately, the implementation of open (POSIX) can be defined in its own file, also based on the internal open.)

This approach is fairly general can be adopted with the existing setup as well. In fact, I can image that irrespective of the approach we take, we will end up with patterns like this.

Yes. This is totally separate from whether to use objcopy or not. I'm sorry the two concerns were merged into one comment thread.

I agree that, "post-processing using objcopy" does sound complex as it kind of gives a feel of a complex binary surgery. But, what the current proposal is doing is to just add an alias symbol using objcopy. That is neither invasive, nor complex as "post-processing using objcopy" sounds. Moreover, as I have said earlier, developers do not need to deal with objcopy on a regular basis as it is hidden behind a build rule.

I continue to disagree that that's at all a good trade-off, but since we're only going around in circles now, I'll drop it in the name of progress.

Move markdown docs to reStructuredText; Will add conf.py and build targets in a later pass.

MaskRay added a comment.EditedOct 3 2019, 12:08 AM

My earlier question is about why we need the namespace __llvm_libc at all. From libc/src/string/strcat/strcat_test.cpp I conclude it is for unit testing in an environment that already has a libc (gtest). This should probably be documented.

https://reviews.llvm.org/D67867#1686834 mentioned that the objcopy scheme will break the st_size fields of symbols.

Can we do things the other way round? No namespace, no __llvm_libc_ prefix. Add the -ffreestanding compiler flag and just define strstr, open, etc in the global namespace. In unit tests, invoke llvm-objcopy --redefine-syms= to rename strstr to __llvm_libc_strstr, and call __llvm_libc_strstr in the tests. For functions that affect global states, gtest will not be suitable. It is good to think how the tests will be organized in the current early stage.

Since weak aliases have been mentioned. I'd like to say that a weak_alias macro will be needed, probably not now, but it should be taken into consideration so that we will not change things back and forth.

  1. Avoid namespace pollution. ISO C interfaces should not pull in POSIX functions, e.g. fopen can call strstr, strchr (ISO C) but not open (POSIX). To call open, (a) define STV_HIDDEN __open (b) make open a weak alias of __open (c) call __open in fopen.
  2. Weak definition for dummy implementation. Some features do not necessarily pull in functions from other components. Create a static dummy function and create a weak definition. When used as a variable, this can be used to check whether some features are unavailable.
  3. Pure aliases. Required either by standard or ABI compatibility purposes. glibc uses this a lot.

As an example when this will be used: in the implementation of strcpy, call STV_HIDDEN __memcpy (I'm not saying implementing strcpy on top of memcpy is efficient. This is merely an example), not STV_DEFAULT memcpy, to make it explicit symbol interposition is not desired. This matters if you ever support dynamic linking and support libc.so.

In the description, a brief mention of what will be built and how to test will be helpful to reviewers and subscribers. It is not easy to figure out where the libc build artifacts are located.

My earlier question is about why we need the namespace __llvm_libc at all. From libc/src/string/strcat/strcat_test.cpp I conclude it is for unit testing in an environment that already has a libc (gtest). This should probably be documented.

Documentation is a good idea. At this point, we want the implementation to be as much as possible like a normal C++ library and allow mixing this libc with other libcs in various scenarios like unit testing, differential testing etc. This is the first patch; I am sure a lot will change in the coming months. I want to write up things like this after we at least get a clear idea of the Windows strategy (I am working on it).

https://reviews.llvm.org/D67867#1686834 mentioned that the objcopy scheme will break the st_size fields of symbols.

Yes, st_size is broken in the sense that it doesn't show the size of the aliasee. But, it almost never matters practically.

Herald added a project: Restricted Project. · View Herald TranscriptOct 3 2019, 10:10 PM
This revision was automatically updated to reflect the committed changes.
MaskRay added a comment.EditedOct 7 2019, 1:11 AM

The commit was done in a hurry. For the initial commit of a brand new project that sets up the project hierarchy, this seems to have received fewer than enough thumbs up. Many points raised in the review process were just shrugged off.

  • Proper cmake review
  • Detailed summary. The commit message should at least reference some previous discussions on the mailing list, especially this is a brand new project.
  • The llvm-objcopy issue definitely needs more consideration. This may interfere badly with instrumentation tools, which is a selling point of the llvm libc.
  • Why __llvm_libc / .llvm.libc.entrypoint. are necessary is not well explained.
  • Some necessary options -ffreestanding -nostdinc are absent.
  • C++ should not get #define __restrict restrict
  • ...

I understand that you want agile development but I just did not want 1000 lines of code were commit then 700 lines of which were replaced very quickly in the next month. If you read the source code of musl, you'll find still a large chunk of code remains untouched since the initial import ("initial check-in, version 0.5.0").

This is a post-commit review anyway so many points are probably moot.

My earlier question is about why we need the namespace __llvm_libc at all. From libc/src/string/strcat/strcat_test.cpp I conclude it is for unit testing in an environment that already has a libc (gtest). This should probably be documented.

Can we do things the other way round? No namespace, no __llvm_libc_ prefix. Add the -ffreestanding compiler flag and just define strstr, open, etc in the global namespace. In unit tests, invoke llvm-objcopy --redefine-syms= to rename strstr to __llvm_libc_strstr, and call __llvm_libc_strstr in the tests. For functions that affect global states, gtest will not be suitable. It is good to think how the tests will be organized in the current early stage.

The commit was done in a hurry. For the initial commit of a brand new project that sets up the project hierarchy, this seems to have received fewer than enough thumbs up. Many points raised in the review process were just shrugged off.

I don't know if it matters anymore because this was committed but I agree with @MaskRay. His suggestion of using llvm-objcopy to rename the symbols for tests makes much more sense to me. I haven't seen a libc that does testing in an ergonomic way and this suggestion seems the best to me, frankly.

There is a lot going on here, it's hard to follow it all in one patch, and I think some comments got lost because of this. I feel like a lot of big design decisions were made here, did I miss something on the libc-dev mailing list?

The commit was done in a hurry. For the initial commit of a brand new project that sets up the project hierarchy, this seems to have received fewer than enough thumbs up. Many points raised in the review process were just shrugged off.

I don't know if it matters anymore because this was committed but I agree with @MaskRay. His suggestion of using llvm-objcopy to rename the symbols for tests makes much more sense to me. I haven't seen a libc that does testing in an ergonomic way and this suggestion seems the best to me, frankly.

There is a lot going on here, it's hard to follow it all in one patch, and I think some comments got lost because of this. I feel like a lot of big design decisions were made here, did I miss something on the libc-dev mailing list?

FWIW -- llvm project policy is that both pre and post-commit reviews must be addressed.

That this is now committed does not change anything w.r.t. needing to respond to outstanding comments.

That this is now committed does not change anything w.r.t. needing to respond to outstanding comments.

Absolutely!

I have heard everyone, and will try my best to address everything. FWIW, I am not happy myself that I am unable to address jyknight's concerns on post-processing in an acceptable fashion.

lygstate added inline comments.
libc/trunk/include/ctype.h
18

There is no DLL export things here for MSVC, so only building as a static c lib? not considerating as a shared library?

Just as an FYI, this patch breaks LLVM_INCLUDE_TESTS=OFF for me:

$ cmake -GNinja -DPYTHON_EXECUTABLE=$(command -v python3)  -DLLVM_ENABLE_PROJECTS=all ../llvm
...
-- Configuring done
-- Generating done
-- Build files have been written to: /home/nathan/src/llvm-project/build

$ cd .. && rm -rf build && mkdir -p build && cd build 

$ cmake -GNinja -DPYTHON_EXECUTABLE=$(command -v python3) -DLLVM_ENABLE_PROJECTS=all -DLLVM_INCLUDE_TESTS=OFF ../llvm
...
-- Configuring done
CMake Error at /home/nathan/src/llvm-project/libc/cmake/modules/LLVMLibCRules.cmake:264 (add_dependencies):
  The dependency target "gtest" of target "strcpy_test" does not exist.
Call Stack (most recent call first):
  /home/nathan/src/llvm-project/libc/src/string/strcpy/CMakeLists.txt:11 (add_libc_unittest)


CMake Error at /home/nathan/src/llvm-project/libc/cmake/modules/LLVMLibCRules.cmake:264 (add_dependencies):
  The dependency target "gtest" of target "strcat_test" does not exist.
Call Stack (most recent call first):
  /home/nathan/src/llvm-project/libc/src/string/strcat/CMakeLists.txt:12 (add_libc_unittest)


-- Generating done
-- Build files have been written to: /home/nathan/src/llvm-project/build

$ git revert -n 4380647e79bd80af1ebf6191c2d6629855ccf556

$ cd .. && rm -rf build && mkdir -p build && cd build 

$ cmake -GNinja -DPYTHON_EXECUTABLE=$(command -v python3) -DLLVM_ENABLE_PROJECTS=all -DLLVM_INCLUDE_TESTS=OFF ../llvm
...
-- Configuring done
-- Generating done
-- Build files have been written to: /home/nathan/src/llvm-project/build

This is as of r374191.