This is an archive of the discontinued LLVM Phabricator instance.

[libc] Add few docs and implementation of strcpy and strcat.
ClosedPublic

Authored by sivachandra on Sep 20 2019, 3:33 PM.

Details

Summary

This patch illustrates some of the features like modularity we want
in the new libc. Few other ideas like different kinds of testing, redirectors
etc are not yet present.

Event Timeline

sivachandra created this revision.Sep 20 2019, 3:33 PM

There are a bunch of design docs and CMake rules in this change. They are not perfect yet and have a very Linux/ELF feel to them. Also, there are no tests in this patch. My intent is to get a signoff on the general direction. When I can get it into a shape we agree on, I will add unit tests.

Some of the other ideas like redirectors, different kinds of testing etc are dependent on this initial change. So, once it lands, I can prepares patches showcasing those ideas as well. I also plan to start working on the Windows story to eliminate the Linux/ELF-isms that are currently baked into this patch.

dlj added a comment.Sep 20 2019, 4:41 PM

After a quick once-over, I do have a few nit-picky comments. I plan to do another pass, but these are some things that were a little bit surprising up front.

libc/docs/header_generation.md
8

Unrequested opinion: my personal complaint in this area is that reading the files (as a human) requires understanding the entirety of the configuration macros, plus usually double-checking clang -E -dM, just to figure out what the preprocessor result will be. Generating files up-front at least carves out the internal configuration part.

I'm not sure how to word that eloquently, but it might be possible to say something a bit more concrete than the current wording.

17

It might be good to describe .h.in files, too.

(Obligatory bikeshed: and maybe give them a different extension to disambiguate from CMake-configured files?)

39

Wording nit: the distinction between "parameter" and "argument" caught me a bit by surprise: the "arguments" here are what I would call "parameters" -- things that parameterize the target; the term "parameters" here are free-floating values, but they could be passed as as what I would call "arguments." (This is mostly following the way these terms are used in C++: functions are parameterized, call expressions have arguments.)

I really don't want to get into a bikeshed argument here, though. If you don't agree with my reasoning, I can live with this as-is, too. :-)

libc/include/null.h.in
21 ↗(On Diff #221129)

Do you think it would be reasonable to check for availability of __null, and use it if possible?

My reasoning: this would be different in Clang's AST, and that could lead to subtle differences (for example, when using the nullPointerConstant AST matcher).

libc/scripts/gen_hdr.py
23

It might be helpful to add a comment "command." For example, looking at the .h.in files, it's not immediately obvious that they are used for something other than CMake configure_file, so comments might help.

38

(Side note: this looks like it could be a function with @contextlib.contextmanager instead, which might be simpler: https://docs.python.org/3/library/contextlib.html#contextlib.contextmanager)

72

Nit: maybe _fatal_error instead? It seems reasonable to assume _report_error would only report.

jyknight added inline comments.
libc/include/ctype.h
17

All the argument names should either be removed or put in the implementation namespace, e.g. named __c, etc. Since they're not specified by the standard, users can #define them to something else.

Removed is probably best.

libc/include/gcc_clang_size_t.h.in
1 ↗(On Diff #221129)

It would be good to be clear as to who is responsible for providing which headers.

Right now, clang expects to provide stddef.h, and thus size_t, so this shouldn't be provided by the libc.

I note that some bsd libcs remove some of clang's provided headers, and other libcs like musl want their headers in the wrong order (libc first, then clang builtins), but I don't know why they do that, or why they want to do that. And I think that's probably not a good thing to do for the llvm libc.

libc/include/math.h
17

Same issue with all these argument names.

libc/include/null.h.in
1 ↗(On Diff #221129)

Same issue here -- duplicative of clang's stddef.h.

libc/include/stdbool.h
1 ↗(On Diff #221129)

Duplicative of clang's stdbool.h.

MaskRay added inline comments.
libc/CMakeLists.txt
20

We can reuse CMAKE_AR.

libc/include/ctype.h
17

Removed is probably best.

+1

libc/include/gcc_clang_size_t.h.in
1 ↗(On Diff #221129)

See https://reviews.llvm.org/D65699#1614203 for why musl has a different search path order from glibc.

I agree we don't need gcc_clang_size_t.h.in for now.

libc/include/math.h
19

I'd prefer deleting empty lines among similar functions.

libc/include/null.h.in
21 ↗(On Diff #221129)

The GNU extension is not necessary:

void foo(int a) { puts("int"); }
void foo(long a) { puts("long"); }
void foo(void* a) { puts("void*"); }

both foo(__null) and foo(0L) resolve to foo(long).

C++03: an integral constant expression rvalue of integer type that evaluates to zero
C++11: an integer literal with value zero, or a prvalue of type std::nullptr_t

0L is clearly a conforming definition for all conforming compilers. This is also well tested by musl.

libc/src/string/strcpy/strcpy.cpp
14

superfluous empty line.

17

Such delegation is inefficient. The call to memcpy will go through GOT/PLT. You need a hidden alias to memcpy to eliminate the PLT call.

MaskRay removed a subscriber: MaskRay.
lebedev.ri retitled this revision from Add few docs and implementation of strcpy and strcat. to [libc] Add few docs and implementation of strcpy and strcat..Sep 24 2019, 2:56 AM
lebedev.ri added a subscriber: lebedev.ri.

This patch is subscribed to llvm-commits, not libc-commits, is that correct? (you may need to bump phab people to setup correct auto-subscribe rules for this new repo)
Also, @sivachandra please please, can libc patches all start with [libc] prefix in their patch name, so it is less painful to read through -commits lists, and commit log?

seiya added a subscriber: seiya.Sep 24 2019, 3:10 AM
sivachandra marked 18 inline comments as done.

Address comments

I addressed most comments. For the rest, I asked for clarification/suggestion.

libc/docs/header_generation.md
8

I made starts to address this comment, but scrubbed them all out as I do not think I am doing a good job. Do you have any suggestion, or some specific points you want to see covered here?

17

Changed the extension to ".h.inc". Also, added a note about the "begin" command and described the ".h.inc" files over there.

39

I agree there was some mix up. I made minor changes now. How does it look?

libc/include/null.h.in
21 ↗(On Diff #221129)

Removed this file for now.

libc/scripts/gen_hdr.py
23

I have now changed the extension to ".h.inc". Also added a comment syntax for the ".h.def" files.

In a ".h.inc" file, there can be two kinds of comments:

  1. Comments to describe what the file is for/about - Such comments can be put before the %%begin command. Anything goes before the %%begin command.
  2. Comments which describe the actual contents - These comments should also go into the generated header file, so they should be normal C/C++ style comments.

On the other hand, one can have comments in the ".h.def" file attached to a command. Such comments should not be copied into the generated header file, so I added comment syntax for that: Prefix the comment lines with "<!>", then the rest of the line will be ignored.

libc/src/string/strcpy/strcpy.cpp
17

Since llvm-libc does not yet provide the memcpy implementation, we use ::memcpy from the system libc. When llvm-libc provides the memcpy implementation, this should be changed to a call to llvm_libc::memcpy.

MaskRay added inline comments.Sep 25 2019, 8:48 AM
libc/include/math.h
19

I mean

double asin(double); // no blank line below
float asinf(float);
long double asinl(long double);

Some high level comments on your filesystem layout standard:

I think one directory per function is going to be annoying, not helpful. Most libc functions are expressible as a single function of less than 100 LOC. I'd suggest, instead, to have e.g. src/stdlib/strcpy.cc, implementing the strcpy function. It seems like a good rule that every file implements one public function. If additional files are needed for implementation clarity, I'd suggest to put such helper code into something like src/stdlib/internal/strcpy_internal.cpp (obviously not _actually_ for strcpy)

This layout does not have room for splitting across what I think are likely to be more useful boundaries. In particular, I think it'd be good to have at the toplevel whether the function is ISO C standard, POSIX standard, or a private extension. Another thing that may be good to somehow express is a distinction separation between generally-portable "library-code-only" nonstandard functions, and libc functions which are explicitly exposing non-portable kernel-specific functionality. In the former would go things like asprintf, and in the latter would go things like kqueue or epoll_create.

That is, maybe something like:

libc/
  src/
    c_std/{string/,math/,...}
    posix_std/{string/,math/,...}
    extensions_portable/{string/,stdlib/,...}
    extensions_macos/{string/,stdlib/,...}
    extensions_linux/{string/,stdlib/,...}

Probably also worth considering splitting the headers in the same way. E.g., move the current libc/include/math.h file to libc/include/c_std/math.h, and have a 'math.h' be generated by %%includeing the pieces from all the standards you wish to implement.

libc/docs/implementation_standard.md
30

Unless all the symbols defined here are going to be suppressed via some external mechanism, this namespace name is also not okay, as it will break valid user programs using the un-reserved namespace "llvm_libc".

I'd suggest renaming __llvm_libc.

I think one directory per function is going to be annoying, not helpful. Most libc functions are expressible as a single function of less than 100 LOC. I'd suggest, instead, to have e.g. src/stdlib/strcpy.cc, implementing the strcpy function. It seems like a good rule that every file implements one public function. If additional files are needed for implementation clarity, I'd suggest to put such helper code into something like src/stdlib/internal/strcpy_internal.cpp (obviously not _actually_ for strcpy)

My choice here is driven by my preference to keep tests co-located with the implementation. Since we are in general going to have different kinds of testing like unit-testing, differential testing, fuzz-testing, I have chosen to put implementations of individual functions in their own directories. This way, the tests (which will span multiple files) will all be in a single directory and avoid clutter in the higher level directory. In the rare occasion that implementation also spans multiple files, all the implementation files will also live together.

That said, I think we will need a place for common/shared infrastructure. For example, if all hyperbolic functions use a common algorithm, that algorithm should probably have its own target and go in a directory like src/math/__<appropriate_name>.

I agree that, they way I have structured it now, it can be an annoyance. But, I am of the opinion that it gives us a simpler mental model, and that the annoyance can be overcome with some kind of convenience tooling/scripting if required.

This layout does not have room for splitting across what I think are likely to be more useful boundaries. In particular, I think it'd be good to have at the toplevel whether the function is ISO C standard, POSIX standard, or a private extension. Another thing that may be good to somehow express is a distinction separation between generally-portable "library-code-only" nonstandard functions, and libc functions which are explicitly exposing non-portable kernel-specific functionality. In the former would go things like asprintf, and in the latter would go things like kqueue or epoll_create.

That is, maybe something like:

libc/
  src/
    c_std/{string/,math/,...}
    posix_std/{string/,math/,...}
    extensions_portable/{string/,stdlib/,...}
    extensions_macos/{string/,stdlib/,...}
    extensions_linux/{string/,stdlib/,...}

I did consider such a layout. However, the same kind of distinction can be achieved by two things:

  1. Next to the implementation and/or target listing for a function, call out the standard/extension that prescribes it.
  2. When composing the target for libc.a for a platform, group the functions per standards/extensions they come from.

Another point which made me not pick this layout: I agree that as a catalog, the structure you are suggesting could be more meaningful. But as a developer, my mental model is much simpler if all the functions from a header file are grouped in one place, irrespective of the standard they come from.

Probably also worth considering splitting the headers in the same way. E.g., move the current libc/include/math.h file to libc/include/c_std/math.h, and have a 'math.h' be generated by %%includeing the pieces from all the standards you wish to implement.

Irrespective of the directory layout, I think we will have such %%include-ed files anyway. For example, linux extensions can only be %%include-ed for linux targets.

libc/docs/implementation_standard.md
30

Thanks again for pointing out another of my oversights. I will change per you suggestion when I update the code next.

Add unittests for strcat and strcpy.

sivachandra edited the summary of this revision. (Show Details)Sep 25 2019, 11:38 PM
sivachandra marked an inline comment as done.Sep 25 2019, 11:41 PM
sivachandra added inline comments.
libc/include/math.h
19

I like your suggestion, but I want to leave it as is for now. As we add implementations, I want to separate out those having implementations from those not. So, within the implemented ones, we should follow your suggestion of grouping similar functions. Eventually, the full set of functions will be grouped.

I think one directory per function is going to be annoying, not helpful. Most libc functions are expressible as a single function of less than 100 LOC. I'd suggest, instead, to have e.g. src/stdlib/strcpy.cc, implementing the strcpy function. It seems like a good rule that every file implements one public function. If additional files are needed for implementation clarity, I'd suggest to put such helper code into something like src/stdlib/internal/strcpy_internal.cpp (obviously not _actually_ for strcpy)

My choice here is driven by my preference to keep tests co-located with the implementation. Since we are in general going to have different kinds of testing like unit-testing, differential testing, fuzz-testing, I have chosen to put implementations of individual functions in their own directories. This way, the tests (which will span multiple files) will all be in a single directory and avoid clutter in the higher level directory. In the rare occasion that implementation also spans multiple files, all the implementation files will also live together.

Some tests will involve several functions from a header, e.g. flockfile+ftrylock+funlockfile. Where will you place the test, under flockfile/, or funlockfile/ ?

I prefer jyknight's proposed hierarchy.

dlj added a comment.Sep 26 2019, 11:52 AM

I think one directory per function is going to be annoying, not helpful. Most libc functions are expressible as a single function of less than 100 LOC. I'd suggest, instead, to have e.g. src/stdlib/strcpy.cc, implementing the strcpy function. It seems like a good rule that every file implements one public function. If additional files are needed for implementation clarity, I'd suggest to put such helper code into something like src/stdlib/internal/strcpy_internal.cpp (obviously not _actually_ for strcpy)

My choice here is driven by my preference to keep tests co-located with the implementation. Since we are in general going to have different kinds of testing like unit-testing, differential testing, fuzz-testing, I have chosen to put implementations of individual functions in their own directories. This way, the tests (which will span multiple files) will all be in a single directory and avoid clutter in the higher level directory. In the rare occasion that implementation also spans multiple files, all the implementation files will also live together.

Some tests will involve several functions from a header, e.g. flockfile+ftrylock+funlockfile. Where will you place the test, under flockfile/, or funlockfile/ ?

I prefer jyknight's proposed hierarchy.

The source tree layout should probably reflect the structure of implementation (which we decide and work within), not just the structure of the interface (which is specified for us).

Another way to think about it: within the implementation, how can we localize changes?

Siva mentioned tests, and I think that's quite instructive: the testing boundary provides a natural cut point for functions that must be tested mutually.

File locking functions are practically inseparable, so it's reasonable to keep them together. However, that doesn't imply they should be grouped with unrelated functions, nor that all functions should be lumped together. A finer-grained directory structure seems quite intuitive for expressing these chunks of inseparable functionality.

libc/docs/header_generation.md
8

Eh, I'll think about it. Don't block on me... I can send a patch if I come up with something.

39

This is definitely more consistent, but I think it's still the reverse of what I had in mind: "${thing}" is a specific value that exists at call sites, so I would call that an argument. The thing defined by the called function (with unspecified value) is a parameter.

int foo(int a); // 1 parameter named "a"
int b;
int c = foo(b); // 1 argument named "b"

The usage below of "Parameters" in include_file matches my expectation, but "Arguments" for begin seems at odds with this.

(The specific wording I'm thinking of is in the C++ standard: [dcl.fct] and [expr.call].)

dxf added a subscriber: dxf.Sep 26 2019, 2:09 PM
sivachandra marked an inline comment as done.

Update docs according to the latest comments.

In D67867#1684562, @dlj wrote:

Some tests will involve several functions from a header, e.g. flockfile+ftrylock+funlockfile. Where will you place the test, under flockfile/, or funlockfile/ ?

I prefer jyknight's proposed hierarchy.

The source tree layout should probably reflect the structure of implementation (which we decide and work within), not just the structure of the interface (which is specified for us).

Another way to think about it: within the implementation, how can we localize changes?

Siva mentioned tests, and I think that's quite instructive: the testing boundary provides a natural cut point for functions that must be tested mutually.

File locking functions are practically inseparable, so it's reasonable to keep them together. However, that doesn't imply they should be grouped with unrelated functions, nor that all functions should be lumped together. A finer-grained directory structure seems quite intuitive for expressing these chunks of inseparable functionality.

I understand maskray's concerns about related functions. I also agree with what dlj says, so I have updated the implementation standard to say that implementation of related groups of functions will be in their own directories (previous wording said that every function/entrypoint will be in its own directory.)

libc/docs/header_generation.md
39

Ah, I think I understand whats going on now. I think I got mixed up here. Fixed it now (I hope).

dlj accepted this revision.Sep 26 2019, 4:20 PM

Looks good for my comments.

This revision is now accepted and ready to land.Sep 26 2019, 4:20 PM
MaskRay added inline comments.Sep 26 2019, 6:53 PM
libc/cmake/modules/LLVMLibCRules.cmake
129

there is an existing cmake variable to specify this
llvm has moved to c++14 now.

158

Prefer CMAKE_OBJCOPY. Does this target need llvm-objcopy to build?

libc/include/string.h
57

place mem* functions together

libc/src/string/strcat/strcat.h
14

blank line above namespace

16

delete dest and src

libc/src/string/strcpy/strcpy_test.cpp
19

clang-format

24

clang-format.

This should be delete[] dest;

llvm/CMakeLists.txt
62

Making it build by default makes me concerned: there are some cmake constructs which I am not sure build on a non-Linux platform. If nobody has tested that, probably hold off on this a bit.

MaskRay added a comment.EditedSep 26 2019, 7:37 PM

I get a bunch of Ninja targets. When the libc is ready, we will get thousands of targets like these:

build strcat: phony projects/libc/src/string/strcat/strcat
build strcat_objects: phony projects/libc/src/string/strcat/strcat_objects
build strcat_test: phony bin/strcat_test
build strcpy: phony projects/libc/src/string/strcpy/strcpy
build strcpy_objects: phony projects/libc/src/string/strcpy/strcpy_objects
build strcpy_test: phony bin/strcpy_test
build string_h: phony projects/libc/include/string_h

Is there some way not to create the targets in the "global namespace"? The target name bin/strcpy_test is also less ideal. With thousands tests, those executables can "overflow" people's bin/ directories.

The build system concerns me the most. I think you should probably find some capable CMake reviewers (I am not well versed in it).

libc/CMakeLists.txt
20

Delete this. Just use CMAKE_LINKER.

On Windows, the variable defaults to set(CMAKE_LINKER "${LLVM_NATIVE_TOOLCHAIN}/bin/lld-link" CACHE FILEPATH "").

compiler-rt/lib/fuzzer uses ld -r, maybe you can read its CMakeLists.txt for some insights.

libc/cmake/modules/LLVMLibCRules.cmake
28

Use file(COPY ...)

libc/docs/build_system.md
20

I am not sure names libm.a and libc.aare required by POSIX or other standards. I think this is more of a convention. Citation needed if you state this.

libc/docs/source_layout.md
71

This is usually named utils/ in LLVM projects.

test/ and unittests/ contain lit tests and unit tests, respectively.

libc/src/string/strcpy/strcpy.h
14

Blank line

16

Delete dest and src.

sivachandra marked 14 inline comments as done.

Another round of updates.

I get a bunch of Ninja targets. When the libc is ready, we will get thousands of targets like these:

build strcat: phony projects/libc/src/string/strcat/strcat
build strcat_objects: phony projects/libc/src/string/strcat/strcat_objects
build strcat_test: phony bin/strcat_test
build strcpy: phony projects/libc/src/string/strcpy/strcpy
build strcpy_objects: phony projects/libc/src/string/strcpy/strcpy_objects
build strcpy_test: phony bin/strcpy_test
build string_h: phony projects/libc/include/string_h

Is there some way not to create the targets in the "global namespace"? The target name bin/strcpy_test is also less ideal. With thousands tests, those executables can "overflow" people's bin/ directories.

I am not aware of a way to avoid creating targets in the global namespace. But, the target names are restricted C names anyway. So, I am not sure it will be a problem. If it does turn out to be a problem, it should be straightforward to add some prefixes and formulate naming rules as required.

I have now moved the test executables out of the bin directory. But note that the test executables are excluded from the "all" target anyway.

The build system concerns me the most. I think you should probably find some capable CMake reviewers (I am not well versed in it).

This is the first patch for the libc, and I am sure I did not get everything perfect. So, I also share your concerns about not getting everything perfect. At the same time, I am not worried about this as it isn't like something is getting set in stone - we can change as required as we learn along the way.

libc/cmake/modules/LLVMLibCRules.cmake
28

file and configure_file do not create concrete build targets. We want concrete build targets so that we have the ability to pick and choose targets.

158

We don't need llvm-objcopy just yet. But, I have a follow up change which will require llvm-objcopy. So, I have been carrying it here as well. Removed now anyway; will put it back in the next change.

libc/docs/build_system.md
20

I couldn't pull out an official reference, but I have this for now: http://www.musl-libc.org/faq.html - See the question "Why is libm.a empty?"

libc/docs/source_layout.md
71

I am for reducing the number of toplevel directories. So, I have modified it to this structure:

+ utils
      build_scripts
      testing
libc/src/string/strcat/strcat.h
16

This is an internal header file, so I do not see any problem with keeping the names.

libc/src/string/strcpy/strcpy.h
16

This is an internal header, so I do not see any problem with keeping the names.

llvm/CMakeLists.txt
62

I am not sure what you mean by "default". libc will be built only if one specifies libc among the list of project with -DLLVM_ENABLE_PROJECTS, or if one says "all".

We don't have anything documenting the usage of reserved identifiers. The C standard reserves identifiers that begin with a double underscore or with an underscore or a capital as reserved for the 'implementation' but assumes that the compiler and library are a single blob. GCC has a documented policy that double-underscore identifiers are reserved for the compiler and underscore-capital identifiers are reserved for the library (but glibc doesn't follow it, so ended up with __block being used as an identifier in unistd.h, which broke many things for a long time). Do we want to have a stricter policy, for example that the only identifiers that we use in public headers that are not standard symbols are __LLVM_LIBC_{FOO} for macros and __llvm_libc_foo for other identifiers?

libc/docs/header_generation.md
10

I'm not sure I understand how this will work. For existing code, it's common to have to do things like #define or #undef macros like _GNU_SOURCE for a single compilation unit before including libc headers. There are some annoying differences in philosophy between the header exports: most libc implementations export all interfaces but provide a mechanism for hiding non-standard (or later-standard) ones from portable code, glibc exports only standard functions (though I don't recall which set of standards) and requires feature macros to expose others. Code that works with glibc and other libc implementations ends up jumping through hoops to support both models. Are we adding a third mechanism? Does this work for projects that want to use only standard interfaces in most compilation units but some non-standard extensions in their platform abstraction layer?

22

This sounds like you will end up with only one set of headers per configuration, so you lose the ability to have different projects using the same generated headers but enforcing different sets of standards compliance in their use of the interfaces.

libc/docs/implementation_standard.md
80

There are a few things that are unclear to me in this description:

  1. How do we express the standards to which an entrypoint conforms? For example, a function defined in C11 or POSIX2008?
  2. How do we differentiate between things that we want to be preemptible versus things that we don't? If we want to call the preemptible version of a symbol in other libc code, will we have the ::foo symbol visible at library build time?
  3. How are we exposing information for building subsets of the implementation that avoid dependencies on certain platform features? For example, a CloudABI-compatible mode that does not provide (or consume) any functions that touch the global namespace.
libc/include/ctype.h
15

BSD libcs include a cdefs.h that provides macros such as __BEGIN_DECLS and __END_DECLS wrapping this kind of pattern, and __restrict for language keywords that should not be in headers in certain versions of the standard. It looks as if __support/common.h is the equivalent - we should probably have an explicit rule that this header is included first in all libc headers and provide sensible helpers there.

The libc/docs/*.md should be .RST - there are no more no other .MD docs in llvm, they were all converted back to .RST

theraven added inline comments.Sep 27 2019, 3:10 AM
libc/include/math.h
21

The long long functions should be exposed only for C99 or later and a version of C++ that supports the long long type.

libc/include/string.h
13

Namespace pollution. The standard expects size_t to be exposed by this header, but not the other types in stddef.h. Software that relies on this pollution is non-portable (and will break on existing libc implementations that follow the standard).

19

Missing restrict qualifiers on this and many other standard functions in this file. The standard defines memcpy as:

void *memcpy(void * restrict s1,
const void * restrict s2,
size_t n);

I'm surprised that clang doesn't warn about the declaration of memcpy with incorrect types - it usually notices missing qualifiers.

libc/src/string/strcpy/strcpy_test.cpp
17

Having the libc test suite depend on a working C++ runtime and standard library is likely to make unit testing the library difficult. The implementation of std::string almost certainly depends on several libc functions and new and delete probably do as well. This means that we can test the namespaced libc functions in an environment that already has a libc, but we can't easily extend these tests to run in a process (or other environment) that has only LLVM libc.

I did consider such a layout. However, the same kind of distinction can be achieved by two things:

  1. Next to the implementation and/or target listing for a function, call out the standard/extension that prescribes it.
  2. When composing the target for libc.a for a platform, group the functions per standards/extensions they come from.

Another point which made me not pick this layout: I agree that as a catalog, the structure you are suggesting could be more meaningful. But as a developer, my mental model is much simpler if all the functions from a header file are grouped in one place, irrespective of the standard they come from.

Which standard the function is from could be useful to think about while writing/reading the code -- it might be good to, in general, only use functions from the same or "lower" standard versions when implementing functions in a "higher" standard. Although...if the namespacing mechanism described here is kept, and calls are all made only to other llvm_libc-namespace functions, it's less problematic to use nonstandard functions. Doing so then won't pollute the linker namespace, unlike if you're calling C toplevel functions.

Maybe everything is fine, but given this setup, does anyone see any potential problems with compiling these functions for nvptx? I'd like to eventually see a mode where we compile an appropriate subset of these functions for GPU targets -- either in bitcode form for linking with our device-side OpenMP runtime library or as a native object -- to provide a more feature-complete offloading environment.

The one thing that caught by eye was the use of the section attribute, and I was curious what nvptx does with that. As far as I can tell, perhaps the answer is nothing. I took this:

cat /tmp/f.c 
int __attribute__((section(".llvm.libc.entrypoint.foo"))) foo ()  {
  return 0;
}

clang -target nvptx64 -O3 -S -o - /tmp/f.c
//
// Generated by LLVM NVPTX Back-End
//
 
.version 3.2
.target sm_20
.address_size 64
 
  // .globl foo
 
.visible .func  (.param .b32 func_retval0) foo()
{
  .reg .b32   %r<2>;
 
  mov.u32   %r1, 0;
  st.param.b32  [func_retval0+0], %r1;
  ret;
 
}

so maybe that part will just work?

Maybe everything is fine, but given this setup, does anyone see any potential problems with compiling these functions for nvptx? I'd like to eventually see a mode where we compile an appropriate subset of these functions for GPU targets -- either in bitcode form for linking with our device-side OpenMP runtime library or as a native object -- to provide a more feature-complete offloading environment.

The one thing that caught by eye was the use of the section attribute, and I was curious what nvptx does with that. As far as I can tell, perhaps the answer is nothing.

Then I think this scheme won't work, since the point of the sections is to enable the creation of the global symbols post-build.

E.g., I think the idea is that the main implementation defines the function with C++ name __llvm_libc::strcpy(char *, const char *), and places the code in the .llvm.libc.entrypoint.strcpy section. And then another tool comes along and iterates the llvm.libc.entrypoint sections, and adds global symbol aliases for each one.

That scheme feels probably over-complex, IMO, but I don't have an concrete counter-proposal in mind.

The libc/docs/*.md should be .RST - there are no more no other .MD docs in llvm, they were all converted back to .RST

So, does it mean, this has been reversed: https://reviews.llvm.org/D44910

lebedev.ri added a comment.EditedSep 27 2019, 10:25 AM

The libc/docs/*.md should be .RST - there are no more no other .MD docs in llvm, they were all converted back to .RST

So, does it mean, this has been reversed: https://reviews.llvm.org/D44910

llvm-project$ find -iname *.md | grep "docs"
llvm-project$

So yes, i suppose that patch needs to be reverted, to avoid misdirecting new docs.

The libc/docs/*.md should be .RST - there are no more no other .MD docs in llvm, they were all converted back to .RST

So, does it mean, this has been reversed: https://reviews.llvm.org/D44910

llvm-project$ find -iname *.md | grep "docs"
llvm-project$

So yes, i suppose that patch needs to be reverted, to avoid misdirecting new docs.

I think you should put *.md in quotes like this to see results:

$> find -iname "*.md" | grep "docs"

sivachandra marked 9 inline comments as done.

Address theraven's comments.

libc/docs/header_generation.md
10

This is kind of related to the other question you have asked below. I have tried to address the two questions together below.

22

Yes, that is the general direction in which this is going. We are making the headers for a configuration much simpler to navigate at the cost of having multiple sets of headers. In this day and age, I do not think forcing multiple sets of header files is a bad thing. Note also that users' build systems already have the knowledge and capability to handle multiple configurations. Hence, we are not making the build systems any more complicated.

This is not what traditional libcs have done. So, yes we are introducing a "third mechanism". At the same time, one can also argue that we are doing away with such mechanisms as we require that each configuration have its own set of header files.

libc/docs/implementation_standard.md
80

Answers as per numbers in the comment.

  1. We can do it in the public header file, or in the implementation .cpp files. Or, at both the places. Did I understand this question correctly? Is this question related to or similar to #3 below? Like, are you asking as to how we will add a new function without breaking the old standard? If yes, then the %%include mechanism is present to accommodate such scenarios: we start with a baseline standard and %%include new standards until they become baseline.
  2. I frankly do not have a good answer and would prefer someone who cares about this use case to contribute. May be a real world example can help me think about this more clearly.
  3. For the header files, we have the %%include_file mechanism. For the library files, we pick and choose to compose a suitable library target. For example, like the one in lib/CMakeLists.txt of this patch.

At some level, my answers are only guessing about how things would evolve. So, I wouldn't be surprised if my answers here aren't valid or relevant even in say 3 months from now.

libc/include/ctype.h
15

All this is great, so I have incorporated them now.

libc/include/math.h
21

We are C17 and higher already. Should we still have such conditions?

libc/include/string.h
13

Fixed.

19

I used a script which scanned the standard document to produce these headers and missed restrict. I guess clang did not find it because we do a C++ compilation of the implementations.

libc/src/string/strcpy/strcpy_test.cpp
17

Using gtest already brings in the C++ runtime. Should that also be avoided?

Maybe everything is fine, but given this setup, does anyone see any potential problems with compiling these functions for nvptx? I'd like to eventually see a mode where we compile an appropriate subset of these functions for GPU targets -- either in bitcode form for linking with our device-side OpenMP runtime library or as a native object -- to provide a more feature-complete offloading environment.

The one thing that caught by eye was the use of the section attribute, and I was curious what nvptx does with that. As far as I can tell, perhaps the answer is nothing. I took this:

cat /tmp/f.c 
int __attribute__((section(".llvm.libc.entrypoint.foo"))) foo ()  {
  return 0;
}

clang -target nvptx64 -O3 -S -o - /tmp/f.c
//
// Generated by LLVM NVPTX Back-End
//
 
.version 3.2
.target sm_20
.address_size 64
 
  // .globl foo
 
.visible .func  (.param .b32 func_retval0) foo()
{
  .reg .b32   %r<2>;
 
  mov.u32   %r1, 0;
  st.param.b32  [func_retval0+0], %r1;
  ret;
 
}

so maybe that part will just work?

The way it is setup, it is very ELF-ish (or should I be saying elvish :)

Very likely, non-ELF systems like Windows and nvptx will require some other mechanism. I working on the windows story in parallel, but I am totally new to nvptx. If you have any suggestions on how to make it work there, we can surely incorporate it.

Maybe everything is fine, but given this setup, does anyone see any potential problems with compiling these functions for nvptx? I'd like to eventually see a mode where we compile an appropriate subset of these functions for GPU targets -- either in bitcode form for linking with our device-side OpenMP runtime library or as a native object -- to provide a more feature-complete offloading environment.

The one thing that caught by eye was the use of the section attribute, and I was curious what nvptx does with that. As far as I can tell, perhaps the answer is nothing.

Then I think this scheme won't work, since the point of the sections is to enable the creation of the global symbols post-build.

E.g., I think the idea is that the main implementation defines the function with C++ name __llvm_libc::strcpy(char *, const char *), and places the code in the .llvm.libc.entrypoint.strcpy section. And then another tool comes along and iterates the llvm.libc.entrypoint sections, and adds global symbol aliases for each one.

That scheme feels probably over-complex, IMO, but I don't have an concrete counter-proposal in mind.

Another problem with the .llvm.libc.entrypoint.strcpy + objcopy --add-symbol scheme is that there is no way to change the st_size field. This can impact symbolization preciseness when debugging information is stripped. This makes me wonder what advantages the namespace __llvm_libc:: brings. @theraven questioned the disadvantage:

This means that we can test the namespaced libc functions in an environment that already has a libc, but we can't easily extend these tests to run in a process (or other environment) that has only LLVM libc.

libc/cmake/modules/LLVMLibCRules.cmake
165

,function -> ,function,global otherwise it is STB_LOCAL.

libc/include/string.h
19

So this should be

void *memcpy(void *__restrict, const void *__restrict, size_t);

to be usable in C++

Maybe everything is fine, but given this setup, does anyone see any potential problems with compiling these functions for nvptx? I'd like to eventually see a mode where we compile an appropriate subset of these functions for GPU targets -- either in bitcode form for linking with our device-side OpenMP runtime library or as a native object -- to provide a more feature-complete offloading environment.

The one thing that caught by eye was the use of the section attribute, and I was curious what nvptx does with that. As far as I can tell, perhaps the answer is nothing.

Then I think this scheme won't work, since the point of the sections is to enable the creation of the global symbols post-build.

E.g., I think the idea is that the main implementation defines the function with C++ name __llvm_libc::strcpy(char *, const char *), and places the code in the .llvm.libc.entrypoint.strcpy section. And then another tool comes along and iterates the llvm.libc.entrypoint sections, and adds global symbol aliases for each one.

That scheme feels probably over-complex, IMO, but I don't have an concrete counter-proposal in mind.

For what it's worth, FreeBSD libc does a similar namespacing trick in C. The internal symbols are underscore prefixed and they're exported as aliases (typically as weak aliases, to allow them to be preempted by other implementations, and to explicitly give names to callers for the preemptible and non-preemptible versions). Making symbols preemptible isn't really possible with PE/COFF, because the linkage model has a stronger concept of where definitions come from than ELF (at least, in the absence of symbol versions in ELF). On ELF platforms, we should support symbol versions as early as possible, because adding them later is an ABI break, even if we change no code.

theraven added inline comments.Sep 29 2019, 12:39 AM
libc/docs/header_generation.md
22

I think that's fine when you consider building libc and shipping a single configuration, but a lot of projects that I've seen have different feature macros defined for different components. Are they now expected to rebuild the libc headers multiple times for each module? Do they need to drive that from their own build system (which is often not CMake)? It's even more complex when a project contains C89, C11, and C++11 files - these all have subtly different sets of requirements for the functions exposed in libc headers: do we require that they build a set for each? Or do you imagine that anyone shipping C11 will ship a powerset of headers?

The reason that we don't do the separate header thing in libcs today is that we end up with a huge explosion of the set of things that are supported. For example, in FreeBSD we support 3 versions of the C standard, 3 or 4 versions of POSIX, GNU and BSD extensions. Almost any combination of these is allowed, so we're looking at 20-30 possible sets of header files, before we start considering restricted subsets for sandboxed applications, custom configurations for sanitisers, and so on.

libc/docs/implementation_standard.md
80

Compliance with overlapping standards is one of the core reasons that we make symbols preemptible in existing libc implementations. C reserves a set of identifiers for the C standard and a C89 program is free to use any other identifier. If POSIX2008 or C11 use those identifiers then libc should not call their implementations from internal code, but should allow theirs to be called. Pthreads is subtly different, where users are allowed to bring their own pthreads implementation and libc should correctly consume it.

The third question is more in terms of layering. For example, I recently tried building libc++ to work in kernel space. This is incredibly hard, because a lot of things depend on locale, for example, and pull in iostream dependencies, so you end up needing a load of things that have no real meaning in kernel space. The same is true for sandboxed environments, where things like open may not exist (though openat) may, (important when something like locale support in libc needs to open files: for a sandboxed deployment we should support either baking those files into the binary or not expose the symbols that depend on their working correctly).

If we don't start out with some declarative definitions of things like C89 / C99 / POSIX2008 compliance, then any kind of automatic tooling to generate a pure C11 library (no POSIX) or to ensure that the correct symbols are preemptible will be very hard.

libc/include/math.h
21

We are building as C++17. There is no C17. I would hope that we're still aiming for the headers to be consumed by C89 programs, because there are a huge number of those in the world.

libc/include/string.h
17

Does this work correctly with the inclusion guards? I don't see the stddef.h implementation here, so I don't know what those macros do (they don't do anything in FreeBSD libc's stddef.h, they appear to do something in libc++'s cstddef, though I'm not entirely sure what) .

The FreeBSD solution so this is to define types like __size_t, use these in headers that are supposed to use size_t in function prototypes but not provide a definition of size_t (yes, there are several in the C spec, it's annoying but that's what the standard says), and then add a guarded typedef to turn that into size_t in this header, stddef.h and a couple of other places.

jyknight added a comment.EditedSep 30 2019, 9:11 AM

Maybe everything is fine, but given this setup, does anyone see any potential problems with compiling these functions for nvptx? I'd like to eventually see a mode where we compile an appropriate subset of these functions for GPU targets -- either in bitcode form for linking with our device-side OpenMP runtime library or as a native object -- to provide a more feature-complete offloading environment.

The one thing that caught by eye was the use of the section attribute, and I was curious what nvptx does with that. As far as I can tell, perhaps the answer is nothing.

Then I think this scheme won't work, since the point of the sections is to enable the creation of the global symbols post-build.

E.g., I think the idea is that the main implementation defines the function with C++ name __llvm_libc::strcpy(char *, const char *), and places the code in the .llvm.libc.entrypoint.strcpy section. And then another tool comes along and iterates the llvm.libc.entrypoint sections, and adds global symbol aliases for each one.

That scheme feels probably over-complex, IMO, but I don't have an concrete counter-proposal in mind.

For what it's worth, FreeBSD libc does a similar namespacing trick in C. The internal symbols are underscore prefixed and they're exported as aliases (typically as weak aliases, to allow them to be preempted by other implementations, and to explicitly give names to callers for the preemptible and non-preemptible versions). Making symbols preemptible isn't really possible with PE/COFF, because the linkage model has a stronger concept of where definitions come from than ELF (at least, in the absence of symbol versions in ELF). On ELF platforms, we should support symbol versions as early as possible, because adding them later is an ABI break, even if we change no code.

Actually, now that I think I understand the existing proposal better, I believe it's broken, as well as confusing. It's getting the same effect as using __attribute__((alias)), except harder to understand. But it's not ok to have a single object file expose both a strong public alias and an internal alias, for any function that's not in the baseline ISO C standard. It would be OK if the aliases were weak, or if they were strong but exposed by a separate .o file. (In any case, I'd like to suggest not using an external objcopy invocation to achieve this.)

For example of why this is wrong -- consider if libc has an 'open.o' object file, which defines __llvm_libc::open, and has also had the alias open added to it with objcopy. Internally, if libc needs to call open, it calls __llvm_libc::open, which pulls in that open.o file, which then also defines the global 'open' function. Then there thus be a duplicate symbol error for any Standard C (e.g. non-posix) program which defines its own open function.

libc/docs/header_generation.md
22

IMO, it makes sense not to bother making C99/C11-only functions conditionally available. The libc headers still ought to be compatible in C89 mode, but I don't see that there's really much point to excluding declarations for new functions like 'strtof', 'aligned_alloc', etc, when building in older standards modes.

The same most likely can apply to old POSIX versions.

However, I do think it is quite likely to be necessary to preserve the ability to conditionally disable the various standards "layers". That is -- for all the headers specified in ISO C, you should be able to disable the declarations added by POSIX (and extensions) with a define. And for all the headers specified in POSIX, you should be able to disable the declarations added by the GNU/BSD/etc extensions with a define.

libc/include/string.h
17

Both Clang and GCC ship a stddef.h which supports these defines -- and it's expected to be first in the include path, before libc's headers.

For some reason, freebsd and some other platforms remove these compiler-shipped files and replace them with their own for their libc.

Exactly what the contract should be between the compiler headers and the libc headers could be a larger discussion, but for now, I'm strongly in favor of assuming that we're using the existing stddef.h from clang/gcc -- in which case this code will work correctly.

sivachandra marked 3 inline comments as done.Sep 30 2019, 12:06 PM

Actually, now that I think I understand the existing proposal better, I believe it's broken, as well as confusing. It's getting the same effect as using __attribute__((alias)), except harder to understand. But it's not ok to have a single object file expose both a strong public alias and an internal alias, for any function that's not in the baseline ISO C standard. It would be OK if the aliases were weak, or if they were strong but exposed by a separate .o file. (In any case, I'd like to suggest not using an external objcopy invocation to achieve this.)

The objcopy step is required to avoid putting mangled names with the alias attribute. If there is any other way to achieve the same thing, I am open to it.

For example of why this is wrong -- consider if libc has an 'open.o' object file, which defines __llvm_libc::open, and has also had the alias open added to it with objcopy. Internally, if libc needs to call open, it calls __llvm_libc::open, which pulls in that open.o file, which then also defines the global 'open' function. Then there thus be a duplicate symbol error for any Standard C (e.g. non-posix) program which defines its own open function.

I am open to making all public symbols weak. Should we start with that, or should make them weak on an as-needed basis?

libc/docs/header_generation.md
22

With respect to excluding extension standards, I will go back to my earlier comment here: The %%include mechanism gives us a way to do it, but calls for a header set per configuration. Note that the libc source tree still only has a single set of files.

True that there will be an "explosion of configurations", but I expect a large number of these to be downstream configurations. Off-the-shelf, we should probably only provide what a normal Linux or a Windows development environment needs.

libc/include/math.h
21

We have said in the proposal that we will only support C17 and higher: http://llvm.org/docs/Proposals/LLVMLibC.html

libc/include/string.h
17

I was convinced by jyknight's reasoning last time. If not anything else, I like that it keeps our implementation surface smaller.

The objcopy step is required to avoid putting mangled names with the alias attribute. If there is any other way to achieve the same thing, I am open to it.

Ah, I see! I'd suggest using extern "C" instead. There's no need for these be C++-mangled -- you can simply use a name prefix instead. E.g., if you define it as extern "C" __llvm_libc_strcpy(...) {} then it's trivial to make the strcpy alias without objcopy magic.

The objcopy step is required to avoid putting mangled names with the alias attribute. If there is any other way to achieve the same thing, I am open to it.

Ah, I see! I'd suggest using extern "C" instead. There's no need for these be C++-mangled -- you can simply use a name prefix instead. E.g., if you define it as extern "C" __llvm_libc_strcpy(...) {} then it's trivial to make the strcpy alias without objcopy magic.

Yes, this solution was considered. But, it does not really solve the problem you brought up; we will still need to make the alias weak. It of course eliminates the objcopy step, but we end up with some kind of source code mismatch/inconsistency (for the lack of a better word I can think of). That is, we will have the main function in a global namespace, while the support/helper functions will live inside of a namespace. I rather prefer everything in a single LLVM-libc specific namespace. The objcopy step is neatly hidden behind a build rule, so I do not see it as being "complicated" or "confusing". Someone coming in to implement a new function just has to use the build rule and put the code in the namespace.

The libc/docs/*.md should be .RST - there are no more no other .MD docs in llvm, they were all converted back to .RST

So, does it mean, this has been reversed: https://reviews.llvm.org/D44910

llvm-project$ find -iname *.md | grep "docs"
llvm-project$

So yes, i suppose that patch needs to be reverted, to avoid misdirecting new docs.

I think you should put *.md in quotes like this to see results:

$> find -iname "*.md" | grep "docs"

Aha. But still, all but few docs migrated back to rst.
I'm not sure it's great to add new ones in the format being-migrated-from.

Aha. But still, all but few docs migrated back to rst.
I'm not sure it's great to add new ones in the format being-migrated-from.

Can you kindly point me to some reference which says rst is preferred over markdown? Not that I do not trust you, but I was told by others to prefer markdown, and so I used markdown. I want to have a reference to point to if I am questioned in future about this.

Aha. But still, all but few docs migrated back to rst.
I'm not sure it's great to add new ones in the format being-migrated-from.

Can you kindly point me to some reference which says rst is preferred over markdown? Not that I do not trust you, but I was told by others to prefer markdown, and so I used markdown. I want to have a reference to point to if I am questioned in future about this.

See e.g.
https://lists.llvm.org/pipermail/llvm-dev/2019-June/133038.html
https://reviews.llvm.org/D66305
I think there was some other disscussion but i can't find it.

The objcopy step is required to avoid putting mangled names with the alias attribute. If there is any other way to achieve the same thing, I am open to it.

Ah, I see! I'd suggest using extern "C" instead. There's no need for these be C++-mangled -- you can simply use a name prefix instead. E.g., if you define it as extern "C" __llvm_libc_strcpy(...) {} then it's trivial to make the strcpy alias without objcopy magic.

Yes, this solution was considered. But, it does not really solve the problem you brought up; we will still need to make the alias weak.

Correct, my suggestion above was only to reduce the complexity.

However, back to the weak-alias question -- I have a different suggestion for solving that.

The issue really only exists when you refer to object files across standards layers -- e.g., using an object file that exposes the POSIX symbol "open" from an object file implementing ISO C. If you make sure to always strictly-layer the libc, so that ISO C-implementing object files don't use any POSIX-exporting object files, and so on, you won't need to mark anything weak. For the example of "open", you'd have an internal implementation of open in its own file, only exposing libc-internal symbols. Then fopen (ISO C) can use that safely, without dragging in a definition of the symbol open. Separately, the implementation of open (POSIX) can be defined in its own file, also based on the internal open.)

It of course eliminates the objcopy step, but we end up with some kind of source code mismatch/inconsistency (for the lack of a better word I can think of). That is, we will have the main function in a global namespace, while the support/helper functions will live inside of a namespace.

Sort of... At the source level, you can keep using namespaces as usual -- you don't need to move the function to the global namespace. E.g., given

namespace __llvm_libc {
extern "C" __llvm_libc_strcat(...) {
  ...
}
}

the __llvm_libc_strcat function is within the namespace __llvm_libc, as far as C++ name-resolution semantics are concerned. It's only at a lower level, in the object-file linkage semantics, that it is "as if" it was in the global namespace -- the function has been given the symbol name __llvm_libc_strcat instead of a C++ namespace-mangled name.

I rather prefer everything in a single LLVM-libc specific namespace. The objcopy step is neatly hidden behind a build rule, so I do not see it as being "complicated" or "confusing". Someone coming in to implement a new function just has to use the build rule and put the code in the namespace.

I do think it's both complicated and confusing to have build rules invoking objcopy to post-process .o files after compilation. If that was necessary, it'd be one thing, but since it's not, I'd say the additional complexity is not justified.

The issue really only exists when you refer to object files across standards layers -- e.g., using an object file that exposes the POSIX symbol "open" from an object file implementing ISO C. If you make sure to always strictly-layer the libc, so that ISO C-implementing object files don't use any POSIX-exporting object files, and so on, you won't need to mark anything weak. For the example of "open", you'd have an internal implementation of open in its own file, only exposing libc-internal symbols. Then fopen (ISO C) can use that safely, without dragging in a definition of the symbol open. Separately, the implementation of open (POSIX) can be defined in its own file, also based on the internal open.)

This approach is fairly general can be adopted with the existing setup as well. In fact, I can image that irrespective of the approach we take, we will end up with patterns like this.

Sort of... At the source level, you can keep using namespaces as usual -- you don't need to move the function to the global namespace. E.g., given

namespace __llvm_libc {
extern "C" __llvm_libc_strcat(...) {
  ...
}
}

the __llvm_libc_strcat function is within the namespace __llvm_libc, as far as C++ name-resolution semantics are concerned. It's only at a lower level, in the object-file linkage semantics, that it is "as if" it was in the global namespace -- the function has been given the symbol name __llvm_libc_strcat instead of a C++ namespace-mangled name.

This approach was also considered. (I missed the __ prefixes in my considerations of course.)

I rather prefer everything in a single LLVM-libc specific namespace. The objcopy step is neatly hidden behind a build rule, so I do not see it as being "complicated" or "confusing". Someone coming in to implement a new function just has to use the build rule and put the code in the namespace.

I do think it's both complicated and confusing to have build rules invoking objcopy to post-process .o files after compilation. If that was necessary, it'd be one thing, but since it's not, I'd say the additional complexity is not justified.

This is in subjective territory...

I prefer not to have the repetitiveness of llvm_libc::llvm_libc_strcpy or anything similar. One can suggest using a macro to avoid the repetitiveness of course, but I'd rather avoid using a macro at every call site. To me, that seems not pretty and not necessary. I prefer to keep the implementation layer be as much as possible like a normal C++ library. The post-processing is done to make this C++ library be usable as a C library as well.

I agree that, "post-processing using objcopy" does sound complex as it kind of gives a feel of a complex binary surgery. But, what the current proposal is doing is to just add an alias symbol using objcopy. That is neither invasive, nor complex as "post-processing using objcopy" sounds. Moreover, as I have said earlier, developers do not need to deal with objcopy on a regular basis as it is hidden behind a build rule.

The issue really only exists when you refer to object files across standards layers -- e.g., using an object file that exposes the POSIX symbol "open" from an object file implementing ISO C. If you make sure to always strictly-layer the libc, so that ISO C-implementing object files don't use any POSIX-exporting object files, and so on, you won't need to mark anything weak. For the example of "open", you'd have an internal implementation of open in its own file, only exposing libc-internal symbols. Then fopen (ISO C) can use that safely, without dragging in a definition of the symbol open. Separately, the implementation of open (POSIX) can be defined in its own file, also based on the internal open.)

This approach is fairly general can be adopted with the existing setup as well. In fact, I can image that irrespective of the approach we take, we will end up with patterns like this.

Yes. This is totally separate from whether to use objcopy or not. I'm sorry the two concerns were merged into one comment thread.

I agree that, "post-processing using objcopy" does sound complex as it kind of gives a feel of a complex binary surgery. But, what the current proposal is doing is to just add an alias symbol using objcopy. That is neither invasive, nor complex as "post-processing using objcopy" sounds. Moreover, as I have said earlier, developers do not need to deal with objcopy on a regular basis as it is hidden behind a build rule.

I continue to disagree that that's at all a good trade-off, but since we're only going around in circles now, I'll drop it in the name of progress.

Move markdown docs to reStructuredText; Will add conf.py and build targets in a later pass.

MaskRay added a comment.EditedOct 3 2019, 12:08 AM

My earlier question is about why we need the namespace __llvm_libc at all. From libc/src/string/strcat/strcat_test.cpp I conclude it is for unit testing in an environment that already has a libc (gtest). This should probably be documented.

https://reviews.llvm.org/D67867#1686834 mentioned that the objcopy scheme will break the st_size fields of symbols.

Can we do things the other way round? No namespace, no __llvm_libc_ prefix. Add the -ffreestanding compiler flag and just define strstr, open, etc in the global namespace. In unit tests, invoke llvm-objcopy --redefine-syms= to rename strstr to __llvm_libc_strstr, and call __llvm_libc_strstr in the tests. For functions that affect global states, gtest will not be suitable. It is good to think how the tests will be organized in the current early stage.

Since weak aliases have been mentioned. I'd like to say that a weak_alias macro will be needed, probably not now, but it should be taken into consideration so that we will not change things back and forth.

  1. Avoid namespace pollution. ISO C interfaces should not pull in POSIX functions, e.g. fopen can call strstr, strchr (ISO C) but not open (POSIX). To call open, (a) define STV_HIDDEN __open (b) make open a weak alias of __open (c) call __open in fopen.
  2. Weak definition for dummy implementation. Some features do not necessarily pull in functions from other components. Create a static dummy function and create a weak definition. When used as a variable, this can be used to check whether some features are unavailable.
  3. Pure aliases. Required either by standard or ABI compatibility purposes. glibc uses this a lot.

As an example when this will be used: in the implementation of strcpy, call STV_HIDDEN __memcpy (I'm not saying implementing strcpy on top of memcpy is efficient. This is merely an example), not STV_DEFAULT memcpy, to make it explicit symbol interposition is not desired. This matters if you ever support dynamic linking and support libc.so.

In the description, a brief mention of what will be built and how to test will be helpful to reviewers and subscribers. It is not easy to figure out where the libc build artifacts are located.

My earlier question is about why we need the namespace __llvm_libc at all. From libc/src/string/strcat/strcat_test.cpp I conclude it is for unit testing in an environment that already has a libc (gtest). This should probably be documented.

Documentation is a good idea. At this point, we want the implementation to be as much as possible like a normal C++ library and allow mixing this libc with other libcs in various scenarios like unit testing, differential testing etc. This is the first patch; I am sure a lot will change in the coming months. I want to write up things like this after we at least get a clear idea of the Windows strategy (I am working on it).

https://reviews.llvm.org/D67867#1686834 mentioned that the objcopy scheme will break the st_size fields of symbols.

Yes, st_size is broken in the sense that it doesn't show the size of the aliasee. But, it almost never matters practically.

Herald added a project: Restricted Project. · View Herald TranscriptOct 3 2019, 10:10 PM
This revision was automatically updated to reflect the committed changes.
MaskRay added a comment.EditedOct 7 2019, 1:11 AM

The commit was done in a hurry. For the initial commit of a brand new project that sets up the project hierarchy, this seems to have received fewer than enough thumbs up. Many points raised in the review process were just shrugged off.

  • Proper cmake review
  • Detailed summary. The commit message should at least reference some previous discussions on the mailing list, especially this is a brand new project.
  • The llvm-objcopy issue definitely needs more consideration. This may interfere badly with instrumentation tools, which is a selling point of the llvm libc.
  • Why __llvm_libc / .llvm.libc.entrypoint. are necessary is not well explained.
  • Some necessary options -ffreestanding -nostdinc are absent.
  • C++ should not get #define __restrict restrict
  • ...

I understand that you want agile development but I just did not want 1000 lines of code were commit then 700 lines of which were replaced very quickly in the next month. If you read the source code of musl, you'll find still a large chunk of code remains untouched since the initial import ("initial check-in, version 0.5.0").

This is a post-commit review anyway so many points are probably moot.

My earlier question is about why we need the namespace __llvm_libc at all. From libc/src/string/strcat/strcat_test.cpp I conclude it is for unit testing in an environment that already has a libc (gtest). This should probably be documented.

Can we do things the other way round? No namespace, no __llvm_libc_ prefix. Add the -ffreestanding compiler flag and just define strstr, open, etc in the global namespace. In unit tests, invoke llvm-objcopy --redefine-syms= to rename strstr to __llvm_libc_strstr, and call __llvm_libc_strstr in the tests. For functions that affect global states, gtest will not be suitable. It is good to think how the tests will be organized in the current early stage.

The commit was done in a hurry. For the initial commit of a brand new project that sets up the project hierarchy, this seems to have received fewer than enough thumbs up. Many points raised in the review process were just shrugged off.

I don't know if it matters anymore because this was committed but I agree with @MaskRay. His suggestion of using llvm-objcopy to rename the symbols for tests makes much more sense to me. I haven't seen a libc that does testing in an ergonomic way and this suggestion seems the best to me, frankly.

There is a lot going on here, it's hard to follow it all in one patch, and I think some comments got lost because of this. I feel like a lot of big design decisions were made here, did I miss something on the libc-dev mailing list?

The commit was done in a hurry. For the initial commit of a brand new project that sets up the project hierarchy, this seems to have received fewer than enough thumbs up. Many points raised in the review process were just shrugged off.

I don't know if it matters anymore because this was committed but I agree with @MaskRay. His suggestion of using llvm-objcopy to rename the symbols for tests makes much more sense to me. I haven't seen a libc that does testing in an ergonomic way and this suggestion seems the best to me, frankly.

There is a lot going on here, it's hard to follow it all in one patch, and I think some comments got lost because of this. I feel like a lot of big design decisions were made here, did I miss something on the libc-dev mailing list?

FWIW -- llvm project policy is that both pre and post-commit reviews must be addressed.

That this is now committed does not change anything w.r.t. needing to respond to outstanding comments.

That this is now committed does not change anything w.r.t. needing to respond to outstanding comments.

Absolutely!

I have heard everyone, and will try my best to address everything. FWIW, I am not happy myself that I am unable to address jyknight's concerns on post-processing in an acceptable fashion.

lygstate added inline comments.
libc/trunk/include/ctype.h
18 ↗(On Diff #223246)

There is no DLL export things here for MSVC, so only building as a static c lib? not considerating as a shared library?

Just as an FYI, this patch breaks LLVM_INCLUDE_TESTS=OFF for me:

$ cmake -GNinja -DPYTHON_EXECUTABLE=$(command -v python3)  -DLLVM_ENABLE_PROJECTS=all ../llvm
...
-- Configuring done
-- Generating done
-- Build files have been written to: /home/nathan/src/llvm-project/build

$ cd .. && rm -rf build && mkdir -p build && cd build 

$ cmake -GNinja -DPYTHON_EXECUTABLE=$(command -v python3) -DLLVM_ENABLE_PROJECTS=all -DLLVM_INCLUDE_TESTS=OFF ../llvm
...
-- Configuring done
CMake Error at /home/nathan/src/llvm-project/libc/cmake/modules/LLVMLibCRules.cmake:264 (add_dependencies):
  The dependency target "gtest" of target "strcpy_test" does not exist.
Call Stack (most recent call first):
  /home/nathan/src/llvm-project/libc/src/string/strcpy/CMakeLists.txt:11 (add_libc_unittest)


CMake Error at /home/nathan/src/llvm-project/libc/cmake/modules/LLVMLibCRules.cmake:264 (add_dependencies):
  The dependency target "gtest" of target "strcat_test" does not exist.
Call Stack (most recent call first):
  /home/nathan/src/llvm-project/libc/src/string/strcat/CMakeLists.txt:12 (add_libc_unittest)


-- Generating done
-- Build files have been written to: /home/nathan/src/llvm-project/build

$ git revert -n 4380647e79bd80af1ebf6191c2d6629855ccf556

$ cd .. && rm -rf build && mkdir -p build && cd build 

$ cmake -GNinja -DPYTHON_EXECUTABLE=$(command -v python3) -DLLVM_ENABLE_PROJECTS=all -DLLVM_INCLUDE_TESTS=OFF ../llvm
...
-- Configuring done
-- Generating done
-- Build files have been written to: /home/nathan/src/llvm-project/build

This is as of r374191.