This is an archive of the discontinued LLVM Phabricator instance.

Document Clang's expectations of the C standard library.
Needs ReviewPublic

Authored by rsmith on Sep 1 2020, 5:10 PM.

Details

Reviewers
fhahn
rjmccall
Summary

Diff Detail

Unit TestsFailed

Event Timeline

rsmith created this revision.Sep 1 2020, 5:10 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 1 2020, 5:10 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
rsmith requested review of this revision.Sep 1 2020, 5:10 PM
rsmith updated this revision to Diff 289329.Sep 1 2020, 5:12 PM

Add missing word.

Wording looks good. Should we alsso document our assumptions about what functions exist in the standard library — the functions that we'll always use even in freestanding builds?

Wording looks good. Should we alsso document our assumptions about what functions exist in the standard library — the functions that we'll always use even in freestanding builds?

Sounds like a good idea to me. Do we have a list?

@t.p.northover says it's complicated. memcpy, memmove, memset, and bzero are (I think) the only ones that LLVM will potentially synthesize from nothing and therefore need to be present even in freestanding builds; that's probably okay for us to guarantee. That's probably also a good place to document the supported way to write those functions in libraries (just -fno-builtin, IIRC?).

In hosted builds, we should document that we assume the existence of the standard C library for the target platform, potentially including non-standard functions like exp10.

aaron.ballman added inline comments.
clang/docs/Toolchain.rst
289

While I think it's good that we're documenting this, it is really troubling that Clang community's perspective is "we can't work with a conforming C standard library" without filing any DRs to WG14 about why we cannot conform, especially if other major compilers are in a similar boat. Do you have any interest in filing a DR to see if we can change the C standard?

bjope added a subscriber: bjope.Sep 10 2021, 1:36 PM

What is the status of this patch? It has not been committed as far as I can tell.

I was looking into this coming from Rust, where we do worry a bit about LLVM calling memcpy, memmove or memset in ways that are not allowed by the C standard. This here would at least give us something to point to, though it doesn't answer the question "which C standard libraries have officially committed to guaranteeing these properties". Since this affects all compilers using LLVM, not just Clang, should this be put somewhere in the LLVM docs?

Herald added a project: Restricted Project. · View Herald TranscriptJul 4 2023, 10:41 AM
Herald added a subscriber: StephenFan. · View Herald Transcript

What is the status of this patch? It has not been committed as far as I can tell.

Correct, it's not been committed yet. we're still discussing (though that's obviously stalled out a bit).

I was looking into this coming from Rust, where we do worry a bit about LLVM calling memcpy, memmove or memset in ways that are not allowed by the C standard. This here would at least give us something to point to, though it doesn't answer the question "which C standard libraries have officially committed to guaranteeing these properties". Since this affects all compilers using LLVM, not just Clang, should this be put somewhere in the LLVM docs?

I think it makes sense to document this both from the Clang perspective and from the LLVM perspective, as they're somewhat different. Clang strives to be a conforming C and C++ implementation, so our conformance requirements are different than LLVM, which aims to be a backend for more general use than just C and C++. For example, Clang needs to talk about freestanding and that's not really something LLVM needs to talk about as such.

I continue to think it's a mistake for us to document that Clang will not work with a conforming C standard library implementation unless we're filing issues with WG14 to alert them to the reasons why. If there's a DR filed with the committee, that gives us something to point our users to as justification. Otherwise, our users will correctly put these bugs at our feet.

I agree with Aaron that it would be better to change the C standard if we can. I don't know how important the first bullet is; IIRC it enables some useful middle-end transformation. I know the second is useful in the frontend so that we don't have to do explicit pointer equality checks around aggregate assignments, although in many cases we'd be able to avoid that because e.g. we know that one of the operands is a temporary.

The first point is important for LLVM's own memcpy/memmove intrinsics, which are documented as NOPs on size 0 (and e.g. Rust relies on that).

memset should be added to that list.

The first point is important for LLVM's own memcpy/memmove intrinsics, which are documented as NOPs on size 0 (and e.g. Rust relies on that).

Right, I understand that these assumptions come directly from the stronger semantics offered by the LLVM intrinsics. The C committee is not going to find that compelling, though — we don't get to default-win arguments just because we've defined an IR with stronger semantics than necessary. They are going to want to see arguments about why it's valuable for the C library to have these stronger semantics, which for us means talking about code patterns in user programs that take advantage of those stronger semantics and the benefits they see from that.

Yeah, agreed. Though I hear that GCC makes similar assumptions, so this should count as something I think. (I don't have a source for this, so it should probably be verified.)

Anyway I don't think LLVM documenting its assumptions should be blocked on changing the C standard. Maybe LLVM should have blocked *making* those assumptions on changing the standard, but that ship has sailed long ago...
(I have less of an opinion about clang since I am not involved there.)

I think it makes sense to document this both from the Clang perspective and from the LLVM perspective, as they're somewhat different.

What would be the right place in LLVM to document this?

Yeah, agreed. Though I hear that GCC makes similar assumptions, so this should count as something I think. (I don't have a source for this, so it should probably be verified.)

glibc's header files mark the pointer as non-null:
extern void *memcpy (void *restrict dest, const void *restrict src,

size_t n) throw () attribute__ ((nonnull (1, 2)));
c99: 7.21.2.1 "The memcpy function copies n characters from the object pointed to by s2 into the
object pointed to by s1. If copying takes place between objects that overlap, the behavior
is undefined."

null is not a pointer to an object

You can't just consider the declared function signatures; memcpying two non-empty objects with perfect overlap is a violation of restrict, and yet IIRC GCC is known to assume that that works in some places. (I can't duplicate that off-hand, though — GCC seems determined to generate inlined memcpys in my quick efforts to test this.)

Is memmove() expected to perform a lot worse than memcpy()? Why not just use memmove() all the time. In some library implementations, memcpy() and memmove() are the same function.

I was told that "glibc explicitly only marks memcpy arguments as nonnull for external use of headers (i.e. the nonnull assumptions are disabled when compiling the implementation)"... okay now re-reading this, that's a statement about glibc, not GCC.

Anyway this is outside my expertise. I will try to add Nikita here who said "GCC also requires passing equal pointer to memcpy to be legal".

RalfJung added a subscriber: nikic.Jul 5 2023, 11:17 AM
nikic added a comment.Jul 5 2023, 11:40 AM

The first point is important for LLVM's own memcpy/memmove intrinsics, which are documented as NOPs on size 0 (and e.g. Rust relies on that).

Right, I understand that these assumptions come directly from the stronger semantics offered by the LLVM intrinsics. The C committee is not going to find that compelling, though — we don't get to default-win arguments just because we've defined an IR with stronger semantics than necessary. They are going to want to see arguments about why it's valuable for the C library to have these stronger semantics, which for us means talking about code patterns in user programs that take advantage of those stronger semantics and the benefits they see from that.

I think the argument for this one is pretty clear:

  • Lower code size: No need to add an explicit n != 0 check.
  • Better performance: If the n == 0 case is rare, you don't have to do that redundant check.
  • Better security: You are less likely to cause UB by forgetting a n != 0 check in place that needs it.

When I made an old code base ubsan compliant some time ago, adding these checks certainly felt like the one case where fixing the sanitizer diagnostics made the code objectively worse on all possible metrics. Take into account that the requirement for nonnull pointers for n != 0 does not allow any useful optimization of the memcpy implementation, and this seems like a no-brainer.

I was told that "glibc explicitly only marks memcpy arguments as nonnull for external use of headers (i.e. the nonnull assumptions are disabled when compiling the implementation)"... okay now re-reading this, that's a statement about glibc, not GCC.

Right. glibc only adds nonnull annotations to declarations when included by users of glibc. The implementation is compiled without any such assumptions (though of course, usually you get an assembly implementation anyway). Musl doesn't annotate nonnull either way.

Anyway this is outside my expertise. I will try to add Nikita here who said "GCC also requires passing equal pointer to memcpy to be legal".

GCC uses equal pointer memcpy in pretty much exactly the same cases as Clang does: Aggregate assignments.

I continue to think it's a mistake for us to document that Clang will not work with a conforming C standard library implementation unless we're filing issues with WG14 to alert them to the reasons why. If there's a DR filed with the committee, that gives us something to point our users to as justification. Otherwise, our users will correctly put these bugs at our feet.

As you are involved with WG14, can you please take care of that? I couldn't even find out how one is supposed to file a DR.

I continue to think it's a mistake for us to document that Clang will not work with a conforming C standard library implementation unless we're filing issues with WG14 to alert them to the reasons why. If there's a DR filed with the committee, that gives us something to point our users to as justification. Otherwise, our users will correctly put these bugs at our feet.

As you are involved with WG14, can you please take care of that? I couldn't even find out how one is supposed to file a DR.

I'm happy to help with it, but I'll need some outside help as well so I get the technical details and justifications correct and can defend them when presenting to the committee. If someone could put together a super rough draft of what changes we'd like to see and why, I can definitely work with them to polish it into a final document, get it to the committee, and champion it there. Related question though: what do we do if the committee rejects the DR? I think LLVM and Clang have slightly different pressures here, so we might have different answers.

FWIW, you can find more info here about how to communicate with WG14: https://www.open-std.org/jtc1/sc22/wg14/www/contributing

nikic added a comment.Jul 6 2023, 8:10 AM

I continue to think it's a mistake for us to document that Clang will not work with a conforming C standard library implementation unless we're filing issues with WG14 to alert them to the reasons why. If there's a DR filed with the committee, that gives us something to point our users to as justification. Otherwise, our users will correctly put these bugs at our feet.

As you are involved with WG14, can you please take care of that? I couldn't even find out how one is supposed to file a DR.

I'm happy to help with it, but I'll need some outside help as well so I get the technical details and justifications correct and can defend them when presenting to the committee. If someone could put together a super rough draft of what changes we'd like to see and why, I can definitely work with them to polish it into a final document, get it to the committee, and champion it there. Related question though: what do we do if the committee rejects the DR? I think LLVM and Clang have slightly different pressures here, so we might have different answers.

FWIW, you can find more info here about how to communicate with WG14: https://www.open-std.org/jtc1/sc22/wg14/www/contributing

Thank you! I will prepare an initial draft and share it here.

It would probably be worth including all string functions that take a length in such a DR. In Rust we are currently puzzling over whether calling 0-length memcmp on something like (char*)42 is okay or not. If not we'd have to introduce a pretty pointless branch.

It would probably be worth including all string functions that take a length in such a DR. In Rust we are currently puzzling over whether calling 0-length memcmp on something like (char*)42 is okay or not. If not we'd have to introduce a pretty pointless branch.

I think the DR would be to change 7.1.4p1 where it currently says:

If an argument to a function has an invalid value (such as a value outside the domain of the function, or a pointer outside the address space of the program, or a null pointer, or a pointer to non-modifiable storage when the corresponding parameter is not const-qualified) or a type (after default argument promotion) not expected by a function with a variable number of arguments, the behavior is undefined.

"invalid value" includes null pointers and we may want to make a surgical repair to say something along the lines of "unless invalid value is a null pointer and the library function has a count parameter blah blah blah". Then we'd cover all the APIs taking a pointer and a count in one go, I believe.

It would probably be worth including all string functions that take a length in such a DR. In Rust we are currently puzzling over whether calling 0-length memcmp on something like (char*)42 is okay or not. If not we'd have to introduce a pretty pointless branch.

I think the DR would be to change 7.1.4p1 where it currently says:

If an argument to a function has an invalid value (such as a value outside the domain of the function, or a pointer outside the address space of the program, or a null pointer, or a pointer to non-modifiable storage when the corresponding parameter is not const-qualified) or a type (after default argument promotion) not expected by a function with a variable number of arguments, the behavior is undefined.

"invalid value" includes null pointers and we may want to make a surgical repair to say something along the lines of "unless invalid value is a null pointer and the library function has a count parameter blah blah blah". Then we'd cover all the APIs taking a pointer and a count in one go, I believe.

But not the case mentioned above, (char *)42. More like if size (count) parameter is zero, pointer arguments are not dereferenced?

It would probably be worth including all string functions that take a length in such a DR. In Rust we are currently puzzling over whether calling 0-length memcmp on something like (char*)42 is okay or not. If not we'd have to introduce a pretty pointless branch.

I think the DR would be to change 7.1.4p1 where it currently says:

If an argument to a function has an invalid value (such as a value outside the domain of the function, or a pointer outside the address space of the program, or a null pointer, or a pointer to non-modifiable storage when the corresponding parameter is not const-qualified) or a type (after default argument promotion) not expected by a function with a variable number of arguments, the behavior is undefined.

"invalid value" includes null pointers and we may want to make a surgical repair to say something along the lines of "unless invalid value is a null pointer and the library function has a count parameter blah blah blah". Then we'd cover all the APIs taking a pointer and a count in one go, I believe.

But not the case mentioned above, (char *)42. More like if size (count) parameter is zero, pointer arguments are not dereferenced?

Yeah, that's a good point.. I'm not certain how the committee might react to that. I think it might be an easier case for nullptr/0 than it is for invalid/0, but I do like the specification niceties that come along with "if count is zero, pointer is not dereferenced".

nikic added a comment.Jul 23 2023, 7:55 AM

Sorry for the delay, this took more time than expected. I've created an initial draft here: https://docs.google.com/document/d/1guH_HgibKrX7t9JfKGfWX2UCPyZOTLsnRfR6UleD1F8/edit?usp=sharing

While writing this I remembered another bit that is somewhat relevant here: Clang explicitly ignores the nonnull attributes on declarations in order to avoid miscompilations for uses of these libc functions. There are sanitizer checks for them, but Clang will not optimize based on nonnull assumptions, even if a libc implementation is used that adds the attributes.

nikic added a comment.Aug 14 2023, 6:01 AM

@aaron.ballman I wanted to check back whether the linked document is what you had in mind, or whether some more/else would be needed.

@aaron.ballman I wanted to check back whether the linked document is what you had in mind, or whether some more/else would be needed.

Sorry about the delayed review; this fell off my radar! This is exactly along the lines of what I was thinking of -- I left some comments on the document. In terms of next steps, I think we can do a few things:

  1. You and I can iterate on the document until we're ready to submit it to WG14
  2. Simultaneously, I think we can move forward with this review -- we can add a statement along the lines of "C standard library implementations that do not guarantee these properties are incompatible with Clang and LLVM (and with several other major compilers); please see WG14 NXXXX, which proposes changes to the C standard to make this behavior conforming." (I'm not strongly tied to these words, just so long as there's some link back to the WG14 document justifying the deviation from the C standard.)

With everything moving to github PRs, what are the next steps for this patch?

With everything moving to github PRs, what are the next steps for this patch?

I'm unlikely to have cycles to work on this in the near future. If someone else wants to take over here, you would have my blessing :-)

With everything moving to github PRs, what are the next steps for this patch?

I'm unlikely to have cycles to work on this in the near future. If someone else wants to take over here, you would have my blessing :-)

Since I had strong opinions, I'll commandeer the patch and get it across the finish line before Phab gets put into read-only mode.