Please use GitHub pull requests for new patches. Phabricator shutdown timeline
Changeset View
Standalone View
clang/docs/SafeBuffers.rst
- This file was added.
================ | ||||||||||||
C++ Safe Buffers | ||||||||||||
================ | ||||||||||||
.. contents:: | ||||||||||||
:local: | ||||||||||||
Introduction | ||||||||||||
============ | ||||||||||||
Clang can be used to harden your C++ code against buffer overflows, an otherwise | ||||||||||||
common security issue with C-based languages. | ||||||||||||
The solution described in this document is an integrated programming model: | ||||||||||||
it defines safety guidelines that restrict operations the code is allowed | ||||||||||||
to perform, it provides tools for updating the code to conform to these | ||||||||||||
guidelines, and it provides runtime mitigations for the situations when | ||||||||||||
following guidelines isn't sufficient. | ||||||||||||
aaron.ballman: For consistency with the rest of our docs, you should make this change throughout. | ||||||||||||
Namely, the solution consists of the following parts: | ||||||||||||
- ``-Wunsafe-buffer-usage`` is a family of warnings that warns you when | ||||||||||||
a potentially unsafe operation is performed in your code; | ||||||||||||
- Pragmas ``unsafe_buffer_usage`` and ``only_safe_buffers`` allow you to | ||||||||||||
annotate sections of code as opt-out of or opt-into the programming model, | ||||||||||||
aaron.ballman: | ||||||||||||
which enables incremental adoption and provides "escape hatches" | ||||||||||||
when unsafety is necessary; | ||||||||||||
steakhal: | ||||||||||||
- Automatic fixits provided by the warning act as a modernizer to help you | ||||||||||||
convert large amounts of old code to conform to the warning; | ||||||||||||
- Attribute ``[[unsafe_buffer_usage]]`` lets you annotate custom functions as | ||||||||||||
steakhalUnsubmitted Not Done ReplyInline Actions
steakhal: | ||||||||||||
unsafe, while providing a safe alternative that can often be suggested by | ||||||||||||
Not Done ReplyInline ActionsHmmmm that attribute name looks problematic (more below in that section). aaron.ballman: Hmmmm that attribute name looks problematic (more below in that section). | ||||||||||||
Not Done ReplyInline ActionsWith C++ modules gaining some momentum, I wonder if it would be possible/beneficial to be able to annotate module import/export declarations with this attribute to signal that the APIs and the content in a module was not yet hardened. xazax.hun: With C++ modules gaining some momentum, I wonder if it would be possible/beneficial to be able… | ||||||||||||
Not Done ReplyInline ActionsI think it makes sense on module export (the module author knows whether they are buffer-safe or not), but what do you have in mind for import where the user may not be in a position to make that assertion validly? aaron.ballman: I think it makes sense on module export (the module author knows whether they are buffer-safe… | ||||||||||||
It's usually not that valuable for us to know that outside the module. If the module wasn't hardened, there's nothing the "caller module" can do about it, the recommended approach is to harden the module first, and then use the attribute to indicate which functions were found to be unsafe in that process (especially if safe alternatives were implemented). So it's only valuable when the module *will never be hardened*, and the only thing the caller module can do in such case is isolate calls to that module, maybe cover it with a safe abstraction. In this case, yeah, I think it makes sense. Maybe we should consider this eventually. NoQ: > I wonder if it would be possible/beneficial to be able to annotate module import/export… | ||||||||||||
Not Done ReplyInline ActionsWouldn't only_safe_buffers make sense on the import side to ensure that no unhardened code is being imported from a specific module? xazax.hun: Wouldn't `only_safe_buffers` make sense on the import side to ensure that no unhardened code is… | ||||||||||||
Not Done ReplyInline Actions
But the import side has no way to know that the module being imported has no unhardened code, only the module implementer knows that. That said, I could imagine wanting to mark a module import as "will never be hardened" as a blanket way to say "don't complain about interfaces in this module I have no control over". But now I'm questioning something from the docs:
It now occurs to me there's two ways to read this. unsafe_buffer_usage = will diagnose any unsafe buffer usage, it's already hardened. or unsafe_buffer_usage == this uses unsafe buffers, it's not to be hardened I originally read it the first way from the docs in the patch but the comment from @xazax.hun reads to me like he read it the second way. aaron.ballman: > Wouldn't only_safe_buffers make sense on the import side to ensure that no unhardened code is… | ||||||||||||
Yes our intent was to have #pragma unsafe_buffer_usage begin #pragma unsafe_buffer_usage end annotate a section of code that intentionally uses unsafe buffers (so, warnings are suppressed). It makes sense to have #pragma unsafe... mean something's unsafe. I wonder if it wouldn't have been confusing without my clumsy explanation. @jkorous WDYT about this wording problem, do you see ambiguity here? NoQ: Yes our intent was to have
```
#pragma unsafe_buffer_usage begin
#pragma unsafe_buffer_usage… | ||||||||||||
the compiler automatically; | ||||||||||||
- LLVM's own C++ standard library implementation, libc++, provides a | ||||||||||||
hardened mode where C++ classes such as ``std::vector`` and ``std::span``, | ||||||||||||
together with their respective ``iterator`` classes, are protected | ||||||||||||
at runtime against buffer overflows. This changes buffer overflows from | ||||||||||||
extremely dangerous undefined behavior to a predictable runtime crash; | ||||||||||||
Not Done ReplyInline ActionsOnly buffer overflows, so comparing cases like: auto it1 = v1.begin(); auto it2 = v2.begin(); if (it1 == it2) { ... } are out of scope, right? The main reason I'm asking, because I am not sure how safe iterators are implemented and whether the same implementation would give opportunities for other safety checks without significant additional cost. xazax.hun: Only buffer overflows, so comparing cases like:
```
auto it1 = v1.begin();
auto it2 = v2.begin… | ||||||||||||
I suspect libc++ will have such checks (or even already does, @ldionne pls correct me if I'm wrong). They're still out of scope for our proposal though. But they're definitely useful and may deserve mentioning. NoQ: I suspect libc++ will have such checks (or even already does, @ldionne pls correct me if I'm… | ||||||||||||
Not Done ReplyInline ActionsYes, this is out of scope for our diagnostics and code transformations. @ldionne Do you want to cover the safe iterator design part? jkorous: Yes, this is out of scope for our diagnostics and code transformations. @ldionne Do you want to… | ||||||||||||
- Finally, in order to avoid bugs in newly converted code, the | ||||||||||||
steakhal: | ||||||||||||
Clang static analyzer provides a checker to find misconstructed | ||||||||||||
span/view objects. | ||||||||||||
Isn't this overly specific? What about string_views? xazax.hun: Isn't this overly specific? What about `string_view`s? | ||||||||||||
Yeah I think we can cover string views as well! NoQ: Yeah I think we can cover string views as well! | ||||||||||||
This is based on our current plans for near future. jkorous: This is based on our current plans for near future.
While we are positive we will use `std… | ||||||||||||
Does this mean you'd recommend using a span<char> over string_view in all cases? I think that might surprise some users, so I wonder if it is worth documenting. xazax.hun: Does this mean you'd recommend using a `span<char>` over `string_view` in all cases? I think… | ||||||||||||
Oh, I was sort of assuming we'd be suggesting string_view over span<char> when appropriate. So agreed that we should clearly document this as it may surprise folks. Out of curiosity, why the strong preference for span? aaron.ballman: Oh, I was sort of assuming we'd be suggesting `string_view` over `span<char>` when appropriate. | ||||||||||||
string_view is a bit harder to recommend automatically because whether a certain sequence of bytes constitutes a "string" is often a matter of opinion / user's point of view. string_view's methods such as find will not necessarily make sense over original data, or may turn out to have counter-intuitive behavior when users try to modify the code further. Additionally, string_view doesn't offer write access, so at best we'll be able to recommend it for const pointers. So I think we should gather more data before considering it. NoQ: `string_view` is a bit harder to recommend automatically because whether a certain sequence of… | ||||||||||||
Not Done ReplyInline Actions
But why does that matter for string_view? It's not a null-terminated interface, so it explicitly encodes a pointer/length interface, same as span. aaron.ballman: > string_view is a bit harder to recommend automatically because whether a certain sequence of… | ||||||||||||
We can probably offer both variants in different note-fixits as soon as we teach the machine to recognize the different requirements for these strategies, and we can have string_view go first in the list? Like, I don't want to eliminate the question of programmer's intent from this. It's ok for the programmer to prefer span<char> in some cases, just to make the code more expressive, and we'll have to implement it anyway, so we might as well offer it. I definitely agree that string_view is the best choice when dealing with strings, and humans will absolutely prefer that and we should be recommending that to humans when applicable. I'm just not sure we should completely drop span<char> from automatic recommendations, and I admit that it'll take some extra work to implement recommending string_view instead, which we may or may not rush to implement asap. NoQ: We can probably offer both variants in different note-fixits as soon as we teach the machine to… | ||||||||||||
Not Done ReplyInline Actions
I think that's reasonable enough; it doesn't have to do everything up front. So long as there's a plan (or at least an aspiration) to support multiple containers depending on context and preference, I'm happy enough. I had the (perhaps mistaken) impression this was heading towards *only* recommending span. aaron.ballman: > We can probably offer both variants in different note-fixits as soon as we teach the machine… | ||||||||||||
Note that some of these partial solutions are useful on their own. For example, | ||||||||||||
hardened libc++ can be used as a "sanitizer" to find buffer overflow bugs in | ||||||||||||
existing code. The static analyzer checker is also universally useful, and | ||||||||||||
not tied to this programming model. The warning may be desired independently, | ||||||||||||
steakhal: | ||||||||||||
even in plain C code, if you need to isolate, annotate and audit all | ||||||||||||
buffer-related code in your codebase. That said, some of these guidelines | ||||||||||||
aaron.ballman: | ||||||||||||
don't cause your code to mitigate vulnerabilities better unless hardened | ||||||||||||
containers are in use to carry out runtime checks. | ||||||||||||
Why Not Just Harden All Pointer Operations? | ||||||||||||
------------------------------------------- | ||||||||||||
At a glance, it seems reasonable to ask the compiler to make raw pointer | ||||||||||||
operations crash on out-of-bounds accesses at runtime. And in case of plain C, | ||||||||||||
steakhal: | ||||||||||||
this may be the most practical way to achieve safety. The main problem with | ||||||||||||
such a solution lies in dealing with the fact that most pointers in the code | ||||||||||||
aaron.ballman: | ||||||||||||
don't have bounds information associated with them. Adding such information | ||||||||||||
to the code in order to allow the compiler to perform bounds checks at runtime | ||||||||||||
would result either in a system of attributes significantly more advanced than | ||||||||||||
the current solution, or a significant change in the runtime representation | ||||||||||||
of all pointers. | ||||||||||||
steakhal: | ||||||||||||
It would be helpful to have some details here about why a user would pick these proposed features over use of a sanitizer like asan or hwasan, which also finds out of bounds accesses. aaron.ballman: It would be helpful to have some details here about why a user would pick these proposed… | ||||||||||||
A different approach is taken by AddressSanitizer, which is amazing at detecting | ||||||||||||
some buffer overflows in the testing phase as it can be used for checking | ||||||||||||
pointer accesses with respect to a global map of all allocations. However, | ||||||||||||
it cannot be used as a security mitigation in production binaries, so it can | ||||||||||||
only help you with vulnerabilities within your test coverage, which is often | ||||||||||||
insufficient for security purposes. | ||||||||||||
This makes it sound like the proposal doesn't cover C because there is no version of std::span there. aaron.ballman: This makes it sound like the proposal doesn't cover C because there is no version of `std… | ||||||||||||
I read this as a question as: "Could an imaginary codebase benefit from the warnings if there were no fixits?". I can imagine projects where it would be very useful to have 90% of the code guaranteed to be free of "potentially dangerous pointer arithmetic" and have such operations contained to the remaining 10% of the code. I can imagine a principled solution to bounds-safety in C but even just splitting the code into safe and 9x smaller "unsafe" part could be extremely useful when it comes to debugging, audits, automated and manual testing and similar. Having clang guarantee this property reduces the scope of code that needs to be processed and a deliberate use of this model by projects can unlock new possibilities. That is why I feel that the proposal can benefit also C projects. jkorous: I read this as a question as: "Could an imaginary codebase benefit from the warnings if there… | ||||||||||||
Not quite. It's whether an imaginary codebase can benefit from the warnings when there are no available standardized alternatives that are a better replacement and no C standard library hardening done. Basically, it feels like enabling this warning in a C code base is not the same as a C++ code base. In C++, it's "enable the warning, start using these awesome replacements, and get to a good state" and in C, it's "enable the warning, start implementing exemptions everywhere, and hope they're correct or that your test coverage saves you from mistakes." aaron.ballman: > I read this as a question as: "Could an imaginary codebase benefit from the warnings if there… | ||||||||||||
Yes that's definitely true.
Hmm, I didn't intend to consider this possibility *here*. This section is mostly motivational, I wanted to explain why the default recommended setup makes sense. That's why I start this section with a "Whelp for C it'd probably be better to build something completely different, even if this proposal is better than nothing". I added a section "What About Plain C?" to discuss this once in detail, and let the rest of the proposal focus on C++. NoQ: > Basically, it feels like enabling this warning in a C code base is not the same as a C++ code… | ||||||||||||
Fortunately, in case of C++, there already is a standard solution for passing | ||||||||||||
bounds information alongside the pointer, namely the ``std::span``. | ||||||||||||
With the help of safe standard containers such as ``std::span``, you can achieve | ||||||||||||
bounds safety by *writing better modern C++ code*. The proposed C++ modernizer | ||||||||||||
is designed to help you with that. | ||||||||||||
aaron.ballman: | ||||||||||||
But, Really, What About Plain C? | ||||||||||||
-------------------------------- | ||||||||||||
While plain C does not offer a natural way to combine pointers with bounds | ||||||||||||
information, parts of this solution may prove useful. In particular, | ||||||||||||
``#pragma clang unsafe_buffer_usage`` and ``#pragma clang only_safe_buffers`` | ||||||||||||
may be used for isolating code that deals with buffers. Such code can be | ||||||||||||
carefully audited and the rest of the code may be taught to only interact with | ||||||||||||
buffers through functions provided by the audited code. Additionally, attribute | ||||||||||||
``[[clang::unsafe_buffer_usage]]`` can be used for annotating functions | ||||||||||||
steakhal: | ||||||||||||
that have bounds-safe alternatives. | ||||||||||||
We should mention as part of the model description that pointer-chasing structures like linked lists, trees, etc are not covered by the model. (That said, I suspect the *implementation* of such things might be covered. e.g., a tree with Node *Children[]) aaron.ballman: We should mention as part of the model description that pointer-chasing structures like linked… | ||||||||||||
Added a couple paragraphs above the list to cover that. NoQ: Added a couple paragraphs above the list to cover that. | ||||||||||||
However, no automatic code modernizer for plain C is not provided, | ||||||||||||
aaron.ballmanUnsubmitted Not Done ReplyInline Actions
aaron.ballman: | ||||||||||||
and the hardened C++ standard library cannot benefit C code, which limits | ||||||||||||
usefulness of the proposed integrated programming model in environments | ||||||||||||
I think in many cases it is more desirable to accept a view as a view can be constructed from multiple different containers. An API that accepts a span can work with std::vectors and std::arrays or even some 3rd party containers. This makes me think maybe mentioning views first and the other options later would result in a better guideline. xazax.hun: I think in many cases it is more desirable to accept a view as a view can be constructed from… | ||||||||||||
where C++ cannot be used. | ||||||||||||
And what should C users do? aaron.ballman: And what should C users do? | ||||||||||||
The Programming Model for C++ | ||||||||||||
============================= | ||||||||||||
Similar to the single quote situation. aaron.ballman: Similar to the single quote situation. | ||||||||||||
Let's start with what your C++ code would need to look like in order to | ||||||||||||
reap benefits of the safe buffers programming model. | ||||||||||||
Because the model focuses on preventing buffer overflows, other pointer-based | ||||||||||||
data structures that don't deal with buffers (such as linked lists or trees) | ||||||||||||
aren't affected by it, don't require any code changes, nor would they benefit | ||||||||||||
from additional security once the rest of the code is transformed | ||||||||||||
to comply to the model. In particular, problems like use-after-free and | ||||||||||||
iterator invalidation are out of scope. | ||||||||||||
steakhal: | ||||||||||||
For dealing with buffers, the programming model encourages you to make your code | ||||||||||||
provide correct bounds information near every buffer access. Such information | ||||||||||||
can be consumed either by the standard library to conduct runtime bounds checks, | ||||||||||||
or by the code itself with assertions or safe idioms such as range-based | ||||||||||||
for-loops. In order to keep bounds information correct, the following | ||||||||||||
I would be surprised if we can find a heuristic that we'd feel confident is correct for most situations. e.g., Foo buffer[10]; might be a flat array... or it might be a ring buffer without benefit of a wrapper class... or it may be a sparse matrix... and so on. aaron.ballman: I would be surprised if we can find a heuristic that we'd feel confident is correct for most… | ||||||||||||
Yeah it's very non-trivial. A much harder call than the legendary hack in the fixit for __attribute__((fallthrough)). I'm thinking of a combination of a clang flag and an attribute, to mark sufficiently suitable classes and ask the user to pick one of them as the new default. It's not a high priority for us, and it'll definitely require some work to formalize the precise class interface / contract we're relying on in our fixits, but I totally see us doing it eventually. NoQ: Yeah it's very non-trivial. A much harder call than the legendary hack in the fixit for… | ||||||||||||
guidelines should be followed: | ||||||||||||
I think this would be a good idea, especially if you can use an example of a more involved (but still common) container. Like, how does someone harden a ring buffer? aaron.ballman: I think this would be a good idea, especially if you can use an example of a more involved (but… | ||||||||||||
I believe that we should stick to hardening low-level primitives - all that a ring buffer implementation has to do is to store data in std::array from hardened libc++ and it's all set. jkorous: I believe that we should stick to hardening low-level primitives - all that a ring buffer… | ||||||||||||
I think it would be great to link to some other resources like how to annotate your container such that address sanitizer can help you catch OOB accesses. xazax.hun: I think it would be great to link to some other resources like how to annotate your container… | ||||||||||||
That doesn't help real world projects like LLVM where we replace the low-level primitives when they're not appropriate for our domain. I think we should assume that users will implement their own containers and support that situation from the start. aaron.ballman: > I believe that we should stick to hardening low-level primitives - all that a ring buffer… | ||||||||||||
Yes, so I think we want to eventually cover low-level primitives like SmallVector, and then ask people to implement their higher-level primitives like ring buffers in terms of the covered low-level primitives. The situation when people really need to have ultra-optimized high-level primitives that are implemented directly with raw buffers, and then optionally harden that, becomes significantly more rare. In such cases maybe it's ok not to have a clear step-by-step guide. But a ring buffer still makes a good example! I'll think more about that. NoQ: Yes, so I think we want to eventually cover low-level primitives like `SmallVector`, and then… | ||||||||||||
1. When you allocate a buffer, allocate a container instead. For example, | ||||||||||||
if you find yourself allocating a fixed-size array, use ``std::array``. | ||||||||||||
If you need to allocate a stack or heap buffer of variable length, | ||||||||||||
``std::vector`` can often act as a drop-in replacement. A vector can do | ||||||||||||
a lot more than that, but you can always resort to a fill-constructor that | ||||||||||||
preallocates the buffer of necessary size, and then never use any resizing | ||||||||||||
operations. This gives you a simple safe buffer at no extra cost other than | ||||||||||||
the cost of safety. Another good solution is ``std::span`` which you can | ||||||||||||
Isn't this too restrictive? How about arrays where both the index and the size of the array is known at compile time? Also, what about subscripts in consteval code where the compiler should diagnose OOB accesses at compile time? I believe this model can be made more ergonomic without losing any of the guarantees. xazax.hun: Isn't this too restrictive? How about arrays where both the index and the size of the array is… | ||||||||||||
Small ping on this point. I think there are many code patterns that are completely safe (i.e., the compiler can diagnose OOB accesses), but the current model would ban. One example is converting an enum value to string using an array of string_views. In those cases, both enum consts' value and the array's size are known at compile time. I think those easy to diagnose special cases should be permitted to make programming more ergonomic. The more ergonomic the experience, the faster the adoption will be. xazax.hun: Small ping on this point. I think there are many code patterns that are completely safe (i.e. | ||||||||||||
Yes, we'll probably add occasional suppressions for known safe cases like this, as long as we're sure we'll diagnose if safety is lost in later code changes. And then document them here. It's probably even reasonable to suppress the warning when the index is a loop counter and the loop bound is known at compile time to be sufficiently small. Though it'll take some work to implement (re-use static analyzer's loop unrolling prerequisites code?) (uh-oh, I thought I sent this reply before the conference) NoQ: Yes, we'll probably add occasional suppressions for known safe cases like this, as long as… | ||||||||||||
Added a clause about known sizes. I'll remove it if we decide it's not feasible. NoQ: Added a clause about known sizes. I'll remove it if we decide it's not feasible. | ||||||||||||
initialize with a heap pointer allocated with ``new[]``; it doesn't assume | ||||||||||||
unique ownership so you can copy such span if you want, just like | ||||||||||||
a raw pointer, but of course, you'll have to manually ``delete[]`` it | ||||||||||||
Sphinx is picky about indentation, and it also complains about the missing new-line prior to the sub-list. steakhal: Sphinx is picky about indentation, and it also complains about the missing new-line prior to… | ||||||||||||
Uh-oh, great catch! NoQ: Uh-oh, great catch! | ||||||||||||
when appropriate. While it may be desirable to use ``std::shared_ptr<T>`` | ||||||||||||
Not Done ReplyInline ActionsOne case I think this is going to trip up is: this + 1 (which is not uncommon in practice: https://sourcegraph.com/search?q=context:global+-file:.*test.*+file:.*%5C.cpp+this+%2B+1&patternType=standard). How do users fix up that pointer arithmetic and do they ever get any safety or security benefits from it? (Note, this doesn't apply to +=, just +.) aaron.ballman: One case I think this is going to trip up is: `this + 1` (which is not uncommon in practice… | ||||||||||||
Not Done ReplyInline ActionsThat's a fair point and if we don't find a good solution we should ignore arithmetic with this and a constant. jkorous: That's a fair point and if we don't find a good solution we should ignore arithmetic with… | ||||||||||||
I'll leave this unaddressed for now. It looks like these constructs typically *do* indicate that the remaining part of the object is a buffer. C++ doesn't seem to provide a good "safe" way to represent such objects. So, yeah, we need more data. NoQ: I'll leave this unaddressed for now. It looks like these constructs typically *do* indicate… | ||||||||||||
Not Done ReplyInline ActionsFair enough -- it's worth documenting explicitly as an open question. FWIW, I agree that the construct is mostly used for accessing trailing objects as a buffer. The closest thing I've seen to a safe way to handle this is how LLVM does it with the TrailingObjects mixin. aaron.ballman: Fair enough -- it's worth documenting explicitly as an open question. FWIW, I agree that the… | ||||||||||||
for shared access to the buffer, note that the standard library cannot | ||||||||||||
provide hardening in such cases. | ||||||||||||
2. When you provide function or class interfaces that accept or return | ||||||||||||
raw buffers, consider using standard "view" classes such as ``std::span`` or | ||||||||||||
NULL as well, I presume? (This matters for C where NULL often expands to (void *)0.) aaron.ballman: `NULL` as well, I presume? (This matters for C where `NULL` often expands to `(void *)0`.) | ||||||||||||
Hmm, I was thinking that ((void *)0) is a compile-time constant 0. Like, I didn't say it's a *literal*. Is there a better term? NoQ: Hmm, I was thinking that `((void *)0)` is a compile-time constant `0`. Like, I didn't say it's… | ||||||||||||
Are you okay with constant expressions? e.g., consteval int foo() { return 0; } void func(char *ptr) { char *offset_ptr = ptr + foo(); // Is this fine? } aaron.ballman: Are you okay with constant expressions? e.g.,
```
consteval int foo() { return 0; }
void func… | ||||||||||||
Yes sure, why not? If it ever becomes unsafe, we'll immediately warn. NoQ: Yes sure, why not? If it ever becomes unsafe, we'll immediately warn. | ||||||||||||
Not Done ReplyInline ActionsAh, C terminology is biting me here -- we use the term "constant" the same way C++ uses the term "literal", so when I read "constant 0" my brain interpreted that as literally just 0. aaron.ballman: Ah, C terminology is biting me here -- we use the term "constant" the same way C++ uses the… | ||||||||||||
``std::string_view``. Spans are particularly useful as they can be implicitly | ||||||||||||
Eventually we should extend this to have an exhaustive list. aaron.ballman: Eventually we should extend this to have an exhaustive list. | ||||||||||||
Absolutely! Is it ok if we populate the list as we land the implementation? jkorous: Absolutely! Is it ok if we populate the list as we land the implementation? | ||||||||||||
Populating the list as we land the implementation sounds great to me. aaron.ballman: Populating the list as we land the implementation sounds great to me. | ||||||||||||
constructed from arbitrary containers which keeps the interface flexible. | ||||||||||||
steakhal: | ||||||||||||
Alternatively, you can pass the original container – move it or pass | ||||||||||||
by reference or use a smart pointer, but that locks the function down to | ||||||||||||
a specific container type, so this is only useful if you want to use | ||||||||||||
container-specific methods inside the function. | ||||||||||||
Not Done ReplyInline Actions
aaron.ballman: | ||||||||||||
Not Done ReplyInline Actions
steakhal: | ||||||||||||
3. If the interface you're providing needs to preserve compatibility with | ||||||||||||
other code already written to use the old raw buffer interface, annotate the | ||||||||||||
Not Done ReplyInline Actions
aaron.ballman: | ||||||||||||
raw buffer interface function with the ``[[clang::unsafe_buffer_usage]]`` | ||||||||||||
attribute. This would inform your clients that there exists a safer | ||||||||||||
alternative so that they could gracefully convert their code to comply with | ||||||||||||
the programming model as well. | ||||||||||||
4. If you deal with interfaces you cannot change, such as an interface of | ||||||||||||
a third-party library you're using, and these interfaces require you to pass | ||||||||||||
buffers as raw pointers, make sure you "containerize" these buffers | ||||||||||||
*as soon as possible*. For example, if the library returns a raw buffer | ||||||||||||
What does auto-attached mean? steakhal: What does //auto-attached// mean? | ||||||||||||
Means they show up in fixits as suggested code. Clarified, thanks! NoQ: Means they show up in fixits as suggested code. Clarified, thanks! | ||||||||||||
pointer, put it into ``std::span`` in order to immediately write down | ||||||||||||
the size of that buffer for the purposes of future bounds checks. | ||||||||||||
Then keep such buffers contained this way *for as long as possible*, | ||||||||||||
especially when passing across function boundaries. Say, until you need to | ||||||||||||
pass it back to that library, you can pass ``std::span`` by value between | ||||||||||||
your functions to preserve precise size information naturally. | ||||||||||||
5. Sometimes you will find yourself implementing a custom container. | ||||||||||||
Say, ``std::vector`` or ``std::span`` may turn out to be poorly suitable | ||||||||||||
for your needs, and this is fine. In such cases, the same guidelines apply. | ||||||||||||
You may have to use the opt-out pragma on the implementation of | ||||||||||||
the container – that's exactly their purpose! Additionally, consider | ||||||||||||
implementing runtime checks in your container similar to the ones already | ||||||||||||
present in hardened libc++, because following the guidelines alone is often | ||||||||||||
insufficient without such hardening. | ||||||||||||
(TODO: Will automatic fixits be able to suggest custom containers or views?) | ||||||||||||
(TODO: Explain how to implement such checks in a custom container?) | ||||||||||||
Compiler Tooling And Enforcement | ||||||||||||
================================ | ||||||||||||
Now that we know what the code needs to look like, let's talk about what can | ||||||||||||
the compiler do for you to help you with converting your code, as well as | ||||||||||||
making sure that the code stays safe after later changes. | ||||||||||||
Not Done ReplyInline ActionsHave you considered allowing a short-hand form of the pragmas that are scoped to the current block scope (or file scope if applied at global scope)? e.g., void func() { { #pragma only_safe_buffers int stuff[10] = {}; int x = stuff[1]; // Warning: you're a bad person who should feel bad } int array[10] = {}; int y = array[1]; // this_is_fine.png } aaron.ballman: Have you considered allowing a short-hand form of the pragmas that are scoped to the current… | ||||||||||||
Not Done ReplyInline ActionsWe discussed other options but generally prefer simple pragmas for their flexibility. jkorous: We discussed other options but generally prefer simple pragmas for their flexibility.
Our… | ||||||||||||
Not Done ReplyInline Actions
A single pragma with lexical scoping behaves like RAII, which is often a replacement for more complex acquire/release paired operations and is a pretty well-understood pattern in C++.
True, it is easy to reason about. But not allowing nesting for readability purposes removes some utility, especially for folks in C where macros are much more common. e.g., #define ONLY_SAFE_BUFFERS _Pragma("only_safe_buffers") #define AWESOME_BUFFER_THING(ptr, size) ({ ONLY_SAFE_BUFFERS; (interesting work goes here) }) void func(int *ptr, size_t size) { ONLY_SAFE_BUFFERS; ... interesting code lives here ... AWESOME_BUFFER_THING(ptr, size); // Oops, this is nested. .. more interesting code lives here ... } (You can hit the same situation with paired pragmas.) Because C doesn't have constexpr or consteval support for functions or templates, function-like macros are a pretty common interface folks use in practice and macro expansion makes it somewhat hard to avoid potential surprising conflicts. aaron.ballman: > We discussed other options but generally prefer simple pragmas for their flexibility.
A… | ||||||||||||
Not Done ReplyInline ActionsIt is desirable to minimize the extent of code labeled as unsafe - if a single line of code is unsafe then it'd be suboptimal do label a whole scope as unsafe. Separately from that - in theory the users could just end the scope before the macro: ONLY_SAFE_BUFFERS_BEGIN; ... interesting code lives here ... ONLY_SAFE_BUFFERS_END; AWESOME_BUFFER_THING(ptr, size); // Oops, this is nested. But point taken - there might be options for better ergonomics. jkorous: It is desirable to minimize the extent of code labeled as unsafe - if a single line of code is… | ||||||||||||
Not Done ReplyInline Actions
FWIW, I wasn't suggesting we support *only* the scoped form, but that we *also* supported a scoped form. The benefit really boils down to more complex functions (generally ones in need of refactoring anyway) where you already have an appropriate scope handy and pairing the pragmas would be clutter. e.g., if (ptr) { #pragma clang unsafe_buffer_usage ptr += 12; } vs if (ptr) { #pragma clang unsafe_buffer_usage begin ptr += 12; #pragma clang unsafe_buffer_usage end } I don't see a need to write two pragmas in these kinds of cases where the compound scope is already a clear delimiter. WDYT? aaron.ballman: > In this sense associating the pragmas' semantics with scopes reduces their flexibility. Is… | ||||||||||||
The Warning | ||||||||||||
----------- | ||||||||||||
steakhal: | ||||||||||||
The warning ``-Wunsafe-buffer-usage`` warns on the following operations: | ||||||||||||
aaron.ballman: | ||||||||||||
- Array subscript expression on raw arrays or raw pointers, | ||||||||||||
- unless the index is a compile-time constant ``0``, | ||||||||||||
- or, in case of arrays, if both the index and the array size is known | ||||||||||||
at compile time and the index is within bounds; | ||||||||||||
- Increment and decrement of a raw pointer with operators ``++`` and ``--``; | ||||||||||||
- Addition or subtraction of a number to/from a raw pointer with operators | ||||||||||||
``+``, ``-``, ``+=``, ``-=``, | ||||||||||||
Not Done ReplyInline Actions
Er, I'm not certain how much I agree with that. push/pop semantics on this have been around for a *long* time. I do agree that the user can write code such that another user may struggle to understand what the state of the system is after a pop operation, but that shouldn't be a deciding factor for whether we add a new pragma to do the same thing as another pragma. I think I'd rewrite this as: Similar functionality can be achieved with ``#pragma clang diagnostic``, but such use is discouraged because it reduces code readability due to being an imperative action rather than describing a property of the annotated code. That reduced readability also adds complexity to security code audits. Use of one of the dedicated pragmas is strongly recommended. or something along these lines. WDYT? aaron.ballman: > a custom pragma is preferable because the generic solution’s push/pop mechanism may be… | ||||||||||||
Not Done ReplyInline ActionsWe'd love to arrive at a design that is very difficult to use incorrectly. You are absolutely right that our preference for not using the diagnostic pragmas stems from the "untyped" pop operation. Assuming this code is a starting point: #pragma clang diagnostic push warning "-Wfoo" // code #pragma clang diagnostic pop //end of -Wfoo // more code if the user adds pragmas like this: #pragma clang diagnostic push warning "-Wfoo" #pragma clang diagnostic push warning "-Wsafe-buffers-usage" // code #pragma clang diagnostic pop //end of -Wfoo // more code #pragma clang diagnostic pop //end of -Wsafe-buffers-usage they actually get this behavior: #pragma clang diagnostic push warning "-Wfoo" #pragma clang diagnostic push warning "-Wsafe-buffers-usage" // code #pragma clang diagnostic pop //end of -Wsafe-buffers-usage // more code #pragma clang diagnostic pop //end of -Wfoo We should have the patch for pragmas ready to be put for review here in not too distant future. Maybe it'd be easier to discuss there? jkorous: We'd love to arrive at a design that is very difficult to use incorrectly.
You are absolutely… | ||||||||||||
Not Done ReplyInline Actions
Certainly! I was mostly trying to reword what you had to keep the same intent but not make it sound like the basis for the pragmas is "we think these other pragmas, which do work, were not good enough" because that's not very technically compelling. aaron.ballman: > We should have the patch for pragmas ready to be put for review here in not too distant… | ||||||||||||
Reworded a bit and added the example! NoQ: Reworded a bit and added the example! | ||||||||||||
- unless that number is a compile time constant ``0``; | ||||||||||||
- subtraction between two pointers is also fine; | ||||||||||||
- Passing a pointer through a function parameter annotated with | ||||||||||||
attribute ``[[clang::unsafe_buffer_usage]]``, | ||||||||||||
That's stealing a name from the C and C++ committees because it's not behind a vendor attribute namespace. I think it should be [[clang::unsafe_buffer_usage]] or unsafe_buffer_usage (with no specific syntax) instead. aaron.ballman: That's stealing a name from the C and C++ committees because it's not behind a vendor attribute… | ||||||||||||
Yes, absolutely, I'll update that as we get closer to landing the attribute. NoQ: Yes, absolutely, I'll update that as we get closer to landing the attribute. | ||||||||||||
- unless the pointer is a compile time constant ``0`` or ``nullptr`` | ||||||||||||
(possibly a result of simple operations, such as C-style ``NULL`` | ||||||||||||
that expands to ``((void) 0)`` which can be "folded" to ``0`` | ||||||||||||
at compile time); | ||||||||||||
- a number of C/C++ standard library buffer manipulation functions | ||||||||||||
(such as ``std::memcpy()`` or ``std::next()``) are hardcoded to act | ||||||||||||
as if they had the attribute. | ||||||||||||
The warning doesn't warn on single pointer use in general, such as | ||||||||||||
dereferencing operations like ``*`` or ``->`` or ``[0]``. If such operation | ||||||||||||
causes a buffer overflow, there's probably another unsafe operation nearby | ||||||||||||
that the warning does warn about. Pointer-based data structures | ||||||||||||
such as linked lists or trees are allowed as they don't typically cause | ||||||||||||
buffer overflows. "Temporal" safety issues that arise from using raw pointers, | ||||||||||||
such use-after-free, null pointer dereference, dangling pointers, | ||||||||||||
reference invalidation, are out of scope for this warning. | ||||||||||||
The warning also doesn't warn every time a pointer is passed into a function, | ||||||||||||
but only when the function is annotated with the attribute. Because | ||||||||||||
Not Done ReplyInline ActionsOne thing I think is worth pointing out about these docs is that the first example effectively says "do add the attribute because the size passed in to the function could be wrong" and the second example says "don't add the attribute on the assumption that the container has the correct size information". The advice feels a bit conflicting to me because in one case we're assuming callers pass in incorrect values and in the other case we're assuming callers pass in correct values and the only distinction between them is a "container was used". But a significant portion of our users don't use libc++ (they use libstdc++ or MS STL for example). I think we should have more details on why the STL used at runtime doesn't matter, or if it still really does matter, we may want to reconsider the advice we give. Also, we don't give similar advice for use of the pragmas. Should we? (Maybe split the advice as to when to use markings in general out into its own section and reference it from both the pragma and attribute sections?) aaron.ballman: One thing I think is worth pointing out about these docs is that the first example effectively… | ||||||||||||
Not Done ReplyInline Actions
Our model depends on Standard Library used implementing the bounds checks. Let's assume we're starting with this buggy code: void caller() { int buffer[10]; callee(buffer, 20); } void callee(int* ptr, unsigned size) { ptr[size-1] = 42; ptr[size] = 42; ptr[100] = 42; } and transform it to (this is just a part of what we actually plan but hopefully illustrates the point): void caller() { int buffer[10]; callee(std::span<int>(buffer, 20), 20); } void callee(std::span<int> ptr, unsigned size) { ptr[size-1] = 42; // still unmitigated ptr[size] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++ ptr[100] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++ } jkorous: > One thing I think is worth pointing out about these docs is that the first example… | ||||||||||||
Not Done ReplyInline Actions
So will the model consider things like libstdc++ that don't implement the bounds checks? Or MSVC STL with Safe Library Support (https://learn.microsoft.com/en-us/cpp/standard-library/safe-libraries-cpp-standard-library?view=msvc-170) which seems to have some of the hardening you're talking about, but not all of it?
Understood; my point was that the comments in callee() are incorrect in other STL implementations. The model seems to presume all STLs are hardened (or will be in the relatively near future) but I don't think that's a valid presumption (for example, STLs exist which are focused solely on performance, like EASTL). So that makes me question whether the model *should* presume that container state is valid when it comes from an untrusted source. aaron.ballman: > Our model depends on Standard Library used implementing the bounds checks.
So will the model… | ||||||||||||
Not Done ReplyInline ActionsThe warnings about unsafe buffer usage would be valid no matter what Standard Library implementation is used and I can imagine the FixIts to exist in different flavors. jkorous: The warnings about unsafe buffer usage would be valid no matter what Standard Library… | ||||||||||||
Yeah that's an important thing for us to properly formulate, let me try to elaborate on what I meant. So we trust containers and spans, like we trust smart pointers: if you simply use them "normally", they're correct by construction. You have to go out of your way to get them wrong. Like, you take an array, it knows the size, you implicitly convert it to span, it's now well-formed, you pass it from function to function, it's still well-formed as it still carries the original bounds information, you subspan it... well this needs a runtime bounds check yeah, but other than that it's hard to get it wrong. And our tool is supposed to transform the entire program to do exactly this. And once we're there, none of these operations require manual intervention from the user, so there's very little room for error. On the contrary, if bounds information is passed separately, there's a lot of room for error every single time it's passed from function to function. There's literally nothing that prevents such information from going out of sync. There's nothing that prevents the user from passing a size of data he *intends to be written* into the buffer, instead of the size of the buffer. So we don't trust that. Does that make sense? NoQ: > One thing I think is worth pointing out about these docs is that the first example… | ||||||||||||
Not Done ReplyInline Actions
It does, but the situation I'm thinking of is where the user is migrating their unsafe code to this new programming mode and aren't able to use libc++ to get the extra hardening checks. Consider this contrived example: extern "C" void func(int *array, size_t count) { // Do buffer things with array and count } The user can't (easily) modify this code to accept a span directly because that could potentially break ABI. So they wrap the data in a span: extern "C" void func(int *array, size_t count) { std::span<int> sp(array, count); // Do buffer things with sp } In both cases, the user is taking it on faith that the data passed into the function is correct. But you want the user to believe the second case is more trustworthy because it uses span -- but more trustworthy based on what? There's no hardened checks in the STL they're using and the potentially-wrong information is the same flavor of potentially-wrong in either case. My fear is that we push users to make changes like that by telling them it's safer when it's actually functionally equivalent instead, but when they migrates their code they accidentally *introduce* a new bug where there wasn't one before. I'm used to thinking about this sort of stuff as a kind of taint analysis. We need to know where that data comes from in order to know whether we can trust it or not. Did the "count" come from user input on the command line... or did it come from a call to .size() on some other container? These have different levels of trust (to me). Btw, this brings up a different question about fix-it hints. Are you planning to add fix-it hints to interface boundaries? If so, are you planning to pay attention to things like extern "C" or the function linkage to try to reduce the amount of ABI breaks suggested? aaron.ballman: > Does that make sense?
It does, but the situation I'm thinking of is where the user is… | ||||||||||||
No, that's definitely not our intention. That's why I'm explicitly specifying that this assumption only applies to containers constructed *outside* the function. In your example both functions should wear the attribute. And there's no way to produce a safe alternative (which wouldn't need [[unsafe_buffer_usage]]) under extern "C" linkage. But on the C side, there's no expectation of having a safe alternative in the first place. Calls to this function become just another unsafe operation, which we can hope to isolate and reduce, but never to eliminate. So it's fine that a safe alternative isn't provided at all.
So, yes, we want fixes on interface boundaries, but we only want fixes that don't break source or binary compatibility. So our fixes will always preserve the original interface, possibly annotate it as unsafe, and then provide a safe alternative that call sites can, but *don't have to*, transition to. So in this case we're aiming for something like this: [[clang::unsafe_buffer_usage]] extern "C" void func(int *array, size_t count) { func(std::span<int>(array, count)); } void func(std::span<int> sp) { // Do buffer things with sp } where the original function is preserved, the code is not duplicated, and the safe function doesn't have extern "C". In this case the original function is still unsafe (wears the attribute) because the container isn't coming from outside. NoQ: > In both cases, the user is taking it on faith that the data passed into the function is… | ||||||||||||
Not Done ReplyInline Actions
My point is that the assumption seems invalid to me -- it's a matter of garbage-in, garbage-out, but you're saying "this potentially garbage-in situation is assumed to not happen" and then building analysis checks around that assumption. I realize we have to make some heuristic decisions for the analysis, but this one seems dangerous for something attempting to improve buffer safety because nothing validates that assumption. Or do you have plans for the future to add an analysis to cover the situation of shoving garbage into the span? To be clear, this is the situation I'm concerned about: // SomeLibraryTU.cpp void func(int *buffer, size_t count) { // Old interface for compatibility func(std::span<int>(buffer, count)); } void func(std::span<int> sp) { // New, safe interface if (!sp.empty()) { int *data = sp.data(); // Do stuff on data (other than pointer arithmetic, only looking at first element) } } // SomeApplicationTU.cpp int main() { func(0, 1); // Oops, wrong count! } My complaint is that the library code is perfectly safe but the application code is not, and it sounds like this situation is purposefully not considered as part of this analysis... but it seems like this situation is the common case we'd *want* to catch. aaron.ballman: >>In both cases, the user is taking it on faith that the data passed into the function is… | ||||||||||||
We do have plans for a static analyzer checker that warns about shoving garbage into the span! - that's the last bullet point in the intro. We expect it to be disproportionally effective within this programming model because the programming model recommends constructing the span "as early as possible", pretty much immediately after allocation, which makes the problem very local and therefore easy to detect with path-sensitive analysis. We also think it's much harder to shove garbage into span in the first place. The size parameter of the span is extremely unambiguous, there's nothing else you can think of putting there. With an arbitrary function you see for the first time, it's tempting to pass eg. the amount of data you expect to be written into the destination buffer, instead of the amount of room in the destination buffer. Span eliminates this problem because you know that whatever you put into the size parameter is a property of the span itself, it has nothing to do with the function you pass it into. In a lot of cases you don't even *have to* specify the size, you can simply implicitly-convert your container into a span of the right size. That said, this isn't exactly why we make this distinction. What we're trying to do here is reduce *the number of times* a span needs to be manually constructed (and then decomposed back into pointer and size, only to be constructed again). Because our goal is to construct every span manually at most once, right next to the allocation. And when you're literally looking at the allocation, the size becomes *obvious*. And after that, by the virtue of conforming to the model there's no longer a need to use unsafe wrap-unwrap operations over those spans, so everything becomes correct by construction. So despite not addressing the problem of initial span construction directly, this choice decreases the amount of times the programmer needs to solve this problem, and it decreases the complexity of this problem by making the programmer solve it only at those few moments of time when it's the most convenient to solve. And, yes, a static analyzer checker on top of that.
Hmm I'm not sure I understand this example, how does the lack of buffer operations in the new function kicks in here? It's definitely an expected false negative when a buffer of exactly one element is always used in the code and we therefore think that it's not a buffer at all. But why would you want to use a span at all in such case? And regardless of this problem, if the library vendor consciously provides a new interface in order to intentionally make the code compliant with our model, then they'd also add an attribute on the old interface (but not on the new interface) according to the requirements we're discussing. In this case a misuse of the old interface will issue a warning "please use the new interface" and a misuse of the new interface becomes more difficult due to the above considerations, so everything appears to be working as intended(?) NoQ: > Or do you have plans for the future to add an analysis to cover the situation of shoving… | ||||||||||||
Not Done ReplyInline Actions
Ah, perfect!
Hehe, oh to be this optimistic still. :-D The scenario I expect to see garbage in span is from data that can be user-controlled (the exact stuff you're worried about, to be honest!): pulling in size data from the network or a file format, etc. Something where it's not at all accidental that it's garbage going into the span, but is actually malicious input from an attacker.
It doesn't have to lack buffer operations, I was just trying to make the simplest example I could and I probably obscured my point. My point is mostly that a library interface that takes buffer + size and then wraps it into a span to pass to a span interface is every bit as dangerous as not having the span interface at all. If the data going into the span is tainted, it's game over. So I worry about *assuming* the data in the span is valid and basing diagnostic decisions around that. However, because you're going to also have checkers specifically trying to catch the case where invalid data CAN be shoved into the span (like a taint analysis, I hope?), that basically resolves my concerns. aaron.ballman: > We do have plans for a static analyzer checker that warns about shoving garbage into the span! | ||||||||||||
Yes, so it'll definitely cover basic cases where the allocation size is known to be smaller than span size on a given path (as constants or symbolically), eg. int *p = new int[5]; std::span<int> sp(p, 7); // warn int *q = new int[N]; std::span<int> sq(q, N + 1); // warn int *r = new int[N]; if (M > N) std::span<int> sr(r, M); // warn Then, yes, taint analysis could be a straightforward addition to that checker: int *p = /* suppose we don't even know how it's allocated */; size_t S; scanf("%zu", &S); std::span<int> sp(p, S); // warn Though, of course, this will imply moving taint analysis out of alpha and making sure we have a way to annotate user-defined taint sources, but the checker can be taught to interact with that very easily. And such taint checker will be, again, disproporitonally effective within our programming model because spans are constructed immediately next to taint sources, so no need for heavy interprocedural analysis. And maybe we'll consider an opt-in aggressive mode that warns every time the size isn't "related" to the allocation size (in terms of symbolic expressions, so assume M and N are actually unrelated values): int *q = new int[N]; std::span<int> sq(q, M); // warn where the warning can be suppressed with assert(M <= N); (we'll have to double-check that our constraint solver is up to the task). Which still has limited efficiency because the allocation size has to be known (even if symbolic), unlike the taint case where it's sufficient (but not necessary) to know that the span size unconstrained. NoQ: Yes, so it'll definitely cover basic cases where the allocation size is known to be smaller… | ||||||||||||
the attribute can be added to functions by automatic fixits, the warning | ||||||||||||
and the fixes can propagate across function boundaries. The users are also | ||||||||||||
Not Done ReplyInline Actions
aaron.ballman: | ||||||||||||
encouraged to annotate their unsafe functions manually. But the warning is | ||||||||||||
Not Done ReplyInline Actions
aaron.ballman: | ||||||||||||
Not Done ReplyInline Actions
steakhal: | ||||||||||||
not otherwise inter-procedural. | ||||||||||||
The Pragmas | ||||||||||||
----------- | ||||||||||||
In order to aid incremental adoption of the programming model, you are | ||||||||||||
encouraged to enable/disable the warning on a file-by-file basis. Additionally, | ||||||||||||
pragmas are provided to annotate sections of the code as opt-in to the model | ||||||||||||
and activate the warnings: | ||||||||||||
.. code-block:: c++ | ||||||||||||
#pragma clang only_safe_buffers begin | ||||||||||||
... | ||||||||||||
#pragma clang only_safe_buffers end | ||||||||||||
or opt-out of the model and deactivate the warnings on a section of code: | ||||||||||||
.. code-block:: c++ | ||||||||||||
#pragma clang unsafe_buffer_usage end | ||||||||||||
... | ||||||||||||
#pragma clang unsafe_buffer_usage begin | ||||||||||||
Such pragmas not only enable incremental adoption with much smaller granularity, | ||||||||||||
but also provide essential "escape hatches" when the programming model | ||||||||||||
is undesirable for a section of code (such as tight loops in | ||||||||||||
performance-critical code, or implementation of a custom container). In such | ||||||||||||
cases, sections of code with unsafe buffer usage deserve being explicitly marked | ||||||||||||
and easily auditable by security researches. | ||||||||||||
Even though similar functionality can be achieved with the generic pragma | ||||||||||||
``#pragma clang diagnostic``, our specialized pragmas are preferable because | ||||||||||||
they clearly document the property of the code within them. Additionally, | ||||||||||||
it is problematic to "interleave" different instances of | ||||||||||||
``#pragma clang diagnostic push`` and ``#pragma clang diagnostic pop`` | ||||||||||||
between each other, as ``pop`` always pops the last ``push`` and there's no way | ||||||||||||
Not Done ReplyInline Actions
aaron.ballman: | ||||||||||||
to pop the one you want: | ||||||||||||
.. code-block:: c++ | ||||||||||||
#pragma clang diagnostic push warning "-Wfoo" | ||||||||||||
Not Done ReplyInline ActionsI don't think I understand what you're trying to say here. aaron.ballman: I don't think I understand what you're trying to say here. | ||||||||||||
I was trying to say that `[[deprecated]] is probably appropriate if the vendor of the function expects *all* users of the function to move away from it. But if the vendor thinks that the function is generally fine, and it's only important to move away from it when the -Wunsafe-buffer-usage analysis is requested by the user, then [[unsafe_buffer_usage]] is more appropriate as it only causes a warning to be emitted under that condition, not every time. I'll wordsmith. NoQ: I was trying to say that `[[deprecated]] is probably appropriate if the vendor of the function… | ||||||||||||
Not Done ReplyInline Actions
Ahhhh that makes a lot more sense to me, thanks! aaron.ballman: > I was trying to say that `[[deprecated]] is probably appropriate if the vendor of the… | ||||||||||||
#pragma clang diagnostic push warning "-Wsafe-buffers-usage" | ||||||||||||
#pragma clang diagnostic pop //end of -Wsafe-buffers-usage | ||||||||||||
#pragma clang diagnostic pop //end of -Wfoo | ||||||||||||
Not Done ReplyInline ActionsThe deprecated attribute does not have a replacement argument, it has an argument for some kind of message to be printed when the deprecated interface is used. So I'm not certain what this is telling me either. aaron.ballman: The deprecated attribute does not have a replacement argument, it has an argument for some kind… | ||||||||||||
Not Done ReplyInline ActionsFor better or worse the deprecated attribute is "overloaded" and this refers to the two-parameter version. https://clang.llvm.org/docs/AttributeReference.html#deprecated "When spelled as attribute((deprecated)), the deprecated attribute can have two optional string arguments. The first one is the message to display when emitting the warning; the second one enables the compiler to provide a Fix-It to replace the deprecated name with a new name." Since it seems confusing even to you - maybe we should take this as a signal that our users would get just as confused and remove the reference to deprecated? jkorous: For better or worse the deprecated attribute is "overloaded" and this refers to the two… | ||||||||||||
Not Done ReplyInline ActionsOhhhh! I see now! What confused me is that the docs spell it as [[deprecated]] which does *not* have an overload that takes two arguments. If we keep this part of the docs, you should probably change it to __attribute__((deprecated)) and link to the attribute docs for it.
Maybe we should, but because I'm still not quite clear on how this relates to the deprecated attribute yet, maybe you could reply with some code examples of deprecated and unsafe_buffer_usage and what diagnostics you get? If that helps clarify, maybe I can help reword the docs to improve clarity. aaron.ballman: Ohhhh! I see now! What confused me is that the docs spell it as `[[deprecated]]` which does… | ||||||||||||
Uh-oh I didn't notice there's a difference. Indeed, will fix. NoQ: Uh-oh I didn't notice there's a difference. Indeed, will fix. | ||||||||||||
versus | ||||||||||||
Not Done ReplyInline ActionsFWIW, I think a significant amount of these attribute docs will move into AttrDocs.td (which should explain all the gory details about the attribute) and we can leave the high-level description here and link to the attribute documentation (rather than duplicate information). aaron.ballman: FWIW, I think a significant amount of these attribute docs will move into AttrDocs.td (which… | ||||||||||||
.. code-block:: c++ | ||||||||||||
#pragma clang diagnostic push warning "-Wfoo" | ||||||||||||
#pragma clang unsafe_buffer_usage begin | ||||||||||||
#pragma clang diagnostic pop //end of -Wfoo | ||||||||||||
#pragma clang unsafe_buffer_usage end //end of -Wunsafe-buffer-usage | ||||||||||||
Not Done ReplyInline Actions
aaron.ballman: | ||||||||||||
Not Done ReplyInline Actions
aaron.ballman: | ||||||||||||
The Attribute | ||||||||||||
------------- | ||||||||||||
The attribute ``[[clang::unsafe_buffer_usage]]`` should be placed on functions | ||||||||||||
that need to be avoided as they may cause buffer overflows. It is designed to | ||||||||||||
aid automatic fixits which would replace such unsafe functions with safe | ||||||||||||
alternatives, though it can be used even when the fix can't be automated. | ||||||||||||
The attribute is warranted even if the only way a function can overflow | ||||||||||||
the buffer is by violating the function's preconditions. For example, it | ||||||||||||
would make sense to put the attribute on function ``foo()`` below because | ||||||||||||
passing an incorrect size parameter would cause a buffer overflow: | ||||||||||||
.. code-block:: c++ | ||||||||||||
Not Done ReplyInline ActionsI anticipate that doing these fixits for multi-dimensional arrays will be confusing. Is that in scope? xazax.hun: I anticipate that doing these fixits for multi-dimensional arrays will be confusing. Is that in… | ||||||||||||
Yeah it's gonna be fun, it's in scope, we'll see how deep we'll be able to get. Ideally we want to transform them in one step. Of course the procedure is drastically different in int [][] (array of arrays, one contiguous chunk of memory) vs. int ** (span of spans, each span in its own heap allocation). NoQ: Yeah it's gonna be fun, it's in scope, we'll see how deep we'll be able to get. Ideally we want… | ||||||||||||
Not Done ReplyInline ActionsDo you mean that the dimensions get reversed if we replace int arr_3d[3][4][5] with array<array<array<int, 5>, 4>, 3> arr_3d_r? jkorous: Do you mean that the dimensions get reversed if we replace `int arr_3d[3][4][5]` with… | ||||||||||||
Not Done ReplyInline ActionsYeah, I can imagine some people wanting bounds safety but having strong opinions about the reversed dimensions in their code. Moreover, if there are multi-dimensional array typedefs in the nesting, that can make things more complicated. xazax.hun: Yeah, I can imagine some people wanting bounds safety but having strong opinions about the… | ||||||||||||
[[clang::unsafe_buffer_usage]] | ||||||||||||
void foo(int *buf, size_t size) { | ||||||||||||
for (size_t i = 0; i < size; ++i) { | ||||||||||||
buf[i] = i; | ||||||||||||
} | ||||||||||||
} | ||||||||||||
The attribute is NOT warranted when the function has runtime protection against | ||||||||||||
overflows, assuming hardened libc++, assuming that containers constructed | ||||||||||||
outside the function are well-formed. For example, function ``bar()`` below | ||||||||||||
doesn't need an attribute, because assuming buf is well-formed (has size that | ||||||||||||
fits the original buffer it refers to), hardened libc++ protects this function | ||||||||||||
from overflowing the buffer: | ||||||||||||
.. code-block:: c++ | ||||||||||||
void bar(std::span<int> buf) { | ||||||||||||
for (size_t i = 0; i < buf.size(); ++i) { | ||||||||||||
buf[i] = i; | ||||||||||||
} | ||||||||||||
} | ||||||||||||
This corresponds to our safety precaution about keeping buffers "containerized" | ||||||||||||
in spans for as long as possible. Function ``foo()`` may have internal | ||||||||||||
consistency, but by accepting a raw buffer it requires the user to unwrap | ||||||||||||
the span, which is undesirable. | ||||||||||||
The attribute is warranted when a function accepts a raw buffer only to | ||||||||||||
immediately put it into a span: | ||||||||||||
.. code-block:: c++ | ||||||||||||
[[clang::unsafe_buffer_usage]] | ||||||||||||
Not Done ReplyInline Actions
steakhal: | ||||||||||||
void baz(int *buf, size_t size) { | ||||||||||||
Not Done ReplyInline Actions
aaron.ballman: | ||||||||||||
std::span<int> sp{ buf, size }; | ||||||||||||
for (size_t i = 0; i < sp.size(); ++i) { | ||||||||||||
sp[i] = i; | ||||||||||||
} | ||||||||||||
Not Done ReplyInline ActionsWhile this might be desirable for some projects, I can imagine other projects want to avoid generating the overload (just generate the fixes in one pass, and apply all of them in a second to avoid a "fixed" header render another TU invalid). But that would require fixits for the call sites. Do you plan to support those batch migration scenarios or is that out of scope? xazax.hun: While this might be desirable for some projects, I can imagine other projects want to avoid… | ||||||||||||
Not Done ReplyInline ActionsWe currently don't have a machinery that would allow us to analyze the whole project at once. Two passes are generally not enough - that number is something like number of unique functions in the deepest call-stack. Part of our fixits is attributing functions as unsafe and application of such fixits can create new warnings in callers. Let's take an example: foo(int* ptr) { ptr[5] = 5; } bar(int* ptr) { foo(ptr); } baz(int* ptr) { bar(ptr); } Initially it is not obvious that bar or baz might get transformed. In the first iteration a fixit for foo is emitted and when applied we get: foo(span<int> ptr) { ptr[5] = 5; } [[clang::unsafe_buffer_usage]] foo(int* ptr); bar(int* ptr) { foo(ptr); } baz(int* ptr) { bar(ptr); } Only now we learn that bar (which can live in a diferent TU) should get transformed as well because it calls to unsafe function foo. But our analysis still won't see that baz will eventually have to be transformed too and it'll take another iteration. jkorous: We currently don't have a machinery that would allow us to analyze the whole project at once. | ||||||||||||
Not Done ReplyInline ActionsI see, this explains it all, thanks! I'd love to have this motivating example in the design document why a 2-pass fix-it-all model does not work. xazax.hun: I see, this explains it all, thanks! I'd love to have this motivating example in the design… | ||||||||||||
} | ||||||||||||
In this case ``baz()`` does not contain any unsafe operations, but the awkward | ||||||||||||
parameter type causes the caller to unwrap the span unnecessarily. | ||||||||||||
In such cases the attribute may never be removed. | ||||||||||||
In particular, the attribute is NOT an "escape hatch". It does not suppress | ||||||||||||
the warnings about unsafe operations in the function. Addressing warnings | ||||||||||||
inside the function is still valuable for internal consistency. | ||||||||||||
Attribute ``[[clang::unsafe_buffer_usage]]`` is similar to attribute | ||||||||||||
[[deprecated]] but it has important differences: | ||||||||||||
* Use of a function annotated by such attribute causes ``-Wunsafe-buffer-usage`` | ||||||||||||
warning to appear, instead of ``-Wdeprecated``, so they can be | ||||||||||||
enabled/disabled independently as all four combinations make sense; | ||||||||||||
* The "replacement" parameter of ``[[deprecated]]``, which allows for automatic | ||||||||||||
fixits when the function has a drop-in replacement, becomes significantly more | ||||||||||||
powerful and flexible in ``[[clang::unsafe_buffer_usage]]`` where it will allow | ||||||||||||
non-trivial automatic fixes. | ||||||||||||
(TODO: Explain parameters of the attribute, how they aid automatic fixits) | ||||||||||||
Code Modernization Workflow With Semi-Automatic Fixits | ||||||||||||
------------------------------------------------------ | ||||||||||||
Not Done ReplyInline ActionsJust a small concern. While users are not allowed to use identifiers like this, there is always a chance someone has a global variable of this name and the code ends up compiling. So, if we want to be absolutely safe, we should either have logic to pick a unique name, or not generate fixits in that case. Although, this is probably rare enough that we could just hand-wave. xazax.hun: Just a small concern. While users are not allowed to use identifiers like this, there is always… | ||||||||||||
Not Done ReplyInline ActionsAgreed; we should not have a fixit that suggests use of a reserved identifier. aaron.ballman: Agreed; we should not have a fixit that suggests use of a reserved identifier. | ||||||||||||
Not Done ReplyInline ActionsOur thinking is to actually use syntactically incorrect placeholder to make sure that users can't accidentally miss it. jkorous: Our thinking is to actually use syntactically incorrect placeholder to make sure that users… | ||||||||||||
Not Done ReplyInline Actions
So long as that happens on a note, that would be fine depending on how you name the identifier. e.g., remember that we have a feature flag to support dollar signs in identifiers, we need to protect against macros, that sort of thing. aaron.ballman: > Our thinking is to actually use syntactically incorrect placeholder to make sure that users… | ||||||||||||
Every time your code preforms an unsafe operation that causes a | ||||||||||||
``-Wunsafe-buffer-usage warning`` to appear, the warning may be accompanied | ||||||||||||
by an automatic fix that changes types of one or more variables associated | ||||||||||||
with the unsafe operation from raw pointer or array type to safe container type. | ||||||||||||
Not Done ReplyInline ActionsI see this part a bit confusing. While it is true that we cannot be sure whether the size is actually the size of the buffer, I wonder if we should at least generate something to call it out the user has further action items. It could be a /*revise me*/ comment after the size, or documentation, both, or something else. xazax.hun: I see this part a bit confusing. While it is true that we cannot be sure whether the size is… | ||||||||||||
Not Done ReplyInline ActionsWe have accepted that our analysis will have some limits. In situations where we have a reasonable suspicion that a certain parameter has the semantics of size we can somehow communicate that to the user. But it is very likely that in some cases the analysis won't be able to tell even that. jkorous: We have accepted that our analysis will have some limits. In situations where we have a… | ||||||||||||
Not Done ReplyInline ActionsNot being able to cover all the cases sounds fine to me, and I'm glad you plan to communicate potential size params to the user when there is reasonable certainty. xazax.hun: Not being able to cover all the cases sounds fine to me, and I'm glad you plan to communicate… | ||||||||||||
For example, the following function contains a local constant-size array. | ||||||||||||
.. code-block:: c++ | ||||||||||||
void use_array(int[] x); | ||||||||||||
Not Done ReplyInline Actions
steakhal: | ||||||||||||
void test_array() { | ||||||||||||
int x[5]; | ||||||||||||
x[3] = 3; // Warning: Array indexing is unsafe operation! | ||||||||||||
use_array(x); | ||||||||||||
} | ||||||||||||
Not Done ReplyInline Actions
steakhal: | ||||||||||||
The automatic fixit associated with the warning would transform this array | ||||||||||||
Not Done ReplyInline ActionsBecause fixits can be applied automatically to an entire source file, we don't typically allow fix-its that can break code (or fixits where there are multiple choices for the user to pick from) unless they're attached to a *note* instead of a warning (or error). Are you planning to follow that same pattern when placeholders are involved? (For example, one potential use for automatic fix-its is to apply them automatically to source on commit before running precommit CI; similar to how folks sometimes run clang-format to automate "the source is different before/after the clang-format, so you should look into that".) aaron.ballman: Because fixits can be applied automatically to an entire source file, we don't typically allow… | ||||||||||||
Not Done ReplyInline ActionsThat's an extremely interesting question! Let me add our current thinking as a context for the discussion. Our analysis will have its limits and will not be able to provide 100% complete and guaranteed correct code transformation in every case. At the same time it sounds suboptimal both from our and our users perspective to not emit anything for such situations. The next question is - can we not break the code when using placeholders? WDYT? I wasn't aware of the precedent related to fixits attached to notes not having to lead to compilable code and we should consider that path. jkorous: That's an extremely interesting question! Let me add our current thinking as a context for the… | ||||||||||||
Not Done ReplyInline Actions
We document the expectations here: https://clang.llvm.org/docs/InternalsManual.html#fix-it-hints and the interesting bit is: "Fix-it hints on errors and warnings need to obey these rules:
If a fix-it can’t obey these rules, put the fix-it on a note. Fix-its on notes are not applied automatically." I think using placeholders on warnings rather than notes will be an issue for both the first and second bullets. aaron.ballman: > I wasn't aware of the precedent related to fixits attached to notes not having to lead to… | ||||||||||||
I like that! This way we can even emit multiple alternative fixes, eg. note: change type of variable 'ptr' to 'std::span' to add bound checks note: change type of variable 'ptr' to 'std::span::iterator' to add bound checks (I'm still not sure we'll ever going to recommend iterators as a good solution, but they're generally a better drop-in replacement for a raw pointer, so despite their increased verbosity, we're considering them at least in some cases) Another thing this can take care of is support for custom containers: note: change type of variable 'vla' to 'std::vector' to add bound checks note: change type of variable 'vla' to 'llvm::SmallVector' to add bound checks Now we can simply present all available containers to the user, and the user can select the one they like the most. NoQ: I like that! This way we can even emit multiple alternative fixes, eg.
```
note: change type of… | ||||||||||||
Not Done ReplyInline ActionsYes, we should look into attaching FixIts that contain placeholders to notes instead. jkorous: Yes, we should look into attaching FixIts that contain placeholders to notes instead. | ||||||||||||
into an ``std::array``: | ||||||||||||
.. code-block:: c++ | ||||||||||||
void use_array(int[] x); | ||||||||||||
void test_array() { | ||||||||||||
std::array<int> x[5]; | ||||||||||||
x[3] = 3; | ||||||||||||
use_array(x.data()); | ||||||||||||
} | ||||||||||||
Note that the use site of variable ``x`` inside the call to ``use_array()`` | ||||||||||||
needed an update. The fixit considers every use site of the fixed variable | ||||||||||||
Not Done ReplyInline Actions
steakhal: | ||||||||||||
and updates them if necessary. | ||||||||||||
In some cases an automatic fix would be problematic, so the warning will simply | ||||||||||||
highlight unsafe operations for you to consider. | ||||||||||||
In some cases a partial fix is emitted, where you'll have to fill in a few | ||||||||||||
placeholders in order to document the bounds information related to the pointer. | ||||||||||||
This is particularly common when suggesting ``std::span``. | ||||||||||||
For example, consider function ``foo()`` we've seen before: | ||||||||||||
.. code-block:: c++ | ||||||||||||
void foo(int *buf, size_t size) { | ||||||||||||
for (size_t i = 0; i < size; ++i) { | ||||||||||||
buf[i] = i; | ||||||||||||
} | ||||||||||||
} | ||||||||||||
In spirit of our guidelines, the automatic fixit would prefer this function | ||||||||||||
to accept a span instead of raw buffer, so that the span didn't need to be | ||||||||||||
unwrapped. Of course, such change alone would break both source compatibility | ||||||||||||
and binary compatibility. In order to avoid that, the fix will provide | ||||||||||||
a compatibility overload to preserve the old functionality. The updated code | ||||||||||||
produced by the fixit will look like this: | ||||||||||||
.. code-block:: c++ | ||||||||||||
[[clang::unsafe_buffer_usage]] | ||||||||||||
void foo(int *buf, size_t size) { | ||||||||||||
foo(std::span<int>{ buf, __placeholder__}, size); | ||||||||||||
} | ||||||||||||
void foo(std::span<int> buf, size_t size) { | ||||||||||||
for (size_t i = 0; i < size; ++i) { | ||||||||||||
buf[i] = i; | ||||||||||||
} | ||||||||||||
} | ||||||||||||
The following changes were performed automatically: | ||||||||||||
- The type of parameter ``buf`` was changed from ``int *`` to | ||||||||||||
``std::span<int>``. Use sites updated if necessary. | ||||||||||||
- A compatibility overload was autogenerated with the old prototype, with | ||||||||||||
attribute ``[[clang::unsafe_buffer_usage]]`` attached to it to encourage | ||||||||||||
the callers to switch to the new function – and possibly update the callers | ||||||||||||
automatically! | ||||||||||||
The following changes need manual intervention: | ||||||||||||
- The compatibility overload contains a ``__placeholder__`` which needs | ||||||||||||
to be populated manually. In this case ``size`` is a good candidate. | ||||||||||||
- Despite accepting a ``std::span`` which carries size information, | ||||||||||||
the fixed function still accepts ``size`` separately. It can be removed | ||||||||||||
manually, or it can be preserved, if ``size`` and ``buf.size()`` | ||||||||||||
actually need to be different in your case. | ||||||||||||
Placeholders fulfill an important purpose as they attract attention to | ||||||||||||
situations where the buffer's size wasn't properly documented for the purposes | ||||||||||||
of bounds checks. Variable ``size`` does not *have* to carry the size of the | ||||||||||||
buffer (or the size of *that* buffer) just becaused it's named "size". | ||||||||||||
The compiler will avoid making guesses about that. | ||||||||||||
The fixits emitted by the warning are correct modulo placeholders. Placeholders | ||||||||||||
are the only reason why fixed code is allowed to fail to compile. | ||||||||||||
Incorrectly resolving the placeholder is the only reason why fixed code | ||||||||||||
will demonstrate incorrect runtime behavior compared to the original code. | ||||||||||||
In an otherwise well-formed program it is always possible (and usually easy) | ||||||||||||
to resolve the placeholder correctly. | ||||||||||||
Note that regardless of how ``__placeholder__`` is resolved, it does not allow | ||||||||||||
you to remove the ``[[clang::unsafe_buffer_usage]]`` annotation. The annotation | ||||||||||||
will stay forever because that function is now equivalent to function ``baz()`` | ||||||||||||
we've seen before: it contains no unsafe operations, but it only offers internal | ||||||||||||
consistency. It is still possible to misuse that function by supplying an | ||||||||||||
invalid ``size`` parameter. It still requires you to unwrap ``std::span`` if you | ||||||||||||
already have it, only to wrap it back immediately. So the callers should still | ||||||||||||
be updated to use the new function, and automatic fixits will now be emitted | ||||||||||||
for the call sites to aid that. | ||||||||||||
Even if you decide to remove the ``size`` parameter, fixits at call sites | ||||||||||||
will remain operational. A warning will be emitted if the replacement function's | ||||||||||||
prototype diverges from the original prototype beyond recognition. In such cases | ||||||||||||
an attribute can be either updated to give more manual hints to the compiler, or | ||||||||||||
changed to a different variant that explicitly opts out of automatic fixits. | ||||||||||||
(TODO: Elaborate on that last point and confirm that we actually want | ||||||||||||
such behavior.) | ||||||||||||
(TODO: Cover more examples.) | ||||||||||||
The Hardened libc++ | ||||||||||||
=================== | ||||||||||||
(TODO: Write this section. Probably just link to its own documentation.) | ||||||||||||
Clang Static Analyzer Checks | ||||||||||||
============================ | ||||||||||||
(TODO: Write this section.) |
For consistency with the rest of our docs, you should make this change throughout.