Page MenuHomePhabricator

[dfsan] Do not specialize vector<bool> for DFSan
Needs ReviewPublic

Authored by stephan.yichao.zhao on Feb 16 2021, 11:06 PM.

Details

Reviewers
morehouse
ldionne
Group Reviewers
Restricted Project
Summary

DataFlow Sanitizer (https://clang.llvm.org/docs/DataFlowSanitizer.html)
tracks dataflow at byte granularity: all bits of a user byte share the
same dataflow label.

The default implementation of vector<bool> causes DFSan uses undefined
labels. The problem is like this:

  1. push_back may reserve new memory when out of capacity. Note that after reserve, __construct_at_end initializes only the first new byte.
  1. The new bit is pushed at this line, https://github.com/llvm/llvm-project/blob/main/libcxx/include/vector#L3067 which eventually calls operator= like this
__bit_reference& operator=(bool __x) _NOEXCEPT
{
    if (__x)
        *__seg_ |= __mask_;
    else
        *__seg_ &= ~__mask_;
    return *this;
}
Here note that
* __x is not assigned to __seg_ directly, but based on if-else. So DFSan
never has a chance to propagate the label of x to __seq_ without
control-flow-tainting.

* recall that reserve does not zero out new bytes. So the read of __seq_
can return undefined. However at bit-level this is fine because
   * ? & 1 = 1 and ? | 0 = 0;
   * although operator[] does not do bound check,
     it is obvious that reading unset bits is undefined.

     However DFSan reads uninitialized shadow values at the case, causing
     over-tainting. Even if DFSan tracks at bit-level, it also needs to
     understand the semantics of | and & to make it work.

This change, when -DDATAFLOW_SANITIZER is used, disables specializing
vector<bool>, forcing using the general implementation. Although this
increases memory cost 8x, this is acceptable to DFSan because DFSan has
memory overhead anyway.

Diff Detail

Event Timeline

stephan.yichao.zhao requested review of this revision.Feb 16 2021, 11:06 PM
stephan.yichao.zhao created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptFeb 16 2021, 11:06 PM
Herald added a reviewer: Restricted Project. · View Herald Transcript
stephan.yichao.zhao edited the summary of this revision. (Show Details)Feb 16 2021, 11:06 PM
stephan.yichao.zhao added a reviewer: morehouse.
stephan.yichao.zhao added a subscriber: Restricted Project.
ldionne requested changes to this revision.Feb 17 2021, 5:35 AM
ldionne added a subscriber: ldionne.

Sorry, but this doesn't work. We can't change the API when DF Sanitizer is in use. One thing we could do is add some sort of annotation if you have an attribute to disable that specific sanitizer on vector<bool>, but we can't just not have the declaration.

This revision now requires changes to proceed.Feb 17 2021, 5:35 AM

re "We can't change the API". Is this because C++ standard claims that std::vector<bool> must be specialized?

Can we keep all interfaces unchanged, and add some DFSan logic inside a method when the sanitizer is used?
For example, the problem we have is operator= does not propagate DFSan's meta data correctly, and the 'reserve' method does not initialize DFSan's meta data.
When DATAFLOW_SANITIZER is on, the code may add logic to initialize or propagate DFSan's information. This does not change interfaces.
Is this a better approach?

kcc added a subscriber: kcc.Feb 17 2021, 9:04 AM

Why not?
DFSan is a separate ABI, you can't mix DFSan-ified code with non-DFSan-ified code.
There is no annotation that we can imagine to work in this case.
DFSan's metadata is per-byte, if we mix different taints in a single byte (8 packet bits) we get an overtaint, i.e. a false positive.
The only other solution for our users is to not use vector<bool>

miscco added a subscriber: miscco.Feb 17 2021, 11:06 AM

re "We can't change the API". Is this because C++ standard claims that std::vector<bool> must be specialized?

Can we keep all interfaces unchanged, and add some DFSan logic inside a method when the sanitizer is used?
For example, the problem we have is operator= does not propagate DFSan's meta data correctly, and the 'reserve' method does not initialize DFSan's meta data.
When DATAFLOW_SANITIZER is on, the code may add logic to initialize or propagate DFSan's information. This does not change interfaces.
Is this a better approach?

As far as I can tell this should be no blocking issue. Usually the goal is to keep the changes as minimal as possible. E.g. would it suffice if we would zero out after allocating a new block? That would add some "minimal" runtime overhead but if you are running with a sanitizer anyway it should not be that big of a deal

DFSan's memory and CPU overhead is 2-3x because each user byte has 1-2 shadow byte, and user instructions have additional inserted code to propagate shadow data. So the overhead added to std::vector<bool> is acceptable.

The ideal fix is like this change. This way DFSan can accurately track flows.

Zeroing out after reserve can ensure DFSan does not read undefined shadow data, but because vector<bool> updates data at bit-level while dfsan tracks at byte-level, dfsan's tracking would not be accurate.

stephan.yichao.zhao edited the summary of this revision. (Show Details)Feb 17 2021, 12:01 PM

The reason why we can't make this change as-is is that it modifies the API of std::vector when instantiated with bool when the dataflow sanitizer is used. Specifically, the specialization of vector<bool> has a different interface, it's not only an optimization. For example std::vector<bool>::reference has a .flip() method. So if someone is doing something like v[3].flip() (which is legal), now their code won't compile when they turn on the dataflow sanitizer. That's not acceptable, and it would make us non-conforming when the dataflow sanitizer is used.

Frankly, from your description, it sounds like the dataflow sanitizer is just not clever enough (yet?) to understand our code. This isn't specific to std::vector - you'll run into similar issues in a bunch of places in user code. libc++ is hardly the only place in the C++ world to do these kinds of bit manipulations. Now, this is the standard library, so we definitely want to be team players when it comes to enabling useful tooling - that's why we have things like _LIBCPP_NO_CFI to turn off some sanitizers in specific areas of our code where the sanitizers are known to be wrong.

Are there ways of locally turning off the dataflow sanitizer? Is there some sort of annotation that we could use to tell the dataflow sanitizer to track at the bit level for some operations inside vector<bool>? Those are the kinds of solutions that would be acceptable.

Also, independently of the dataflow sanitizer, it looks to me like we do have UB right here:

template <class _Allocator>
void
vector<bool, _Allocator>::push_back(const value_type& __x)
{
    if (this->__size_ == this->capacity())
        reserve(__recommend(this->__size_ + 1));
    ++this->__size_;
    back() = __x; // UB HERE
}

As you mention in the great description for this review (thanks for that by the way), we call back() = __x;, but we might not have not initialized back() yet (if we're about to set the first bit in that byte). @miscco can you confirm that we're on the same page? If so, I'll send a review to fix this. I'm a bit surprised ubsan doesn't catch that, but I guess ubsan isn't a silver bullet.

kcc added a comment.Feb 17 2021, 1:52 PM

The reason why we can't make this change as-is is that it modifies the API of std::vector when instantiated with bool when the dataflow sanitizer is used. Specifically, the specialization of vector<bool> has a different interface, it's not only an optimization. For example std::vector<bool>::reference has a .flip() method. So if someone is doing something like v[3].flip() (which is legal), now their code won't compile when they turn on the dataflow sanitizer. That's not acceptable, and it would make us non-conforming when the dataflow sanitizer is used.

Bummer, I didn't realize that... Ok...

Frankly, from your description, it sounds like the dataflow sanitizer is just not clever enough (yet?) to understand our code.

It's not about being clever, but about trading overhead for accuracy when it comes to bit-packed data structures.
msan maintains a bit-to-bit metadata and thus doesn't have this kind of false positives.

dfsan maintains a byte-to-byte metadata (currently, byte-to-two-bytes, but we are moving to byte-to-byte).
This enables us to maintain 8 taint bits per one application byte, and thus track up to 8 different data flows (taints).
This means a) 8x fewer runs required compared to a single taint tracer and b) ability to detect when two taints merge.
The downside is that in the presence of bit packing we over-taint, i.e. when setting a single bit with a tainted
value we taint the entire byte.
I am not aware of any good solution here, users will have to workaround in their code,
e.g. by not using bit-packed data structures in cases where data flow tracing is critical.

Valgrind/memcheck uses a secondary shadow for bit-to-bit shadow (for uninitialized),
but I don't think we can afford this in dfsan.

This isn't specific to std::vector - you'll run into similar issues in a bunch of places in user code. libc++ is hardly the only place in the C++ world to do these kinds of bit manipulations. Now, this is the standard library, so we definitely want to be team players when it comes to enabling useful tooling - that's why we have things like _LIBCPP_NO_CFI to turn off some sanitizers in specific areas of our code where the sanitizers are known to be wrong.

Are there ways of locally turning off the dataflow sanitizer? Is there some sort of annotation that we could use to tell the dataflow sanitizer to track at the bit level for some operations inside vector<bool>? Those are the kinds of solutions that would be acceptable.

Also, independently of the dataflow sanitizer, it looks to me like we do have UB right here:

template <class _Allocator>
void
vector<bool, _Allocator>::push_back(const value_type& __x)
{
    if (this->__size_ == this->capacity())
        reserve(__recommend(this->__size_ + 1));
    ++this->__size_;
    back() = __x; // UB HERE
}

As you mention in the great description for this review (thanks for that by the way), we call back() = __x;, but we might not have not initialized back() yet (if we're about to set the first bit in that byte). @miscco can you confirm that we're on the same page? If so, I'll send a review to fix this. I'm a bit surprised ubsan doesn't catch that, but I guess ubsan isn't a silver bullet.

Can we keep all interfaces unchanged, and add some DFSan logic inside a method when the sanitizer is used?

This sounds like an acceptable solution.

To double check my understanding, you are proposing vector<bool> owns a small shadow for its contents (2 bytes per bit!) internally, and, for example, implementation of operator[] would copy the right label to its output.

Can we keep all interfaces unchanged, and add some DFSan logic inside a method when the sanitizer is used?

This sounds like an acceptable solution.

To double check my understanding, you are proposing vector<bool> owns a small shadow for its contents (2 bytes per bit!) internally, and, for example, implementation of operator[] would copy the right label to its output.

Yup. This will be adding more code but not be changing interfaces... I need to try, and see how the change would look like.

kcc added a comment.Feb 17 2021, 3:25 PM

(2 bytes per bit!)

1 byte per bit, hopefully. (for the new 8-bit mode only)

Yup. This will be adding more code but not be changing interfaces... I need to try, and see how the change would look like.

I'm open to that solution if it doesn't add much complexity. Please feel free to update this patch with that proposed solution and we'll see what that looks like.

Before doing a complete test, this shows how one potential change would look like.
Please have a look and see if this is on the right way.

What the change does are

  1. extended all operations of __bit_reference/_bit_iterator with dfsan label propagation The two classes take an additional iterator of dfsan label storage
  2. vector<bool> uses <__bit_reference>, and allocates dfsan label storage We use vector<int> as the storage, so we can get all operations like copy/fill for free.
  3. bitset uses <bit_reference>. So it got changed, and this increases dfsan accuracy too. Its change is more than vector<bool> because its logic is not only in bit_reference.
kcc added a comment.Feb 24 2021, 3:35 PM

ugh..
If I were the maintainer of this file, I would run away from this change.
Not because there is something wrong with it functionality-wise, but because of the ifdefs :(
We ourselves in the sanitizer land would reject a change with this many ifdefs w/o looking further.

Some options to explore, even though I am not sure any of those options are nice:

  • Have an ifdef in one place and function calls in all other places, e.g. instead of #ifdef DATAFLOW_SANITIZER do-foo #endif ... #ifdef DATAFLOW_SANITIZER do-bar #endif

have

#ifdef DATAFLOW_SANITIZER
  void doo_foo() { stuff(); } 
  void doo_bar() { other_stuff(); }
#else
  void doo_foo() { }
  void doo_bar() { }   
#endif

 do_foo();
  ...
 do_bar()
  • Have a separate specialization for vector<bool> under dfsan, that would use vector<uint8_t> but implement the rest of the vector<bool> interface however ineffectively.
  • Give up, and fix issues like this on the user side, by not using vector<bool> directly (for cases when we need taint tracking).
  • ???

Yeah, I kinda want to echo what @kcc said. I think there's not enough bang for the buck with this change as it is.

... If so, I'll send a review to fix this. I'm a bit surprised ubsan doesn't catch that, but I guess ubsan isn't a silver bullet.

Did we confirm if this is undefined behavior?