DataFlow Sanitizer (https://clang.llvm.org/docs/DataFlowSanitizer.html)
tracks dataflow at byte granularity: all bits of a user byte share the
same dataflow label.
The default implementation of vector<bool> causes DFSan uses undefined
labels. The problem is like this:
- push_back may reserve new memory when out of capacity. Note that after reserve, __construct_at_end initializes only the first new byte.
- The new bit is pushed at this line, https://github.com/llvm/llvm-project/blob/main/libcxx/include/vector#L3067 which eventually calls operator= like this
__bit_reference& operator=(bool __x) _NOEXCEPT
{
if (__x)
*__seg_ |= __mask_;
else
*__seg_ &= ~__mask_;
return *this;
}Here note that
* __x is not assigned to __seg_ directly, but based on if-else. So DFSan
never has a chance to propagate the label of x to __seq_ without
control-flow-tainting.
* recall that reserve does not zero out new bytes. So the read of __seq_
can return undefined. However at bit-level this is fine because
* ? & 1 = 1 and ? | 0 = 0;
* although operator[] does not do bound check,
it is obvious that reading unset bits is undefined.
However DFSan reads uninitialized shadow values at the case, causing
over-tainting. Even if DFSan tracks at bit-level, it also needs to
understand the semantics of | and & to make it work.This change, when -DDATAFLOW_SANITIZER is used, disables specializing
vector<bool>, forcing using the general implementation. Although this
increases memory cost 8x, this is acceptable to DFSan because DFSan has
memory overhead anyway.