DataFlow Sanitizer (https://clang.llvm.org/docs/DataFlowSanitizer.html)
tracks dataflow at byte granularity: all bits of a user byte share the
same dataflow label.
The default implementation of vector<bool> causes DFSan uses undefined
labels. The problem is like this:
- push_back may reserve new memory when out of capacity. Note that after reserve, __construct_at_end initializes only the first new byte.
- The new bit is pushed at this line, https://github.com/llvm/llvm-project/blob/main/libcxx/include/vector#L3067 which eventually calls operator= like this
__bit_reference& operator=(bool __x) _NOEXCEPT { if (__x) *__seg_ |= __mask_; else *__seg_ &= ~__mask_; return *this; }
Here note that * __x is not assigned to __seg_ directly, but based on if-else. So DFSan never has a chance to propagate the label of x to __seq_ without control-flow-tainting. * recall that reserve does not zero out new bytes. So the read of __seq_ can return undefined. However at bit-level this is fine because * ? & 1 = 1 and ? | 0 = 0; * although operator[] does not do bound check, it is obvious that reading unset bits is undefined. However DFSan reads uninitialized shadow values at the case, causing over-tainting. Even if DFSan tracks at bit-level, it also needs to understand the semantics of | and & to make it work.
This change, when -DDATAFLOW_SANITIZER is used, disables specializing
vector<bool>, forcing using the general implementation. Although this
increases memory cost 8x, this is acceptable to DFSan because DFSan has
memory overhead anyway.