This is the second part of https://reviews.llvm.org/D28445, it extends taint propagation to cases where the tainted region is a sub-region and we can't taint a conjured symbol entirely. This required adding a new map in the GDM that maps tainted parent symbols to tainted sub-regions (in order to avoid a linear scan looking for appropriate symbols in the current TaintMap.) With this change, tainting of structs and unions should work as expected.
Details
Diff Detail
Event Timeline
Thanks again for the awesome stuff! It took years for me to even come up with the idea.
lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp | ||
---|---|---|
477 | I'd think about this a bit more and come back. I need to understand how come that constructing a symbol manually is the right thing to do; that doesn't happen very often, but it seems correct here. | |
lib/StaticAnalyzer/Core/ProgramState.cpp | ||
696 | Just see if this pointer is null instead of a separate contains<> check? | |
797 | Just see if this pointer is null instead of a separate contains<> check? | |
802 | This could be made even stronger when there are multiple ways of constructing the same sub-region. For instance, union { int x; char y[4]; } u; u.x = taint(); u.y[0]; // is tainted? To handle such cases, we could try to see if byte offsets are nested, instead of checking isSubRegionOf(). I suggest adding a TODO (and maybe a FIXME-test), because it gets more and more complicated. Especially with symbolic offsets. | |
test/Analysis/taint-generic.c | ||
205 | Are we already supporting the case when we're tainting some elements of an array but not all of them, and this works as expected? Could we add such tests (regardless of whether we already handle them)? |
Respond to Artem's comments.
lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp | ||
---|---|---|
477 | Indeed it is odd. The best justification I could come up with: LCVs are meant to be optimizations, their 'purpose' is to expose an SVal that hides SymbolRef values so that we can have a split store. We don't have to copy all of a compound values SymbolRef mappings because LCVs are kept distinct. Hence to set/query/constrain region values you use SVals so that RegionStore can differentiate between LCVs and SymbolRef backed SVals for the two different stores it contains. The taint interface however requires you taint a SymbolRef, not an SVal. If we wanted, instead of doing this logic here, we could change getPointedToSymbol() to return an SVal and update usages of it accordingly since that value is only passed on to ProgramState.isTainted()/ProgramState.addTaint() anyway. Then we could update addTaint/isTainted to perform this logic, hiding it from the checker. This still requires manually constructing a symbol, now it's just performed in the analyzer instead of in a checker. Not sure if that addresses the issue you were considering, but the idea that we need to 'undo' the LCV optimization hiding the SymbolRef to have a value to taint seems somewhat convincing to me. What do you think? | |
test/Analysis/taint-generic.c | ||
205 | It does work in that case. If you taint element X of region Y the current logic will be conservative and only mark element X as tainted, not X-i or X+i. This is also true for element 0, so if a programmer passes &array[0] but reads sizeof(array) bytes it will not correctly mark that. This is also a short coming of the invalidation code so I don't think there's much to do until there's more general support for dealing with region extents. |
lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp | ||
---|---|---|
477 | Hmm (!) I suggest adding a new function to the program state, that we'd call addPartialTaint() or something like that, and this function would accept a symbol and a region and would act identically to passing a derived symbol (from this symbol and that region) to addTaint() (but we wouldn't need to actually construct a derived symbol here). Such API would be easier to understand and use than the current approach that forces the user to construct a derived symbol manually in the checker code. Unfortunately, this checker's getLCVSymbol() would become a bit more complicated (having various return types depending on circumstances), but this misfortune seems more random than systematic to me. Since we're having this new kind of partial taint, why don't we expose it in the API. | |
test/Analysis/taint-generic.c | ||
205 | \o/ |
lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp | ||
---|---|---|
477 | I'm happy to implement it this way, but figured I'd ask why you prefer this approach first in the interest of keeping the TaintChecker simple! The optimal approach to me seems to be changing getPointedToSymbol() to getPointedToSVal() and having addTaint(SVal) call addPartialTaint() when it's passed an LCV sub-region. That way users of the taint interface like the TaintChecker have a clean way to add & check regardless of whether it's a SymbolRef or an LCV but the partial taint functionality is still exposed and documented for those who might want to use it in new ways. Just curious to understand your rationale. Thanks for the feedback! |
include/clang/StaticAnalyzer/Core/PathSensitive/TaintManager.h | ||
---|---|---|
45 | Nit: =0 is redundant |
lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp | ||
---|---|---|
477 | Your idea actually looks good to me! I'd approve going this way. With this change to addTaint(SVal), i suspect it'd need some extra documentation to explain what it does now. | |
lib/StaticAnalyzer/Core/ProgramState.cpp | ||
694 | The SymRegions name is a bit confusing because we often shorten SymbolicRegion to SymRegion (eg. in dumps), which is not what we mean. | |
700 | I wonder if it's worth it to check if a super-region of this region is already tainted, and avoid adding the region in this scenario. I guess in practice it won't happen very often, because this code would most likely be executed just once per taint source. This probably deserves a comment though. |
- Update the logic to move the LCV symbol logic into ProgramState::addTaint(SVal) out of the GenericTaintChecker. This allows us to no longer have to synthesize a new SymbolDerived from a LazyCompoundVal. This also required adding a new addPartialTaint() function.
- Update TaintedSymRegions name to TaintedSubRegions per @NoQ's comment.
- I realized that the new partial taint logic did not respect the idea of TaintTagTypes, so I updated TaintSubRegion to include a TaintTagType and added appropriate logic to add/check them.
This is fantastic, thanks! I really like the shape of how it turned out to work.
Minor stuff and we're landing :)
lib/StaticAnalyzer/Core/ProgramState.cpp | ||
---|---|---|
651 | Whitespace seems a bit off. | |
656–671 | I still feel bad about producing API with very tricky pre-conditions. The LCV may have various forms - some may have empty store with no symbols at all, and others may be full of direct bindings that make the symbol essentially irrelevant. However, because the taint API is designed to be defensive to cases when taint cannot be added, and it sounds like a good thing, i guess we've taken the right approach here :) I suggest commenting this more thoroughly though, something like:
Then inside if (Sym):
| |
698–700 | Speaking of taint tags: right now we didn't add support for multiple taint tags per symbol (because we have only one taint tag to choose from), but addTaint overwrites the tag. I guess we should do the same for now. | |
712 | Can we assert that the returned state is not empty, like in addTaint? |
Minor updates & some clarification on the feedback
lib/StaticAnalyzer/Core/ProgramState.cpp | ||
---|---|---|
656–671 | The pre-conditions for using the API are actually a bit simpler than what's exposed here. I made it explicit to make the logic for tainting LCVs explicit, but the following simpler logic works: if (auto LCV = V.getAs<nonloc::LazyCompoundVal>()) { if (Optional<SVal> binding = getStateManager().StoreMgr->getDefaultBinding(*LCV)) { if (SymbolRef Sym = binding->getAsSymbol()) { return addPartialTaint(Sym, LCV->getRegion(), Kind); } } } This works because addPartialTaint() actually already performs the 'getRegion() == getRegion()->getBaseRegion()` check already and taints the parent symbol if the region is over the base region already. I chose to replicate it here to make the logic more explicit, but now that I've written this down the overhead of duplicating the logic seems unnecessary. Do you agree? | |
698–700 | I believe this is the current behavior. On line 714 I presume ImmutableMap::add overrides a key if it's already present in the map but I couldn't trace down the Tree ADT implementation to confirm this. |
lib/StaticAnalyzer/Core/ProgramState.cpp | ||
---|---|---|
656–671 |
I'm talking about the situation when we add the partial taint to the default-bound symbol but it has no effect because there's a direct binding in the lazy compound value on top of it. The user should ideally understand why it doesn't work that way.
The variant without the check looks easier to read, and it is kind of obvious that full taint is a special case of partial taint, so i'm for removing the check. | |
698–700 |
Yep, it does.
No, you early-return the original state if the full-taint map already contains the info for the whole symbol on line 703. Hmm. In fact, with my suggestion we'd be able to have full taint of one kind and partial taint of another kind. I guess it's all right. | |
807 | I actually have no idea why does this function accumulate things in the Tainted variable, instead of returning :) |
Some stylistic & comment updates.
lib/StaticAnalyzer/Core/ProgramState.cpp | ||
---|---|---|
656–671 |
Ah, so you're talking about the case where the LCV encompasses a value with both direct & default bindings, e.g. int foo[1024]; foo[123] = rand(); taint(foo);? In that case we will miss tainting the rand SymbolConjured. I suppose we could scan the region store for matching entries? In practice I think we're really only interested in tainting default bindings anyway for some unknown network/user input anyway. Anyways, your comment makes more sense now. I've added it. | |
807 | It made a little more stylistic sense before this change, but not so much now. I've updated them to return immediately. |
Nit: =0 is redundant