This is an archive of the discontinued LLVM Phabricator instance.

[Analyzer] Finish taint propagation to derived symbols of tainted regions
ClosedPublic

Authored by vlad.tsyrklevich on Mar 13 2017, 2:12 PM.

Details

Summary

This is the second part of https://reviews.llvm.org/D28445, it extends taint propagation to cases where the tainted region is a sub-region and we can't taint a conjured symbol entirely. This required adding a new map in the GDM that maps tainted parent symbols to tainted sub-regions (in order to avoid a linear scan looking for appropriate symbols in the current TaintMap.) With this change, tainting of structs and unions should work as expected.

Diff Detail

Event Timeline

Fix a stray assert()

Fix a stray assert(), correctly this time..

NoQ edited edge metadata.Mar 24 2017, 7:16 AM

Thanks again for the awesome stuff! It took years for me to even come up with the idea.

lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
494 ↗(On Diff #91615)

I'd think about this a bit more and come back.

I need to understand how come that constructing a symbol manually is the right thing to do; that doesn't happen very often, but it seems correct here.

lib/StaticAnalyzer/Core/ProgramState.cpp
678

Just see if this pointer is null instead of a separate contains<> check?

746

Just see if this pointer is null instead of a separate contains<> check?

751

This could be made even stronger when there are multiple ways of constructing the same sub-region. For instance,

union {
  int x;
  char y[4];
} u;

u.x = taint();
u.y[0]; // is tainted?

To handle such cases, we could try to see if byte offsets are nested, instead of checking isSubRegionOf().

I suggest adding a TODO (and maybe a FIXME-test), because it gets more and more complicated. Especially with symbolic offsets.

test/Analysis/taint-generic.c
210 ↗(On Diff #91615)

Are we already supporting the case when we're tainting some elements of an array but not all of them, and this works as expected? Could we add such tests (regardless of whether we already handle them)?

vlad.tsyrklevich marked 2 inline comments as done.

Respond to Artem's comments.

lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
494 ↗(On Diff #91615)

Indeed it is odd. The best justification I could come up with: LCVs are meant to be optimizations, their 'purpose' is to expose an SVal that hides SymbolRef values so that we can have a split store. We don't have to copy all of a compound values SymbolRef mappings because LCVs are kept distinct. Hence to set/query/constrain region values you use SVals so that RegionStore can differentiate between LCVs and SymbolRef backed SVals for the two different stores it contains.

The taint interface however requires you taint a SymbolRef, not an SVal. If we wanted, instead of doing this logic here, we could change getPointedToSymbol() to return an SVal and update usages of it accordingly since that value is only passed on to ProgramState.isTainted()/ProgramState.addTaint() anyway. Then we could update addTaint/isTainted to perform this logic, hiding it from the checker.

This still requires manually constructing a symbol, now it's just performed in the analyzer instead of in a checker. Not sure if that addresses the issue you were considering, but the idea that we need to 'undo' the LCV optimization hiding the SymbolRef to have a value to taint seems somewhat convincing to me. What do you think?

test/Analysis/taint-generic.c
210 ↗(On Diff #91615)

It does work in that case. If you taint element X of region Y the current logic will be conservative and only mark element X as tainted, not X-i or X+i. This is also true for element 0, so if a programmer passes &array[0] but reads sizeof(array) bytes it will not correctly mark that. This is also a short coming of the invalidation code so I don't think there's much to do until there's more general support for dealing with region extents.

NoQ added inline comments.Apr 5 2017, 5:30 AM
lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
494 ↗(On Diff #91615)

Hmm (!) I suggest adding a new function to the program state, that we'd call addPartialTaint() or something like that, and this function would accept a symbol and a region and would act identically to passing a derived symbol (from this symbol and that region) to addTaint() (but we wouldn't need to actually construct a derived symbol here).

Such API would be easier to understand and use than the current approach that forces the user to construct a derived symbol manually in the checker code. Unfortunately, this checker's getLCVSymbol() would become a bit more complicated (having various return types depending on circumstances), but this misfortune seems more random than systematic to me.

Since we're having this new kind of partial taint, why don't we expose it in the API.

test/Analysis/taint-generic.c
210 ↗(On Diff #91615)

\o/

vlad.tsyrklevich marked 2 inline comments as done.Apr 8 2017, 3:09 PM
vlad.tsyrklevich added inline comments.
lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
494 ↗(On Diff #91615)

I'm happy to implement it this way, but figured I'd ask why you prefer this approach first in the interest of keeping the TaintChecker simple! The optimal approach to me seems to be changing getPointedToSymbol() to getPointedToSVal() and having addTaint(SVal) call addPartialTaint() when it's passed an LCV sub-region. That way users of the taint interface like the TaintChecker have a clean way to add & check regardless of whether it's a SymbolRef or an LCV but the partial taint functionality is still exposed and documented for those who might want to use it in new ways.

Just curious to understand your rationale. Thanks for the feedback!

danielmarjamaki added inline comments.
include/clang/StaticAnalyzer/Core/PathSensitive/TaintManager.h
45 ↗(On Diff #93091)

Nit: =0 is redundant

NoQ added inline comments.Apr 18 2017, 6:55 AM
lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
494 ↗(On Diff #91615)

Your idea actually looks good to me! I'd approve going this way.

With this change to addTaint(SVal), i suspect it'd need some extra documentation to explain what it does now.

lib/StaticAnalyzer/Core/ProgramState.cpp
676

The SymRegions name is a bit confusing because we often shorten SymbolicRegion to SymRegion (eg. in dumps), which is not what we mean.

682

I wonder if it's worth it to check if a super-region of this region is already tainted, and avoid adding the region in this scenario.

I guess in practice it won't happen very often, because this code would most likely be executed just once per taint source. This probably deserves a comment though.

vlad.tsyrklevich marked 3 inline comments as done.
  • Update the logic to move the LCV symbol logic into ProgramState::addTaint(SVal) out of the GenericTaintChecker. This allows us to no longer have to synthesize a new SymbolDerived from a LazyCompoundVal. This also required adding a new addPartialTaint() function.
  • Update TaintedSymRegions name to TaintedSubRegions per @NoQ's comment.
  • I realized that the new partial taint logic did not respect the idea of TaintTagTypes, so I updated TaintSubRegion to include a TaintTagType and added appropriate logic to add/check them.
NoQ added a comment.May 10 2017, 2:57 AM

This is fantastic, thanks! I really like the shape of how it turned out to work.

Minor stuff and we're landing :)

lib/StaticAnalyzer/Core/ProgramState.cpp
651

Whitespace seems a bit off.

652–653

I still feel bad about producing API with very tricky pre-conditions. The LCV may have various forms - some may have empty store with no symbols at all, and others may be full of direct bindings that make the symbol essentially irrelevant. However, because the taint API is designed to be defensive to cases when taint cannot be added, and it sounds like a good thing, i guess we've taken the right approach here :)

I suggest commenting this more thoroughly though, something like:

If the SVal represents a structure, try to mass-taint all values within the structure. For now it only works efficiently on lazy compound values that were conjured during a conservative evaluation of a function - either as return values of functions that return structures or arrays by value, or as values of structures or arrays passed into the function by reference, directly or through pointer aliasing. Such lazy compound values are characterized by having exactly one binding in their captured store within their parent region, which is a conjured symbol default-bound to the base region of the parent region.

Then inside if (Sym):

If the parent region is a base region, we add taint to the whole conjured symbol.

Otherwise, when the value represents a record-typed field within the conjured structure, so we add partial taint for that symbol to that field.

691–693

Speaking of taint tags: right now we didn't add support for multiple taint tags per symbol (because we have only one taint tag to choose from), but addTaint overwrites the tag. I guess we should do the same for now.

705

Can we assert that the returned state is not empty, like in addTaint?

vlad.tsyrklevich marked 2 inline comments as done.

Minor updates & some clarification on the feedback

lib/StaticAnalyzer/Core/ProgramState.cpp
652–653

The pre-conditions for using the API are actually a bit simpler than what's exposed here. I made it explicit to make the logic for tainting LCVs explicit, but the following simpler logic works:

if (auto LCV = V.getAs<nonloc::LazyCompoundVal>()) {
  if (Optional<SVal> binding = getStateManager().StoreMgr->getDefaultBinding(*LCV)) {
    if (SymbolRef Sym = binding->getAsSymbol()) {
      return addPartialTaint(Sym, LCV->getRegion(), Kind);
    }
  }
}

This works because addPartialTaint() actually already performs the 'getRegion() == getRegion()->getBaseRegion()` check already and taints the parent symbol if the region is over the base region already. I chose to replicate it here to make the logic more explicit, but now that I've written this down the overhead of duplicating the logic seems unnecessary. Do you agree?

691–693

I believe this is the current behavior. On line 714 I presume ImmutableMap::add overrides a key if it's already present in the map but I couldn't trace down the Tree ADT implementation to confirm this.

NoQ added inline comments.May 12 2017, 2:22 AM
lib/StaticAnalyzer/Core/ProgramState.cpp
652–653

The pre-conditions for using the API are actually a bit simpler than what's exposed here.

I'm talking about the situation when we add the partial taint to the default-bound symbol but it has no effect because there's a direct binding in the lazy compound value on top of it. The user should ideally understand why it doesn't work that way.

I chose to replicate it here to make the logic more explicit, but now that I've written this down the overhead of duplicating the logic seems unnecessary. Do you agree?

The variant without the check looks easier to read, and it is kind of obvious that full taint is a special case of partial taint, so i'm for removing the check.

691–693

I presume ImmutableMap::add overrides a key if it's already present in the map

Yep, it does.

I believe this is the current behavior.

No, you early-return the original state if the full-taint map already contains the info for the whole symbol on line 703.

Hmm. In fact, with my suggestion we'd be able to have full taint of one kind and partial taint of another kind. I guess it's all right.

756

I actually have no idea why does this function accumulate things in the Tainted variable, instead of returning :)

Some stylistic & comment updates.

lib/StaticAnalyzer/Core/ProgramState.cpp
652–653

when we add the partial taint to the default-bound symbol but it has no effect because there's a direct binding in the lazy compound value on top of it

Ah, so you're talking about the case where the LCV encompasses a value with both direct & default bindings, e.g. int foo[1024]; foo[123] = rand(); taint(foo);? In that case we will miss tainting the rand SymbolConjured. I suppose we could scan the region store for matching entries? In practice I think we're really only interested in tainting default bindings anyway for some unknown network/user input anyway.

Anyways, your comment makes more sense now. I've added it.

756

It made a little more stylistic sense before this change, but not so much now. I've updated them to return immediately.

NoQ accepted this revision.May 29 2017, 7:46 AM

I'll land this. Thanks again for working on all that stuff!

This revision is now accepted and ready to land.May 29 2017, 7:46 AM
NoQ closed this revision.May 29 2017, 8:46 AM

Uhm, messed up the phabricator link in rL304162, which should have been pointing here but points to D28445 instead.

NoQ added a comment.May 29 2017, 2:46 PM

Yep, fixed indeed.