This is an archive of the discontinued LLVM Phabricator instance.

[Analyzer] Extend taint propagation and checking
ClosedPublic

Authored by vlad.tsyrklevich on Jan 7 2017, 12:53 AM.

Details

Summary

While extending the GenericTaintChecker for my purposes I'm hitting a case where taint is not propagated where I believe it should be. Currently, the following example will propagate taint correctly:

char buf[16];
taint_source(buf);
taint_sink(buf);

However, the following fails:

char buf[16];
taint_source(&buf);
taint_sink(buf);

In the first example, buf has it's symbol correctly extracted (via GenericTaintChecker::getPointedToSymbol()) as a SymbolDerived{conj_$N{char}, element{buf,0 S64b,char}}, it's marked as tainted and then the taint check correctly finds it using ProgramState::isTainted().

In the second example, the SVal obtained in GenericTaintChecker::getPointedToSymbol() is a LazyCompoundVal so SVal::getAsSymbol() returns a NULL SymbolRef, meaning the symbol is not tainted.

This change extends GenericTaintChecker to obtain a SymbolRegionValue from the LCV MemRegion in getPointedToSymbol(), and then extends ProgramState::isTainted() to correctly match a SymbolRegionValue{buf} against a SymbolDerived{conj_$N{char}, element{buf,0 S64b,char}}.

I'm not familiar enough with analyzer internals to be confident in whether this is the right approach to fix this bug, so feedback is welcome.

Diff Detail

Repository
rL LLVM

Event Timeline

vlad.tsyrklevich retitled this revision from to [Analyzer] Extend taint propagation and checking.
vlad.tsyrklevich updated this object.
zaks.anna added a subscriber: dcoughlin.
zaks.anna edited edge metadata.Jan 9 2017, 12:27 PM

Thanks for working on this!

lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
443 ↗(On Diff #83521)

This might create a new symbol. Is this what we want?

lib/StaticAnalyzer/Core/ProgramState.cpp
694 ↗(On Diff #83521)

This might create a new symbol as well. Is this what we want here?

lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
443 ↗(On Diff #83521)

I'm not sure how to turn an LCV into a proper symbol, so without creating new symbols the best approach I can see is changing getPointedToSymbol() to getPointedToSval() and also update addTaint() and isTainted() to accept SVals. Then you could have separate TaintMaps that include both symbols and regions and check both for taintedness. Does that sound like the correct approach to you?

NoQ added a subscriber: NoQ.Jan 10 2017, 9:14 AM

Thanks for working on the taint! I really wish the taint analysis in the analyzer to flourish, and the part you've digged into is particularly sensitive, so i'd dump some thoughts here, hopefully helpful.

What works as it should:

In the second example, the SVal obtained in GenericTaintChecker::getPointedToSymbol() is a LazyCompoundVal so SVal::getAsSymbol() returns a NULL SymbolRef, meaning the symbol is not tainted.

Only symbols currently can, and in my opinion ever should, directly carry taint.

We have symbols, regions, concrete values, and compound values.

A symbol $s denotes a particular but unknown value from a certain set S (say, of all integers or all pointers); we are not generally sure if it can denote any value from S, or only from a smaller subset S' within S (for example, the analyzer may be unaware that $s is always positive, because the small part of the code it's looking into doesn't give any hints). Analyzer also gathers constraints C when he picks execution paths - for instance, on a branch condition "if ($s < 5)", if we pick the true branch, C gets cropped into [-INT_MIN, 4], and the possible values of $s stay within S' intersected with C, while the analyzer thinks that possible values of $s are within S intersected with C.

A tainted symbol merely corresponds to the special case when we know that S' = S, which we normally don't - that is, the analyzer knows that for some reason the symbol's value is so much chaotic that it may definitely take any value from S in run-time. The analyzer can know it by seeing that the symbol comes from an "untrusted source" such as from "outside".

We do not really talk about tainted regions - instead, we talk about tainted pointer values, which are symbols. In this sense, in code

char buffer[100];

the region of variable buffer cannot be tainted. No matter what we do, the buffer region itself comes from a perfectly trusted source - it's always in the same well-known segment of memory (stack or in global memory). In practice this means that the attacker cannot affect or forge the positioning of this segment in any way. On the other hand, if $i is a tainted index symbol, then region buffer[$i] of the $i'th element would be "said to be tainted" - in the sense, its pointer is known to possibly take any value, not necessarily within range of the buffer. Similarly, if a pointer-symbol is tainted, the segment of memory (region) it points to (called SymbolicRegion) is "said to be tainted". So whenever we're talking about a "tainted region", it's always a shorthand for talking about a certain tainted symbol.

Concrete values shouldn't be tainted for the same reason - they are already "known", their S would consist of one element, so there's little interest in saying that they can take any value from this set - they're bound to take that only value.

Now, we have (possibly lazy) compound values. The whole point of compound values is that they consist of multiple other values, and therefore there's usually little sense in even considering them as a whole. Some sub-values of them may be null, over-constrained, uninitialized, irrelevant (eg. padding), or tainted, while other sub-values may be completely different. In your example, you obtain a compound value of an array, which consists of all values of all elements of the array. There's no point in asking if the compound value itself is tainted; instead, you might be interested in particular elements. For example, buffer[1] may be tainted, while buffer[2] was overwritten with '\0' and is no longer tainted. If you had a structure, you may ask if a certain field is tainted, or if all fields are tainted.

What works but not as it should:

In the first example, buf has it's symbol correctly extracted (via GenericTaintChecker::getPointedToSymbol()) as a SymbolDerived{conj_$N{char}, element{buf,0 S64b,char}}.

Nope, unfortunately, in fact this is not correctly extracted either. This symbol denotes the value of the first element of the buffer. The taint checker marks the first element as tainted, and then later reads that the first element is tainted, and throws the warning. However, if we pass, say, buffer+1 to execl, it breaks.

The root cause of the problem is how the analyzer denotes casted/typed pointers: the SVal for a pointer of type T* is said to be the element region with index 0 of the region behind the pointer. You can read this as "The pointer points to the first T on this segment of memory". In this case, we try to say that the value behind the pointer is tainted, and we forget that the value behind the pointer is not only the first element of the array, but the whole array, even though the pointer points to the first element only.

How this should work: we should instead put the taint over the conjured symbol, on which the derived symbol you mention is based. The conjured symbol denotes the unknown noise that wipes the whole array after reading tainted data into it, which in other words means that it denotes the tainted data itself. Derived symbols denote values for sections of this noise that correspond to sub-regions. You have already seen from the code that derived symbols inherit taint from parent symbols (on which they are "based", or from which they are "derived"), so if we taint the conjured symbol, then all the derived symbols for all elements will be tainted automagically; however, elements overwritten later will not carry taint.

Then when you try to find out if the value passed to execl is tainted, some imagination is required to conduct a proper check. You definitely shouldn't try to inspect the same conjured symbol - instead, this time you should try to pay attention to the elements. Probably the first element's being tainted is already a good time to panic. Probably you could improve upon this by scanning until a null character and see how much taint is there. So the taint-discovering code currently works more or less correctly, but it requires a different version of getPointedToSymbol than the taint-origination code discussed above.

What doesn't work and you're trying to fix:

Now you observe that by changing the array-to-pointer cast syntax slightly, you make the analyzer core drop the element region (which, as i mentioned, only represents pointer casts) from your lvalue - and receive the rvalue of the whole array (the lazy compound value of all elements), instead of the rvalue of the first element (the derived symbol).

How it should work? Same as above: when adding the initial taint, we should instead put the taint over the conjured symbol. It is the pointed-to symbol you're looking for. Similarly, when detecting tainted values, we should unpack the lazy value and see if its elements or fields in which we are interested are tainted.

Regarding the origin regions:

Origin regions for SymbolRegionValue, and even more so for`SymbolDerived`, and even more so for LazyCompoundVal, are immaterial for most purposes. Usually the region isn't at fault that a certain tainted value was written into it.

Well, in case of SymbolRegionValue, he is actually at fault, because if the origin region is a pointer that comes from "outside", then the values that came with it may also have been forged; after all, SymbolRegionValue denotes "what have been there in this region before we noticed".

However, the value of SymbolDerived is known to be a "sub-section" of the parent symbol, and no matter how hard you try to forge the parent region, you can't change that fact. So unless the parent symbol is tainted, we shouldn't call the derived symbol tainted.

You may say that if the parent region's address can be an arbitrary forged pointer, then deriving the symbol may cause buffer overflow and therefore denote a forged data. Unfortunately, our implemenetation of the Store allows such examples, but ideally it shouldn't construct a SymbolDerived for such cases.

Finally, LazyCompoundVal's origin (parent) region is also not at fault. The value only captures the contents of this region. All we know is that at some prior point of time contents of this region were as represented by the lazy compound value. But because this moment occurs during analysis, the value has little connection with the region; this region may have been overwritten many times, completely or partially, with values of completely different nature, both before and after taking the snapshot we call a lazy compound value.

What to do:

I'd suggest to consider moving taint to the invalidation-related conjured symbols. It wouldn't be easy because API of the Store doesn't allow obtaining this symbol easily, so it needs some work, which would be useful for other purposes as well (this is brought up often in C++ discussions, where we also need to work with complete structures a lot, eg. iterators).

vlad.tsyrklevich edited edge metadata.

Artem, thank you for the very detailed reply! Between this, and hitting the right search terms to find your clang analyzer guide my understanding of the symbol abstractions the analyzer exposes has improved significantly.

You state that it's not easy to derive the conjured symbols from the Store; however, it didn't seem to be too difficult to do by recursively finding the bindings for the constituent FieldRegions (if the LCV is backing a struct/union) or the first ElementRegion (if the LCV is backing an array) until you reach an element/field initialized with the conjured symbol. Does the new patch look correct to you? Your comment about the difficulty has me unsure whether I've fully grasped the scope of the problem.

One wrinkle you'll notice is in the patch to taint-tester.c, one FIXME for missing taint has been fixed, but now one variable that should not be tainted is! This seems to be because of overeager invalidation, not strictly the fault of the patch but it is exposed by it.

vlad.tsyrklevich marked 3 inline comments as done.Jan 26 2017, 12:25 AM

Hello @NoQ, I just wanted to ping you on the updated change since I'm not sure if you saw it.

NoQ added a subscriber: a.sidorin.Feb 1 2017, 1:01 AM

Hello! Sorry, i'm very distracted recently.

I think your new approach should work, however it would be much easier and more straightforward to just ask the store manager to provide the default-bound symbol for a region directly, instead of trying to derive from it manually and then underive it back. In inline comments i also show that sometimes we're not really that much interested in the particular derived symbols.

lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
442 ↗(On Diff #84517)

We need to be careful in the case when we don't have the definition in the current translation unit. In this case we may still have derived symbols by casting the pointer into some blindly guessed type, which may be primitive or having well-defined primitive fields.

By the way, in D26837 i'm suspecting that there are other errors of this kind in the checker, eg. when a function returns a void pointer, we put taint on symbols of type "void", which is weird.

Adding Alexey who may recall something on this topic.

455 ↗(On Diff #84517)

The "safe" way to do this is to use one of the ASTContext's get***ArrayType() methods.

lib/StaticAnalyzer/Core/ProgramState.cpp
667 ↗(On Diff #84517)

While this is quite vague, i can imagine us wanting to taint only a particular field of a symbolic structure. Considering that taint also automatically propagates in the opposite direction (from parent symbol to derived symbol), i'd rather preserve some freedom by moving this code to the taint-checker (and other places where we may want to generate taint; for now, it's getLCVSymbol()), forcing them to manually lookup the parent symbol and put taint there when they want to.

test/Analysis/taint-tester.c
53 ↗(On Diff #84517)

I don't think you'd be able to pass this test by tweaking the taint mechanisms alone. The original FIXME here appears because the second scanf() destroys the first scanf()'s binding, together with the default-bound symbol. There's no way you preserve taint on y without adding taint on z, unless you model scanf() to update only specific store bindings.

In any case, this makes a good example for my comment in addTaint(). Because false positives are worse than false negatives, we shouldn't add taint to the default-bound symbol here.

vlad.tsyrklevich marked 4 inline comments as done.

As Artem mentioned, I reworked getLCVSymbol() to get the default bindings directly from the StoreManager which required adding a new method. I also reverted the tainting logic to add taint more conservatively. I have another change that will fix the new FIXME test that is added, but it might involved some refactoring and discussion so I've left it out to make this change as simple as possible.

Hi Artem,
Thank you for adding me, I missed this patch. I have few comments below. If you (and Vlad) can wait for two or three days, I will re-check the place I'm worrying about and post the results.

lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
442 ↗(On Diff #84517)

I will check the correctness of this code sample because I have some doubts about it.
The problem of casts is hard because of our approach to put taints on 0th elements. We lose casts and may get some strange void symbols. However, I guess this patch isn't related to this problem directly.

lib/StaticAnalyzer/Core/RegionStore.cpp
502 ↗(On Diff #87237)

We get the LazyCompoundVal for some region but return the symbol for its base. It means that at least method name is very confusing.

lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
442 ↗(On Diff #84517)

Not sure which form of correctness you're interested in here but I'll bring up one issue I'm aware of: currently this will create a new SymbolDerived for an LCV sub-region, but it won't be correctly matched against in isTainted() because subsequent references to the same region will have a different SymbolDerived. This is the FIXME I mentioned below in taint-generic.c I have some idea on how to fix this but I think it will probably require more back and forth, hence why I didn't include it in this change. As it stands now, the sub-region tainting could be removed without changing the functionality of the current patch.

lib/StaticAnalyzer/Core/RegionStore.cpp
502 ↗(On Diff #87237)

I believe that default bindings are only on base regions, so if you pass a reference to outer_struct.inner_struct the default binding for that LCV will be over outer_struct. I'm basing this on other references to LCVs in Core/RegionStore.cpp but I could be wrong. Either way, I'd be happy to change the interface to have the caller pass the correct MemRegion here.

Sorry for the delay, I have commented inline.
For me, this patch looks like a strict improvement. I think, after some clean up this can be accepted.

lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
442 ↗(On Diff #84517)

Checking a default binding symbol here works because we're in PostStmt of the call that creates this default binding. I think this machinery deserves a comment, it was not evident for me initially.
However, I don't like to create a SymbolDerived manually. The first idea is to just use getSVal(MemRegion*) which will return a SymbolDerived if it is. The second is to create... SymbolMetadata that will carry a taint (if the value is not a symbol, for example). Not sure if second idea is sane enough, I need some comments on it. Artem, Anna?
Also, instead of referring to the base region, we may need to walk parents one-by-one.

lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
442 ↗(On Diff #84517)

I'd rather just get rid of the tainted SymbolDerived and re-introduce it for discussion in the follow-up change as it's currently not functional anyway and there is some room for debate on the correct way to make tainting of sub-regions of LCVs work correctly.

As far as walking parents one-by-one, my current understanding of the RegionStoreManager led me to believe that that would be unnecessary. Currently, all bindings are made as offsets from a base region, reference the implementation of RegionBindingsRef::addBinding. Is there another reason to walk the parents that I'm missing?

lib/StaticAnalyzer/Core/RegionStore.cpp
502 ↗(On Diff #87237)

One change we could introduce to make getDefaultBinding() more explicit is to have the caller pass LCV.getRegion()->getBaseRegion() instead of just passing the LCV. However, this has the disadvantage of hardcoding an implementation detail of RegionStoreManager into users of the StoreManager interface. I think the correct way forward here might just be to add a comment that mentions that currently RegionStoreManagers bindings are only over base regions. What do you think?

Remove an unnecessary call to getBaseRegion(), remove the logic to create a new SymbolDerived as it's currently unused, and add a comment to reflect that getLCVSymbol() is called in a PostStmt and a default binding should always be present.

lib/StaticAnalyzer/Core/RegionStore.cpp
502 ↗(On Diff #87237)

I spoke too soon, after digging a little deeper I realized that lookups using RegionBindingsRef ::getDefaultBinding converts to a base region check anyway in BindingKey::Make. Therefore it'd be better to just completely hide that implementation detail to the RegionBindings than hard coding it here unnecessarily. The updated revision corrects this.

NoQ added a comment.Feb 20 2017, 6:50 AM

Yeah, i think this is now as easy as i expected it to be :)

Still, the new API is in need of better documentation, because the notion of default binding is already confusing, and the new use-case we now have for this API is even more confusing. I don't instantly see a good way to justify this hack, but some day we'd need to either do that or refactor further. The notion of default binding needs to become more "material".

A more expanded doxygen comment should probably be a great start.

include/clang/StaticAnalyzer/Core/PathSensitive/Store.h
66 ↗(On Diff #88493)

I think there are two different use cases here:

  1. getDefaultBinding(Store, Region) would retrieve default binding in a given store for the region's base region,
  2. getDefaultBinding(LazyCompoundVal) would retrieve default binding for the lazy compound value's base region from the lazy compound value's store - a shorthand for getDefaultBinding(lcv.getStore(), lcv.getRegion()).

Otherwise we're passing two different stores into the function (one directly, another as part of the lazy compound value), and it's confusing which one is actually used.

lib/StaticAnalyzer/Core/RegionStore.cpp
498 ↗(On Diff #88493)

LazyCompoundVal always has a region.

505 ↗(On Diff #88493)

Because UnknownVal is an interesting default binding, quite different from lack of default binding, i'd rather return Optional<SVal> from that function (and None here).

@NoQ I've tried to address all the issues you mentioned in this change. Let me know if the expanded documentation doesn't address what you were looking for.

vlad.tsyrklevich marked 2 inline comments as done.Feb 22 2017, 9:58 PM
vlad.tsyrklevich marked an inline comment as done.Feb 23 2017, 1:32 AM
NoQ accepted this revision.Mar 7 2017, 7:52 AM

I believe this should land. Thank you very much for getting this far to get this fixed.

My take on the documentation:

Return the default value bound to a region in a given store. The default binding is the value of sub-regions that were not initialized separately from their base region. For example, if the structure is zero-initialized upon construction, this method retrieves the concrete zero value, even if some or all fields were later overwritten manually. Default binding may be an unknown, undefined, concrete, or symbolic value.
\param[in] store The store in which to make the lookup.
\param[in] R The region to find the default binding for.
Return the default value bound to a LazyCompoundVal. The default binding is used to represent the value of any fields or elements within the structure represented by the LazyCompoundVal which were not initialized explicitly separately from the whole structure. Default binding may be an unknown, undefined, concrete, or symbolic value.
\param[in] lcv The lazy compound value.
\return The default value bound to the LazyCompoundVal \c lcv, if a default binding exists.
lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
455 ↗(On Diff #89467)

I'm not sure if i mentioned this before, but for this case we could store taint information in the program state as a map T from symbols to sets of regions, so that a SymbolDerived-class symbol with parent symbol S and parent region R is auto-tainted when R is a sub-region of at least one region R' in T(S).

That is, if we need to taint some fields in a structure with default symbol S, we add the relevant field regions to T(S), and later lookup if the derived symbol's parent region is within one of the "tainted-regions-for-that-symbol".

That's a crazy plan, but i believe it's also quite expressive, using the SVal hierarchy to the fullest. So it might be the way to go.

This revision is now accepted and ready to land.Mar 7 2017, 7:52 AM

Updated with the documentation update from @NoQ

lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
455 ↗(On Diff #89467)

That's exactly what I was considering, I can't imagine another clean way to keep track of that information otherwise (short of a linear scan of the taint data.)

@vlad.tsyrklevich,

Do you have commit access or should we commit on your behalf?

@vlad.tsyrklevich,

Do you have commit access or should we commit on your behalf?

You should commit on my behalf.

This revision was automatically updated to reflect the committed changes.
NoQ added a comment.May 29 2017, 8:46 AM

Uhm, messed up the phabricator link in rL304162, which should have been pointing to D30909 but points here instead.