Page MenuHomePhabricator

An analysis to find external function pointers and trace their data flow

Authored by tmroeder on Feb 24 2014, 4:20 PM.



This patch provides a new analysis: ExternalFunctionAnalysis (EFA). EFA is useful for my Control-Flow Integrity pass (I'm waiting to submit the CFI patch until EFA is dealt with), since it can inform CFI about potential false positives due to known external function pointers.

The end result of EFA is an analysis that can answer two questions:

  • is this Instruction maybe an indirect call that is calling a function that is not in the current Module?
  • does this Function maybe contain indirect calls that target functions that are not in the current Module?

These questions are useful when the "current Module" is the Module used during Link-Time Optimization (LTO), since LTO gathers as much of the code as possible into a single Module.

EFA looks for function pointers returned by functions external to the module it is analyzing, and it traces the dataflow of these incoming external function pointers to find places where they are stored or called in indirect function calls.

EFA also supports an attribute annotation: "efa-maybe-external". This can be applied to variables, pointers, or functions (using llvm.var.annotation, llvm.ptr.annotation, and, respectively). EFA traces the dataflow from variables and pointers annotated with efa-maybe-external and finds store instructions and indirect calls. It tries to match these store locations with the stores found from the incoming external function pointer analysis. And it warns if external function pointers are being stored into non-annotated locations.

EFA could be generalized to an Analysis Group, since there's a trivial EFA that just returns false for both questions mentioned above: it simply assumes that no instructions target external function pointers, and no functions contain indirect calls to external function pointers.

Diff Detail

Event Timeline

jfb added a comment.Mar 10 2014, 1:51 PM

First pass comments.

Overall LLVM has had a lot of churn w.r.t. C++11, so you may want to rebase and use the new C++11 features since it'll make some of the code easier to read (and it won't need updating).


It would be nice to have a better explanation of the algorithm and its use here, or on the class. You had some good information in your email.


I'll let others comment here, but it seems like LLVM likes using its own datastructures for some use cases in the Programmer's manual. I don't think you need SetVector<Value*> since you AFAICT you don't traverse the set, but you may want DenseMap.




It seems weird to call this and the next two functions "get*" when they don't return anything. Compute instead?


Put in anonymous namespace since it's local to the file.


You should probably hoist the above strings with efa_annotation since it's use in other parts of the file.






This could be a constant if the declaration were an array, and you used array_lengthof.


You should use include/llvm/Target/TargetLibraryInfo.h. You're also missing the nothrow variant.


Why not malloc too? Is this just a performance thing, or does it impact correctness?

jfb added inline comments.Mar 10 2014, 4:20 PM


tmroeder updated this revision to Unknown Object (????).Mar 12 2014, 12:17 PM

This diff makes changes as suggested by the last review. In summary:

  • Fixed the minor points noted in JF's comments
  • C++11-ified loops and such. Please let me know if there's other C++11-ification I should do here
  • Switched to using TargetLibraryInfo instead of hardcoded strings to check for memory-allocation functions.
  • Switched to using LLVM ADT as much as possible; I still need std::list in a couple places for the iterator guarantees it gives
  • The switch to C++11 range loops forced a bunch of const throughout the code. This probably should have been there in the first place.
  • Added explanatory docs on the class.
tmroeder added inline comments.Mar 12 2014, 12:21 PM

Memory allocation functions are a frequent source of false positives, so the code tries to skip them. I see what you mean though, about malloc, and in the next revision, I've switched to using TargetLibraryInfo and adding other memory allocation functions from it.

jfb added inline comments.Mar 12 2014, 4:58 PM

Doesn't the algorithm become imprecise if you don't follow vaarg arguments? AFAICT the call may just fail? Realistically it's probably just printf, but it would be good to document what happens with:

void ugly(size_t n, ...) {
  typedef void (*F)();
  va_list list;
  va_start(list, n);
  for (size_t i = 0; i != n; ++n) {
    F f = va_arg(list, F);
void callee() {
int main() {
  ugly(1, &callee);

Maybe add a test if it's expected to work?


This could just be a cast<> and get the assert for free.


I'm not sure I understand what the bad implication is here, could you improve the warning to explain? Is it a kind of taint tracking thing, where it hard to know which maybe-external the pointer now refers to?


This should be array_lengthof(efa_annotation) - 1? The array contains the null terminator. The same applies a few places below.


The assert is redundant when using cast<>.


You could build a local StringRef(efa_annotation, annotation_len) and then just use operator== here. The same applies below.








3 redundant asserts above.

tmroeder updated this revision to Unknown Object (????).Mar 13 2014, 1:29 PM

This patch removes several asserts and fixes string handling, as suggested. It also removes some extra unnecessary/unused code from the tests.


Yes, the algorithm becomes (more) imprecise by not following var args. The algorithm does not guarantee that it finds all the dataflow for external function pointers, only that it finds some. In fact, it's relatively easy to write code that the analysis misses even without considering var args: just do pointer arithmetic in C. The analysis doesn't trace dataflow through arithmetic operations, so it will lose track of that external pointer.

In this case, not following the var args means that the analysis will miss places that external functions might be called, and they will become false positives when this analysis is used for Control-flow integrity. The CFI failure function will determine in that case what happens to the call. It needn't necessarily fail.

Note that the case in the code above will not cause a call to fail or be a false positive, since callee is defined in the module. The only time this matters is when code is passing an external function pointer (e.g., from dlsym) to a var args call that then calls this function pointer.

So, I think it's probably not worth the extra complexity to make the analysis more precise on an unusual corner case.


Yes, it's a matter of the precision of the analysis: this means that a maybe-external has been stored into a pointer that we're not tracking. It's just a warning, though, and is only a DEBUG warning, since imprecision of this algorithm is almost guaranteed in any complex program. I've updated the warning to try to make this more clear.


It turns out that the initializer length in the IR also contains the null terminator, so both strings end up with the same length this way.


The assert is for getOperand rather than the cast. But maybe getOperand can't return null. It will certainly assert if its argument is out of range.

jfb added a comment.Mar 13 2014, 2:43 PM

Looks good overall, but it would be great to have someone more seasoned with LLVM reviewing this.


Hmm, weird. It still seems wrong to initialized the StringRef below counting the null terminator, instead of excluding it. Isn't the StringRef str below initialized with just the const char * constructor for StringRef? That would then set Length with strlen on Data, and the length would mismatch that of StringRef efa (because StringRef::equal first compare length, and then compares memory).

Anyhow, I may be confused, but I'm surprised when string comparison counts the null terminator too, it seems like it could break in odd ways if one of the string's length starts not containing the null terminator.


Oh right, I was over-zealous on cast<>.


Unrelated to your change, but I was sad that StringRef can't have the following constructor:

template<size_t N> StringRef(const char (&Data)[N]) : StringRef(Data, N) { }

Templates don't participate in overload resolution if there is a best match nontemplate overload.

I tried to request a reviewer earlier today, but I didn't see any email sent about it. Apologies in advance if I somehow missed it and this is effectively a duplicate email.

Eric: this is a prerequisite for the implementation of Control-Flow Integrity (CFI) code we were discussing on llvmdev; this helps find external functions pointers that can cause false positives in CFI. Can you please take a look and tell me what you think? Or can you point me at the right person to review this if it's not you?

I think as far as the pass infrastructure and analysis passes in general that Chandler is going to be the best person to review this.

echristo resigned from this revision.Apr 22 2014, 9:36 AM
echristo removed a reviewer: echristo.

Just a few comments from looking at it once, this isn't a complete review yet.


llvm::isAllocationFn? It's in include/llvm/Analysis/MemoryBuiltins.h and there are other similar functions in case that one isn't quite right.


If I is not null then I->getParent() is guaranteed* to be a BasicBlock.

  • The only way it isn't guaranteed is if your pass is creating instructions and not inserting them into BasicBlocks (or taking instructions out of BBs and not deleting them). At the start and finish of every pass, this property holds.

I think what you're referring to as instructions that appear as subexpressions are actually ConstantExpr's.


I claim this is always false. The only way an operand can be null is if the object whose operands we're looking at is metadata, and metadata never shows up as a user of another Value.


Why not

if (!F->isDeclaration())

without the explicit cast?

tmroeder updated this revision to Diff 11586.Jul 17 2014, 10:08 AM
tmroeder edited edge metadata.

This fixes comments from the recent review

nlewycky added inline comments.

This means "does this Function contain any Instruction matching case #1" right?


There's three possible states, certainly not calling an external function, certainly calling an external function, and who knows.

The previous paragraph:
/ 1. Is this Instruction an indirect call that is maybe calling a function
/ that is not defined or declared in the current Module?
suggests that you answer "false" only when certainly not calling into an external function, and "true" when calling into an external function or you're not sure. This paragraph suggests that you're going to track uses of external function pointers and try to track them, and that's only going to help you solve cases which are certainly calling an external function.

Which way is it? I can't review the dataflow part of this patch without understanding this.


In C++ "struct SourcePair { ... };" will suffice.


No need for the emacs major mode marker on a .cpp file, just on the .h files.


This always returns an Argument*, any reason not to make that clear in the type? Why not use std::advance? I think this becomes:

static const Argument *getParameterForArgument(unsigned ArgNo, const Function *F) {
  if (ArgNo < F->arg_size()) return nullptr;
  return *std::advance(F->arg_begin(), ArgNo);

but I haven't tried to compile that.


The form:

if (const Value *Arg = getParameterForArgument(ArgNo, CalledFun))
  followValue(Arg, FoundValues, SeenValues);

is very common in LLVM.


It's possible to do that with a vector too, you just need to keep an index number (I assume you only add to the back).


Missing blank line.


Don't create a new one here, use the one that was created for the given target and module being compiled:

const TargetLibraryInfo *TLI = P->getAnalysisIfAvailable<TargetLibraryInfo>();

TargetLibraryInfo supports a whole ton of different functions. Do you just care about allocation functions or any recognized library function (strlen, etc.)?


That shouldn't be necessary, opt works on .ll files as well as .bc files.


If you plan to discard %t2, use -disable-output instead?


Please don't use -debug-only=efa here, it makes your test only work on debug builds of llvm and not optimized builds. Instead, you could implement Pass::print() in your pass and then use that via "opt -analyze". This is the same thing that the "opt -analyze -scalar-evolution" tests do with the ScalarEvolution analysis.



amanone added a subscriber: amanone.Nov 6 2014, 6:35 AM
tmroeder abandoned this revision.Nov 16 2018, 4:00 PM

This is an old revision that is no longer needed (the implementation of CFI moved on long ago).