[analyzer] Support for naive cross translational unit analysis
Needs ReviewPublic

Authored by xazax.hun on Mar 7 2017, 4:47 AM.

Details

Summary

This patch adds support for naive cross translational unit analysis.

The aim of this patch is to be minimal to enable incremental development of the feature on the top of the tree. This patch should be an NFC in case CTUDir is not provided by the user.

When CTUDir is provided:

  • In case a function definition is not available it will be looked up from a textual index, whether it was available in another TU.
  • The AST dump of the other TU will be loaded and the function definition will be inlined.

One of the main limitations is that the coverage pattern of the analysis will change when this feature is enabled. For this reason, in the future, it might be better to include some heuristics to prefer examining shorter execution paths to the longer ones. Until then this feature is not recommended to be turned on by users unless they already fixed the important issues with this feature turned off.

We will cover more detailed analysis of this patch soon in our EuroLLVM talk: http://llvm.org/devmtg/2017-03//2017/02/20/accepted-sessions.html#7
We will talk about how this works and the memory usage, analysis time, coverage pattern change, limitations of the ASTImporter, how the length of the bug paths changed and a lot more.

Feel free to skip the review after the talk, but we wanted to make the code available in an easy to review format before the conference.

Note that the initial prototype was done by A. Sidorin et al.: http://lists.llvm.org/pipermail/cfe-dev/2015-October/045730.html

Contributions to the measurements and the new version of the code: Peter Szecsi, Zoltan Gera, Daniel Krupp, Kareem Khazem.

Diff Detail

There are a very large number of changes, so older changes are hidden. Show Older Changes
NoQ added inline comments.Apr 10 2017, 9:26 AM
test/Analysis/Inputs/externalFnMap.txt
2

These tests use a pre-made external function map.

Are you willing to add tests for generating function maps?

That'd be useful, in my opinion, because it'd actually tell people how to run the whole thing.

NoQ added a comment.Apr 10 2017, 9:32 AM

One more obvious observation regarding scan-build: because you are already reading a compilation database, the whole tool is essentially usable in combination with the current scan-build-py (which can create compilation databases). So it's already quite usable, but you're forced to do regular analysis before cross-translation-unit analysis. So all we need is an extra mode in scan-build-py that does the interception and leaves the rest of the work to us, either by piping commands to us directly, or by providing us with a compilation database (if --intercept-first is passed). Having some kind of --intercept-only would solve half of the problems. Of course, ideally the logic that adds the -analyzer* options should also be re-used, but for usability it isn't immediately necessary.

xazax.hun updated this revision to Diff 94709.Apr 10 2017, 11:45 AM
xazax.hun edited the summary of this revision. (Show Details)
  • Fixed some memory leaks.
  • Removed some unrelated style changes.
  • Simplifications to the python scripts.
  • Removed debug prints.
xazax.hun marked 6 inline comments as done.Apr 10 2017, 11:50 AM
xazax.hun added inline comments.
test/Analysis/Inputs/externalFnMap.txt
2

Good idea! We will add a test for that.

tools/clang-cmdline-arch-extractor/ClangCmdlineArchExtractor.cpp
1 ↗(On Diff #93658)

We had the same idea recently. Next version of this patch will do this by parsing the clang cc1 command line (gathered by -###).

72 ↗(On Diff #90833)

Other clang tools return 0.

tools/clang-func-mapping/ClangFnMapGen.cpp
192 ↗(On Diff #93658)

That part is removed from this version of the patch. It wasn't used anywhere.

257 ↗(On Diff #90833)

That is not a problem for C++ programs, but added it.

xazax.hun updated this revision to Diff 94814.Apr 11 2017, 7:08 AM
xazax.hun edited the summary of this revision. (Show Details)
  • Removed a clang tool, replaced with python tool functionality.

I agree that scan-build or scan-build-py integration is an important issue to resolve here. What I envision is that users will just call scan-build and pass -whole-project as an option to it. Everything else will happen automagically:)

Another concern is dumping the ASTs to the disk. There are really 2 concerns here. First one is the high disk usage, which is a blocker for having higher adoption. Second is that I am not sure how complete and reliable AST serialization and deserialization are. Are those components used for something else that is running in production or are we just utilizing -ast-dump, which is used for debugging? I do not quite understand why AST serialization is needed at all. Can we instead recompile the translation units on demand into a separate ASTContext and then ASTImport?

dcoughlin added inline comments.Apr 20 2017, 9:26 AM
include/clang/AST/Decl.h
53 ↗(On Diff #94814)

Is this needed? It seems like a layering violation.

include/clang/AST/Mangle.h
59 ↗(On Diff #94814)

I'm pretty worried about using C++ mangling for C functions. It doesn't ever seem appropriate to do so and it sounds like it is papering over problems with the design.

Some questions:

  • How do you handle when two translation units have different C functions with the same type signatures? Is there a collision? This can arise because of two-level namespacing or when building multiple targets with the same CTU directory.
  • How do you handle when a C function has the same signature as a C++ function. Is there a collision when you mangle the C function?
xazax.hun marked 6 inline comments as done and an inline comment as not done.Apr 21 2017, 8:24 AM

I agree that scan-build or scan-build-py integration is an important issue to resolve here. What I envision is that users will just call scan-build and pass -whole-project as an option to it. Everything else will happen automagically:)

We had a skype meeting with Laszlo. He had no objections adding this to scan-build-py.

Another concern is dumping the ASTs to the disk. There are really 2 concerns here. First one is the high disk usage, which is a blocker for having higher adoption. Second is that I am not sure how complete and reliable AST serialization and deserialization are. Are those components used for something else that is running in production or are we just utilizing -ast-dump, which is used for debugging? I do not quite understand why AST serialization is needed at all. Can we instead recompile the translation units on demand into a separate ASTContext and then ASTImport?

According to our measurements there are some optimization possibilities in the serialized AST format. Even though it might not get us an order of magnitude improvement, it can be still very significant. The main reason, why the AST dumps are so large: duplicated AST parts from the headers. Once modules are supported and utilized, the size could be reduced once again significantly.
The serialization/deserialization of AST nodes should be very reliable. The same mechanisms are used for precompiled headers and modules.

AST serialization is there to avoid the time overhead of reparsing translation units. This can be a big win for translation units that have metaprograms. Technically, I can not see any obstackles in reparsing a translation unit on demand, but we did not measure how much slower would that be.
Having a serialized format could also make it possible to only load fragmets of the AST into the memory instead of the whole translation unit. (This feature is currently not utilized by this patch.)

include/clang/AST/Decl.h
53 ↗(On Diff #94814)

Good catch, this is redundant. I will remove this in the next patch.

include/clang/AST/Mangle.h
59 ↗(On Diff #94814)

I agree that using C++ mangling for C+ is not the nicest solution, and I had to circumvent a problem in the name mangler for C prototypes.

In case a mangled name is found in multiple source files, it will not be imported. This is the way how collisions handled regardless of being C or C++ functions.
The target arch is appended to the mangled name to support the cross compilation scenario. Currently we do not add the full triple, but this could be done.

An alternative solution would be to use USRs instead of mangled names but we did not explore this option in depth yet.

lib/AST/ASTContext.cpp
1481 ↗(On Diff #93658)

What do you mean here?

xazax.hun updated this revision to Diff 96377.Apr 24 2017, 4:39 AM
xazax.hun marked an inline comment as done.
  • Removed more unused code.
  • Remove the usages of posix API.
  • Changes according to some review comments.
  • Add a test to the clang-func-mapping tool.
xazax.hun marked 9 inline comments as done.Apr 24 2017, 4:41 AM
lib/AST/ASTContext.cpp
1495 ↗(On Diff #96377)

As far as I see you can move this ASTFileName declaration down.

1497 ↗(On Diff #96377)

How about replacing FunctionAstUnitMap.find() with FunctionAstUnitMap.count()?

1511 ↗(On Diff #96377)

As far as I see this can be changed to:

if (It == FunctionFileMap.end())
  return nullptr;
const StringRef ASTFileName = It->second;
tools/clang-func-mapping/ClangFnMapGen.cpp
48 ↗(On Diff #96377)

As far I see CurrentFileName is only used in 1 method and should therefore be private.

86 ↗(On Diff #96377)

I believe realpath is a posix function.

whisperity added inline comments.Apr 25 2017, 6:11 AM
lib/AST/ASTContext.cpp
1497 ↗(On Diff #96377)

FnUnitCacheEntry is used in line 1525.

xazax.hun updated this revision to Diff 96727.Apr 26 2017, 6:40 AM
xazax.hun marked 5 inline comments as done.
  • Updates according to review comments
  • Improvements to the python scripts
a.sidorin added inline comments.Apr 26 2017, 10:00 AM
lib/AST/ASTContext.cpp
1481 ↗(On Diff #93658)

This comment was about if (!FD->getType()->getAs<FunctionProtoType>()) return nullptr. While testing, we have found that many C and C-style functions are FunctionNoProtoTypes, so they cannot be mangled, but they still have a body in another TU. There definitely should be a way to fix this.

phosek added a subscriber: phosek.May 8 2017, 11:43 AM
xazax.hun updated this revision to Diff 101695.Jun 7 2017, 5:02 AM
xazax.hun edited the summary of this revision. (Show Details)
  • Migrate to use USR instead of mangled names. Name mangling related changes are reverted.
  • Better error handling in some cases.
  • File paths containing spaces are now handled correctly.
  • Fixes to support scripts.
xazax.hun marked 2 inline comments as done.Jun 7 2017, 5:10 AM

I agree that scan-build or scan-build-py integration is an important issue to resolve here. What I envision is that users will just call scan-build and pass -whole-project as an option to it. Everything else will happen automagically:)

We contacted Laszlo and we have a pull request into scan-build that is under review. He is very helpful and supports the idea of scan-build-py supporting CTU analysis.

I do not quite understand why AST serialization is needed at all. Can we instead recompile the translation units on demand into a separate ASTContext and then ASTImport?

We did a prototype implementation of on-demand reparsing. On the C projects we tested, the runtime is increased by 10-30% compared to dumping the ASTs. Note that, it is relatively fast to parse C, I would expect a much bigger delta in case of C++ projects. Unfortunately, we weren't able to test that setting due to the ASTImporter limitations.

include/clang/AST/Mangle.h
59 ↗(On Diff #94814)

Note that the newest version of this patch does not use name mangling, it uses USRs instead. This turned out to be a perfectly viable alternative and we did not see any behavioral changes on the project we tested after the transition.

Thanks for the reviews so far.
I think we have addressed all major concerns regarding this patch:

-(Anna) Scan-build-py integration of the functionality is nearly finished (see https://github.com/rizsotto/scan-build/issues/83) (--ctu switch performs both analysis phases at once). This I think could go in a different patch, but until we could keep the ctu-build.py and ctu-analyze.py scripts. Do you agree?

-(Devin,NoQ) In the externalFnMap.txt (which contains which function definitions is located where) Unified Symbol Resolution (USR) is used to identify functions instead of mangled names, which seems to work equally well for C and C++

-(Anna) Dumping the ASTs to the disk. We tried a version where, ASTs are not dumped in the 1st phase, but is recreated each time a function definition is needed from an external TU. It works fine, but the analysis time went up by 20-30% on open source C projects. Is it OK to add this functionality in a next patch? Or should we it as an optional feature right now?

Do you see anything else that would prevent this patch to get in?

a.sidorin accepted this revision.Jun 14 2017, 8:52 AM

Hello Daniel & Gabor,
Thank you very much for your work. This patch looks good to me but I think such a change should also be approved by maintainers.

-(Anna) Scan-build-py integration of the functionality is nearly finished (see https://github.com/rizsotto/scan-build/issues/83) (--ctu switch performs both analysis phases at once). This I think could go in a different patch, but until we could keep the ctu-build.py and ctu-analyze.py scripts. Do you agree?

It's important to bring this patch into the LLVM repo so that it becomes part of the clang/llvm project and is used. The whole point of adding CTU integration to scan-build-py is to make sure that there is a single tool that all/most users could use; adding the patch to a fork does not accomplish that goal. Also, I am not a fan of developing on downstream branches and that is against the LLVM Developer policy due to all the reasons described here: http://www.llvm.org/docs/DeveloperPolicy.html#incremental-development. This development style leads to fragmentation of the community and the project. Unfortunately, we often see cases where large patches developed out of tree never make it in as a result of not following this policy and it would be great to avoid this in the future.

This I think could go in a different patch, but until we could keep the ctu-build.py and ctu-analyze.py scripts. Do you agree?

It would be best to just add the scan-build-py support to the tree, especially, since the new scrips are not tested.

-(Anna) Dumping the ASTs to the disk. We tried a version where, ASTs are not dumped in the 1st phase, but is recreated each time a function definition is needed from an external TU. It works fine, but the analysis time went up by 20-30% on open source C projects.

I am curious which optimization strategies you considered. An idea that @NoQ came up with was to serialize only the most used translation units. Another idea is to choose the TUs that a particular file has most dependencies on and only inline functions from those.

What mode would you prefer? Would you pay the 20%-30% in speed but reduce the huge disk consumption? That might be a good option, especially, if you have not exhausted the ideas on how to optimize.

Is it OK to add this functionality in a next patch? Or should we it as an optional feature right now?

This depends on what the plan for going forward is. Specifically, if we do not need the serialization mode, you could remove that from this patch and add the new mode. If you think the serialization mode is essential going forward, we could have the other mode in a separate patch. (It would be useful to split out the serialization mode work into a separate patch so that we could revert it later on if we see that the other mode is better.)

I see some changes to the compiler proper, such as ASTImporter, ASTContext, SourceManager. We should split those changes into a separate patch and ask for review from people working on those components. You can point back to this patch, which would contain the analyzer related changes, and state that the other patch is blocking this work.

Thanks.

It would be best to just add the scan-build-py support to the tree, especially, since the new scrips are not tested.

OK. We will update this patch with the scan-build-py changes and remove the ctu-build.py and ctu-analyze.py scripts.

I am curious which optimization strategies you considered. An idea that @NoQ came up with was to serialize only the most used translation units. Another idea is to choose the TUs that a particular file has most dependencies on and only inline functions from those.

Both of these strategies could work in practice, we did not try them. We implemented the two extremes: serialize all TUs/don't serialize any of the TUs. Both of them could can be useful in practice as is (depending if the user is cpu/memory/disk space bound). I think we could try with the above suggested optimizations as an incremental improvement (and keep this initial patch as simple as possible).

What mode would you prefer? Would you pay the 20%-30% in speed but reduce the huge disk consumption? That might be a good option, especially, if you have not exhausted the ideas on how to optimize.

In this initial version I think we should keep the serializing mode. We just measured that the heap consumption of the non-serializing mode and it seems to be ~50% larger. Probably because the serializing mode only loads those AST fragments from the disk that is imported. But I can imagine that some user still want to use the non-serializing version which is not using extra disk space. So we will add the non-serializing mode in a next patch as an Analyzer option first (which we can turn into default behaviour later on). OK?

I see some changes to the compiler proper, such as ASTImporter, ASTContext, SourceManager. We should split those changes into a separate patch and ask for review from people working on those components. You can point back to this patch, which would contain the analyzer related changes, and state that the other patch is blocking this work.

Allright, we will do that.

So is it OK to proceed like this?

Some of the CTU related analyzer independent parts are being factored out.
The review is ongoing here: https://reviews.llvm.org/D34512

Another small and independent part is under review here: https://reviews.llvm.org/D34506

NoQ added a comment.Jun 22 2017, 7:30 AM

Regarding serializing vs not serializing and now vs later.

  1. I think we eventually need to provide a reasonable default approach presented to the user. This approach shouldn't be hurting the user dramatically in any sense. Because serializing hurts the user's disk space dramatically, and not-serializing may be slower and a bit more memory-intensive but isn't too bad in all senses, out of these two options not-serializing is definitely preferable as a default approach.
  1. Later we should definitely consider the alternative approaches that serialize only some ASTs, with the hope that one of them would turn out to be a better default approach.
  1. From 2. it follows that for now it's better to keep both approaches around - as we believe that the ideal approach may be a combination of the two. Therefore it doesn't really matter in what order they land.

tl;dr: I propose to land serialization-based approach first, then land non-serialization-based approach later and make it default, then consider taking the best of the two and making it a new default.

klimek added a subscriber: klimek.

Richard (added as reviewer) usually owns decisions around clang itself. Writing an email to cfe-dev with the numbers and wait for whether others have concerns would probably also be good.

xazax.hun updated this revision to Diff 105053.Jul 3 2017, 4:58 AM
  • Patch scan-build instead of using custom scripts
  • Rebase patch based on the proposed LibTooling CTU code
tools/scan-build-py/libscanbuild/analyze.py
165

I believe you can write:

for line in open(filename, 'r'):
172

this 'with' seems redundant. I suggest an assignment and then less indentation will be needed below

229

not a big deal but I would use early exits in this function

tools/scan-build-py/libscanbuild/clang.py
177

I am guessing that you can use cmd.find() instead of the loop

The Python code here still uses mangled name in their wording. Does this mean this patch is yet to be updated with the USR management in the parent patch?

tools/scan-build-py/libscanbuild/analyze.py
165

Do we want to rely on the interpreter implementation on when the file is closed.

If

for line in open(filename, 'r'):
   something()

is used, the file handle will be closed based on garbage collection rules. Having this handle disposed after the iteration is true for the stock CPython implementation, but it is still nontheless an implementation specific approach.

Whereas using with will explicitly close the file handle on the spot, no matter what.

172

I don't seem to understand what do you want to assign to what.

danielmarjamaki accepted this revision.Sep 1 2017, 1:07 AM
danielmarjamaki added inline comments.
tools/scan-build-py/libscanbuild/analyze.py
165

ok I did not know that. feel free to ignore my comment.

172

I did not consider the garbage collection. I assumed that out_file would Always be closed when it Went out of scope and then this would require less indentation:

out_file = open(extern_fns_map_file, 'w')
for mangled_name, ast_file in mangled_ast_pairs:
    out_file.write('%s %s\n' % (mangled_name, ast_file))
229

with "not a big deal" I mean; feel free to ignore my comment if you want to have it this way.

This revision is now accepted and ready to land.Sep 1 2017, 1:07 AM

While testing this I stumbled upon a crash with the following test case:

inc.h

#define BASE ((int*)0)
void foo();

main.c:

#include "inc.h"
void moo()
{
    int a = BASE[0];
    foo();
}

other.c

#include "inc.h"
void foo()
{
    int a = BASE[0];
}

Note that I used a custom checker that did not stop on the path like the DerefChecker would here. I did not know how to reproduce it with official checkers, but the issue should be understandable without reproduction.

With the given test a checker may produce two results for the null dereference in moo() and foo(). When analyzing main.c they will both be found and therefore sorted with PathDiagnostic.cpp "compareCrossTUSourceLocs".

If either of the FullSourceLocs is a MacroID, the call SM.getFileEntryForID(XL.getFileID()) will return a null pointer. The null pointer will crash the program when attempting to call ->getName() on it.

My solution was to add the following lines before the .getFileID() calls:

XL = XL.getExpansionLoc();
YL = YL.getExpansionLoc();
lib/StaticAnalyzer/Core/PathDiagnostic.cpp
391

see comment

In a similar case, static inline functions are an issue.

inc.h

void foo();
static inline void wow()
{
    int a = *(int*)0;
}

main.c

#include "inc.h"
void moo()
{
    foo();
    wow();
}

other.c

#include "inc.h"
void foo()
{
    wow();
}

The inline function is inlined into each calling AST as different AST objects. This causes the PathDiagnostics to be distinct, while pointing to the exact same source location.

When the compareCrossTUSourceLocs function tries to compare the FullSourceLocs, it cannot find a difference and FlushDiagnostics will assert on an erroneous compare function.

For my purposes I replaced the return statement of the compareCrossTUSourceLocs function with:

return XL.getFileID() < YL.getFileID();

A more correct fix would create only one unique diagnostic for both cases.

xazax.hun updated this revision to Diff 116502.Sep 25 2017, 2:35 AM
  • Fixed an issue with source locations
  • Updated to latest trunk

If either of the FullSourceLocs is a MacroID, the call SM.getFileEntryForID(XL.getFileID()) will return a null pointer. The null pointer will crash the program when attempting to call ->getName() on it.

Thank you for the report and the detailed analysis!
I did not want to get the expansion location since the original code did not query that either, so for now I just handle the case when I can not query a file entry. This case can occur even when no macros are in the code, for example when a PCH is referring to a nonexistent file.

For my purposes I replaced the return statement of the compareCrossTUSourceLocs function with:

return XL.getFileID() < YL.getFileID();

A more correct fix would create only one unique diagnostic for both cases.

Thank you for the report, I could reproduce this after modifying the null dereference checker. My fear of using file IDs is that I don't know whether they are stable. So in subsequent runs, different paths might be chosen by the analyzer and this could be confusing to the user. I will think about a workaround that both stable and solves this assertion.

I see two possible ways to do the proper fix. One is to check explicitly for this case when the same function appears in multiple translation units. A better approach would be to have the ASTImporter handle this case. I think the proper fix is better addressed in a separate patch.

dkrupp requested changes to this revision.Oct 15 2017, 7:28 AM

Please fix the incompatibility between analyze-build and lib/CrossTU in the format of externalFnMap.txt mappfing file.

tools/scan-build-py/libscanbuild/analyze.py
540

There is a incompatibility between this scan-build (analyze-build actually) implementation and new lib/CrossTU library.

CrossTranslationUnitContext::loadExternalAST(

StringRef LookupName, StringRef CrossTUDir, StringRef IndexName)

expects the externalFnMap.txt to be in
"functionName astFilename" format.
however currently we generate here
"functionName@arch astFilename" lines.

One possible fix could be to
create one externalFnMap.txt indexes per arch
<collect-dir>/ast/x86_64/externalFnMap.txt
<collect-dir>/ast/ppc64/externalFnMap.txt
etc.
and call clang analyze with the architecture specific map directory:
e.g. ctu-dir=<collect-dir>/ast/x86_64

This would then work if the "to-be-analyzed" source-code is cross-compiled into multiple architectures.

Would be useful to add a test-case too to check if the map file and ctu-dir content generated by analyze-build is compatible.

590

Maybe we could use the full target-triple for distinguishing the AST binaries, not only the architecture part. The sys part for example is probably important too and a "win32" AST may not be compatible with a "linux" AST.

This revision now requires changes to proceed.Oct 15 2017, 7:28 AM
xazax.hun updated this revision to Diff 119458.Oct 18 2017, 4:22 AM
xazax.hun marked an inline comment as done.
  • Update the scan-build part to work correctly with the accepted version of libCrossTU
xazax.hun updated this revision to Diff 121700.Mon, Nov 6, 2:15 AM
  • Rebased on ToT