[analyzer] Support for naive cross translational unit analysis
Needs ReviewPublic

Authored by xazax.hun on Mar 7 2017, 4:47 AM.

Details

Summary

This patch adds support for naive cross translational unit analysis.

The aim of this patch is to be minimal to enable incremental development of the feature on the top of the tree. This patch should be an NFC in case CTUDir is not provided by the user.

When CTUDir is provided:

  • In case a function definition is not available it will be looked up from a textual index, whether it was available in another TU.
  • The AST dump of the other TU will be loaded and the function definition will be inlined.

One of the main limitations is that the coverage pattern of the analysis will change when this feature is enabled. For this reason, in the future, it might be better to include some heuristics to prefer examining shorter execution paths to the longer ones. Until then this feature is not recommended to be turned on by users unless they already fixed the important issues with this feature turned off.

We will cover more detailed analysis of this patch soon in our EuroLLVM talk: http://llvm.org/devmtg/2017-03//2017/02/20/accepted-sessions.html#7
We will talk about how this works and the memory usage, analysis time, coverage pattern change, limitations of the ASTImporter, how the length of the bug paths changed and a lot more.

Feel free to skip the review after the talk, but we wanted to make the code available in an easy to review format before the conference.

Note that the initial prototype was done by A. Sidorin et al.: http://lists.llvm.org/pipermail/cfe-dev/2015-October/045730.html

Contributions to the measurements and the new version of the code: Peter Szecsi, Zoltan Gera, Daniel Krupp, Kareem Khazem.

Diff Detail

There are a very large number of changes, so older changes are hidden. Show Older Changes
tools/scan-build-py/libscanbuild/analyze.py
129

Could you also specify what func_map_lines is (file? list? of strings?) in the docstring? There's specific Sphinx syntax for that. Otherwise it is hard to follow.

139

Could be improved with defaultdict:

mangled_to_asts = defaultdict(set)
...
   mangled_to_asts[mangled_name].add(ast_file)
145

Firstly, no need to modify the set in order to get the first element, just use next(iter(ast_files)).
Secondly, when exactly do conflicting names happen? Is it a bug? Shouldn't the tool log such cases?

146

Overall, instead of creating a dictionary with multiple elements, and then converting to a list, it's much simper to only add an element to mangled_to_asts when it is not already mapped to something (probably logging a message otherwise), and then just return mangled_to_asts.items()

161

Firstly, is glob.glob actually random, as the docstring is saying?
If yes, can we make it not to be random (e.g. with sorted), as dealing with randomized input makes tracking down bugs so much harder?

169

If write_global_map is a closure, please use the fact that it would capture ctudir from the outer scope, and remove it from the argument list.
That would make it more obvious that it does not change between invocations.

190

Having an analysis tool remove files is scary, what if (maybe not in this revision, but in a future iteration) a bug is introduced, and the tool removes user code instead?
Why not just create a temporary directory with tempfile.mkdtemp, put all temporary files there, and then simply iterate through them?
Then you would be able to get rid of the constant CPU_TEMP_FNMAP_FOLDER entirely, and OS would be responsible for cleanup.

244

Similarly to the comment above, I would prefer if analysis tool would not remove files (and I assume those are not huge ones?)
Can we just use temporary directories?

255

Same as the comment above about removing folders. Also it seems like there should be a way to remove redundancy in if collect / remove tree block repeated twice.

285

Using JSON-serialization-over-environment-variables is very unorthodox, and would give very bizarre error messages if the user would try to customize settings (I assume that is the intent, right?)
I think that separating those options into separate ENV variables (if they _have_ to be customizable) would be much better.

319

Again, is it possible to avoid JSON-over-environment-variables?

557

mangled_name, path = fn_src_txt.split(" ", 1) ?

584

try/except/pass is almost always bad.
When can the error occur? Why are we ignoring it?

590

The above can be written more succinctly as:
ast_command = [opts['clang'], ...] + args + ['-w', ...]

601

Similarly here, funcmap_command can be generated in one line using +

611

Again, why is this error ignored?

632

In which case is this branch hit? Isn't improperly formed input argument indicative of an internal error at this stage?

718

This blank line should not be in this PR.

tools/scan-build-py/libscanbuild/clang.py
168

I might be missing something here, but why is the ability to call --version indicative of CTU support?
At worst, this can lead to obscuring real bugs: imagine if the user has args.clang pointed to broken/non-existent binary, then is_ctu_capable would simply return False (hiding the original error!), which would show a completely misleading error message.

Just checking func_map_cmd seems better, but even in this case we should probably log any errors occurring on -version call (such messages would really aid debugging)

177

Seconded, would prefer this rewritten using separator = cmd.find('-triple')

tools/scan-build-py/libscanbuild/report.py
270

I understand the intent here, but it seems it should be handled at a different level: would it be hard to change Clang to only write the report file at the very end, when no crash should be encountered? Or make parsers not choke on empty fields?

tools/scan-build-py/tests/unit/test_analyze.py
338

Probably more tests are required for almost 400 lines of functional Python code in this PR.
Would it be hard to have a full LIT-style integration test? E.g. have a dummy script emulating Clang with a dummy directory structure, which would show how all pieces are meant to fit together?

This revision now requires changes to proceed.Nov 27 2017, 2:34 PM

Thanks George for the review. I will start working on the code right away. I've tried to answer the simpler cases.

tools/scan-build-py/libscanbuild/analyze.py
45

Yes and no. The 1st, collect part of CTU creates this file by aggregating all data from the build system, while the 2nd part which does the analysis itself only reads it. Multiple analysis can use this file simultaneously without problem. However, multiple collect phases launched on a system does not make sense. In this case, the later one would write over the previous results with the same data.

65

We have an other run_analyzer method but still, it is a good idead, I will make something better up.

104

Yes. The only reason was for the move to make it testable. However, we need more tests as you wrote down below.

121

It definitely needs more comments, so thanks, I will put them here.
There are two separate possibilities here for this code part:

  1. The user does not want CTU at all. In this case, no CTU phases are switched on (collect and analyze), so no CTU code will run. This is why dir has no importance in this case.
  2. CTU capabilities were not even detected, so the help and available arguments about CTU are also missing. In this case we create a dummy config telling that everything is switched off.

The reason for using hasattr was that not having CTU capabilities is not an exceptional condition, rather a matter of configuration. I felt that handling this with an exception would be misleading.

129

Thanks, I'll do it.

145

Nice catch, thanks.
For your second question: Unfortunately, it is not a bug, rather a misfeature which we can't handle yet. There can be tricky build-systems where a single file with different configurations is built multiple times, so the same function signature will be present in multiple link units. The other option is when two separate compile units have an exact same function signature, but they are never linked together, so it is not a problem in the build itself. In this case, the cross compilation unit functionality cannot tell exactly which compilation unit to turn to for the function, because there are multiple valid choices. In this case, we deliberately leave such functions out to avoid potential problems. It deserves a comment I think.

161

"random" here means we don't care. So you are right. glob.glob doc says it is arbitrary but I couldn't find that it is deterministic or not. Probably sorting it will not do harm.

190

Yes, you are right. We are essentially using a temp dir. Because of the size we first had to put it next to the project (not on tmp drive for instance) and for debugging purposes we gave a name to it. Still it can be done with mkdtemp as well.

244

Unlike above, here we do remove non-temporary data intentionally. The user asks here to do the recollection of CTU data for a fresh start. Because there is no "clean" functionality in the analyzer interface itself, this seemed to be the easiest-on-user solution to save him/her an extra effort or a new command.

285

To be honest, I could never make a decision over this. The previous version of scan-build-py used extensively environment variables for everything which ended up in a huge mess and env contamination. There was an effort to reduce this to only a few, well-designated ones. Still there is the need of adding new data to the environment.

584

I think this code is redundant with the if above.

632

An other part of scan-build-py, analyze_cc uses namedtuple to json format to communicate. However, the names are not coming back from json, so this code helps in this. This is the case when someone uses the whole toolset with compiler wrapping. All the environment variable hassle is also happening because of this. So these env vars are not for user modification (as you've suggested earlier).

tools/scan-build-py/libscanbuild/clang.py
168

The original idea was that clang can give information about CTU support itself. However, it never happened because the analyzer is so deep down in the system. So I am open to remove the clang binary check here. However, clang binary is needed anyway, so the whole toolset will still throw an error later on not having a clang binary.

gerazo added a comment.Dec 6 2017, 7:17 AM

The code modifications are coming soon (after doing some extensive testing) for the scan-build part.

tools/scan-build-py/libscanbuild/analyze.py
146

The reason for the previous is that we need to count the occurence number ofdifferent mappings only let those pass through which don't have multiple variations.

190

Finally, I came to the conclusion that mkdtemp would not be better than the current solution. In order to find our created dir by other threads, we need a designated name. Suffixing it by generated name would further complicate things as we need not to allow multiple concurrent runs here. The current solution is more robust from this point of view.

243

I've checked it through. The only place for an early exit now would be before the else. The 1st and 2nd ifs are in fact non-orthogonal.

255

Th previous call for data removal happens because the user asked for a collect run, so we clean data to do a recollection. This second one happens because the user asked for a full recollection and anaylsis run all in one. So here we destroy the temp data for user's convenience. This happens after, not before like previously. The default behavior is to do this when the user uses the tool the easy way (collect and analyze all in one) and we intentionally keep collection data if the user only asks for a collect or an analyze run. So with this, the user can use a collect run's results for multiple analyze runs. This is written in the command line help. I will definitely put comments here to explain.

319

There is an other thing against changing this. Currently the interface here using env variables is used by intercept-build, analyze-build and scan-build tool as well. In order to drop json, we need to change those tools too. It would be a separate patch definitely.

584

Here the folders are created on demand. Because these are created parallel by multiple processes, there is small chance that an other process already created the folder between the isdir check and the makedirs call. This is why the the pass is needed to make it always run correctly. I will add a comment.

590

After several iterations of the code, I find it easier to version control such multiline constructs. If someone changes a data source, it is clear which one (which line) was modified. The succint notation does not allow clean VCS annotations.

tools/scan-build-py/tests/unit/test_analyze.py
338

You are right. The testing infra in scan-build-py is not right anyway (uses nosetests). However, this should be a new patch as you've mentioned earlier.

tools/scan-build-py/libscanbuild/analyze.py
45

However, multiple collect phases launched on a system does not make sense

Why not? What about a system with multiple users, where code is in the shared directory? Or even simpler: I've launched a background process, forgot about it, and then launched it again?

In this case, the later one would write over the previous results with the same data.

That is probably fine, I am more worried about racing, where process B would be reading a partially overriden file (not completely sure whether it is possible)

104

Sure, but that separate PR can also include tests.

121

Right, but instead of doing (1) and (2) can we have a separate (maybe hidden from user) param called e.g. ctu_enabled which would explicitly communicate that?

145

In this case, we deliberately leave such functions out to avoid potential problems

We probably want to log it, don't we?

146

ah, OK!

190

OK

244

OK!

319

OK I didn't know that the JSON interface was used by other tools. In that case, ignore my comment.

590

OK. Though you could still use split addition across multiple lines with \

632

OK so opts['ctu'] is a tuple or a named tuple depending on how this function is entered? BTW could you point me to the analyze_cc entry point?

For the purpose of having more uniform code with less cases to care about, do you think we could just use ordinary tuples instead of constructing a named one, since we have to deconstruct an ordinary tuple in any case?

tools/scan-build-py/libscanbuild/arguments.py
376

BTW can we also explicitly add dest='ctu_dir' here, as otherwise I was initially very confused as to where the variable is set.

tools/scan-build-py/libscanbuild/clang.py
168

so the whole toolset will still throw an error later on not having a clang binary.

Of course, but I think that would be easier to debug, and the error would mean that Clang is not available, not that CTU is not working.

xazax.hun updated this revision to Diff 125986.Dec 7 2017, 9:54 AM
  • @gerazo addressed some review comments in scan-build-py.

Python part looks good to me. I don't know whether @dcoughlin or @NoQ would want to insert additional comments on C++ parts.

gerazo added inline comments.Dec 11 2017, 5:30 AM
tools/scan-build-py/libscanbuild/analyze.py
45

I see your point. In order to create the multiple user scenario you've mentioned, those users need to give the exact same output folders for their jobs. Our original bet was that our users are not doing this as the scan-build tool itself is also cannot be used this way. Still the process left there is something to consider. I will plan some kind of locking mechanism to avoid situations like this.

632

Using a NamedTuple improves readability of the code a lot with less comments. It is unfortunate that serializing it is not solved by Python. I think moving this code to the entry point would make the whole thing much nicer. The entry point is at analyze_compiler_wrapper

tools/scan-build-py/libscanbuild/arguments.py
376

Yes, of course.

xazax.hun updated this revision to Diff 126543.Dec 12 2017, 6:31 AM
  • Further improvements to python script part.
tools/scan-build-py/libscanbuild/report.py
257

Minor nitpicking: type comments are semi-standardized with Sphinx-style auto-generated documentation, and should be a part of the docstring.

george.karpenkov requested changes to this revision.Dec 13 2017, 5:26 PM

I've tried using the patch, and I got blocked at the following: CTU options are only exposed when one goes through analyze-build frontend, which requires compile_commands.json to be present. I've used libear to generate compile_commands.json, but the generated JSON does not contain the command field, which causes @require before run to die (also, due to the passing style this error was unnecessarily difficult to debug).
So could you write a short documentation somewhere how all pieces fit together? What entry point should be used, what should people do who don't have a build system-generated compile_commands.jsonetc. etc.

This revision now requires changes to proceed.Dec 13 2017, 5:26 PM

Some comments on the C++ inline.

include/clang/AST/ASTContext.h
82 ↗(On Diff #126543)

Is this forward declaration needed?

include/clang/StaticAnalyzer/Core/PathSensitive/ExprEngine.h
63

I don't think CrossTranslationUnitContext belongs in ExprEngine (it doesn't really have much to do with transfer functions and graph construction.

Can you move it to AnalysisDeclContextManager instead?

Also, when you move it to AnalysisManager can you make it a pointer that is NULL when naive cross-translation support is not enabled? This will make it more clear that the cross-translation unit support will not always be available.

lib/StaticAnalyzer/Core/AnalyzerOptions.cpp
400

Can you also add an analyzer option that is something like 'enable-naive-cross-translation-unit-analysis' and defaults to false?

I'd like to avoid using the presence of 'ctu-dir' as an indication that the analyzer should use the naive CTU analysis. This way when if add a less naive CTU analysis we'll be able to the CTUDir for analysis artifacts (such as summaries) for the less naive CTU analysis as well.

lib/StaticAnalyzer/Core/CallEvent.cpp
379

This downcast is an indication that the CTUContext is living in the wrong class.

386

Can this logic be moved to AnalysisDeclContext->getBody()?

CallEvent::getRuntimeDefinition() is really all about modeling function dispatch at run time. It seems odd to have the cross-translation-unit loading (which is about compiler book-keeping rather than semantics) here.

396

I don't think it makes sense to diagnose index errors here.

Doing it when during analysis means that, for example, the parse error could be emitted or not emitted depending on whether the analyzer thinks a particular call site is reached.

It would be better to validate/parse the index before starting analysis rather than during analysis itself.

lib/StaticAnalyzer/Frontend/AnalysisConsumer.cpp
199

There is no need for AnalysisConsumer to take both a CompilerInstance and a Preprocessor. You can just call getPreprocessor() to get the preprocessor from the CompilerInstance

I've tried using the patch, and I got blocked at the following: CTU options are only exposed when one goes through analyze-build frontend, which requires compile_commands.json to be present. I've used libear to generate compile_commands.json, but the generated JSON does not contain the command field, which causes @require before run to die (also, due to the passing style this error was unnecessarily difficult to debug).
So could you write a short documentation somewhere how all pieces fit together? What entry point should be used, what should people do who don't have a build system-generated compile_commands.jsonetc. etc.

Basically this patch does not change the way scan-build-py is working. So you can either create a compile_commands.json using intercept-build and than use analyze-build to use the result. Or you can do the two together using scan-build. We only offer the CTU functionality in analyze-build, because CTU needs multiple runs anyway, so on the spot build action capture and analyze is not possible with CTU. The general usage of scan-build-py is written in clang/tools/scan-build-py/README.md

xazax.hun marked 6 inline comments as done.Tue, Dec 19, 5:29 AM
xazax.hun added inline comments.
lib/StaticAnalyzer/Core/CallEvent.cpp
386

I just tried that and unfortunately, that introduces cyclic dependencies.
CrossTU depends on Frontend. Frontend depends on Sema. Sema depends on Analysis. Making Analysis depending on CrossTU introduces the cycle.

396

While I agree, right now this validation is not the part of the analyzer but the responsibility of the "driver" script for example CodeChecker. It is useful to have this diagnostics to catch bugs in the driver.

xazax.hun updated this revision to Diff 127525.Tue, Dec 19, 7:10 AM
xazax.hun marked an inline comment as not done.
  • Address some review comments
  • Rebased on ToT
xazax.hun added inline comments.Tue, Dec 19, 7:11 AM
lib/StaticAnalyzer/Core/AnalyzerOptions.cpp
407

This option might not be the most readable but this way experimental is a prefix. Do you prefer this way or something like enable-experimental-naive-ctu-analysis?

NoQ added a comment.Wed, Jan 10, 11:17 AM

Had a fresh look on the C++ part, it is super clean now, i'm very impressed :)

lib/StaticAnalyzer/Core/CallEvent.cpp
373–394

*Humbly suggests to refactor whatever we need here into SubEngine's virtual method(s).

getAnalysisManager() is already there, so i guess we only need to expose getCrossTranslationUnitContext().

385

I think CallEvent is the last place where i'd prefer to hardcode this filename. Maybe hardcode it in CrossTranslationUnitContext or get from AnalyzerOptions? (the former, i guess, because i doubt anybody would ever want to change it).

lib/StaticAnalyzer/Core/PathDiagnostic.cpp
418–424

It seems to me that XDL and YDL are exactly the same as XL and YL we've seen at the beginning of the function.

...we still have only one SourceManager, right?

xazax.hun added inline comments.Thu, Jan 11, 1:12 PM
lib/StaticAnalyzer/Core/PathDiagnostic.cpp
418–424

Is this true?
One is the location associated with the PathDiagnostic the other is the location of the Decl associated with the issue. I do not have deep understanding of this part of the code but not sure if these are guaranteed to be the same.

NoQ added inline comments.Thu, Jan 11, 1:25 PM
lib/StaticAnalyzer/Core/PathDiagnostic.cpp
418–424

Whoops, you're totally right, never mind.

Comments might have probably helped me understand that faster.

xazax.hun updated this revision to Diff 129509.Thu, Jan 11, 1:31 PM
xazax.hun marked 5 inline comments as done.
  • Fixed review comments
xazax.hun added inline comments.Thu, Jan 11, 1:32 PM
lib/StaticAnalyzer/Core/PathDiagnostic.cpp
418–424

Sure, comments might help, but I want to keep the changes in this patch focused and minimal, so I prefer to do such modifications in a separate patch.