Add preliminary Cross Translation Unit support library
ClosedPublic

Authored by xazax.hun on Jun 22 2017, 7:12 AM.

Details

Summary

This patch introduces a class that can help to build tools that require cross translation unit facilities.
This class allows function definitions to be loaded from external AST files based on an index.
In order to use this functionality an index is required. USRs are used as names to look up the functions.
This class also does caching to avoid redundant loading of AST files.

Right now only function defnitions can be loaded using this API, because this is what the Static Analyzer requires.
In to future this could be extended to classes, types etc.

More tests will come together with the first user once it is accepted: https://reviews.llvm.org/D30691

Diff Detail

Repository
rL LLVM
There are a very large number of changes, so older changes are hidden. Show Older Changes
whisperity added inline comments.Aug 9 2017, 9:20 AM
include/clang/CrossTU/CrossTranslationUnit.h
42 ↗(On Diff #109559)

Does the name of this class make sense? If I say

CrossTranslationUnit *ctuptr = new CrossTranslationUnit(...);

did I construct a new CrossTranslationUnit? What even is a CrossTranslationUnit? Other class names, such as ASTImporter, DiagOpts, and FrontendAction, etc. somehow feel better at transmitting reflecting upon their meaning in their name.

lib/CrossTU/CrossTranslationUnit.cpp
46 ↗(On Diff #109559)

If we are using USR for lookup, then why does this look up "based on mangled name"?

Apart from those in the in-line comments, I have a question: how safe is this library to Release builds? I know this is only a submodule dependency for the "real deal" in D30691, but I have seen some asserts that "imported function should already have a body" and such.

Will the static analyzer handle these errors gracefully and fall back to "function is unknown, let's throw my presumptions out of the window" or will it bail away? I'm explicitly thinking of the assert-lacking Release build.

The basic idea is that, if a function already have a body in the current translation unit and you still want to import it from somewhere else, it might be a programmer error, so we do not handle this gracefully.

include/clang/CrossTU/CrossTranslationUnit.h
42 ↗(On Diff #109559)

What would you think about CrossTranslationUnitContext?

The functionality of this class is have all the relevant data required for doing Cross TU lookups and also provide the interface for that.

whisperity added inline comments.Aug 10 2017, 3:01 AM
include/clang/CrossTU/CrossTranslationUnit.h
42 ↗(On Diff #109559)

WFM. I'm not sure whose authority is the class names, but I like it a lot better.

xazax.hun updated this revision to Diff 110559.Aug 10 2017, 4:54 AM
xazax.hun marked 4 inline comments as done.
  • Address review comments

The creation of this library (libCrossTU) is approved for importing function definitions. @zaks.anna, @NoQ , @klimek could you please help us reviewing the code itself?

Then, when this is approved, we could progress with the review of the dependent "Support for naive cross translational unit analysis" (https://reviews.llvm.org/D30691) feature. The two patch is now in sync.
Thanks a lot in advance.

Except for some drive-by nits, this is a high-level review.

I have some high-level questions about the design of the library:

  1. How do you intend to handle the case when there are multiple function definitions with the same USR? Whose responsibility it is to deal with this: the client (for example, the static analyzer) or the index/database (libCrossTU)? This seems to me to be a fundamental part of the design for this feature that is missing. If you expect the client to handle this scenario (for example, to pick a definition) I would expect the library API to return more than a single potential result for a query. If you expect the the library to pick a definition then at a minimum you should document how this will be done. If the library will treat this as an error then you should communicate this to the client so it can attempt to recover from it.
  1. Whose responsibility is it to handle errors that the library runs into? Right now several errors appear to reported to the end user as diagnostics. It seems to me that the decision of how (or whether) to report these errors to the end user should be up to a particular client. Are these errors even actionable on the part of end users? My sense it that with the envisioned workflow users would not know/care about the format of the database file.
  1. Where should the code that understands the database file format live? Right now the logic for the parsing of the index format is in libCrossTU and the emission of the index is in the command-line tool. I think it would be easier to maintain if there were a single place that understand the file format. This way consumers and emitters of the index could be easily updated when the format/representation changes. Additionally, I think it is important to document this format.
include/clang/Basic/DiagnosticCrossTUKinds.td
13 ↗(On Diff #110559)

Is this a user-facing diagnostic? Do we expect users to know what 'CrossTU' means? Or what 'USR' means?

include/clang/CrossTU/CrossTranslationUnit.h
60 ↗(On Diff #110559)

Can this document what happens when the function declaration cannot be found? And what happens when multiple function declarations are found?

lib/CrossTU/CrossTranslationUnit.cpp
48 ↗(On Diff #110559)

Should this assert that DC is not null? What is the reason to accept null here?

81 ↗(On Diff #110559)

What is the rationale behind emitting this as a diagnostic? In general for a library I would expect that errors would be communicated to the client, which then would have responsibility for handling them, reporting them to the user, or aborting.

92 ↗(On Diff #110559)

It looks like this is parsing. Is the file format documented anywhere? Also, it would be good to keep the parsing and generation code in the same place so that it is easy to figure out what to update when the file format changes.

tools/clang-func-mapping/ClangFnMapGen.cpp
86 ↗(On Diff #110559)

It seems like the functionality that writes an entry and the functionality that reads an entry should ideally be in the same place. This way when the format is updated it is obvious what other places need to be updated. Can the code that writes a mapping be moved to libCrossTU and can we have this tool link against it?

Except for some drive-by nits, this is a high-level review.

I have some high-level questions about the design of the library:

  1. How do you intend to handle the case when there are multiple function definitions with the same USR? Whose responsibility it is to deal with this: the client (for example, the static analyzer) or the index/database (libCrossTU)? This seems to me to be a fundamental part of the design for this feature that is missing. If you expect the client to handle this scenario (for example, to pick a definition) I would expect the library API to return more than a single potential result for a query. If you expect the the library to pick a definition then at a minimum you should document how this will be done. If the library will treat this as an error then you should communicate this to the client so it can attempt to recover from it.

Right now we rely on the script that is using the analyzer such as scan-build-py or codechecker to remove the duplicate USRs from the index before starting the analysis. Having an ErrorOr return type so the user can handle the case when multiple definitions or none of them is found could be useful. It is not feasible to return multiple definitions, however. The importing alters the ASTContext, so the only way to choose between the definitions would be via a callback that is triggered before the import is done. What do you think?

  1. Whose responsibility is it to handle errors that the library runs into? Right now several errors appear to reported to the end user as diagnostics. It seems to me that the decision of how (or whether) to report these errors to the end user should be up to a particular client. Are these errors even actionable on the part of end users? My sense it that with the envisioned workflow users would not know/care about the format of the database file.

The intention was that to report to the user that the index format is invalid, since it is a configuration issue that she might be able to handle. But we could also return an error code and leave the reporting to the client. The reason we introduced the diagnostic is that we assumed that the client can not recover from a configuration error and will end up reporting the problem to the user.

  1. Where should the code that understands the database file format live? Right now the logic for the parsing of the index format is in libCrossTU and the emission of the index is in the command-line tool. I think it would be easier to maintain if there were a single place that understand the file format. This way consumers and emitters of the index could be easily updated when the format/representation changes. Additionally, I think it is important to document this format.

This is a very good point! I will update the library.

include/clang/Basic/DiagnosticCrossTUKinds.td
13 ↗(On Diff #110559)

It is. Do you have any alternative suggestion? What about error parsing index file without mentioning any of the terms you mentioned?

include/clang/CrossTU/CrossTranslationUnit.h
60 ↗(On Diff #110559)

Sure. Good point.

lib/CrossTU/CrossTranslationUnit.cpp
48 ↗(On Diff #110559)

Indeed, an assert should be better.

81 ↗(On Diff #110559)

Since this is likely to be a configuration error we were thinking that only the user can handle it. But I could transform this to an error code if you would prefer that solution.

92 ↗(On Diff #110559)

Good point!

tools/clang-func-mapping/ClangFnMapGen.cpp
86 ↗(On Diff #110559)

Sure, good point!

xazax.hun updated this revision to Diff 113088.Aug 29 2017, 7:03 AM
xazax.hun marked 2 inline comments as done.
  • Updates according to review comments.

The importing alters the ASTContext, so the only way to choose between the definitions would be via a callback that is triggered before the import is done. What do you think?

I think that could work. Another possibility would be to have a two step process. The first step would return a representation of possible definitions to import; then the client would chose which one to important and call another API to perform the actual import.

In either case, the important scenario I think we should support is choosing at a call site to a C function the most likely definition of the called function, based on number and type of parameters, from multiple possible definitions in other translation units. If the API is rich enough to support this then I think that is a good indication it will be useful for other scenarios as well.

The reason we introduced the diagnostic is that we assumed that the client can not recover from a configuration error and will end up reporting the problem to the user.

For the static analyzer client, at least, one possible recovery is performing the existing invalidation that we do when calling a function defined in another translation unit. (I'm not really sure this a good idea; I just wanted to point out that the decision probably belongs with the client). I think it is reasonable to report an error as a diagnostic -- but I think this should be up to the client and I don't think it should show up in the index file itself. In an ideal world the user wouldn't even know that file existed. Further, for large projects/incremental rebuilds a text file is unlikely to be a suitable representation for the index. In the long term I doubt the file will be end-user editable/readable.

In either case, the important scenario I think we should support is choosing at a call site to a C function the most likely definition of the called function, based on number and type of parameters, from multiple possible definitions in other translation units. If the API is rich enough to support this then I think that is a good indication it will be useful for other scenarios as well.

Note that the lookup is already based on USR which is similar to mangled names in a sense that it contains information about the function parameters. So the only way to get multiple candidates from the lookup is having multiple function definitions with the same signature. It is only possible to disambiguate with knowledge about the linking graph. I think that information is better to be stored in the index in the future and do the disambiguation within the library instead of making the client do that.

The reason we introduced the diagnostic is that we assumed that the client can not recover from a configuration error and will end up reporting the problem to the user.

For the static analyzer client, at least, one possible recovery is performing the existing invalidation that we do when calling a function defined in another translation unit. (I'm not really sure this a good idea; I just wanted to point out that the decision probably belongs with the client). I think it is reasonable to report an error as a diagnostic -- but I think this should be up to the client and I don't think it should show up in the index file itself. In an ideal world the user wouldn't even know that file existed. Further, for large projects/incremental rebuilds a text file is unlikely to be a suitable representation for the index. In the long term I doubt the file will be end-user editable/readable.

I agree that we do not consider this format to be final for the index. In the future, it would be great to include linking information as well and also have utilities to merge partial indexes. This is a first version to make the machinery work for the most common cases which are already pretty usable for the open source projects we tested.

xazax.hun updated this revision to Diff 113206.Aug 30 2017, 1:47 AM
  • Added unit test to ensure the produced index format can be parsed.
  • Added further diagnostics.

In either case, the important scenario I think we should support is choosing at a call site to a C function the most likely definition of the called function, based on number and type of parameters, from multiple possible definitions in other translation units. If the API is rich enough to support this then I think that is a good indication it will be useful for other scenarios as well.

Note that the lookup is already based on USR which is similar to mangled names in a sense that it contains information about the function parameters. So the only way to get multiple candidates from the lookup is having multiple function definitions with the same signature.

I just want to clarify that C function USRs do not contain type information, although C++ USRs do.

In either case, the important scenario I think we should support is choosing at a call site to a C function the most likely definition of the called function, based on number and type of parameters, from multiple possible definitions in other translation units. If the API is rich enough to support this then I think that is a good indication it will be useful for other scenarios as well.

Note that the lookup is already based on USR which is similar to mangled names in a sense that it contains information about the function parameters. So the only way to get multiple candidates from the lookup is having multiple function definitions with the same signature.

I just want to clarify that C function USRs do not contain type information, although C++ USRs do.

Just double checked and indeed: https://github.com/llvm-mirror/clang/blob/master/lib/Index/USRGeneration.cpp#L236
Thank you for the clarification, it looks like I did not notice this detail. In this case, it might make sense to let the client disambiguate.

small nits

include/clang/CrossTU/CrossTranslationUnit.h
58 ↗(On Diff #113206)

I suggest that "can" is removed.

62 ↗(On Diff #113206)

there is no \return or \returns here.

62 ↗(On Diff #113206)

maybe: each line consists of an USR and a filepath separated by a space

68 ↗(On Diff #113206)

Maybe /can be/is/ unless you envision that tools that require cross translation unit capability might use some other classes.

92 ↗(On Diff #113206)

you should split out some of this information to a \return or \returns

lib/CrossTU/CrossTranslationUnit.cpp
60 ↗(On Diff #113206)

redundant assignment

79 ↗(On Diff #113206)

I suggest that LineNo is 1 on the first line.

81 ↗(On Diff #113206)

Pos can be const

82 ↗(On Diff #113206)

LineRef can be const

84 ↗(On Diff #113206)

FunctionName and FileName can be moved here and it is possible to make these variables const.

85 ↗(On Diff #113206)

I would like to see some FunctionName validation. For instance "123" should not be a valid function name.

89 ↗(On Diff #113206)

Stupid question .. how will this work if the path is longer than 256 bytes?

102 ↗(On Diff #113206)

how about std::ostringstream , imho that is cleaner

121 ↗(On Diff #113206)

/funtion/function/

125 ↗(On Diff #113206)

LookupFnName could be const right?

148 ↗(On Diff #113206)

I believe LookupFnName can be const

189 ↗(On Diff #113206)

It is possible to make ASTFileName const

223 ↗(On Diff #113206)

sorry I am probably missing something here.. you first assert !FD->hasBody() and here you assert FD->hasBody(). Is it changed in this function somewhere?

tools/clang-func-mapping/ClangFnMapGen.cpp
78 ↗(On Diff #113206)

I believe that realpath() is a posix function

unittests/CrossTU/CrossTranslationUnitTest.cpp
40 ↗(On Diff #113206)

should this be assert(FD && FD->getName() == "f")

xazax.hun updated this revision to Diff 113740.Sep 4 2017, 4:55 AM
xazax.hun marked 19 inline comments as done.
  • Make the API capable of using custom lookup strategies for CTU
  • Fix review comments
lib/CrossTU/CrossTranslationUnit.cpp
82 ↗(On Diff #113206)

StringRef is an immutable reference to a string, I think it is not idiomatic in LLVM codebase to make it const.

85 ↗(On Diff #113206)

This is not a real function name but an USR. I updated the name of the variable to reflect that this name is only used for lookup.

89 ↗(On Diff #113206)

If the path is shorter than 256 bytes it will be stored on stack, otherwise SmallString will allocate on the heap.

125 ↗(On Diff #113206)

In LLVM we usually do not mark StringRefs as consts, they represent a const reference to a string.

223 ↗(On Diff #113206)

Yes, after the importer imports the new declaration with a body we should be able to find the body through the original declaration. The importer modifies the AST and the supporting data structures.

xazax.hun marked 9 inline comments as done.Sep 4 2017, 4:57 AM

Thanks Gabor! Some additional comments in line.

include/clang/CrossTU/CrossTranslationUnit.h
118 ↗(On Diff #113740)

Is this comment correct? Does it return a map?

128 ↗(On Diff #113740)

I found this comment confusing. Is 'Unit' the separate ASTUnit or it the current one?

134 ↗(On Diff #113740)

If this is public API, it deserves a comment. If not, can we make it private?

lib/CrossTU/CrossTranslationUnit.cpp
35 ↗(On Diff #113740)

Is this FIXME still relevant? What will need to be transitioned to llvm::Error for it to be removed? Can you make the comment a bit more specific so that future maintainers will know when/how to address the FIXME?

194 ↗(On Diff #113740)

What is this cast for? Is it to suppress an analyzer dead store false positive? I thought we got all those!

199 ↗(On Diff #113740)

If I understand correctly, there is no way to call loadExternalAST() without the potential for a diagnostic showing up to the user. Can this diagnostic emission be moved to the client?

249 ↗(On Diff #113740)

Does this method need to take the unit? Can it just get the ASTContext from 'FD'? If not, can we add an assertion that FD has the required relationship to Unit? (And document it in the header doc.)

tools/clang-func-mapping/ClangFnMapGen.cpp
56 ↗(On Diff #113740)

Is getLookupName() here used?

72 ↗(On Diff #113740)

Should this instead reuse the logic for generating a lookup name from CrossTranslationUnitContext::getLookupName()? That way if we change how the lookup name is generated we only have to do it in one place.

xazax.hun updated this revision to Diff 114133.Sep 7 2017, 2:18 AM
xazax.hun marked 8 inline comments as done.
  • Address review comments.
lib/CrossTU/CrossTranslationUnit.cpp
35 ↗(On Diff #113740)

Unfortunately, it is. IndexError needs to implement convertToErrorCode because it is pure virtual in the base class, and that requires the existence of this class. So this can only be removed, once convertToErrorCode is pruned globally from the codebase. I think all of such classes are marked with similar FIXMEs.

194 ↗(On Diff #113740)

The point was to avoid an assertion fail where an error was not checked. The reason, there is no really good way to propagate errors right now. But with this diagnostic emission moved to the client this is no longer required.

Any further reviews?
What are the criteria to accept this patch? Who or how many people should accept this?

compnerd added a subscriber: alexshap.

I'm OK with this going into the repo a is (although it is light on tests!), as long as we have an agreement that you'll be OK with iteration on both the interface and the implementation to handle real-world projects.

More specifically, for this to work well on large/complicated projects we'll need to:

  • Move conflict resolution from index generation where it is now to query time so that clients can pick the best implementation of a function when multiple are found. This is important for projects that have multiple functions with the same name or that build the same file in multiple ways.
  • Change function lookup from needing to traverse over the entire AST.
  • Move away from a linear, flat-file index to something that can handle larger projects more efficiently.

For this last point, I suspect it will be good to adopt whatever Clangd ends up using to support indexing. (There has been some discussion of this relatively recently on the cfe-dev list.)

I'm OK with this going into the repo a is (although it is light on tests!), as long as we have an agreement that you'll be OK with iteration on both the interface and the implementation to handle real-world projects.

I agree that it could have more tests, but while the main features are already tested, more tests are also coming with the first usage of the library (the Static Analyzer). We are also willing to add new tests as the interface/library evolves.

More specifically, for this to work well on large/complicated projects we'll need to:

  • Move conflict resolution from index generation where it is now to query time so that clients can pick the best implementation of a function when multiple are found. This is important for projects that have multiple functions with the same name or that build the same file in multiple ways.

I agree. Right now the way we use this internally, the collisions are handled after index generation by an external tool (CodeChecker).

  • Change function lookup from needing to traverse over the entire AST.
  • Move away from a linear, flat-file index to something that can handle larger projects more efficiently.

    For this last point, I suspect it will be good to adopt whatever Clangd ends up using to support indexing. (There has been some discussion of this relatively recently on the cfe-dev list.)

The index right now is a minimal implementation to make this functionality work. This part was never the bottleneck according to our measurements so far. It would be great to reuse the index format from ClangD in case it supports to generate lightweight indexes that only contain the data we need for analysis. (We only need an index of function definitions for the analyzer and nothing else. I suspect other clients also would not need other information.)

I think it would be a waste of effort to change the lookup strategy or move the collision handling mechanisms to index generation until we do not know how the index will work. But I also think it wouldn't be good to block this until ClanD indexing reaching a mature state.

All in all, we are willing to reuse as much of ClangD indexing as we can in the future, but I think it is also more aligned with the LLVM Developer Policy to have something working committed and do the development in the repo rather than working separately on a fork.

xazax.hun updated this revision to Diff 116195.Sep 21 2017, 8:30 AM
  • Unittests now creates temporary files at the correct location
  • Temporary files are also cleaned up when the process is terminated
  • Absolute paths are handled correctly by the library
dcoughlin accepted this revision.Sep 21 2017, 2:35 PM

But I also think it wouldn't be good to block this until ClanD indexing reaching a mature state.

I agree 100%.

All in all, we are willing to reuse as much of ClangD indexing as we can in the future, but I think it is also more aligned with the LLVM Developer Policy to have something working committed and do the development in the repo rather than working separately on a fork.

This sounds good to me. I just wanted to make sure that you were onboard with changing this index to use the common indexing infrastructure in the future. (Assuming it makes sense to do so, of course).

This revision is now accepted and ready to land.Sep 21 2017, 2:35 PM
This revision was automatically updated to reflect the committed changes.
r.stahl added a subscriber: r.stahl.Oct 6 2017, 8:24 AM
  • Unittests now creates temporary files at the correct location
  • Temporary files are also cleaned up when the process is terminated
  • Absolute paths are handled correctly by the library

I think there is an issue with this change.

First, the function map generation writes absolute paths to the map file. Now when the parseCrossTUIndex function parses it, it will no longer prepend the paths with the CTUDir and therefore expect the precompiled AST files at the exact path of the source file. This seems like a bad requirement, because the user would have to overwrite his source files.

If I understand your intention correctly, a fix would be to replace the is_absolute check with a check for CTUDir == "" in the parseCrossTUIndex function.

  • Unittests now creates temporary files at the correct location
  • Temporary files are also cleaned up when the process is terminated
  • Absolute paths are handled correctly by the library

I think there is an issue with this change.

First, the function map generation writes absolute paths to the map file. Now when the parseCrossTUIndex function parses it, it will no longer prepend the paths with the CTUDir and therefore expect the precompiled AST files at the exact path of the source file. This seems like a bad requirement, because the user would have to overwrite his source files.

If I understand your intention correctly, a fix would be to replace the is_absolute check with a check for CTUDir == "" in the parseCrossTUIndex function.

Good catch, that is right!