Hi,
We would like to ask for your comments and opinion (RFC) a patch
that introduces a unique hash identifier for defects reported by Clang Static Analyzer. This unique identifier is generated for every report into the PLIST file output of Clang SA in the form: “<key>bug_id_1</key><string>MD5_HASH</string>”.
We work on a report database & viewer for Clang SA, which we also presented on the EuroLLVM 2015 conf in London.
http://llvm.org/devmtg/2015-04/slides/Clang_static_analysis_toolset_final.pdf
Unique hash identifiers are useful to implement the following use-cases:
- list new defects compared to a baseline. This is called “differential view”.
- Being able to permanently suppress false positive reports.
Design goals for the bug identifier:
a) The identifier must be unique. So there must not exist two different defects identified by the same ID.
b) The identifier must not change if there are changes in the code not affecting the reported fault. Such changes are: inserting empty lines, or adding unrelated code before or after the position of the report. Therefore line number cannot be included in the identifier.
We propose to use a hash function to generate the unique identifier.
Since line number cannot be used we must utilize semantic information as the source of this hash.
The suggested patch generates an MD5 hash based on the following information:
- File name
• So a bug in a copy pasted function with the same signature has a different id.
- Content of the line where the bug is
• So if anything changes in the close environment of the bug, it changes the ID. We think that it is likely that the changes in the same line will semantically affect the bug.
- Position (column) within the line
• To be able to differentiate between bugs within the same line reported by the same checker
- Unique name of the checker
• So that we are able to differentiate between reported faults for the same position, but generated by different checkers
- Signature of the enclosing functiondecl, type declaration or namespace
• Due to overloaded functions and copy pasted implementations, it is likely that the same fault is found in two different overloaded functions. These reports must have different IDs, thus we take the signature of the enclosing function into consideration. If the bug position is not within the scope of a function, we use the fully qualified name of the enclosing scope (type name or namespace). The global namespace is represented by an empty string.
- Optional Extra field
• There may be cases when a checker would like to report multiple problems for the same position. In this case the checker writer can add a differentiator field in the checker implementation.
In the current code there exists a similar identifier generator to the one suggested above. That implementation takes into consideration only
• the name of the enclosing scope
• and the relative line number within the enclosing scope.
This source of information is insufficient as the base of the hash for the reasons described above.
We included a version of the hash in the name of the key (<key>bug_id_1</key>) in the PLIST output to identify the hash generator algorithm. This way it will be possible to introduce a new hash calculation algorithm if needed.
Please, rename BugId into issue hash everywhere, including the file names.