Page MenuHomePhabricator

[Clang Static Analyzer] Bug identification
ClosedPublic

Authored by xazax.hun on Jun 8 2015, 2:21 AM.

Details

Summary

Hi,

We would like to ask for your comments and opinion (RFC) a patch
that introduces a unique hash identifier for defects reported by Clang Static Analyzer. This unique identifier is generated for every report into the PLIST file output of Clang SA in the form: “<key>bug_id_1</key><string>MD5_HASH</string>”.

We work on a report database & viewer for Clang SA, which we also presented on the EuroLLVM 2015 conf in London.
http://llvm.org/devmtg/2015-04/slides/Clang_static_analysis_toolset_final.pdf

Unique hash identifiers are useful to implement the following use-cases:

  1. list new defects compared to a baseline. This is called “differential view”.
  2. Being able to permanently suppress false positive reports.

Design goals for the bug identifier:
a) The identifier must be unique. So there must not exist two different defects identified by the same ID.
b) The identifier must not change if there are changes in the code not affecting the reported fault. Such changes are: inserting empty lines, or adding unrelated code before or after the position of the report. Therefore line number cannot be included in the identifier.

We propose to use a hash function to generate the unique identifier.
Since line number cannot be used we must utilize semantic information as the source of this hash.

The suggested patch generates an MD5 hash based on the following information:

  1. File name

• So a bug in a copy pasted function with the same signature has a different id.

  1. Content of the line where the bug is

• So if anything changes in the close environment of the bug, it changes the ID. We think that it is likely that the changes in the same line will semantically affect the bug.

  1. Position (column) within the line

• To be able to differentiate between bugs within the same line reported by the same checker

  1. Unique name of the checker

• So that we are able to differentiate between reported faults for the same position, but generated by different checkers

  1. Signature of the enclosing functiondecl, type declaration or namespace

• Due to overloaded functions and copy pasted implementations, it is likely that the same fault is found in two different overloaded functions. These reports must have different IDs, thus we take the signature of the enclosing function into consideration. If the bug position is not within the scope of a function, we use the fully qualified name of the enclosing scope (type name or namespace). The global namespace is represented by an empty string.

  1. Optional Extra field

• There may be cases when a checker would like to report multiple problems for the same position. In this case the checker writer can add a differentiator field in the checker implementation.

In the current code there exists a similar identifier generator to the one suggested above. That implementation takes into consideration only
• the name of the enclosing scope
• and the relative line number within the enclosing scope.
This source of information is insufficient as the base of the hash for the reasons described above.

We included a version of the hash in the name of the key (<key>bug_id_1</key>) in the PLIST output to identify the hash generator algorithm. This way it will be possible to introduce a new hash calculation algorithm if needed.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
babati updated this revision to Diff 29603.Jul 13 2015, 1:45 PM

Changed fuzzy extra field to BugType.
Fixed location.
Updated to the trunk.

Some minor style comments.

include/clang/StaticAnalyzer/Core/BugReporter/BugReporter.h
146 ↗(On Diff #27669)

Ampersand should be next to the variable name. On some other places too.

lib/StaticAnalyzer/Core/BugId.cpp
2 ↗(On Diff #29603)

The top of the header should be one line and should be within the 80 column limit.

Just for reference, this is the tool that wants to utilize the unique identifiers: https://github.com/Ericsson/codechecker

babati updated this revision to Diff 30808.Jul 28 2015, 5:43 AM

Added line normalization.
Updated to the trunk.

please consider stripping the line from comments and normalize white spaces before hashing and calculating column position.
This will make the hash more robust, as a re-identation or adding comments to a line will not spoil the hashes.

I changed the patch and the normalized line is used instead of the raw content of line. So, the whitespaces and the comments are removed from the line before the hashing.

By redundant, I mean that this information is already encoded in the report; even if it's not part of the issue id. I can see this argument go either way. However, if we do decide to include the filename, we would need to change clang/utils/analyzer/CmpRuns.py and the current issue_hash so that it's all consistent.

The hash should be a unique identifier of a concreate defect. If a hash identifies multiple deffects in different files at the same time, that must be considered as a fault (from the user perspective).
If the user suppresses a fault then he discovers later that with that suppression 2 or more other bugs „disappeared”.

The filename should be part of the hash because there will be hash clash if for example:
-there are multiple main() functions in the codebase with the same signature (this is likely)
-and there is a same line with a defect in each of them
then the same bug hash would be generated.

Including the filename in the hash would decrease the likelyhood of such cases.

This definitely needs tests!

I agree that the file name should be used for bug identification. I am not sure if it should be included in the hash by the analyzer or if it should be included by the processing script. Have you looked at CmpRuns.py and SATestBuild.py?

On the other hand, the tools will be much faster if they do not have to process more information from the report to identify the bugs.

The file name is another field in the same report; I do not see how this would result in a much slower scripts.

Here is one problem to consider:
What is included as "filename"? Is it the full path to the file where the issue is found? Is it just the file name without the path?

Does your solution support uniquing the issues when the same project is analyzed from different directories?
How will you handle issues coming from 2 different projects that have the same file name?

Our testing scrips that is based on CmpRuns has the notion of multiple projects and bug reports coming from different projects are considered to be different. We include project name and the location of the file within the project to identify the issue. Not including a filename in the hash allows for that model.

babati updated this revision to Diff 31636.Aug 10 2015, 12:27 AM

Removed filename.
Updated to the latest trunk.

We wanted to include the filename without path in the hash. This would allow us to compare two runs of the same project (located in different directories) purely based on the hashes.
This gives good enough results when comparing different versions of the same project, or storing bug suppression for a single project.
(Limitation: You would not be able to detect a new bug if it appaers in a new file with same name and same content as an existing one. (for example a copied main.c) But we think this is a rare case.)

If the filename is not part of the hash, then the comparison script can get somewhat slower, as you need to compare also the extra file field. But this is acceptable.

So finally, I agree that leaving out the filename from the hash allows for greater flexibility at post processing.

Do you think we need any other improvements to this patch?

Hi Babati,

I'm sorry for the late appreciation response. I can see the Bug ID in HTML.
I have also tested CmpRuns.py and it works also fine as Anna explained.
CmpRuns.py works great with plist files, but I personally want to compare two reports with HTML results without plist as well.

So my use case prefers to have a uniq hash number in each HTML report.
And I was thinking to modify the bug report name from "report-77e15b.html" to "report-<BugID>.html" so that we can easily compare two reports. It can just check whether the same name files exist or not. If we need post processing to generate uniq ID by extracting file name, comparison would be a little bit more complicated.

As far as I observed by using CmpRuns.py, it can greatly compare two reports with plist, but I don't know how I can pick the HTML reports that are matched with CmpRuns.py's result.
For example, even if CmpRuns.py shows there are 3 different bug reports of out 10 bugs, it's not easy to open the 3 different HTML bug reports in browser. It just shows the number of different bugs and those bug type descriptions.

I hope anyone can correct me if I'm wrong or misunderstand some points.

We need a way to test this functionality. One way of testing this would be to write unit tests. Gabor has already added such tests for the static analyzer.

honggyu.kim,

You are right, CmpRuns.py does not work with HTML files.

The list of HTML reports is produced by scan-build and there is no facility there to search for newly generated reports. It is also not clear that there should be one HTML file per report. This is what we have now but the current approach does not scale well for large files with numerous reports (since we create a copy of the file for each report.) We do not have a proper design for issue uniquing with HTML interface in tree.

honggyu.kim added a comment.EditedAug 10 2015, 7:03 PM

honggyu.kim,

You are right, CmpRuns.py does not work with HTML files.

The list of HTML reports is produced by scan-build and there is no facility there to search for newly generated reports.

Thanks for the confirmation.

It is also not clear that there should be one HTML file per report. This is what we have now but the current approach does not scale well for large files with numerous reports (since we create a copy of the file for each report.) We do not have a proper design for issue uniquing with HTML interface in tree.

Yes, because we generate one HTML file per report, the number of bug reports can be hugely increased sometimes. In order to have multiple bug reports in a single HTML file just like plist, we need to write JavaScript code per HTML report to switch between multiple reports in a single file.

But I think it's not a critical issue as of now and it can be considered at the next step.
I hope to have a way to compare bug reports with HTML reports without plist as well otherwise we need some facility to match the result of CmpRuns.py with HTML reports.

As I mentioned, I think the simplist way of comparing HTML reports is to have the BugID in each HTML report so that we can just compare HTML reports whether the file exists or not.

I appreciate all your feedback. Thanks.

zaks.anna requested changes to this revision.Aug 14 2015, 6:45 PM
zaks.anna edited edge metadata.

This patch needs tests.

We should treat the improvements to HTML uniquing separately from this patch.

But I think it's not a critical issue as of now and it can be considered at the next step.

I think it is very relevant to the design of the solution. For example, you could keep the information about the reports in the plist files and use those to render the reports in HTML.

As I mentioned, I think the simplist way of comparing HTML reports is to have the BugID in each HTML report so that we can just compare HTML reports >whether the file exists or not.

I believe scan-build creates and names those files. How do you propose to pass this information from the clang invocation to scan-build? Can you check if the BugID is already part of the HTML report itself? That might be a much simpler solution.

(One thing to keep in mind is that right now you get more reports with plist output. Because HTML cannot currently display cross file paths, this reports are not reported.)

This revision now requires changes to proceed.Aug 14 2015, 6:45 PM

Sorry for the late answer.
I would like to write the comment for HTML report comparison first.

As I mentioned, I think the simplist way of comparing HTML reports is to have the BugID in each HTML report so that we can just compare HTML reports >whether the file exists or not.

I believe scan-build creates and names those files. How do you propose to pass this information from the clang invocation to scan-build? Can you check if the BugID is already part of the HTML report itself? That might be a much simpler solution.

I meant that having BugID in the file name of HTML report would be beneficial when it comes to comparison. For example, report-<BugID_1234>.html and report-<BugID_1234>.html can be easily compared by their file names without observing the inside.

As you mentioned, there is a way to generate HTML report that has such file name with "-analyzer-config stable-report-filename=true" option. It is described in scan-build help message:

-analyzer-config <options>

  Provide options to pass through to the analyzer's -analyzer-config flag.
  Several options are separated with comma: 'key1=val1,key2=val2'

  Available options:
    * stable-report-filename=true or false (default)
      Switch the page naming to:
      report-<filename>-<function/method name>-<id>.html
      instead of report-XXXXXX.html

It seems that I can easily compare the result with HTML file names with this option.

honggyu.kim added a comment.EditedSep 7 2015, 6:46 AM

I would like to also write about bug identification methods.
As I observed the current CmpRuns.py script, the IssueIdentifier is defined as follows:

def getIssueIdentifier(self) :
    id = self.getFileName() + "+"
    if 'issue_context' in self._data :
      id += self._data['issue_context'] + "+"
    if 'issue_hash' in self._data :
      id += str(self._data['issue_hash'])
    return id

https://github.com/llvm-mirror/clang/blob/master/utils/analyzer/CmpRuns.py#L69-L75

It has 3 items to generate a bug identification.
(1) file name
(2) function name - issue_context
(3) line offset from the beginning of function - issue_hash

As of now, we generate issue_hash by simply calculating the line offset from the first line of the function.

FullSourceLoc UL(SM->getExpansionLoc(UPDLoc.asLocation()),
                 *SM);
FullSourceLoc UFunL(SM->getExpansionLoc(
  D->getUniqueingDecl()->getBody()->getLocStart()), *SM);
o << "  <key>issue_hash</key><string>"
  << UL.getExpansionLineNumber() - UFunL.getExpansionLineNumber()
  << "</string>\n";

https://github.com/llvm-mirror/clang/blob/master/lib/StaticAnalyzer/Core/PlistDiagnostics.cpp#L423-L452

On the other hand, this patch generates BugID as follows:

llvm::SmallString<32> clang::GetIssueHash(const SourceManager *SM,
                                          FullSourceLoc &L,
                                          StringRef CheckerName,
                                          StringRef HashField, const Decl *D) {
  static llvm::StringRef Delimiter = "$";
  
  return GetHashOfContent(
      (llvm::Twine(CheckerName) + Delimiter + GetEnclosingDeclContextSignature(D) +
       Delimiter + std::to_string(L.getExpansionColumnNumber()) + Delimiter +
       NormalizeLine(SM, L, D) + 
       Delimiter + HashField.str()).str());
}

It has 6 items to generate a bug identification.
(1) file name (removed now)
(2) checker name
(3) function name - GetEnclosingDeclContextSignature(D)
(4) column number
(5) source line string after removing whitespace - NormalizeLine(SM, L, D)
(6) bug type - D->getBugType()

I think even if this patch is not accepted, we need to accept some of the methods suggested by this patch.
Current CmpRuns.py cannot distinguish the following 2 different bugs.

BUG 1. garbage return value

1 int main()
2 {
3   int a;
4   return a;
5 }

test.c:4:3: warning: Undefined or garbage value returned to caller
  return a;
  ^~~~~~~~

BUG 2. garbage assignment

1 int main()
2 {
3   int a;
4   int b = a;
5   return b;
6 }

test.c:4:3: warning: Assigned value is garbage or undefined
  int b = a;
  ^~~~~   ~

In this case, getIssueIdentifier() returns the same ID for both cases as below:
<filename> + <function name> + <line offset from function>
test.c + main + 2

We cannot distinguish BUG1 and BUG2 with the current CmpRuns.py, so at least we need to add checker information from <check_name>.

BUG 3. a single line of comment is added based on BUG 1 code.

1 int main()
2 {
3   // main function
4   int a;
5   return a;
6 }

test.c:5:3: warning: Undefined or garbage value returned to caller
  return a;
  ^~~~~~~~

If we compare BUG3 with BUG1, CmpRuns.py shows those bugs are different even though only a single line of comment is added without actual modification.

REMOVED: 'test.c:4:3, Logic error: Undefined or garbage value returned to caller'
ADDED: 'test.c:5:3, Logic error: Undefined or garbage value returned to caller'
TOTAL REPORTS: 1
TOTAL DIFFERENCES: 2

I think we need to enhance issue_hash generation method in order to avoid those cases.

honggyu.kim,

Uniquing HTML reports is out of the scope of this patch and should be discussed elsewhere (either send a design idea to cfe-dev, send a patch for review, or file a bugzilla request).

I agree that this patch is a definite improvement to issue identification; however, as I mentioned earlier, it needs tests.

honggyu.kim,

Uniquing HTML reports is out of the scope of this patch and should be discussed elsewhere (either send a design idea to cfe-dev, send a patch for review, or file a bugzilla request).

I agree that this patch is a definite improvement to issue identification; however, as I mentioned earlier, it needs tests.

Hi Anna,

I have created another patch based on Babati's work in order to use the existing infrastructure(CmpRuns.py).
Please have a look at D12906 and give me your feedback.
Thank you very much for your comment.

Thanks! Will do.

Hi Babati,

Can you please rebase this patch based on D12673?
It is just a whitespace cleanup patch and I wrote D12906 based on this patch after applying whitespace cleanup at the end of each line.
D12906 can be applied after applying this patch first.
Thank you.

Hi Babati,

We at Sony are interested in this feature so that our tools can suppress undesirable bug warnings. You described at the end of the summary that you have thought about introducing new hash calculation algorithms if needed. How do you expect this to work? i.e. would bug_id_1 always be generated along with new improved bug_ids in the same plist file or would you expect the new bug_ids to replace old ones. I am hoping that the analyzer will always keep generating old bug_ids so that we can maintain backwards compatibility.

Regards,
Phillip

SN Systems - Sony Computer Entertainment

Hi

Sorry for the late answer.

Can you please rebase this patch based on D12673?

Yes, I can. The patch will comes soon.

How do you expect this to work? i.e. would bug_id_1 always be generated along with new improved bug_ids in the same plist file or would you expect the new bug_ids to replace old ones.

I think, the newly introduced bugids should be generated along with the older ones.

I am hoping that the analyzer will always keep generating old bug_ids so that we can maintain backwards compatibility.

I agree with this.

Even if the analzyer no longer generates new hashes it should be easy to maintain old hash algorithms out of tree.

Hi Babati,

As far as I can see, the following comments from June 15th have not been addressed. It would be good if you could address them in the latest revision.

"I would be interested in either replacing "issue_hash" or adding "issue_hash_bug_line_content" (or something like it) instead of adding another completely differently named field with very similar information. I see no reason for having both. I am not sure if we have any users of "issue_hash" right now, who will suffer from the change. Maybe we could have "issue_hash", "issue_hash_1"(offset based), and "issue_hash_2"(content of line) and add another field "issue_hash_version" that describes the version "issue_hash" is using?

This needs tests!!!"

Hi,

Regarding testing:
I think we should create a RecursiveASTvistor based "test checker" that matches every statement and declaration and reports a bug there.
Then we could create a test file similar to what we have in /tools/clang/test/Analysis/diagnostics/report-issues-within-main-file.cpp
where the expected plist output can be written at the end of the file.

I am not sure though where to register this test "test checker". Should it be a dynamically loadable checker similar to /tools/clang/examples/analyzer-plugin/MainCallChecker.cpp or it should be a debug checker like (debug.DumpCalls)?

The advantage of the dynamic lib based solution is that we would not need to statically add it to clang-sa.

What do you think?

Regards,
Daniel

xazax.hun commandeered this revision.Sep 25 2015, 6:37 AM
xazax.hun added a reviewer: babati.
xazax.hun updated this revision to Diff 35721.Sep 25 2015, 6:41 AM
xazax.hun edited edge metadata.

Unfortunately Bence is can no longer work on this patch, because he had to switch to another team within Ericsson. I took the authorship of this patch.

This version has several changes since the last one:

  • Updated to latest trunk.
  • The name of the hash field was changed to issue_hash_2.
  • Correctly handle prototypeless C functions.
  • Eliminated a string copy when generating the hash.
  • Added a debug checker to dump the context of the hash.
  • Made the lit tests pass.
  • Style fixes.
xazax.hun updated this revision to Diff 35722.Sep 25 2015, 6:45 AM
xazax.hun edited edge metadata.
  • Updated patch to have more context.

Could you address this:

Could you ask on the open source list if people are using "issue_hash" and if they are Ok with us renaming it.

I'd suggest to suffix each issue hash field with the description of that hash. For example, we would remove "issue_hash" and replace it with "issue_hash_function_offset" and add "issue_hash_content_of _line_in_context".

include/clang/StaticAnalyzer/Core/BugId.h
1 ↗(On Diff #35722)

Please, rename BugId into issue hash everywhere, including the file names.

23 ↗(On Diff #35722)

This is an API. Could you add descriptions to the parameter and/or use better parameter names. For example, what's HashField?

We also need to describe how the hash is computed; for example, how L is used...

lib/StaticAnalyzer/Checkers/DebugCheckers.cpp
239

Awesome!

lib/StaticAnalyzer/Core/BugId.cpp
12 ↗(On Diff #35722)

Let's sort the includes.

29 ↗(On Diff #35722)

Can/Should we use some existing machinery for this? For example, mangleName().

38 ↗(On Diff #35722)

Why are these necessary?

52 ↗(On Diff #35722)

Why do we need these in the hash?

126 ↗(On Diff #35722)

ObjCMethod is not a FunctionDecl. Please, add a test case.

198 ↗(On Diff #35722)

Including the checker name here is not perfect because checker name can easily change from one release of clang to another. For example, when a checker moves to another package.

I think the best approach here would be to give checkers some unique identifiers and using those here. We can discuss the more if you are interested in solving this problem.

lib/StaticAnalyzer/Core/HTMLDiagnostics.cpp
242

This check UPDLoc.isValid() should probably move here; no need to create 2 FullSourceLocs. It's a bit confusing to read otherwise. (Same in the code below.)

lib/StaticAnalyzer/Core/PlistDiagnostics.cpp
19

Is this needed?

test/Analysis/model-file.cpp
43–88

Extra space in the whole file.

xazax.hun marked 9 inline comments as done.Oct 2 2015, 7:30 AM
xazax.hun added inline comments.
lib/StaticAnalyzer/Core/BugId.cpp
29 ↗(On Diff #35722)

Generating mangled names requires ASTContext which is not available during the error reporting. BugReporter does have the ASTContext, so it would not be a big change to add it to the DiagnosticConsumers though. And I think the mangled name might contain too much redundant information. The only relevant information here is the fully qualified name and the parts of the signature that can be ocerloaded on e.g.: constness. Using this method might also be easier to debug, since the IssueString is more readable.

38 ↗(On Diff #35722)

You are right, it is not possible to overload on these properties, so it is safe (and even beneficial) to remove them from the hash.

52 ↗(On Diff #35722)

We do not need these information (see my previous comment). I removed them from the hash.

198 ↗(On Diff #35722)

I definitely agree. It would be great to have a unique identifier for checkers that is independent from the name/package. It is not a trivial problem however, since we probably want to support plugins too. I can think of a similar solution like git commit ids, but I think this problem might be out of the scope of this patch. But I am happy to start a discussion on the mailing list and create a new patch to solve this.

lib/StaticAnalyzer/Core/PlistDiagnostics.cpp
19

Removed.

test/Analysis/model-file.cpp
43–88

Whatever I do it looks like the whitespaces are not alligned well for this file. My guess s that this plist might be generated with an old version of the static analyzer. I think it is better to actually update it to the old formatting than doing a diff by hand, or writing a script to do just that.

xazax.hun updated this revision to Diff 36363.Oct 2 2015, 7:33 AM
xazax.hun marked 3 inline comments as done.
  • Updated to latest trunk.
  • Style fixes.
  • Addressed comments.
  • Added tests for Objective-C constructs.
  • Fixed the handling of Objective-C methods.

Generating mangled names requires ASTContext which is not available during the error reporting. BugReporter does have the ASTContext, so it would not
be a big change to add it to the DiagnosticConsumers though. And I think the mangled name might contain too much redundant information. The only
relevant information here is the fully qualified name and the parts of the signature that can be ocerloaded on e.g.: constness. Using this method might also
be easier to debug, since the IssueString is more readable.

I think it'd be fine to pass ASTContext to the DiagnosticConsumers. It would be useful to find out exactly what extra information the mangled names have that we do not want to see in the issue_hash.

One issue that I see with the current approach is that we do not differentiate methods from different classes/namespaces. Is this by design?

I definitely agree. It would be great to have a unique identifier for checkers that is independent from the name/package. It is not a trivial problem however,
since we probably want to support plugins too. I can think of a similar solution like git commit ids, but I think this problem might be out of the scope of this
patch. But I am happy to start a discussion on the mailing list and create a new patch to solve this.

Sounds good to me. I agree that this is out of scope for this patch.

include/clang/StaticAnalyzer/Core/IssueHash.h
33

UniqueLoc -> IssueLoc ?

utils/analyzer/CmpRuns.py
74

We should use the new hash here.

xazax.hun marked 2 inline comments as done.Oct 6 2015, 1:18 AM

Generating mangled names requires ASTContext which is not available during the error reporting. BugReporter does have the ASTContext, so it would not
be a big change to add it to the DiagnosticConsumers though. And I think the mangled name might contain too much redundant information. The only
relevant information here is the fully qualified name and the parts of the signature that can be ocerloaded on e.g.: constness. Using this method might also
be easier to debug, since the IssueString is more readable.

I think it'd be fine to pass ASTContext to the DiagnosticConsumers. It would be useful to find out exactly what extra information the mangled names have that we do not want to see in the issue_hash.

I looked into the name mangling and found the following downsides:

  • The name mangling is different when clang is running in msvc compatible mode. If we want to consistent hashes on all platforms, we should not rely on mangled names.
  • The name mangling contains the calling convention which is redundant information.
  • Some attributes have effect on name mangling.
  • Linkage is included in mangled name.

In general using mangled name might cause unexpected behaviour for users. E.g. a user might not expect that making a function extern "C" changes the hashes in the function.

One issue that I see with the current approach is that we do not differentiate methods from different classes/namespaces. Is this by design?

The implementation do differentiate. It uses qualified names whenever a name is queried, which contains all of the enclosing namespaces and classes.

I definitely agree. It would be great to have a unique identifier for checkers that is independent from the name/package. It is not a trivial problem however,
since we probably want to support plugins too. I can think of a similar solution like git commit ids, but I think this problem might be out of the scope of this
patch. But I am happy to start a discussion on the mailing list and create a new patch to solve this.

Sounds good to me. I agree that this is out of scope for this patch.

xazax.hun updated this revision to Diff 36586.Oct 6 2015, 1:24 AM
  • Updated to latest trunk.
  • Addressed the comments.
zaks.anna accepted this revision.Oct 6 2015, 9:57 AM
zaks.anna edited edge metadata.

Thanks for working on this!

LGTM, Anna.

This revision is now accepted and ready to land.Oct 6 2015, 9:57 AM

Thank you for the review!

Before committing this I would like to have a policy regarding future changes and document it inside the IssueHash header.
My proposed policy is the following:

  • Do not change the calculation of issue hash unless we have a very good reason to do so.
  • Every time the way of calculation changes the name of the hash in the plist should be changed (this makes it possible for users to patch clang to generate both old and new hash for those who are relying on the old hash).

I have some questions though:

  • Should we require the generation of old hashes once a change is introduced, or should we expect users who rely on old hash to maintain the old hash generation as an out of tree patch?
  • The hash calculation WILL change in the near future once we figured out how to identify checkers properly (but I think it will not make sense to rename the hash for this change). For this reason I think we should mark this feature as experimental, until that change is introduced. What is the recommended way, to do that? Generating a comment to the plist? Just adding a comment to the headers? Only mention it in the commit log?

What do you think?

  • Should we require the generation of old hashes once a change is introduced, or should we expect users who rely on old hash to maintain the old hash generation as an out of tree patch?

I will likely release the analyzer with all the previous hashes generated by default. I am happy to enable old hashes out of tree, as long as enabling is a small change.

  • The hash calculation WILL change in the near future once we figured out how to identify checkers properly (but I think it will not make sense to rename the hash for this change). For this reason I think we should mark this feature as experimental, until that change is introduced. What is the recommended way, to do that? Generating a comment to the plist? Just adding a comment to the headers? Only mention it in the commit log?

How close is "the near future"? I would like to start using the hashing feature in the next couple of weeks. If your checker identification improvements are a long time out, I would like you to submit the current hash as non-experimental.

Best regards,
Phillip
SN Systems - Sony Computer Entertainment

  • Should we require the generation of old hashes once a change is introduced, or should we expect users who rely on old hash to maintain the old hash generation as an out of tree patch?

I will likely release the analyzer with all the previous hashes generated by default. I am happy to enable old hashes out of tree, as long as enabling is a small change.

In case the clang maintainers do not want to support old hashes, I think it should be as easy to maintain them as out of tree patches as possible.

  • The hash calculation WILL change in the near future once we figured out how to identify checkers properly (but I think it will not make sense to rename the hash for this change). For this reason I think we should mark this feature as experimental, until that change is introduced. What is the recommended way, to do that? Generating a comment to the plist? Just adding a comment to the headers? Only mention it in the commit log?

How close is "the near future"? I would like to start using the hashing feature in the next couple of weeks. If your checker identification improvements are a long time out, I would like you to submit the current hash as non-experimental.

As soon as I get some feedback for this mail: http://lists.llvm.org/pipermail/cfe-dev/2015-October/045368.html . I think Anna and her team is busy right now due to the LLVM Meeting. If I do not get a response, I will commit this patch as is, and address those questions in a separate commit. Once the problem with the checker identifier is resolved, I think this feature should no longer be treated as experimental.

Best regards,
Phillip
SN Systems - Sony Computer Entertainment

How close is "the near future"? I would like to start using the hashing feature in the next couple of weeks. If your checker identification improvements are a long time out, I would like you to submit the current hash as non-experimental.

As soon as I get some feedback for this mail: http://lists.llvm.org/pipermail/cfe-dev/2015-October/045368.html . I think Anna and her team is busy right now due to the LLVM Meeting.

IMO the idea makes good sense and I can benefit from it when moving checkers from experimental packages to other packages.

If I do not get a response, I will commit this patch as is, and address those questions in a separate commit.

Thanks.

Should we require the generation of old hashes once a change is introduced, or should we expect users
who rely on old hash to maintain the old hash generation as an out of tree patch?

We should maintain the old hashes, at least for a while, if there are people who rely on them.

However, we should not treat every commit that changes the hash as creating a new hash that we have to maintain forever. We should determine the point when the newly developed hash is considered stable/good enough and commit to maintaining it at that time. The way I see it is that the experimental hashes could be committed, improved upon, and tested by various "consumers". After all the planned features are done and bugs are fixed, we can assume that the hash is no longer experimental.

We could have process around this as well if the community thinks it's useful. For example, adding experimental to the name of the hash until we commit to it if you think that people might get confused. Another thought would be to tie the hashes to the open source llvm releases, however, those are not very frequent.

The hash calculation WILL change in the near future once we figured out how to identify checkers
properly (but I think it will not make sense to rename the hash for this change). For this reason I think we > should mark this feature as experimental, until that change is introduced.

I agree.

If we plan on fixing the checker name problem soon, I think it's worthwhile waiting for it.
I did not have time to think about the issue much (I am traveling at a conference this week); however, we can come up with a more flexible solution than just stripping out the package names. (We might also want to consider how checker plug-ins would work with this. Though, that is not critical.)

What is the recommended way, to do that? Generating a comment to the plist?

That'd be hard to miss...

Hi Gabor,

I was away from the office for 3 weeks so I couldn't catch up the issues but you almost completed the patch now.
Thanks for this work!

I just found a tiny problem and described it in the line comment so please have a look at it.

utils/analyzer/CmpRuns.py
73

I think we should also modify 'issue_hash' to 'issue_hash_content_of_line_in_context' in if statement.

xazax.hun closed this revision.Oct 22 2015, 5:00 AM

Committed in r251011.

Hi Gabor,

I was away from the office for 3 weeks so I couldn't catch up the issues but you almost completed the patch now.
Thanks for this work!

I just found a tiny problem and described it in the line comment so please have a look at it.

Thank you for spotting this! It is fixed in the committed version.

We are working on tools that use the new hash for bug suppression. There seems to be no way to predict the names of future hashes. We have products (that will use the bug identification) that are on a different release schedule to our clang compiler. These tools will not be able to take advantage of new hashes, unless they know the future hash names.

Regarding the previous discussions on the hash name:

Maybe we could have "issue_hash", "issue_hash_1"(offset based), and "issue_hash_2"(content of line) and add another field "issue_hash_version" that describes the version "issue_hash" is using?

This would allow tools to discern the order of the hashes from the plist file. It would also be possible to identify new hashes and those that are no longer supported. These are features we would like in order to make our tools as flexible as possible.

I'd suggest to suffix each issue hash field with the description of that hash. For example, we would remove "issue_hash" and replace it with "issue_hash_function_offset" and add "issue_hash_content_of _line_in_context".

This makes forwards compatibility difficult, since there is no way to predict the names of future hashes (As far as I understand).

To resolve the forwards compatibility issues, what are peoples opinions on:

  • Having a consistent name with an incrementing number (i.e. issue_hash_1)?
  • Adding an ordered list of all the hash names to the plist file?

Regards,
Sean Eveson
SN Systems - Sony Computer Entertainment

Thank you for pointing this out. I think it would be great to be forward compatible.
Right now one could use a heuristic such as for every hash the key has the "hash" string in its name.
This information might not be sufficient, however, since it would be great to have some kind of ordering between the generated hashes, so a tool could choose the latest (and possibly greatest) one.

To resolve the forwards compatibility issues, what are peoples opinions on:

  • Having a consistent name with an incrementing number (i.e. issue_hash_1)?
  • Adding an ordered list of all the hash names to the plist file?

A third alternative would be to have both semantic names (containing hash) and a number suffix which indicates the ordering.

A third alternative would be to have both semantic names (containing hash) and a number suffix which indicates the ordering.

Yes that would work, but I don't understand the benefit of having the semantic names in the plist file.

This makes forwards compatibility difficult, since there is no way to predict the names of future hashes
(As far as I understand).

Can you describe what you are trying to achieve?

We can agree that all issue hashes start with "issue_hash" prefix. If you find an entry with "issue_hash" prefix and unknown suffix, you would know that it's new. It would be the same as a number you have not seen so far. What do you plan to do when a new hash is detected?

The reason I like names more than the numbers is that we may use different solutions for issue hash generation and some users might prefer one over the other. It is not necessarily clear which one is the best. Numbers would obfuscate the heuristic used to produce the hash and the quality of the hash and would be mainly based on the time when the hash was introduced.

A third alternative would be to have both semantic names (containing hash) and a number suffix
which indicates the ordering.

If there is a minor enhancement to the existing issue hash method, appending the version number to it is fine by me. Though, this might be confusing in it's own right..

The reason I like names more than the numbers is that we may use different solutions for issue hash generation and some users might prefer one over the other. It is not necessarily clear which one is the best. Numbers would obfuscate the heuristic used to produce the hash and the quality of the hash and would be mainly based on the time when the hash was introduced.

Thanks, I understand where you are coming from now.

This makes forwards compatibility difficult, since there is no way to predict the names of future hashes
(As far as I understand).

Can you describe what you are trying to achieve?

Just for the sake of explaining, lets say in 3 subsequent Analyzer releases the hashes are called “hash_1”, “hash_2” and “hash_3”.

In the first release the suppression tool will record hash_1 to suppress a warning. Some developers will upgrade to the second and third release, where others may jump straight to the third release. Additionally some developers may temporarily downgrade to the second after upgrading to the third release. The suppression tool (which may not be upgraded during this period) needs to maintain suppressions between these version upgrades/downgrades. Also, some developers may upgrade before others, where they all share one repository of suppressions.

Say there is an issue with hash_1 and two warnings collide; this is fixed by hash_2. When the user upgrades to the 2nd release, the suppression tool will record hash_2 on top of hash_1. The tool will then suppress using hash_2 if it is present in the plist file, ignoring hash_1. If the user downgrades and hash_2 is not in the plist file the tool will suppress using hash_1. For this to work the tool needs to know the order in which to look at the hashes.

We can agree that all issue hashes start with "issue_hash" prefix. If you find an entry with "issue_hash" prefix and unknown suffix, you would know that it's new. It would be the same as a number you have not seen so far.

The tool would not be able to establish the order of multiple produced hashes when it was first run.

A third alternative would be to have both semantic names (containing hash) and a number suffix
which indicates the ordering.

If there is a minor enhancement to the existing issue hash method, appending the version number to it is fine by me. Though, this might be confusing in it's own right..

This would work for us, although I agree it might be a little confusing. Would issue_hash_<number>_<name> be better? e.g. "issue_hash_1_content_of _line_in_context".

Just for the sake of explaining, lets say in 3 subsequent Analyzer releases the hashes are called “hash_1”, “hash_2” and “hash_3”.

In the first release the suppression tool will record hash_1 to suppress a warning. Some developers will upgrade to the second and third release, where
others may jump straight to the third release. Additionally some developers may temporarily downgrade to the second after upgrading to the third
release. The suppression tool (which may not be upgraded during this period) needs to maintain suppressions between these version
upgrades/downgrades. Also, some developers may upgrade before others, where they all share one repository of suppressions.

Say there is an issue with hash_1 and two warnings collide; this is fixed by hash_2. When the user upgrades to the 2nd release, the suppression tool will
record hash_2 on top of hash_1. The tool will then suppress using hash_2 if it is present in the plist file, ignoring hash_1. If the user downgrades and
hash_2 is not in the plist file the tool will suppress using hash_1. For this to work the tool needs to know the order in which to look at the hashes.

I still do not understand how your system would work.

If you have multiple users using a bug suppression system, I would design such system using only a single hash version across all users; using a mix seems error prone.. Once all of your users upgrade to a version of the analyzer where a new hash version is available, you upgrade the hash in the database to reflect that.

Let's talk about your example.
You have user1 using hash_version_1 and user2 using hash_version_2. If the second user suppresses an issue and hash_version_2 is used to record that, how will user1 be able to identity that issue?

Also, as I've mentioned before, the newest issue hash would not necessarily be the best.

This would work for us, although I agree it might be a little confusing. Would issue_hash_<number>_<name> be better? e.g.
"issue_hash_1_content_of _line_in_context".

We could have a list of issue hashes and each item in the list containing type_of_hash, hash_version, hash. However, I am not yet convinced that this complexity is justified.

If you have multiple users using a bug suppression system, I would design such system using only a single hash version across all users; using a mix seems error prone.. Once all of your users upgrade to a version of the analyzer where a new hash version is available, you upgrade the hash in the database to reflect that.

We want users to be able to take advantage of the new hash when they upgrade to a new version. There may be a large amount of time between the first user(s) to upgrade and the last. On top of this there is no way to identify when all the users have upgraded.

Let's talk about your example.
You have user1 using hash_version_1 and user2 using hash_version_2. If the second user suppresses an issue and hash_version_2 is used to record that, how will user1 be able to identity that issue?

For each suppressed warning we will store all of the produced hashes, then the tool can compare any generated hash against the stored hashes.

Also, as I've mentioned before, the newest issue hash would not necessarily be the best.

We could have a list of issue hashes and each item in the list containing type_of_hash, hash_version, hash. However, I am not yet convinced that this complexity is justified.

I did not consider that the hashes might be used for something other than suppression. In that case, our problem is identifying which hashes to use for suppression and knowing which order to use them in. Do you have any suggestions?