This is an archive of the discontinued LLVM Phabricator instance.

[RFC] Bug identification("issue_hash") change for CmpRuns.py
AbandonedPublic

Authored by honggyu.kim on Sep 16 2015, 7:55 AM.

Details

Summary

This patch is based on D10305 in order to make it work the existing infrastructure.
By applying this patch, two different bug reports can be compared with existing CmpRuns.py.

Currently, "issue_hash" in plist file is just "line offset from the beginning of function".
But it even cannot distinguish those kind of simple cases that are completely different bugs.

BUG 1. garbage return value

1 int main()
2 {
3   int a;
4   return a;
5 }

test.c:4:3: warning: Undefined or garbage value returned to caller
  return a;
  ^~~~~~~~

BUG 2. garbage assignment

1 int main()
2 {
3   int a;
4   int b = a;
5   return b;
6 }

test.c:4:3: warning: Assigned value is garbage or undefined
  int b = a;
  ^~~~~   ~

Moreover, The following bug is regarded as different from BUG 1.

BUG 3. a single line of comment is added based on BUG 1 code.

1 int main()
2 {
3   // main function
4   int a;
5   return a;
6 }

test.c:5:3: warning: Undefined or garbage value returned to caller
  return a;
  ^~~~~~~~

The comparison result is as follows:

REMOVED: 'test.c:4:3, Logic error: Undefined or garbage value returned to caller'
ADDED: 'test.c:5:3, Logic error: Undefined or garbage value returned to caller'
TOTAL REPORTS: 1
TOTAL DIFFERENCES: 2

This patch generates the "issue_hash" with the following information:

  1. column number
  2. source line string after removing whitespace
  3. bug type (bug message)

This patch cannot be the final solution, but it enhances "issue_hash" to distinguish such kind of cases by generating stronger hash value.

Diff Detail

Event Timeline

honggyu.kim retitled this revision from to [RFC] Bug identification("issue_hash") change for CmpRuns.py.
honggyu.kim updated this object.

I think we can enhance the bug comparison method one by one rather than making a big change at once.
The point of this patch is to make "issue_hash" field in plist more robust while still using existing CmpRuns.py for comparison.

I would like to hear any kind of comment.
Thank you.

honggyu.kim updated this object.Sep 16 2015, 8:00 AM

issue_hash is generated with hash value rather than line offset from function as below:

<key>issue_hash</key><string>765b04830932fef5631bf3c48c4d2db2</string>
zaks.anna edited edge metadata.Sep 17 2015, 6:28 PM

This new patch does not seem to build on top of http://reviews.llvm.org/D10305 but is an alternative way of generating the hash that reuses a lot of the building blocks from the other patch. What is the reason for that?

(It also addresses your comment to this patch and creates the GetIssueHash subroutine as well as addresses my comment and uses the computed hash in issue_hash, which is used by CmpRuns. Thanks!)

I can see two very differen directions on the two version of this patch. I think this is bad and we should pick one.

In the other version we started to exclude some of the stuff (like filename) from the hash, since it is available already in the plist and gives the processing scripts more freedom. While this patch adds bug type to the hash.

I was thinking a bit more, what would be the best way to determine what should we include in the hash.

In order to determine that first we need to define the scope of the hash itself. There are several sensible choices such as:

  • identify bugs that were generated by a specific checker in a translation unit. (i.e. check name is not included in the hash)
  • identify bugs that has the same bugtype in the same translation unit.
  • identify bugs in a translation unit.
  • identify bug within a project.
  • identify bugs globally.
  • etc...

Note that the scope of the hash does not necesserily affects the scope of diffing results, since the information omitted from the hash can be processed separately.

If we exclude something from the hash that is available in the plist we give flexibility to the tools that consume plists, but also give more possibility to diverge in the way bugs are identified.

Just to make it simpler to understand, original strings of issue_hash are as below with this patch:

BUG 1

3$returna;$Garbage return value

BUG 2

3$intb=a;$Assigned value is garbage or undefined

BUG 3

3$returna;$Garbage return value

This consists of the 3 fields:

(1)$(2)$(3)

(1) column number
(2) source line string after removing whitespace
(3) bug type (bug message)

And GetHashOfContent generates 32 letters of hash values based on those strings.
By comparing hash values of those strings, this patch allows CmpRuns.py distinguish such cases.

$ CmpRuns.py bug1/ bug2/
REMOVED: 'bug.c:4:3, Logic error: Undefined or garbage value returned to caller'
ADDED: 'bug.c:4:3, Logic error: Assigned value is garbage or undefined'
TOTAL REPORTS: 1
TOTAL DIFFERENCES: 2

$ CmpRuns.py bug1/ bug3/
TOTAL REPORTS: 1
TOTAL DIFFERENCES: 0

Since CmpRuns.py also compares "file name" and "function name" with "issue_hash" in plist file, we uses 5 different information to compare bug reports.

You made a comment while I was writing other comment :)

This new patch does not seem to build on top of http://reviews.llvm.org/D10305 but is an alternative way of generating the hash that reuses a lot of the building blocks from the other patch. What is the reason for that?

Yes, this patch is not on top of D10305, I just copied some code from it, so this patch doesn't have dependency with it.

(It also addresses your comment to this patch and creates the GetIssueHash subroutine as well as addresses my comment and uses the computed hash in issue_hash, which is used by CmpRuns. Thanks!)

I just tried not to change the existing working behaviour.
Most of code are written by Babati, I just recomposed them. Thanks!

(3) bug type (bug message)

As far as I remember the other version of this patch exclude any message that is displayed to the user deliberately, to make the hash more resilient to changes to those strings. As always, there is a trade off here, and in order to decide we need to have the exact scope of the hash. If we expect to only compare hashes that was generated with the same version of clang, it is great to include that in the hash. If we want to compare hashes that was generated by different versions of clang, this might not be a good idea, depending on how often are those messages changes. The original patch included the check_name in the hash. Unfortunately that might change too.

honggyu.kim updated this object.
honggyu.kim edited edge metadata.

Hi Anna and Gábor,

I modified this patch based on D10305 following your comments.
I think we can distinguish issue_hash and Bug ID. issue_hash is considered as a subset of Bug ID here.

  1. issue_hash contains

(1) column number
(2) source line string after removing whitespace
(3) bug type (bug message)

And it is used in plist, then CmpRuns.py can combine file name, function name with issue_hash.
This patch also added checker name that is already in plist("check_name") to CmpRuns.py for better comparison

  1. Bug ID contains

(1) file name
(2) checker name
(3) function name
(4) column number
(5) source line string after removing whitespace
(6) bug type (bug message)

Bug ID is for uniquing the identifier without further post processing. This is printed to HTML file inside so that HTML reports can be compared with a single identifier.

I would like to get more feedback in order to enhance this method.
Thanks very much for the feedback :)

I agree with Gabor and do not think we should have 2 entries in the plist file for bug identification. This is confusing and I do not see a need for this.

Any comparison script can check for 2 fields in a plist file and concatenate them. We've already discussed this during the patch review for the other patch. For example, the latest version does not include the file name.

I agree with Gabor and do not think we should have 2 entries in the plist file for bug identification. This is confusing and I do not see a need for this.

Any comparison script can check for 2 fields in a plist file and concatenate them. We've already discussed this during the patch review for the other patch. For example, the latest version does not include the file name.

I didn't put 2 entries in plist, I made issue_hash as it was while adding bugid to html report only.
So this patch makes issue_hash based on Babati's patch after removing redundent fields. And it also makes BugID that contains all the information so that it doesn't require post processing.
BugId is for HTML report comparison.

issue_hash is a subset of BugId and CmpRuns.py script will collect more fields with issue_hash so that it contains all information just like BugId here.

Thanks.

zaks.anna requested changes to this revision.Sep 29 2015, 5:56 PM
zaks.anna edited edge metadata.

I find it very confusing to have 2 different fields for bug identification - issue_hash and BugId. Why do we need to treat the HTML reports differently from plist files?

This revision now requires changes to proceed.Sep 29 2015, 5:56 PM

Should this be closed?

honggyu.kim abandoned this revision.Oct 21 2015, 4:18 AM

Hi Anna,

I was away from the office for 3 weeks. Sorry for the late answer.

I find it very confusing to have 2 different fields for bug identification - issue_hash and BugId. Why do we need to treat the HTML reports differently from plist files?

I have read Gabor's patch in D10305, and I see that patch has now accepted. I also agree with his approach and your comment for checker identification as well.

But the reason why I made BugId was for HTML report comparison. As you mentioned before, CmpRuns.py does not work with HTML files, but we need to know which HTML report is newly generated among the entire HTML bug reports.

We can get limited source location information with current CmpRuns.py. But I think we can also compare two different sets of HTML bug reports, then show which HTML reports are newly generated. HTML report can be more helpful for the developers who made the new bugs rather than a single line of bug information in text format.

For example, you could keep the information about the reports in the plist files and use those to render the reports in HTML.

If you're okay with adding HTML file name in plist for each bug, I will prepare a new patch for that.
Thanks for the review!

In D10305#224956, @zaks.anna wrote:
For example, you could keep the information about the reports in the plist files and use those to
render the reports in HTML.

If you're okay with adding HTML file name in plist for each bug, I will prepare a new patch for that.
Thanks for the review!

I think you misunderstood my comment. I am not talking about using the existing HTML files here but rather having an HTML viewer, which you could use to browse source code. This viewer would be extended to read the bug reports from the plist files and display them. Currently, we create an html file with source code + report info for each bug report. This does not scale when you have a lot of reports on a single large file (ex: sqlite).

What I describe above is a larger project. What workflow are you trying to support? I think adding the issue hash to the HTML file is fine if you find it to be useful for your workflow...

In D10305#224956, @zaks.anna wrote:
For example, you could keep the information about the reports in the plist files and use those to
render the reports in HTML.

If you're okay with adding HTML file name in plist for each bug, I will prepare a new patch for that.
Thanks for the review!

I think you misunderstood my comment. I am not talking about using the existing HTML files here but rather having an HTML viewer, which you could use to browse source code. This viewer would be extended to read the bug reports from the plist files and display them. Currently, we create an html file with source code + report info for each bug report. This does not scale when you have a lot of reports on a single large file (ex: sqlite).

What I describe above is a larger project. What workflow are you trying to support? I think adding the issue hash to the HTML file is fine if you find it to be useful for your workflow...

Hi Anna & Kim,

we recognized these scalability issues you just described and that's why we created CodeChecker https://github.com/Ericsson/codechecker/
tool. A HTML report viewer for Clang SA.

Reports are stored in a postgresql db. Each source file is only stored once (unlike scanbuild), supports bug suppression and diff view between runs etc.
And it works pretty fast with many-million lines of code projects.

If you have time, give it a try. we are happy to get feedback...
Regards,
Daniel

Hi Daniel,

Have you considered contributing this work to clang/llvm?

Hi Anna,

I think you misunderstood my comment. I am not talking about using the existing HTML files here but rather having an HTML viewer, which you could use to browse source code. This viewer would be extended to read the bug reports from the plist files and display them. Currently, we create an html file with source code + report info for each bug report. This does not scale when you have a lot of reports on a single large file (ex: sqlite).

I know you already mentioned about the scalability issue but I was thinking to use the existing implementation since I thought we didn't have the HTML viewer you mentioned as of now. (before I saw Daniel's comment)

What I describe above is a larger project. What workflow are you trying to support? I think adding the issue hash to the HTML file is fine if you find it to be useful for your workflow...

My workflow is as follows:

  1. manage buildbot server to get an event whenever a registered action happened for Linux kernel
  2. when an event is received, run clang static analyzer for the current Linux kernel
  3. put the each report in the place where anyone can access through web server

Whenever buildbot runs clang static analyzer for Linux kernel and it shows more than 600 bugs. I've received many feedback from my colleagues. They want to see only diff set in HTML format because it has too many bug reports.

That's why I was thinking about connecting plist info to HTML files regardless of using CmpRuns.py or not.

Hi Anna & Kim,

Hi Daniel,

we recognized these scalability issues you just described and that's why we created CodeChecker https://github.com/Ericsson/codechecker/
tool. A HTML report viewer for Clang SA.

Reports are stored in a postgresql db. Each source file is only stored once (unlike scanbuild), supports bug suppression and diff view between runs etc.
And it works pretty fast with many-million lines of code projects.

If you have time, give it a try. we are happy to get feedback...

I have installed CodeChecker after fixing some problems due to the environmental difference. (I use ubuntu 12.04)

(checker_env) $ CodeChecker server --dbusername test_user --dbport 5432 -w ~/checker_workspace -v 8080
[WARNING] - WARNING! No suppress file was given, suppressed results will be only stored in the database.
[INFO] - Checking for database
[INFO] - Waiting for client requests on [localhost:8080]

It seems that it works fine with localhost but cannot connect it from remote browsers.

$ curl localhost:8080
   ... works fine ...

$ curl <ip addr>:8080
curl: (7) couldn't connect to host

Can I use CodeChecker Viewer as a web server and connect it from outside of server?
If you give me some more guide, I will have a look at it more.
Thank you!

Hi Anna & Kim,

Hi Daniel,

we recognized these scalability issues you just described and that's why we created CodeChecker https://github.com/Ericsson/codechecker/
tool. A HTML report viewer for Clang SA.

Reports are stored in a postgresql db. Each source file is only stored once (unlike scanbuild), supports bug suppression and diff view between runs etc.
And it works pretty fast with many-million lines of code projects.

If you have time, give it a try. we are happy to get feedback...

I have installed CodeChecker after fixing some problems due to the environmental difference. (I use ubuntu 12.04)

(checker_env) $ CodeChecker server --dbusername test_user --dbport 5432 -w ~/checker_workspace -v 8080
[WARNING] - WARNING! No suppress file was given, suppressed results will be only stored in the database.
[INFO] - Checking for database
[INFO] - Waiting for client requests on [localhost:8080]

It seems that it works fine with localhost but cannot connect it from remote browsers.

$ curl localhost:8080
   ... works fine ...

$ curl <ip addr>:8080
curl: (7) couldn't connect to host

Can I use CodeChecker Viewer as a web server and connect it from outside of server?
If you give me some more guide, I will have a look at it more.
Thank you!

Hi!

By default it is disabled for security reasons. You can enable by passing a command line option to the code checker, something like --not-host-only.

Regards,
Gabor

Hi,

its a good idea to include in LLVM/Clang i will propose it

Hi Daniel,

Have you considered contributing this work to clang/llvm?

It's a good idea I will propose this at cfe-dev.

Daniel

By default it is disabled for security reasons. You can enable by passing a command line option to the code checker, something like --not-host-only.

Thanks, I can remotely access now with --not-host-only.
But I cannot see the correct diff result when I click "Diff" button with two different Run Ids in List of runs menu.
Is it working properly with your bug identification patch? (It has changed "issue_hash" field)
Please correct me if I did something wrong.

And here is one more question. I'm just wondering how CodeChecker runs clang static analyzer. Does it override CC and CXX just like scan-build?
I can only see the actual build command with "-b" option as below:

CodeChecker check --dbusername test_user --dbport 5432 -n test_project_check -w ~/checker_workspace -b "cd my_test_project && make clean && make"

Thank you!

Hi,

You are right the diff is is based on the hash. We already tried to
use an earlier hash generator (before the patch was introduced), which
generates a slightly different plist, that is why the current version
does not work with the patch.
We will fix CodeChecker to use new hash tag introduced in the final
patch (we did not change it so far because we didn't know what will be
the accepted naming convention and plist format).

We use LD_PRELOAD technique the log all the compiler calls, so no CC
or CXX environment variable orverride is necessary.
With the 'export CODECHECKER_VERBOSE=debug' enviromnet variable you
can see the analyzer commands.

Br,
Gyorgy

Hi,

You are right the diff is is based on the hash. We already tried to
use an earlier hash generator (before the patch was introduced), which
generates a slightly different plist, that is why the current version
does not work with the patch.
We will fix CodeChecker to use new hash tag introduced in the final
patch (we did not change it so far because we didn't know what will be
the accepted naming convention and plist format).

Hi, thanks for the explanation. I will try again when it is modified based on the final patch.

We use LD_PRELOAD technique the log all the compiler calls, so no CC
or CXX environment variable orverride is necessary.
With the 'export CODECHECKER_VERBOSE=debug' enviromnet variable you
can see the analyzer commands.

I just cross-compiled Linux kernel using CodeChecker but I don't get the proper analysis result. I have fixed cross-compilation issue by adding --analyzer-target option to scan-build in D10356.
Could you please explain more about LD_PRELOAD technique? I would like to more understand how it works.

Kind regards,

Honggyu

Hi,

Sorry for my late answer.
We fixed the plist parser for the CodeChecker it shoud process the
plist generated by the current trunk clang with the issue hash tags.

Right now CodeChecker does not support forwarding additional options
directly to clang (uses only the compiler options from the original
build command extended with options required for the static analyzer).
We already discussed such features. I think they will be implemented.
If you have any requirements feel free to create a new issue on our
github page, we can discuss them further there.

With the LD_PRELOAD (linux specific) technique we load a shared object
before every executable during the original build. In this shared
object we filter out the compiler calls (gcc, g++, clang ... it is
configurable) and create a compile command json file, like cmake. We
use these compilation commands to run the clang static analyzer.
With this technique we can create the compilation command json file
build system independent.

The official LD_PRELOAD documentation can be found here:
http://man7.org/linux/man-pages/man8/ld.so.8.html

Best regards,
Gyorgy

Please, move the conversation on how to configure/use CodeChecker in the particular setting elsewhere. I think it could be useful to the clang dev list; however, this patch review is not the proper place for it.