This is an archive of the discontinued LLVM Phabricator instance.

[analyzer] Fix HTML report deduplication.
ClosedPublic

Authored by NoQ on Jun 29 2021, 8:09 PM.

Details

Summary

Folks, I deleted some code.

Some of it was... Perl code.

Not sure if anybody noticed but scan-build didn't deduplicate reports correctly. Its algorithm was extremely primitive (basically it hashed each html file and deleted files with duplicate hash). This obviously didn't work with cross-file reports (i.e., bug path starts in the main file but ends in a header) which we intended to deduplicate by the end-of-path location, because the main file name is included in the report and therefore affects md5. D42269 doesn't help either. Long story short, it didn't honor the modern advancements in IssueHash at all.

I could make scan-build read the issue hash but I think my solution is even more elegant so hear me out. I put the issue hash into the report filename instead, replacing the random section. When clang tries to emit a duplicate report, it'll simply fail because the file with that name is already present. Moreover, as per tradition, I reduce the issue hash to the first 6 characters, so bug report filenames look exactly like before (report-123abc.html), except now they're automatically stable and deterministic! Such truncation is, of course, entirely cosmetic and absolutely unnecessary but I think it's pretty cool.

The flag -analyzer-config stable-report-filename=true now becomes redundant because reports are stable anyway (in fact, they weren't actually stable before the patch even with the flag, because they depended on the race between clang invocations to emit the reports; I changed it to include a snippet of the issue hash as well instead of the race-y index). That said, I'm conflicted about removing the flag entirely because it also produces more verbose/readable filenames that people seem to enjoy. I think we should enable it by default instead, as soon as we make sure it doesn't produce extremely long filenames.

scan-build --generate-index-only no longer does deduplication, as seen on the updated test "rebuild_index". If deduplication is desired, it can typically be achieved with a simple

find -name '*.html' -exec mv "{}" . \;

Diff Detail

Event Timeline

NoQ created this revision.Jun 29 2021, 8:09 PM
NoQ requested review of this revision.Jun 29 2021, 8:09 PM

AFAIK from CodeChecker's side we are fine with this change. I think we are not using the stable-report-filename analyzer config either.
Aside from these, the less Perl the better, I guess.

clang/lib/StaticAnalyzer/Core/HTMLDiagnostics.cpp
314

I don't frequently see stringstreams in the codebase.
Why don't you use the llvm alternative?

std::string s;
llvm::raw_string_ostream os(s);
vsavchenko accepted this revision.Jun 30 2021, 6:42 AM

This is incredible! Thanks for addressing it! I've encountered this many times.

clang/lib/StaticAnalyzer/Core/HTMLDiagnostics.cpp
314

nit: whatever stream you choose, can you please start it with a capital letter? 💅

323–324

Can we make a mirror for this option and mark the other one as deprecated?

This revision is now accepted and ready to land.Jun 30 2021, 6:42 AM
ASDenysPetrov accepted this revision.EditedAug 17 2021, 8:27 AM

Nice work! Unfortunately I'm not able to run tests on my Windows env, but I've run you tests files manually. It works for me.

P.S. BTW, is there any workarounds to make current tests supported on Windows? I know there is REQUIRES instruction (https://llvm.org/docs/TestingGuide.html#constraining-test-execution) but I didn't find any sufficient description of it.

clang/lib/StaticAnalyzer/Core/HTMLDiagnostics.cpp
323–324

Nice idea. I'm in favor of a mirorring.

337

Do you think 6 trimmed characters are enough to avoid collisions?

NoQ updated this revision to Diff 368917.Aug 26 2021, 9:59 AM
NoQ marked 4 inline comments as done.

Address review comments!

NoQ added a comment.Aug 26 2021, 9:59 AM

Nice work! Unfortunately I'm not able to run tests on my Windows env, but I've run you tests files manually. It works for me.

P.S. BTW, is there any workarounds to make current tests supported on Windows? I know there is REQUIRES instruction (https://llvm.org/docs/TestingGuide.html#constraining-test-execution) but I didn't find any sufficient description of it.

Yes, you can always remove the REQUIRES: directive. The problem with generally enabling these tests on Windows is that scan-build is written in Perl and the Perl interpreter isn't shipped with Windows by default. Nothing else in LLVM is written in Perl so requiring a Perl installation to run LLVM tests just for this single use case is counter-productive.

clang/lib/StaticAnalyzer/Core/HTMLDiagnostics.cpp
323–324

The new option is -analyzer-config verbose-report-filename=true and the old option is preserved and acts exactly the same; the help text says that it's deprecated.

337

It's basically same as before. Previously these 6 digits were completely random, now they're chunks of MD5 hashes of the real thing which is probably as random.

That's 16.7 million variants. The probability of collision on 6 digits with 100 warnings is around 0.0001%, with 1000 warnings it goes up to 3% (Birthday Problem). Even though 1000 warnings are a real possibility on a huge project (eg., LLVM or WebKit), this is way above the point where you care about collisions.

I think there's nothing that prevents us from bumping it (at least in the non-verbose filename mode; in the verbose filename mode it may cause slightly more breakages in the case where it's close to hitting the file system's filename length limit). Bumping from 6 digits to 8 digits would reduce the collision probability on 1000 warnings to 0.0001%. I'll think about it a bit more :)

This revision was landed with ongoing or failed builds.Aug 26 2021, 1:39 PM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptAug 26 2021, 1:39 PM
clang/test/Analysis/scan-build/rebuild_index/report-3.html