run-clang-tidy.py is the parallel executor for clang-tidy. Due to the
common header-inclusion problem in C++/C diagnostics that are usually emitted
in class declarations are emitted every time their corresponding header is
included.
This results in a *VERY* high amount of spam and renders the output basically
useles for bigger projects.
With this patch run-clang-tidy.py gets another option that enables
deduplication of all emitted diagnostics(by default off). This is achieved with parsing the
diagnostic output from each clang-tidy invocation, identifying warnings and
error and parsing until the next occurrence of an error or warning. The collected
diagnostic is hashed and stored in a set. Every new diagnostic will only be
emitted if its hash is not in the set already.
Numbers to show the issue
I am currently creating a buildbot for running clang-tidy over real world projects. Some experience comes from there, I reproduced one specific case for this test. It is not made up and not even the worst I could see.
Running clang-tidys misc-module over llvm/lib:
/fast_data2/llvm/tools/clang/tools/extra/clang-tidy/tool/run-clang-tidy.py \ -checks=-*,misc-* \ -header-filter=".*" \ -clang-tidy-binary /fast_data2/llvm/build_clang_fast/bin/clang-tidy \ -fix \ lib/ \ 2>/fast_data2/rct_dedup_lib.err.misc \ 1>/fast_data2/rct_dedup_lib.out.misc
produces over 300MB of diagnostic output. The run-clang-tidy.py script consumes up to 0.8%*32GB of RAM on my machine.
373K Nov 5 22:48 rct_lib.err.misc 306M Nov 5 22:48 rct_lib.out.misc
Doing the same analysis but with -deduplication enabled results in 5.4MB of diagnostic output (two orders of magnitude less!) and run-clang-tidy.py only consumes up to 0.5%*32GB of RAM.
373K Nov 5 23:13 rct_dedup_lib.err.misc 5,4M Nov 5 23:13 rct_dedup_lib.out.misc
Notes
The difference in RAM usage for the run-clang-tidy.py script seems suspicious as one would expect the duplication overhead should need more RAM as only printing the stuff out.
It might be a memory leak in the script of some other effect. To my surprise we are better of deduplicating. I did not measure run-time differences but I suspect they decrease as well, as piping hundreds of MB through stdout in python is probably slower.
I found multiple checks that are specifically prone to producing *A LOT* of spam, e.g. bugprone-macro-parentheses. I did statistics in my buildbot where the spammy checks easily had 100x times the output then they needed to have (consistent with the finding in the llvm/lib example).
Running modules with spam-prone checks over the whole of LLVM resulted in ~GB of log-output. I could measure more because my buildbot just refused to give me the full log-files.
Correctness
I did check against a grep "warning: " | sort | uniq -c | sort -n -r output for the log-files. They showed every diagnostic in the deduplicated output occured exactly once.
The hashing is done with SHA256 with is considered to be secure, so there are no collision expected. For this use-case MD5 might even be viable, but by inspecting htop output
the 16 cores of my machine were all fully loaded, so there doesn't seem to be a performance issue from to slow hashing or similar (the parsing is done within the lock, so no parallelization there!).
This simlink is required for my unittests, I don't know how to add the added tests in the lit test-suite so there is no change there yet. A bit of guidance there would be nice :)