This is an archive of the discontinued LLVM Phabricator instance.

[clang-tidy] add deduplication support for run-clang-tidy.py
AbandonedPublic

Authored by JonasToth on Nov 6 2018, 2:19 AM.

Details

Summary

run-clang-tidy.py is the parallel executor for clang-tidy. Due to the
common header-inclusion problem in C++/C diagnostics that are usually emitted
in class declarations are emitted every time their corresponding header is
included.

This results in a *VERY* high amount of spam and renders the output basically
useles for bigger projects.
With this patch run-clang-tidy.py gets another option that enables
deduplication of all emitted diagnostics(by default off). This is achieved with parsing the
diagnostic output from each clang-tidy invocation, identifying warnings and
error and parsing until the next occurrence of an error or warning. The collected
diagnostic is hashed and stored in a set. Every new diagnostic will only be
emitted if its hash is not in the set already.

Numbers to show the issue

I am currently creating a buildbot for running clang-tidy over real world projects. Some experience comes from there, I reproduced one specific case for this test. It is not made up and not even the worst I could see.

Running clang-tidys misc-module over llvm/lib:

/fast_data2/llvm/tools/clang/tools/extra/clang-tidy/tool/run-clang-tidy.py \
    -checks=-*,misc-* \
    -header-filter=".*" \
    -clang-tidy-binary /fast_data2/llvm/build_clang_fast/bin/clang-tidy \
    -fix \
    lib/ \
    2>/fast_data2/rct_dedup_lib.err.misc \
    1>/fast_data2/rct_dedup_lib.out.misc

produces over 300MB of diagnostic output. The run-clang-tidy.py script consumes up to 0.8%*32GB of RAM on my machine.

373K Nov  5 22:48 rct_lib.err.misc
306M Nov  5 22:48 rct_lib.out.misc

Doing the same analysis but with -deduplication enabled results in 5.4MB of diagnostic output (two orders of magnitude less!) and run-clang-tidy.py only consumes up to 0.5%*32GB of RAM.

373K Nov  5 23:13 rct_dedup_lib.err.misc
5,4M Nov  5 23:13 rct_dedup_lib.out.misc

Notes

The difference in RAM usage for the run-clang-tidy.py script seems suspicious as one would expect the duplication overhead should need more RAM as only printing the stuff out.
It might be a memory leak in the script of some other effect. To my surprise we are better of deduplicating. I did not measure run-time differences but I suspect they decrease as well, as piping hundreds of MB through stdout in python is probably slower.

I found multiple checks that are specifically prone to producing *A LOT* of spam, e.g. bugprone-macro-parentheses. I did statistics in my buildbot where the spammy checks easily had 100x times the output then they needed to have (consistent with the finding in the llvm/lib example).
Running modules with spam-prone checks over the whole of LLVM resulted in ~GB of log-output. I could measure more because my buildbot just refused to give me the full log-files.

Correctness

I did check against a grep "warning: " | sort | uniq -c | sort -n -r output for the log-files. They showed every diagnostic in the deduplicated output occured exactly once.
The hashing is done with SHA256 with is considered to be secure, so there are no collision expected. For this use-case MD5 might even be viable, but by inspecting htop output
the 16 cores of my machine were all fully loaded, so there doesn't seem to be a performance issue from to slow hashing or similar (the parsing is done within the lock, so no parallelization there!).

Event Timeline

JonasToth created this revision.Nov 6 2018, 2:19 AM
JonasToth updated this revision to Diff 172726.EditedNov 6 2018, 2:21 AM
  • spurious change in my git?
JonasToth edited the summary of this revision. (Show Details)Nov 6 2018, 2:41 AM
JonasToth added a project: Restricted Project.
JonasToth added inline comments.
clang-tidy/tool/run_clang_tidy.py
1

This simlink is required for my unittests, I don't know how to add the added tests in the lit test-suite so there is no change there yet. A bit of guidance there would be nice :)

hokein added a comment.Nov 6 2018, 7:20 AM

Thanks for the patch and nice improvements.

Some initial thoughts:

  • The output of clang-tidy diagnostic is YAML, and YAML is not an space-efficient format (just for human readability). If you want to save space further, we might consider using some compressed formats, e.g. llvm::bitcode. Given the reduced YAML result (5.4MB) is promising, this might not matter.
  • clang-tidy itself doesn't do deduplication, and run-clang-tidy.py seems an old way of running clang-tidy in parallel. The python script seems become more complicated now. We have AllTUsToolExecutor right now, which supports running clang tools on a compilation database in parallel, so another option would be to use AllTUsToolExecutor in clang-tidy, and we can do deduplication inside clang-tidy binary (in reduce phase), which should be faster than the python script (spawn new clang-tidy processes and do round-trip of all the data through YAML-on-disk).

This feature seems like a good idea. I started writing it too some months ago, but then I changed tactic and worked on distributing the refactor over the network instead. As far as I know, your deduplication would not work with a distributed environment.

However, it seems that both features can exist.

You use a regex to parse the clang output. Why not use the already-machine-readable yaml output and de-duplicate based on that? I think the design would be something like:

  • Run clang-tidy in a quiet mode which only exports yaml and does not issue diagnostics
  • Read the yaml in your python script
  • Add the entries to your already-seen cache
  • For any entry which was not already there
    • Write the entries to a new yaml file
    • Use clang-apply-replacements --issue-diags the_new_file.yaml to actually cause the new diagnostics to be issued (they were omitted from the clang-tidy run).

This avoids fragile parsing of the output from clang, instead relying on the machine-readable format.

I think clang-apply-replacements already does de-duplication, so it's possible that could take more responsibility.

Also, I think your test content is too big. I suggest trying to write more contained tests for this.

JonasToth added a comment.EditedNov 6 2018, 8:24 AM
  • The output of clang-tidy diagnostic is YAML, and YAML is not an space-efficient format (just for human readability). If you want to save space further, we might consider using some compressed formats, e.g. llvm::bitcode. Given the reduced YAML result (5.4MB) is promising, this might not matter.

The output were normal diagnostics written to stdout, deduplication happens from there (see the test-cases). The files i created were just through piping to filter some of the noise.
Without de-duplication its very hard to get something useful out of a run with many checks activated for bigger projects (e.g. Blender and OpenCV are useless to try, because they have some commonly used macros with a check-violation. The buildbot filled 30GB of RAM before it crashed and couldn't even finish the analysis of the project. Similar for LLVM).

I would like to try the simple deduplication first and see if space is still an issue. After all I want to just read the diagnostic and see whats happening instantly and a more compressed format might not help there.

  • clang-tidy itself doesn't do deduplication, and run-clang-tidy.py seems an old way of running clang-tidy in parallel. The python script seems become more complicated now. We have AllTUsToolExecutor right now, which supports running clang tools on a compilation database in parallel, so another option would be to use AllTUsToolExecutor in clang-tidy, and we can do deduplication inside clang-tidy binary (in reduce phase), which should be faster than the python script (spawn new clang-tidy processes and do round-trip of all the data through YAML-on-disk).

I agree that AllTUsToolExecutor would be better instead of the python script, but i think getting this done takes longer, then just patching the script now. From the patch here (it is an by-default off option as well) it is easier to test all pieces of clang-tidy. From there we can easily migrate to something better then `run-clang-tidy.py´.

The deduplication within clang-tidy would be the best option! But for full deduplication the parallelization must happen first.

The python script seems become more complicated now.

A bit, yes. The actual calling of clang-tidy and other parts are not touched. Just the parser adds additional complexity, which is covered in the unit tests. I don't think this solution lives for ever, but its fast and effective. Its optional and by default off.

Thank you for the comment!

This feature seems like a good idea. I started writing it too some months ago, but then I changed tactic and worked on distributing the refactor over the network instead. As far as I know, your deduplication would not work with a distributed environment.

I agree that it would probably not work. It might enable a two-stage deduplication, but I don't know if that would be viable.

However, it seems that both features can exist.

You use a regex to parse the clang output. Why not use the already-machine-readable yaml output and de-duplicate based on that? I think the design would be something like:

  • Run clang-tidy in a quiet mode which only exports yaml and does not issue diagnostics
  • Read the yaml in your python script
  • Add the entries to your already-seen cache
  • For any entry which was not already there
    • Write the entries to a new yaml file
    • Use clang-apply-replacements --issue-diags the_new_file.yaml to actually cause the new diagnostics to be issued (they were omitted from the clang-tidy run).

This avoids fragile parsing of the output from clang, instead relying on the machine-readable format.

In principle this approach seems more robust and I am not claiming my approach is robust at all :)
The point hokein raised should be considered first in my opinion. If clang-tidy itself is already parallel we should definitely deduplicate there. This is something I would put more
effort in. The proposed solution is more a hack to get my buildbot running and find transformation bugs and provide real-world data for checks we implement. :)

I think clang-apply-replacements already does de-duplication, so it's possible that could take more responsibility.

Yes, the emitted fixes are deduplicated but i think we need something even if no fixes are involved.

Also, I think your test content is too big. I suggest trying to write more contained tests for this.

I wanted to have a mix of both real snippets and some unit-tests on short examples. Do you think its enough if i shorten the list of fields that the CSA output contains for the padding checker?

Thank you for the comment!

This feature seems like a good idea. I started writing it too some months ago, but then I changed tactic and worked on distributing the refactor over the network instead. As far as I know, your deduplication would not work with a distributed environment.

I agree that it would probably not work. It might enable a two-stage deduplication, but I don't know if that would be viable.

Yes, I think the distributed refactoring would benefit from the design I outlined - issuing diagnostics from the yaml files.

In principle this approach seems more robust and I am not claiming my approach is robust at all :)
The point hokein raised should be considered first in my opinion. If clang-tidy itself is already parallel we should definitely deduplicate there. This is something I would put more
effort in. The proposed solution is more a hack to get my buildbot running and find transformation bugs and provide real-world data for checks we implement. :)

Yes, I think it makes sense to do something more robust. I understand you're yak shaving here a bit while trying to reach a higher goal.

The AllTUsToolExecutor idea is worth exploring - it would mean we could remove the threading from run-clang-tidy.py.

I don't think we should get a self-confessed hack in just because it's already written.

However, AllTUsToolExecutor seems to not create output replacement files at all, which is not distributed-refactoring-friendly.

I think clang-apply-replacements already does de-duplication, so it's possible that could take more responsibility.

Yes, the emitted fixes are deduplicated but i think we need something even if no fixes are involved.

Maybe my suggestion was not clear. The yaml file generated by clang-tidy contains not only replacements, but all diagnostics, even without a fixit.

So, running clang-apply-replacements --issue-diags the_new_file.yaml would issue the warnings/fixit hints by processing the yaml and issuing the diagnostics the way clang-tidy would have done (See in my proposed design that we silence clang-tidy).

Note also that the --issue-diags option does not yet exist. I'm proposing adding it.

Also, I think your test content is too big. I suggest trying to write more contained tests for this.

I wanted to have a mix of both real snippets and some unit-tests on short examples. Do you think its enough if i shorten the list of fields that the CSA output contains for the padding checker?

It seems that the bulk of the testing part of this commit is parsing a real-world log that you made. I guess if you remove the parsing (by taking a machine-readable approach) that bulk will disappear anyway.

Maybe my suggestion was not clear. The yaml file generated by clang-tidy contains not only replacements, but all diagnostics, even without a fixit.

So, running clang-apply-replacements --issue-diags the_new_file.yaml would issue the warnings/fixit hints by processing the yaml and issuing the diagnostics the way clang-tidy would have done (See in my proposed design that we silence clang-tidy).

Note also that the --issue-diags option does not yet exist. I'm proposing adding it.

At the moment clang-apply-replacements is called at the end of an clang-tidy run in run-clang-tidy.py That means we produce ~GBs of Yaml first, to then emit ~10MBs worth of it.
I think just relying on clang-apply-replacements is not ok.
If we do a hybrid: on-the-fly deduplication within clang-tidy/run-clang-tidy.py and a potential final deduplication with clang-apply-replacments we get the best of both worlds.
That fits the distributed use-case as well(? I don't use a distributed system for these things as my projects are too small), because the first stage is local, the second stage central after all local workers are done.

It seems that the bulk of the testing part of this commit is parsing a real-world log that you made. I guess if you remove the parsing (by taking a machine-readable approach) that bulk will disappear anyway.

That is true. The lat bit you have to convince me is on-the-flight output to see whats going on. I personally usually just take the raw textual representation and grep/scroll through it to see whats going on. It might be a bit of a tension between large-scale and small/medium scale applications.

So, running clang-apply-replacements --issue-diags the_new_file.yaml would issue the warnings/fixit hints by processing the yaml and issuing the diagnostics the way clang-tidy would have done (See in my proposed design that we silence clang-tidy).

Note also that the --issue-diags option does not yet exist. I'm proposing adding it.

At the moment clang-apply-replacements is called at the end of an clang-tidy run in run-clang-tidy.py That means we produce ~GBs of Yaml first, to then emit ~10MBs worth of it.
I think just relying on clang-apply-replacements is not ok.

I think my proposal is still unclear to you. Sorry about that.

I am proposing on-the-fly de-duplication, but without regex parsing of the diagnostic output of clang-tidy.

My proposal is still the same as I wrote before, but maybe what I write below will be more clear. Sorry if this seems condescendingly detailed. I don't know where the misunderstanding is coming from, so I err on the side of detail:

Imagine you have two cpp files which both include the same header. Imagine also that all 3 files are missing 1 override keyword and you run the modernize-use-override check on the two translation units using run-clang-tidy.py.

Here is what I propose happens:

  • First, assume that the two translation units are processed serially, just to simplify the process as described. You will see that parallelizing does not change anything.
  • clang-tidy gets run on file1.cpp in a way that it does not write diagnostics (and fixes) to stdout/stderr, but only generates a yaml file representing the diagnostics (warnings and fixes).
  • Two diagnostics are created - one for the missing override in file1.cpp and one for the missing override in shared_header.h
  • the on-the-fly deduplication cache is empty, so both diagnostics get added to the on-the-fly deduplication cache
  • Because both were added to the cache, both diagnostics get written to a temporary file foo.yaml
  • clang-appy-replacements --issue-diags foo.yaml is run (some other tool could be used for this step, but CAR seems right to me)
  • clang-appy-replacements --issue-diags foo.yaml causes the two diagnostics to be issued, exactly as they would have been issued by clang-tidy.
  • clang-appy-replacements --issue-diags foo.yaml DOES NOT actually change the source code. It only emits diagnostics
  • Next process file2.cpp
  • Processing file2.cpp results in two diagnostics - one for the missing override in file2.cpp and one for the missing override in shared_header..h
  • NOTE: The diagnostic for the missing override in shared_header.h is a duplicate
  • NOTE: The diagnostic for the missing override in file2.cpp is NOT a duplicate
  • Try to add both to the on-the-fly deduplication cache
  • Discover that the diagnostic for file2.cpp IS NOT a duplicate. Add it to a tempoary bar.yaml (named not to conflict with any other temporary file! This is built into temporary file APIs).
  • Discover that the diagnostic for shared_header.h IS a duplicate. DO NOT write it to the temporary file
  • Run clang-appy-replacements --issue-diags bar.yaml
  • Run clang-appy-replacements --issue-diags bar.yaml causes ONLY the diagnostic for file2.cpp to be issued because that is all that is in the file.
  • At the end, you have a de-duplicated yaml structure in memory. Write it to the file provided to the --export-fixes parameter of run-clang-tidy.py.

Do you understand the proposal now?

This means that you get

  • A deduplicated fixes file
  • De-duplicated diagnostics issued - which means you can process them in your CI system etc.

That fits the distributed use-case as well?

If the distributed system processes more than one file at a time on the remote computer.

I'm not aware of any distributed systems that work that way. I think they all process single files at a time.

However, when the resulting yaml files are sent back to your computer, your computer can deduplicate the diagnostics issued (in my design).

Do you understand the proposal now?

Yes better, I was under the impression that clang-apply-replaments is run on the end and the YAMLs are kept until then. Now its clear.
I assume --issue-diags produce the same result as the normal diagnostic engine. That could work, yes.

clang-tidy does not have a quiet mode though. It has the -quiet option which just does not emit how many warnings were created and suppressed.
Do you have these things already in the pipeline?

Reducing log file size is good idea, but I think will be also good idea to count duplicates. This will allow to concentrate clean-up efforts on place where most of warnings originate.

Reducing log file size is good idea, but I think will be also good idea to count duplicates. This will allow to concentrate clean-up efforts on place where most of warnings originate.

Places that emit a lot of diagnostics, still do. I think the amount of duplicated warnings does not show an urgency with the unique warning.

Do you understand the proposal now?

Yes better, I was under the impression that clang-apply-replaments is run on the end and the YAMLs are kept until then. Now its clear.
I assume --issue-diags produce the same result as the normal diagnostic engine. That could work, yes.

clang-tidy does not have a quiet mode though. It has the -quiet option which just does not emit how many warnings were created and suppressed.
Do you have these things already in the pipeline?

Please let me clarify a bit: -export-fixes _DOES_ emit all warnings to yaml, but clang-tidy still prints the diagnostics out, even in -quiet mode. So a useful deduplication would require changes to the clang-tidy itself if going the YAML route.

Do you understand the proposal now?

Yes better, I was under the impression that clang-apply-replaments is run on the end and the YAMLs are kept until then. Now its clear.
I assume --issue-diags produce the same result as the normal diagnostic engine. That could work, yes.

Great, glad we got that misunderstanding sorted out.

clang-tidy does not have a quiet mode though. It has the -quiet option which just does not emit how many warnings were created and suppressed.
Do you have these things already in the pipeline?

No, I have not started on these.

At least the clang-tidy quiet mode is trivial to implement. Maybe instead of --quiet we could have --stdout=<output_format> where output_format can be one of none, diag, yaml and in the future possibly json (requested here: http://lists.llvm.org/pipermail/cfe-dev/2018-October/059944.html) or cbor, to address the binary output suggestion from @hokein.

At least the clang-tidy quiet mode is trivial to implement. Maybe instead of --quiet we could have --stdout=<output_format> where output_format can be one of none, diag, yaml and in the future possibly json (requested here: http://lists.llvm.org/pipermail/cfe-dev/2018-October/059944.html) or cbor, to address the binary output suggestion from @hokein.

Yes, it would slighlty duplicate export-fixes but i think that is not a big issue.
I think the first steps would be using parallel execution within clang-tidy itself.
After that we can extract the deduplication logic from apply-replacements (if it is actually suitable!) or deduplicate in the diag() emitting phase. A thing we have to keep in mind, that deduplication must happen to the whole warning: blaaa\n note: blaa \n note: 'aslkdjad' here comes the only change between two different warnings construct.
If we don't group the diagnostics properly we have a bad time.

That said, would you agree to have the parser-based deduplication as an developer-only optin solution for now? :)

That said, would you agree to have the parser-based deduplication as an developer-only optin solution for now? :)

If you're suggesting proceeding with this regex based solution, I don't think that's a good idea. Why commit a hack which people will object to ever removing? Just see if we can do the right thing instead.

hokein added a comment.Nov 7 2018, 4:13 AM

If you're suggesting proceeding with this regex based solution, I

don't think that's a good idea. Why commit a hack which people will object to ever removing? Just see if we can do the right thing instead.

+1, my main concern is the complexity of the patch and maintenance burden of the python script.

  • The output of clang-tidy diagnostic is YAML, and YAML is not an space-efficient format (just for human readability). If you want to save space further, we might consider using some compressed formats, e.g. llvm::bitcode. Given the reduced YAML result (5.4MB) is promising, this might not matter.

The output were normal diagnostics written to stdout, deduplication happens from there (see the test-cases). The files i created were just through piping to filter some of the noise.
Without de-duplication its very hard to get something useful out of a run with many checks activated for bigger projects (e.g. Blender and OpenCV are useless to try, because they have some commonly used macros with a check-violation. The buildbot filled 30GB of RAM before it crashed and couldn't even finish the analysis of the project. Similar for LLVM).

I would like to try the simple deduplication first and see if space is still an issue. After all I want to just read the diagnostic and see whats happening instantly and a more compressed format might not help there.

I misthought that the output was the -export-fixes, but what you mean is the stdout of clang-tidy.

Could you please explain your motivation of catching clang-tidy stdout? --export-fixes emits everything of diagnostic to YAML even the diagnostic doesn't have fixes. I guess the reason is that you want code snippets that you could show to users? If so, I think this is a separate UX problem, since we have everything in the emitted YAML, and we could construct whatever messages we want from it. A simpler approach maybe:

  1. run clang-tidy in parallel on whole project, and emits a deduplicated result (fixes.yaml).
  2. run a postprocessing in your buildbot that constructs diagnostic messages from fixes.yaml, and store it somewhere.
  3. do whatever you want with output from 1) and 2).

Step 1 could be done in upstream, probably via AllTUsExecutor, and deduplication can be done on the fly based on <CheckName>::<FilePath>::<FileOffset>; we still need clang-apply-replacement to deduplicate replacements; I'm happy to help with this. Step 2 could be done by your own, just a simple script.

At the moment clang-apply-replacements is called at the end of an clang-tidy run in run-clang-tidy.py That means we produce ~GBs of Yaml first, to then emit ~10MBs worth of it.

That's why I suggest using some sort of other space-efficient formats to store the fixes. My intuition is that the final deduplicated result shouldn't be too large (even for YAML), because 1) no duplication 2) these are actual diagnostics in code, a healthy codebase shouldn't contain lots of problem 3) you have mentioned that you use it for small projects :)

@hokein you and I seem to be making the same proposal :)

Could you please explain your motivation of catching clang-tidy stdout? --export-fixes emits everything of diagnostic to YAML even the diagnostic doesn't have fixes. I guess the reason is that you want code snippets that you could show to users? If so, I think this is a separate UX problem, since we have everything in the emitted YAML, and we could construct whatever messages we want from it.

A bit for pragmatic reasons and a bit precaution.

  • You are right with the code-snippet. I want to check for false-positives in new clang-tidy checks, if I can just scroll through and see the code-snippet in question it is just practical.
  • diagnostics in template-code might emit multiple warnings at the same code-position. There is a realistic chance that warning: xy happened here will be the same for all template-instantiations, and only the note: type 'XY' does not match gives the differentiating hint. If dedup happens _ONLY_ on the first warning I fear we might loose valid diagnostics! I did re-evaluate and it seems that the emitted yaml does not include the notes. That is an issues, for example the CSA output relies on the emitted notes that explain the path to the bug.
  • I originally implemented it for my buildbot which parses the check-name, location and so on and then gives an ordered output for each check in a module and so on. I extracted the essence for deduplication. -export-fixes still emits the clang-tidy diagnostics, so for my concrete use-case YAML based de-duplication brings no value in its current form, as my BB still struggles with the amount of stdout.
  1. run clang-tidy in parallel on whole project, and emits a deduplicated result (fixes.yaml).
  2. run a postprocessing in your buildbot that constructs diagnostic messages from fixes.yaml, and store it somewhere.
  3. do whatever you want with output from 1) and 2).

Step 1 could be done in upstream, probably via AllTUsExecutor, and deduplication can be done on the fly based on <CheckName>::<FilePath>::<FileOffset>; we still need clang-apply-replacement to deduplicate replacements; I'm happy to help with this. Step 2 could be done by your own, just a simple script.

At the moment clang-apply-replacements is called at the end of an clang-tidy run in run-clang-tidy.py That means we produce ~GBs of Yaml first, to then emit ~10MBs worth of it.

That's why I suggest using some sort of other space-efficient formats to store the fixes. My intuition is that the final deduplicated result shouldn't be too large (even for YAML), because 1) no duplication 2) these are actual diagnostics in code, a healthy codebase shouldn't contain lots of problem 3) you have mentioned that you use it for small projects :)

To 3) I do use it for all kinds of projects, LLVM and Blender are currently the biggest ones. I want to go for LibreOffice, Chromium and so on as well. But right now the amount of noise is the biggest obstacle. My goal is not to check if the project is healthy/provide a service for the project, but check if _we_ have bugs in our checks and if code-transformation is correct, false positives, too much output, misleading messages, ...

To 2) LLVM is very chatty as well, I don't consider LLVM to be a bad code-base. Take readability-braces-around-statements for example. I want to test if the check transform all possible places correctly, LLVM does not follow this style and LLVM has a lot of big headers that implement functionality that are transitively included a lot. LLVM is the one that overflowed my 32GB of RAM :)

To 1) I do agree and the data presented support that. I suspect that Yaml-to-stdout Ratio is maybe 2/3:1? So in the analyzed case we end up with 10-15MB of data instead of ~600MB(all Yaml). Space optimization is something we can tackle after-wards as it does not seem to be pathological after deduplication.

In general: I somewhat consider this patch as rejected, I will still use it locally for my BB, but I think this revision should be closed. We could move the discussion here to the mailing-list if you want. It is ok to continue here as well, as we already started to make plans :)
My opinion is, that we should put as much of the deduplication into clang-tidy itself and not rely on tools like run-clang-tidy.py if we can.

So for me step 1. would be providing AllTUsExecutor in clang-tidy and make it parallel itself. For dedup we need hook the diagnostics. CSA has the BugReport class that could be hashed. clang-tidy currently doesn't have this, maybe a similar approach (or the same?) would help us out.

Push some fixes i did while working with this script, for reference of others
potentially looking at this patch.

JonasToth updated this revision to Diff 175459.Nov 27 2018, 5:25 AM
  • after countless attempts of fixing the unicode problem, it is finally done.
  • Remove all unnecessary whitespace
  • remove the xxx warnings generated. as well

This setup now runs in my BB and is a good approximation (for me) how the
deduplication should work in the future.

Still just for reference and documentation.

  • make the script more useable in my buildbot context
  • reduce the test-files
  • fix unicode issues I encountered while using

LLVM is very chatty as well, I don't consider LLVM to be a bad code-base. Take readability-braces-around-statements for example.

Do we need a llvm-elide-braces-for-small-statements?

This would make a great pre-review check

LLVM is very chatty as well, I don't consider LLVM to be a bad code-base. Take readability-braces-around-statements for example.

Do we need a llvm-elide-braces-for-small-statements?

This would make a great pre-review check

IMHO wouldn't hurt. It could even be a readability, one. But we need general thought on how to deal with conflicting checks, in this case especially.
Maybe we could extend the readability-braces-around-statements check with its AntiCheck and make it configurable. Therefore collision cant be emitted?

If you're suggesting proceeding with this regex based solution, I

don't think that's a good idea. Why commit a hack which people will object to ever removing? Just see if we can do the right thing instead.

+1, my main concern is the complexity of the patch and maintenance burden of the python script.

I think these are reasonable concerns and to a degree I share them. At the same time, I worry we may be leaving useful functionality behind in favor of functionality that doesn't exist and doesn't appear to be moving forward. If we were to move forward with this patch, nothing prevents us from surfacing it more naturally later when we have the infrastructure in place for the better solution, correct?

At the moment clang-apply-replacements is called at the end of an clang-tidy run in run-clang-tidy.py That means we produce ~GBs of Yaml first, to then emit ~10MBs worth of it.

That's why I suggest using some sort of other space-efficient formats to store the fixes. My intuition is that the final deduplicated result shouldn't be too large (even for YAML), because 1) no duplication 2) these are actual diagnostics in code, a healthy codebase shouldn't contain lots of problem 3) you have mentioned that you use it for small projects :)

Re: #2, I don't think that assertion is true in practice. I expect there are plenty of projects that contain a lot of clang-tidy diagnostics, especially given that clang-tidy checks tend to have higher false positive rates. Even if clang-tidy checks were not so chatty, "shouldn't" and "don't" are very different measurements.

I'm not suggesting to plow full-steam-ahead with this patch or that the concerns raised here are invalid, but at the same time, I think it does solve a real problem and it would be a shame to lose a workable solution because something better might be possible. If work is taking place to actually implement that something better, then that's a different matter of course. I get the impression though that "something better" is an extensive amount of work compared to what's in front of us; am I misunderstanding?

My opinion is, that we should put as much of the deduplication into clang-tidy itself and not rely on tools like run-clang-tidy.py if we can.

Strong +1. TBH, I was unaware people used run-clang-tidy.py. ;-)

lebedev.ri resigned from this revision.Jul 10 2019, 2:42 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 10 2019, 2:42 PM
JonasToth abandoned this revision.Feb 21 2022, 9:26 AM

won't happen anymore realistically.