This is an archive of the discontinued LLVM Phabricator instance.

Fix overlapping replacements in clang-tidy.
ClosedPublic

Authored by angelgarcia on Oct 7 2015, 10:02 AM.

Details

Summary

Prevent clang-tidy from applying fixes to errors that overlap with other errors' fixes, with one exception: if one fix is completely contained inside another one, then we can apply the big one.

Diff Detail

Event Timeline

angelgarcia retitled this revision from to Fix overlapping replacements in clang-tidy..
angelgarcia updated this object.
angelgarcia added reviewers: klimek, bkramer.
angelgarcia added subscribers: alexfh, cfe-commits.
klimek added inline comments.Oct 13 2015, 2:22 AM
clang-tidy/ClangTidyDiagnosticConsumer.cpp
417–418

These need to be documented.

420

I'd name this Queue instead (reading later I had no idea what this was).

432–435

Why are you calling this "Sit"?

474

Why do we need to sort?

Fix comments.

These need to be documented.

Done.

I'd name this Queue instead (reading later I had no idea what this was).

Done.

Why are you calling this "Sit"?

I didn't even know how to describe this variable without using examples,
and naming it is harder. More or less, it keeps track of the different
overlapping situations that have been spotted during the process, so "Sit"
stands for that. But I know it is a really poor name and I'd like to change
it, any ideas?

Why do we need to sort?

It's a remnant of a different approach that I tried before. Removed.

klimek edited edge metadata.Oct 14 2015, 3:13 AM
clang-tidy/ClangTidyDiagnosticConsumer.cpp
437–439

The description makes it unclear whether the indices stand for the set, or the dimensions stand for the sets (I can see it's the dimensions from where it is set, and that the indices are bools, but that's unclear without that / hard to grasp).
Here's a slightly denormalized way, that uses early-exit and might be easier to understand (more opinions would be nice, though).
(note that I probably messed up the FirstInsideSecond setting somewhere :)

bool Overlaps = false;
bool FirstInsideSecond = false;

// In the loop:
if (Count[0] != 0 && Count[1] != 0) {
  Overlaps = true;
} else if (Overlaps && ((Count[0] != 0 && FirstInsideSecond) ||
                        (Count[1] != 0 && !FirstInsideSecond)) {
  return OK_Overlap;
} else if (Count[0] != 0 || Count[1] != 0) {
  FirstInsideSecond = Count[1] != 0;
}

// After the loop:
if (!Overlaps)
  return OK_Disjoint;
return FirstInsideSecond ? OK_FirstInsideSecond : OK_SecondInsideFirst;
klimek added inline comments.Oct 14 2015, 3:24 AM
clang-tidy/ClangTidyDiagnosticConsumer.cpp
437–439

Ok, what I proposed doesn't work.

I start to like your solution more. Perhaps we just need to document it better:
a) call it Coverage
b) introduce enum with Covered and Empty (or similar)
c) document:
Coverage[Covered][Covered]: both set 1 and set 2 cover an area;
Coverage[Covered][Empty]: set 1 covers an area set 2 doesn't cover;
Coverage[Empty][Covered]: set 2 covers an area set 1 doesn't cover;

angelgarcia edited edge metadata.

Add an enum and rename "Sit" to "Coverage" to improve readability.

klimek added inline comments.Oct 14 2015, 4:36 AM
clang-tidy/ClangTidyDiagnosticConsumer.cpp
443–447

I'd just call out the 3 cases, like I suggested in my previous comment ("etc" sounds for me like there's more than 1 case left ;)

460

I'd say:
// If both sets never cover the same range, there is no overlap.

471–474

I'd make that:
return Coverage[Empty][Covered] ? OK_FirstInsideSecond : OK_SecondInsideFirst;
(and adapt the comment)

klimek added inline comments.Oct 14 2015, 4:54 AM
clang-tidy/ClangTidyDiagnosticConsumer.cpp
494–495

I'm somewhat concerned about the quadratic runtime here, for cases where somebody messes up a {} pair in a header and we have 1000s of errors.
Can we do a two step algorithm:

  • go over all error-lists, compute bounding rectangles for each
  • sort errors
  • only do the squared algorithm for errors where the bounding rectangles overlap

Use bounding boxes to reduce complexity.

LG in general now, looks to me like we have very few tests though.
My favorite strategy to make sure I have enough tests is to comment out code (or do mutations) as long as the tests still pass. Then write tests that fail with the mutation, then undo the mutation.

clang-tidy/ClangTidyDiagnosticConsumer.cpp
491–492

I think a comment at the top of this function outlining the full algorithm would be nice (without lots of details).

518–520

Document what goes into the pair - I assumed 'first, second' for a moment, and then was confused.

djasper added inline comments.
clang-tidy/ClangTidyDiagnosticConsumer.cpp
494–495

From a brief look, it seems like it might be slightly easier, to do what this is currently doing, but skip the bounding-box-calculation step. Simple do the same with each interval of each fix. We are just interested to efficiently calculate, which fixes overlap at all.

Now, I am not sure whether this is going to be more or less efficient:

  • It might be hard / not worth it to calculate which fixes have already been compared so we might compare the same two fixes repeatedly (once for each "touching point").
  • We might not need to carefully compare fixes at all in spite of their bounding boxes overlapping, e.g. two fixes A and B might have 3 entirely disjoint intervals: A, B, A. Then, we don't need to do the sweep at all.

I've done a couple of runs for each version, and these are the results (I
have clang-tidy compiled with the option "RelWithDebInfo"):

$ time clang-tidy -checks=* test.cpp -- -std=c++11

Without looking for overlaps.

Suppressed 23463 warnings (23463 in non-user code).
Use -header-filter=.* to display errors from all non-system headers.

real 2m14.572s
user 2m13.136s
sys 0m0.483s

real 2m15.103s
user 2m13.361s
sys 0m0.687s

Bounding boxes

Suppressed 23463 warnings (23463 in non-user code).
Use -header-filter=.* to display errors from all non-system headers.

real 2m14.208s
user 2m13.051s
sys 0m0.643s

real 2m16.368s
user 2m14.286s
sys 0m0.986s

Quadratic

Suppressed 23463 warnings (23463 in non-user code).
Use -header-filter=.* to display errors from all non-system headers.

real 2m15.130s
user 2m13.627s
sys 0m0.499s

real 2m15.322s
user 2m13.660s
sys 0m0.683s

The time is about the same for all three versions. Note that the first
version doesn't do any sweep at all, and the last version invariably does
(23463 choose 2) = 275244453 sweeps. The amount of time required to do this
seems to be too low compared to the time that clang-tidy takes to output
these diagnostics.
Also, the fact that all three versions take about the same time is a bit
suspicious, but I double-checked that I was doing it right (I did a small
file with would cause overlapping and I checked if the message "note: this
fix will not be applied because it overlaps with another fix" was there
before each run).

I intended to implement Daniel's idea to check out which one was more
efficient, but with these results in sight I don't think it is worth it.

There might be an even easier algorithm:

What if we put all start and end points into a single sorted list. Identical start points are sorted by decreasing end points and identical end points are sorted by increasing start points. Now do a single sweep and simply count the total number of starts vs. the total number of ends, call this C (if you encounter a start point, ++C, if you encounter an end point, --C). If you encounter a start-point and C != 0, mark the interval and corresponding Fixit as inapplicable. Similarly if you encounter an end point and C != 1, mark the fixit as inapplicable. Basically, this does a single sweep over all fixits and marks the ones as bad that have either their start point or their endpoint within a different interval.

(Take this as an additional idea, if you have an existing algorithm that you like better, I am fine with keeping that. Just wanted to give my additional thoughts).

That works pretty well (well, identical end points have to be sorted
decreasingly, if I understand correctly). I see a couple of problems
though, like equal intervals, or consecutive. But the algorithm is way more
efficient than what we currently have, I will try to make it work.

I did several mutations and the only case were a test didn't break was when I removed the sort, but it turned out that we don't need it. I changed the tests to apply the checks in both orders to ensure that a test will break if the errors are not sorted.

I am still thinking about how to complete Daniel's idea.

I cannot find a way to make Daniel's idea work with equal intervals:

In this case, fix A can be applied because B is completely contained inside
it.
A: [a, b)[c, d)
B: [a, b)

This time, we should not apply anyone:
A: [a, b)
B: [a, b)

And here they both have to be discarded again, but for a different reason:
A: [a, b)[c, d)
B: [e, f)[a, b)

The problem is that we have three completely different situations that are
the same from the point of view of interval "[a, b)". The local situation
of individual intervals doesn't provide enough information.

I implemented this, with the following addition: if several errors share
the same interval, I can still apply the biggest one that was not discarded
during the sweep. This way, the first example would work and in the two
other examples it would just apply a random one, as you suggested. But I
found a different case that also fails:
A: [a, b)[c,d)
B: [e, f) such that e < a, b < f < c.

In this case it just discards A. This is not incorrect (as long as we don't
apply both, everything is OK), but it gets away from the idea of "if two
fixes overlap, discard both", which is what Manuel said when I started
this. I don't think that changing that idea is a problem, but right now the
behavior of the tool is so diffuse that maintaining and testing it would be
a bit painful. Also, if we want to allow applying one of the fixes when
they overlap, we have a different problem, it might be better just to start
from scratch and think about how to solve that problem.

New algorithm :)

klimek accepted this revision.Oct 16 2015, 4:24 AM
klimek edited edge metadata.

lg

clang-tidy/ClangTidyDiagnosticConsumer.cpp
488

Perhaps call this OpenIntervals, or if you like it short, just Open.

492–493

s/have/has/

This revision is now accepted and ready to land.Oct 16 2015, 4:24 AM
angelgarcia edited edge metadata.

Remove unused include, fix typo and rename Count to OpenIntervals.

angelgarcia closed this revision.Oct 16 2015, 4:45 AM