This is an archive of the discontinued LLVM Phabricator instance.

[clang-format] Prevent extraneous space insertion in bitshift operators
ClosedPublic

Authored by penagos on Apr 19 2021, 11:17 AM.

Details

Summary

This serves to augment the improvements made in https://reviews.llvm.org/D86581. It prevents clang-format from interpreting bitshift operators as template arguments in certain circumstances. This is an attempt at fixing https://bugs.llvm.org/show_bug.cgi?id=49868

Diff Detail

Event Timeline

penagos requested review of this revision.Apr 19 2021, 11:17 AM
penagos created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptApr 19 2021, 11:17 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
penagos edited the summary of this revision. (Show Details)Apr 19 2021, 11:21 AM
Quuxplusone added inline comments.
clang/unittests/Format/FormatTest.cpp
8010

IMO you should use "test < a | b >> c;" as your test case here, to reassure the reader that it doesn't depend on the fact that ... 1; is visibly not a variable declaration.
Personally I'd also like to see "test<test<a | b>> c;" tested on the very next line, to show off the intended difference between the two. (Assuming that I understand the intent of this patch correctly.)
(I also switched to a bitwise operator just for the heck of it; that makes this expression just a very tiny bit less implausible — but still highly implausible, to the point where I question why we're special-casing it.)

clang/unittests/Format/FormatTest.cpp
8010

Btw, a much-bigger-scope way to fix this would be to teach clang-format about "input encoding" versus "output encoding." The only time clang-format should ever be inserting space in the middle of >> is if it's translating C++11-encoded input into C++03-encoded output. If the input is known to already be C++03-encoded, then breaking up an >> token into a pair of > > tokens is guaranteed to introduce a bug.
Right now, my impression is that clang-format has a concept of "output encoding" (i.e. "language mode") but has no way of knowing the "input encoding."

penagos updated this revision to Diff 338641.Apr 19 2021, 2:20 PM
penagos marked an inline comment as done.Apr 19 2021, 2:38 PM
penagos added inline comments.
clang/unittests/Format/FormatTest.cpp
8010

Thanks for the feedback. Your 2 test suggestions make sense to me; I've updated the patch diff. I hadn't considered teaching clang-format input encoding, but that does sound like the preferable long term solution. This patch is intended to be a lightweight fix to fix a very narrow use case.

penagos updated this revision to Diff 338648.Apr 19 2021, 2:45 PM

Update Format test

MyDeveloperDay accepted this revision.Apr 21 2021, 1:53 AM
MyDeveloperDay added inline comments.
clang/lib/Format/TokenAnnotator.cpp
129

I don't really understand what we are saying here?

This revision is now accepted and ready to land.Apr 21 2021, 1:53 AM
penagos added inline comments.Apr 21 2021, 7:26 AM
clang/lib/Format/TokenAnnotator.cpp
129

Effectively we are checking that, barring intervening whitespace, we are analyzing 2 consecutive '>' tokens. If so, we treat such sequence as a binary op in lieu of a closing template angle bracket. If there's another more straightforward way of accomplishing this check, I'm open to that, but this seemed to be the most straightforward way at the time.

Additionally; barring any other feedback, I'll need someone to land this change as I do not have commit access.

krasimir added inline comments.Apr 22 2021, 2:44 AM
clang/lib/Format/TokenAnnotator.cpp
129

I'm worried that this may regress template code. How does this account for cases where two consecutive >-s are really two closing template brackets, e.g.,
std::vector<std::decay_t<int& >> v;?

In particular, one added test case is ambiguous: >> could really be two closing template brackets:
https://godbolt.org/z/v19hj9vKn

I have to say that my general feeling about trying to disambiguate between bitshifts and template closers is: don't try too hard inside clang-format as the heuristics are generally quite brittle and make the code harder to maintain; in cases where clang-format wrongly detects bitshift as templates, users should add parens around the bitshift, which IMO improves readability.

penagos added inline comments.Apr 22 2021, 1:47 PM
clang/lib/Format/TokenAnnotator.cpp
129

As this patch currently stands, it does not disambiguate between bitshift '>>' operators and 2 closing template brackets, so in your snippet, we would no longer insert a space between the '>' characters (despite arguably being the better formatting decision in this case).

I agree with your feeling that user guided disambiguation between bitshift operators and template closing brackets via parens is the ideal solution and also improves readability, but IMO the approach taken by clang-format to format the '>' token should be conservative in that any change made should be non-semantic altering, which is not presently the case. While the case you mentioned would regress, we would no longer potentially alter program semantics. Thinking about this more, would it make sense to modify the actual white-space change generation later on in the analysis to not break up >> sequences of characters in lieu of annotating the tokens differently as the proposed patch is currently doing?

krasimir added inline comments.Apr 26 2021, 2:33 AM
clang/lib/Format/TokenAnnotator.cpp
129

I tried and can't make this misinterpret two consecutive template > as a bit shift, IMO because this check is guarded by the Left->ParentBracket != tok::less condition. Both std::vector<std::decay_t<int&>> v; and test<test<a | b>> c; below are handled correctly.
I'm less worried about regressions in common template cases now.
Thank you for pointing out altering program semantics, I agree.
Please add a comment about this tradeoff and and a bit of the reasoning behind it in code for future reference.

penagos updated this revision to Diff 340656.Apr 26 2021, 2:42 PM

Add justification comment for changes in parseAngle()

penagos added inline comments.Apr 26 2021, 2:47 PM
clang/lib/Format/TokenAnnotator.cpp
129

I had come to the same conclusion when modifying the conditional here; namely the ParentBracket predicate is what catches the case you were alluding to earlier.
I've added a brief comment to parseAngle() to document the need for the change, explaining the conservative nature of the change w.r.t. nested template cases; thank you for the suggestion.

krasimir accepted this revision.Apr 27 2021, 1:26 AM

Friendly reminder that I need someone to land this for me as I do not have commit access.

@penagos, I'll submit this for you.