This is an archive of the discontinued LLVM Phabricator instance.

[Diagnostic] Add ftabstop to -Wmisleading-indentation
ClosedPublic

Authored by Tyker on Dec 4 2019, 3:07 PM.

Details

Summary

this allow much better support of codebases like the linux kernel that mix tabs and spaces.

-ftabstop=Width allow specifying how large tabs are considered to be.

Diff Detail

Event Timeline

Tyker created this revision.Dec 4 2019, 3:07 PM
Tyker updated this revision to Diff 232226.Dec 4 2019, 3:13 PM
Tyker marked an inline comment as done.Dec 4 2019, 3:19 PM
Tyker added inline comments.
clang/lib/Parse/ParseStmt.cpp
1261

this is more of a bug fix. although i was not able to produce a breaking test case. i am not convinced it isn't needed.

Tyker added a comment.Dec 5 2019, 1:52 AM

i ran a build of the linux kernel using the stage1 compiler, about 30% of files crashed the compiler in the optimizer. everything was building fine 2 weeks ago with the same configuration but a previous compiler.
but aside from that from that file that complied fine, there was only one warning left. this warning is exactly the same case we had in llvm's source code.

One case in kernel can be easily fixed.

(This patch looks reasonable).

aaron.ballman added inline comments.Dec 5 2019, 5:00 AM
clang/include/clang/Basic/SourceManager.h
1394 ↗(On Diff #232226)

otherwise -> Otherwise,

I'd probably drop the doxygen style /param as well (none of the other arguments are documented that way and this documentation applies to getSpellingColumnNumber() as well anyway.

1395–1400 ↗(On Diff #232226)

Is there a reason we want these to default to false? Given that the option is intended to be used for calculating correct column numbers in the presence of tabs, I would expect that most places would want this to be true instead of false.

clang/lib/Parse/ParseStmt.cpp
1261

This seems unrelated to the tabstop patch, however. I think it should be commit separately.

Tyker updated this revision to Diff 232406.Dec 5 2019, 11:09 AM
Tyker marked 2 inline comments as done.

fixed aaron's comment

Tyker marked an inline comment as done.Dec 5 2019, 11:09 AM
Tyker added inline comments.
clang/include/clang/Basic/SourceManager.h
1395–1400 ↗(On Diff #232226)

i agree in theory but i am worried about the behavior change and potential performance impact of it.
haven't tested it tho.

aaron.ballman added inline comments.Dec 6 2019, 12:41 PM
clang/include/clang/Basic/SourceManager.h
1395–1400 ↗(On Diff #232226)

I think it's worth experimenting to see if any tests break by switching the default to true.

I also think you should perform the timing tests -- this code is quite possibly on the hot path, and others have expressed concern about regressing performance.

clang/test/Parser/warn-misleading-indentation.cpp
307

Missing newline.

Tyker updated this revision to Diff 233678.Dec 12 2019, 1:52 PM
Tyker marked 2 inline comments as done.
Tyker added inline comments.
clang/include/clang/Basic/SourceManager.h
1395–1400 ↗(On Diff #232226)

i ran the testsuite with UseTabsStopOption = false as a default on clang + clangd + clang-tools. there is 13 test failures in total. including complier 3 crashes. so there is a non negligible behavior change.
the best solution is maybe to make a new getVisualColumnNumber which would be the current getSpellingColumnNumber with UseTabsStopOption = true. there is probably many places for which the 'visual' is more important the the actual number of char.

for the performance side:
the 'patched' clang is with all patches concerning the misleading indentation warning.
the 'master' clang was a commit before my first patch about the misleading indentation warning being committed.
clang was built with -02 -march=native but with dynamic linking and no lto nor pgo.
the test was compiling SemaDecl.cpp (one of the biggest files in the repo) with all normal built flags + -fsyntax-only on an otherwise idle pc.

and here are the results in number of seconds of user time on 100 iteration of building SemaDecl.cpp -fsyntax-only.
patched clang:
Min. : 9.612
1st Qu.: 9.838
Median : 9.943
Mean :10.001
3rd Qu.:10.061
Max. :11.573
master master:
Min. : 9.645
1st Qu.: 9.855
Median : 9.939
Mean : 9.974
3rd Qu.:10.062
Max. :10.736

this gives a p-value of 0.418 which is not statistically significant. so I believe it is safe to assume that the difference is not large.

aaron.ballman added a subscriber: rsmith.

Adding @rsmith to weigh in on the default behavior.

clang/include/clang/Basic/SourceManager.h
1395–1400 ↗(On Diff #232226)

i ran the testsuite with UseTabsStopOption = false as a default on clang + clangd + clang-tools. there is 13 test failures in total. including complier 3 crashes. so there is a non negligible behavior change.

Oof, we should correct at least those crashes, that's pretty bad.

the best solution is maybe to make a new getVisualColumnNumber which would be the current getSpellingColumnNumber with UseTabsStopOption = true. there is probably many places for which the 'visual' is more important the the actual number of char.

I'm still not convinced we shouldn't correct the test failures and crashes, but this would be a viable option. @rsmith -- do you have a preference?

1395–1400 ↗(On Diff #232226)

this gives a p-value of 0.418 which is not statistically significant. so I believe it is safe to assume that the difference is not large.

Thank you for running the timing tests -- the difference in speed seems negligible to me.

Adding @rsmith to weigh in on the default behavior.

If we want correct visual columns, we have more work to do: we need to do UTF-8 decoding and use llvm::sys::locale::columnWidth to compute widths of individual characters (byteToColumn in TextDiagnostic.cpp already does this when printing caret diagnostics).

Maybe instead of changing the source manager, we should expose the byte <-> column map from the diagnostic printer and use it from the misleading indentation warning? Or perhaps we can provide an optimized form of it on the basis that we only care about the column number of the first non-whitespace character in the line (so we only need to deal with tabs and spaces) -- but then it's very clear that this is specific to misleading indentation and not a general computation that belongs in the source manager.

clang/include/clang/Basic/SourceManager.h
1393 ↗(On Diff #233678)

Typo throughout: TabsStop -> TabStop.

clang/lib/Basic/SourceManager.cpp
1175 ↗(On Diff #233678)

Is there a reason we can't still use the cache in this case? It seems like the process here should be:

  1. Find the start of the line (possibly from the cache)
  2. Find the number of columns between the start of the line and the query point (either a simple difference or by walking the line looking for tabs)

... and we should only need to alter the second step here.

1203 ↗(On Diff #233678)

This can read one byte past the end of the buffer (FilePos is allowed to point just past the end of the buffer).

I think there's an off-by-one error in here: suppose we have FilePos == 0, LineStart == 0, and the buffer starts with a tab. The correct column number is 1, but we will return TabStop instead. I think we should be looping while LineStart < FilePos and returning VirtualCharCount + 1. (Column numbers in Clang are 1-based, which is why we need the +1.)

Tyker updated this revision to Diff 234499.Dec 18 2019, 4:36 AM
Tyker marked an inline comment as done.

my intent was not necessarily to create a generic solution but to get a solution that is sufficient to not have false positives on the Wmisleading-indentation warning for code base that have mixes of tabs and spaces.

after taking a look at SourceColumnMap and byteToColumn this seems like way overkill for what i am trying to solve.

Maybe instead of changing the source manager, we should expose the byte <-> column map from the diagnostic printer and use it from the misleading indentation warning?

i did not found such a mapping.

after fixing the bug rsmith spotted. there is no test failures in clang + clang-tool + clangd with UseTabStopOption = true (i found this quite surprising but i double checked). the previous test result were incorrect due to this bug sorry. however I am still not sure that UseTabStopOption = true is the right default for every uses of getSpellingColumnNumber.

one visible example of the impact is the indentation in diagnostics on code indented with tabs.
with ftabstop = 8 (the default)

test.cpp:5:3: warning: misleading indentation; statement is not part of the previous 'if'
                return 0;
                ^
test.cpp:3:2: note: previous statement is here
        if (0)
        ^
1 warning generated.

with ftabstop = 2

test.cpp:5:3: warning: misleading indentation; statement is not part of the previous 'if'
    return 0;
    ^
test.cpp:3:2: note: previous statement is here
  if (0)
  ^
1 warning generated.

i don't know if this is a good thing, i never coded with tabs so i don't know the habits.

clang/lib/Basic/SourceManager.cpp
1203 ↗(On Diff #233678)

nice catch.

after fixing the bug rsmith spotted. there is no test failures in clang + clang-tool + clangd with UseTabStopOption = true (i found this quite surprising but i double checked). the previous test result were incorrect due to this bug sorry. however I am still not sure that UseTabStopOption = true is the right default for every uses of getSpellingColumnNumber.

I think that would be a bad idea. The column numbers are often typically used as offsets (e.g. in diagnostics/etc, that's the expectation) and shouldn't generally be changed to visual offsets.

one visible example of the impact is the indentation in diagnostics on code indented with tabs.

That was already the case -- that's basically the only part of Clang which used TabStop already.

I'm not sure this code really needs to be in SourceManager -- it seems maybe simple enough to just special case within MisleadingIndentationChecker, e.g. something like this function:

// Figure out the visually-apparent column number for a SourceLocation, using
// the TabStop value, from the "-ftabstop" option, to expand tabs.
unsigned GetVisualColumnNum(SourceLocation Loc) {
  SourceManager &SM = P.getPreprocessor().getSourceManager();
  unsigned TabStop = P.getActions().getDiagnostics().getDiagnosticOptions().TabStop;

  unsigned ColNo = SM.getSpellingColumnNumber(Loc);
  if (ColNo == 0 || TabStop == 1)
    return ColNo;

  auto FIDAndOffset = SM.getDecomposedLoc(Loc);

  bool Invalid;
  StringRef BufData = SM.getBufferData(FIDAndOffset.first, &Invalid);
  if (Invalid) return 0;

  const char *EndPos = BufData.data() + FIDAndOffset.second;
  assert(FIDAndOffset.second > ColNo && "Column number smaller than file offset?");

  unsigned VisualColumn = 0; // stored as 0-based column, here.
  // Loop from beginning of line up to Loc's file position, counting columns, expanding tabs
  for (const char *CurPos = EndPos - (ColNo - 1); CurPos != EndPos; ++CurPos) {
    if (*CurPos == '\t')
      // Advance visual column to next tabstop
      VisualColumn += (TabStop - VisualColumn % TabStop);
    else
      VisualColumn++;
  }
  return VisualColumn + 1;
}

BTW, the commit here should mark llvm.org/PR44324 as fixed by it.

Tyker updated this revision to Diff 234738.Dec 19 2019, 9:15 AM

i agree that it is the best solution.

xbolva00 added inline comments.Dec 19 2019, 10:00 AM
clang/lib/Parse/ParseStmt.cpp
1217

Typo

riccibruno added inline comments.Dec 19 2019, 10:21 AM
clang/lib/Parse/ParseStmt.cpp
1217

Also, what about a little comment explaining what getAdjustedColumnNumber does ? It is not immediately clear what "adjusted" means here (adjusted for what ?)

riccibruno added inline comments.Dec 19 2019, 10:23 AM
clang/lib/Parse/ParseStmt.cpp
1217

Oh, and it should static.

aaron.ballman added inline comments.Dec 20 2019, 11:36 AM
clang/lib/Parse/ParseStmt.cpp
1217

And you can add a newline above it to give it some visual separation from the previous function.

1224

Please don't use auto as the type is not spelled out in the initialization.

1235

stored -> Stored

1237

Missing a full stop at the end of the sentence.

1241

Missing full stop.

Tyker updated this revision to Diff 235073.Dec 22 2019, 11:40 AM
Tyker marked 8 inline comments as done.

fixed comments.

This comment was removed by Tyker.
aaron.ballman accepted this revision.Dec 24 2019, 8:17 AM

LGTM, thank you!

This revision is now accepted and ready to land.Dec 24 2019, 8:17 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptDec 30 2019, 12:26 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

This broke my builds. A configure test that tries to compile int main (void) { for( int i = 0; i < 9; i++ ); return 0; } (a single line source file), with -Wall enabled, now triggers failed asserts:

clang-10: ../tools/clang/lib/Parse/ParseStmt.cpp:1236: static unsigned int {anonymous}::MisleadingIndentationChecker::getVisualIndentation(clang::SourceManager&, clang::SourceLocation): Assertion `FIDAndOffset.second > ColNo && "Column number smaller than file offset?"' failed.

<cut>

1.      oneline.c:1:49: current parser token 'return'
2.      oneline.c:1:17: parsing function body 'main'
3.      oneline.c:1:17: in compound statement ('{}')

This broke my builds.

I pushed a revert for this for now, to unbreak things.

Tyker updated this revision to Diff 236058.EditedJan 3 2020, 7:32 AM

This update should fixes the issue.
the assert was incorrect. because file offsets are 0-based where as column numbers are 1-based.

however the issue can only occur for on the first line. so we can't have a run line. before it to add a test case.

xbolva00 reopened this revision.Jan 3 2020, 11:22 AM

Can you add a test case for that crash? Otherwise OK.

This revision is now accepted and ready to land.Jan 3 2020, 11:22 AM
Tyker added a comment.EditedJan 3 2020, 12:59 PM

Can you add a test case for that crash? Otherwise OK.

as i said. the bug can only occur for one line files. so we can't have a run line and a the crashing line. so i wasn't able to make a test for it.

Can you add a test case for that crash? Otherwise OK.

as i said. the bug can only occur for one line files. so we can't have a run line and a the crashing line. so i wasn't able to make a test for it.

Is it possible to have the RUN bit at the end of the line in a comment, after the test code itself?

Tyker closed this revision.Jan 3 2020, 4:43 PM

Can you add a test case for that crash? Otherwise OK.

as i said. the bug can only occur for one line files. so we can't have a run line and a the crashing line. so i wasn't able to make a test for it.

Is it possible to have the RUN bit at the end of the line in a comment, after the test code itself?

it is possible.

i committed the fix and the tests.

I just ran into this warning, and I think there's a bit of a discoverability problem related to the width of tabs and -ftabstop. If you have mixed tabs and spaces, and you've correctly specified the tab stop width with -ftabstop, everything works fine. If you haven't specified -ftabstop, you get a warning, and the issue isn't obvious. Depending on your editor settings, the code might look fine at first glance, and the warning text doesn't mention -ftabstop at all. Maybe there's something that could be improved here, if we're printing a warning for two lines with mismatched indentation styles?

Tyker added a comment.Feb 22 2020, 2:49 AM

I just ran into this warning, and I think there's a bit of a discoverability problem related to the width of tabs and -ftabstop. If you have mixed tabs and spaces, and you've correctly specified the tab stop width with -ftabstop, everything works fine. If you haven't specified -ftabstop, you get a warning, and the issue isn't obvious. Depending on your editor settings, the code might look fine at first glance, and the warning text doesn't mention -ftabstop at all. Maybe there's something that could be improved here, if we're printing a warning for two lines with mismatched indentation styles?

i agree that this is not obvious. i added a patch to improve this https://reviews.llvm.org/D75009.