This is an archive of the discontinued LLVM Phabricator instance.

format: Remove redundant calls to guessIsObjC to speed up clang-format on unknown file types
Needs RevisionPublic

Authored by davidvc1 on Nov 30 2021, 6:26 PM.

Details

Summary

Running clang-format on the following input

int lambdas() {
  return [&] { 
  return [&] { 
  return [&] { 
  return [&] { 
  return [&] { 
  return [&] { 
  return [&] { return 3; } (); 
  } (); } (); } (); } (); } (); } (); }

will finish instantly if you pass clang-format a .cpp input with this content,
but hang for tens of seconds if you pass the same via stdin.

Adding some debug statements showed that guessIsObjC was getting called
tens of millions of times in a manner that scales very rapidly with the amount
of nesting (if clang-format just takes a few seconds with that input passed
on stdin, try adding a couple more levels of nesting).

This change moves the recursive guessIsObjC call one level of nesting out of
an inner loop whose iterations don't affect the input to the recursive call. This
resolves the performance issue.

Diff Detail

Event Timeline

davidvc1 requested review of this revision.Nov 30 2021, 6:26 PM
davidvc1 created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptNov 30 2021, 6:26 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript

Hi @curdeius, I'm new to the project and I'm adding you as a reviewer because I noticed you reviewed a number of recent changes in this directory. Are you a good reviewer for this change? Thanks!

+ @rsmith from CODE_OWNERS.txt: Hi Richard, could you please suggest a reviewer for this change? Thanks!

MyDeveloperDay requested changes to this revision.Jan 17 2022, 9:40 AM

This needs a full context diff and ideally some unit tests

This revision now requires changes to proceed.Jan 17 2022, 9:40 AM

I see indeed that this change fixes the superlinear complexity when formatting code from stdin:

cat test.cpp | time clang-format --style=LLVM

0:01.44 ==   1.44s  //  9 returns
0:13.17 ==  13s     // 10 returns
1:56.63 == 116s     // 11 returns

time clang-format --style=LLVM test.cpp
0.03s-0.05s

And after this change the stdin case takes as much time as with a file.

It's definitely a good change, but I'd like to understand what's the reason, why stdin is handled so much differently?
Have you investigated this?

Also, a sort of performance test (e.g. a test with a timeout) would be nice to have.

@curdeius , the PR description mentions the reason: when you pipe input from stdin, clang-format has to run through the "is it objective C?" codepath, and this is the codepath with the bug.

On the other hand, when you pass a .cpp file as a file path input, it doesn't need to figure out whether the code is objective C.

Herald added a project: Restricted Project. · View Herald TranscriptApr 24 2022, 11:25 AM

We can land this as NFC. @davidvc1 what email do you want to use for the authorship?

Herald added a project: Restricted Project. · View Herald TranscriptThu, Nov 30, 5:29 PM
Herald added a reviewer: rymiel. · View Herald Transcript
owenpan accepted this revision.Thu, Nov 30, 5:29 PM
NOTE: Clang-Format Team Automated Review Comment

It looks like your clang-format review does not contain any unit tests, please try to ensure all code changes have a unit test (unless this is an NFC or refactoring, adding documentation etc..)

Add your unit tests in clang/unittests/Format and you can build with ninja FormatTests. We recommend using the verifyFormat(xxx) format of unit tests rather than EXPECT_EQ as this will ensure you change is tolerant to random whitespace changes (see FormatTest.cpp as an example)

For situations where your change is altering the TokenAnnotator.cpp which can happen if you are trying to improve the annotation phase to ensure we are correctly identifying the type of a token, please add a token annotator test in TokenAnnotatorTest.cpp