This is an archive of the discontinued LLVM Phabricator instance.

[clang-format] SortIncludes should support "@import" lines in Objective-C
AbandonedPublic

Authored by kwk on Mar 10 2022, 3:44 AM.

Details

Summary

[clang-format] SortIncludes should support "@import" lines in Objective-C

Fixes #38995

This is an attempt to modify the regular expression to identify
@import and import alongside the regular #include. The challenging
part was not to support @ in addition to # but how to handle
everything that comes after the include|import keywords. Previously
everything that wasn't " or < was consumed. But as you can see in
this example from the issue #38995, there is no " or < following the
keyword:

@import Foundation;

I experimented with a lot of fancy and useful expressions in this online regex tool only to find out that some
things are simply not supported by the regex implementation in LLVM.

  • For example the beginning [\t\ ]* should be replacable by the horizontal whitespace character \h* but this will break the SortIncludesTest.LeadingWhitespace test.

That's why I've chosen to come back to the basic building blocks.

The essential change in this patch is the change from this regular
expression:

^[\t\ ]*#[\t\ ]*(import|include)[^"<]*(["<][^">]*[">])
        ~                              ~~~~~~~~~~~~~~
        ^                              ^
        |                              |
        only support # prefix not @    |
                                       only support "" and <> as delimiters
                                       no support for C++ modules and ;
                                       ending. Also this allows for ">
                                       or <" or "" or <> which all seems
                                       either off or wrong.

to this:

^[\t\ ]*[@#][\t\ ]*(import|include)([^"/]*("[^"]+")|[^</]*(<[^>]+>)|[\t/\ ]*([^;]+;))
        ~~~~                        ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~
        ^                                 ^           ^       ^       ^
        |                                 |           |       |       |
        Now support @ and #.            Clearly support "" and <> as ell as an
                                        include name without enclosing characters.
                                        Allows for no mixture of "> or <" or
                                        empty include names. Trailing
                                        comments on include lines as shown in
                                        D124513 are ignored now by
                                        ignoring slashes in the characters that
                                        lead up to a match group.

Here is how I've tested this patch:

ninja clang-Format
ninja FormatTests
./tools/clang/unittests/Format/FormatTests
--gtest_filter=SortIncludesTest*

And if that worked I doubled checked that nothing else broke by running
all format checks:

./tools/clang/unittests/Format/FormatTests

One side effect of this change is it should partially support
C++20 Module
import lines without the optional export in front. Adding
this can be a change on its own that shouldn't be too hard. I say
partially because the @ or # are currently *NOT* optional in the
regular expression.

I see an opportunity to optimized the matching to exclude @include for
example. But eventually these should be caught by the compiler, so...

With my change, the matching group is not at a fixed position any
longer. I decided to choose the last match (group) that is not empty.

Diff Detail

Event Timeline

kwk created this revision.Mar 10 2022, 3:44 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 10 2022, 3:44 AM
kwk requested review of this revision.Mar 10 2022, 3:44 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 10 2022, 3:44 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
kwk added a project: Restricted Project.Mar 10 2022, 3:44 AM
clang/lib/Format/Format.cpp
2798

How does that not go out of scope and I have a dangling reference?

clang/lib/Tooling/Inclusions/HeaderIncludes.cpp
171–172

I don't see why this is needed. It should match with the previous stuff.

171–172

Do we want that optional?

clang/unittests/Format/SortIncludesTest.cpp
471

I've no idea about obj-c, is it feasible to mix # and @? If so please add a test where the @ is between #s.

curdeius added inline comments.
clang/lib/Format/Format.cpp
2727–2728

I'd rather see handling separately the #import/#include case on one hand and @import on the other.
IIUC, @import is followed by ns1.ns2.something without quotes nor <>. And it ends with a ;.
Trying to put everything together is IMO error-prone and not very readable.

You could just do (pseudo-code):

CppIncludeRegexPattern = (current_regex) | "@\s*import\s+[^;]+;"

(\s is probably not supported, but we could use character class [:space:] for clarity, cf. llvm\lib\Support\regcomp.c)

Not sure whether whitespace is allowed after @.
Could you please post a link to the specification of @import syntax?

Not sure fixing <" / "> issue is worth it, but in all cases I'd prefer seeing it in a different patch.

2798

Why not doing it the other way round, i.e. trimming <>" from include name?

curdeius added inline comments.Mar 10 2022, 6:41 AM
clang/lib/Format/Format.cpp
2727–2728

[:space:] would match newlines and we don't want this in all cases, but [:blank:] should be a good replacement for [\t ].

On a different note, why there's a new line \n and backslash \\ in [\t\n\ \\]?

kwk updated this revision to Diff 414469.Mar 10 2022, 12:47 PM
kwk marked 2 inline comments as done.
  • Make @ or # not optional again
  • Remove [\t\n\ \\]*
  • Properly concat string
kwk added a comment.Mar 10 2022, 12:47 PM

Thank you for your comments! I've addressed many of them and now address the rest.

clang/lib/Format/Format.cpp
2798

Why not doing it the other way round, i.e. trimming <>" from include name?

Because the include name is used for sorting later and if the <> is removed from the name, the <> included will be mixed with the "". That's undesirable.

2798

How does that not go out of scope and I have a dangling reference?

Sorry I was a bit lost with this but with some help I found other places in which concat was used and toStringRef. I hope everything is fine now.

clang/lib/Tooling/Inclusions/HeaderIncludes.cpp
171–172

I don't see why this is needed. It should match with the previous stuff.

That's why I thought too. I swear I ran the tests before and I wouldn't work without this. But I re-ran the tests and now it works. Thanks for bringing up this very obvious part.

171–172

Do we want that optional?

We don't need it optional for the tests to pass. Only for C++20 Modules, we need it optional. But this is *not* part of this change. I've removed the optional part of this.

clang/unittests/Format/SortIncludesTest.cpp
471

I've no idea about obj-c, is it feasible to mix # and @? If so please add a test where the @ is between #s.

This test is copied from the issue itself. I have no idea about Obj-c as well. The @ already *is* between the #. Did I miss something?

kwk edited the summary of this revision. (Show Details)Mar 10 2022, 12:51 PM
kwk edited the summary of this revision. (Show Details)
kwk edited the summary of this revision. (Show Details)Mar 10 2022, 12:53 PM
kwk planned changes to this revision.Mar 10 2022, 1:08 PM

A test if failing. Need to address this first.

MyDeveloperDay edited the summary of this revision. (Show Details)Mar 11 2022, 1:52 AM
kwk updated this revision to Diff 415487.Mar 15 2022, 10:08 AM

Changed the strategy of how includes are sorted by increasing the priority if an include is of the "new" type: @import Foundation;. This will sort these includes after everything else just as requested in https://github.com/llvm/llvm-project/issues/38995.

kwk edited the summary of this revision. (Show Details)Mar 15 2022, 10:09 AM
kwk updated this revision to Diff 415489.Mar 15 2022, 10:11 AM
kwk edited the summary of this revision. (Show Details)
  • Fix formatting
kwk edited the summary of this revision. (Show Details)Mar 15 2022, 10:13 AM

Some test with a main header and blocks would be nice.

clang/lib/Format/Format.cpp
2813–2816

I don't know about the others, but I prefer C++: std::numeric_limits

kwk updated this revision to Diff 415769.Mar 16 2022, 3:51 AM
  • Use std::numeric_limits<int>::max() instead of INT_MAX
kwk marked an inline comment as done.Mar 16 2022, 3:52 AM

Addressed review comments.

clang/lib/Format/Format.cpp
2813–2816

Sure. That's done now.

Some test with a main header and blocks would be nice.

This one please. :)

MyDeveloperDay requested changes to this revision.Mar 16 2022, 2:58 PM
MyDeveloperDay added inline comments.
clang/lib/Format/Format.cpp
2801

int false?

This revision now requires changes to proceed.Mar 16 2022, 2:58 PM
kwk updated this revision to Diff 416876.Mar 21 2022, 3:05 AM
kwk marked an inline comment as done.
  • Address review comments on grouping and main headers
kwk updated this revision to Diff 416877.Mar 21 2022, 3:06 AM
  • Fix: int -> bool
kwk added a comment.Mar 21 2022, 3:11 AM

Some test with a main header and blocks would be nice.

This one please. :)

Done.

clang/lib/Format/Format.cpp
2801

int false?

Yes, this was just to see if everybody is awake ;). Just kiddin'. Good catch.

kwk added a comment.Mar 21 2022, 3:13 AM
This comment was removed by kwk.
kwk updated this revision to Diff 416878.EditedMar 21 2022, 3:19 AM
  • Fix tests

But please fix the format. :)

kwk updated this revision to Diff 417097.Mar 21 2022, 2:21 PM
kwk marked an inline comment as done.
  • introduce getIncludeNameFromMatches
  • Fix format
kwk marked 3 inline comments as done.Mar 21 2022, 2:24 PM

@HazardyKnusperkeks thank you for the approval. Can you have one last view please? I've introduced a function to get the match instead of repeating the for loop two times.

kwk updated this revision to Diff 417099.Mar 21 2022, 2:32 PM
  • Remove left-over comment
kwk updated this revision to Diff 417101.Mar 21 2022, 2:34 PM
  • Remove duplicate assert
kwk added a comment.Mar 22 2022, 2:37 AM

@MyDeveloperDay can you please have another look now that the patch has additional tests and most comments that still apply have been addressed?

HazardyKnusperkeks added inline comments.
clang/lib/Format/Format.cpp
2733–2736
2738

Is there something like LLVM_UNREACHABLE?

kwk updated this revision to Diff 417379.Mar 22 2022, 1:11 PM
  • Address review comment
kwk marked 2 inline comments as done.Mar 22 2022, 1:13 PM

@HazardyKnusperkeks I've addressed your comments and did an early return together with llvm_unreachable() which is used in the clang/lib/Format in other places as well. I hope this is to your liking. There's no longer an assert. Do you want that as well?

owenpan added inline comments.Mar 22 2022, 5:35 PM
clang/lib/Format/Format.cpp
2731–2734

I think you missed Matches[0].

3083

We should make it static if it's only used in this file.

clang/lib/Tooling/Inclusions/HeaderIncludes.cpp
171–172

If these are the same as in Format.cpp above, we should move the definitions to HeaderIncludes.h.

owenpan added inline comments.Mar 22 2022, 5:42 PM
clang/lib/Tooling/Inclusions/HeaderIncludes.cpp
171–172

If these are the same as in Format.cpp above, we should move the definitions to HeaderIncludes.h.

I meant we should remove the definitions from Format.cpp and add the declarations to HeaderIncludes.h.

kwk updated this revision to Diff 423254.Apr 16 2022, 12:19 PM
  • Put shared code for finding include names HeaderIncludes
kwk updated this revision to Diff 423255.Apr 16 2022, 12:23 PM
  • Added llvm:: namespace
kwk updated this revision to Diff 423256.Apr 16 2022, 12:26 PM
  • Use proposed reverse loop
kwk updated this revision to Diff 423257.Apr 16 2022, 12:29 PM
kwk marked an inline comment as done.
  • Make function static
kwk marked 3 inline comments as done.Apr 16 2022, 12:30 PM
kwk added inline comments.
clang/lib/Format/Format.cpp
2731–2734

I did miss it and fixed it but it doesn't make much sense to use the first match because that's always the overall string that's being matched. Nevertheless, I've used your porposed reverse loop and the logic is better if it includes even the first match.

clang/lib/Tooling/Inclusions/HeaderIncludes.cpp
171–172

I think I've done what you intended me to do.

Should we handle #import and @import for Object-C only so as to simply the regex for C++?

clang/include/clang/Tooling/Inclusions/HeaderIncludes.h
136

I think you meant @import bar;, which is valid in Objective-C only. So is #import btw.

139–147

Function names should start with a lowercase letter.

kwk updated this revision to Diff 423582.Apr 19 2022, 3:45 AM
  • Make functions lowercase
kwk marked 2 inline comments as done.Apr 19 2022, 3:51 AM

Should we handle #import and @import for Object-C only so as to simply the regex for C++?

@owenpan my counter question probably reveals how few things I know about clang-format and how much I still have to learn. I only see sortCppIncludes but no sortObjCIncludes in the function list in Format.h. That said, do you want to introduce that or do you want to put logic in sortCppIncludes that knows about the programming language at hand and decides which regex to pick?

kwk updated this revision to Diff 423584.Apr 19 2022, 3:53 AM
  • typo: @include bar; -> @import bar;

The hard part about doing 2 separate cases is where we miss-interpret a .hpp or .h file as being the wrong language (C/C++/ObjC/ObjC++), I don't think I have a problem with a single regex (its not like we change it that often).

Is there a use case for having them separate or not?

MyDeveloperDay accepted this revision.Apr 19 2022, 6:56 AM

Accepting but please wait for @owenpan

This revision is now accepted and ready to land.Apr 19 2022, 6:56 AM
owenpan accepted this revision.Apr 19 2022, 9:00 PM

LGTM as I defer to @MyDeveloperDay.

kwk edited the summary of this revision. (Show Details)Apr 20 2022, 12:00 AM
kwk edited the summary of this revision. (Show Details)Apr 20 2022, 12:05 AM
This revision was landed with ongoing or failed builds.Apr 20 2022, 12:22 AM
This revision was automatically updated to reflect the committed changes.

It appears that this regressed include sorting in the following case, where the contents of test.h show the expected include order and the clang-format behavior before this patch:

% cat test.h
#include <cstdint>

#include "util/bar.h"
#include "util/foo/foo.h"  // foo<T>
% bin/clang-format --version; bin/clang-format -style=google test.h
clang-format version 15.0.0 (https://github.com/llvm/llvm-project.git d46fa023caa2db5a9f1e21dd038bcb626261d958)
#include "util/foo/foo.h"  // foo<T>
#include <cstdint>

#include "util/bar.h"

@kwk could you please take a look

kwk added a comment.Apr 26 2022, 12:01 PM

It appears that this regressed include sorting in the following case, where the contents of test.h show the expected include order and the clang-format behavior before this patch:

% cat test.h
#include <cstdint>

#include "util/bar.h"
#include "util/foo/foo.h"  // foo<T>
% bin/clang-format --version; bin/clang-format -style=google test.h
clang-format version 15.0.0 (https://github.com/llvm/llvm-project.git d46fa023caa2db5a9f1e21dd038bcb626261d958)
#include "util/foo/foo.h"  // foo<T>
#include <cstdint>

#include "util/bar.h"

@kwk could you please take a look

@krasimir I'm on sick leave and will have a look as soon as I'm back. Can it wait that long?

@krasimir I'm on sick leave and will have a look as soon as I'm back. Can it wait that long?

I'm planning to revert this temporarily and add a regression test, you can proceed with this work without time pressure. We aim to release a new clang-format version weekly to ease the process of finding and reporting regressions.

kwk reopened this revision.May 3 2022, 7:24 AM
This revision is now accepted and ready to land.May 3 2022, 7:24 AM
kwk updated this revision to Diff 426701.May 3 2022, 7:26 AM

Address trailing comments on include lines by ignoring slashes in
characters that lead up to the match groups.

^[\t\ ]*[@#][\t\ ]*(import|include)([^"/]*("[^"]+")|[^</]*(<[^>]+>)|[\t/\ ]*([^;]+;))
        ~~~~                        ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~
        ^                                 ^           ^       ^       ^
        |                                 |           |       |       |
        Now support @ and #.            Clearly support "" and <> as ell as an
                                        include name without enclosing characters.
                                        Allows for no mixture of "> or <" or
                                        empty include names. Trailing
                                        comments on include lines as shown in
                                        D124513 are ignored now by
                                        ignoring slashes in the characters that
                                        lead up to a match group.
kwk added a comment.May 3 2022, 7:32 AM

@krasimir could you please test this patch on your side? I've reopened it and tested it with the following file that contains both // and /* ... */ trailing comments on include lines followed by foo<T> and alike. The includes use <...> and "..." notation to locate files.

*test.h*

#include <cstdint>

#include "util/bar.h"
#include "util/foo/foo.h"  // foo<T>
#include "util/bar/bar.h"  // bar
#include "utilzbar0.h"  // bar0<T>
#include <util/foo/foo3.h>  // foo3<T>
#include <util/bar/bar3.h>  // bar3
#include <util/foo/foo2.h>  // foo2<T>
#include <util/bar/bar2.h>  // bar2
#include "util/bar.h"
#include "util/foo/foo.h"  /* foo<T> */
#include "util/bar/bar.h"  /* bar */
#include "utilzbar0.h"  /* bar0<T> */
#include <util/foo/foo3.h>  /* foo3<T> */
#include <util/bar/bar3.h>  /* bar3 */
#include <util/foo/foo2.h>  /* foo2<T> */
#include <util/bar/bar2.h>  /* bar2 */

Run test with ninja clang-format && bin/clang-format -style=google test.h:

#include <util/bar/bar2.h>  // bar2
#include <util/bar/bar2.h>  /* bar2 */
#include <util/bar/bar3.h>  // bar3
#include <util/bar/bar3.h>  /* bar3 */
#include <util/foo/foo2.h>  // foo2<T>
#include <util/foo/foo2.h>  /* foo2<T> */
#include <util/foo/foo3.h>  // foo3<T>
#include <util/foo/foo3.h>  /* foo3<T> */

#include <cstdint>

#include "util/bar.h"
#include "util/bar/bar.h"  // bar
#include "util/bar/bar.h"  /* bar */
#include "util/foo/foo.h"  // foo<T>
#include "util/foo/foo.h"  /* foo<T> */
#include "utilzbar0.h"     // bar0<T>
#include "utilzbar0.h"     /* bar0<T> */

To me, this looks precisely how it should look like. What do you think?

kwk edited the summary of this revision. (Show Details)May 3 2022, 7:33 AM
kwk planned changes to this revision.May 3 2022, 9:00 AM

Some thing that doesn't work at the moment is ordering an include like this:

#include /*some include*/ "wontwork.h"

I wonder if it worked before. Let's see. Yes it worked. Grrr. I guess I need to do more...

kwk added a comment.May 3 2022, 9:13 AM

I found another problem with both regular expressions (pre mine and mine): They cannot find

/* a comment before */ #include <foo>

Is this a known problem @MyDeveloperDay @owenpan @krasimir ?

@krasimir could you please test this patch on your side?

@kwk thank you! the new version of the patch looks good!

I didn't know before about the issue you found with the /* a comment before */ #include <foo> (and since it's preexisting, if it's tricky to address, it would be reasonable to just file a bug for it and omit fixing it by this work, I think).

Now support @ and #.            Clearly support "" and <> as ell as an

as well as?

@krasimir could you please test this patch on your side?

@kwk thank you! the new version of the patch looks good!

I didn't know before about the issue you found with the /* a comment before */ #include <foo> (and since it's preexisting, if it's tricky to address, it would be reasonable to just file a bug for it and omit fixing it by this work, I think).

+1.

Eitot added a subscriber: Eitot.May 25 2022, 3:10 AM

@kwk have we bottomed out on this issue? its been stale for a bit. If we are not going to work on it further can be "Abandon" the review to get it off our lists?

kwk abandoned this revision.Sep 27 2022, 5:52 AM

@kwk have we bottomed out on this issue? its been stale for a bit. If we are not going to work on it further can be "Abandon" the review to get it off our lists?

We can abandon this patch and I might look into it some other time. But I wanted to rescue bits from this patch that don't change the current behavior. That's why I've created: https://reviews.llvm.org/D134733 . I hope we can approve it to get the transparency in and make a patch like D121370 a bit smaller in the future.