This is an archive of the discontinued LLVM Phabricator instance.

[clang-format][NFC] more centric handling of include name matching
Needs ReviewPublic

Authored by kwk on Sep 27 2022, 5:49 AM.

Details

Summary

Histroy

The foundation for this patch was done in: https://reviews.llvm.org/D121370. I wanted to "rescue" some pieces from
it.

Unlike D121370, this patch doesn't modify the regular expression to identify an include name itself.

  1. It simply provides one instead of two copies of regular expression patterns.
  2. We also provide a function to get the include name from a group of match groups delivered by the shared regular expression that is only instantiated once now. Before when you experimented with the regex for identifying include matches, you'd have to change multiple locations and for example replace Matches[2] with Matches[3] in many locations. This way cumbersome and by having a function clang::tooling::getIncludeNameFromMatches() this makes it more concise and easy to modify.
  3. The include name was trimmed from unwanted characters ("<>) before and this was done repeatedly in multiple places. Now we have: clang::tooling::trimIncludeName() which does it for you, again in one place and not scattered over the code.

Testing

Here's how I've tested this patch:

$ cd llvm-project
$ mkdir build && cd build
$ cmake ../llvm/ -DCMAKE_BUILD_TYPE=Release -DCLANG_BUILD_TOOLS=On -DLLVM_ENABLE_PROJECTS=clang -DCLANG_INCLUDE_TESTS=ON -DLLVM_BUILD_UTILS=ON -G Ninja
$ ninja clangFormat
$ ninja FormatTests
$ ./tools/clang/unittests/Format/FormatTests --gtest_filter=SortIncludesTest*

And if that worked I doubled checked that nothing else broke by running
all format checks:

$ ./tools/clang/unittests/Format/FormatTests

Diff Detail

Event Timeline

kwk created this revision.Sep 27 2022, 5:49 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 27 2022, 5:49 AM
kwk requested review of this revision.Sep 27 2022, 5:49 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 27 2022, 5:49 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
clang/lib/Tooling/Inclusions/HeaderIncludes.cpp
410

What if this was overridable via the Style we could experiment with changing the regex

417

‘>= 2’

kwk added a comment.Sep 28 2022, 3:11 AM

@MyDeveloperDay please see my potentially uneducated comments.

clang/lib/Tooling/Inclusions/HeaderIncludes.cpp
410

Well, right now that was never a requirement and to be honest it isn't very worth exploring this because its not easy to determine which the correct matching group is for the include name.

For example, let's say you would allow this to be configurable from the config file and a user enters the regex from my previous patch (D121370):

^[\t\ ]*[@#][\t\ ]*(import|include)([^"/]*("[^"]+")|[^</]*(<[^>]+>)|[\t/\ ]*([^;]+;))

How would you determine which is the correct matching group? I first thought about taking the last one matching but then I found that my regex was not able to cope with trailing comments. And even in my case there's more than one matching group for the include name. All in all I really don't see any value in outsourcing this to the config file is not something I see value in.

The only thing valuable would be to have a regex for multiple languages. Sorry if this is what you intended. I just wouldn't want to move this out to the config file for the end-user to play with.

417

@MyDeveloperDay correct me if I'm wrong but if an array has size 3 it has indexes 0, 1 and 2. And an array of size 2 only has 0 and 1. So the case =2, which is implied by your suggested`>=2`, is actually an out of bounds access when going Matches[2]. Because that is effectively accessing the third element. The only valid change would be to check for Matches.size() > 2 IMHO and that is the same as Matches.size() >= 3.

I must admit that I had to look at the regex a few times only to realize that it has two non-optional matching groups (...). The third matching group at index 0 is the whole line. So in theory >=3 isn't possible with the current regex. I wanted to give my best to have this logic "survive" a change to the regex in which for example something is added after the matching group of the include name; let's say a comment or something.

I hope I haven't made myself a complete fool.

MyDeveloperDay added inline comments.Sep 28 2022, 1:22 PM
clang/lib/Tooling/Inclusions/HeaderIncludes.cpp
417

No foolery other than my own. I meant to say '> 2', actually the first time I wrote the comment it disappear because it didn't escape the '>'

So one though I had for the config was something like this (ignore regex itself I just put anything in it to show the principle)

---
Language Cpp
IncludeRegex:  ^[\t\ ]*[@#][\t\ ]*(import|include)([^"/]*("[^"]+")|[^</]*(<[^>]+>)|[\t/\ ]*([^;]+;)
IncludeRegexGroup: 1
---
Language: ObjectiveC
IncludeRegex:  ^[\t\ ]*[@#][\t\ ]*(import|include)([^"/]*("[^"]+")|[^</]*(<[^>]+>)|[\t/\ ]*([^;]+;)
IncludeRegexGroup: 2
---
Language: CSharp
IncludeRegex:  using  ([A-z])*
IncludeRegexGroup: 0
---
Language: Carbon
IncludeRegex:  package ([A-z])* api
IncludeRegexGroup: 0
---

Wouldn't this allow different languages to have their own Include/Import regex? Just thinking out loud

It could be more powerful than having 1 regex to rule them all.

While attempting to review this patch, I came to the conclusion that we should first clean up the HeaderIncludes class and Format.cpp a little. I will submit a patch for review ASAP.

owenpan added a project: Restricted Project.Sep 28 2022, 11:43 PM

See D134852.

kwk updated this revision to Diff 463786.Sep 29 2022, 1:14 AM
  • Remove ad-hoc instantiated IncludeRegex objects
kwk added inline comments.Sep 29 2022, 1:31 AM
clang/lib/Tooling/Inclusions/HeaderIncludes.cpp
417

@MyDeveloperDay I like the idea of this approach! But I also see that a single language alone, namely C++, is quite a tough nut to crack with the advent of Modules (https://en.cppreference.com/w/cpp/language/modules). I'm afraid the bit of IncludeRegexGroup would need to be expanded to be either a

  • fixed number (wherever possible)
  • a method as in: last-non-empty-matching-group

Only then I think that your solution would work. Can one have a setting in clang format that allows for mixed input as in number or string? A bit of a hacky solution but probably cutting 100% of all regexes would be to have an enum with these allowed strings: last-non-empty-matching-group,first-non-empty-matching-group,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20.

kwk updated this revision to Diff 464196.EditedSep 30 2022, 2:44 AM
  • Make include regex a static member and not a function (@owenpan did this in D134852)
  • Run clang-format
kwk retitled this revision from [clang-format][chore] transparent #include name regex to [clang-format][NFC] more centric handling of include name matching.Sep 30 2022, 2:58 AM
kwk edited the summary of this revision. (Show Details)

Can you rebase? I'll finish the review after that.

clang/lib/Tooling/Inclusions/HeaderIncludes.cpp
410

Use ArrayRef instead.

411–415

You can simply assert and return.