Page MenuHomePhabricator

clang-format IncludeBlocks: Regroup determination of "main" for framework-style includes fix
Needs ReviewPublic

Authored by poelmanc on Feb 15 2021, 7:41 PM.

Details

Summary

Most modern libraries and applications include files with relative paths from a root (e.g. #include <foo/bar/baz.h> versus #include "baz.h".) When regrouping, a file's "main include" should be left at the top (given priority 0.) The existing IncludeCategoryManager::isMainHeader() checks only the file stem (e.g. baz.h), and fails to identify main includes with angle brackets despite a comment saying // remove the surrounding "" or <>.

This patch fixes these issues by comparing all directory components present in both the #include string and the file name, and by allowing angle bracket includes to be considered "main". It adds a new IsMainHeader unit test to check behavior of framework-style includes, which would fail without this patch.

Diff Detail

Event Timeline

poelmanc requested review of this revision.Feb 15 2021, 7:41 PM
poelmanc created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptFeb 15 2021, 7:41 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
HazardyKnusperkeks added inline comments.
clang/unittests/Tooling/HeaderIncludesTest.cpp
106

I would also add "baz.h", since the test is named IsMainHeader.

I think I'd be concern about the <string.h> case (lets ignore the fact that I should be including <string>

so imagine I'm writing a string class and I have

string.cpp including "string.h"

but as part of my string class because I'm somehow extending std::string

#include "string.h"
#include <string.h>

class mystring : public std::string
{
}

Which file wins?

I think lots of people will always consider <> to mean system paths, I think if there is any chance that <> could change the order then I think it needs to be an option that is off by default.

I don't think this is a bug, I think you are asking for an enhancement.

clang/lib/Tooling/Inclusions/HeaderIncludes.cpp
237

Does this need to be an option?

Thanks for the speedy reply and the great tool. With appropriate Regex and Priority settings, IncludeCategories worked flawlessly for me aside from not identifying the main header. Treating #include "foo/bar/baz.h" as the main header in file bar/random/baz.h seems like a bug, but I certainly see the dangers of changing current <> behavior. I also considered treating <> includes as main headers only if they also contain a forward-slash, e.g.:

if (!IncludeName.startswith("\"") && !IncludeName.contains("/"))
  return false;

That would resolve the <string.h> case, although #include <sys/types.h> in a file anything/sys/types.h would be identified as the main header. So making an option seems like the cleanest solution. Say, bool IncludeIsMainAllowBraces?

Does this need to be an option?

It's easy to add an option, but there are already two main include-related options, so before adding a third I wanted to give this some thought. As someone new to IncludeCategories, I was genuinely impressed at how easy it was to use and how it gave me such complete control over the grouping via regular expressions. Yet in comparison the determination of main headers was less clear and more hard-coded; I had to examine the source code to figure out that the comparison is case-insensitive, it doesn't consider <> includes, only file stems are considered (e.g. the foo/bar in foo/bar/baz.h is ignored), and the behaviors of IncludeIsMainSourceRegex and IncludeIsMainRegex were a bit murky.

That's all fixable with a patch and minor documentation tweaks, but I wanted to consider some alternatives. Users of the IncludeCategories feature are comfortable with regular expressions, so imagine eliminating special handling of main headers entirely, and instead enabling users to write their own IncludeCategories rules for putting main headers first? We'd need to give them some bits of the source file path to use in their Regex, say called ${SourceStem} and ${SourcePath}.

Users unconcerned with formatting foo_test.cc or foo_proto.cc files could get the current functionality by including a simple rule like:

- Regex:           '"(.*\/)?${SourceStem}(\..*)?"'
  Priority:        0
  CaseSensitive:   false

I want case sensitivity, matching at least one component of the path, and angle brackets, so I'd use something like:

- Regex:           '[<"]${SourcePath:1}${SourceStem}\..*[>"]'
  Priority:        0
  CaseSensitive:   true

Then adding a generally-useful ApplyToSourcesRegex feature to apply any category regex only to certain source files, and an ability to trim a regular expression from the end of ${SourceStem}, gives users full control, including mimicking current isMainHeader() behavior with rules like:

- Regex:           '"(.*\/)?${SourceStem-<insert-old-IncludeIsMainRegex>}(\..*)?"'
  Priority:        0
  CaseSensitive:   false
  ApplyToSourcesRegex: `((.*\.(c|cc|cpp|c\+\+|cxx|m|mm))|<insert-old-IncludeIsMainSourceRegex>)`

For backwards-compatibility we'd automatically add the above rule if no special ${Source tokens appear in any rules, and we can deprecate IncludeIsMainRegex and IncludeIsMainSourceRegex at any point. The special tokens for use in Regex are defined as:

  • ${SourcePath:<n>} is a regular expression matching n or more trailing components of the source directory file path, using forward slashes and including any trailing slash (0 <= n < 10)
  • ${SourceStem} is a string starting after the last / in the source file path and includes up to but excluding the first dot
  • ${SourceStem-<re>} is ${SourceStem} with any trailing characters matching regular expression re removed

I have this coded up and it passes the ToolingTests, but wanted to run the idea by others. Thoughts? Should I update this patch with these changes? Start a separate issue? Stick with a new option IncludeIsMainAllowBraces?

I am in favor of a new option, IncludeIsMainAllowBraces. It keeps things simple and does not cause any backwards-compatibility issues. The ${} variables are clever, but if "users unconcerned with formatting ... could get the current functionality by including a simple rule [that includes a variable]", I think it might be asking too much. Being able to write (and read) an include regex without referring to documentation seems ideal to me. You mentioned automatically adding variables in rules for backwards-compatibility, but I'd be a bit concerned about robustness and too much magic going on behind the scenes for users to understand if something goes wrong or has unexpected/unexplained results.

Admittedly I don't fully comprehend the proposed additions and all their ins-and-outs, but I do understand a new, simple option.