This is an archive of the discontinued LLVM Phabricator instance.

[clang-format] Refactor #include insertion/deletion functionality into a class.
ClosedPublic

Authored by ioeric on Apr 27 2018, 1:50 AM.

Details

Summary

The class will be moved into libToolingCore as followup.

The new behaviors in this patch:

  • New #include is inserted in the right position in a #include block to

preserver sorted #includes. This is best effort - only works when the
block is already sorted.

  • When inserting multiple #includes to the end of a file which doesn't

end with a "\n" character, a "\n" will be prepended to each #include.
This is a special and rare case that was previously handled. This is now
relaxed to avoid complexity as it's rare in practice.

Diff Detail

Repository
rC Clang

Event Timeline

ioeric created this revision.Apr 27 2018, 1:50 AM

There isn't actually as much code changed as it appears to be, but phabricator doens't understand the diff very well... git diff might be an easier way to review the patch.

This change LG as an extraction of the helper functionality to be reused in clang, clang-tidy, etc.
However, I feel there are potential improvements both to the underlying code and the new APIs that we could make.

I left some comments, trying to focus on interface of the class and keeping the changes that would be required to address them to be purely cosmetic. As chatted offline, any non-trivial changes to the underlying implementation are out of scope of this patch.

preserver sorted #includes.

A typo in the change description:
s/preserver/preserve

This is best effort - only works when the
block is already sorted.

Could we describe behavior for the unsorted case in a sensible way? E.g. is it added to the end of the include list, added after closely related headers, added to a totally random place (but deterministic) place, etc?
It seems like an important case, and I believe we shouldn't completely ignore it and describe what the "best effort" means in that case.

When inserting multiple #includes to the end of a file which doesn't
end with a "\n" character, a "\n" will be prepended to each #include.
This is a special and rare case that was previously handled. This is now
relaxed to avoid complexity as it's rare in practice.

I don't quite get it, could you please elaborate on what has changed here exactly? Maybe give an example?

lib/Format/Format.cpp
1711–1712

I would personally keep the function non-const and not use mutable fields here.
Even though it's logically const, I would strive towards keeping the things const only if there are actually immutable
One particular problem that could be avoided is accidentally calling the const methods concurrently on different threads.

But up to you if you actually want to make this change.

2232

Maybe s/#include headers/#include directives/?
This is how they called in terminology of preprocessing.

2239

Maybe make an API more explicit about the style of includes that should be used?
Make it literally impossible to misuse the API:

enum class IncludeStyle { Quoted, Angle };
/// \p IncludeName is the include path to be inserted, \p Style specifies whether the included path should use quotes or angle brackets.
Optional<Replacement> insert(StringRef IncludeName, IncludeStyle Style = IncludeStyle::Quoted) const;
2244

Maybe use StringRef here?

2257

Have you considered changing the API slightly to allow iterating over all includes in the file and APIs to remove includes pointed by the iterators?
It would give more flexibility, allowing the clients to implement useful things like:

  • Remove all #includes that start with a prefix, e.g. clang/Basic/.....
  • Find and remove all duplicate #include directives.
  • etc, etc.

The current implementation seems tied to a specific use-case that we have in clang-format, i.e. "preprocess once, remove headers in batches".
It feels wasteful to pay overhead of StringMap<vector<Include>> sometimes, e.g. if the code wants to insert just a single header or do other things, like removing duplicate headers.

I.e. I propose to extract the code that parses the source code and produces a list of #include directives with their respective ranges and filenames into a separate function/class.
This would. arguably, improve readability of the include insertion code.

That might not be a trivial change, though, since it requires rewriting some of the code used here, while the current patch seems to focus on simply extracting useful functionality from clang-format for reuse in clangd, clang-tidy, etc.
Let me know what you think.

2259

We should mention that remove only deletes exact matches and won't try to resolve filenames, patch up path separators, etc.

2260

Maybe also add a param of type Optional<IncludeStyle> and

  • only remove includes with the specified style, if specified the value is specified
  • remove both styles the value is llvm::None.
2261–2262

What do we use the FileName for?
On a higher level, it seems we shouldn't need it to implement what HeaderIncludes does.

2270

Could you please add a comment explaining what are those priorities and what this map is used for?

2270

Is this field always equal to Code.drop_front(MinInsertOffset)?
If so, maybe compute it when needed and don't store explicitly?

2272

Could you add a comment explaining how min and max offsets are calculated? I assume minoffset goes after the file comment and maxoffset is where non-preprocessor code starts, but it would be nice to have a short comment to know for sure.

ioeric marked 7 inline comments as done.Apr 27 2018, 5:45 AM

Thanks for reviewing this!

This change LG as an extraction of the helper functionality to be reused in clang, clang-tidy, etc.
However, I feel there are potential improvements both to the underlying code and the new APIs that we could make.

I left some comments, trying to focus on interface of the class and keeping the changes that would be required to address them to be purely cosmetic. As chatted offline, any non-trivial changes to the underlying implementation are out of scope of this patch.

preserver sorted #includes.

A typo in the change description:
s/preserver/preserve

This is best effort - only works when the
block is already sorted.

Could we describe behavior for the unsorted case in a sensible way? E.g. is it added to the end of the include list, added after closely related headers, added to a totally random place (but deterministic) place, etc?
It seems like an important case, and I believe we shouldn't completely ignore it and describe what the "best effort" means in that case.

Sure. I added more comments to the API.

When inserting multiple #includes to the end of a file which doesn't
end with a "\n" character, a "\n" will be prepended to each #include.
This is a special and rare case that was previously handled. This is now
relaxed to avoid complexity as it's rare in practice.

I don't quite get it, could you please elaborate on what has changed here exactly? Maybe give an example?

This is really a corner cases that users might not need to know about. But an example is:
Code: "#include <a>" (without \n at the end). After inserting <x>, #include <a>\n#include <x>\n (this is good). However, if you insert another <y>, the code would become "#include <a>\n#include <x>\n\n#include <y>\n" (note the double newline!).

lib/Format/Format.cpp
1711–1712

Making this non-const would prevent us from making HeaderIncludes::insert const, which I do want to keep. Added a mutex to prevent race-condition...

2239

I think the risk is pretty low, so I'm not sure if it'd be worth spending another variable. From experience, the current callers would likely need to either carry another variable for each include or use the following pattern (as the IncludeName is often already quoted):

... = insert(Include.trime("\"<>"), QuoteStyle);
2244

We need to worry about life-cycle of the underlying string if we use StringRef, and I think it's not trivial in this case, so I think std::string would be a better fit here.

2257

I haven't thought think about these use cases.

These use cases don't seem to be very common though. Generally, I try to make the APIs easy to use for the use cases we have instead designing the APIs for all potential uses. An over-generalized API might be awkward to use for major use cases, and designing for the future also has opportunity cost...

That said, I think when the use cases do come up, it wouldn't be hard to add APIs for that (and change implementation if necessary). But I also think the current APIs would still be useful for users who only care about #include insertion/deletion.

2260

Again, I'm not quite sure if it's worth another parameter for this.

2261–2262

FileName is needed to decide whether a header is a main header.

ioeric updated this revision to Diff 144319.Apr 27 2018, 5:46 AM
ioeric marked 2 inline comments as done.

Addressed review comments.

This is really a corner cases that users might not need to know about. But an example is:
Code: "#include <a>" (without \n at the end). After inserting <x>, #include <a>\n#include <x>\n (this is good). However, if you insert another <y>, the code would become "#include <a>\n#include <x>\n\n#include <y>\n" (note the double newline!).

Even though that's rare in practice, it seems like an actual bug.
How hard would be it be to fix it?

ilya-biryukov added inline comments.May 2 2018, 3:12 AM
lib/Format/Format.cpp
1711–1712

Actually keeping things const and mentioning in the class comments that the class is not thread-safe because it uses regexes seems like a better choice.
Most of the clients shouldn't pay for the mutex operations for the sake of the rare multi-threaded clients.
WDYT?

2239

I still don't feel encoding things in the strings is the right approach here.
It raises some questions that we should handle/assert: what if users don't quote the strings at all? what is the leading and trailing quotes don't match? and so on and so forth.
Being explicit about such things seems like a better choice.

From experience, the current callers would likely need to either carry another variable

It's not surprising, since the class is literally extracted from the middle of the include insertion implementation in clang-format.

2244

We used StringReffor Code, Filename, etc before.
Are they different? Is the lifetime of Include more complicated? When would Include point outside Code?

2257

Ok, I'm happy with leaving the current API as is, even though I do feel some refactoring would help both the readability and allow us to reuse the code for more use-cases.
But does seem out of scope of this patch, so we could wait for an actual use-case to pop up.

2260

The parameter in this case seems to be even more useful, since we have two different behaviors for quoted vs unquoted strings.
So it's really easy to misuse the API and make a mistake, thinking it does the right thing for you.

2261–2262

Can we pass bool IsMainHeader instead? Can we not store it as a field? It seems to be only needed in the constructor, right?

ioeric updated this revision to Diff 144895.May 2 2018, 9:36 AM
ioeric marked 3 inline comments as done.
  • addressed more review comments.
lib/Format/Format.cpp
1711–1712

Honestly, I'm a bit torn on this. Either approach is not perfect but either seems acceptable. Rolled back to const+mutable and added the comment.

2239

Ok, I am convinced ;)

2260

After offline discussion, I was also convinced here :)

2261–2262

This can be only calculated at insert time when IncludeName is provided - we can only tell if a new include is main header when both FileName and name of the new header are provided.

Friendly ping.

I somehow missed the review email, sorry for the delayed response.

Just one nit and one question from me on changed behavior in the tests (quoted vs angled #include).
Otherwise LG, just wanted to make sure the change in behavior is intentional.

lib/Format/Format.cpp
2150

NIT: Maybe we could add some asserts to this function that the passed include name is actually properly quoted.
E.g. starts with '<' or '"', ends with the corresponding char, etc.

So long as this is part of Format.cpp, it's not terribly important, since it's only called on paths that come from regex matches.
If we make it a public later, having asserts would be really useful.
That said, we might address it later when/if we actually make this function public.

unittests/Format/CleanupTest.cpp
864–865

Are we sure we want this behavior change?
The API seems to allow us keeping the old one.

I know there's a FIXME there, but I just wanted to understand better what's the right thing to do here and why.

ioeric updated this revision to Diff 145210.May 4 2018, 9:20 AM
ioeric marked an inline comment as done.
  • Addressed review comment.
This revision is now accepted and ready to land.May 4 2018, 9:34 AM
This revision was automatically updated to reflect the committed changes.