This is an archive of the discontinued LLVM Phabricator instance.

PR36643 Make clang-format support more generic TMarcos
Needs RevisionPublic

Authored by obfuscated on Mar 21 2018, 5:31 PM.

Details

Summary

This patch changes the check for _T to detect TMarcos with a more generic check.
This makes it possible to format wxWidgets projects (where wxT is used a lot) correctly.

Patch by Teodor Petrov

Diff Detail

Event Timeline

obfuscated created this revision.Mar 21 2018, 5:31 PM
alexfh requested changes to this revision.Mar 22 2018, 12:48 AM
alexfh added a subscriber: alexfh.

We can't just treat anything("....") like the _T macro. There should be a whitelist configurable with an option. By default only _T should be handled.

This revision now requires changes to proceed.Mar 22 2018, 12:48 AM
Typz added a subscriber: Typz.Mar 22 2018, 2:31 AM

A generic (or at least extandable) approach to specifying macro behaviors was introduced here: https://reviews.llvm.org/D33440

But this change seem to be on hold, as a configuration based on providing "macro definitions" seem to be preferable and kind-of in the pipe, though with no defined timeline...

A generic (or at least extandable) approach to specifying macro behaviors was introduced here: https://reviews.llvm.org/D33440

I believe, that patch solves a significantly different problem and it won't make it much easier to implement correct handling of _T(x)-like macros (which expand to either x or L ## x depending on some other macro and thus have to be repeated for each fragment of the string literal after splitting).

The most extensible solution I see is to make the list of _T-like macro spellings configurable via an option.

Typz added a comment.Mar 22 2018, 7:07 AM

You are right, I did not mean it would help with the handling of these macros; but it may be related on the configuration-side of things : adding an option for listing these macros may hit the same limitation, and the same mean of storing (in the code) the list of these macros may be used.

We can't just treat anything("....") like the _T macro. There should be a whitelist configurable with an option. By default only _T should be handled.

What cases could break with this version of the patch?
Or you think it is just too dangerous to have this in such generic form?
I'm fine adding an option if this is deemed acceptable.

We can't just treat anything("....") like the _T macro. There should be a whitelist configurable with an option. By default only _T should be handled.

What cases could break with this version of the patch?

It is fine to split _T("veeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeery loooooooooooooooooooooooooooong striiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiing") as

_T("veeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeery ")
_T("loooooooooooooooooooooooooooong ")
_T("striiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiing")

and it's incorrect to split it like this:

_T("veeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeery "
   "loooooooooooooooooooooooooooong "
   "striiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiing")

since only the first string literal will get the L prefix in case when _T(x) is defined as L ## x. The same is valid for wxT() as I understand.

However, it's wrong to do so for any function or any macro which uses its argument in any way different from that of _T. E.g. strlen("aaa bbb") is not the same as

strlen("aaa ")
strlen("bbb")

I'm fine adding an option if this is deemed acceptable.

I'd say this should be an option.

Added an option

Attached both commits in a single diff...

alexfh requested changes to this revision.Mar 26 2018, 8:30 AM

Thank you! One more comment from me and I'll leave the rest of the review to Krasimir, who has a better idea of how the configuration options should be named etc.

include/clang/Format/Format.h
1676–1677

I'm not sure "string wrapping macros" is the right term or can be easily understood by the reader of the code. The relevant MSDN page is talking about "generic-text mapping". I'm not sure this is a commonly used term either, but at least there's a precedent. Another relevant term is "encoding prefix". With these in mind, I could suggest following descriptions: "generic-text mapping macros", "string literal encoding prefix macros", "macros expanding to a string literal encoding prefix".

I'd also expand this a bit explaining where these macros are coming from and how they are special. Something along the lines of:

"Some libraries provide macro(s) to change the encoding prefix of string literals depending on configuration, for example, _T() macro on Microsoft platforms. When splitting string literals, the macro should be applied to each fragment of the literal to apply the same encoding prefix to all of them, which requires special treatment from clang-format. This option lists the names of these special macros."

This revision now requires changes to proceed.Mar 26 2018, 8:30 AM

I don't have preferences over names, but I agree with Alex that the option should have more detailed description.

lib/Format/FormatToken.h
138

We don't strictly need this new field. We could do as in the old implementation and check if the string prefix is from a T macro in ContinuationIndenter.

unittests/Format/FormatTest.cpp
10693

Why is the FIXME here? I suggest just use the pattern similar to the other cases here and just keep the test with 2 elements:

Style.TMacros.clear();
std::vector<std::string> ExpectedTMacros = {"_T", "myT"};
CHECK_PARSE("TMacros: [_T, myT]", TMacros, ExpectedTMacros);
obfuscated marked an inline comment as done.Mar 27 2018, 3:28 PM
obfuscated added inline comments.
lib/Format/FormatToken.h
138

Using a flag is more reliable and faster - the checks are done once, so they don't have to be repeated.

From the field layout in the struct and from the usage of bools I could only conclude that this is not a memory nor performance critical part of the code.

unittests/Format/FormatTest.cpp
10693

I've copy pasted this from the foreach macro option.
I've not investigate why this fixme is there.

obfuscated marked an inline comment as done.

Fixed tests. Fixed description.

obfuscated marked 3 inline comments as done.Mar 29 2018, 1:12 AM
obfuscated marked an inline comment as done.
krasimir added inline comments.Mar 29 2018, 2:47 AM
lib/Format/FormatTokenLexer.cpp
386

In the original code, TMacro detection was done as:

(Text.startswith(Prefix = "_T(\"") && Text.endswith(Postfix = "\")"))

In the new code the left part is saved in TMacroStringLiteral, and the right part is checked in ContinuationIndenter. Could you keep them together in FormatTokenLexer.
@alexfh, why are we checking for the right check at all? What would be an example where this is needed to disambiguate?

obfuscated added inline comments.Mar 29 2018, 11:33 AM
lib/Format/FormatTokenLexer.cpp
386

Are you talking about the code in ContinuationIndenter::createBreakableToken?

I don't think I understand how the hole thing works.
Using the debugger I can see that this function is executed first and then createBreakableToken.
So we first detect the tmacro in FormatTokenLexer::tryMerge_TMacro and then use this information in the createBreakableToken to do something with it.

So when I get a TMacroStringLiteral==true in createBreakableToken I know that the token starts with a TMacro and there is no need to use some king of startswith calls. Probably the endswith call is redundant and I can do just a string search backwards...

alexfh added inline comments.Apr 4 2018, 8:05 AM
include/clang/Format/Format.h
1676

"A list of macro names"?

1686–1689

According to the description above, the option contains "a vector of macros", and here it says "these are expected to be macros of the form ...". This all is rather confusing. The above description should refer to names of macros, and this example should be made clearer in what it demonstrates.

lib/Format/FormatTokenLexer.cpp
371

Consider using llvm::find, which is a range adaptor for std::find.

386

In the new code the left part is saved in TMacroStringLiteral, and the right part is checked in ContinuationIndenter. Could you keep them together in FormatTokenLexer.

+1 to keeping these conditions together.

@alexfh, why are we checking for the right check at all? What would be an example where this is needed to disambiguate?

I don't know whether there's reasonable code that would be handled incorrectly without checking for the "), but I'm sure some test case can be generated, that would break clang-format badly without this condition. It also serves as a documentation (similar to asserts).

alexfh requested changes to this revision.Apr 4 2018, 9:28 AM
This revision now requires changes to proceed.Apr 4 2018, 9:28 AM
obfuscated marked 3 inline comments as done.Apr 4 2018, 11:50 AM
obfuscated added inline comments.
lib/Format/FormatTokenLexer.cpp
371

No idea what is range adaptor, but I can probably switch to it... :)

386
In the new code the left part is saved in TMacroStringLiteral, and the right part is checked in ContinuationIndenter. Could you keep them together in FormatTokenLexer.

+1 to keeping these conditions together.

Can you please explain what you mean here?
And what am I supposed to do, because I don't know how to put these conditions together.
Do you want me to search the tmacro vector in ContinuationIndenter::createBreakableToken instead of checking the bool flag?

obfuscated marked an inline comment as done.Apr 10 2018, 7:12 AM

Ping?

alexfh added inline comments.Apr 10 2018, 8:11 AM
lib/Format/FormatTokenLexer.cpp
371

It works with a range (something that has .begin() and .end()) instead of two iterators.

386

Looking at this code again, I see what the checks for _T(" and ") were needed for. If there was a space between _T( and " or between " and ), the code in createBreakableToken() would probably not work correctly (see the FIXME). tryMerge_TMacro() looks at tokens and doesn't care about the whitespace, so checking for TMacroStringLiteral is not equivalent to checking for <macro-name>(" and "). In particular, if you only remove just the first check, a test case like _T ( "...") may start being formatted in some way. We could remove the second check as well, if we manage to make formatting decent. Otherwise, performing these checks (that there are no spaces between _T, (, "..." and )) in tryMerge_TMacro() may be a better option than doing this in createBreakableToken().