Page MenuHomePhabricator

[clang-format] Add a MacroExpander.

Authored by klimek on Jul 7 2020, 4:54 AM.



The MacroExpander allows to expand simple (non-resursive) macro
definitions from a macro identifier token and macro arguments. It
annotates the tokens with a newly introduced MacroContext that keeps
track of the role a token played in expanding the macro in order to
be able to reconstruct the macro expansion from an expanded (formatted)
token stream.

Made Token explicitly copy-able to enable copying tokens from the parsed
macro definition.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
MyDeveloperDay added inline comments.Jul 13 2020, 6:06 AM

are you using this?

Sorry for the long delay, I've made up for it with extra comments :-\

This looks really well-thought-out and I'm rationalizing my pickiness as:

  • this is conceptually complicated
  • I expect this code to live a long time and be read and "modified around" by lots of people

Some of the comments/requests for doc might strictly be more in scope for in later patches (documenting functionality that doesn't exist yet). Those docs would help *me* now but happy if you'd rather briefly explain and add them later.


this is a great example, it might be a little more clear with more distinct chars and some vertical alignment:

Given X(A)=[A], Y(A)=<A>,
                  X({ Y(0)  } ) expands as
                  [ { < 0 > } ]
StartOfExpansion  1   1
ExpandedFrom[0]   X X X X X X X
ExpandedFrom[1]       Y Y Y

You could extend this to cover all the fields and hoist it to be a comment on MacroContext if you like, I think the concreteness helps.


why the asymmetry between start/end?

given ID(x)=X, ID(ID(0)) yields 0 which starts and ends two expansions, right?
Consider making them both integer, even if you don't need it at this point.

(also 64 bits, really?)


this isn't used in this patch - can we leave it out until used?


if you're not extremely concerned about memory layout, I'd consider making this an Optional<MacroContext> with nullopt replacing the current MR_None.

This reduces the number of implicit invariants (AIUI MR_None can't be combined with any other fields being set) and means the name MacroContext more closely fits the thing it's modeling.



I guess it doesn't matter, but copyFrom would seem a little less weird to me in an OOP/encapsulation sense.
I do like this explicit form rather than clone() + move constructor though, as pointer identity is pretty important for tokens.


Tokens -> Expansion? (semantics rather than type)


Dmitri gave a tech talk on dropping comments like these :-)


who's responsible for establishing this?

AIUI this will fail if e.g. Macros contains a string that contains only whitespace, which is a slightly weird precondition.


assert instead? Caller checks this


this assumes the expansion is nonempty, which the grammar doesn't. while{} instead?


weird param name!


This is a slightly spooky buffer name - it's the magic name the PP uses for pasted tokens.
A closer fit for config is maybe "<command line>" (like macro definitions passed with -D).
Is it necessary to use one of clang's magic buffer names at all? If so, comment! Else maybe "<clang-format style>" or something?


is the caller responsible for checking the #args matches #params?
If so, document and assert here?

Looking at the implementation, it seems you don't expand if there are too few args, and expand if there are too many args (ignoring the last ones). Maybe it doesn't matter, but it'd be nice to be more consistent here.

(Probably worth calling out somewhere explicitly that variadic macros are not supported)


This doesn't depend on args, so we could compute this mapping when the Definition is constructed and encapsulate it there.

(Maybe performance doesn't matter, I'd also find this a little clearer. But if the allocation doesn't matter, we shouldn't be using SmallVector...)


skip the parameter -> treat the parameter as empty?
(My first guess was this meant given ID(X)=X, ID() would expand to X.)


nit: Result


"tokens that were not part of the macro argument" --> "tokens from the macro body"?


(I don't know exactly how this is used, but consider whether you mean "do not need to", "should not" or "cannot" here)


this threw me for a loop... it's EOF right?

It's not explicitly mentioned, so maybe either add a comment or && Result.back()->is(tok::eof).

This makes the size-2 less cryptic too.


Why not set StartOfExpansion in the same way, to avoid tracking the First state?

10 ↗(On Diff #275998)

I think this comment is too short (and doesn't really say much that you can't get from the filename). IME many people's mental model of macros is based on how they're used rather than how they formally work, so I think it's worth spelling out even the obvious implementation choices.

I'd like to see:

  • rough description of the scope/goal of the feature (clang-format doesn't see macro definitions, so macro uses tend to be pseudo-parsed as function calls. When this isn't appropriate [example], a macro definition can be provided as part of the style. When such a macro is used in the code, clang-format will expand it as the preprocessor would before determining the role of tokens. [example])
  • explicitly call out here that only a single level of expansion is supported, which is a divergence from the real preprocessor. (This influences both the implementation and the mental model)
  • what the MacroExpander does and how it relates to MacroContext. I think this should give the input and output token streams names, as it's often important to clearly distinguish the two. (TokenBuffer uses "Spelled tokens" and "Expanded tokens" for this, which is not the same as how these terms are used in SourceManager, but related and hasn't been confusing in practice).
  • a rough sketch of how the mismatch between what was annotated and what must be formatted is resolved e.g. -- just guessing here -- the spelled stream is formatted but for macro args, the annotations from the expanded stream are used.

(I'm assuming this is the best file to talk about the implementation of this feature in general - i'm really hoping that there is such a file. If there are a bunch of related utilities you might even consider renaming the header as an umbrella to make them easier to document. This is a question of taste...)

40 ↗(On Diff #275998)

You define "simple" in the patch description as non-recursive - can you just say "non-recursive" here? Or better spell out what that means (no macro can refer to another macro)

44 ↗(On Diff #275998)

nit: I think using is usually considered more readable

49 ↗(On Diff #275998)

This seems to precisely define the grammar but leave me guessing as to the semantics.
I'd at least suggest 'exp' -> 'expansion'.

Personally I'm partial to examples instead :-)

50 ↗(On Diff #275998)

"PI 3.14" or "NOT(X) !X" seems much less familiar than "PI=3.14" or "NOT(X)=!X" as accepted by -D.

It also resolves an ambiguity: is "DISCARD_ERROR ( void ) ( err )" an object or function macro?

The extra redundancy in the grammar should make it easier to detect errors too.

51 ↗(On Diff #275998)

this grammar allows object macros, but disallows function macros with no args. intended?

(FWIW this grammar also allows "ID(X X": an object macro "ID" which expands to "(X X". But the implementation, probably correctly, doesn't)

54 ↗(On Diff #275998)

The signature doesn't allow errors to be reported, which is a little unfortunate but seems hard to fix properly (so that errors are reported when the config is parsed) - the "style is a simple struct" is hard to reconcile with this data structure.

Silent discard on error should probably be mentioned in the constructor.

56 ↗(On Diff #275998)

Why are the macro definitions in an arbitrary specified encoding? I'd hope by the time we've parsed the config, our strings are UTF-8. (On disk, YAML can be UTF-16 per spec, but...)

62 ↗(On Diff #275998)

const (and an expand)

66 ↗(On Diff #275998)

(I can't see the real callsite but...)
If we care about performance here, is 8 a little small? should we have a vector &Out instead?

curdeius added inline comments.

Nit: typo "corresponding Definition".


Why isn't it defaulted?

klimek updated this revision to Diff 281611.Jul 29 2020, 8:34 AM
klimek marked 27 inline comments as done.

Addressed code review comments.

klimek added inline comments.Jul 29 2020, 8:37 AM

Changed to "Body".


I have no clue how this ever worked tbh O.O
Has been reworked as part of the move to use = to separate the macro signature from the body.


Copy-paste gone wrong I assume.


We need source locations, and apparently only:
<built-in>, <inline asm> and <scratch space>
are allowed to have source locations.


Added docs in the class comment for MacroExpander.
(so far I always expand, too few -> empty, too many -> ignore)


Replaced with "are not".

10 ↗(On Diff #275998)

Moved to Macros.h and added file comment that tries to address all of these.

66 ↗(On Diff #275998)

I don't think we particularly care about performance here, but the llvm docs say I should use SmallVector. Happy to bump down to 0 if you feel that the magic 8 is a problem here as a gut-feeling premature optimization.

Somehow I missed the email from your replies.

Mostly nits that you can take or leave, but I think potential bugs around functionlike-vs-objectlike and multiple-expansion of args.


"context" is often pretty vague - "MacroSource" isn't a brilliant name but at least seems to hint at the direction (that the FormatToken is the expanded token and the MacroSource gives information about what it was expanded from)

I don't feel strongly about this though, up to you.


nit: comment -> copyFrom


nit: banner is for wrong filename


this accepts FOO(A,B,)=... as equivalent to FOO(A,B)=.... Not sure if worth fixing.


(nit: I'd probably find this easier to follow as if (equal) else if (eof) else with parseTail inlined, but up to you)


uber-nit: seems like this loop belongs in the caller


nit: this is a copy for what seems like no reason - move Parser.parse() inline to this line?


lookup() returns a value, so this is a copy (with lifetime-extension)
I think you want *find


please use a different name for this variable, or the parameter it shadows, or preferably both!


nit: "part of a macro argument at multiple levels"?
(Current text suggests to me that it can be arg 0 and arg 1 of the same macro)


you're pushing here without copying. This means the original tokens from the ArgsList are mutated. Maybe we own them, but this seems at least wrong for multiple expansion of the same arg.


#define M(X,Y) X Y X

Will expand to:

  • 1, ExpandedArg, ExpandedFrom = [M, M] // should just be one M
  • 2, ExpandedArg, ExpandedFrom = [M]
  • 1, ExpandedArg, ExpandedFrom = [M, M] // this is the same token pointer as the first one

Maybe it would be better if pushToken performed the copy, and returned a mutable pointer to the copy. (If you can make the input const, that would prevent this type of bug)


Is this saying that the functionlike vs objectlike distiction is not preserved?

This doesn't seem safe (unless the *caller* is required to retain this information).

#define NUMBER int
using Calculator = NUMBER(); // does expansion consume () or not?

(Seems a little odd that these pointers to external FormatTokens aren't const... I can believe there's a reason though)

klimek added inline comments.Aug 19 2020, 9:26 AM

Ugh. I'll need to take a deeper look, but generally, the problem is we don't want to copy - we're mutating the data of the token while formatting the expanded token stream, and then re-use that info when formatting the original stream.
We could copy, add a reference to the original token, and then have a step that adapts in the end, and perhaps that's cleaner overall anyway, but will be quite a change.
The alternative is that I'll look into how to specifically handle double-expansion (or ... forbid it).

sammccall added inline comments.Aug 19 2020, 10:12 AM

(or ... forbid it).

I'm starting to think this is the best option.

The downsides seem pretty acceptable to me:

  • it's another wart to document: on the other hand it simplifies the conceptual model, I think it helps users understand the deeper behavior
  • some macros require simplification rather than supplying the actual definition: already crossed this bridge by not supporting macros in macro bodies, variadics, pasting...
  • loses information: one expansion is enough to establish which part of the grammar the arguments form in realistic cases. (Even in pathological cases, preserving the conflicting info only helps you if you have a plan to resolve the conflicts)
  • it's another wart to document:

Are there any others?

klimek added inline comments.Aug 20 2020, 2:32 AM

My main concern is that it's probably the most surprising feature to not support.

Just checking this is waiting on you rather than me...

If the multiple-expansion thing is blocking progress, I think we're much better off getting a limited version of this feature landed than losing momentum trying to solve it.

klimek updated this revision to Diff 292970.Sep 19 2020, 9:23 AM
klimek marked 7 inline comments as done.

Worked in review comments.


MacroSource sounds like it is about the macro source (i.e. the tokens written for the macro).
I'd be happy to rename to MacroExpansion or MacroExpansionInfo or somesuch if you think that helps?


We're generally accepting too much; I'd either want to restrict it fully, or basically be somewhat minimum/forgiving. Given that we can't get errors back to the user, I was aiming for the latter.


I basically like having the implementation match the BNR.
That said, not feeling strongly about it. You're saying you'd duplicate the Def.Body.push_back in the if (eof)?

if (Current->is(tok::equal) {
  // inline parseTail
} else if (Current->is(tok::eof) {
} else {
  return false;

Generally, I personally find it easier to read the early exit.


Reason is that we need the name.


Forbade multi-expansion.


Fixed. I thought we'd get away without it, but it's simple enough to fix and we have enough suprises as is.


We modify the tokens by adding the macro context.

sammccall accepted this revision.Sep 22 2020, 11:33 AM

Ship it!


Oops, accidental "source" pun. MacroOrigin would be another name in the same spirit. But MacroExpansion sounds good to me, too.


oops, right.
std::move() the RHS? (mostly I just find the copies surprising, so draws attention)


nit: this is confusingly a const reference to a non-const pointer... auto * or FormatToken *?


may want a test that uses of an arg after the first are not expanded, because that "guards" a bunch of nasty potential bugs


I guess clang-tidy wants ..._TESTLEXER_H here

This revision is now accepted and ready to land.Sep 22 2020, 11:33 AM
This revision was landed with ongoing or failed builds.Sep 25 2020, 5:09 AM
This revision was automatically updated to reflect the committed changes.
klimek marked 5 inline comments as done.
nridge added a subscriber: nridge.Sep 28 2020, 1:13 PM

What does this change mean for users of clang-format -- better formatting of complicated (e.g. multi-line) macro invocations?

What does this change mean for users of clang-format -- better formatting of complicated (e.g. multi-line) macro invocations?

Nothing from this change is exposed yet, it's part of a series.
The end goal is as you say: perfect formatting of code in and around arbitrary macros, by passing (simplified) macro definitions as configuration.

D88299 is next, Manuel assures me it gets easier from there :-)

What does this change mean for users of clang-format -- better formatting of complicated (e.g. multi-line) macro invocations?

In addition to what Sam said, this also attempts to be an improvement in maintainability. Given this is a fairly complex change, you might ask how this helps :)
The idea is that we bundle the complexity of macro handling in a clearly separated part of the code that can be tested and developed ~on its own.
Currently, we have multiple macro regex settings that then lead to random code throughout clang-format that tries to handle those identifiers special.
Once this is done, we can delete all those settings, as the more generalized macro configuration will supersede them.


Yikes, thanks for catching!


Discussed offline: the above test tests exactly this.