This is an archive of the discontinued LLVM Phabricator instance.

[clang-format] Add MacroUnexpander.
ClosedPublic

Authored by klimek on Sep 25 2020, 6:28 AM.

Details

Summary

MacroUnexpander applies the structural formatting of expanded lines into
UnwrappedLines to the corresponding unexpanded macro calls, resulting in
UnwrappedLines for the macro calls the user typed.

Diff Detail

Event Timeline

klimek created this revision.Sep 25 2020, 6:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 25 2020, 6:28 AM
klimek requested review of this revision.Sep 25 2020, 6:28 AM

This is magnificently difficult :-)
I've sunk a fair few hours into understanding it now, and need to send some comments based on the interface+tests.
Some of these will be misguided or answered by the implementation, but I'm afraid I can't fit it all into my head (on a reasonable timeframe) until I understand it better.

clang/lib/Format/FormatToken.h
500

I can't really understand from the comment when this is supposed to be set, and there are no tests of it.

(The comment is vague: is a "parent" the inverse of FormatToken::Children here? Is this scenario when the parents in question are new, or their children are new, or both? What part of the code is "formatting", and why would it otherwise skip the children?)

clang/lib/Format/Macros.h
135

"matches formatted lines" probably describes the hard technical problem it has to solve, but not so much what it does for the caller: what the transformation is between its inputs and its outputs.

Is it something like:

Rewrites expanded code (containing tokens expanded from macros) into spelled code (containing tokens for the macro call itself). Token types are preserved, so macro arguments in the output have semantically-correct types from their expansion. This is the point of expansion/unexpansion: to allow this information to be used in formatting.

[Is it just tokentypes? I guess it's also Role and MustBreakBefore and some other stuff like that?]

143

I'm a bit confused by these arrows. It doesn't seem that they each point to an unwrappedline passed to addLine?

143

This example didn't really help me understand the interface of this class, it seems to be a special case:

  • the input is a single block construct (rather than e.g. a whole program), but it's not clear why that's the case.
  • the output (unexpanded form) consists of exactly a macro call with no leading/trailing tokens, which isn't true in general

If the idea is to provide as input the minimal range of lines that span the macro, we should say so somewhere. But I would like to understand why we're not simply processing the whole file.

148

this says "creates the unwrapped lines" but getResult() returns only one.
Does the plural here refer to the tree? Maybe just use singular or "the unwrappedlinetree"?

155

I get the symmetry between the expander/unexpander classes, but the name is making it harder for me to follow the code.

  • the extra compound+negation in the name makes it hard/slow to understand, as I'm always thinking first about expansion
  • the fact that expansion/unexpansion are not actually opposites completely breaks my intuition. It also creates two different meanings of "unexpanded" that led me down the garden path a bit (e.g. in the test).

Would you consider MacroCollapser? It's artificial, but expand/collapse are well-known antonyms without being the same word.

(Incidentally, I just realized this is why I find "UnwrappedLine" so slippery - the ambiguity between "this isn't wrapped" and "this was unwrapped")

165

I find this hard to follow.

  • again, the "match" part is an implementation detail that sounds interesting but isn't :-)
  • "from a macro" sounds like one in particular, but is actually every macro
  • the bit about "getResult" is a separate point that feels shoehorned in

What about:
"Replaces tokens that were expanded from macros with the original macro calls. The result is stored and available in getResult()"

176

how can this be the case if the input can have multiple lines and the output only one?

Is the return value a synthetic parent of the translated lines?
Or is there a hidden requirement on the caller here that we don't keep feeding in lines unless we're continuing a macro call and therefore know we'll end up with one line?

This stuff could be clarified in docs but again I have to ask, can't we sidestep the whole issue by processing the whole file and returning all the lines?

(This is somewhat answered in the implementation, though that doesn't seem like the right place)

238

The explanation here seems to be proving the converse: if we *didn't* use this representation, then the code wouldn't work. However what I'm missing is an explanation of why it is correct/sensible.

After staring at the tests, I'm still not sure, since the tests seem to postprocess the "natural" output the same way before asserting equality.

My tentative conclusion is it would be clearest to move this "in the end" step to the *caller* of getResult(), as it seems to have more to do with formatting than unexpansion. But I haven't looked in detail at that caller, maybe I'm wrong...

clang/unittests/Format/MacroUnexpanderTest.cpp
1 ↗(On Diff #294294)

All of these tests use both the expander and unexpander (so need consider behavior/bugs of both).
Is it possible to test the unexpander in isolation?

(Context: I'm trying to use the tests to understand what the class does, but the inputs aren't visible)

1 ↗(On Diff #294294)

Having read all the tests here, they seem to follow exactly the same schema:

  1. some macro definitions
  2. some sequence of expand() calls, simulating expanding some macro-heavy code
  3. verify the expanded token sequence, rearranging it into expected UnwrappedLine structure
  4. call unexpand() and check that the result has the expected UnwrappedLine structure

You do this using various fairly-general helpers and DSLs, but don't really combine them in different ways... the tests are somewhat readable and error messages are OK, but if these are important tests, it might be worth looking at a data-driven test. e.g.

Case NestedChildBlocks;
NestedChildBlocks.Macros = {"ID(x)=x", "CALL(x)=f([] { x })"};
NestedChildBlocks.Original = "ID(CALL(CALL(return a * b;)))";
NestedChildBlocks.Expanded = R"cpp(
{
f([] {
  f([] {
    return a * b;
  })
})
}
)cpp"; // indentation shows structure
NestedChildBlocks.Unexpanded = R"cpp(
ID(
  {
  CALL(CALL(return a * b;))
  }
)
)cpp";
NestedChildBlocks.verify();

Definitely involves a bit of reinventing wheels though.

20 ↗(On Diff #294294)

docs for this class/major members?

28 ↗(On Diff #294294)

this name (and all the other "Unexpanded*" in this class) is misleading, because there's a process called "unexpanding" but these aren't the output of it.

I'd suggest "spelled", though I do think renaming the unexpander would also be worthwhile.

93 ↗(On Diff #294294)

this needs docs (it's a cool technique! no need to make the reader decipher it though)

96 ↗(On Diff #294294)

you always consume(lex(...)) - taking a StringRef directly instead would make it clearer that the identity of the temporary tokens doesn't matter

125 ↗(On Diff #294294)

create() is a strange name for the *expander* in an *unexpander* test

141 ↗(On Diff #294294)

FWIW, this seems confusing to me - line() has overloads that are simple, and then this one that does the same nontrivial stitching that the production code does.
The fact that the a/b lines are not sibling sub-lines in line({tokens("a"), tokens("b")})but they *are* sibling tokens in line(lex("a b")) is hard to keep track of in the tests.

If the stitching really is necessary, I think it's important for the expected output to also be shown in its stitched form.

155 ↗(On Diff #294294)

why is this "tokens" and not "chunk"?

207 ↗(On Diff #294294)

you have lots of assertions that this is true, but none that it is false

242 ↗(On Diff #294294)

the high-level point of this test seems to be that by looking the post-expansion context, we can determine that b is a declared pointer so we don't put a space before it.

And everywhere these tokens are mentioned/verified here, the spacing is correct, as if it were propagated... but of course the spacing is actually ignored everywhere and underneath it's just sequences of tokens.
This makes it hard to track how well this it testing what it really wants to test.

Is it possible to mark the asterisk with the correct tokentype at the point where formatting would occur, and then verify that the unexpanded form (i.e. Chunk1.Tokens[1]) still has the tokentype set?
(It's probably possible to prove this by inspection by understanding what U1 contains, how Matcher works etc, but I think it'd still be illustrative)

245 ↗(On Diff #294294)

what does id() mean if not identifier?

406 ↗(On Diff #294294)

because this example isn't realistic, it'd be nice to have the comment here showing what formatting you're simulating

klimek updated this revision to Diff 301253.Oct 28 2020, 5:37 AM
klimek marked 3 inline comments as done.

Adapted based on review comments.

clang/lib/Format/FormatToken.h
500

Rewrote comment.

clang/lib/Format/Macros.h
135

It's not the token info, this we'd trivially have by using the original token sequence which is still lying around (we re-use the same tokens).

Reworded.

143

Why not? That is the intention? Note that high-level we do not pass class definitions in one unwrapped line; if they ever go into a single line it's done in the end by a special merging pass (yes, that's a bit confusing).

143

Re: input is a single construct - that is not the case (see other comment)
Re: leading / trailing tokens: I didn't want to make it too complex in the example.

148

Fixed comment.

155

Happy to rename, but first wanted to check in - I use "unexpanded token stream" quite often to refer to the macro call. Perhaps we should also find different wording for that then?

Perhaps we should call this MacroLineMatcher btw? This is not creating anything new - the resulting token sequence is the "unexpanded token sequence" with the exact same tokens, the special thing is that they're matched into unwrapped lines.

165

I think the match part is important, as it's matching unwrapped lines, which is the heart of the algorithm.

176

Reworded; the reason why we have the single-line anyway is that:

  • a macro call is something we generally want to have in one unwrapped line
  • the tokens (or other macro calls) that go into the same instance of MacroUnexpander only consist of tokens that do not have an unwrapped line break and macro calls

Thus, we want the output to be in a single unwrapped line, as we're otherwise majorly confusing ~everything else in the formatter.

238

This is basically what I wrote before - in the end, that the expanded code creates multiple unwrapped lines is the weird thing, as the input is fundamentally a single unwrapped line (the macro call plus a bit of stuff around it). Thus, it's quite natural for the unexpander to return a single unwrapped line that represents the original structure. Not sure how to best put this in words.

clang/unittests/Format/MacroUnexpanderTest.cpp
1 ↗(On Diff #294294)

Not sure I understand what you mean - everything that's interesting about the inputs is spelled out in the test - namely, what the structure of unwrapped lines going into the unexpander is.

1 ↗(On Diff #294294)

It seems like the main thing this does is replacing the structure how we build unwrapped lines with a DSL that gets parsed into unwrapped lines by indentation? I personally find that significantly less readable unless we'd create really good error messages if the indentation doesn't line up. In your example "Unexpanded" I have problems understanding exactly what goes into what unwrapped line intuitively - for example, {} can be one unwrapped line or 2 different ones. How do we decide when an unwrapped line is finished?

28 ↗(On Diff #294294)

This comment made it clear why Unwrapper is a really bad name, because it's also not about collapsing.

96 ↗(On Diff #294294)

Yeah, I thought that I want Matcher and the test to be more decoupled, but passing in the Lexer is not a big thing, so changed.

141 ↗(On Diff #294294)

Would renaming tokens() to chunk() help with that? The idea is that in the test I mostly need to create a single line from chunks of tokens. Also happy to rename this function if it helps more?

155 ↗(On Diff #294294)

My thought was that Chunk is just a type for a chunk of a line, and I can create one from a set of tokens (via tokens()) or from a set of child unwrapped lines (via children()).

207 ↗(On Diff #294294)

Yeah, that's a white-box assertion - finished() is false the vast majority of the time, so testing the true cases is the important part - happy to add tests if you think it's worth it.

242 ↗(On Diff #294294)

I don't care about anything about the tokens than that they have pointer identity and that the same tokens end up in the right unwrapped lines in the result (thus match also checks token identity).

245 ↗(On Diff #294294)

Sigh, yeah, good point - initially this was used for identifiers, but it really means "lex exactly one token". Do you have a good name that is short (it's used really often) and descriptive?

406 ↗(On Diff #294294)

As before, formatting is (to me) not relevant here - what's relevant is that we don't crash and the resulting unwrapped lines contain the right tokens.

Thanks a lot for all the time explaining, I get this at least at a high level now.
I have trouble following the ins and outs of the algorithm, but I think I understand most of it and tests cover the rest.

Regarding tests: I have trouble reasoning about whether all cases are tested. I wonder if we can easily throw this against (internal) coverage + mutation testing infrastructure? I'm not a massive believer in that stuff usually, but this seems like a good candidate.

Lots of comments around making this easier for confused people like me to understand, partly because I don't understand it well enough to suggest more substantive changes.
Throughout, I think it'd be nice to be explicit about:

  • which structure tokens/lines are part of: spelled/expanded/unexpanded. (Including getting rid of input/output/incoming/original, which AFAICT are synonyms for these)
  • which lines are complete, and which are partial (being added to as we go)

It took me a while to understand the control flow and to tell "who's driving" - I think it was hard for me to see that processNextUnexpanded was basically a leaf, and that unexpand/startUnexpansion/endUnexpansion/continueUnexpansionUntil were the layer above, and add() was the top. Maybe there's naming that would clarify this better, but I don't have great suggestions, and I'm not sure if this is a problem other people would have.
Maybe an example trace showing the internal method calls when parsing a small example might help... but again, not sure.

clang/lib/Format/MacroUnexpander.cpp
52 ↗(On Diff #301253)

this doesn't use any state, right?

it could be static or even private to the impl file.

replacing std::function with a template param allow this loop to be specialized for the one callsite - up to you, maybe it doesn't matter much, but it's not very invasive

77 ↗(On Diff #301253)

I can't work out what "it" refers to in this sentence.

(and "spelled" token stream?)

82 ↗(On Diff #301253)

It would be helpful to complete the example by spelling out which token you're adding, which the correct parent is, and which tokens you need to "expand over" to make it available.

I *think* the answer to the first two is - when you're adding the a then its proper parent is the inner ( in BRACED(BRACED(... but I don't know the last part.

263 ↗(On Diff #301253)

maybe unexpandActiveMacroUntil or so?

Something to hint that this stops at the end of the top-of-stack macro.

(and to avoid the term "unexpansion stream" which isn't used/defined anywhere else)

269 ↗(On Diff #301253)

nit: while

273 ↗(On Diff #301253)

nit: this assert is just the opposite of the while condition, drop it?

322 ↗(On Diff #301253)

want to keep this?

391 ↗(On Diff #301253)

nit: finished()

clang/lib/Format/Macros.h
25–29

This would be a good place to explicitly introduce the name "unexpanded" for what comes out of the unexpander.

This para gives names to the token streams, but not as clearly to the UnwrappedLines parsed out of them.

I think the fact that the tokens *alias* between the streams/lines, and so the final formatting of the expanded lines "writes through" tokentype etc to the unexpanded lines, is an important design point worth emphasizing.

(This part is mostly structure around "what happens", with the data sets secondary - I think I'd personally find the reverse easier to follow but YMMV)

30–31

would s/formats/annotates/ be inaccurate?

This is just my poor understanding of the code, but it wasn't obvious to me that annotation is closely associated with formatting and not with parsing UnwrappedLines.

136

I know it's the common case, but I think saying "the macro call" here is misleading, because it quickly becomes apparent reading the code that the scope *isn't* one macro call, and (at least for me) it's easy to get hung up on not understanding what the scope is. (AIUI the scope is actually one UL of *output*... so the use of plural there is also confusing).

I think a escription could be something like:

Converts a sequence of UnwrappedLines containing expanded macros into a single UnwrappedLine containing the macro calls.
This UnwrappedLine may be broken into child lines, in a way that best conveys the structure of the expanded code.
...
In the simplest case, a spelled UnwrappedLine contains one macro, and after expanding it we have one expanded UnwrappedLine.
In general, macro expansions can span UnwrappedLines, and multiple macros can contribute tokens to the same line.
We keep consuming expanded lines until:

  • all expansions that started have finished (we're not chopping any macros in half)
  • *and* we've reached the end of a *spelled* unwrapped line

A single UnwrappedLine represents this chunk of code.

After this point, the state of the spelled/expanded stream is "in sync" (both at the start of an UnwrappedLine, with no macros open), so the Unexpander can be thrown away and parsing can continue.

(and then launch into an example)

143

Re: input is a single construct - that is not the case (see other comment)

A class is definitely a single construct :-) It sounds like that's not significant to the MacroUnexpander though, so it feels like a red herring to me.

Re: leading / trailing tokens: I didn't want to make it too complex in the example.

That seems fine, I think the complexities of the general case need to be mentioned somewhere because the API reflects them. But you're right, the primary example should be simple.
I think a tricky example (maybe the #define M ; x++ one?) could be given on one of addLine/finish/getResult maybe.

155

I think unexpanded/unexpander are reasonable names, having understood this better, but with caveats.

It's important to distinguish between the pre-expansion state ("spelled"?) and the post-unexpansion state ("unexpanded?").
The UnwrappedLines are vitally different, but the token stream is the same as you point out.

When referring to the token stream, I think "spelled" is probably less confusing (that's where the token stream fundamentally comes from), and explicitly mentioning somewhere that the token sequence encoded by the unexpanded lines is the same original spelled stream.

175

const? or do we not care

175

Maybe comment that when finished() is true, it's possible to call getResult() and stop processing... but also valid to continue calling addLine(), if this isn't a good place to stop.

183

Maybe a note like "this representation is chosen because it can be opaque to the UnwrappedLineParser, but the Formatter treats it appropriately" or something.
I think it should be clear that this representation isn't really "natural" at this layer (or if it is, why).
Maybe an example would help.

190

ASCII art of some sort would help :-)

193

nit: giving getResult() a side-effect but also making it idempotent is a bit clever/confusing.

Either exposing void finalize(); + UnwrappedLine get() const, or UnwrappedLine takeResult() &&, seems a little more obvious.

287

consider calling these NextSpelled and EndSpelled to be really explicit?
Since the type doesn't really clarify which sequence is being referred to.

292

I think this is a confusing use of "unexpanded".

These macros that we're in the process of unexpanding. So the past tense doesn't seem right, they don't seem more "unexpanded" than they do "expanded", at least to me.

Maybe ActiveExpansions or so?

clang/unittests/Format/MacroUnexpanderTest.cpp
541 ↗(On Diff #301253)

nit: Result

sammccall added inline comments.Feb 3 2021, 7:56 AM
clang/lib/Format/MacroUnexpander.cpp
66 ↗(On Diff #301253)

I find using all of spelled/expanded/unexpanded, input/incoming/output/outgoing, and original, unclear.
(Particularly after OriginalParent is propagated around by that name without comment, and because "original" refers to the expanded structure, which is *not* the first one created)
Can we call this "ExpandedParent in the expanded unwrapped line"?

66 ↗(On Diff #301253)

Unexand -> unexpand

80 ↗(On Diff #301253)

(in these examples throughout, #define BRACED(a) {a} would make it clearer that this is a macro definition and not code)

84 ↗(On Diff #301253)

is it possible that you may need to unexpand more than the innermost macro?

e.g. BRACED(ID(ID(BRACED(ID(ID(a)))))) expands to {{a}} and the parents of a and inner { each come from a couple of macro-levels up.

(I don't totally understand the logic here, so the answer's probably "no"...)

103 ↗(On Diff #301253)

I had trouble understanding the what this function does at a high level: i.e. *why* you'd want this.

Maybe:

Adjusts the stack of active (unexpanded) lines so we're ready to push tokens.
The tokens to be pushed are children of ExpandedParent (in the expanded code).
This may entail:
 - creating a new line, if the parent is on the active line
 - popping active lines, if the parent is further up the stack

and s/First/ForceNewLine/ to avoid documenting it?

(impl looks good though once I understood what it does!)

156 ↗(On Diff #301253)

nit: s/find/lookup/, then you only have to deal with pointers and this becomes a bit easier to read IMO

173 ↗(On Diff #301253)

This raises another point: a macro can have an empty body.
AFAICT this fundamentally isn't supported here, as we're driven by the expanded token stream.
I guess it makes sense to handle this elsewhere in clang-format (or even not handle it) but it should be documented somewhere.

237 ↗(On Diff #301253)

assert that this number is equal to StartOfExpansion?

238 ↗(On Diff #301253)

nit: index arithmetic obscures what's going on a bit.
You could write

ArrayRef<FormatToken *> StartedMacros = makeArrayRef(Token->MacroCtx->ExpandedFrom).drop_front(Unexpanded.size());
for (FormatToken *ID : llvm::reverse(StartedMacros)) {
...
}

but up to you

It's not obvious to me *why* we're iterating in reverse order BTW: i.e. why the order of the Unexpanded stack is opposite the order of the ExpandedFrom stack.
So maybe just a comment to reinforce that, like (// ExpandedFrom is outside-in order, we want inside-out)

282 ↗(On Diff #301253)

nit: Token->MacroCtx is checked at the callsite, and startUnexpansion asserts it as a precondition - do that here too for symmetry?

303 ↗(On Diff #301253)

finishe -> finish

310 ↗(On Diff #301253)

I can't work out if this is supposed to say comma or comment :-)

If comma - is that a thing?
If comment - why would a comment be considered part of the unexpansion of a macro invocation, rather than a sibling to the macro? How do we know what trails is a comment - should we have an assertion?

319 ↗(On Diff #301253)

describe the return value, it's not obvious

345 ↗(On Diff #301253)

nit: easier to see the three cases being handled independently (comma, rparen, lparen) if they're all siblings instead of grouping two together.

348 ↗(On Diff #301253)

This line is both subtle and cryptic :-).

// New unwrapped-lines from unexpansion are normally "chained" - treated as
// children of the last token of the previous line. 
// Redirect them to be treated as children of the comma instead.
// (only affects lines added before we push more tokens into MacroStructure.Line)
366 ↗(On Diff #301253)

This feels like a stupid question, but how do we know that this non-macro-parenthesis has anything to do with macros?
(Fairly sure the answer is "processNextUnexpanded() is only called when the token is macro-related", it'd be nice to have this precondition spelled out somewhere)

367 ↗(On Diff #301253)

please assign to the struct fields or use /*Foo=*/ comments

All the data flow around this struct is local but the order of fields isn't :-)

369 ↗(On Diff #301253)

nit: = MacroCallStructure.back().RedirectParentTo
This line + line 361 are the keys to understanding what the RedirectParentFrom/To do, so it'd be helpful for them to explicitly use those variables.

(or use the source expressions for both... but in that case the fields may need docs)

396 ↗(On Diff #301253)

nit: pull out a named reference for Output.Tokens.front() after the size assertion.

assert(NullToken.is(tok::null) && !NullToken.children.empty()) may even obviate the need for a comment :-)

398 ↗(On Diff #301253)

resulting

406 ↗(On Diff #301253)

hmm, when is this possible? are these literal blank lines? where do they come from?

409 ↗(On Diff #301253)

Isn't this just the end of line from the previous iteration of this loop? Why the global map populated by pushToken?
(I feel like i'm missing something obvious here)

411 ↗(On Diff #301253)

nit: again .lookup()

423 ↗(On Diff #301253)

nit: i'd consider calling this appendToken, since lots of the stuff in this class deals with stacks but this doesn't)

(I say a lot of nasty things about java but ArrayList.add is 1000x better name than vector::push_back)

clang/lib/Format/Macros.h
164

any reason for std::map rather than DenseMap, here and elsewhere? (Only good reasons i'm aware of are sorting and pointer stability)

219

you have this "no expansions are open, but we already didn't find any" state.
The effect of this is that finished() returns false so the caller keeps looping.

But a correct caller will never rely on this:

  • the first line a caller feeds must have macro tokens in it, or our output will be garbage AFAICT
  • calling getResult() without feeding in any lines is definitely not correct

It seems we could rather assert on these two conditions, and eliminate the Start/InProgress distinction.
That way incorrect usage is an assertion crash, rather than turning into an infinite loop.

221

similarly, the InProgress/Finalized distinction would be eliminated by making takeResult() destructive, and requiring (through the type system or an assert) that it be called only once.
It doesn't seem that it needs to be part of the NDEBUG runtime control flow.

250

This is very closely related to what you return from "getResult" - not quite the same, but Output vs Result doesn't seem to hint at the difference. Could we use the same name for both?

254

ActiveUnexpandedLines?

(Line is very overloaded here)

279

ParentInExpandedToParentInUnexpanded?

(current name implies that it maps a token to *its* parent. It also uses the input/output names, rather than expanded/unexpanded - it would be nice to be consistent)

315

nit: if you're going to specify SmallVector sizes, I don't understand why you'd size Unexpanded vs MacroCallStructure differently - they're going to be the same size, right?

These days you can leave off the size entirely though, and have it pick a value

clang/unittests/Format/MacroUnexpanderTest.cpp
202 ↗(On Diff #301253)

There's no test for a macro that expands to nothing - is this supported?

207 ↗(On Diff #294294)

Yes - a basic test that it's not always true would be useful I think (maybe the #define M ;x++ case would be useful for showing the expected loop and finished() values)

klimek updated this revision to Diff 390057.Nov 26 2021, 7:32 AM
klimek marked 49 inline comments as done.

Work in review comments.

Noticed I should have waiting with the renaming of the file until the review is done :( Sorry for the extra confusion.

clang/lib/Format/MacroUnexpander.cpp
77 ↗(On Diff #301253)

Changed to "given token" - it refers to the token.

It's reconstructed, not spelled, like the example below explains: we do have hidden tokens that we want to find when we're in the middle of a recursive expansion. I wouldn't call the hidden tokens here "spelled" - would you?

82 ↗(On Diff #301253)

Found out that I was missing a unit test. Added a unit test, and now explained the unit test here in the comment. PTAL.

84 ↗(On Diff #301253)

A token on a higher level must always also be there on a lower level.
Calls in this example are:
BRACED({a})
ID({a})
ID({a})
BRACED(a)
ID(a)
ID(a)
When the next token comes in, we thus always find it in the higher level (open) macro call. When we reconstruct for that token, we then reconstruct multiple macro calls at once, if it is the first token in multiple macro calls.

173 ↗(On Diff #301253)

I think the right point to document and handle it is at the next layer, where we integrate all of this into clang-format.

238 ↗(On Diff #301253)

Oooh, this is nice. s/drop_front/drop_back/. Added comment.

282 ↗(On Diff #301253)

Thanks for spotting, this was from a time where the code looked differently (the assert above also didn't make sense any more in that form).

310 ↗(On Diff #301253)

Added words.

Re: the trailing comment, that's an idiosyncrasy of who clang-format handles trailing comments - it parses them with the token preceding them. We could probably spend more complexity on the call side, trying to break them out, but I'm not sure that's better, given that this point already needs to be fairly reliable against all user code.

348 ↗(On Diff #301253)

This is partially captured in the comment above. Added a shorter description here, let me know if that's not enough.

366 ↗(On Diff #301253)

Added a comment above the restructured !Token->MacroCtx check.

367 ↗(On Diff #301253)

Added a constructor.

369 ↗(On Diff #301253)
  1. Line 361 was not actually necessary; I first thought that was a bug, but then dug in, and noticed that we weren't able to hit that code path (anymore, after restructuring)
  1. Completely renamed the members and documented them, added some more asserts to make things hopefully clearer - thanks for pointing it out, those were clearly underdocumented; I hope it's now more clear.
406 ↗(On Diff #301253)

Looks like you're right; this looks like it was left over from a previous different structure, I'll make sure to fuzz it thoroughly.

409 ↗(On Diff #301253)

Nice catch, the complexity here was also from a previous iteration where the structure of the algorithm was (even) more complex, where we needed to also do stitching for non-top-level lines. Now able to completely get rid of PreviousToken and TokenToPrevious \o/

clang/lib/Format/Macros.h
30–31

Said "annotates and formats" now - yes, the fact that annotating and formatting is so coupled is a admittedly weird choice of the initial design.

136

Thanks, that's a really good write-up!

143

#define M ; x++ seems to be similarly tricky to #define CLASSA(x) class A x
We get multiple calls to addLine, in between which finished() is false.

193

Done.

219

Changed to asserts.

279

Called it SpelledParentToReconstructedParent.

287

Called them SpelledI and SpelledE in llvm tradition.

315

I did not know that, that's awesome!

clang/unittests/Format/MacroUnexpanderTest.cpp
207 ↗(On Diff #294294)

We got a couple of tests like that, added checks for negative finished in between.

sammccall accepted this revision.Jul 11 2022, 10:46 AM

OK, I think we've reached the point where:

  • the impact is very clear, this solves a whole class of clang-format's biggest problems
  • the idea clearly works (there are good tests)
  • the implementation is very well-documented: I can't really improve my understanding further by asking for things to be better explained
  • I can't make clear suggestions to simplify - you've applied all my low-level suggestions and my high-level understanding is poor
  • I still don't feel like I really understand how it works, but that's not really different than the other big pieces of clang-format

So I think all that's left to do here is ship it. It makes me uneasy that the core of clang-format is functionally magic (could anyone other than you and Daniel reproduce it after nuclear apocalypse?) but this doesn't really change that state.

clang/lib/Format/MacroCallReconstructor.cpp
54

if you want to keep these LLVM_DEBUGs, maybe this should be "MCR" or so instead of "unex"?

63

you might want an assertion that Result has one token with one child (it's pretty obvious in finalize() but less so here)

99

nit: I think this would be clearer by naming the result:
if (bool PassedMacroComma = reconstruct...)

(because it's not clear from the name what the function returns, and documenting it would help only a little)

109

(this is the else branch of a negated condition, consider swapping the branches to avoid double-negative)

115

liens -> lines

(unless there's a *really* weird metaphor going on here!)

233

this FIXME looks obsolete?

clang/lib/Format/Macros.h
190

nit: getResult()->takeResult() in comment now

201

you could give this a name like "tail form", and then refer to it in docs of MacroCallReconstructor::Result, in MacroCallReconstructor.cpp:482, etc.

Up to you.

This revision is now accepted and ready to land.Jul 11 2022, 10:46 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 11 2022, 10:46 AM
klimek marked 7 inline comments as done.Jul 11 2022, 12:48 PM
klimek added inline comments.
clang/lib/Format/Macros.h
201

I'm somewhat unhappy with the term "tail form"; happy to do this in a subsequent change if we find a better name.

klimek updated this revision to Diff 443729.Jul 11 2022, 12:57 PM

Address review comments.

This revision was landed with ongoing or failed builds.Jul 12 2022, 12:12 AM
This revision was automatically updated to reflect the committed changes.
nridge added a subscriber: nridge.Jul 18 2022, 12:44 PM

Does this patch change the formatting behaviour of clang-format?

If so, are there any test cases that show before/after formatting? The MacroCallReconstructorTest unit test seems like it's testing an internal interface.

Does this patch change the formatting behaviour of clang-format?

If so, are there any test cases that show before/after formatting? The MacroCallReconstructorTest unit test seems like it's testing an internal interface.

No, this is prep-work for the real change, which I'm planning to send out soon.

Thanks! (I was intrigued by Sam's "solves a whole class of clang-format's biggest problems" comment :-))

Thanks! (I was intrigued by Sam's "solves a whole class of clang-format's biggest problems" comment :-))

The end-result hopefully will :)

HazardyKnusperkeks added a project: Restricted Project.Jul 27 2022, 7:57 AM

A bit surprising such a big change...

sstwcw added a subscriber: sstwcw.Jul 28 2022, 6:10 PM

It looks like some of the braces in the code should be removed for example those surrounding one-line for bodies. Sorry if it is not my job to point this out, but MyDeveloperDay has not said anything.

@sstwcw thanks for pointing it out. See D134329.

@sstwcw thanks for pointing it out. See D134329.

Thanks Owen, really appreciate this! And sorry for not getting to it yet myself :(

clang/lib/Format/MacroCallReconstructor.cpp
223