This is an archive of the discontinued LLVM Phabricator instance.

[Pseudo] Token/TokenStream, PP directive parser.
ClosedPublic

Authored by sammccall on Feb 7 2022, 10:15 AM.

Details

Summary

The TokenStream class is the representation of the source code that will
be fed into the GLR parser.

This patch allows a "raw" TokenStream to be built by reading source code.
It also supports scanning a TokenStream to find the directive structure.

Next steps (with placeholders in the code): heuristically choosing a
path through #ifs, preprocessing the code by stripping directives and
comments, cooking raw_identifiers.
These will produce a suitable stream to feed into the parser proper.

Diff Detail

Event Timeline

sammccall created this revision.Feb 7 2022, 10:15 AM
sammccall requested review of this revision.Feb 7 2022, 10:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 7 2022, 10:15 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

This needs some more detailed unittests than the smoke lex.test added so far.

But I wanted to get a bit of feedback first on whether the general scope/layout is reasonable.
It'd be very possible to split off the PP support out of this patch, I just wasn't sure if what remained was 'interesting' enough.

(LMK if having the tests earlier would help you understand/evaluate what's there)

sammccall updated this revision to Diff 407512.Feb 10 2022, 6:22 AM

Add unit tests.
Clarify finalization is about the mutability of the *sequence*. Add assertions.
Include cooking & cleaning of tokens. (In our prototype this was after PP but I
don't think that makes much sense).
Drop fields/accessors that aren't used yet.
Fix a few bugs revealed by tests.

Thanks, this looks good in general.

I only reviewed the token part, but don't want to block the review. I think it might be nice to split out the preprocessor part from this patch, this would make the scope clearer and review efficient (sorry, quite busy with other stuff this week).

clang/include/clang/Tooling/Syntax/Pseudo/Preprocess.h
104
clang/include/clang/Tooling/Syntax/Pseudo/Token.h
61

nit: add default init value.

65

nit: mention this is the line number for the start of token, a token can be multiple-line.

I think the meaning of line number is token-stream-dependent, right? For a lex token-stream, it is raw line number; For a cook token stream, it might take the backslash continued line into account?

90

nit: unsigned -> Token::Index

I'd probably drop it (looks like only one place using it, inclining it should be fine), the name empty is usually a method checking the Range is empty rather than creating an empty Range.

129

nit: isFinalized assert?

131

>= => >, since we have checked above.

144

the mutated version seems not used?

145

assert(isFinalized())

156

I think it is probably good/clearer to have a comment like "Storage = eof + Tokens + eof" around here.

164

trigraps => trigraphs

167

I think it would be clearer to give more details (for example, give some examples) in the comment to clarify -- it took me quite a while to understand "an unescaped newline" (I guess this refers to the backslash continued line?).

And having a PP in the name seems a little confusing, it somehow indicates this flag is PP-specific, e.g. for a simple example without any PP structure int a; the int token has the StartsPPLine being set.

172

I was going to say "add some comments around these enums", then realized that there were comments for them in the above lex() comment, I'd move those comments to here.

clang/lib/Tooling/Syntax/Pseudo/Lex.cpp
66

just want to double-check this is the behavior we want,

From the clang's token comment:

TokenFlags::StartOfLine  // At start of line or only after whitespace
`

so the abc token will have this flag

abc // abc has two leading spaces.
67

If I read the code correctly, StartsPPLine is not PP-specific, the flag will be set for all tokens where they are at start of the line. Having a PP in the flag name seems a little confusing.

abc // abc token will have StartsPPLine set
89

nit: = 0

clang/unittests/Tooling/Syntax/Pseudo/TokenTest.cpp
109

can we add some line&indentation verifications here?

118

and here as well, line number & indentation.

166

nit: add trailing comments for each lineIndent, e.g. // hello

169

why the the indentation is 0?

comments on preprocessor part.

clang/include/clang/Tooling/Syntax/Pseudo/Preprocess.h
50

is the -+ Directive expected?

61

maybe mention it includes comments.

clang/lib/Tooling/Syntax/Pseudo/Preprocess.cpp
19

I'd use a more specific name, maybe PPParser

22

nit: result -> Result

27

nit: kind => Kind

46

#end => #endif.

61

is there any special reason to use a nested while? it looks like we can replace with an if. -- in each iteration, we get a Chunk result (either a Code, or a Directive), something like below should probably work?

while (Tok->Kind != tok::eof) {
  if (!StartsDirective()) {
     // handle Chunk::Code case
     continue;
  } else {
     // handle directive case.
  }
}
78

It took me a while to understand the logic here. It feels more natural to make this case as an early-exit case, and leave the above general case as a default case like

if (ParsingCondBranch && (Kind == Cond::Else || Kind == Cond::End)
  return ...
92

#end => #endif

135

maybe rename to PPKeywordTable, it seems clearer for the purpose

146

nit: Kind

146

looks like this function can be lifted to TokenKinds.h file (there is a similar getTokenName there), maybe a FIXME?

clang/test/Syntax/lex.test
37 ↗(On Diff #407512)

1 tokens seems silly :( maybe it doesn't matter.

38 ↗(On Diff #407512)

Does this extra trailing blank line matter?

clang/unittests/Tooling/Syntax/Pseudo/PreprocessTest.cpp
41

maybe add a mismatched-if testcase.

sammccall marked 26 inline comments as done.

address review comments

sammccall added inline comments.Feb 21 2022, 11:13 AM
clang/include/clang/Tooling/Syntax/Pseudo/Preprocess.h
50

Yes, this is Cond.Branches[0].first.

104

I hope so, but I'm not holding my breath.
(It'll happen soon enough that I'm not going to worry about optimizing this though)

clang/include/clang/Tooling/Syntax/Pseudo/Token.h
65

mention this is the line number for the start of token

Done.

I think the meaning of line number is token-stream-dependent, right?

This always refers to the original source code - a backslash continued line won't reduce the line number. Expanded the comment.

90

This will be used in a bunch more places, and inlining it is awkward:

  • often have to extract an expression to a variable to avoid duplicating it
  • it's IMO much less obvious at the callsite the intent is to specify an empty range

You're right about the name though, renamed to emptyAt which hopefully avoids that association.

131

I'm not sure why that change would be an improvement, can you elaborate?

The two assertions are that we own the token, and that it's not the start sentinel. I think changing the condition makes that less clear.

(Reordered them though, as the start-sentinel assertion seems secondary)

167

Oops, this sentence didn't make much sense. The point is that the preprocessor distinguishes between logical and physical lines:

#define X(error) \
  #error // this is not a directive
auto m = X(foo); // string "foo"

This concept of "logical line" and the rules around it are specific to the preprocessor. The rest of the parser is line-agnostic or uses physical line numbers. So I think PP belongs in the name.

Expanded the comment here.

clang/lib/Tooling/Syntax/Pseudo/Lex.cpp
66

Yes. This matches the logic clang uses (Lexer.cpp:657) to decide whether a hash starts a PP directive, which is what this flag is intended to emulate.

67

the semantics are PP-specific, only the preprocessor cares about this unusual definition of "line".

In your example, abc // abc token... is indeed a preprocessor line (though not a directive). In particular:

#define foo
abc // abc token...

The #define directive is terminated by the abc token precisely because it starts a new PP logical line.

clang/lib/Tooling/Syntax/Pseudo/Preprocess.cpp
46

Changed these to End etc instead.
While #endif is the only End directive, that's incidental (and not the case for If or Else)

146

Makes sense, moved it in this patch.

clang/unittests/Tooling/Syntax/Pseudo/TokenTest.cpp
169

It was reusing the indentation from the continued line. I'm not really sure why I thought this was a good idea though. Removed this logic, indentation is now 2.

hokein accepted this revision.Feb 23 2022, 4:22 AM

Thanks!

clang/include/clang/Tooling/Syntax/Pseudo/Token.h
82

what's the purpose of this static_assert?

clang/lib/Tooling/Syntax/Pseudo/Preprocess.cpp
61

this comment and the one blew seem to be ignored.

clang/test/Syntax/lex.test
2 ↗(On Diff #410340)

looks like the example.c can be moved into this file

This revision is now accepted and ready to land.Feb 23 2022, 4:22 AM
This revision was landed with ongoing or failed builds.Feb 23 2022, 8:52 AM
This revision was automatically updated to reflect the committed changes.
sammccall marked 5 inline comments as done.
Herald added a project: Restricted Project. · View Herald TranscriptMar 16 2022, 7:01 PM