This is an archive of the discontinued LLVM Phabricator instance.

[syntax] Introduce a TokenManager interface.
ClosedPublic

Authored by hokein on Jun 23 2022, 2:12 AM.

Details

Summary

The TokenManager defines Token interfaces for the clang syntax-tree. This is level of
abstraction that the syntax-tree should use to operate on Tokens.

It decouples the syntax-tree from a particular token implementation (TokenBuffer
previously). This enables us to use a different underlying token implementation
for the syntax Leaf node -- in clang pseudoparser, we want to produce a
syntax-tree with its own pseudo::Token rather than syntax::Token.

Diff Detail

Event Timeline

hokein created this revision.Jun 23 2022, 2:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 23 2022, 2:12 AM
hokein requested review of this revision.Jun 23 2022, 2:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 23 2022, 2:12 AM

As discussed offline this has some problems:

  • putting virtual methods on BaseToken gives it a vtable, which makes it (and syntax::Token) large
  • being able to use ArrayRef<syntax::Token> but not ArrayRef<BaseToken> is a bit weird
  • unusual uses of inheritance can be hard to reason about

We suggested rather having Leaf store an opaque ID.
Callers who know what kind of tokens are in use can use this to associate with the original token.
For generic use (e.g dump()), we can have a TokenManager interface which provides the common operations (like getText()). This generalizes where SourceManager is needed today.

hokein updated this revision to Diff 439705.Jun 24 2022, 5:04 AM

Revised to the TokenManager approach:

  • Inroduce a Base Token class (TokenManager) for syntax-tree, the motivation is to allow using different underlying token implementation in syntax-tree
  • Decouple the syntax-tree from the TokenBuffer:
    • syntax-tree structure (Tree.h) doesn't depend on the TokenBuffer, SourceManager Source location etc, it communicates with TokenManager interfaces;
    • syntax-tree Arena is simpler, the token-managing responsiblity is transferred to TokenManager;
    • in SyntaxTree directory, we implement a TokenBuffer-based SyntaxTokenManager, which mangues all token-related stuff
    • For the mutation/replacement computation APIs, currently they only work on a TokenBuffer-based token manager. Asssertion will be raised if it is not satisfied. It is an NFC change

I'm quite happy about the interfaces now, it should be in a good shape for the API review (would be nice to get some initial feedback before I do further cleanup).

ilya-biryukov added inline comments.Jun 24 2022, 6:10 AM
clang/include/clang/Tooling/Syntax/TokenManager.h
24

NIT: maybe explain the TokenManager concept here, the comment seems to be a leftover from the previous revision.

clang/lib/Tooling/Syntax/Tree.cpp
271

Maybe add TokenManager::getKind(Key) right away and remove this FIXME.
This should as simple as cast<syntax::Token>(T)->Kind, right? Or am I missing some complications?

ilya-biryukov added inline comments.Jun 24 2022, 6:15 AM
clang/include/clang/Tooling/Syntax/TokenManager.h
34

I have just realized that we were discussing having opaque index as a key, but there may also be tokens in the syntax tree that are not backed by the TokenBuffer.
Those can be synthesized, e.g. imagine someone wants to change !(A && B) to !A || !B. They will need to synthesize at least the || token as it's not in the source code. There is a way to do this now and it prohibits the use of an index to the TokenBuffer.

So having the opaque pointer is probably ok for now, it should enable the pseudo-parser to build syntax trees.
We might want to add an operation to synthesize tokens into the TokenManager at some point, but that's a discussion for another day.

hokein added inline comments.Jun 24 2022, 12:28 PM
clang/include/clang/Tooling/Syntax/TokenManager.h
34

Those can be synthesized, e.g. imagine someone wants to change !(A && B) to !A || !B. They will need to synthesize at least the || token as it's not in the source code. There is a way to do this now and it prohibits the use of an index to the TokenBuffer.

Yes, this is the exact reason why the Key is an opaque pointer, my first attempt was to use an index integer, but failed -- we already have some APIs doing this stuff (see createLeaf in BuildTree.h), the token can be a synthesized token backed up by the SourceManager...

Personally, I don't like the Key to be an opaque pointer as well, but considering the effort, it seems to be the best approach so far -- it enables the pseudoparser to build syntax trees with a different Token implementation while keeping the rest syntax stuff unchanged.

We might want to add an operation to synthesize tokens into the TokenManager at some point, but that's a discussion for another day.

Agree, we will encounter this in the future, but we're still far away from there (the layering mutation/syntax-tree is not perfect at the moment, mutation still depends on the TokenBuffer). And our initial application of syntax-tree in pseudoparser focuses on the read use-case, we should be fine now.

clang/lib/Tooling/Syntax/Tree.cpp
271

Yeah, the main problem is that we don't have the TokenManager object in the syntax::Node, we somehow need to pass it (e.g. a function parameter), which seems a heavy change. I'm not sure it is worth for this small assert.

ilya-biryukov added inline comments.Jun 27 2022, 1:40 AM
clang/lib/Tooling/Syntax/BuildTree.cpp
571–573

Could we accept a TokenBuffer here directly?
If TokenManager is needed, we can use it directly in the arena.

627–628

Same suggestion here: accept TokenBuffer instead of TokenManager

clang/lib/Tooling/Syntax/Tree.cpp
271

That makes sense. WDYT about the alternative fix: to pass ̀TokenManager` to assertInvariants?
Not necessary to do it now, just thinking about changing the FIXME

clang/unittests/Tooling/Syntax/TreeTestBase.cpp
117

NIT: it´s not breaking anything now, but I suggest putting SyntaxTokenManager after TokenBuffer.
The reason is that it´s the right destruction order, TokenManager has references to TokenBuffer, so it could potentially access it in destructor some time in the future (e.g. imagine asserting something on tokens).

Not that it actually breaks today, but seems like a potential surprising bug in the future if we happen to refactor code in a certain way.

sammccall added inline comments.Jun 27 2022, 2:31 AM
clang/include/clang/Tooling/Syntax/TokenManager.h
10

It's important that we have comments explaining what the concept is *now*, rather than how it changes the code structure from the previous state (decouples tokenbuffer, enables pseudoparser).

(I think this comment is actually fine, but be careful when writing the class comment for TokenManager)

34

Consider uintptr_t instead, which more naturally supports both approaches. (Stashing an integer in a void* is incredibly weird)

38

I wouldn't want to prejudge this: this is a very basic attribute similar to kind/role, and we may want to store it in Leaf and avoid the virtual stuff.

There's certainly enough space, e.g. of the current 16-bit kind use the top bit to denote leaf-or-not and the bottom 15 bits to store kind-or-tokenkind.

clang/include/clang/Tooling/Syntax/Tokens.h
463

conceptually this is just "TokenBuffer implements TokenManager"

The main reason I can see not to actually write that is to avoid the dependency from TokenBuffer (tokens library) to TokenManager (syntax library). But here you've added that dependency anyway.

So I think we'd be better either with TokenBuffer : TokenManager or moving this class to its own header.

465

no need to take SourceManager, TokenBuffer already includes it.
LangOpts is unused here.

475

Empty string seems like the correct return value here to me.
If you want a special case for dump, I think that belongs in dump().

If this is because we currently provide no way to get the token kind, then this should be a FIXME

487

just tokenBuffer(), consistent with TokenBuffer::sourceManager()

490

manager

504

aren't we trying to store these on the syntax arena?
We never do lookups into this map, maybe lexbuffer just allocates storage on the arena('s allocator) instead of using this map?

clang/include/clang/Tooling/Syntax/Tree.h
156

why is this comment removed?

clang/lib/Tooling/Syntax/Tree.cpp
271

per my comment above: Leaf can store the tok::Kind directly and I think it's appropriate to do so.
But maybe fiddly enough that it's worth deferring for one patch

ilya-biryukov added inline comments.Jun 27 2022, 2:40 AM
clang/include/clang/Tooling/Syntax/Tokens.h
463

I would argue they should be separate concepts.

  • TokenBuffer is about storing tokens and mapping between expanded and spelled token streams.
  • SyntaxTokenManager is about implementing relevant certain operations on syntax::Token and hiding the actual token type.

Conceptually those things are different and decoupling them allows for more flexibility and allows reasoning about them independently. In particular, one could imagine having a SyntaxTokenManager without a TokenBuffer if we do not need to deal with two streams of tokens.

I would suggest keeping SyntaxTokenManager as a separate class.

hokein updated this revision to Diff 443173.Jul 8 2022, 2:16 AM
hokein marked 9 inline comments as done.

address review comments.

hokein added inline comments.Jul 8 2022, 2:23 AM
clang/include/clang/Tooling/Syntax/TokenManager.h
38

This sounds good. Adjust the FIXME.

clang/include/clang/Tooling/Syntax/Tokens.h
463

I would suggest keeping SyntaxTokenManager as a separate class.

+1. Moved to a separate file.

465

no need to take SourceManager, TokenBuffer already includes it.

The SourceMgr can be mutated by the class (it stores the underlying tokens for ExtraTokens), while the SourceManager in TokenBuffer is immutable.

LangOpts is unused here.

It is used in SyntaxTokenManager::lexBuffer.

475

Yeah, this special case is for the Leaf node dump. Added a FIXME.

(I even double whether we need this special case at all, do we really want to build a Leaf node for eof token?)

504

aren't we trying to store these on the syntax arena?

We could do that, but I'd try to avoid that. Now the allocator of syntax::Arena is a storage for syntax-tree nodes only. Allocation for token-related stuff is on the SyntaxTokenManager.

We never do lookups into this map, maybe lexbuffer just allocates storage on the arena('s allocator) instead of using this map?

This map is moved from the syntax::Arena.

You're right, there is no usage of the key at the moment, and the only use-case is to create a syntax leaf node that not backed by the source code (for refactoring usecase), it is unclear whether we will use it in the future. I will keep it as-is (this is not the scope of this patch).

clang/unittests/Tooling/Syntax/TreeTestBase.cpp
117

good point!

sammccall added inline comments.Jul 11 2022, 5:03 AM
clang/include/clang/Tooling/Syntax/Nodes.h
23–24

I think Token and TokenKinds are also no longer needed?

clang/include/clang/Tooling/Syntax/SyntaxTokenManager.h
20 ↗(On Diff #443173)

I don't think "syntax" in "syntax token manager" is particularly disambiguating here, both TokenBuffer and TokenManager are part of syntax, so it's not clear what it refers to (and it doens't have any obvious plain-english meaning).

Maybe some combination like TokenBufferTokenManager or BufferTokenManager.
In fact I think best is TokenBuffer::TokenManager, still defined in a separate header, though I'm not sure if you think that's too weird.

clang/include/clang/Tooling/Syntax/TokenManager.h
15

unclear: different from what? it's not clear what "enables" means if there's no default. Maybe replace the sentence with:

"For example, a TokenBuffer captured from a clang parse may track macro expansions and associate tokens with clang's SourceManager, while a pseudo-parser would use a flat array of raw-lexed tokens in memory."

39

This is not a useful comment, either remove it or add more content to make it useful.

In particular the guarantees (or lack thereof) of exactly what this text is would be helpful. (This is some source code that would produce this token, though it may differ from exactly what was spelled in the file when preprocessing is involved)?

clang/include/clang/Tooling/Syntax/Tree.h
9

this is an implementation detail.
At a high level, leaf nodes correspond to tokens. I'd just delete "expanded" from the original comment

41–42

this is not a helpful comment, just remove it?

clang/lib/Tooling/Syntax/BuildTree.cpp
370

need changes to the public API to make this cast valid

clang/lib/Tooling/Syntax/ComputeReplacements.cpp
95–96

Need a change to the public interface to guarantee this cast will succeed.
The cleanest would be just to take the SyntaxTokenManager as a param (moving the cast to the call site - I don't think Arena is otherwise used for anything).

Failing that we at least need to update the contract

hokein updated this revision to Diff 443736.Jul 11 2022, 1:22 PM
hokein marked 7 inline comments as done.

address comments.

hokein updated this revision to Diff 443737.Jul 11 2022, 1:23 PM

more update

clang/include/clang/Tooling/Syntax/SyntaxTokenManager.h
20 ↗(On Diff #443173)

Renamed to TokenBufferTokenManager.

BufferTokenManager name is short, but it has BufferToken prefix, which seems confusing with TokenBuffer. TokenBuffer::TokenManager is weird to me, and doesn't reflect the layering IMO

clang/lib/Tooling/Syntax/BuildTree.cpp
370

Are you suggesting to change all public APIs where there is such a cast usage?

If yes, this seems not a trivial change, I think at least we will change all APIs (buildSyntaxTree, createLeaf, createTree, deepCopyExpandingMacros) in BuildTree.h (by adding a new TokenBufferTokenManager parameter).
And the Arena probably doesn't need to have a TokenManager field (it could be simplified as a single BumpPtrAllocator), as the TokenManager is passed in parallel with the Arena.

I'm happy to do the change, but IMO, the current version doesn't seem too bad for me (if we pass an non-SyntaxTokenManager, it will trigger an assertion in debug mode).

clang/lib/Tooling/Syntax/ComputeReplacements.cpp
95–96

Done, this is a trivial change.

sammccall added inline comments.Jul 11 2022, 2:04 PM
clang/lib/Tooling/Syntax/BuildTree.cpp
370

Are you suggesting to change all public APIs where there is such a cast usage?

Yes, at the very least they should document the requirement ("this arena must use a TokenBuffer").
An unchecked downcast with no indication on the public API that a specific subclass is required just looks like a bug.

hokein updated this revision to Diff 443922.Jul 12 2022, 5:26 AM

remove all TokenBufferTokenManager cast usage, make it as a contract in the APIs.

sammccall accepted this revision.Jul 13 2022, 2:49 AM
This revision is now accepted and ready to land.Jul 13 2022, 2:49 AM
hokein retitled this revision from [syntax] Introduce a BaseToken class. to [syntax] Introduce a TokenManager interface..Jul 13 2022, 6:37 AM
hokein edited the summary of this revision. (Show Details)
hokein updated this revision to Diff 444244.Jul 13 2022, 6:38 AM

update the API changes in clangd part.

Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2022, 6:38 AM
This revision was landed with ongoing or failed builds.Jul 15 2022, 1:31 AM
This revision was automatically updated to reflect the committed changes.

FYI, after this change I get:

/home/david.spickett/llvm-project/clang/include/clang/Tooling/Syntax/TokenBufferTokenManager.h:20:7: warning: 'clang::syntax::TokenBufferTokenManager' has virtual functions but non-virtual destructor [-Wnon-virtual-dtor]
class TokenBufferTokenManager : public TokenManager {
      ^

FYI, after this change I get:

/home/david.spickett/llvm-project/clang/include/clang/Tooling/Syntax/TokenBufferTokenManager.h:20:7: warning: 'clang::syntax::TokenBufferTokenManager' has virtual functions but non-virtual destructor [-Wnon-virtual-dtor]
class TokenBufferTokenManager : public TokenManager {
      ^

Sorry, it is fixed in 30c2406e270cc5dab8da813ce5c54e4bb8c40e49.