This is an archive of the discontinued LLVM Phabricator instance.

add -dump-tokens-bg option for clang
Needs ReviewPublic

Authored by Anarion-zuo on Jun 7 2023, 6:40 PM.

Details

Summary

This path is intended it add -dump-tokens-bg option to clang.

While trying to extract information from code, we find it difficult to use -dump-tokens option with complex compiling systems, e.g. this one. This new option is supposed not to interfere with other parts of the build, and output token dumps to stdout.

A dump may look like this. Other formats are possible if one cares to propose.

file token dump begins hello.c
raw_identifier	'int'
raw_identifier	'foo'
l_paren	'('
r_paren	')'
l_brace	'{'
raw_identifier	'int'
raw_identifier	'i'
equal	'='
numeric_constant	'0'
semi	';'
raw_identifier	'int'
raw_identifier	'j'
equal	'='
raw_identifier	'i'
minus	'-'
numeric_constant	'4'
semi	';'
raw_identifier	'int'
raw_identifier	'k'
equal	'='
numeric_constant	'999'
slash	'/'
raw_identifier	'j'
semi	';'
raw_identifier	'return'
numeric_constant	'1'
semi	';'
r_brace	'}'
raw_identifier	'int'
raw_identifier	'main'
l_paren	'('
r_paren	')'
l_brace	'{'
raw_identifier	'foo'
l_paren	'('
r_paren	')'
semi	';'
raw_identifier	'return'
numeric_constant	'0'
semi	';'
r_brace	'}'
eof	''
file token dump ends hello.c

Diff Detail

Event Timeline

Anarion-zuo created this revision.Jun 7 2023, 6:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 7 2023, 6:40 PM
Anarion-zuo requested review of this revision.Jun 7 2023, 6:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 7 2023, 6:40 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
Anarion-zuo edited the summary of this revision. (Show Details)Jun 7 2023, 6:56 PM
Anarion-zuo added a reviewer: jansvoboda11.
Anarion-zuo added a reviewer: ChuanqiXu.

It's my first time here. I don't know if I got everything right.

CodeOwners.rst says you guys are relevant somewhat. Thanks! @ChuanqiXu @jansvoboda11

What is the intended use case of this?

We intend to build a deep learning model with code as input, then noticed clang does not currently have a convenient way of dumping tokens in huge projects.

The dumped tokens are supposed to be converted to vectors, then given to a deep learning model.

I think such requirement should be done by something like a plugin or an analysis tool by using clang as a library instead of implementing it in clang compiler itself directly.

I see that plugin is a more elegant way of doing this. Do you still want this as a part of LLVM?

I see that plugin is a more elegant way of doing this. Do you still want this as a part of LLVM?

No, I don't feel this is needed.