This is an archive of the discontinued LLVM Phabricator instance.

Carry raw string literal information through to the AST StringLiteral representation
AbandonedPublic

Authored by aaron.ballman on Jan 8 2016, 1:57 PM.

Details

Reviewers
dblaikie
rsmith
Summary

The AST does not currently carry information about whether a string literal is raw or not in the StringLiteral AST representation, which this patch rectifies. It exposes information about whether a string literal is raw, and what prefix (if any) a raw string literal uses. The patch also adds support for pretty printing a raw string literal.

One thing to note, however, is that this is not 100% perfect because of translation phase 6, where adjacent string literals are concatenated. This patch only supports concatenated raw string literals that have the same prefix. It does not support differing prefixes, or adjacent raw & nonraw literals. I felt that these were a sufficiently uncommon edge case to not attempt to support.

Diff Detail

Event Timeline

aaron.ballman retitled this revision from to Carry raw string literal information through to the AST StringLiteral representation.
aaron.ballman updated this object.
aaron.ballman added reviewers: rsmith, dblaikie.
aaron.ballman added a subscriber: cfe-commits.
rsmith edited edge metadata.Jan 27 2016, 6:08 AM
rsmith added a subscriber: rsmith.

What's the benefit of storing this? You can get the same effect by
re-lexing. We don't guarantee that the pretty printed version of the AST
comprises the same sequence of tokens in general.

What's the benefit of storing this? You can get the same effect by
re-lexing. We don't guarantee that the pretty printed version of the AST
comprises the same sequence of tokens in general.

In writing clang-tidy checks, I've had to do re-lexing a number of times and personally I find it to be a complete PITA and very easy to get wrong, plus it results in many review iterations because of all the ways that StringRef enters the picture, whether you're using the plain lexer or the raw lexer, SourceRange or CharSourceRange, yadda yadda yadda.

I thought the whole point of the AST was to do this work once and store the results of the work for tools.

Ping. Richard, I think we've provided justification that warrants this functionality (mostly for clang-tidy checks). Any issues with the code?

Ping. Richard, I think we've provided justification that warrants this functionality (mostly for clang-tidy checks).

I am not convinced. This flag does not seem to make sense for StringLiteral, because it is not a property of the AST-level StringLiteral object; it is a property of the underlying tokens instead (and in general the StringLiteral object corresponds to multiple underlying tokens with differing rawness). We do not store other spelling information from the underlying tokens, such as whether a character was written literally or with an escape sequence, and in AST printing we do not preserve that; I do not see why this case should be any different. (We do store the translated value of the string, but that is a different case, both because it's part of the semantic model of the expression -- we don't want IR generation doing semantic string literal analysis -- and because it's needed frequently enough to justify caching it.)

I'm definitely sympathetic to making StringLiteralParser expose information on whether each token in the string literal is a raw string literal, and if so what prefix it uses. I can also see an argument for exposing an easier interface for constructing a StringLiteralParser from an existing StringLiteral object (note that StringLiteral::getLocationOfByte already does this dance). But I don't see that there's a compelling reason to break out and cache this information separately from its actual point of truth.

If we want to make AST printing round-trip string literals as-written, we should make it re-lex the string literal tokens and print them out verbatim -- making this work for raw vs non-raw strings but not for escapes versus non-escapes, or different flavours of escape sequence, doesn't make a lot of sense to me.

aaron.ballman abandoned this revision.Feb 17 2016, 6:17 AM

Your logic makes sense to me. I am abandoning this revision while I think about alternatives. Thanks!

I'm definitely sympathetic to making StringLiteralParser expose information [...]

I was unaware of this class; so far I have only studied the classes appearing in the AST.

I did notice that the AST shows string literals after concatenation (which makes perfect sense) and sometimes for refactoring you want to treat the individual string literal chunks separately. My clang-tidy check for raw string literals is probably confused by literal concatenation in the AST, so I will go add some tests for those cases.

One area of clang-tidy/refactoring tooling code development that is pretty opaque at the moment is when you should dip down to relexing/reparsing the source text and when you use the AST. So far most (all?) of the documentation on guiding developers to writing such tools talks almost exclusively about the AST. Anything we can do to simplify lexing/parsing tasks when you need to dip lower than the AST would be most welcome.