Page MenuHomePhabricator

[WIP] Implement P2361 Unevaluated string literals
Needs ReviewPublic

Authored by cor3ntin on Jul 10 2021, 6:53 AM.

Details

Summary

This patch proposes to handle in an uniform fashion
the parsing of strings that are never evaluated,
in asm statement, static assert, attrributes, extern,
etc.

Unevaluated strings are UTF-8 internally and so currently
behave as narrow strings, but these things will diverge with
D93031.

The big question both for this patch and the P2361 paper
is whether we risk breaking code by disallowing
encoding prefixes in this context.
I hope this patch may allow to gather some data on that.

Future work:
Improve the rendering of unicode characters, line break
and so forth in static-assert messages

Diff Detail

Unit TestsFailed

TimeTest
2,340 msx64 debian > AddressSanitizer-x86_64-linux.TestCases::strcmp.c
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang -fsanitize=address -mno-omit-leaf-frame-pointer -fno-omit-frame-pointer -fno-optimize-sibling-calls -gline-tables-only -m64 /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/asan/TestCases/strcmp.c -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/asan/X86_64LinuxConfig/TestCases/Output/strcmp.c.tmp
1,630 msx64 debian > HWAddressSanitizer-x86_64.TestCases::wild-free-shadow.c
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang -m64 -gline-tables-only -fsanitize=hwaddress -fuse-ld=lld -fsanitize-hwaddress-experimental-aliasing -mcmodel=large -mllvm -hwasan-generate-tags-with-calls=1 -mllvm -hwasan-globals -mllvm -hwasan-use-short-granules -mllvm -hwasan-instrument-landing-pads=0 -mllvm -hwasan-instrument-personality-functions /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/hwasan/TestCases/wild-free-shadow.c -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/hwasan/X86_64/TestCases/Output/wild-free-shadow.c.tmp && not /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/hwasan/X86_64/TestCases/Output/wild-free-shadow.c.tmp 2>&1 | FileCheck /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/hwasan/TestCases/wild-free-shadow.c

Event Timeline

cor3ntin requested review of this revision.Jul 10 2021, 6:53 AM
cor3ntin created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptJul 10 2021, 6:53 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
cor3ntin updated this revision to Diff 357714.Jul 10 2021, 6:53 AM

clang-format

cor3ntin updated this revision to Diff 357723.Jul 10 2021, 8:05 AM
  • Support unevaluated str8ing literal in deprecated/nodiscard attributes
  • Fix test formatting
cor3ntin updated this revision to Diff 357741.Jul 10 2021, 11:37 AM

Fix asserts firing

cor3ntin updated this revision to Diff 357746.Jul 10 2021, 12:47 PM

Fix tests, windows compilation, formatting

Herald added a project: Restricted Project. · View Herald TranscriptJul 10 2021, 12:47 PM
cor3ntin updated this revision to Diff 357764.Jul 10 2021, 4:06 PM

Fix avaibility attribute parsing

cor3ntin updated this revision to Diff 357765.Jul 10 2021, 4:07 PM

clang format

aaron.ballman added inline comments.Aug 2 2021, 7:48 AM
clang/include/clang/AST/Expr.h
1846–1847

This should silence some diagnostics about mixed && and || in the same expression.

clang/include/clang/Basic/DiagnosticLexKinds.td
247
249
clang/include/clang/Basic/DiagnosticParseKinds.td
59–60

No need for this diagnostic, you can use err_expected_string_literal from DiagnosticCommonKinds.td instead.

clang/include/clang/Lex/LiteralSupport.h
220

This seems to be unused.

231–234

Alternatively, you could use an in-class initializer and drop the changes to both ctor init lists.

clang/lib/AST/Expr.cpp
1076–1077

Basically unused and is shadowed by a declaration below (on line 1087).

1158–1159
clang/lib/Lex/LiteralSupport.cpp
95–96

Do you intend to miss a bunch of escapes like \' and \r (etc)?

242
1539–1560
1541

This means we're looping over (almost) all the string tokens three times -- once here, once below on line 1562, and again on 1605.

1542
1545

This diagnostic might be somewhat odd for Pascal strings because those sort of have a prefix but it's not really the kind of prefix we're talking about. I don't know of a better way to word the diagnostic though. If you think of a way to improve it, then yay, but otherwise, I think it's fine as-is.

1964
clang/lib/Parse/ParseDecl.cpp
427–429

Hmmm, I'm not certain about these changes.

For some attributes, the standard currently requires accepting any kind of string literal (like [[deprecated]] https://eel.is/c++draft/dcl.attr.deprecated#1). P2361 is proposing to change that, but it's not yet accepted by WG21 (let alone WG14). So giving errors in those cases is a bit of a hard sell -- I think warnings would be a bit more reasonable.

But for other attributes (like annotate), it's a bit less clear whether we should *prevent* literal prefixes because the attribute can be used to have runtime impacts (for example, I can imagine someone using annotate to emit the string literal bytes into the resulting binary). In some cases, I think it's very reasonable (e.g., diagnose_if should behave the same as deprecated and nodiscard because those are purely about generating diagnostics at compile time).

I kind of wonder whether we're going to want to tablegenerate whether the argument needs to be parsed as unevaluated or not on an attribute-by-attribute basis.

460–461

I don't think this is the right way to go about this, but the comments were left above.

cor3ntin added inline comments.Aug 2 2021, 8:24 AM
clang/lib/Lex/LiteralSupport.cpp
95–96

\' is there. I am less sure about '\r' and '\a'. for example. This is something I realized after writing P2361.
what does '\a` in static assert mean? even '\r' is not so obvious

clang/lib/Parse/ParseDecl.cpp
427–429

Yep, I would not expect this to get merge before P2361 but I think the implementation experience is useful and raised a bunch of good questions.
I don't think it ever makes sense to have L outside of literals - but people *might* do it currently, in which case there is a concern about whether it breaks code (I have found no evidence of that though).

If we wanted to inject these strings in the binary - in some form, then we might have to transcode them at that point.
I don't think the user would know if the string would be injected as wide or narrow (or something else) by the compiler.

L is really is want to convert that string _at that point_. in an attribute, strings might have multiple usages so it's better to delay any transcoding.
Does that make sense?

But I agree that a survey of what each attribute expects is in order.

cor3ntin updated this revision to Diff 367409.Aug 19 2021, 12:08 AM
cor3ntin marked 10 inline comments as done.

Address most of Aaron's comments

I addressed most of the comments. I still need to look at the 3 loops thing (I guess if the string is very long it could leave the cache? I am not actually sure it's an issue but maybe I can improve that), and then remain design questions

cor3ntin updated this revision to Diff 367811.Aug 20 2021, 9:26 AM

Fix diagnostic for avaibility attributes

aaron.ballman added a subscriber: jfb.
aaron.ballman added inline comments.
clang/lib/Lex/LiteralSupport.cpp
95–96

Looking at the list again, I think only \a is really of interest here. I know some folks like @jfb have mentioned that \a could be used to generate an alert sound on a terminal, which is a somewhat useful feature for a failed static assertion if you squint at it hard enough.

But the rest of the missing ones do seem more questionable to support.

clang/lib/Parse/ParseDecl.cpp
427–429

Yep, I would not expect this to get merge before P2361 but I think the implementation experience is useful and raised a bunch of good questions.

Absolutely agreed, this is worthwhile effort!

If we wanted to inject these strings in the binary - in some form, then we might have to transcode them at that point.
I don't think the user would know if the string would be injected as wide or narrow (or something else) by the compiler.

My intuition is that a user who writes L"foo" will expect a wide "foo" to appear in the binary in the cases where the string ends up making it that far.

L is really is want to convert that string _at that point_. in an attribute, strings might have multiple usages so it's better to delay any transcoding.
Does that make sense?

Not yet, but I might get there eventually. :-D My concern is that vendor attributes can basically do anything, so there's no reason to assume that any given string literal usage should or should not transcode. I think we have to decide on a case by case basis by letting the attributes specify what they intend in their argument lists. However, my intuition is that *most* attributes will expect unevaluated string literals because the string argument doesn't get passed to LLVM.

cor3ntin added inline comments.Aug 25 2021, 3:46 PM
clang/lib/Parse/ParseDecl.cpp
427–429

The status quo is that everything transcodes.

But not transcoding, we do not destroy any information as to what is in the source.

If an attribute then wants to use the string later in such a way that it needs to transcode to a literal encoding (or something else, for example, one might imagine a fun scenario where literal are ASCII encoded and debug information are EBCDIC encoded), then that can be done, because the string still exists.

Whereas for literal specifically, we assume they will be evaluated by the abstract machine as per phase 5 so we transcode them immediately. which is destructive. we get away with it because the original spelling is in the source if we need it, and currently, literals are also assumed to be (potentially invalid because of \x escape sequences) UTF-8.

There is an alternative design where string literals are not transcoded until lazily evaluated but I'm not sure there is a big motivation for that.

So this PR is exactly trying not to force a specific behavior on attributes that I assume can be displayed, put into some form in the binary, or converted to literal which might represent 3 distinct encodings. The parser leaving them as Unicode is the least opinionated thing the parser can possibly do.
And then each attribute can decide for itself if it needs to transcode, and how to handle any errors if they occur.
An attribute might decide to keep both a Unicode and non-Unicode spelling around if the string has a dual purpose, etc

Question though, Is there a scenario in which \x/\0 would actually be useful in the context of attributes? Because if so, then we might need to do something to allow that.

At @aaron.ballman request, here are some data collection trying to assert the amount of breakage

On a corpus of 78M lines of C++ code corresponding to all the packages in vcpkg

  • Number of strings with encoding prefix in _Pragma: 3/3383 (all in test suits)
  • Number of strings with encoding prefix in deprecated/nodiscard attributes: 0/845
  • Number of strings with encoding prefix in static_assert: 62/92800 (all in in test suits)
  • Number of strings with encoding prefix in extern : 3/39829 (all in in test suits)

Note that, due to the complexity of operating these sort of queries at this scale, this was done by regex and as such is not exact, nor is the approach fully representative of all C++ code.
However, I think it gives a good idea of the risk (or lack thereof) involved