This is an archive of the discontinued LLVM Phabricator instance.

Implement P2361 Unevaluated string literals
ClosedPublic

Authored by cor3ntin on Jul 10 2021, 6:53 AM.

Details

Summary

This patch proposes to handle in an uniform fashion
the parsing of strings that are never evaluated,
in asm statement, static assert, attrributes, extern,
etc.

Unevaluated strings are UTF-8 internally and so currently
behave as narrow strings, but these things will diverge with
D93031.

The big question both for this patch and the P2361 paper
is whether we risk breaking code by disallowing
encoding prefixes in this context.
I hope this patch may allow to gather some data on that.

Future work:
Improve the rendering of unicode characters, line break
and so forth in static-assert messages

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In general, I think this is shaping up nicely and is almost complete. I'm adding some additional reviewer though, as this is a somewhat experimental patch for a WG21 proposal that has not been accepted yet and I want to make sure that I'm not missing something. That may also solve the few open questions that still remain.

clang/lib/AST/Expr.cpp
1187–1188

I'd recommend running the entire patch through clang-format though: https://clang.llvm.org/docs/ClangFormat.html#script-for-patch-reformatting

clang/lib/Lex/LiteralSupport.cpp
1897

Looks like this comment is still missing punctuation.

cor3ntin updated this revision to Diff 375278.Sep 27 2021, 8:27 AM

Formatting and missing punctuation

erichkeane added inline comments.Sep 27 2021, 8:33 AM
clang/include/clang/Basic/DiagnosticLexKinds.td
280

Is there value to combining these two diagnostics with a %select?

clang/lib/Lex/LiteralSupport.cpp
96–97

I might consider rejecting ANY character escape in the less-than-32 part of the table.

For consistency at least, I don't see value in allowing \a if we're rejecting layout things like \t.

119

This is like the 3rd time we're using 'Unevaluated' as a bool parameter. I have a pretty strong preference for making it a scoped-enum in 'Basic' somewhere.

2005

Is this OK? It looks like we're passing a ton of parameters to a diag type that doesn't have any wildcards?

cor3ntin updated this revision to Diff 375319.Sep 27 2021, 10:05 AM

Replace Unevaluated by an enum.

aaron.ballman added inline comments.Sep 27 2021, 10:13 AM
clang/include/clang/Basic/DiagnosticLexKinds.td
280

I waffled when doing this review, so it's funny you mention it. :-D

We could do: an unevaluated string literal cannot %select{have an encoding prefix|be a user-defined literal}0 but there was just enough text in the select that I felt it wasn't critical to combine. But I don't feel strongly either way.

clang/lib/Lex/LiteralSupport.cpp
96–97

But that's just it, we're accepting \t and \n with this code.

2005

Good catch! The first two are not helpful (the diag engine will silently ignore them), but the second two are for underlines in the diagnostic and are useful.

erichkeane added inline comments.Sep 27 2021, 10:20 AM
clang/lib/Lex/LiteralSupport.cpp
96–97

Ah! I missed that this is an allow-list instead of a deny-list. That makes me way more comfortable with this code.

IMO, I'd suggest we we allow '\r' (since wouldn't we have problems on Windows at that point, being unable to accept a printable newline for windows?), but disallow \a for now unless someone comes up with a really good reason to allow it.

cor3ntin updated this revision to Diff 375475.Sep 27 2021, 11:12 PM

Accept \r as an escape sequence n unevaluated string literal

cor3ntin updated this revision to Diff 375476.Sep 27 2021, 11:13 PM

Rename commit

A couple of small things, otherwise I'm happy; but Aaron has some bigger opens above, plus clang-format, plus the modules from Richard.

clang/include/clang/Basic/DiagnosticLexKinds.td
280

I was waffly on this too, so your waffling + my waffling I think is sufficient reason to not deal with this now.

clang/lib/AST/Expr.cpp
1160

minor preference (perhaps 'nit' level) to move this whole CharByteWidth + IsPascal calculation into its own function. This constructor is absurdly long as it is.

clang/lib/Lex/LiteralSupport.cpp
99

For future clarification, the ones from the 'simple' list here: https://en.cppreference.com/w/cpp/language/escape

that we are missing are: \a \b \f and \v.

I personally think I'm ok with that until someone else says they care.

1896

Hrm.... this is unfortunate. Is there no way to combine the loops? I guess (hope?) that hte list of tokens is at least going to be short...

cor3ntin updated this revision to Diff 375569.Sep 28 2021, 7:23 AM

Formatting

cor3ntin updated this revision to Diff 375576.Sep 28 2021, 7:46 AM

Get rid of the extra loop by using a lambda

Some naming nits. There are two open questions also: one about module behavior and one about a TODO comment in the patch. If we don't hear back about the modules question, I think that can be handled in a follow-up.

clang/include/clang/Lex/LiteralSupport.h
215

Slight renaming so nobody thinks this is going to be about wide vs narrow vs u8, etc.

243

We should rename anything mentioning StringKind similarly -- this will also help avoid confusion with the StringKind type in Expr.h.

257

Can we make this private now rather than letting callers access it directly?

clang/lib/Lex/LiteralSupport.cpp
1897

When I hear "check" I think it'll return a value; I think this name is a bit more clear.

2007–2009
cor3ntin updated this revision to Diff 375645.Sep 28 2021, 10:57 AM

Address Aaron's comments

clang/lib/Lex/LiteralSupport.cpp
119

Any suggestion for where to

2007–2009

This are actually used by err_string_concat_mixed_suffix

cor3ntin added inline comments.Sep 28 2021, 10:58 AM
clang/lib/Lex/LiteralSupport.cpp
119

NVM

erichkeane added inline comments.Sep 28 2021, 10:59 AM
clang/lib/Lex/LiteralSupport.cpp
2007–2009

right, i guess it is just super awkward to have unused parameters passed like this. I know we only check the other direction, but seems awkward. Aaron, thoughts?

aaron.ballman added inline comments.Sep 28 2021, 11:56 AM
clang/lib/Lex/LiteralSupport.cpp
2007–2009

I'd split it into two calls at this point. e.g.,

if (UnevaluatedStringHasUDL)
  Diags->Report(TokLoc, diag::err_unevaluated_string_udl) << ...;
else
  Diags->Report(TokLoc, diag::err_string_concat_mixed_suffix) << ...;
cor3ntin updated this revision to Diff 375940.Sep 29 2021, 10:10 AM

Cleanup Diagnostics In LiteralSupport

aaron.ballman accepted this revision.Sep 29 2021, 10:35 AM

LGTM aside from two small nits. As for the modules question, if @rsmith doesn't get back to us, I think it's fine to address that post-commit.

clang/include/clang/Lex/LiteralSupport.h
243

Did this one get missed?

clang/test/CXX/dcl.dcl/p4-0x.cpp
25

Can you add the newline back to the end of the file?

This revision is now accepted and ready to land.Sep 29 2021, 10:35 AM
cor3ntin updated this revision to Diff 375964.Sep 29 2021, 11:07 AM

Fix EOF & unrenamed StringKind

cor3ntin retitled this revision from [WIP] Implement P2361 Unevaluated string literals to Implement P2361 Unevaluated string literals.Sep 30 2021, 5:07 AM
cor3ntin closed this revision.Oct 1 2021, 12:00 PM
cor3ntin reopened this revision.Jun 22 2023, 1:01 PM
This revision is now accepted and ready to land.Jun 22 2023, 1:01 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 22 2023, 1:01 PM
cor3ntin updated this revision to Diff 533743.Jun 22 2023, 1:01 PM

As approved by CWG
Updates to cxx_status / doc will come later :)

shafik added a subscriber: shafik.Jun 23 2023, 10:09 PM
shafik added inline comments.
clang/lib/Parse/ParseExpr.cpp
3268
cor3ntin updated this revision to Diff 534201.Jun 24 2023, 5:40 AM

Address Shafik's comment

shafik accepted this revision.Jun 26 2023, 9:14 AM

LGTM but I don't see asm covered in the tests.

clang/lib/AST/Expr.cpp
1140

Why not grouped w/ Ordinary above?

1164

Isn't this the same as Length?

1201

Isn't Str.size() the same as ByteLength?

clang/lib/Lex/LiteralSupport.cpp
90

Should we use Is as a prefix here? Right now it should like we are modifying something.

cor3ntin updated this revision to Diff 534613.Jun 26 2023, 9:42 AM

Rename /EscapeValidInUnevaluatedStringLiteral/IsEscapeValidInUnevaluatedStringLiteral

cor3ntin marked 43 inline comments as done.Jun 26 2023, 9:52 AM
cor3ntin added inline comments.
clang/lib/AST/Expr.cpp
1164

Only when CharByteWidth == 1

1201

ByteLength isn't defined in this scope, I guess i could move it.

This also should update the cxx_status page and have a release note.

clang/include/clang/Basic/Attr.td
1411 ↗(On Diff #534201)

What is the plan for non-standard attributes? Are you planning to handle those in a follow-up, or should we be investigating those right now?

clang/include/clang/Parse/Parser.h
1857–1859

Two default bool params is a bad thing but three default bool params seems like we should fix the interface at this point. WDYT?

Also, it's not clear what the new parameter will do, the function could use comments unless fixing the interface makes it sufficiently clear.

clang/lib/AST/Expr.cpp
1140

Specifically because we want the host encoding, not the target encoding.

1164

It is -- I think we can get rid of ByteLength, but it's possible that this exists because of the optimization comment below. I don't insist, but it would be nice to know if we can replace the switch with Length /= CharByteWidth these days.

1189

Add assert(!Pascal && "Can't make an unevaluated Pascal string"); ?

1201

I think it's more clear to use Str.size() because we're copying from Str.data().

1238
clang/lib/Lex/LiteralSupport.cpp
90

+1, I think Is would be an improvement.

aaron.ballman added inline comments.Jun 26 2023, 9:54 AM
clang/lib/Lex/LiteralSupport.cpp
98

We're still missing support for some escape characters from: http://eel.is/c++draft/lex#nt:simple-escape-sequence-char

Just to verify, UCNs have already been handled by the time we get here, so we don't need to care about those, correct?

1917

Doesn't returning here leave the object in a partially-initialized state? That seems bad.

2083–2086

Is there test coverage that we diagnose this properly?

clang/lib/Lex/PPMacroExpansion.cpp
1872–1874

Test coverage for this change?

clang/lib/Lex/Pragma.cpp
776

Pinging @ChuanqiXu for opinions.

clang/lib/Parse/ParseExpr.cpp
3497–3498

I'm surprised we need special logic in ParseExpressionList() for handling unevaluated string literals; I would have expected that to be needed when parsing a string literal. Nothing changed in the grammar for http://eel.is/c++draft/expr.post.general#nt:expression-list (or initializer-list), so these changes seem wrong. Can you explain the changes a bit more?

clang/lib/Sema/SemaDeclAttr.cpp
878–879 ↗(On Diff #534201)

Test coverage for these changes?

clang/lib/Sema/SemaDeclCXX.cpp
16474

Test coverage for changes?

cor3ntin marked 2 inline comments as done.Jun 26 2023, 9:57 AM
cor3ntin added inline comments.
clang/lib/AST/Expr.cpp
1140

an unevaluated string is a sequence of 1-byte even on platforms were sizeof(char) would be 2 or 4. It's never influenced by the target's properties

cor3ntin marked 6 inline comments as done.Jun 26 2023, 10:30 AM
cor3ntin added inline comments.
clang/include/clang/Basic/Attr.td
1411 ↗(On Diff #534201)

I don't feel I'm qualified to answer that. Ideally, attributes that expect string literals that are not evaluated should follow suite.

clang/include/clang/Parse/Parser.h
1857–1859

I'm still not sure that's the best solution. AllowEvaluatedString would only ever be false for attributes, I consider duplicating the function, except it does quite a bit for variadics, which apparently attribute support

Maybe would could have

ParseAttributeArgumentList
ParseExpressionList
ParseExpressionListImpl?

?

clang/lib/AST/Expr.cpp
1164

I think we should.

clang/lib/Lex/LiteralSupport.cpp
98

Just to verify, UCNs have already been handled by the time we get here, so we don't need to care about those, correct?

They are dealt with elsewhere yes (and supported)

2083–2086

What sort of test would you like to see?

clang/lib/Parse/ParseExpr.cpp
3497–3498

We use ParseExpressionList when parsing attribute arguments, and some attributes have unevaluate string as argument - I agree with you that I'd rather find a better solution for attributes, but I came up empty. There is no further reason for this change, and you are right it does not match the grammar.

clang/lib/Sema/SemaDeclAttr.cpp
878–879 ↗(On Diff #534201)

There is one somewhere, I don;t remember where, The reason we need to do that is that Unevaluated StringLiterals don''t have types

clang/lib/Sema/SemaDeclCXX.cpp
16474

There are some in dcl.link/p2.cpp

cor3ntin updated this revision to Diff 534640.Jun 26 2023, 10:40 AM
cor3ntin marked 2 inline comments as done.

Address some of Aaron's comments

ChuanqiXu added inline comments.Jun 26 2023, 7:42 PM
clang/lib/Lex/Pragma.cpp
776

I think the both options (to modify it or not) are acceptable. Because the input here should be the output of the clang itself. See https://github.com/llvm/llvm-project/blob/ebd0b8a0472b865b7eb6e1a32af97ae31d829033/clang/lib/Basic/Module.cpp#L229-L231 and https://github.com/llvm/llvm-project/blob/ebd0b8a0472b865b7eb6e1a32af97ae31d829033/clang/lib/Frontend/Rewrite/FrontendActions.cpp#L238-L240.

We can see there is no deprecated prefix. So while it is acceptable to modify this since its pattern matches the paper, it doesn't matter really since we can control the input completely.

Personally, I prefer to not touch it. Since I feel like this use case doesn't have been used a lot. So the effort here may not be worthy.

aaron.ballman added inline comments.Jun 27 2023, 7:41 AM
clang/include/clang/Basic/Attr.td
1411 ↗(On Diff #534201)

Let's do them in a follow-up. Normally I'd suggest working with @erichkeane on which attributes to apply that to, but he's about to go on a sabbatical and might not have time to help with that. So maybe you can take a first pass at it as best you can and then rope me in to help finalize it, if that'd work for you?

clang/lib/Lex/LiteralSupport.cpp
2083–2086

Pascal strings enabled and using something like [[deprecated("\pOh no, a Pascal string!")]] (or some other unevaluated uses).

clang/lib/Parse/ParseExpr.cpp
3497–3498

I was thinking we'd use a new kind of evaluation context for this. We'd enter the evaluation context when we know we need to parse an expression that is an unevaluated string literal which the string literal parser would pay attention to. This would require knowing up-front when we want to parse an unevaluated string literal, but we should have that information available to us at parse time (I think).

clang/lib/Sema/SemaDeclAttr.cpp
878–879 ↗(On Diff #534201)

Let's try to track that down, but... an unevaluated string literal still has a type, surely? It'd be const char[] for C++?

cor3ntin added inline comments.Jun 27 2023, 8:14 AM
clang/lib/Parse/ParseExpr.cpp
3497–3498

After offline discussion, i think what we want to be doing is to have a

ParseAtttributeArgumentList function that is aware of whether the Nth argument is an unevaluated string - by means of modifying tablegen,
and doing the right parsing accordingly.
It would take care of all attributes automatically.
Alas that's a tad more involved.

aaron.ballman added inline comments.Jun 27 2023, 8:16 AM
clang/lib/Parse/ParseExpr.cpp
3497–3498

+1

I agree it's more involved, but it's also a more general solution that fits nicely in the parser design (we do this sort of thing for other parts of attribute parsing).

cor3ntin added inline comments.Jun 27 2023, 8:29 AM
clang/lib/Sema/SemaDeclAttr.cpp
878–879 ↗(On Diff #534201)

It doesn't because it doesn't exist past phase 6.
It's not unevaluated as in decltype, it's more unevaluated as it's a weird token that never participate in the program, the same way a pragma or an attribute don't have a type.
Note that we can revert that change if we do the whole tablegen thing

The relevant test is in test/SemaCXX/warn-thread-safety-parsing.cpp, L17

cor3ntin updated this revision to Diff 534999.Jun 27 2023, 8:30 AM

Add tests for pascal strings (which are not a thing in C++ apparently)

cor3ntin updated this revision to Diff 535073.Jun 27 2023, 11:19 AM

Parse attribute as unevaluated string if they
are declare StringLiteralArgument in the Attr.td file.

WIP

@aaron.ballman Do we agree on direction before I
fix the remaining broken tests?

There are a few limitations, which I'm hoping not to fix there

  • It doesn't support variadic string arguments
  • checking the type of argument ahead of time seems like a good idea overall, maybe we want to expand that system?
Herald added a project: Restricted Project. · View Herald TranscriptJun 27 2023, 11:19 AM

Parse attribute as unevaluated string if they
are declare StringLiteralArgument in the Attr.td file.

WIP

@aaron.ballman Do we agree on direction before I
fix the remaining broken tests?

Mostly agreed, though I left a comment where I think the direction should change slightly.

There are a few limitations, which I'm hoping not to fix there

  • It doesn't support variadic string arguments

That's reasonable; let's leave them as evaluated strings for now so there's no behavioral change.

  • checking the type of argument ahead of time seems like a good idea overall, maybe we want to expand that system?

I agree; we currently have the common handler checking argument counts (https://github.com/llvm/llvm-project/blob/1e010c5c4fae43c52d6f5f1c8e8920c26bcc6cc7/clang/lib/Sema/SemaAttr.cpp#L1419), but we don't generate any code for checking argument types. But certainly doesn't need to be done as part of your work.

clang/include/clang/Basic/Attr.td
3048 ↗(On Diff #535073)

I don't think we should reuse this flag this way. This flag is for the traditional sense of "unevaluated", but unevaluated string literals are a different kind of beast. I think that should be tracked on the argument level. We can either adjust:

class StringArgument<string name, bit opt = 0> : Argument<name, opt>;

so that it takes another bit for whether the string is unevaluated or not, or we could add a new subclass for UnevaluatedStringArgument. Then ClangAttrEmitter.cpp would look at this information when emitting the switch cases.

llvm/cmake/modules/HandleLLVMOptions.cmake
608 ↗(On Diff #535073)

Spurious change. ;-)

aaron.ballman requested changes to this revision.Jun 27 2023, 12:21 PM

Clearing the "accepted" status so it's not confusing as to the state of things.

This revision now requires changes to proceed.Jun 27 2023, 12:21 PM
cor3ntin added inline comments.Jun 27 2023, 12:40 PM
clang/include/clang/Basic/Attr.td
3048 ↗(On Diff #535073)

This is the previous approach i forgot to fixup everywhere.
My current approach is to always consider StringArgument unevaluated.
I don't think it make sense to have both StringArgument and UnevaluatedStringArgument.
Currently in all the places we accept StringArgument, we check it's a possibly parenthesized StringLiteral

If you want an evaluated string literal, an expression that produce a const char* or something should work

llvm/cmake/modules/HandleLLVMOptions.cmake
608 ↗(On Diff #535073)

I've been battling with that for weeks, that flag completely breaks my IDE, not sure why. It was inevitable that it ended up in a commit :|

aaron.ballman added inline comments.Jun 27 2023, 1:32 PM
clang/include/clang/Basic/Attr.td
3048 ↗(On Diff #535073)

My current approach is to always consider StringArgument unevaluated.
I don't think it make sense to have both StringArgument and UnevaluatedStringArgument.

I think that's potentially a pretty significant change in behavior until we actually evaluate (ahahaha, puns!) all the vendor attributes using a StringArgument. Also, I thought you mentioned you planned to leave variadic string arguments as evaluated strings, so there would be a pretty surprising inconsistency to the behavior there. I would feel more comfortable not changing the behavior of attributes we've not validated are still correct when using unevaluated strings.

cor3ntin updated this revision to Diff 535355.Jun 28 2023, 6:09 AM

Fix tests and handle variadic attributes.

  • With that all normal attributes are handled. Only attributes with custop parsing code and those specified as an enum are left untouched.

Note that I have confirmed that all the change attributes require a StringLiteral and go through checkStringLiteralArgumentAttr

cor3ntin updated this revision to Diff 535356.Jun 28 2023, 6:11 AM

revert accidental changes to cmake

Just 2 small nits, otherwise this all LGTM.

clang/lib/Parse/ParseDecl.cpp
430 ↗(On Diff #535356)

Please put a newline between unchained 'if' statements... it makes tehse really hard to read without it.

It happens a few times here.

clang/lib/Sema/SemaDeclAttr.cpp
878 ↗(On Diff #535356)

Unrelated change here? What is this for?

cor3ntin updated this revision to Diff 535373.Jun 28 2023, 6:42 AM

Address Erich's comments

cor3ntin added inline comments.Jun 28 2023, 7:02 AM
clang/lib/Sema/SemaDeclAttr.cpp
878 ↗(On Diff #535356)

Some test i failed to fully revert. good catch!

I don't think it's correct to assume that all string arguments to attributes are unevaluated, but it is hard to tell where to draw the line sometimes. Backing up a step, as I understand P2361, an unevaluated string is one which is not converted into the execution character set (effectively). Is that correct? If so, then as an example, [[clang::annotate()]] should almost certainly be using an evaluated string because the argument is passed down to LLVM IR and is used in ways we cannot predict. What's more, an unevaluated string cannot have some kinds of escape characters (numeric and conditional escape sequences) and those are currently allowed by clang::annotate and could potentially be used by a backend plugin.

I think other attributes may have similar issues. For example, the alias attribute is a bit of a question mark for me -- that takes a string literal representing an external identifier that is looked up. I'm not certain whether that should be in the execution character set or not, but we do support escape sequences for it: https://godbolt.org/z/v65Yd7a68

I think we need to track evaluated vs not on the argument level so that the attributes in Attr.td can decide which form to use. I think we should default to "evaluated" for any attribute we're on the fence about because that's the current behavior they get today (so we should avoid regressions).

clang/include/clang/Sema/ParsedAttr.h
919 ↗(On Diff #535373)
clang/lib/Parse/ParseDecl.cpp
291 ↗(On Diff #535373)

Comment doesn't match the function name. ;-)

453–454 ↗(On Diff #535373)

What are these lines intended to do? We assign to E but nothing ever reads from it after this assignment and we reset it on the next iteration through the loop.

clang/lib/Parse/ParseExpr.cpp
3497

Can revert these two changes now.

I don't think it's correct to assume that all string arguments to attributes are unevaluated, but it is hard to tell where to draw the line sometimes. Backing up a step, as I understand P2361, an unevaluated string is one which is not converted into the execution character set (effectively). Is that correct? If so, then as an example, [[clang::annotate()]] should almost certainly be using an evaluated string because the argument is passed down to LLVM IR and is used in ways we cannot predict. What's more, an unevaluated string cannot have some kinds of escape characters (numeric and conditional escape sequences) and those are currently allowed by clang::annotate and could potentially be used by a backend plugin.

I think other attributes may have similar issues. For example, the alias attribute is a bit of a question mark for me -- that takes a string literal representing an external identifier that is looked up. I'm not certain whether that should be in the execution character set or not, but we do support escape sequences for it: https://godbolt.org/z/v65Yd7a68

I took a quick pass over our existing attributes, and here's my intuition on them regarding encoding of the literal:

Unevaluated Strings are Fine:
AbiTag
TLSModel
Availability
Deprecated
EnableIf/DiagnoseIf
ObjCRuntimeName
PragmaClangBSSSection/PragmaClangDataSection/PragmaClangRodataSection/PragmaClangRelroSection/PragmaClangTextSection (only created implicitly)
Suppress
Target/TargetVersion/TargetClones
Unavailable
Uuid
WarnUnusedResult
NoSanitize
Capability
Assumption
NoBuiltin (it names a builtin name, so this is probably fine to leave unevaluated?)
AcquireHandle/UseHandle/ReleaseHandle
Error
HLSLResourceBinding

Unevaluated String are Potentially Bad:
Annotate
AnnotateType

Unevaluated String Needs More Thinking (common thread is that they survive to LLVM IR):
Alias
AsmLabel
IFunc
BTFDeclTag/BTFTypeTag (is emitted to DWARF with -g so probably evaluated?)
WebAssemblyExportName/WebAssemblyImportModule/WebAssemblyImportModule
ExternalSourceSymbol
SwiftAsyncName/SwiftAttr/SwiftBridge/SwiftName
Section/CodeSeg/InitSeg
WeakRef
EnforceTCB/EnforceTCBLeaf

There's also the escape sequences issue where use of an escape sequence will go from accepted to rejected in these contexts. I did some hunting to see if I could find uses of numeric escape sequences in asm labels or alias attributes, to see if there's some signs this is done in practice:

Testing we can find numeric escape sequences at all:
https://sourcegraph.com/search?q=context:global+%5C%28%5C%22%5B%5B:alpha:%5D%5D*%28%5C%5C%5B%5B:digit:%5D%5D%2B%29%2B%5B%5B:alpha:%5D%5D*%5C%22%5C%29+lang:C+lang:C%2B%2B&patternType=regexp&case=yes&sm=1&groupBy=repo

Testing we can find asm labels at all:
https://sourcegraph.com/search?q=context:global+asm%5C%28%5C%22%5B%5B:alpha:%5D%5D*%5B%5B:alpha:%5D%5D*%5C%22%5C%29+lang:C+lang:C%2B%2B&patternType=regexp&case=yes&sm=1&groupBy=repo

Testing we can find asm labels with numeric escapes:
https://sourcegraph.com/search?q=context:global+asm%5C%28%5C%22%5B%5B:alpha:%5D%5D*%28%5C%5C%5B%5B:digit:%5D%5D%2B%29%2B%5B%5B:alpha:%5D%5D*%5C%22%5C%29+lang:C+lang:C%2B%2B&patternType=regexp&case=yes&sm=1&groupBy=repo

Testing we can find alias attributes at all:
https://sourcegraph.com/search?q=context:global+alias%5C%28%5C%22%5B%5B:alpha:%5D%5D*%5B%5B:alpha:%5D%5D*%5C%22%5C%29+lang:C+lang:C%2B%2B&patternType=regexp&case=yes&sm=1&groupBy=repo

Testing we can find alias attributes with numeric escapes:
https://sourcegraph.com/search?q=context:global+alias%5C%28%5C%22%5B%5B:alpha:%5D%5D*%28%5C%5C%5B%5B:digit:%5D%5D%2B%29%2B%5B%5B:alpha:%5D%5D*%5C%22%5C%29+lang:C+lang:C%2B%2B&patternType=regexp&case=yes&sm=1&groupBy=repo

I think this leaves me with three open questions:

  • Do we know of any uses of the annotate attribute that rely on the string literal being in the execution character set? I do not know of any but I know this is used by plugins quite often.
  • Do we know of any attributes in the "needs more thinking" list that should have the string literal encoded in the execution character set? I think most of these are for referring to identifiers in source and I expect those would want source character set and not execution character set strings.
  • Do we know of any significant body of code using numeric escape sequences in these string literals that could not be relatively easily modified to compile again? I would be surprised, but I think someone should probably run more of the attributes on the "needs more thinking" list through similar searches on source graph and we can use that as an approximation.

If all these answers come back "no" as best we can figure, then I think we can punt on argument-level handling of this until we add an attribute that really does need an execution-encoded (or numeric escape sequence-using) string literal. I think we've got enough time before the Clang 17 branch to hear if the changes cause problems after we've done this due diligence. WDYT?

I don't think it's correct to assume that all string arguments to attributes are unevaluated, but it is hard to tell where to draw the line sometimes. Backing up a step, as I understand P2361, an unevaluated string is one which is not converted into the execution character set (effectively). Is that correct? If so, then as an example, [[clang::annotate()]] should almost certainly be using an evaluated string because the argument is passed down to LLVM IR and is used in ways we cannot predict. What's more, an unevaluated string cannot have some kinds of escape characters (numeric and conditional escape sequences) and those are currently allowed by clang::annotate and could potentially be used by a backend plugin.

I think other attributes may have similar issues. For example, the alias attribute is a bit of a question mark for me -- that takes a string literal representing an external identifier that is looked up. I'm not certain whether that should be in the execution character set or not, but we do support escape sequences for it: https://godbolt.org/z/v65Yd7a68

I think we need to track evaluated vs not on the argument level so that the attributes in Attr.td can decide which form to use. I think we should default to "evaluated" for any attribute we're on the fence about because that's the current behavior they get today (so we should avoid regressions).

I really don't think it makes sense to have both "unevaluated" and "evaluated" arguments.
We chatted offline and we struggle to find places where escape sequences are used, or examples of attributes intended to be in the execution character set.

My suggestion would be to land the non-attributes changes now, and the attributes bits in early clang 18.
If we find clear example of attributes expecting execution character set, they should be able to be described as an expression, which will be checked as a string literal anyway, hopefully?

In the case of annotate, if these are fed, for example to a debugger, their may need to convert to whatever the debugger expect as encoding, which is not necessarily the execution charset,
Same for plugins, they certainly not expect ebcdic data, for example.
I would expect for example static analyzers and code generator to keep working after the introduction of fexec-charset
So it's important that it remains unevaluated in the front end so that it can be correctly converted to the appropriate encoding of the various consumers. Which doesn't have a single answer

Do we know of any attributes in the "needs more thinking" list that should have the string literal encoded in the execution character set? I think most of these are for referring to identifiers in source and I expect those would want source character set and not execution character set strings.

Identifiers and symbol names are in UTF8, and may get mangle through, for example replacing non-ascii codepoints by UCN. The source character set is never relevant
This address the WebAsm attributes

BTFDeclTag/BTFTypeTag (is emitted to DWARF with -g so probably evaluated?)

Is it correct to assume the debugger file encoding is always the same as the program's ? Probably not!
If need be, we can then transcode the strings when doing codegen for these things

aaron.ballman added a subscriber: Restricted Project.Jul 5 2023, 7:32 AM

I don't think it's correct to assume that all string arguments to attributes are unevaluated, but it is hard to tell where to draw the line sometimes. Backing up a step, as I understand P2361, an unevaluated string is one which is not converted into the execution character set (effectively). Is that correct? If so, then as an example, [[clang::annotate()]] should almost certainly be using an evaluated string because the argument is passed down to LLVM IR and is used in ways we cannot predict. What's more, an unevaluated string cannot have some kinds of escape characters (numeric and conditional escape sequences) and those are currently allowed by clang::annotate and could potentially be used by a backend plugin.

I think other attributes may have similar issues. For example, the alias attribute is a bit of a question mark for me -- that takes a string literal representing an external identifier that is looked up. I'm not certain whether that should be in the execution character set or not, but we do support escape sequences for it: https://godbolt.org/z/v65Yd7a68

I think we need to track evaluated vs not on the argument level so that the attributes in Attr.td can decide which form to use. I think we should default to "evaluated" for any attribute we're on the fence about because that's the current behavior they get today (so we should avoid regressions).

I really don't think it makes sense to have both "unevaluated" and "evaluated" arguments.
We chatted offline and we struggle to find places where escape sequences are used, or examples of attributes intended to be in the execution character set.

In general I agree, but the one scenario that I keep coming back to are attributes like diagnose_if where they take an expression we're evaluating at compile time (condition expression) and a string literal that's not evaluated (warning vs error, diagnostic message itself). But I think the "evaluating at compile time" is part of why I don't think we intend the attribute to be considering the execution character set.

My suggestion would be to land the non-attributes changes now, and the attributes bits in early clang 18.

I think we're almost safe enough to make the attribute changes in Clang 17 so that no attribute uses an evaluated argument, but given that there's less than a month before we make the 17 branch, I think it's probably a good idea to make these changes after the branch point so folks have longer to react. Adding clang-vendors to the review for awareness of the potential for a breaking change.

cor3ntin updated this revision to Diff 538073.Jul 7 2023, 4:11 AM

I removed the changes to attributes.
Nothing else changes except cxx_status/ReleaseNotes.

Unevaluated strings in attributes will be back (in a separate PR)

aaron.ballman accepted this revision.Jul 7 2023, 4:21 AM

LGTM with a minor tweak to the wording on the status page, thank you!

clang/www/cxx_status.html
118–120
This revision is now accepted and ready to land.Jul 7 2023, 4:21 AM
This revision was landed with ongoing or failed builds.Jul 7 2023, 4:30 AM
This revision was automatically updated to reflect the committed changes.
barannikov88 added inline comments.Jul 7 2023, 7:37 AM
clang/docs/ReleaseNotes.rst
138

Looks like a copy&paste bug.

cor3ntin added inline comments.Jul 7 2023, 7:40 AM
clang/docs/ReleaseNotes.rst
138

Nice catch, thanks

@cor3ntin
I've been working on pretty much the same functionality in our downstream fork. I was not aware of the paper, nor of the ongoing work in this direction, and so I unfortunately missed the review.
Thanks for this patch, it significantly reduces the number of changes downstream and makes it easier to merge with upstream in the future.

I have a couple of questions about future work:

  • IIUC the paper initially addressed this issue with #line directive, but the changes were reverted(?). Is there any chance they can get back?
  • Are there any plans for making similar changes to asm statement parsing?

@cor3ntin
I've been working on pretty much the same functionality in our downstream fork. I was not aware of the paper, nor of the ongoing work in this direction, and so I unfortunately missed the review.
Thanks for this patch, it significantly reduces the number of changes downstream and makes it easier to merge with upstream in the future.

I have a couple of questions about future work:

  • IIUC the paper initially addressed this issue with #line directive, but the changes were reverted(?). Is there any chance they can get back?

There is a core issue tracking that, ie the c++ committee was concerned about escape sequences in header names
https://www.open-std.org/jtc1/sc22/wg21/docs/cwg_active.html#2693

I'd be happy to bring that back to clang though, as the concerned is unlikely to be warranted for us.

  • Are there any plans for making similar changes to asm statement parsing?

The direction of the c++ committee is that what's in asm() is now strictly implementation-defined, so we could but last time there were concerns about escape sequences in there too.

I hope this patch may allow to gather some data on that.

@cor3ntin, I have reports that applications having encoding prefixes in static_assert are failing to build. The committee did not adopt the subject paper as a "DR resolution". Is it possible to downgrade to a warning?

I hope this patch may allow to gather some data on that.

@cor3ntin, I have reports that applications having encoding prefixes in static_assert are failing to build. The committee did not adopt the subject paper as a "DR resolution". Is it possible to downgrade to a warning?

You know how frequent it is?

Making it a warning is possible but not straightforward.

Previously with a prefix, it was parsed as a wide (for example) string, and we relied on the fact that L was UTF-16/32 to sometimes print a reasonable diagnostics - and sometimes not https://godbolt.org/z/f3Pj4T5aj
This is going to be worse when we add -fexec-charset:

  • static_assert(true, L"やあ") is going to be ill-formed when coded as, eg, EBCDIC because it's not representable, and if it is representable we need to either output mojibake or convert the string back to UTF-8 which we are currently not doing.

Another solution maybe to lexically ignore prefixes by replacing the string literal token on the fly such that they are still parsed as unevaluated strings and not encoded, i could look into that.

I hope this patch may allow to gather some data on that.

@cor3ntin, I have reports that applications having encoding prefixes in static_assert are failing to build. The committee did not adopt the subject paper as a "DR resolution". Is it possible to downgrade to a warning?

You know how frequent it is?

No, but I am concerned that this came up even before we deployed an LLVM 17-based solution (in pre-release testing). I believe that reverting for LLVM 17 is the prudent course of action.

Previously with a prefix, it was parsed as a wide (for example) string, and we relied on the fact that L was UTF-16/32 to sometimes print a reasonable diagnostics - and sometimes not https://godbolt.org/z/f3Pj4T5aj
This is going to be worse when we add -fexec-charset:

  • static_assert(true, L"やあ") is going to be ill-formed when coded as, eg, EBCDIC because it's not representable, and if it is representable we need to either output mojibake or convert the string back to UTF-8 which we are currently not doing.

This may be the motivation for the prefixes in the applications in the first place in the context of other compilers: They may have needed the prefix to avoid unrepresentable character issues (e.g., if the other compiler rejects the unprefixed string, but manages to emit the error to the terminal because both the terminal and the compiler use the source encoding).

Another solution maybe to lexically ignore prefixes by replacing the string literal token on the fly such that they are still parsed as unevaluated strings and not encoded, i could look into that.

This sounds good.

I hope this patch may allow to gather some data on that.

@cor3ntin, I have reports that applications having encoding prefixes in static_assert are failing to build. The committee did not adopt the subject paper as a "DR resolution". Is it possible to downgrade to a warning?

You know how frequent it is?

No, but I am concerned that this came up even before we deployed an LLVM 17-based solution (in pre-release testing). I believe that reverting for LLVM 17 is the prudent course of action.

Early reports of user code getting tripped up on this is something we should react to while we still can; I'd recommend we change the diagnostic to be a warning that defaults to an error so that users who are caught by the changes can still disable the diagnostic rather than be stuck; for Clang 18, we can explore other solutions to the issue. Would this work for you @hubert.reinterpretcast?

I'd recommend we change the diagnostic to be a warning that defaults to an error so that users who are caught by the changes can still disable the diagnostic rather than be stuck; for Clang 18, we can explore other solutions to the issue. Would this work for you @hubert.reinterpretcast?

I think there are questions about whether an error (or even warning) by default is appropriate. This seems to be a change for C++2c that does not have "DR" treatment from the committee. Considering this a warning controlled by c++2c-compat is a potential direction. Indeed, if we are going to accept the code, we might as well allow it as an extension in C++2c modes. With this line of logic, I don't see why we would want user-side churn of making a migration effort.

I'd recommend we change the diagnostic to be a warning that defaults to an error so that users who are caught by the changes can still disable the diagnostic rather than be stuck; for Clang 18, we can explore other solutions to the issue. Would this work for you @hubert.reinterpretcast?

I think there are questions about whether an error (or even warning) by default is appropriate. This seems to be a change for C++2c that does not have "DR" treatment from the committee. Considering this a warning controlled by c++2c-compat is a potential direction. Indeed, if we are going to accept the code, we might as well allow it as an extension in C++2c modes. With this line of logic, I don't see why we would want user-side churn of making a migration effort.

I will endeavor to have a patch by the beginning of the week.

I think the implementation effort is going to be the same whether it is an error by default or not so we can discuss that. I don't have a strong opinion.
Ideally, that would depend on how many users are affected.

However, I don't think nothing at all is a reasonable expectation here, L in static_assert message does either not work or is ignored. In no case does it do what the user wants https://godbolt.org/z/fYnMqT38P
Text encodings are sufficiently confusing that we should not add to the confusion by not telling users their encodings prefix have no effects.

And, given that prior to c++20 the standard simply ignores encoding prefixes, we could also discuss whether it was ever intended for prefixes to be supported or whether it was an oversight to begin with.