This is an archive of the discontinued LLVM Phabricator instance.

Downgrade the UAX31 diagnostic to a warning which defaults to an error
AbandonedPublic

Authored by aaron.ballman on Aug 29 2022, 11:20 AM.

Details

Reviewers
tahonermann
cor3ntin
Group Reviewers
Restricted Project
Summary

WG21 P1949 and WG14 N2836 were both adopted without any deprecation period for users who were making use of now-invalid characters in their identifiers. This caused some amount of pain for people, especially folks using mathematical symbols in their identifiers, which makes it hard to upgrade to newer Clang versions with this diagnostic as an error.

This patch downgrades the error to be a warning which defaults to an error. This continues to signal to users that the identifiers are in fact invalid, but it gives a grace period for people to update their code bases to the new rules (or, alternatively, time for the Unicode consortium to determine whether some of these characters should be allowed in an identifier in the immutable set). I am tentatively documenting that grace period as being until we ship Clang 18 (which should give users a year or two to migrate their code).

Note, because we implemented the UAX31 rules in Clang 14, I would like to try to land this patch and backport it to Clang 15 so that we avoid having *two* releases with no easy migration path for this subset of users. I think this is reasonable because we're not altering the implementation logic at all, just modifying what diagnostic kind is emitted by default. If someone has a concern about wanting to cherry pick this so late in the cycle, please share those concerns ASAP.

Fixes #54732

Diff Detail

Event Timeline

aaron.ballman created this revision.Aug 29 2022, 11:20 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 29 2022, 11:20 AM
aaron.ballman requested review of this revision.Aug 29 2022, 11:20 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 29 2022, 11:20 AM

CC @tstellar and @thieta for any early backporting concerns.

If we don't specifically call out the deprecation period in the diagnostic, usage will grow beyond what we expect. (Most people don't read the release notes; they'll just see it's possible to turn off the error message, and turn it off.)

If we're okay with people deciding their own rules for what they want to allow in identifiers, we can just make the flag permanently available, though, and just drop the whole "deprecation period" discussion. Or alternatively, we could add a separate diagnostic specifically for older standard modes: allow only the characters that were allowed by the older standards, and don't emit it for C++23 and newer. That way, usage naturally goes away as people upgrade their code to newer standard modes.

clang/docs/ReleaseNotes.rst
128

If this makes it into clang 15, we don't want to relnote it for clang 16.

If we don't specifically call out the deprecation period in the diagnostic, usage will grow beyond what we expect. (Most people don't read the release notes; they'll just see it's possible to turn off the error message, and turn it off.)

If we're okay with people deciding their own rules for what they want to allow in identifiers, we can just make the flag permanently available, though, and just drop the whole "deprecation period" discussion. Or alternatively, we could add a separate diagnostic specifically for older standard modes: allow only the characters that were allowed by the older standards, and don't emit it for C++23 and newer. That way, usage naturally goes away as people upgrade their code to newer standard modes.

IIRC, this was accepted as DR, so there are NO 'older standards' of C++ in which allow these characters.

aaron.ballman marked an inline comment as done.Aug 29 2022, 12:10 PM

If we don't specifically call out the deprecation period in the diagnostic, usage will grow beyond what we expect. (Most people don't read the release notes; they'll just see it's possible to turn off the error message, and turn it off.)

I'm okay with that approach (I definitely agree putting the info in the diagnostic will get more eyeballs on it than putting it in the release notes), but at the same time, we use downgradable errors with the intent to turn them into a hard error in the future somewhat frequently and don't add any notice of pending doom for them. These diagnostics are all exposed in a way that makes it clear the code is an error, so anyone who blanket disables the error should not be too terribly surprised when it becomes a hard error later. (Putting the deprecation message into the warning itself has the same problem -- users who disable the warning entirely won't remember what the text of the warning was anyway, and users who inherit build system flags from may never even see the diagnostic messages at all.) So personally, I'm also fine not rewording the diagnostic.

If we're okay with people deciding their own rules for what they want to allow in identifiers, we can just make the flag permanently available, though, and just drop the whole "deprecation period" discussion. Or alternatively, we could add a separate diagnostic specifically for older standard modes: allow only the characters that were allowed by the older standards, and don't emit it for C++23 and newer. That way, usage naturally goes away as people upgrade their code to newer standard modes.

There's quite a bit of discussion on the issue thread that goes into this, but there's pretty strong sentiment from Clang maintainers against making the flag permanently available. This behavior is controlled by standards bodies and allowing users to ignore that effectively makes a custom language subset (goes against our usual policy for language extensions). As Erich mentions, this was adopted as a DR in C++ so there's no older standards there. WG14 doesn't have the notion of DRs any longer and our replacement mechanism wasn't in place when the paper was adopted into C23. Technically that makes it a C23-only change, but I don't see any value to making this a DR in C++ but not in C given that both languages implement the same identifier rules and there's a reasonable user expectation that a valid identifier in C is also valid in C++ and vice versa (otherwise you run into problems linking C and C++ together).

clang/docs/ReleaseNotes.rst
128

Agreed -- I added the release note so folks could review it, but the plan is to either add it directly to the 15.x release notes or if we don't want to backport they'll be typical Clang 16 release notes.

Hmm. 15.0.0 is just a week away - I am not planning any more RCs unless we hit something critical. What's the risk of taking this specific change at this point? Would it make more sense to wait for 15.0.1? (I am guessing it's better if it goes into 15.0.0 or not in 15.x at all).

I am just pushing back because I don't want to introduce to much risk this late in the cycle - but I will defer to you guys since you are the domain experts.

cor3ntin added a comment.EditedAug 29 2022, 12:46 PM

I am strongly against this change. If we do want to do this, we need to restore the original tables from c++11 and check against them not to extend the set of characters to something much wider than was supported before c++23. It should be fairly easy to extract those tables from the git history. (Sorry for the very short comment, I'm currently traveling). A warning for non characters, controls, pua characters and everything not previously accepted isn't sufficient. A warning for things in the old set but not 23 is fine as long as, as other have said, we have a depreciation period.

I'm not really a fan of the argument of "this was accepted as a DR, therefore new versions of clang have to reject code that old versions accepted". Regardless of what the C++ committee does, our commitment to our users is to ensure that code written against "-std=c++11" continues to compile in newer versions of clang.

I don't want to relitigate the whole discussion from the bug, though; just wanted to make sure you considered the possibilities. If you think this is best, I won't object.

I don't have a strong opinion regarding when, or if, the diagnostic is reverted to an always-error. It looks like gcc is not even planning to diagnose identifiers that are ill-formed according to the new rules by default.

With regard to Corentin's opposition to the patch as is, I think it would be acceptable to put this patch in Clang 15 as is and then tighten things up as Corentin suggests for Clang 16. The patch as is is very low risk (it barely even qualifies as a code change). any change to address Corentin's concerns would add some additional risk this late in the release, and the only people impacted are those that opt-in to the new option. I do agree with Corentin's concern that suppressing the error does look like it would have the effect of allowing identifiers that were not previously allowed under the immutable identifier syntax; I think we do want to ensure that identifiers that are in neither immutable identifier syntax nor default identifier syntax are diagnosed (at least in Clang 16).

clang/docs/ReleaseNotes.rst
119–121

Most people reading this won't know what UAX31 is. Linking to it would help some, but the document is fairly inscrutable to the uninitiated. How about something like the following?

The ``-Winvalid-identifier-character`` warning group was added to manage diagnostics
regarding use of invalid identifiers following the adoption of N2836 and P1949 by the
C and C++ committees respectively. This warning default to an error because ...
aaron.ballman marked an inline comment as done.Aug 29 2022, 2:02 PM

I am strongly against this change. If we do want to do this, we need to restore the original tables from c++11 and check against them not to extend the set of characters to something much wider than was supported before c++23. It should be fairly easy to extract those tables from the git history. (Sorry for the very short comment, I'm currently traveling). A warning for non characters, controls, pua characters and everything not previously accepted isn't sufficient. A warning for things in the old set but not 23 is fine as long as, as other have said, we have a depreciation period.

Thanks for this feedback! Ugh, I didn't realize we were going to need to keep around both sets of tables, that potentially adds more expense to this option than I originally realized, depending on how much extra binary size the tables add.

I'm not really a fan of the argument of "this was accepted as a DR, therefore new versions of clang have to reject code that old versions accepted". Regardless of what the C++ committee does, our commitment to our users is to ensure that code written against "-std=c++11" continues to compile in newer versions of clang.

This is the status quo in Clang. When one of the committees says something is a DR, the expectation users have is that the feature was always specified to behave in the fixed way. This case is a bit special though because the old rules were approximately 30+ years old and were "use what you like so long as it's not punctuation, yolo" and the new rules are more restrictive. I think our commitment is a bit weaker than "ensure code continues to compile." We've never promised we wouldn't fix bugs against older standards, and fixing bugs can break otherwise working code. (And WG21 considers this a bug, hence the DR designation.) However, you're definitely right that we're not *obligated* to make the change in older language modes just because the committee said it's a DR. We are obligated to implement the functionality in a conforming manner in newer language modes though.

It's worth noting that the code which is broken by this is largely within one domain (use of math symbols), so the number of users impacted is believed to be relatively small (small enough that we were considering closing the issue as Won't Fix instead of making this change). The reason I decided to propose a relaxation here was to give the handful of impacted users an easier upgrade path.

I don't want to relitigate the whole discussion from the bug, though; just wanted to make sure you considered the possibilities. If you think this is best, I won't object.

Your opinions are very much appreciated on the topic, it's not one with a slam dunk answer (at least that we've found yet).

Hmm. 15.0.0 is just a week away - I am not planning any more RCs unless we hit something critical. What's the risk of taking this specific change at this point? Would it make more sense to wait for 15.0.1? (I am guessing it's better if it goes into 15.0.0 or not in 15.x at all).

I am just pushing back because I don't want to introduce to much risk this late in the cycle - but I will defer to you guys since you are the domain experts.

Given the feedback from Corentin above, I think the window for getting this into Clang 15 is effectively closed. It's not critical (it doesn't fix a crash or a regression); I was mostly hoping it would make it in 15.0 because it doesn't seem appropriate to introduce a new warning flag in a dot release (that would seem to make build systems harder to work with in practice).

@aaron.ballman i am not worried about performance. The old table was fairly compact - as it described large ranges - so a drop in the ocean in terms of binary size. And we only need to check them for the case where we are only going to emit a warning or an error anyway.

See
https://github.com/llvm/llvm-project/commit/4e80636db71a1b6123d15ed1f9eda3979b4292de#diff-c95475d750149050c46892d3a9731fd839737a51a6f19f8e0580acee052e9b26L43

@aaron.ballman I don't think this patch is no longer necessary given that we merged the math profile

I don't think this patch is no longer necessary given that we merged the math profile

I agree; we can revisit this if complaints from users due to use of characters not in the math profile materializes.

aaron.ballman abandoned this revision.Jan 3 2023, 8:34 AM

@aaron.ballman I don't think this patch is no longer necessary given that we merged the math profile

Agreed, I'm going to abandon this for now. Thanks!