This is an archive of the discontinued LLVM Phabricator instance.

[WIP] Reclassify form-feed and vertical tab as vertical WS for the purposes of lexing.
AbandonedPublic

Authored by cor3ntin on Aug 25 2021, 4:39 PM.

Details

Reviewers
None
Summary

According to The Unicode standard, line feed and vertical tab
should be considered line breaks.

See notably http://unicode.org/reports/tr14/#BK

In addition, the C++ standard stipulates:

If there is a form-feed or a vertical-tab character in such a comment,
only whitespace characters shall appear between it
and the new-line that terminates the comment; no diagnostic is required.

Given the amount of ?? I found in the test, it isn't clear that
it was intended for these characters not to be considered
vertical spaces.

This came up in SG16/WG21 in the context of
https://wg21.link/p2348.

Diff Detail

Event Timeline

cor3ntin created this revision.Aug 25 2021, 4:39 PM
cor3ntin requested review of this revision.Aug 25 2021, 4:39 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 25 2021, 4:39 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
cor3ntin edited the summary of this revision. (Show Details)Aug 26 2021, 1:04 AM
cor3ntin added reviewers: aaron.ballman, rsmith.
cor3ntin updated this revision to Diff 368826.Aug 26 2021, 1:05 AM

Fix tests

Thanks for looking at this — I'm just reviewing the test changes in the minimizer (assuming the code change is the right thing to do).

clang/unittests/Lex/DependencyDirectivesSourceMinimizerTest.cpp
170–175

Please don't modify this assertion to test \v and \f. Instead add new assertions. I'm not sure if they belong in DefineMultilineArgs, or if DefineSpacing might be a better spot, or maybe a new test DefineVerticalWhitespace?

186

Please don't remove this test coverage.

cor3ntin added inline comments.Aug 26 2021, 11:37 AM
clang/unittests/Lex/DependencyDirectivesSourceMinimizerTest.cpp
170–175

I think this test is useful, you are right I should add more somewhere else. I will also add that new test without removing the old one!

cor3ntin updated this revision to Diff 368958.Aug 26 2021, 12:14 PM
  • Add tests and fix existing tests (again)
cor3ntin retitled this revision from Reclassify form-feed and vertical tab as vertical WS for the purposes of lexing. to [WIP] Reclassify form-feed and vertical tab as vertical WS for the purposes of lexing..Aug 27 2021, 12:17 PM
cor3ntin removed reviewers: aaron.ballman, rsmith.
xgupta added a subscriber: xgupta.Aug 27 2021, 12:31 PM

(If you pass --draft flag to arc while updating revision it becomes WIP so you don't need to retitle or remove reviewers. Buildbot also work in that case. And to again ask for review you can use Add Action drop down menu.)

I find the lack of changes to tests other than API unittests to be somewhat concerning.
Indeed, I applied the patch and did not notice behaviour changes in how \f or \v were handled in various contexts sensitive to line-termination.

For example:

clang -fsyntax-only -xc -<<<$'#define X\f int\n#define Y\v x;\ncha\\\v\nr x;\n// \fextern int x;\n'
<stdin>:3:5: warning: backslash and newline separated by space [-Wbackslash-newline-escape]
cha\<U+000B>
    ^
1 warning generated.

It appears phase 3 whitespace conversion is applied to \f and \v and the macro definitions extend to the \ns.
The \v\n sequence is treated as only one newline (backslash escaped) and not two (with only the \v escaped).
It also appears that the // comment is not terminated by the \f.

rsmith added a subscriber: rsmith.Aug 27 2021, 4:37 PM

I assume this is intended to form part of the implementation of https://wg21.link/p2348 and so shouldn't be considered for review right now?

Drive-by observation: under P2348, Clang's behavior of treating \n\r as a single new-line would be "non-standard" (requiring special phase 1 mapping). Is that intentional? \n\r is used as a new-line character on old Mac systems.

I find the lack of changes to tests other than API unittests to be somewhat concerning.
Indeed, I applied the patch and did not notice behaviour changes in how \f or \v were handled in various contexts sensitive to line-termination.

For example:

clang -fsyntax-only -xc -<<<$'#define X\f int\n#define Y\v x;\ncha\\\v\nr x;\n// \fextern int x;\n'
<stdin>:3:5: warning: backslash and newline separated by space [-Wbackslash-newline-escape]
cha\<U+000B>
    ^
1 warning generated.

It appears phase 3 whitespace conversion is applied to \f and \v and the macro definitions extend to the \ns.
The \v\n sequence is treated as only one newline (backslash escaped) and not two (with only the \v escaped).
It also appears that the // comment is not terminated by the \f.

Yep, I noticed that today, \n is hardcoded in many places in the lexer, as it turns out, Hence why I marked the change WIP until I reassess the situation.
I should have been more suspicious than no test broke...

I also observed that gcc behavior seems somewhat consistent with clang.

I'm starting to fear that we might not be able to align compilers to consistency with Unicode.
Also as far as comments are concerned clang is strictly non-conforming (but neither seems to be gcc ?) - or rather as per the standard it produces ill-formed NDR programs, which is interesting.

It might be that the best solution is to abandon this route and instead treat \v and \f as whitespace rather than new-line in the standard and be inconsistent with Unicode.
The is probably not a huge motivation for having \v and \f being line breaking anyway (beyond Unicode saying so).

I assume this is intended to form part of the implementation of https://wg21.link/p2348 and so shouldn't be considered for review right now?

Yes, I wanted your feedback anyway but that was before I realized \n was hardcoded in many places in the lexer, notably in comment parsing. Which makes me question my entire reasoning. Sorry for the noise.

Drive-by observation: under P2348, Clang's behavior of treating \n\r as a single new-line would be "non-standard" (requiring special phase 1 mapping). Is that intentional? \n\r is used as a new-line character on old Mac systems.

Somewhat. \n\r is not described by Unicode so we could either mandate that all implementation support that or leave it as implementation-defined mapping. Correct me if I am wrong, but as the line number is itself implementation-defined, whether there are one or 2 line breaks would not materially affect the standard, either way.
But also, when I looked at it, I was under the impression that older macs used a single \r.

Drive-by observation: under P2348, Clang's behavior of treating \n\r as a single new-line would be "non-standard" (requiring special phase 1 mapping). Is that intentional? \n\r is used as a new-line character on old Mac systems.

Somewhat. \n\r is not described by Unicode so we could either mandate that all implementation support that or leave it as implementation-defined mapping. Correct me if I am wrong, but as the line number is itself implementation-defined, whether there are one or 2 line breaks would not materially affect the standard, either way.

Yes, I suppose that's true. Though if we're nailing down exactly how new-lines are defined and asking every conforming implementation to support UTF-8 and such, maybe it's time to also define how the presumed line number is determined? =)

But also, when I looked at it, I was under the impression that older macs used a single \r.

Oops, you're right, I was thinking of Acorn systems not Macs :)

Drive-by observation: under P2348, Clang's behavior of treating \n\r as a single new-line would be "non-standard" (requiring special phase 1 mapping). Is that intentional? \n\r is used as a new-line character on old Mac systems.

Somewhat. \n\r is not described by Unicode so we could either mandate that all implementation support that or leave it as implementation-defined mapping. Correct me if I am wrong, but as the line number is itself implementation-defined, whether there are one or 2 line breaks would not materially affect the standard, either way.

Yes, I suppose that's true. Though if we're nailing down exactly how new lines are defined and asking every conforming implementation to support UTF-8 and such, maybe it's time to also define how the presumed line number is determined? =)

I'm not sure that we want people to rely on the value on LINE, to be honest.

Anyway, I've been thinking a lot about that and I don't think this change is worth pursuing. I've realized all compilers treat VT and FF as non-line breaking horizontal whitespace and even if that's not consistent with Unicode it's also consistent with other programming languages.
I'll modify the paper to keep treating these codepoints as horizontal whitespaces as this seems the sane, non-disruptive course of action.

I will also try to address your feedback regarding acorn systems on the paper.

Thanks!

cor3ntin abandoned this revision.Aug 28 2021, 4:04 AM