Page MenuHomePhabricator

Enable fexec-charset option
Needs ReviewPublic

Authored by abhina.sreeskantharajan on Dec 10 2020, 5:49 AM.

Details

Summary

This patch enables the fexec-charset option to control the execution charset of string literals. It sets the default internal charset, system charset, and execution charset for z/OS and UTF-8 for all other platforms.
This patch depends on https://reviews.llvm.org/D88741

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
abhina.sreeskantharajan requested review of this revision.Dec 10 2020, 5:49 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptDec 10 2020, 5:49 AM
rsmith added a subscriber: rsmith.Dec 10 2020, 3:11 PM

I'm overall pretty happy about how clean and non-invasive the changes required here are. But please make sure you don't change the encodings of u8"..." / u"..." / U"..." literals; those need to stay as UTF-8 / UTF-16 / UTF-32. Also, we should have a story for how the wide execution character set is controlled -- is it derived from the narrow execution character set, or can the two be changed independently, or ...?

We should use the original source form of the string literal when pretty-printing a StringLiteral or CharacterLiteral; there are a bunch of UTF-8 assumptions baked into StmtPrinter that will need revisiting. And we'll need to modify the handful of places that put the contents of StringLiterals into diagnostics (#warning, #error, static_assert) and make them use a different ConversionState, since our assumption is that diagnostic output should be in UTF-8.

clang/include/clang/Lex/LiteralTranslator.h
30–32

It is not acceptable to use global state for this per-compilation information; this will behave badly if multiple independent Clang compilations are performed by different threads in the same process, for example.

33

Similarly, use of a global cache here will require you guard it with a mutex.

As an alternative, how about we move all this state to be per-instance state, and store an instance of LiteralTranslator on the Preprocessor?

clang/lib/Lex/LiteralSupport.cpp
233–241

Is it correct, in general, to do character-at-a-time translation here, when processing a string literal? I would expect there to be some (stateful) target character sets where that's not correct.

238

Reinterpreting an unsigned* as a char* like this is not correct on big-endian, and is way too "clever" on little-endian. Please create an actual char object to hold the value and pass that in instead.

240

What should happen if the result doesn't fit into an unsigned? This also appears to be making problematic assumptions about the endianness of the host. If we really want to pack multiple bytes of encoded output into a single unsigned result value (which itself seems dubious), we should do so with an endianness that doesn't depend on the host.

1323

Shouldn't this depend on the kind of literal? We should have no converter for UTF8/UTF16/UTF32 literals, should use the wide execution character set for L... literals, and the narrow execution character set otherwise. (It looks like this patch doesn't properly distinguish the narrow and wide execution character sets?)

clang/test/Driver/cl-options.c
217

Checking for "don't produce exactly this one spelling of this one diagnostic" is not a useful test; if we started warning on this again, there's a good chance the warning would be spelled differently, so your test does not do a good job of determining whether the code under test is bad (it passes in most bad states as well as in the good state). ...-NOT: error and ...-NOT: warning would be a bit better, if this is worth testing.

clang/test/Driver/clang_f_opts.c
213

Again, this is not a useful test.

tahonermann added inline comments.Dec 11 2020, 11:14 PM
clang/include/clang/Driver/Options.td
3583–3584

How about substituting "character set", "character encoding", or "charset" for "codepage"?

This doesn't state what names are recognized. The ones provided by the system iconv() implementation (as is the case for gcc)? Or all names and aliases specified by the IANA character set registry?

The set of recognized names can be a superset of the names that are actually supported.

clang/include/clang/Lex/LiteralSupport.h
192

Does the conversion state need to be persisted as a data member? The literal is consumed in the constructor.

245

Same concern here with respect to persisting the conversion state as a data member.

247

This static data member will presumably need to be lifted to per-instance state as Richard mentioned elsewhere.

clang/lib/Driver/ToolChains/Clang.cpp
5970–5985

I think it would be preferable to diagnose an unrecognized character encoding name here if possible. The current changes will result in an unrecognized name (as opposed to one that is unsupported for the target) being diagnosed for each compiler instance.

clang/lib/Frontend/CompilerInvocation.cpp
3573

I wouldn't expect the cast to std::string to be needed here.

clang/lib/Lex/LiteralSupport.cpp
233–241

For stateful encodings, I can imagine that state would have to be transitioned to the initial state before translating the escape sequence. I suspect support for stateful encodings is not a goal at this time.

236

What should happen if ResultChar >= 0x100? IBM-1047 does have representation for other UTF-8 characters. Regardless, it seems ResultChar should be converted to something.

1754–1766

UCNs will require conversion here.

llvm/lib/Support/Triple.cpp
1028–1029

No support for targeting the z/OS Enhanced ASCII run-time?

Thanks for your quick reviews! I haven't addressed all the comments yet but I plan to address all of them. I put up this patch early because it has a few major changes:

  • moves LiteralTranslator class to Preprocessor instead of being a static global class
  • add isUTFLiteral() function to detect strings like u8"..." and stop translation
  • translate wide string literals to the system charset for now (we don't have an implementation plan for -fwide-charset right now)
  • remove tests that check fexec-charset will not accept non-UTF charsets
abhina.sreeskantharajan marked 9 inline comments as done.Dec 15 2020, 8:40 AM

I'm overall pretty happy about how clean and non-invasive the changes required here are. But please make sure you don't change the encodings of u8"..." / u"..." / U"..." literals; those need to stay as UTF-8 / UTF-16 / UTF-32. Also, we should have a story for how the wide execution character set is controlled -- is it derived from the narrow execution character set, or can the two be changed independently, or ...?

We should use the original source form of the string literal when pretty-printing a StringLiteral or CharacterLiteral; there are a bunch of UTF-8 assumptions baked into StmtPrinter that will need revisiting. And we'll need to modify the handful of places that put the contents of StringLiterals into diagnostics (#warning, #error, static_assert) and make them use a different ConversionState, since our assumption is that diagnostic output should be in UTF-8.

Yes, these are some of the complications we will need to visit in later patches. We may need to somehow save the original string or reverse the translation.

clang/include/clang/Driver/Options.td
3583–3584

I've updated the description from codepage to charset.

It's hard to specify what charsets are supported because iconv library differs between targets, so the list will not be the same on every platform.

clang/include/clang/Lex/LiteralTranslator.h
33

Thanks, I've added an instance of LiteralTranslator to Preprocessor instead and use that when the Preprocessor is available. There is one constructor of StringLiteralParser that does not pass Preprocessor as an argument, so I had to create a LiteralTranslator instance there as well.

clang/lib/Driver/ToolChains/Clang.cpp
5970–5985

Since we do not know what charsets are supported by the iconv library on the target platform, we don't know what charsets are actually invalid until we try creating a CharSetConverter.

clang/lib/Frontend/CompilerInvocation.cpp
3573

Without that cast, I get the following build error:

error: no viable overloaded '='
clang/test/Driver/cl-options.c
217

You're right, I made a change just to make the testcase pass. I think this testcase is no longer needed because fexec-charset should be able to accept all charset names. We won't be able to diagnose invalid charset names until we actually try creating the CharSetConverter.

llvm/lib/Support/Triple.cpp
1028–1029

We plan to support both modes in the future, but we want the default to still be IBM-1047 (EBCDIC).

abhina.sreeskantharajan marked 5 inline comments as done.Dec 15 2020, 11:05 AM
tahonermann added inline comments.Dec 16 2020, 8:35 AM
clang/include/clang/Driver/Options.td
3583–3584

Being dependent on the host iconv library seems fine by me; that is the case for gcc today. I suggest making that explicit here:

def fexec_charset : Separate<["-"], "fexec-charset">, MetaVarName<"<charset>">,
  HelpText<"Set the execution <charset> for string and character literals.  Supported character encodings include XXX and those supported by the host iconv library.">;
clang/lib/Driver/ToolChains/Clang.cpp
5970–5985

Understood, but what would be the harm in performing a lookup (constructing a CharSetConverter) here?

clang/lib/Frontend/CompilerInvocation.cpp
3573

Ok, rather than a cast, I suggest:

Opts.ExecCharset = Value.str();
clang/lib/Lex/LiteralSupport.cpp
1323–1324

Converting wide character literals to the system encoding doesn't seem right to me. For z/OS, this should presumably convert to the wide EBCDIC encoding, but for all other supported platforms, the wide execution character set is either UTF-16 or UTF-32 depending on the size of wchar_t (which may be influenced by the -fshort-wchar option).

1592–1596

The stored TranslationState should not be completely ignored for wide and UTF string literals. The standard permits things like the following.

#pragma rigoot L"bozit"
#pragma rigoot u"bozit"
_Pragma(L"rigoot bozit")
_Pragma(u8"rigoot bozit")

For at least the _Pragma(L"...") case, the C++ standard states the L is ignored, but it doesn't say anything about other encoding prefixes.

1593–1594

Converting wide string literals to the system encoding doesn't seem right to me. For z/OS, this should presumably convert to the wide EBCDIC encoding, but for all other supported platforms, the wide execution character set is either UTF-16 or UTF-32 depending on the size of wchar_t (which may be influenced by the -fshort-wchar option).

tahonermann added inline comments.Dec 16 2020, 8:55 AM
clang/test/CodeGen/systemz-charset.c
5

const char * please :)

25

Add validation of UCNs. Something like:

const char *UcnCharacters = "\u00E2\u00AC\U000000DF";
// CHECK: c"\42\B0\59\00"

Thanks for your patience, I've addressed some more comments. Here is the summary of the changes in this patch:

  • add translation for UCN strings, update testcase
  • fix helptext for fexec-charset option in Options.td
  • check for invalid charsets when parsing driver options.
  • fix up char conversion code
abhina.sreeskantharajan marked 11 inline comments as done.Mon, Dec 21, 8:25 AM
abhina.sreeskantharajan added inline comments.
clang/include/clang/Driver/Options.td
3583–3584

I've updated the HelpText with your suggested description.

clang/include/clang/Lex/LiteralSupport.h
192

Thanks, I've removed this.

245

If this member is removed in StringLiteralParser, we will need to pass the State to multiple functions in StringLiteralParser like init(). Would this solution be preferable to keeping a data member?

clang/lib/Driver/ToolChains/Clang.cpp
5970–5985

I initially thought it will be a performance issue if we are creating the Converter twice, once here and once in the Preprocessor. But I do think its a good idea to diagnose this early. I've modified the code to diagnose and error here.

clang/lib/Frontend/CompilerInvocation.cpp
3573

Thanks, I've applied this change.

clang/lib/Lex/LiteralSupport.cpp
233–241

Right, stateful encodings may be a problem we will need to revisit later as well.

236

This is no longer valid, thanks for catching that. We were initially translating to ASCII instead of UTF-8 so we needed to guard against larger characters. I've removed this guard since the internal charset is UTF-8.

238

Thanks, I've created a char instead.

240

This may be a problem we need to revisit since ResultChar is expecting a char.

1754–1766

I've added code to translate UCN characters and have updated the testcase as well.

abhina.sreeskantharajan marked 8 inline comments as done.Mon, Dec 21, 8:25 AM
clang/lib/Lex/LiteralSupport.cpp
1323–1324

Since we don't implement -fwide-exec-charset yet, what do you think should be the default behaviour for the interim?

tahonermann added inline comments.Wed, Dec 23, 10:04 PM
clang/include/clang/Lex/LiteralSupport.h
245

I think so, yes. Data members should be used to reflect the state of the object, not as a convenient mechanism to avoid passing arguments.

clang/lib/Driver/ToolChains/Clang.cpp
5982–5984

Thank you for adding this.

clang/lib/Lex/LiteralSupport.cpp
236

Conversion can fail here, particularly in the scenario corresponding to the default switch case above; ResultChar could contain, for example, a lead byte of a UTF-8 sequence. Something sensible should be done here; either rejecting the code with an error or substituting ? (in the execution encoding) seems appropriate to me.

237

As Richard previously noted, this memcpy() needs to be addressed. The intended behavior here is not clear. Are there valid scenarios in which the conversion will produce a sequence of more than one code units? I believe the input is limited to ASCII characters and invalid code units (e.g., a lead byte of a UTF-8 sequence) and in the latter case, an error and/or substitution of a ? (in the execution encoding) seem like acceptable behaviors to me.

1323–1324

Perhaps an Internal compiler error to indicate that appropriate support is not yet in place?

clang/test/CodeGen/systemz-charset.c
17–24

const char* here too please.

clang/test/CodeGen/systemz-charset.cpp
2–26

This is good. I suggest adding escape sequences and UCNs to validate that they are not converted to IBM-1047.

clang/test/Driver/clang_f_opts.c
212–220

This looks good. Can tests also be added to validate that the UTF-8, ISO8895-1, and IBM-1047 option arguments are properly recognized?

Thanks for the review! I've addressed most of the comments but I still need to work on the translation issues in CharLiteralParser that was kindly pointed out by Tom and Richard. Here are the summary of changes in this patch:

  • Removed TranslationState as a member of StringLiteralParser and pass it as an argument instead
  • Added an assertion for wide character translation instead of translating them to the system charset
  • Invalid char escapes are changed to '?' and then translated
  • Updated testcases as requested
abhina.sreeskantharajan marked 8 inline comments as done.Tue, Dec 29, 11:39 AM
abhina.sreeskantharajan added inline comments.
clang/include/clang/Lex/LiteralSupport.h
245

Thanks, I've removed this member.

clang/lib/Lex/LiteralSupport.cpp
1323–1324

Thanks for the suggestion. I've added assertions for wide character translation before we do any translation.

1593–1594

I've now added an assertion when translating wide characters.

clang/test/CodeGen/systemz-charset.cpp
2–26

Good idea, I added those testcases as per your suggestion.

abhina.sreeskantharajan marked 4 inline comments as done.Tue, Dec 29, 11:39 AM
abhina.sreeskantharajan marked an inline comment as done.Tue, Dec 29, 12:45 PM
abhina.sreeskantharajan added inline comments.
clang/lib/Lex/LiteralSupport.cpp
236

Thanks, I added the substitution with the '?' character for invalid escapes.

abhina.sreeskantharajan marked an inline comment as done.Tue, Dec 29, 12:45 PM

This patch replaces the memcpy in CharLiteralParser with an assignment. I've added an assertion for cases where the character size increases after translation.

abhina.sreeskantharajan marked 2 inline comments as done.Wed, Dec 30, 6:52 AM
abhina.sreeskantharajan added inline comments.
clang/lib/Lex/LiteralSupport.cpp
237

I replaced memcpy with an assignment. Please let me know if there is a better solution.

240

I added an assertion for this case where the size of the character increases after translation. I've also removed the memcpy to avoid endianness issues.

abhina.sreeskantharajan marked an inline comment as done.Wed, Dec 30, 6:52 AM
clang/lib/Lex/LiteralSupport.cpp
1592–1596

Please correct me if I'm wrong, these Pragma strings are not parsed through StringLiteralParser, they are parsed in clang/lib/Lex/Pragma.cpp in this function.
void Preprocessor::Handle_Pragma(Token &Tok)
So if they require translation, it would need to be done in that function.