This patch enables the fexec-charset option to control the execution charset of string literals. It sets the default internal charset, system charset, and execution charset for z/OS and UTF-8 for all other platforms. 
This patch depends on https://reviews.llvm.org/D88741
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I'm overall pretty happy about how clean and non-invasive the changes required here are. But please make sure you don't change the encodings of u8"..." / u"..." / U"..." literals; those need to stay as UTF-8 / UTF-16 / UTF-32. Also, we should have a story for how the wide execution character set is controlled -- is it derived from the narrow execution character set, or can the two be changed independently, or ...?
We should use the original source form of the string literal when pretty-printing a StringLiteral or CharacterLiteral; there are a bunch of UTF-8 assumptions baked into StmtPrinter that will need revisiting. And we'll need to modify the handful of places that put the contents of StringLiterals into diagnostics (#warning, #error, static_assert) and make them use a different ConversionState, since our assumption is that diagnostic output should be in UTF-8.
| clang/include/clang/Lex/LiteralTranslator.h | ||
|---|---|---|
| 30–32 | It is not acceptable to use global state for this per-compilation information; this will behave badly if multiple independent Clang compilations are performed by different threads in the same process, for example. | |
| 33 | Similarly, use of a global cache here will require you guard it with a mutex. As an alternative, how about we move all this state to be per-instance state, and store an instance of LiteralTranslator on the Preprocessor? | |
| clang/lib/Lex/LiteralSupport.cpp | ||
| 231–239 | Is it correct, in general, to do character-at-a-time translation here, when processing a string literal? I would expect there to be some (stateful) target character sets where that's not correct. | |
| 236 | Reinterpreting an unsigned* as a char* like this is not correct on big-endian, and is way too "clever" on little-endian. Please create an actual char object to hold the value and pass that in instead. | |
| 238 | What should happen if the result doesn't fit into an unsigned? This also appears to be making problematic assumptions about the endianness of the host. If we really want to pack multiple bytes of encoded output into a single unsigned result value (which itself seems dubious), we should do so with an endianness that doesn't depend on the host. | |
| 1318 | Shouldn't this depend on the kind of literal? We should have no converter for UTF8/UTF16/UTF32 literals, should use the wide execution character set for L... literals, and the narrow execution character set otherwise. (It looks like this patch doesn't properly distinguish the narrow and wide execution character sets?) | |
| clang/test/Driver/cl-options.c | ||
| 217 | Checking for "don't produce exactly this one spelling of this one diagnostic" is not a useful test; if we started warning on this again, there's a good chance the warning would be spelled differently, so your test does not do a good job of determining whether the code under test is bad (it passes in most bad states as well as in the good state). ...-NOT: error and ...-NOT: warning would be a bit better, if this is worth testing. | |
| clang/test/Driver/clang_f_opts.c | ||
| 213 | Again, this is not a useful test. | |
| clang/include/clang/Driver/Options.td | ||
|---|---|---|
| 3583–3584 | How about substituting "character set", "character encoding", or "charset" for "codepage"? This doesn't state what names are recognized. The ones provided by the system iconv() implementation (as is the case for gcc)? Or all names and aliases specified by the IANA character set registry? The set of recognized names can be a superset of the names that are actually supported. | |
| clang/include/clang/Lex/LiteralSupport.h | ||
| 192 | Does the conversion state need to be persisted as a data member? The literal is consumed in the constructor. | |
| 246 | Same concern here with respect to persisting the conversion state as a data member. | |
| 248 | This static data member will presumably need to be lifted to per-instance state as Richard mentioned elsewhere. | |
| clang/lib/Driver/ToolChains/Clang.cpp | ||
| 5970–5985 | I think it would be preferable to diagnose an unrecognized character encoding name here if possible. The current changes will result in an unrecognized name (as opposed to one that is unsupported for the target) being diagnosed for each compiler instance. | |
| clang/lib/Frontend/CompilerInvocation.cpp | ||
| 3573 | I wouldn't expect the cast to std::string to be needed here. | |
| clang/lib/Lex/LiteralSupport.cpp | ||
| 231–239 | For stateful encodings, I can imagine that state would have to be transitioned to the initial state before translating the escape sequence. I suspect support for stateful encodings is not a goal at this time. | |
| 234 | What should happen if ResultChar >= 0x100? IBM-1047 does have representation for other UTF-8 characters. Regardless, it seems ResultChar should be converted to something. | |
| 1751–1763 | UCNs will require conversion here. | |
| llvm/lib/Support/Triple.cpp | ||
| 1028–1029 | No support for targeting the z/OS Enhanced ASCII run-time? | |
Thanks for your quick reviews! I haven't addressed all the comments yet but I plan to address all of them. I put up this patch early because it has a few major changes:
- moves LiteralTranslator class to Preprocessor instead of being a static global class
- add isUTFLiteral() function to detect strings like u8"..." and stop translation
- translate wide string literals to the system charset for now (we don't have an implementation plan for -fwide-charset right now)
- remove tests that check fexec-charset will not accept non-UTF charsets
Yes, these are some of the complications we will need to visit in later patches. We may need to somehow save the original string or reverse the translation.
| clang/include/clang/Driver/Options.td | ||
|---|---|---|
| 3583–3584 | I've updated the description from codepage to charset. It's hard to specify what charsets are supported because iconv library differs between targets, so the list will not be the same on every platform. | |
| clang/include/clang/Lex/LiteralTranslator.h | ||
| 33 | Thanks, I've added an instance of LiteralTranslator to Preprocessor instead and use that when the Preprocessor is available. There is one constructor of StringLiteralParser that does not pass Preprocessor as an argument, so I had to create a LiteralTranslator instance there as well. | |
| clang/lib/Driver/ToolChains/Clang.cpp | ||
| 5970–5985 | Since we do not know what charsets are supported by the iconv library on the target platform, we don't know what charsets are actually invalid until we try creating a CharSetConverter. | |
| clang/lib/Frontend/CompilerInvocation.cpp | ||
| 3573 | Without that cast, I get the following build error: error: no viable overloaded '=' | |
| clang/test/Driver/cl-options.c | ||
| 217 | You're right, I made a change just to make the testcase pass. I think this testcase is no longer needed because fexec-charset should be able to accept all charset names. We won't be able to diagnose invalid charset names until we actually try creating the CharSetConverter. | |
| llvm/lib/Support/Triple.cpp | ||
| 1028–1029 | We plan to support both modes in the future, but we want the default to still be IBM-1047 (EBCDIC). | |
| clang/include/clang/Driver/Options.td | ||
|---|---|---|
| 3583–3584 | Being dependent on the host iconv library seems fine by me; that is the case for gcc today. I suggest making that explicit here: def fexec_charset : Separate<["-"], "fexec-charset">, MetaVarName<"<charset>">, HelpText<"Set the execution <charset> for string and character literals. Supported character encodings include XXX and those supported by the host iconv library.">; | |
| clang/lib/Driver/ToolChains/Clang.cpp | ||
| 5970–5985 | Understood, but what would be the harm in performing a lookup (constructing a CharSetConverter) here? | |
| clang/lib/Frontend/CompilerInvocation.cpp | ||
| 3573 | Ok, rather than a cast, I suggest: Opts.ExecCharset = Value.str(); | |
| clang/lib/Lex/LiteralSupport.cpp | ||
| 1318–1319 | Converting wide character literals to the system encoding doesn't seem right to me. For z/OS, this should presumably convert to the wide EBCDIC encoding, but for all other supported platforms, the wide execution character set is either UTF-16 or UTF-32 depending on the size of wchar_t (which may be influenced by the -fshort-wchar option). | |
| 1589–1593 | The stored TranslationState should not be completely ignored for wide and UTF string literals. The standard permits things like the following. #pragma rigoot L"bozit" #pragma rigoot u"bozit" _Pragma(L"rigoot bozit") _Pragma(u8"rigoot bozit") For at least the _Pragma(L"...") case, the C++ standard states the L is ignored, but it doesn't say anything about other encoding prefixes. | |
| 1590–1591 | Converting wide string literals to the system encoding doesn't seem right to me. For z/OS, this should presumably convert to the wide EBCDIC encoding, but for all other supported platforms, the wide execution character set is either UTF-16 or UTF-32 depending on the size of wchar_t (which may be influenced by the -fshort-wchar option). | |
Thanks for your patience, I've addressed some more comments. Here is the summary of the changes in this patch:
- add translation for UCN strings, update testcase
- fix helptext for fexec-charset option in Options.td
- check for invalid charsets when parsing driver options.
- fix up char conversion code
| clang/include/clang/Driver/Options.td | ||
|---|---|---|
| 3583–3584 | I've updated the HelpText with your suggested description. | |
| clang/include/clang/Lex/LiteralSupport.h | ||
| 192 | Thanks, I've removed this. | |
| 246 | If this member is removed in StringLiteralParser, we will need to pass the State to multiple functions in StringLiteralParser like init(). Would this solution be preferable to keeping a data member? | |
| clang/lib/Driver/ToolChains/Clang.cpp | ||
| 5970–5985 | I initially thought it will be a performance issue if we are creating the Converter twice, once here and once in the Preprocessor. But I do think its a good idea to diagnose this early. I've modified the code to diagnose and error here. | |
| clang/lib/Frontend/CompilerInvocation.cpp | ||
| 3573 | Thanks, I've applied this change. | |
| clang/lib/Lex/LiteralSupport.cpp | ||
| 231–239 | Right, stateful encodings may be a problem we will need to revisit later as well. | |
| 234 | This is no longer valid, thanks for catching that. We were initially translating to ASCII instead of UTF-8 so we needed to guard against larger characters. I've removed this guard since the internal charset is UTF-8. | |
| 236 | Thanks, I've created a char instead. | |
| 238 | This may be a problem we need to revisit since ResultChar is expecting a char. | |
| 1751–1763 | I've added code to translate UCN characters and have updated the testcase as well. | |
| clang/lib/Lex/LiteralSupport.cpp | ||
|---|---|---|
| 1318–1319 | Since we don't implement -fwide-exec-charset yet, what do you think should be the default behaviour for the interim? | |
| clang/include/clang/Lex/LiteralSupport.h | ||
|---|---|---|
| 246 | I think so, yes. Data members should be used to reflect the state of the object, not as a convenient mechanism to avoid passing arguments. | |
| clang/lib/Driver/ToolChains/Clang.cpp | ||
| 5982–5984 | Thank you for adding this. | |
| clang/lib/Lex/LiteralSupport.cpp | ||
| 234 | Conversion can fail here, particularly in the scenario corresponding to the default switch case above; ResultChar could contain, for example, a lead byte of a UTF-8 sequence. Something sensible should be done here; either rejecting the code with an error or substituting ? (in the execution encoding) seems appropriate to me. | |
| 235 | As Richard previously noted, this memcpy() needs to be addressed. The intended behavior here is not clear. Are there valid scenarios in which the conversion will produce a sequence of more than one code units? I believe the input is limited to ASCII characters and invalid code units (e.g., a lead byte of a UTF-8 sequence) and in the latter case, an error and/or substitution of a ? (in the execution encoding) seem like acceptable behaviors to me. | |
| 1318–1319 | Perhaps an Internal compiler error to indicate that appropriate support is not yet in place? | |
| clang/test/CodeGen/systemz-charset.c | ||
| 16–23 | const char* here too please. | |
| clang/test/CodeGen/systemz-charset.cpp | ||
| 1–25 | This is good. I suggest adding escape sequences and UCNs to validate that they are not converted to IBM-1047. | |
| clang/test/Driver/clang_f_opts.c | ||
| 212–213 | This looks good. Can tests also be added to validate that the UTF-8, ISO8895-1, and IBM-1047 option arguments are properly recognized? | |
Thanks for the review! I've addressed most of the comments but I still need to work on the translation issues in CharLiteralParser that was kindly pointed out by Tom and Richard. Here are the summary of changes in this patch:
- Removed TranslationState as a member of StringLiteralParser and pass it as an argument instead
- Added an assertion for wide character translation instead of translating them to the system charset
- Invalid char escapes are changed to '?' and then translated
- Updated testcases as requested
| clang/include/clang/Lex/LiteralSupport.h | ||
|---|---|---|
| 246 | Thanks, I've removed this member. | |
| clang/lib/Lex/LiteralSupport.cpp | ||
| 1318–1319 | Thanks for the suggestion. I've added assertions for wide character translation before we do any translation. | |
| 1590–1591 | I've now added an assertion when translating wide characters. | |
| clang/test/CodeGen/systemz-charset.cpp | ||
| 1–25 | Good idea, I added those testcases as per your suggestion. | |
| clang/lib/Lex/LiteralSupport.cpp | ||
|---|---|---|
| 234 | Thanks, I added the substitution with the '?' character for invalid escapes. | |
This patch replaces the memcpy in CharLiteralParser with an assignment. I've added an assertion for cases where the character size increases after translation.
| clang/lib/Lex/LiteralSupport.cpp | ||
|---|---|---|
| 1589–1593 | Please correct me if I'm wrong, these Pragma strings are not parsed through StringLiteralParser, they are parsed in clang/lib/Lex/Pragma.cpp in this function. | |
Hi, Abhina. Sorry for the delay getting back to you. I added some more comments.
| clang/include/clang/Lex/LiteralSupport.h | ||
|---|---|---|
| 191 | Is the default argument for TranslationState actually used anywhere? I'm skeptical that a default argument provides a benefit here. Actually, this diff doesn't include any changes to construct a CharLiteralParser with an explicit argument. It seems this argument isn't actually needed. The only places I see objects of CharLiteralParser type constructed are in EvaluateValue() in clang/lib/Lex/PPExpressions.cpp and Sema::ActOnCharacterConstant() in clang/lib/Sema/SemaExpr.cpp. | |
| 251–252 | I don't think a LiteralTranslator object is actually needed in this case. The only use of this constructor that I see is in ModuleMapParser::consumeToken() in clang/lib/Lex/ModuleMap.cpp and, in that case, I don't think any translation is necessary. This suggests that TranslationState is not needed for this constructor either; NoTranslation can be passed to init(). | |
| 262–263 | I don't think getOffsetOfStringByte() should require a ConversionState parameter. If I understand it correctly, this function should be operating on the string in the internal encoding, never in a converted encoding. | |
| clang/include/clang/Lex/LiteralTranslator.h | ||
| 20–24 | Some naming suggestions... The enumeration is not used to record a state, but rather to indicate an action to take. Also, use of both "conversion" and "translation" could be confusing, so I suggest sticking with one. Perhaps: enum class LiteralConversion {
  None,
  ToSystemCharset,
  ToExecCharset
}; | |
| 31–32 | I don't know the LLVM style guides well, but I suspect a class with all public members should be defined using struct and not include access specifiers. | |
| 36 | Given the converter setters and accessors below, ExecCharsetTables should be a private member. | |
| 38 | getConversionTable() is logically const. Perhaps ExecCharsetTables should be mutable. From a terminology stand point, this function is misnamed. It doesn't return a table, it returns a converter for an encoding. I suggest: llvm::CharSetConverter *getCharSetConverter(const char *Encoding) const; | |
| 39 | findOrCreateExecCharsetTable() seems oddly named since it doesn't return whatever it finds or creates. It seems like this function would be more useful if it returned a llvm::CharSetConverter pointer with nullptr indicating lookup/creation failed. This function seems like it should be an implementation detail of the class, not a public interface. | |
| 42–44 | setTranslationTables() is awkward. It is effectively operating as a constructor for the class, but isn't called at object construction and it does work that goes beyond initialization. | |
| 45 | I suggest trying a design more like this: class LiteralTranslator {
  std::string SystemEncoding;
  std::string ExecutionEncoding;
public:
  LiteralTranslator(llvm::StringRef SystemEncoding,
                    llvm::StringRef ExecutionEncoding);
  // Retrieve the name for the system encoding.
  llvm::StringRef getSystemEncoding() const;
  // Retrieve the name for the execution encoding.
  llvm::StringRef getExecutionEncoding() const;
  // Retrieve a converter for converting from the internal encoding (UTF-8)
  // to the system encoding.
  llvm::CharSetConverter* getSystemEncodingConverter() const;
  // Retrieve a converter for converting from the internal encoding (UTF-8)
  // to the execution encoding.
  llvm::CharSetConverter* getExecutionEncodingConverter() const;
};
LiteralTranslator createLiteralTranslatorFromOptions(const clang::LangOptions &Opts,
                                                     const clang::TargetInfo &TInfo,
                                                     clang::DiagnosticsEngine &Diags); | |
| clang/include/clang/Lex/Preprocessor.h | ||
| 145 | I don't see a reason for LT to be a pointer. Can it be made a reference or, better, a non-reference, non-pointer data member? | |
| clang/lib/Lex/LiteralSupport.cpp | ||
| 1317–1319 | Per the comment associated with the constructor declaration, I don't think the new constructor parameter is needed; translation to execution character set is always desired for non-UTF character literals. I think this can be something like: llvm::CharSetConverter *Converter = nullptr;
if (! isUTFLiteral(Kind)) {
  assert(LT);
  Converter = LT->getCharConversionTable(TranslateToExecCharset);
} | |
| 1589–1593 | Ah, ok, good. There are other cases where a string literal is not used to produce a string literal object. See https://wg21.link/p2314 for a table. You may want to audit for those cases. | |
| clang/lib/Lex/Preprocessor.cpp | ||
| 88–89 | Per comments elsewhere, please try to make LT a non-pointer non-reference data member. | |
| clang/include/clang/Driver/Options.td | ||
|---|---|---|
| 3583 | Could you switch to the option marshalling infrastructure? https://clang.llvm.org/docs/InternalsManual.html#adding-new-command-line-option | |
Thanks for the feedback! I haven't addressed all the comments yet but I've made major renaming changes and hope to get feedback on it.
| clang/include/clang/Lex/LiteralSupport.h | ||
|---|---|---|
| 191 | You're right, we don't have any cases that use this arg yet so we can remove it. | |
| 251–252 | Thanks, I've removed it. | |
| clang/include/clang/Lex/Preprocessor.h | ||
| 145 | Thanks, I've changed it to a non-reference non-pointer member. | |
| clang/lib/Lex/LiteralSupport.cpp | ||
| 1317–1319 | I can't add an assertion here because LT might not be created in the case of the second StringLiteralParser constructor which does not pass the Preprocessor. But I have added the remaining changes. | |
Hi Tom, @tahonermann I renamed the LiteralTranslator class to LiteralConverter.cpp and have renamed a lot of the functions. Let me know what you think. I agree that the setConverters function is awkward, the problem stems from initializing the member early in Preprocessor but only being able to create the Converters once we know the target host later in the compilation process.
| clang/include/clang/Lex/Preprocessor.h | ||
|---|---|---|
| 145 | Please give this a longer name. Abbreviation names should only be used in fairly small scopes where it's easy to look up what they refer to. Also: why LT? What does the T stand for? | |
| clang/lib/Driver/ToolChains/Clang.cpp | ||
| 5976 | Looping over all the arguments is a little unusual. Normally we'd get the last argument value and only check that one. Do you need to pass more than one value onto the frontend? | |
| clang/lib/Lex/LiteralSupport.cpp | ||
| 234 | This is a regression. Our prior behavior for unknown escapes was to leave the character alone. We should still do that wherever possible -- eg, \q should produce q -- and take fallback action only if the character is unencodable. Producing a ? seems unlikely to ever be what anyone wants; producing a hard error would seem preferable. | |
| 235 | Can you avoid using std::string here? Eg, pass a StringRef, extending the converter to be able to take one if necessary. | |
| 238 | Is there any guarantee the assertion will not fail? | |
| 1364–1365 | Why is this case not possible? | |
| 1368 | What assurance do we have that 1 output character is correct? I would expect we need to reject with a diagnostic if the character doesn't fit in one converted character. | |
| 1701–1703 | Do we need to convert the newline character too? Perhaps for raw string literals it'd be better to do the normal processing here and then convert the entire string at once? | |
| clang/test/Driver/cl-options.c | ||
| 214 | Please use the given spelling of the flag in the diagnostic. (You can ask the argument how it was spelled.) | |
Addressing some more comments. Updating the argument parsing, lit tests, some more renaming.
| clang/include/clang/Lex/Preprocessor.h | ||
|---|---|---|
| 145 | Thanks for catching this. This was a change I missed when renaming LiteralTranslator to LiteralConverter. I've added a longer name. | |
| clang/lib/Driver/ToolChains/Clang.cpp | ||
| 5976 | Thanks, I've changed it back to get the LastArg only and use the spelling of the argument to fix the diagnostic error message in the driver lit tests. | |
| clang/lib/Lex/LiteralSupport.cpp | ||
| 234 | Hi @tahonermann, do you also agree we should use the original behaviour or give a hard error instead? | |
| 1364–1365 | This case should be handled when fwide-exec-charset option is implemented. Until then, we thought it was best to emit a error message that wide literal translation is not supported. | |
Rebase + fix CharLiteralParser endian issue by saving the char to a char variable first and then creating a StringRef
Just a tiny comment: could you please make sure the name of the resolved encoding is also propagated to InitPreprocessor.cpp that sets the __clang_literal_encoding__ macro? (https://github.com/llvm/llvm-project/blob/main/clang/lib/Frontend/InitPreprocessor.cpp#L784)
Thanks for catching that. This sets the clang_literal_encoding to Opts.ExecCharset or defaults to SystemCharset.
We should use the original source form of the string literal when pretty-printing a StringLiteral or CharacterLiteral; there are a bunch of UTF-8 assumptions baked into StmtPrinter that will need revisiting. And we'll need to modify the handful of places that put the contents of StringLiterals into diagnostics (#warning, #error, static_assert) and make them use a different ConversionState, since our assumption is that diagnostic output should be in UTF-8.
Yes, these are some of the complications we will need to visit in later patches. We may need to somehow save the original string or reverse the translation.
The operation is destructive and therefore cannot be reverted.
So I do believe the correct behavior here would indeed be to keep the original spelling around - with *some* of phase 5 applied (replacement of UCNs and replacement of numeric escape sequences).
An alternative would be to do the conversion lazily when the strings are evaluated, rather than during lexing, although that might be more involved
"Keeping the original spelling around" would assume that the input is not using a stateful encoding. That seems worse as assumption than giving the canonical output in UTF-8 and shifting the problem to the user's editor?
Right, terrible choice of words
s/original spelling/the concatenated, non-encoded string literal, in UTF-8
Thanks for the input! I agree doing the conversion lazily will help avoid hitting these issues since we push translation to a later stage but as you mentioned it will be more involved. I think keeping the original spelling might be the best solution. We can make a extra member in StringLiteralParser to save the string prior to translation. But we would need to go through each use of StringLiteralParser and save the original encoding (possibly print it in the .ll file along with the translated string or as an attribute?). Let me know what you think.
Hello, I was waiting for the CharSetConverter patch to land. Now that this patch has landed https://reviews.llvm.org/D148821 to add limited EBCDIC <-> UTF-8 conversion support, I have started to refactor my patch to use this instead. This implementation also heavily relies on iconv support which is still being discussed in the CharSet Converter RFC here https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795/16
I have opened a new patch https://reviews.llvm.org/D153419 and am closing this revision
Could you switch to the option marshalling infrastructure? https://clang.llvm.org/docs/InternalsManual.html#adding-new-command-line-option
Adding MarshallingInfoString<LangOpts<"ExecCharset">> here should do the trick. You can then delete the option parsing in CompilerInvocation.cpp.