This patch enables the fexec-charset option to control the execution charset of string literals. It sets the default internal charset, system charset, and execution charset for z/OS and UTF-8 for all other platforms.
This patch depends on https://reviews.llvm.org/D88741
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Hi, Abhina. Sorry for the delay getting back to you. I added some more comments.
clang/include/clang/Lex/LiteralSupport.h | ||
---|---|---|
192 | Is the default argument for TranslationState actually used anywhere? I'm skeptical that a default argument provides a benefit here. Actually, this diff doesn't include any changes to construct a CharLiteralParser with an explicit argument. It seems this argument isn't actually needed. The only places I see objects of CharLiteralParser type constructed are in EvaluateValue() in clang/lib/Lex/PPExpressions.cpp and Sema::ActOnCharacterConstant() in clang/lib/Sema/SemaExpr.cpp. | |
244–245 | I don't think a LiteralTranslator object is actually needed in this case. The only use of this constructor that I see is in ModuleMapParser::consumeToken() in clang/lib/Lex/ModuleMap.cpp and, in that case, I don't think any translation is necessary. This suggests that TranslationState is not needed for this constructor either; NoTranslation can be passed to init(). | |
258–259 | I don't think getOffsetOfStringByte() should require a ConversionState parameter. If I understand it correctly, this function should be operating on the string in the internal encoding, never in a converted encoding. | |
clang/include/clang/Lex/LiteralTranslator.h | ||
19–23 ↗ | (On Diff #314115) | Some naming suggestions... The enumeration is not used to record a state, but rather to indicate an action to take. Also, use of both "conversion" and "translation" could be confusing, so I suggest sticking with one. Perhaps: enum class LiteralConversion { None, ToSystemCharset, ToExecCharset }; |
30–31 ↗ | (On Diff #314115) | I don't know the LLVM style guides well, but I suspect a class with all public members should be defined using struct and not include access specifiers. |
35 ↗ | (On Diff #314115) | Given the converter setters and accessors below, ExecCharsetTables should be a private member. |
37 ↗ | (On Diff #314115) | getConversionTable() is logically const. Perhaps ExecCharsetTables should be mutable. From a terminology stand point, this function is misnamed. It doesn't return a table, it returns a converter for an encoding. I suggest: llvm::CharSetConverter *getCharSetConverter(const char *Encoding) const; |
38 ↗ | (On Diff #314115) | findOrCreateExecCharsetTable() seems oddly named since it doesn't return whatever it finds or creates. It seems like this function would be more useful if it returned a llvm::CharSetConverter pointer with nullptr indicating lookup/creation failed. This function seems like it should be an implementation detail of the class, not a public interface. |
41–43 ↗ | (On Diff #314115) | setTranslationTables() is awkward. It is effectively operating as a constructor for the class, but isn't called at object construction and it does work that goes beyond initialization. |
44 ↗ | (On Diff #314115) | I suggest trying a design more like this: class LiteralTranslator { std::string SystemEncoding; std::string ExecutionEncoding; public: LiteralTranslator(llvm::StringRef SystemEncoding, llvm::StringRef ExecutionEncoding); // Retrieve the name for the system encoding. llvm::StringRef getSystemEncoding() const; // Retrieve the name for the execution encoding. llvm::StringRef getExecutionEncoding() const; // Retrieve a converter for converting from the internal encoding (UTF-8) // to the system encoding. llvm::CharSetConverter* getSystemEncodingConverter() const; // Retrieve a converter for converting from the internal encoding (UTF-8) // to the execution encoding. llvm::CharSetConverter* getExecutionEncodingConverter() const; }; LiteralTranslator createLiteralTranslatorFromOptions(const clang::LangOptions &Opts, const clang::TargetInfo &TInfo, clang::DiagnosticsEngine &Diags); |
clang/include/clang/Lex/Preprocessor.h | ||
145 | I don't see a reason for LT to be a pointer. Can it be made a reference or, better, a non-reference, non-pointer data member? | |
clang/lib/Lex/LiteralSupport.cpp | ||
1341–1343 | Per the comment associated with the constructor declaration, I don't think the new constructor parameter is needed; translation to execution character set is always desired for non-UTF character literals. I think this can be something like: llvm::CharSetConverter *Converter = nullptr; if (! isUTFLiteral(Kind)) { assert(LT); Converter = LT->getCharConversionTable(TranslateToExecCharset); } | |
1614–1618 | Ah, ok, good. There are other cases where a string literal is not used to produce a string literal object. See https://wg21.link/p2314 for a table. You may want to audit for those cases. | |
clang/lib/Lex/Preprocessor.cpp | ||
88 ↗ | (On Diff #314115) | Per comments elsewhere, please try to make LT a non-pointer non-reference data member. |
clang/include/clang/Driver/Options.td | ||
---|---|---|
4433 | Could you switch to the option marshalling infrastructure? https://clang.llvm.org/docs/InternalsManual.html#adding-new-command-line-option |
Thanks for the feedback! I haven't addressed all the comments yet but I've made major renaming changes and hope to get feedback on it.
clang/include/clang/Lex/LiteralSupport.h | ||
---|---|---|
192 | You're right, we don't have any cases that use this arg yet so we can remove it. | |
244–245 | Thanks, I've removed it. | |
clang/include/clang/Lex/Preprocessor.h | ||
145 | Thanks, I've changed it to a non-reference non-pointer member. | |
clang/lib/Lex/LiteralSupport.cpp | ||
1341–1343 | I can't add an assertion here because LT might not be created in the case of the second StringLiteralParser constructor which does not pass the Preprocessor. But I have added the remaining changes. |
Hi Tom, @tahonermann I renamed the LiteralTranslator class to LiteralConverter.cpp and have renamed a lot of the functions. Let me know what you think. I agree that the setConverters function is awkward, the problem stems from initializing the member early in Preprocessor but only being able to create the Converters once we know the target host later in the compilation process.
clang/include/clang/Lex/Preprocessor.h | ||
---|---|---|
145 | Please give this a longer name. Abbreviation names should only be used in fairly small scopes where it's easy to look up what they refer to. Also: why LT? What does the T stand for? | |
clang/lib/Driver/ToolChains/Clang.cpp | ||
6238 | Looping over all the arguments is a little unusual. Normally we'd get the last argument value and only check that one. Do you need to pass more than one value onto the frontend? | |
clang/lib/Lex/LiteralSupport.cpp | ||
236 | This is a regression. Our prior behavior for unknown escapes was to leave the character alone. We should still do that wherever possible -- eg, \q should produce q -- and take fallback action only if the character is unencodable. Producing a ? seems unlikely to ever be what anyone wants; producing a hard error would seem preferable. | |
237 | Can you avoid using std::string here? Eg, pass a StringRef, extending the converter to be able to take one if necessary. | |
240 | Is there any guarantee the assertion will not fail? | |
1383–1384 | Why is this case not possible? | |
1387 | What assurance do we have that 1 output character is correct? I would expect we need to reject with a diagnostic if the character doesn't fit in one converted character. | |
1735–1737 | Do we need to convert the newline character too? Perhaps for raw string literals it'd be better to do the normal processing here and then convert the entire string at once? | |
clang/test/Driver/cl-options.c | ||
215 | Please use the given spelling of the flag in the diagnostic. (You can ask the argument how it was spelled.) |
Addressing some more comments. Updating the argument parsing, lit tests, some more renaming.
clang/include/clang/Lex/Preprocessor.h | ||
---|---|---|
145 | Thanks for catching this. This was a change I missed when renaming LiteralTranslator to LiteralConverter. I've added a longer name. | |
clang/lib/Driver/ToolChains/Clang.cpp | ||
6238 | Thanks, I've changed it back to get the LastArg only and use the spelling of the argument to fix the diagnostic error message in the driver lit tests. | |
clang/lib/Lex/LiteralSupport.cpp | ||
236 | Hi @tahonermann, do you also agree we should use the original behaviour or give a hard error instead? | |
1383–1384 | This case should be handled when fwide-exec-charset option is implemented. Until then, we thought it was best to emit a error message that wide literal translation is not supported. |
Rebase + fix CharLiteralParser endian issue by saving the char to a char variable first and then creating a StringRef
Just a tiny comment: could you please make sure the name of the resolved encoding is also propagated to InitPreprocessor.cpp that sets the __clang_literal_encoding__ macro? (https://github.com/llvm/llvm-project/blob/main/clang/lib/Frontend/InitPreprocessor.cpp#L784)
Thanks for catching that. This sets the clang_literal_encoding to Opts.ExecCharset or defaults to SystemCharset.
We should use the original source form of the string literal when pretty-printing a StringLiteral or CharacterLiteral; there are a bunch of UTF-8 assumptions baked into StmtPrinter that will need revisiting. And we'll need to modify the handful of places that put the contents of StringLiterals into diagnostics (#warning, #error, static_assert) and make them use a different ConversionState, since our assumption is that diagnostic output should be in UTF-8.
Yes, these are some of the complications we will need to visit in later patches. We may need to somehow save the original string or reverse the translation.
The operation is destructive and therefore cannot be reverted.
So I do believe the correct behavior here would indeed be to keep the original spelling around - with *some* of phase 5 applied (replacement of UCNs and replacement of numeric escape sequences).
An alternative would be to do the conversion lazily when the strings are evaluated, rather than during lexing, although that might be more involved
"Keeping the original spelling around" would assume that the input is not using a stateful encoding. That seems worse as assumption than giving the canonical output in UTF-8 and shifting the problem to the user's editor?
Right, terrible choice of words
s/original spelling/the concatenated, non-encoded string literal, in UTF-8
Thanks for the input! I agree doing the conversion lazily will help avoid hitting these issues since we push translation to a later stage but as you mentioned it will be more involved. I think keeping the original spelling might be the best solution. We can make a extra member in StringLiteralParser to save the string prior to translation. But we would need to go through each use of StringLiteralParser and save the original encoding (possibly print it in the .ll file along with the translated string or as an attribute?). Let me know what you think.
Hello, I was waiting for the CharSetConverter patch to land. Now that this patch has landed https://reviews.llvm.org/D148821 to add limited EBCDIC <-> UTF-8 conversion support, I have started to refactor my patch to use this instead. This implementation also heavily relies on iconv support which is still being discussed in the CharSet Converter RFC here https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795/16
I have opened a new patch https://reviews.llvm.org/D153419 and am closing this revision