This is an archive of the discontinued LLVM Phabricator instance.

Enable fexec-charset option
AbandonedPublic

Authored by abhina.sreeskantharajan on Dec 10 2020, 5:49 AM.

Details

Summary

This patch enables the fexec-charset option to control the execution charset of string literals. It sets the default internal charset, system charset, and execution charset for z/OS and UTF-8 for all other platforms.
This patch depends on https://reviews.llvm.org/D88741

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
abhina.sreeskantharajan requested review of this revision.Dec 10 2020, 5:49 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptDec 10 2020, 5:49 AM
rsmith added a subscriber: rsmith.Dec 10 2020, 3:11 PM

I'm overall pretty happy about how clean and non-invasive the changes required here are. But please make sure you don't change the encodings of u8"..." / u"..." / U"..." literals; those need to stay as UTF-8 / UTF-16 / UTF-32. Also, we should have a story for how the wide execution character set is controlled -- is it derived from the narrow execution character set, or can the two be changed independently, or ...?

We should use the original source form of the string literal when pretty-printing a StringLiteral or CharacterLiteral; there are a bunch of UTF-8 assumptions baked into StmtPrinter that will need revisiting. And we'll need to modify the handful of places that put the contents of StringLiterals into diagnostics (#warning, #error, static_assert) and make them use a different ConversionState, since our assumption is that diagnostic output should be in UTF-8.

clang/include/clang/Lex/LiteralTranslator.h
29–31 ↗(On Diff #310863)

It is not acceptable to use global state for this per-compilation information; this will behave badly if multiple independent Clang compilations are performed by different threads in the same process, for example.

32 ↗(On Diff #310863)

Similarly, use of a global cache here will require you guard it with a mutex.

As an alternative, how about we move all this state to be per-instance state, and store an instance of LiteralTranslator on the Preprocessor?

clang/lib/Lex/LiteralSupport.cpp
233–241

Is it correct, in general, to do character-at-a-time translation here, when processing a string literal? I would expect there to be some (stateful) target character sets where that's not correct.

238

Reinterpreting an unsigned* as a char* like this is not correct on big-endian, and is way too "clever" on little-endian. Please create an actual char object to hold the value and pass that in instead.

240

What should happen if the result doesn't fit into an unsigned? This also appears to be making problematic assumptions about the endianness of the host. If we really want to pack multiple bytes of encoded output into a single unsigned result value (which itself seems dubious), we should do so with an endianness that doesn't depend on the host.

1342

Shouldn't this depend on the kind of literal? We should have no converter for UTF8/UTF16/UTF32 literals, should use the wide execution character set for L... literals, and the narrow execution character set otherwise. (It looks like this patch doesn't properly distinguish the narrow and wide execution character sets?)

clang/test/Driver/cl-options.c
218

Checking for "don't produce exactly this one spelling of this one diagnostic" is not a useful test; if we started warning on this again, there's a good chance the warning would be spelled differently, so your test does not do a good job of determining whether the code under test is bad (it passes in most bad states as well as in the good state). ...-NOT: error and ...-NOT: warning would be a bit better, if this is worth testing.

clang/test/Driver/clang_f_opts.c
226

Again, this is not a useful test.

tahonermann added inline comments.Dec 11 2020, 11:14 PM
clang/include/clang/Driver/Options.td
4433–4434

How about substituting "character set", "character encoding", or "charset" for "codepage"?

This doesn't state what names are recognized. The ones provided by the system iconv() implementation (as is the case for gcc)? Or all names and aliases specified by the IANA character set registry?

The set of recognized names can be a superset of the names that are actually supported.

clang/include/clang/Lex/LiteralSupport.h
193

Does the conversion state need to be persisted as a data member? The literal is consumed in the constructor.

243

Same concern here with respect to persisting the conversion state as a data member.

245

This static data member will presumably need to be lifted to per-instance state as Richard mentioned elsewhere.

clang/lib/Driver/ToolChains/Clang.cpp
6232–6241

I think it would be preferable to diagnose an unrecognized character encoding name here if possible. The current changes will result in an unrecognized name (as opposed to one that is unsupported for the target) being diagnosed for each compiler instance.

clang/lib/Frontend/CompilerInvocation.cpp
3573 ↗(On Diff #310863)

I wouldn't expect the cast to std::string to be needed here.

clang/lib/Lex/LiteralSupport.cpp
233–241

For stateful encodings, I can imagine that state would have to be transitioned to the initial state before translating the escape sequence. I suspect support for stateful encodings is not a goal at this time.

236

What should happen if ResultChar >= 0x100? IBM-1047 does have representation for other UTF-8 characters. Regardless, it seems ResultChar should be converted to something.

1787–1799

UCNs will require conversion here.

llvm/lib/Support/Triple.cpp
1051–1052

No support for targeting the z/OS Enhanced ASCII run-time?

Thanks for your quick reviews! I haven't addressed all the comments yet but I plan to address all of them. I put up this patch early because it has a few major changes:

  • moves LiteralTranslator class to Preprocessor instead of being a static global class
  • add isUTFLiteral() function to detect strings like u8"..." and stop translation
  • translate wide string literals to the system charset for now (we don't have an implementation plan for -fwide-charset right now)
  • remove tests that check fexec-charset will not accept non-UTF charsets
abhina.sreeskantharajan marked 9 inline comments as done.Dec 15 2020, 8:40 AM

I'm overall pretty happy about how clean and non-invasive the changes required here are. But please make sure you don't change the encodings of u8"..." / u"..." / U"..." literals; those need to stay as UTF-8 / UTF-16 / UTF-32. Also, we should have a story for how the wide execution character set is controlled -- is it derived from the narrow execution character set, or can the two be changed independently, or ...?

We should use the original source form of the string literal when pretty-printing a StringLiteral or CharacterLiteral; there are a bunch of UTF-8 assumptions baked into StmtPrinter that will need revisiting. And we'll need to modify the handful of places that put the contents of StringLiterals into diagnostics (#warning, #error, static_assert) and make them use a different ConversionState, since our assumption is that diagnostic output should be in UTF-8.

Yes, these are some of the complications we will need to visit in later patches. We may need to somehow save the original string or reverse the translation.

clang/include/clang/Driver/Options.td
4433–4434

I've updated the description from codepage to charset.

It's hard to specify what charsets are supported because iconv library differs between targets, so the list will not be the same on every platform.

clang/include/clang/Lex/LiteralTranslator.h
32 ↗(On Diff #310863)

Thanks, I've added an instance of LiteralTranslator to Preprocessor instead and use that when the Preprocessor is available. There is one constructor of StringLiteralParser that does not pass Preprocessor as an argument, so I had to create a LiteralTranslator instance there as well.

clang/lib/Driver/ToolChains/Clang.cpp
6232–6241

Since we do not know what charsets are supported by the iconv library on the target platform, we don't know what charsets are actually invalid until we try creating a CharSetConverter.

clang/lib/Frontend/CompilerInvocation.cpp
3573 ↗(On Diff #310863)

Without that cast, I get the following build error:

error: no viable overloaded '='
clang/test/Driver/cl-options.c
218

You're right, I made a change just to make the testcase pass. I think this testcase is no longer needed because fexec-charset should be able to accept all charset names. We won't be able to diagnose invalid charset names until we actually try creating the CharSetConverter.

llvm/lib/Support/Triple.cpp
1051–1052

We plan to support both modes in the future, but we want the default to still be IBM-1047 (EBCDIC).

abhina.sreeskantharajan marked 5 inline comments as done.Dec 15 2020, 11:05 AM
tahonermann added inline comments.Dec 16 2020, 8:35 AM
clang/include/clang/Driver/Options.td
4433–4434

Being dependent on the host iconv library seems fine by me; that is the case for gcc today. I suggest making that explicit here:

def fexec_charset : Separate<["-"], "fexec-charset">, MetaVarName<"<charset>">,
  HelpText<"Set the execution <charset> for string and character literals.  Supported character encodings include XXX and those supported by the host iconv library.">;
clang/lib/Driver/ToolChains/Clang.cpp
6232–6241

Understood, but what would be the harm in performing a lookup (constructing a CharSetConverter) here?

clang/lib/Frontend/CompilerInvocation.cpp
3573 ↗(On Diff #310863)

Ok, rather than a cast, I suggest:

Opts.ExecCharset = Value.str();
clang/lib/Lex/LiteralSupport.cpp
1342–1343

Converting wide character literals to the system encoding doesn't seem right to me. For z/OS, this should presumably convert to the wide EBCDIC encoding, but for all other supported platforms, the wide execution character set is either UTF-16 or UTF-32 depending on the size of wchar_t (which may be influenced by the -fshort-wchar option).

1614–1618

The stored TranslationState should not be completely ignored for wide and UTF string literals. The standard permits things like the following.

#pragma rigoot L"bozit"
#pragma rigoot u"bozit"
_Pragma(L"rigoot bozit")
_Pragma(u8"rigoot bozit")

For at least the _Pragma(L"...") case, the C++ standard states the L is ignored, but it doesn't say anything about other encoding prefixes.

1615–1616

Converting wide string literals to the system encoding doesn't seem right to me. For z/OS, this should presumably convert to the wide EBCDIC encoding, but for all other supported platforms, the wide execution character set is either UTF-16 or UTF-32 depending on the size of wchar_t (which may be influenced by the -fshort-wchar option).

tahonermann added inline comments.Dec 16 2020, 8:55 AM
clang/test/CodeGen/systemz-charset.c
5

const char * please :)

25

Add validation of UCNs. Something like:

const char *UcnCharacters = "\u00E2\u00AC\U000000DF";
// CHECK: c"\42\B0\59\00"

Thanks for your patience, I've addressed some more comments. Here is the summary of the changes in this patch:

  • add translation for UCN strings, update testcase
  • fix helptext for fexec-charset option in Options.td
  • check for invalid charsets when parsing driver options.
  • fix up char conversion code
abhina.sreeskantharajan marked 11 inline comments as done.Dec 21 2020, 8:25 AM
abhina.sreeskantharajan added inline comments.
clang/include/clang/Driver/Options.td
4433–4434

I've updated the HelpText with your suggested description.

clang/include/clang/Lex/LiteralSupport.h
193

Thanks, I've removed this.

243

If this member is removed in StringLiteralParser, we will need to pass the State to multiple functions in StringLiteralParser like init(). Would this solution be preferable to keeping a data member?

clang/lib/Driver/ToolChains/Clang.cpp
6232–6241

I initially thought it will be a performance issue if we are creating the Converter twice, once here and once in the Preprocessor. But I do think its a good idea to diagnose this early. I've modified the code to diagnose and error here.

clang/lib/Frontend/CompilerInvocation.cpp
3573 ↗(On Diff #310863)

Thanks, I've applied this change.

clang/lib/Lex/LiteralSupport.cpp
233–241

Right, stateful encodings may be a problem we will need to revisit later as well.

236

This is no longer valid, thanks for catching that. We were initially translating to ASCII instead of UTF-8 so we needed to guard against larger characters. I've removed this guard since the internal charset is UTF-8.

238

Thanks, I've created a char instead.

240

This may be a problem we need to revisit since ResultChar is expecting a char.

1787–1799

I've added code to translate UCN characters and have updated the testcase as well.

abhina.sreeskantharajan marked 8 inline comments as done.Dec 21 2020, 8:25 AM
clang/lib/Lex/LiteralSupport.cpp
1342–1343

Since we don't implement -fwide-exec-charset yet, what do you think should be the default behaviour for the interim?

tahonermann added inline comments.Dec 23 2020, 10:04 PM
clang/include/clang/Lex/LiteralSupport.h
243

I think so, yes. Data members should be used to reflect the state of the object, not as a convenient mechanism to avoid passing arguments.

clang/lib/Driver/ToolChains/Clang.cpp
6244–6246

Thank you for adding this.

clang/lib/Lex/LiteralSupport.cpp
236

Conversion can fail here, particularly in the scenario corresponding to the default switch case above; ResultChar could contain, for example, a lead byte of a UTF-8 sequence. Something sensible should be done here; either rejecting the code with an error or substituting ? (in the execution encoding) seems appropriate to me.

237

As Richard previously noted, this memcpy() needs to be addressed. The intended behavior here is not clear. Are there valid scenarios in which the conversion will produce a sequence of more than one code units? I believe the input is limited to ASCII characters and invalid code units (e.g., a lead byte of a UTF-8 sequence) and in the latter case, an error and/or substitution of a ? (in the execution encoding) seem like acceptable behaviors to me.

1342–1343

Perhaps an Internal compiler error to indicate that appropriate support is not yet in place?

clang/test/CodeGen/systemz-charset.c
17–24

const char* here too please.

clang/test/CodeGen/systemz-charset.cpp
2–26

This is good. I suggest adding escape sequences and UCNs to validate that they are not converted to IBM-1047.

clang/test/Driver/clang_f_opts.c
225–233

This looks good. Can tests also be added to validate that the UTF-8, ISO8895-1, and IBM-1047 option arguments are properly recognized?

Thanks for the review! I've addressed most of the comments but I still need to work on the translation issues in CharLiteralParser that was kindly pointed out by Tom and Richard. Here are the summary of changes in this patch:

  • Removed TranslationState as a member of StringLiteralParser and pass it as an argument instead
  • Added an assertion for wide character translation instead of translating them to the system charset
  • Invalid char escapes are changed to '?' and then translated
  • Updated testcases as requested
abhina.sreeskantharajan marked 8 inline comments as done.Dec 29 2020, 11:39 AM
abhina.sreeskantharajan added inline comments.
clang/include/clang/Lex/LiteralSupport.h
243

Thanks, I've removed this member.

clang/lib/Lex/LiteralSupport.cpp
1342–1343

Thanks for the suggestion. I've added assertions for wide character translation before we do any translation.

1615–1616

I've now added an assertion when translating wide characters.

clang/test/CodeGen/systemz-charset.cpp
2–26

Good idea, I added those testcases as per your suggestion.

abhina.sreeskantharajan marked 4 inline comments as done.Dec 29 2020, 11:39 AM
abhina.sreeskantharajan marked an inline comment as done.Dec 29 2020, 12:45 PM
abhina.sreeskantharajan added inline comments.
clang/lib/Lex/LiteralSupport.cpp
236

Thanks, I added the substitution with the '?' character for invalid escapes.

abhina.sreeskantharajan marked an inline comment as done.Dec 29 2020, 12:45 PM

This patch replaces the memcpy in CharLiteralParser with an assignment. I've added an assertion for cases where the character size increases after translation.

abhina.sreeskantharajan marked 2 inline comments as done.Dec 30 2020, 6:52 AM
abhina.sreeskantharajan added inline comments.
clang/lib/Lex/LiteralSupport.cpp
237

I replaced memcpy with an assignment. Please let me know if there is a better solution.

240

I added an assertion for this case where the size of the character increases after translation. I've also removed the memcpy to avoid endianness issues.

abhina.sreeskantharajan marked an inline comment as done.Dec 30 2020, 6:52 AM
clang/lib/Lex/LiteralSupport.cpp
1614–1618

Please correct me if I'm wrong, these Pragma strings are not parsed through StringLiteralParser, they are parsed in clang/lib/Lex/Pragma.cpp in this function.
void Preprocessor::Handle_Pragma(Token &Tok)
So if they require translation, it would need to be done in that function.

ping :)
Is there any more feedback on the implementation inside ProcessCharEscape()?

Hi, Abhina. Sorry for the delay getting back to you. I added some more comments.

clang/include/clang/Lex/LiteralSupport.h
192

Is the default argument for TranslationState actually used anywhere? I'm skeptical that a default argument provides a benefit here.

Actually, this diff doesn't include any changes to construct a CharLiteralParser with an explicit argument. It seems this argument isn't actually needed.

The only places I see objects of CharLiteralParser type constructed are in EvaluateValue() in clang/lib/Lex/PPExpressions.cpp and Sema::ActOnCharacterConstant() in clang/lib/Sema/SemaExpr.cpp.

244–245

I don't think a LiteralTranslator object is actually needed in this case. The only use of this constructor that I see is in ModuleMapParser::consumeToken() in clang/lib/Lex/ModuleMap.cpp and, in that case, I don't think any translation is necessary.

This suggests that TranslationState is not needed for this constructor either; NoTranslation can be passed to init().

258–259

I don't think getOffsetOfStringByte() should require a ConversionState parameter. If I understand it correctly, this function should be operating on the string in the internal encoding, never in a converted encoding.

clang/include/clang/Lex/LiteralTranslator.h
19–23 ↗(On Diff #314115)

Some naming suggestions... The enumeration is not used to record a state, but rather to indicate an action to take. Also, use of both "conversion" and "translation" could be confusing, so I suggest sticking with one. Perhaps:

enum class LiteralConversion {
  None,
  ToSystemCharset,
  ToExecCharset
};
30–31 ↗(On Diff #314115)

I don't know the LLVM style guides well, but I suspect a class with all public members should be defined using struct and not include access specifiers.

35 ↗(On Diff #314115)

Given the converter setters and accessors below, ExecCharsetTables should be a private member.

37 ↗(On Diff #314115)

getConversionTable() is logically const. Perhaps ExecCharsetTables should be mutable.

From a terminology stand point, this function is misnamed. It doesn't return a table, it returns a converter for an encoding. I suggest:

llvm::CharSetConverter *getCharSetConverter(const char *Encoding) const;
38 ↗(On Diff #314115)

findOrCreateExecCharsetTable() seems oddly named since it doesn't return whatever it finds or creates. It seems like this function would be more useful if it returned a llvm::CharSetConverter pointer with nullptr indicating lookup/creation failed.

This function seems like it should be an implementation detail of the class, not a public interface.

41–43 ↗(On Diff #314115)

setTranslationTables() is awkward. It is effectively operating as a constructor for the class, but isn't called at object construction and it does work that goes beyond initialization.

44 ↗(On Diff #314115)

I suggest trying a design more like this:

class LiteralTranslator {
  std::string SystemEncoding;
  std::string ExecutionEncoding;
public:
  LiteralTranslator(llvm::StringRef SystemEncoding,
                    llvm::StringRef ExecutionEncoding);

  // Retrieve the name for the system encoding.
  llvm::StringRef getSystemEncoding() const;
  // Retrieve the name for the execution encoding.
  llvm::StringRef getExecutionEncoding() const;

  // Retrieve a converter for converting from the internal encoding (UTF-8)
  // to the system encoding.
  llvm::CharSetConverter* getSystemEncodingConverter() const;
  // Retrieve a converter for converting from the internal encoding (UTF-8)
  // to the execution encoding.
  llvm::CharSetConverter* getExecutionEncodingConverter() const;
};

LiteralTranslator createLiteralTranslatorFromOptions(const clang::LangOptions &Opts,
                                                     const clang::TargetInfo &TInfo,
                                                     clang::DiagnosticsEngine &Diags);
clang/include/clang/Lex/Preprocessor.h
145

I don't see a reason for LT to be a pointer. Can it be made a reference or, better, a non-reference, non-pointer data member?

clang/lib/Lex/LiteralSupport.cpp
1341–1343

Per the comment associated with the constructor declaration, I don't think the new constructor parameter is needed; translation to execution character set is always desired for non-UTF character literals. I think this can be something like:

llvm::CharSetConverter *Converter = nullptr;
if (! isUTFLiteral(Kind)) {
  assert(LT);
  Converter = LT->getCharConversionTable(TranslateToExecCharset);
}
1614–1618

Ah, ok, good. There are other cases where a string literal is not used to produce a string literal object. See https://wg21.link/p2314 for a table. You may want to audit for those cases.

clang/lib/Lex/Preprocessor.cpp
88 ↗(On Diff #314115)

Per comments elsewhere, please try to make LT a non-pointer non-reference data member.

jansvoboda11 added inline comments.Mar 1 2021, 12:00 AM
clang/include/clang/Driver/Options.td
4433

Could you switch to the option marshalling infrastructure? https://clang.llvm.org/docs/InternalsManual.html#adding-new-command-line-option
Adding MarshallingInfoString<LangOpts<"ExecCharset">> here should do the trick. You can then delete the option parsing in CompilerInvocation.cpp.

Thanks for the feedback! I haven't addressed all the comments yet but I've made major renaming changes and hope to get feedback on it.

abhina.sreeskantharajan marked 6 inline comments as done.Mar 2 2021, 8:53 AM
abhina.sreeskantharajan added inline comments.
clang/include/clang/Lex/LiteralSupport.h
192

You're right, we don't have any cases that use this arg yet so we can remove it.

244–245

Thanks, I've removed it.

clang/include/clang/Lex/Preprocessor.h
145

Thanks, I've changed it to a non-reference non-pointer member.

clang/lib/Lex/LiteralSupport.cpp
1341–1343

I can't add an assertion here because LT might not be created in the case of the second StringLiteralParser constructor which does not pass the Preprocessor. But I have added the remaining changes.

abhina.sreeskantharajan marked 4 inline comments as done.Mar 2 2021, 8:55 AM

Hi Tom, @tahonermann I renamed the LiteralTranslator class to LiteralConverter.cpp and have renamed a lot of the functions. Let me know what you think. I agree that the setConverters function is awkward, the problem stems from initializing the member early in Preprocessor but only being able to create the Converters once we know the target host later in the compilation process.

rsmith added inline comments.Mar 2 2021, 12:12 PM
clang/include/clang/Lex/Preprocessor.h
145

Please give this a longer name. Abbreviation names should only be used in fairly small scopes where it's easy to look up what they refer to.

Also: why LT? What does the T stand for?

clang/lib/Driver/ToolChains/Clang.cpp
6238

Looping over all the arguments is a little unusual. Normally we'd get the last argument value and only check that one. Do you need to pass more than one value onto the frontend?

clang/lib/Lex/LiteralSupport.cpp
236

This is a regression. Our prior behavior for unknown escapes was to leave the character alone. We should still do that wherever possible -- eg, \q should produce q -- and take fallback action only if the character is unencodable. Producing a ? seems unlikely to ever be what anyone wants; producing a hard error would seem preferable.

237

Can you avoid using std::string here? Eg, pass a StringRef, extending the converter to be able to take one if necessary.

240

Is there any guarantee the assertion will not fail?

1383–1384

Why is this case not possible?

1387

What assurance do we have that 1 output character is correct? I would expect we need to reject with a diagnostic if the character doesn't fit in one converted character.

1735–1737

Do we need to convert the newline character too?

Perhaps for raw string literals it'd be better to do the normal processing here and then convert the entire string at once?

clang/test/Driver/cl-options.c
215

Please use the given spelling of the flag in the diagnostic. (You can ask the argument how it was spelled.)

Addressing some more comments. Updating the argument parsing, lit tests, some more renaming.

abhina.sreeskantharajan marked 4 inline comments as done.Mar 4 2021, 6:20 AM
abhina.sreeskantharajan added inline comments.
clang/include/clang/Lex/Preprocessor.h
145

Thanks for catching this. This was a change I missed when renaming LiteralTranslator to LiteralConverter. I've added a longer name.

clang/lib/Driver/ToolChains/Clang.cpp
6238

Thanks, I've changed it back to get the LastArg only and use the spelling of the argument to fix the diagnostic error message in the driver lit tests.

clang/lib/Lex/LiteralSupport.cpp
236

Hi @tahonermann, do you also agree we should use the original behaviour or give a hard error instead?

1383–1384

This case should be handled when fwide-exec-charset option is implemented. Until then, we thought it was best to emit a error message that wide literal translation is not supported.

abhina.sreeskantharajan marked 3 inline comments as done.Mar 4 2021, 6:20 AM
abhina.sreeskantharajan marked an inline comment as done.Mar 4 2021, 6:23 AM
abhina.sreeskantharajan marked an inline comment as done.

Add assertion, add testcase for multi-line raw string

clang/lib/Lex/LiteralSupport.cpp
1387

Right, I'll add a similar assertion to the one we have above.

1735–1737

Yes, we need to convert newlines as well. I think the current behaviour is already converting multi line raw strings correctly. I'll add a testcase for this.

abhina.sreeskantharajan marked 3 inline comments as done.Mar 8 2021, 12:01 PM
abhina.sreeskantharajan added inline comments.
clang/include/clang/Lex/LiteralTranslator.h
30–31 ↗(On Diff #314115)

I've made these private.

37 ↗(On Diff #314115)

I've renamed this function to getConverter

abhina.sreeskantharajan marked 2 inline comments as done.Mar 8 2021, 12:01 PM
abhina.sreeskantharajan marked 2 inline comments as done.Mar 15 2021, 5:44 AM

Rebase + fix CharLiteralParser endian issue by saving the char to a char variable first and then creating a StringRef

Accidentally added dependent patch in this one. Removing that

Rebase + set size of char as 1 when creating a StringRef to fix lit failure

ThePhD added a subscriber: ThePhD.Apr 20 2021, 1:35 PM

Just a tiny comment: could you please make sure the name of the resolved encoding is also propagated to InitPreprocessor.cpp that sets the __clang_literal_encoding__ macro? (https://github.com/llvm/llvm-project/blob/main/clang/lib/Frontend/InitPreprocessor.cpp#L784)

abhina.sreeskantharajan added a reviewer: ThePhD.

Thanks for catching that. This sets the clang_literal_encoding to Opts.ExecCharset or defaults to SystemCharset.

cor3ntin added a subscriber: cor3ntin.EditedApr 21 2021, 3:00 PM

We should use the original source form of the string literal when pretty-printing a StringLiteral or CharacterLiteral; there are a bunch of UTF-8 assumptions baked into StmtPrinter that will need revisiting. And we'll need to modify the handful of places that put the contents of StringLiterals into diagnostics (#warning, #error, static_assert) and make them use a different ConversionState, since our assumption is that diagnostic output should be in UTF-8.

Yes, these are some of the complications we will need to visit in later patches. We may need to somehow save the original string or reverse the translation.

The operation is destructive and therefore cannot be reverted.
So I do believe the correct behavior here would indeed be to keep the original spelling around - with *some* of phase 5 applied (replacement of UCNs and replacement of numeric escape sequences).
An alternative would be to do the conversion lazily when the strings are evaluated, rather than during lexing, although that might be more involved

joerg added a subscriber: joerg.Apr 21 2021, 6:29 PM

"Keeping the original spelling around" would assume that the input is not using a stateful encoding. That seems worse as assumption than giving the canonical output in UTF-8 and shifting the problem to the user's editor?

"Keeping the original spelling around" would assume that the input is not using a stateful encoding. That seems worse as assumption than giving the canonical output in UTF-8 and shifting the problem to the user's editor?

Right, terrible choice of words
s/original spelling/the concatenated, non-encoded string literal, in UTF-8

We should use the original source form of the string literal when pretty-printing a StringLiteral or CharacterLiteral; there are a bunch of UTF-8 assumptions baked into StmtPrinter that will need revisiting. And we'll need to modify the handful of places that put the contents of StringLiterals into diagnostics (#warning, #error, static_assert) and make them use a different ConversionState, since our assumption is that diagnostic output should be in UTF-8.

Yes, these are some of the complications we will need to visit in later patches. We may need to somehow save the original string or reverse the translation.

The operation is destructive and therefore cannot be reverted.
So I do believe the correct behavior here would indeed be to keep the original spelling around - with *some* of phase 5 applied (replacement of UCNs and replacement of numeric escape sequences).
An alternative would be to do the conversion lazily when the strings are evaluated, rather than during lexing, although that might be more involved

Thanks for the input! I agree doing the conversion lazily will help avoid hitting these issues since we push translation to a later stage but as you mentioned it will be more involved. I think keeping the original spelling might be the best solution. We can make a extra member in StringLiteralParser to save the string prior to translation. But we would need to go through each use of StringLiteralParser and save the original encoding (possibly print it in the .ll file along with the translated string or as an attribute?). Let me know what you think.

jansvoboda11 resigned from this revision.Aug 6 2021, 4:44 AM
srl295 added a subscriber: srl295.Mar 8 2022, 11:41 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 8 2022, 11:41 AM

@abhina.sreeskantharajan
What is the status of this patch?

@abhina.sreeskantharajan
What is the status of this patch?

Hello, I was waiting for the CharSetConverter patch to land. Now that this patch has landed https://reviews.llvm.org/D148821 to add limited EBCDIC <-> UTF-8 conversion support, I have started to refactor my patch to use this instead. This implementation also heavily relies on iconv support which is still being discussed in the CharSet Converter RFC here https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795/16

@abhina.sreeskantharajan
What is the status of this patch?

Hello, I was waiting for the CharSetConverter patch to land. Now that this patch has landed https://reviews.llvm.org/D148821 to add limited EBCDIC <-> UTF-8 conversion support, I have started to refactor my patch to use this instead. This implementation also heavily relies on iconv support which is still being discussed in the CharSet Converter RFC here https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795/16

Thanks! I was beginning to think it is forgotten/abandoned.

I have opened a new patch https://reviews.llvm.org/D153419 and am closing this revision