Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Phabricator shutdown timeline

Enable fexec-charset option
AbandonedPublic

Authored by abhina.sreeskantharajan on Dec 10 2020, 5:49 AM.

Details

Summary

This patch enables the fexec-charset option to control the execution charset of string literals. It sets the default internal charset, system charset, and execution charset for z/OS and UTF-8 for all other platforms.
This patch depends on https://reviews.llvm.org/D88741

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

ping :)
Is there any more feedback on the implementation inside ProcessCharEscape()?

Hi, Abhina. Sorry for the delay getting back to you. I added some more comments.

clang/include/clang/Lex/LiteralSupport.h
192

Is the default argument for TranslationState actually used anywhere? I'm skeptical that a default argument provides a benefit here.

Actually, this diff doesn't include any changes to construct a CharLiteralParser with an explicit argument. It seems this argument isn't actually needed.

The only places I see objects of CharLiteralParser type constructed are in EvaluateValue() in clang/lib/Lex/PPExpressions.cpp and Sema::ActOnCharacterConstant() in clang/lib/Sema/SemaExpr.cpp.

244–245

I don't think a LiteralTranslator object is actually needed in this case. The only use of this constructor that I see is in ModuleMapParser::consumeToken() in clang/lib/Lex/ModuleMap.cpp and, in that case, I don't think any translation is necessary.

This suggests that TranslationState is not needed for this constructor either; NoTranslation can be passed to init().

258–259

I don't think getOffsetOfStringByte() should require a ConversionState parameter. If I understand it correctly, this function should be operating on the string in the internal encoding, never in a converted encoding.

clang/include/clang/Lex/LiteralTranslator.h
19–23 ↗(On Diff #314115)

Some naming suggestions... The enumeration is not used to record a state, but rather to indicate an action to take. Also, use of both "conversion" and "translation" could be confusing, so I suggest sticking with one. Perhaps:

enum class LiteralConversion {
  None,
  ToSystemCharset,
  ToExecCharset
};
30–31 ↗(On Diff #314115)

I don't know the LLVM style guides well, but I suspect a class with all public members should be defined using struct and not include access specifiers.

35 ↗(On Diff #314115)

Given the converter setters and accessors below, ExecCharsetTables should be a private member.

37 ↗(On Diff #314115)

getConversionTable() is logically const. Perhaps ExecCharsetTables should be mutable.

From a terminology stand point, this function is misnamed. It doesn't return a table, it returns a converter for an encoding. I suggest:

llvm::CharSetConverter *getCharSetConverter(const char *Encoding) const;
38 ↗(On Diff #314115)

findOrCreateExecCharsetTable() seems oddly named since it doesn't return whatever it finds or creates. It seems like this function would be more useful if it returned a llvm::CharSetConverter pointer with nullptr indicating lookup/creation failed.

This function seems like it should be an implementation detail of the class, not a public interface.

41–43 ↗(On Diff #314115)

setTranslationTables() is awkward. It is effectively operating as a constructor for the class, but isn't called at object construction and it does work that goes beyond initialization.

44 ↗(On Diff #314115)

I suggest trying a design more like this:

class LiteralTranslator {
  std::string SystemEncoding;
  std::string ExecutionEncoding;
public:
  LiteralTranslator(llvm::StringRef SystemEncoding,
                    llvm::StringRef ExecutionEncoding);

  // Retrieve the name for the system encoding.
  llvm::StringRef getSystemEncoding() const;
  // Retrieve the name for the execution encoding.
  llvm::StringRef getExecutionEncoding() const;

  // Retrieve a converter for converting from the internal encoding (UTF-8)
  // to the system encoding.
  llvm::CharSetConverter* getSystemEncodingConverter() const;
  // Retrieve a converter for converting from the internal encoding (UTF-8)
  // to the execution encoding.
  llvm::CharSetConverter* getExecutionEncodingConverter() const;
};

LiteralTranslator createLiteralTranslatorFromOptions(const clang::LangOptions &Opts,
                                                     const clang::TargetInfo &TInfo,
                                                     clang::DiagnosticsEngine &Diags);
clang/include/clang/Lex/Preprocessor.h
145

I don't see a reason for LT to be a pointer. Can it be made a reference or, better, a non-reference, non-pointer data member?

clang/lib/Lex/LiteralSupport.cpp
1341–1343

Per the comment associated with the constructor declaration, I don't think the new constructor parameter is needed; translation to execution character set is always desired for non-UTF character literals. I think this can be something like:

llvm::CharSetConverter *Converter = nullptr;
if (! isUTFLiteral(Kind)) {
  assert(LT);
  Converter = LT->getCharConversionTable(TranslateToExecCharset);
}
1614–1618

Ah, ok, good. There are other cases where a string literal is not used to produce a string literal object. See https://wg21.link/p2314 for a table. You may want to audit for those cases.

clang/lib/Lex/Preprocessor.cpp
88 ↗(On Diff #314115)

Per comments elsewhere, please try to make LT a non-pointer non-reference data member.

jansvoboda11 added inline comments.Mar 1 2021, 12:00 AM
clang/include/clang/Driver/Options.td
4433

Could you switch to the option marshalling infrastructure? https://clang.llvm.org/docs/InternalsManual.html#adding-new-command-line-option
Adding MarshallingInfoString<LangOpts<"ExecCharset">> here should do the trick. You can then delete the option parsing in CompilerInvocation.cpp.

Thanks for the feedback! I haven't addressed all the comments yet but I've made major renaming changes and hope to get feedback on it.

abhina.sreeskantharajan marked 6 inline comments as done.Mar 2 2021, 8:53 AM
abhina.sreeskantharajan added inline comments.
clang/include/clang/Lex/LiteralSupport.h
192

You're right, we don't have any cases that use this arg yet so we can remove it.

244–245

Thanks, I've removed it.

clang/include/clang/Lex/Preprocessor.h
145

Thanks, I've changed it to a non-reference non-pointer member.

clang/lib/Lex/LiteralSupport.cpp
1341–1343

I can't add an assertion here because LT might not be created in the case of the second StringLiteralParser constructor which does not pass the Preprocessor. But I have added the remaining changes.

abhina.sreeskantharajan marked 4 inline comments as done.Mar 2 2021, 8:55 AM

Hi Tom, @tahonermann I renamed the LiteralTranslator class to LiteralConverter.cpp and have renamed a lot of the functions. Let me know what you think. I agree that the setConverters function is awkward, the problem stems from initializing the member early in Preprocessor but only being able to create the Converters once we know the target host later in the compilation process.

rsmith added inline comments.Mar 2 2021, 12:12 PM
clang/include/clang/Lex/Preprocessor.h
145

Please give this a longer name. Abbreviation names should only be used in fairly small scopes where it's easy to look up what they refer to.

Also: why LT? What does the T stand for?

clang/lib/Driver/ToolChains/Clang.cpp
6238

Looping over all the arguments is a little unusual. Normally we'd get the last argument value and only check that one. Do you need to pass more than one value onto the frontend?

clang/lib/Lex/LiteralSupport.cpp
236

This is a regression. Our prior behavior for unknown escapes was to leave the character alone. We should still do that wherever possible -- eg, \q should produce q -- and take fallback action only if the character is unencodable. Producing a ? seems unlikely to ever be what anyone wants; producing a hard error would seem preferable.

237

Can you avoid using std::string here? Eg, pass a StringRef, extending the converter to be able to take one if necessary.

240

Is there any guarantee the assertion will not fail?

1383–1384

Why is this case not possible?

1387

What assurance do we have that 1 output character is correct? I would expect we need to reject with a diagnostic if the character doesn't fit in one converted character.

1735–1737

Do we need to convert the newline character too?

Perhaps for raw string literals it'd be better to do the normal processing here and then convert the entire string at once?

clang/test/Driver/cl-options.c
215

Please use the given spelling of the flag in the diagnostic. (You can ask the argument how it was spelled.)

Addressing some more comments. Updating the argument parsing, lit tests, some more renaming.

abhina.sreeskantharajan marked 4 inline comments as done.Mar 4 2021, 6:20 AM
abhina.sreeskantharajan added inline comments.
clang/include/clang/Lex/Preprocessor.h
145

Thanks for catching this. This was a change I missed when renaming LiteralTranslator to LiteralConverter. I've added a longer name.

clang/lib/Driver/ToolChains/Clang.cpp
6238

Thanks, I've changed it back to get the LastArg only and use the spelling of the argument to fix the diagnostic error message in the driver lit tests.

clang/lib/Lex/LiteralSupport.cpp
236

Hi @tahonermann, do you also agree we should use the original behaviour or give a hard error instead?

1383–1384

This case should be handled when fwide-exec-charset option is implemented. Until then, we thought it was best to emit a error message that wide literal translation is not supported.

abhina.sreeskantharajan marked 3 inline comments as done.Mar 4 2021, 6:20 AM
abhina.sreeskantharajan marked an inline comment as done.Mar 4 2021, 6:23 AM
abhina.sreeskantharajan marked an inline comment as done.

Add assertion, add testcase for multi-line raw string

clang/lib/Lex/LiteralSupport.cpp
1387

Right, I'll add a similar assertion to the one we have above.

1735–1737

Yes, we need to convert newlines as well. I think the current behaviour is already converting multi line raw strings correctly. I'll add a testcase for this.

abhina.sreeskantharajan marked 3 inline comments as done.Mar 8 2021, 12:01 PM
abhina.sreeskantharajan added inline comments.
clang/include/clang/Lex/LiteralTranslator.h
30–31 ↗(On Diff #314115)

I've made these private.

37 ↗(On Diff #314115)

I've renamed this function to getConverter

abhina.sreeskantharajan marked 2 inline comments as done.Mar 8 2021, 12:01 PM
abhina.sreeskantharajan marked 2 inline comments as done.Mar 15 2021, 5:44 AM

Rebase + fix CharLiteralParser endian issue by saving the char to a char variable first and then creating a StringRef

Accidentally added dependent patch in this one. Removing that

Rebase + set size of char as 1 when creating a StringRef to fix lit failure

ThePhD added a subscriber: ThePhD.Apr 20 2021, 1:35 PM

Just a tiny comment: could you please make sure the name of the resolved encoding is also propagated to InitPreprocessor.cpp that sets the __clang_literal_encoding__ macro? (https://github.com/llvm/llvm-project/blob/main/clang/lib/Frontend/InitPreprocessor.cpp#L784)

abhina.sreeskantharajan added a reviewer: ThePhD.

Thanks for catching that. This sets the clang_literal_encoding to Opts.ExecCharset or defaults to SystemCharset.

cor3ntin added a subscriber: cor3ntin.EditedApr 21 2021, 3:00 PM

We should use the original source form of the string literal when pretty-printing a StringLiteral or CharacterLiteral; there are a bunch of UTF-8 assumptions baked into StmtPrinter that will need revisiting. And we'll need to modify the handful of places that put the contents of StringLiterals into diagnostics (#warning, #error, static_assert) and make them use a different ConversionState, since our assumption is that diagnostic output should be in UTF-8.

Yes, these are some of the complications we will need to visit in later patches. We may need to somehow save the original string or reverse the translation.

The operation is destructive and therefore cannot be reverted.
So I do believe the correct behavior here would indeed be to keep the original spelling around - with *some* of phase 5 applied (replacement of UCNs and replacement of numeric escape sequences).
An alternative would be to do the conversion lazily when the strings are evaluated, rather than during lexing, although that might be more involved

joerg added a subscriber: joerg.Apr 21 2021, 6:29 PM

"Keeping the original spelling around" would assume that the input is not using a stateful encoding. That seems worse as assumption than giving the canonical output in UTF-8 and shifting the problem to the user's editor?

"Keeping the original spelling around" would assume that the input is not using a stateful encoding. That seems worse as assumption than giving the canonical output in UTF-8 and shifting the problem to the user's editor?

Right, terrible choice of words
s/original spelling/the concatenated, non-encoded string literal, in UTF-8

We should use the original source form of the string literal when pretty-printing a StringLiteral or CharacterLiteral; there are a bunch of UTF-8 assumptions baked into StmtPrinter that will need revisiting. And we'll need to modify the handful of places that put the contents of StringLiterals into diagnostics (#warning, #error, static_assert) and make them use a different ConversionState, since our assumption is that diagnostic output should be in UTF-8.

Yes, these are some of the complications we will need to visit in later patches. We may need to somehow save the original string or reverse the translation.

The operation is destructive and therefore cannot be reverted.
So I do believe the correct behavior here would indeed be to keep the original spelling around - with *some* of phase 5 applied (replacement of UCNs and replacement of numeric escape sequences).
An alternative would be to do the conversion lazily when the strings are evaluated, rather than during lexing, although that might be more involved

Thanks for the input! I agree doing the conversion lazily will help avoid hitting these issues since we push translation to a later stage but as you mentioned it will be more involved. I think keeping the original spelling might be the best solution. We can make a extra member in StringLiteralParser to save the string prior to translation. But we would need to go through each use of StringLiteralParser and save the original encoding (possibly print it in the .ll file along with the translated string or as an attribute?). Let me know what you think.

jansvoboda11 resigned from this revision.Aug 6 2021, 4:44 AM
srl295 added a subscriber: srl295.Mar 8 2022, 11:41 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 8 2022, 11:41 AM

@abhina.sreeskantharajan
What is the status of this patch?

@abhina.sreeskantharajan
What is the status of this patch?

Hello, I was waiting for the CharSetConverter patch to land. Now that this patch has landed https://reviews.llvm.org/D148821 to add limited EBCDIC <-> UTF-8 conversion support, I have started to refactor my patch to use this instead. This implementation also heavily relies on iconv support which is still being discussed in the CharSet Converter RFC here https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795/16

@abhina.sreeskantharajan
What is the status of this patch?

Hello, I was waiting for the CharSetConverter patch to land. Now that this patch has landed https://reviews.llvm.org/D148821 to add limited EBCDIC <-> UTF-8 conversion support, I have started to refactor my patch to use this instead. This implementation also heavily relies on iconv support which is still being discussed in the CharSet Converter RFC here https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795/16

Thanks! I was beginning to think it is forgotten/abandoned.

I have opened a new patch https://reviews.llvm.org/D153419 and am closing this revision