This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/
-
clang/
-
Basic/
-
LangOptions.h
-
TokenKinds.h
-
Driver/
5/5
Options.td
-
Lex/
12/12
LiteralSupport.h
9/12
LiteralTranslator.h
4/4
Preprocessor.h
-
lib/
-
Driver/ToolChains/
-
ToolChains/
7/7
Clang.cpp
-
Frontend/
-
CompilerInstance.cpp
4/4
CompilerInvocation.cpp
-
Lex/
-
CMakeLists.txt
29/38
LiteralSupport.cpp
-
LiteralTranslator.cpp
1/1
Preprocessor.cpp
-
test/
-
CodeGen/
3/3
systemz-charset.c
2/2
systemz-charset.cpp
-
Driver/
3/3
cl-options.c
2/2
clang_f_opts.c
-
llvm/
-
include/llvm/ADT/
-
llvm/
-
ADT/
-
Triple.h
-
lib/Support/
-
Support/
2/2
Triple.cpp

Differential D93031

Enable fexec-charset option
AbandonedPublic

Authored by abhina.sreeskantharajan on Dec 10 2020, 5:49 AM.

Download Raw Diff

Details

Reviewers

Kai
fanbo-meng
tahonermann
hubert.reinterpretcast
efriedma
SeanP
rsmith
ThePhD
cor3ntin
joerg
jansvoboda11

Summary

This patch enables the fexec-charset option to control the execution charset of string literals. It sets the default internal charset, system charset, and execution charset for z/OS and UTF-8 for all other platforms.
This patch depends on https://reviews.llvm.org/D88741

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Herald added subscribers: dexonsmith, dang, hiraditya, mgorny. · View Herald TranscriptDec 10 2020, 5:49 AM

abhina.sreeskantharajan requested review of this revision.Dec 10 2020, 5:49 AM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptDec 10 2020, 5:49 AM

Herald added subscribers: llvm-commits, cfe-commits. · View Herald Transcript

Harbormaster completed remote builds in B81828: Diff 310863.Dec 10 2020, 6:01 AM

abhina.sreeskantharajan added a parent revision: D88741: [SystemZ/z/OS] Add utility class for char set conversion..Dec 10 2020, 6:24 AM

abhina.sreeskantharajan added reviewers: Kai, fanbo-meng, tahonermann, hubert.reinterpretcast, efriedma, SeanP.Dec 10 2020, 1:02 PM

I'm overall pretty happy about how clean and non-invasive the changes required here are. But please make sure you don't change the encodings of u8"..." / u"..." / U"..." literals; those need to stay as UTF-8 / UTF-16 / UTF-32. Also, we should have a story for how the wide execution character set is controlled -- is it derived from the narrow execution character set, or can the two be changed independently, or ...?

We should use the original source form of the string literal when pretty-printing a StringLiteral or CharacterLiteral; there are a bunch of UTF-8 assumptions baked into StmtPrinter that will need revisiting. And we'll need to modify the handful of places that put the contents of StringLiterals into diagnostics (#warning, #error, static_assert) and make them use a different ConversionState, since our assumption is that diagnostic output should be in UTF-8.

clang/include/clang/Lex/LiteralTranslator.h
30–32	It is not acceptable to use global state for this per-compilation information; this will behave badly if multiple independent Clang compilations are performed by different threads in the same process, for example.
33	Similarly, use of a global cache here will require you guard it with a mutex. As an alternative, how about we move all this state to be per-instance state, and store an instance of `LiteralTranslator` on the `Preprocessor`?
clang/lib/Lex/LiteralSupport.cpp
231–239	Is it correct, in general, to do character-at-a-time translation here, when processing a string literal? I would expect there to be some (stateful) target character sets where that's not correct.
236	Reinterpreting an `unsigned` as a `char` like this is not correct on big-endian, and is way too "clever" on little-endian. Please create an actual `char` object to hold the value and pass that in instead.
238	What should happen if the result doesn't fit into an `unsigned`? This also appears to be making problematic assumptions about the endianness of the host. If we really want to pack multiple bytes of encoded output into a single `unsigned` result value (which itself seems dubious), we should do so with an endianness that doesn't depend on the host.
1318	Shouldn't this depend on the kind of literal? We should have no converter for UTF8/UTF16/UTF32 literals, should use the wide execution character set for `L...` literals, and the narrow execution character set otherwise. (It looks like this patch doesn't properly distinguish the narrow and wide execution character sets?)
clang/test/Driver/cl-options.c
217	Checking for "don't produce exactly this one spelling of this one diagnostic" is not a useful test; if we started warning on this again, there's a good chance the warning would be spelled differently, so your test does not do a good job of determining whether the code under test is bad (it passes in most bad states as well as in the good state). `...-NOT: error` and `...-NOT: warning` would be a bit better, if this is worth testing.
clang/test/Driver/clang_f_opts.c
213	Again, this is not a useful test.

tahonermann added inline comments.Dec 11 2020, 11:14 PM

clang/include/clang/Driver/Options.td
3583–3584	How about substituting "character set", "character encoding", or "charset" for "codepage"? This doesn't state what names are recognized. The ones provided by the system iconv() implementation (as is the case for gcc)? Or all names and aliases specified by the IANA character set registry? The set of recognized names can be a superset of the names that are actually supported.
clang/include/clang/Lex/LiteralSupport.h
192	Does the conversion state need to be persisted as a data member? The literal is consumed in the constructor.
246	Same concern here with respect to persisting the conversion state as a data member.
248	This static data member will presumably need to be lifted to per-instance state as Richard mentioned elsewhere.
clang/lib/Driver/ToolChains/Clang.cpp
5970–5985	I think it would be preferable to diagnose an unrecognized character encoding name here if possible. The current changes will result in an unrecognized name (as opposed to one that is unsupported for the target) being diagnosed for each compiler instance.
clang/lib/Frontend/CompilerInvocation.cpp
3573	I wouldn't expect the cast to `std::string` to be needed here.
clang/lib/Lex/LiteralSupport.cpp
231–239	For stateful encodings, I can imagine that state would have to be transitioned to the initial state before translating the escape sequence. I suspect support for stateful encodings is not a goal at this time.
234	What should happen if `ResultChar` >= 0x100? IBM-1047 does have representation for other UTF-8 characters. Regardless, it seems `ResultChar` should be converted to something.
1751–1763	UCNs will require conversion here.
llvm/lib/Support/Triple.cpp
1028–1029	No support for targeting the z/OS Enhanced ASCII run-time?

Thanks for your quick reviews! I haven't addressed all the comments yet but I plan to address all of them. I put up this patch early because it has a few major changes:

moves LiteralTranslator class to Preprocessor instead of being a static global class
add isUTFLiteral() function to detect strings like u8"..." and stop translation
translate wide string literals to the system charset for now (we don't have an implementation plan for -fwide-charset right now)
remove tests that check fexec-charset will not accept non-UTF charsets

Harbormaster completed remote builds in B82467: Diff 311911.Dec 15 2020, 8:23 AM

In D93031#2447230, @rsmith wrote:

I'm overall pretty happy about how clean and non-invasive the changes required here are. But please make sure you don't change the encodings of u8"..." / u"..." / U"..." literals; those need to stay as UTF-8 / UTF-16 / UTF-32. Also, we should have a story for how the wide execution character set is controlled -- is it derived from the narrow execution character set, or can the two be changed independently, or ...?

We should use the original source form of the string literal when pretty-printing a StringLiteral or CharacterLiteral; there are a bunch of UTF-8 assumptions baked into StmtPrinter that will need revisiting. And we'll need to modify the handful of places that put the contents of StringLiterals into diagnostics (#warning, #error, static_assert) and make them use a different ConversionState, since our assumption is that diagnostic output should be in UTF-8.

Yes, these are some of the complications we will need to visit in later patches. We may need to somehow save the original string or reverse the translation.

clang/include/clang/Driver/Options.td
3583–3584	I've updated the description from codepage to charset. It's hard to specify what charsets are supported because iconv library differs between targets, so the list will not be the same on every platform.
clang/include/clang/Lex/LiteralTranslator.h
33	Thanks, I've added an instance of LiteralTranslator to Preprocessor instead and use that when the Preprocessor is available. There is one constructor of StringLiteralParser that does not pass Preprocessor as an argument, so I had to create a LiteralTranslator instance there as well.
clang/lib/Driver/ToolChains/Clang.cpp
5970–5985	Since we do not know what charsets are supported by the iconv library on the target platform, we don't know what charsets are actually invalid until we try creating a CharSetConverter.
clang/lib/Frontend/CompilerInvocation.cpp
3573	Without that cast, I get the following build error: error: no viable overloaded '='
clang/test/Driver/cl-options.c
217	You're right, I made a change just to make the testcase pass. I think this testcase is no longer needed because fexec-charset should be able to accept all charset names. We won't be able to diagnose invalid charset names until we actually try creating the CharSetConverter.
llvm/lib/Support/Triple.cpp
1028–1029	We plan to support both modes in the future, but we want the default to still be IBM-1047 (EBCDIC).

abhina.sreeskantharajan marked 5 inline comments as done.Dec 15 2020, 11:05 AM

tahonermann added inline comments.Dec 16 2020, 8:35 AM

clang/include/clang/Driver/Options.td
3583–3584	Being dependent on the host iconv library seems fine by me; that is the case for gcc today. I suggest making that explicit here: def fexec_charset : Separate<["-"], "fexec-charset">, MetaVarName<"<charset>">, HelpText<"Set the execution <charset> for string and character literals. Supported character encodings include XXX and those supported by the host iconv library.">;
clang/lib/Driver/ToolChains/Clang.cpp
5970–5985	Understood, but what would be the harm in performing a lookup (constructing a `CharSetConverter`) here?
clang/lib/Frontend/CompilerInvocation.cpp
3573	Ok, rather than a cast, I suggest: Opts.ExecCharset = Value.str();
clang/lib/Lex/LiteralSupport.cpp
1318–1319	Converting wide character literals to the system encoding doesn't seem right to me. For z/OS, this should presumably convert to the wide EBCDIC encoding, but for all other supported platforms, the wide execution character set is either UTF-16 or UTF-32 depending on the size of `wchar_t` (which may be influenced by the `-fshort-wchar` option).
1589–1593	The stored `TranslationState` should not be completely ignored for wide and UTF string literals. The standard permits things like the following. #pragma rigoot L"bozit" #pragma rigoot u"bozit" _Pragma(L"rigoot bozit") _Pragma(u8"rigoot bozit") For at least the `_Pragma(L"...")` case, the C++ standard states the `L` is ignored, but it doesn't say anything about other encoding prefixes.
1590–1591	Converting wide string literals to the system encoding doesn't seem right to me. For z/OS, this should presumably convert to the wide EBCDIC encoding, but for all other supported platforms, the wide execution character set is either UTF-16 or UTF-32 depending on the size of `wchar_t` (which may be influenced by the `-fshort-wchar` option).

tahonermann added inline comments.Dec 16 2020, 8:55 AM

clang/test/CodeGen/systemz-charset.c
5	`const char *` please :)
25	Add validation of UCNs. Something like: const char *UcnCharacters = "\u00E2\u00AC\U000000DF"; // CHECK: c"\42\B0\59\00"

Thanks for your patience, I've addressed some more comments. Here is the summary of the changes in this patch:

add translation for UCN strings, update testcase
fix helptext for fexec-charset option in Options.td
check for invalid charsets when parsing driver options.
fix up char conversion code

Harbormaster completed remote builds in B83156: Diff 313112.Dec 21 2020, 8:15 AM

abhina.sreeskantharajan marked 11 inline comments as done.Dec 21 2020, 8:25 AM

abhina.sreeskantharajan added inline comments.

clang/include/clang/Driver/Options.td
3583–3584	I've updated the HelpText with your suggested description.
clang/include/clang/Lex/LiteralSupport.h
192	Thanks, I've removed this.
246	If this member is removed in StringLiteralParser, we will need to pass the State to multiple functions in StringLiteralParser like init(). Would this solution be preferable to keeping a data member?
clang/lib/Driver/ToolChains/Clang.cpp
5970–5985	I initially thought it will be a performance issue if we are creating the Converter twice, once here and once in the Preprocessor. But I do think its a good idea to diagnose this early. I've modified the code to diagnose and error here.
clang/lib/Frontend/CompilerInvocation.cpp
3573	Thanks, I've applied this change.
clang/lib/Lex/LiteralSupport.cpp
231–239	Right, stateful encodings may be a problem we will need to revisit later as well.
234	This is no longer valid, thanks for catching that. We were initially translating to ASCII instead of UTF-8 so we needed to guard against larger characters. I've removed this guard since the internal charset is UTF-8.
236	Thanks, I've created a char instead.
238	This may be a problem we need to revisit since ResultChar is expecting a char.
1751–1763	I've added code to translate UCN characters and have updated the testcase as well.

abhina.sreeskantharajan marked 8 inline comments as done.Dec 21 2020, 8:25 AM

abhina.sreeskantharajan added inline comments.Dec 21 2020, 8:29 AM

clang/lib/Lex/LiteralSupport.cpp
1318–1319	Since we don't implement -fwide-exec-charset yet, what do you think should be the default behaviour for the interim?

abhina.sreeskantharajan added a reviewer: rsmith.Dec 21 2020, 8:32 AM

tahonermann added inline comments.Dec 23 2020, 10:04 PM

clang/include/clang/Lex/LiteralSupport.h
246	I think so, yes. Data members should be used to reflect the state of the object, not as a convenient mechanism to avoid passing arguments.
clang/lib/Driver/ToolChains/Clang.cpp
5982–5984	Thank you for adding this.
clang/lib/Lex/LiteralSupport.cpp
234	Conversion can fail here, particularly in the scenario corresponding to the default switch case above; `ResultChar` could contain, for example, a lead byte of a UTF-8 sequence. Something sensible should be done here; either rejecting the code with an error or substituting `?` (in the execution encoding) seems appropriate to me.
235	As Richard previously noted, this `memcpy()` needs to be addressed. The intended behavior here is not clear. Are there valid scenarios in which the conversion will produce a sequence of more than one code units? I believe the input is limited to ASCII characters and invalid code units (e.g., a lead byte of a UTF-8 sequence) and in the latter case, an error and/or substitution of a `?` (in the execution encoding) seem like acceptable behaviors to me.
1318–1319	Perhaps an Internal compiler error to indicate that appropriate support is not yet in place?
clang/test/CodeGen/systemz-charset.c
16–23	`const char*` here too please.
clang/test/CodeGen/systemz-charset.cpp
1–25	This is good. I suggest adding escape sequences and UCNs to validate that they are not converted to IBM-1047.
clang/test/Driver/clang_f_opts.c
212–213	This looks good. Can tests also be added to validate that the `UTF-8`, `ISO8895-1`, and `IBM-1047` option arguments are properly recognized?

Thanks for the review! I've addressed most of the comments but I still need to work on the translation issues in CharLiteralParser that was kindly pointed out by Tom and Richard. Here are the summary of changes in this patch:

Removed TranslationState as a member of StringLiteralParser and pass it as an argument instead
Added an assertion for wide character translation instead of translating them to the system charset
Invalid char escapes are changed to '?' and then translated
Updated testcases as requested

Harbormaster completed remote builds in B83668: Diff 313990.Dec 29 2020, 11:34 AM

abhina.sreeskantharajan marked 8 inline comments as done.Dec 29 2020, 11:39 AM

abhina.sreeskantharajan added inline comments.

clang/include/clang/Lex/LiteralSupport.h
246	Thanks, I've removed this member.
clang/lib/Lex/LiteralSupport.cpp
1318–1319	Thanks for the suggestion. I've added assertions for wide character translation before we do any translation.
1590–1591	I've now added an assertion when translating wide characters.
clang/test/CodeGen/systemz-charset.cpp
1–25	Good idea, I added those testcases as per your suggestion.

abhina.sreeskantharajan marked 4 inline comments as done.Dec 29 2020, 11:39 AM

abhina.sreeskantharajan marked an inline comment as done.Dec 29 2020, 12:45 PM

abhina.sreeskantharajan added inline comments.

clang/lib/Lex/LiteralSupport.cpp
234	Thanks, I added the substitution with the '?' character for invalid escapes.

abhina.sreeskantharajan marked an inline comment as done.Dec 29 2020, 12:45 PM

This patch replaces the memcpy in CharLiteralParser with an assignment. I've added an assertion for cases where the character size increases after translation.

Harbormaster completed remote builds in B83747: Diff 314115.Dec 30 2020, 6:50 AM

abhina.sreeskantharajan marked 2 inline comments as done.Dec 30 2020, 6:52 AM

abhina.sreeskantharajan added inline comments.

clang/lib/Lex/LiteralSupport.cpp
235	I replaced memcpy with an assignment. Please let me know if there is a better solution.
238	I added an assertion for this case where the size of the character increases after translation. I've also removed the memcpy to avoid endianness issues.

abhina.sreeskantharajan marked an inline comment as done.Dec 30 2020, 6:52 AM

abhina.sreeskantharajan added inline comments.Dec 30 2020, 7:22 AM

clang/lib/Lex/LiteralSupport.cpp
1589–1593	Please correct me if I'm wrong, these Pragma strings are not parsed through StringLiteralParser, they are parsed in clang/lib/Lex/Pragma.cpp in this function. void Preprocessor::Handle_Pragma(Token &Tok) So if they require translation, it would need to be done in that function.

ping :)
Is there any more feedback on the implementation inside ProcessCharEscape()?

Herald added a reviewer: jansvoboda11. · View Herald TranscriptJan 26 2021, 12:06 PM

Hi, Abhina. Sorry for the delay getting back to you. I added some more comments.

clang/include/clang/Lex/LiteralSupport.h
191	Is the default argument for `TranslationState` actually used anywhere? I'm skeptical that a default argument provides a benefit here. Actually, this diff doesn't include any changes to construct a `CharLiteralParser` with an explicit argument. It seems this argument isn't actually needed. The only places I see objects of `CharLiteralParser` type constructed are in `EvaluateValue()` in `clang/lib/Lex/PPExpressions.cpp` and `Sema::ActOnCharacterConstant()` in `clang/lib/Sema/SemaExpr.cpp`.
251–252	I don't think a `LiteralTranslator` object is actually needed in this case. The only use of this constructor that I see is in `ModuleMapParser::consumeToken()` in `clang/lib/Lex/ModuleMap.cpp` and, in that case, I don't think any translation is necessary. This suggests that `TranslationState` is not needed for this constructor either; `NoTranslation` can be passed to `init()`.
262–263	I don't think `getOffsetOfStringByte()` should require a `ConversionState` parameter. If I understand it correctly, this function should be operating on the string in the internal encoding, never in a converted encoding.
clang/include/clang/Lex/LiteralTranslator.h
20–24	Some naming suggestions... The enumeration is not used to record a state, but rather to indicate an action to take. Also, use of both "conversion" and "translation" could be confusing, so I suggest sticking with one. Perhaps: enum class LiteralConversion { None, ToSystemCharset, ToExecCharset };
31–32	I don't know the LLVM style guides well, but I suspect a class with all public members should be defined using `struct` and not include access specifiers.
36	Given the converter setters and accessors below, `ExecCharsetTables` should be a private member.
38	`getConversionTable()` is logically `const`. Perhaps `ExecCharsetTables` should be `mutable`. From a terminology stand point, this function is misnamed. It doesn't return a table, it returns a converter for an encoding. I suggest: llvm::CharSetConverter getCharSetConverter(const char Encoding) const;
39	`findOrCreateExecCharsetTable()` seems oddly named since it doesn't return whatever it finds or creates. It seems like this function would be more useful if it returned a `llvm::CharSetConverter` pointer with `nullptr` indicating lookup/creation failed. This function seems like it should be an implementation detail of the class, not a public interface.
42–44	`setTranslationTables()` is awkward. It is effectively operating as a constructor for the class, but isn't called at object construction and it does work that goes beyond initialization.
45	I suggest trying a design more like this: class LiteralTranslator { std::string SystemEncoding; std::string ExecutionEncoding; public: LiteralTranslator(llvm::StringRef SystemEncoding, llvm::StringRef ExecutionEncoding); // Retrieve the name for the system encoding. llvm::StringRef getSystemEncoding() const; // Retrieve the name for the execution encoding. llvm::StringRef getExecutionEncoding() const; // Retrieve a converter for converting from the internal encoding (UTF-8) // to the system encoding. llvm::CharSetConverter* getSystemEncodingConverter() const; // Retrieve a converter for converting from the internal encoding (UTF-8) // to the execution encoding. llvm::CharSetConverter* getExecutionEncodingConverter() const; }; LiteralTranslator createLiteralTranslatorFromOptions(const clang::LangOptions &Opts, const clang::TargetInfo &TInfo, clang::DiagnosticsEngine &Diags);
clang/include/clang/Lex/Preprocessor.h
145	I don't see a reason for `LT` to be a pointer. Can it be made a reference or, better, a non-reference, non-pointer data member?
clang/lib/Lex/LiteralSupport.cpp
1317–1319	Per the comment associated with the constructor declaration, I don't think the new constructor parameter is needed; translation to execution character set is always desired for non-UTF character literals. I think this can be something like: llvm::CharSetConverter *Converter = nullptr; if (! isUTFLiteral(Kind)) { assert(LT); Converter = LT->getCharConversionTable(TranslateToExecCharset); }
1589–1593	Ah, ok, good. There are other cases where a string literal is not used to produce a string literal object. See https://wg21.link/p2314 for a table. You may want to audit for those cases.
clang/lib/Lex/Preprocessor.cpp
88–89	Per comments elsewhere, please try to make `LT` a non-pointer non-reference data member.

jansvoboda11 added inline comments.Mar 1 2021, 12:00 AM

clang/include/clang/Driver/Options.td
3583	Could you switch to the option marshalling infrastructure? https://clang.llvm.org/docs/InternalsManual.html#adding-new-command-line-option Adding `MarshallingInfoString<LangOpts<"ExecCharset">>` here should do the trick. You can then delete the option parsing in `CompilerInvocation.cpp`.

Thanks for the feedback! I haven't addressed all the comments yet but I've made major renaming changes and hope to get feedback on it.

abhina.sreeskantharajan marked 6 inline comments as done.Mar 2 2021, 8:53 AM

abhina.sreeskantharajan added inline comments.

clang/include/clang/Lex/LiteralSupport.h
191	You're right, we don't have any cases that use this arg yet so we can remove it.
251–252	Thanks, I've removed it.
clang/include/clang/Lex/Preprocessor.h
145	Thanks, I've changed it to a non-reference non-pointer member.
clang/lib/Lex/LiteralSupport.cpp
1317–1319	I can't add an assertion here because LT might not be created in the case of the second StringLiteralParser constructor which does not pass the Preprocessor. But I have added the remaining changes.

abhina.sreeskantharajan marked 4 inline comments as done.Mar 2 2021, 8:55 AM

Hi Tom, @tahonermann I renamed the LiteralTranslator class to LiteralConverter.cpp and have renamed a lot of the functions. Let me know what you think. I agree that the setConverters function is awkward, the problem stems from initializing the member early in Preprocessor but only being able to create the Converters once we know the target host later in the compilation process.

Harbormaster completed remote builds in B91586: Diff 327470.Mar 2 2021, 9:40 AM

rsmith added inline comments.Mar 2 2021, 12:12 PM

clang/include/clang/Lex/Preprocessor.h
145	Please give this a longer name. Abbreviation names should only be used in fairly small scopes where it's easy to look up what they refer to. Also: why `LT`? What does the `T` stand for?
clang/lib/Driver/ToolChains/Clang.cpp
5976	Looping over all the arguments is a little unusual. Normally we'd get the last argument value and only check that one. Do you need to pass more than one value onto the frontend?
clang/lib/Lex/LiteralSupport.cpp
234	This is a regression. Our prior behavior for unknown escapes was to leave the character alone. We should still do that wherever possible -- eg, `\q` should produce `q` -- and take fallback action only if the character is unencodable. Producing a `?` seems unlikely to ever be what anyone wants; producing a hard error would seem preferable.
235	Can you avoid using `std::string` here? Eg, pass a `StringRef`, extending the converter to be able to take one if necessary.
238	Is there any guarantee the assertion will not fail?
1364–1365	Why is this case not possible?
1368	What assurance do we have that 1 output character is correct? I would expect we need to reject with a diagnostic if the character doesn't fit in one converted character.
1701–1703	Do we need to convert the newline character too? Perhaps for raw string literals it'd be better to do the normal processing here and then convert the entire string at once?
clang/test/Driver/cl-options.c
214	Please use the given spelling of the flag in the diagnostic. (You can ask the argument how it was spelled.)

Addressing some more comments. Updating the argument parsing, lit tests, some more renaming.

abhina.sreeskantharajan marked 4 inline comments as done.Mar 4 2021, 6:20 AM

abhina.sreeskantharajan added inline comments.

clang/include/clang/Lex/Preprocessor.h
145	Thanks for catching this. This was a change I missed when renaming LiteralTranslator to LiteralConverter. I've added a longer name.
clang/lib/Driver/ToolChains/Clang.cpp
5976	Thanks, I've changed it back to get the LastArg only and use the spelling of the argument to fix the diagnostic error message in the driver lit tests.
clang/lib/Lex/LiteralSupport.cpp
234	Hi @tahonermann, do you also agree we should use the original behaviour or give a hard error instead?
1364–1365	This case should be handled when fwide-exec-charset option is implemented. Until then, we thought it was best to emit a error message that wide literal translation is not supported.

abhina.sreeskantharajan marked 3 inline comments as done.Mar 4 2021, 6:20 AM

abhina.sreeskantharajan marked an inline comment as done.Mar 4 2021, 6:23 AM

Harbormaster completed remote builds in B92056: Diff 328151.Mar 4 2021, 4:28 PM

Add assertion, add testcase for multi-line raw string

abhina.sreeskantharajan added inline comments.Mar 5 2021, 7:04 AM

clang/lib/Lex/LiteralSupport.cpp
1368	Right, I'll add a similar assertion to the one we have above.
1701–1703	Yes, we need to convert newlines as well. I think the current behaviour is already converting multi line raw strings correctly. I'll add a testcase for this.

Harbormaster completed remote builds in B92310: Diff 328513.Mar 5 2021, 11:45 PM

abhina.sreeskantharajan marked 3 inline comments as done.Mar 8 2021, 12:01 PM

abhina.sreeskantharajan added inline comments.

clang/include/clang/Lex/LiteralTranslator.h
31–32	I've made these private.
38	I've renamed this function to getConverter

abhina.sreeskantharajan marked 2 inline comments as done.Mar 8 2021, 12:01 PM

abhina.sreeskantharajan marked 2 inline comments as done.Mar 15 2021, 5:44 AM

Rebase + fix CharLiteralParser endian issue by saving the char to a char variable first and then creating a StringRef

Harbormaster completed remote builds in B97954: Diff 336416.Apr 9 2021, 5:40 AM

Accidentally added dependent patch in this one. Removing that

Harbormaster completed remote builds in B97955: Diff 336417.Apr 9 2021, 6:21 AM

ThePhD mentioned this in D100346: [Clang] String Literal and Wide String Literal Encoding from the Preprocessor.Apr 12 2021, 3:03 PM

Rebase + set size of char as 1 when creating a StringRef to fix lit failure

Harbormaster completed remote builds in B99514: Diff 338565.Apr 19 2021, 11:57 AM

Just a tiny comment: could you please make sure the name of the resolved encoding is also propagated to InitPreprocessor.cpp that sets the __clang_literal_encoding__ macro? (https://github.com/llvm/llvm-project/blob/main/clang/lib/Frontend/InitPreprocessor.cpp#L784)

Thanks for catching that. This sets the clang_literal_encoding to Opts.ExecCharset or defaults to SystemCharset.

Harbormaster completed remote builds in B100076: Diff 339355.Apr 21 2021, 1:31 PM

We should use the original source form of the string literal when pretty-printing a StringLiteral or CharacterLiteral; there are a bunch of UTF-8 assumptions baked into StmtPrinter that will need revisiting. And we'll need to modify the handful of places that put the contents of StringLiterals into diagnostics (#warning, #error, static_assert) and make them use a different ConversionState, since our assumption is that diagnostic output should be in UTF-8.

Yes, these are some of the complications we will need to visit in later patches. We may need to somehow save the original string or reverse the translation.

The operation is destructive and therefore cannot be reverted.
So I do believe the correct behavior here would indeed be to keep the original spelling around - with *some* of phase 5 applied (replacement of UCNs and replacement of numeric escape sequences).
An alternative would be to do the conversion lazily when the strings are evaluated, rather than during lexing, although that might be more involved

"Keeping the original spelling around" would assume that the input is not using a stateful encoding. That seems worse as assumption than giving the canonical output in UTF-8 and shifting the problem to the user's editor?

In D93031#2706988, @joerg wrote:

"Keeping the original spelling around" would assume that the input is not using a stateful encoding. That seems worse as assumption than giving the canonical output in UTF-8 and shifting the problem to the user's editor?

Right, terrible choice of words
s/original spelling/the concatenated, non-encoded string literal, in UTF-8

In D93031#2706660, @cor3ntin wrote:

We should use the original source form of the string literal when pretty-printing a StringLiteral or CharacterLiteral; there are a bunch of UTF-8 assumptions baked into StmtPrinter that will need revisiting. And we'll need to modify the handful of places that put the contents of StringLiterals into diagnostics (#warning, #error, static_assert) and make them use a different ConversionState, since our assumption is that diagnostic output should be in UTF-8.

Yes, these are some of the complications we will need to visit in later patches. We may need to somehow save the original string or reverse the translation.

The operation is destructive and therefore cannot be reverted.
So I do believe the correct behavior here would indeed be to keep the original spelling around - with *some* of phase 5 applied (replacement of UCNs and replacement of numeric escape sequences).
An alternative would be to do the conversion lazily when the strings are evaluated, rather than during lexing, although that might be more involved

Thanks for the input! I agree doing the conversion lazily will help avoid hitting these issues since we push translation to a later stage but as you mentioned it will be more involved. I think keeping the original spelling might be the best solution. We can make a extra member in StringLiteralParser to save the string prior to translation. But we would need to go through each use of StringLiteralParser and save the original encoding (possibly print it in the .ll file along with the translated string or as an attribute?). Let me know what you think.

abhina.sreeskantharajan added reviewers: cor3ntin, joerg.Apr 22 2021, 6:32 AM

cor3ntin mentioned this in D105759: Implement P2361 Unevaluated string literals.Jul 10 2021, 6:53 AM

cor3ntin mentioned this in D106577: [clang] Define __STDC_ISO_10646__.Jul 22 2021, 11:05 AM

nigelp-xmos added a subscriber: nigelp-xmos.Aug 3 2021, 9:43 AM

jansvoboda11 resigned from this revision.Aug 6 2021, 4:44 AM

evantypanski added a subscriber: evantypanski.Oct 28 2021, 8:54 AM

srl295 added a subscriber: srl295.Mar 8 2022, 11:41 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 8 2022, 11:41 AM

tahonermann mentioned this in D135366: [clang][Interp] Implement String- and CharacterLiterals.Oct 10 2022, 2:06 PM

tahonermann mentioned this in D134036: [libc++][format] Implements string escaping..Oct 10 2022, 2:20 PM

barannikov88 added a subscriber: barannikov88.Feb 11 2023, 2:17 PM

Herald added a subscriber: MaskRay. · View Herald TranscriptFeb 11 2023, 2:17 PM

@abhina.sreeskantharajan
What is the status of this patch?

In D93031#4308764, @barannikov88 wrote:

@abhina.sreeskantharajan
What is the status of this patch?

Hello, I was waiting for the CharSetConverter patch to land. Now that this patch has landed https://reviews.llvm.org/D148821 to add limited EBCDIC <-> UTF-8 conversion support, I have started to refactor my patch to use this instead. This implementation also heavily relies on iconv support which is still being discussed in the CharSet Converter RFC here https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795/16

In D93031#4309546, @abhina.sreeskantharajan wrote:

In D93031#4308764, @barannikov88 wrote:

@abhina.sreeskantharajan
What is the status of this patch?

Hello, I was waiting for the CharSetConverter patch to land. Now that this patch has landed https://reviews.llvm.org/D148821 to add limited EBCDIC <-> UTF-8 conversion support, I have started to refactor my patch to use this instead. This implementation also heavily relies on iconv support which is still being discussed in the CharSet Converter RFC here https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795/16

Thanks! I was beginning to think it is forgotten/abandoned.

I have opened a new patch https://reviews.llvm.org/D153419 and am closing this revision

cor3ntin mentioned this in rG95f50964fbf5: Implement P2361 Unevaluated string literals.Jul 7 2023, 4:30 AM

Revision Contents

Path

Size

clang/

include/

clang/

Basic/

LangOptions.h

3 lines

TokenKinds.h

7 lines

Driver/

Options.td

4 lines

Lex/

LiteralSupport.h

33 lines

LiteralTranslator.h

46 lines

Preprocessor.h

3 lines

lib/

Driver/

ToolChains/

Clang.cpp

23 lines

Frontend/

CompilerInstance.cpp

4 lines

CompilerInvocation.cpp

5 lines

Lex/

CMakeLists.txt

1 line

LiteralSupport.cpp

110 lines

LiteralTranslator.cpp

70 lines

Preprocessor.cpp

3 lines

test/

CodeGen/

systemz-charset.c

32 lines

systemz-charset.cpp

25 lines

Driver/

cl-options.c

7 lines

clang_f_opts.c

4 lines

llvm/

include/

llvm/

ADT/

Triple.h

3 lines

lib/

Support/

Triple.cpp

7 lines

Diff 313112

clang/include/clang/Basic/LangOptions.h

Show First 20 Lines • Show All 297 Lines • ▼ Show 20 Lines	public:
/// Name of the IR file that contains the result of the OpenMP target		/// Name of the IR file that contains the result of the OpenMP target
/// host code generation.		/// host code generation.
std::string OMPHostIRFile;		std::string OMPHostIRFile;

/// Indicates whether the front-end is explicitly told that the		/// Indicates whether the front-end is explicitly told that the
/// input is a header file (i.e. -x c-header).		/// input is a header file (i.e. -x c-header).
bool IsHeaderFile = false;		bool IsHeaderFile = false;

		/// Name of the exec charset to convert the internal charset to.
		std::string ExecCharset;

LangOptions();		LangOptions();

// Define accessors/mutators for language options of enumeration type.		// Define accessors/mutators for language options of enumeration type.
#define LANGOPT(Name, Bits, Default, Description)		#define LANGOPT(Name, Bits, Default, Description)
#define ENUM_LANGOPT(Name, Type, Bits, Default, Description) \		#define ENUM_LANGOPT(Name, Type, Bits, Default, Description) \
Type get##Name() const { return static_cast<Type>(Name); } \		Type get##Name() const { return static_cast<Type>(Name); } \
void set##Name(Type Value) { Name = static_cast<unsigned>(Value); }		void set##Name(Type Value) { Name = static_cast<unsigned>(Value); }
#include "clang/Basic/LangOptions.def"		#include "clang/Basic/LangOptions.def"
▲ Show 20 Lines • Show All 295 Lines • Show Last 20 Lines

clang/include/clang/Basic/TokenKinds.h

	Show First 20 Lines • Show All 84 Lines • ▼ Show 20 Lines
	/// constant, string, etc.			/// constant, string, etc.
	inline bool isLiteral(TokenKind K) {			inline bool isLiteral(TokenKind K) {
	return K == tok::numeric_constant \|\| K == tok::char_constant \|\|			return K == tok::numeric_constant \|\| K == tok::char_constant \|\|
	K == tok::wide_char_constant \|\| K == tok::utf8_char_constant \|\|			K == tok::wide_char_constant \|\| K == tok::utf8_char_constant \|\|
	K == tok::utf16_char_constant \|\| K == tok::utf32_char_constant \|\|			K == tok::utf16_char_constant \|\| K == tok::utf32_char_constant \|\|
	isStringLiteral(K) \|\| K == tok::header_name;			isStringLiteral(K) \|\| K == tok::header_name;
	}			}

				/// Return true if this is a utf literal kind.
				inline bool isUTFLiteral(TokenKind K) {
				return K == tok::utf8_char_constant \|\| K == tok::utf8_string_literal \|\|
				K == tok::utf16_char_constant \|\| K == tok::utf16_string_literal \|\|
				K == tok::utf32_char_constant \|\| K == tok::utf32_string_literal;
				}

	/// Return true if this is any of tok::annot_* kinds.			/// Return true if this is any of tok::annot_* kinds.
	bool isAnnotation(TokenKind K);			bool isAnnotation(TokenKind K);

	/// Return true if this is an annotation token representing a pragma.			/// Return true if this is an annotation token representing a pragma.
	bool isPragmaAnnotation(TokenKind K);			bool isPragmaAnnotation(TokenKind K);

	} // end namespace tok			} // end namespace tok
	} // end namespace clang			} // end namespace clang
	Show All 20 Lines

clang/include/clang/Driver/Options.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 3,574 Lines • ▼ Show 20 Lines
	let Flags = [CC1Option, NoDriverOption] in {			let Flags = [CC1Option, NoDriverOption] in {

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Target Options			// Target Options
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	let Flags = [CC1Option, CC1AsOption, NoDriverOption] in {			let Flags = [CC1Option, CC1AsOption, NoDriverOption] in {

				def fexec_charset : Separate<["-"], "fexec-charset">, MetaVarName<"<charset>">,
				jansvoboda11Unsubmitted Done Reply Inline Actions Could you switch to the option marshalling infrastructure? https://clang.llvm.org/docs/InternalsManual.html#adding-new-command-line-option Adding `MarshallingInfoString<LangOpts<"ExecCharset">>` here should do the trick. You can then delete the option parsing in `CompilerInvocation.cpp`. jansvoboda11: Could you switch to the option marshalling infrastructure? https://clang.llvm.
				HelpText<"Set the execution <charset> for string and character literals. "
				tahonermannUnsubmitted Done Reply Inline Actions How about substituting "character set", "character encoding", or "charset" for "codepage"? This doesn't state what names are recognized. The ones provided by the system iconv() implementation (as is the case for gcc)? Or all names and aliases specified by the IANA character set registry? The set of recognized names can be a superset of the names that are actually supported. tahonermann: How about substituting "character set", "character encoding", or "charset" for "codepage"?
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions I've updated the description from codepage to charset. It's hard to specify what charsets are supported because iconv library differs between targets, so the list will not be the same on every platform. abhina.sreeskantharajan: I've updated the description from codepage to charset. It's hard to specify what charsets are…
				tahonermannUnsubmitted Done Reply Inline Actions Being dependent on the host iconv library seems fine by me; that is the case for gcc today. I suggest making that explicit here: def fexec_charset : Separate<["-"], "fexec-charset">, MetaVarName<"<charset>">, HelpText<"Set the execution <charset> for string and character literals. Supported character encodings include XXX and those supported by the host iconv library.">; tahonermann: Being dependent on the host iconv library seems fine by me; that is the case for gcc today. I…
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions I've updated the HelpText with your suggested description. abhina.sreeskantharajan: I've updated the HelpText with your suggested description.
				"Supported character encodings include ISO8859-1, UTF-8, IBM-1047 "
				"and those supported by the host iconv library.">;
	def target_cpu : Separate<["-"], "target-cpu">,			def target_cpu : Separate<["-"], "target-cpu">,
	HelpText<"Target a specific cpu type">;			HelpText<"Target a specific cpu type">;
	def tune_cpu : Separate<["-"], "tune-cpu">,			def tune_cpu : Separate<["-"], "tune-cpu">,
	HelpText<"Tune for a specific cpu type">;			HelpText<"Tune for a specific cpu type">;
	def target_feature : Separate<["-"], "target-feature">,			def target_feature : Separate<["-"], "target-feature">,
	HelpText<"Target specific attributes">;			HelpText<"Target specific attributes">;
	def triple : Separate<["-"], "triple">,			def triple : Separate<["-"], "triple">,
	HelpText<"Specify target triple (e.g. i686-apple-darwin9)">,			HelpText<"Specify target triple (e.g. i686-apple-darwin9)">,
	▲ Show 20 Lines • Show All 1,391 Lines • Show Last 20 Lines

clang/include/clang/Lex/LiteralSupport.h

	Show All 11 Lines
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLVM_CLANG_LEX_LITERALSUPPORT_H			#ifndef LLVM_CLANG_LEX_LITERALSUPPORT_H
	#define LLVM_CLANG_LEX_LITERALSUPPORT_H			#define LLVM_CLANG_LEX_LITERALSUPPORT_H

	#include "clang/Basic/CharInfo.h"			#include "clang/Basic/CharInfo.h"
	#include "clang/Basic/LLVM.h"			#include "clang/Basic/LLVM.h"
	#include "clang/Basic/TokenKinds.h"			#include "clang/Basic/TokenKinds.h"
				#include "clang/Lex/LiteralTranslator.h"
	#include "llvm/ADT/APFloat.h"			#include "llvm/ADT/APFloat.h"
	#include "llvm/ADT/ArrayRef.h"			#include "llvm/ADT/ArrayRef.h"
	#include "llvm/ADT/SmallString.h"			#include "llvm/ADT/SmallString.h"
	#include "llvm/ADT/StringRef.h"			#include "llvm/ADT/StringRef.h"
				#include "llvm/Support/CharSet.h"
	#include "llvm/Support/DataTypes.h"			#include "llvm/Support/DataTypes.h"

	namespace clang {			namespace clang {

	class DiagnosticsEngine;			class DiagnosticsEngine;
	class Preprocessor;			class Preprocessor;
	class Token;			class Token;
	class SourceLocation;			class SourceLocation;
	▲ Show 20 Lines • Show All 147 Lines • ▼ Show 20 Lines
	class CharLiteralParser {			class CharLiteralParser {
	uint64_t Value;			uint64_t Value;
	tok::TokenKind Kind;			tok::TokenKind Kind;
	bool IsMultiChar;			bool IsMultiChar;
	bool HadError;			bool HadError;
	SmallString<32> UDSuffixBuf;			SmallString<32> UDSuffixBuf;
	unsigned UDSuffixOffset;			unsigned UDSuffixOffset;
	public:			public:
	CharLiteralParser(const char begin, const char end,			CharLiteralParser(const char begin, const char end, SourceLocation Loc,
	SourceLocation Loc, Preprocessor &PP,			Preprocessor &PP, tok::TokenKind kind,
	tok::TokenKind kind);			ConversionState TranslationState = TranslateToExecCharset);
				tahonermannUnsubmitted Done Reply Inline Actions Is the default argument for `TranslationState` actually used anywhere? I'm skeptical that a default argument provides a benefit here. Actually, this diff doesn't include any changes to construct a `CharLiteralParser` with an explicit argument. It seems this argument isn't actually needed. The only places I see objects of `CharLiteralParser` type constructed are in `EvaluateValue()` in `clang/lib/Lex/PPExpressions.cpp` and `Sema::ActOnCharacterConstant()` in `clang/lib/Sema/SemaExpr.cpp`. tahonermann: Is the default argument for `TranslationState` actually used anywhere? I'm skeptical that a…
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions You're right, we don't have any cases that use this arg yet so we can remove it. abhina.sreeskantharajan: You're right, we don't have any cases that use this arg yet so we can remove it.

				tahonermannUnsubmitted Done Reply Inline Actions Does the conversion state need to be persisted as a data member? The literal is consumed in the constructor. tahonermann: Does the conversion state need to be persisted as a data member? The literal is consumed in…
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Thanks, I've removed this. abhina.sreeskantharajan: Thanks, I've removed this.
	bool hadError() const { return HadError; }			bool hadError() const { return HadError; }
	bool isAscii() const { return Kind == tok::char_constant; }			bool isAscii() const { return Kind == tok::char_constant; }
	bool isWide() const { return Kind == tok::wide_char_constant; }			bool isWide() const { return Kind == tok::wide_char_constant; }
	bool isUTF8() const { return Kind == tok::utf8_char_constant; }			bool isUTF8() const { return Kind == tok::utf8_char_constant; }
	bool isUTF16() const { return Kind == tok::utf16_char_constant; }			bool isUTF16() const { return Kind == tok::utf16_char_constant; }
	bool isUTF32() const { return Kind == tok::utf32_char_constant; }			bool isUTF32() const { return Kind == tok::utf32_char_constant; }
	bool isMultiChar() const { return IsMultiChar; }			bool isMultiChar() const { return IsMultiChar; }
	uint64_t getValue() const { return Value; }			uint64_t getValue() const { return Value; }
	StringRef getUDSuffix() const { return UDSuffixBuf; }			StringRef getUDSuffix() const { return UDSuffixBuf; }
	unsigned getUDSuffixOffset() const {			unsigned getUDSuffixOffset() const {
	assert(!UDSuffixBuf.empty() && "no ud-suffix");			assert(!UDSuffixBuf.empty() && "no ud-suffix");
	return UDSuffixOffset;			return UDSuffixOffset;
	}			}
	};			};

	/// StringLiteralParser - This decodes string escape characters and performs			/// StringLiteralParser - This decodes string escape characters and performs
	/// wide string analysis and Translation Phase #6 (concatenation of string			/// wide string analysis and Translation Phase #6 (concatenation of string
	/// literals) (C99 5.1.1.2p1).			/// literals) (C99 5.1.1.2p1).
	class StringLiteralParser {			class StringLiteralParser {
	const SourceManager &SM;			const SourceManager &SM;
	const LangOptions &Features;			const LangOptions &Features;
	const TargetInfo &Target;			const TargetInfo &Target;
	DiagnosticsEngine *Diags;			DiagnosticsEngine *Diags;
				LiteralTranslator *LT;

	unsigned MaxTokenLength;			unsigned MaxTokenLength;
	unsigned SizeBound;			unsigned SizeBound;
	unsigned CharByteWidth;			unsigned CharByteWidth;
	tok::TokenKind Kind;			tok::TokenKind Kind;
	SmallString<512> ResultBuf;			SmallString<512> ResultBuf;
	char *ResultPtr; // cursor			char *ResultPtr; // cursor
	SmallString<32> UDSuffixBuf;			SmallString<32> UDSuffixBuf;
	unsigned UDSuffixToken;			unsigned UDSuffixToken;
	unsigned UDSuffixOffset;			unsigned UDSuffixOffset;
	public:			public:
	StringLiteralParser(ArrayRef<Token> StringToks,			StringLiteralParser(
	Preprocessor &PP, bool Complain = true);			ArrayRef<Token> StringToks, Preprocessor &PP, bool Complain = true,
	StringLiteralParser(ArrayRef<Token> StringToks,			ConversionState translationState = TranslateToExecCharset);
	const SourceManager &sm, const LangOptions &features,			StringLiteralParser(ArrayRef<Token> StringToks, const SourceManager &sm,
	const TargetInfo &target,			const LangOptions &features, const TargetInfo &target,
	DiagnosticsEngine *diags = nullptr)			DiagnosticsEngine *diags = nullptr,
				ConversionState translation = TranslateToExecCharset)
	: SM(sm), Features(features), Target(target), Diags(diags),			: SM(sm), Features(features), Target(target), Diags(diags),
	MaxTokenLength(0), SizeBound(0), CharByteWidth(0), Kind(tok::unknown),			MaxTokenLength(0), SizeBound(0), CharByteWidth(0), Kind(tok::unknown),
	ResultPtr(ResultBuf.data()), hadError(false), Pascal(false) {			ResultPtr(ResultBuf.data()), hadError(false), Pascal(false),
				TranslationState(translation) {
				LT = new LiteralTranslator();
				LT->setTranslationTables(Features, Target, *Diags);
	init(StringToks);			init(StringToks);
	}			}


	bool hadError;			bool hadError;
	bool Pascal;			bool Pascal;
				ConversionState TranslationState;
				tahonermannUnsubmitted Done Reply Inline Actions Same concern here with respect to persisting the conversion state as a data member. tahonermann: Same concern here with respect to persisting the conversion state as a data member.
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions If this member is removed in StringLiteralParser, we will need to pass the State to multiple functions in StringLiteralParser like init(). Would this solution be preferable to keeping a data member? abhina.sreeskantharajan: If this member is removed in StringLiteralParser, we will need to pass the State to multiple…
				tahonermannUnsubmitted Done Reply Inline Actions I think so, yes. Data members should be used to reflect the state of the object, not as a convenient mechanism to avoid passing arguments. tahonermann: I think so, yes. Data members should be used to reflect the state of the object, not as a…
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Thanks, I've removed this member. abhina.sreeskantharajan: Thanks, I've removed this member.

	StringRef GetString() const {			StringRef GetString() const {
				tahonermannUnsubmitted Done Reply Inline Actions This static data member will presumably need to be lifted to per-instance state as Richard mentioned elsewhere. tahonermann: This static data member will presumably need to be lifted to per-instance state as Richard…
	return StringRef(ResultBuf.data(), GetStringLength());			return StringRef(ResultBuf.data(), GetStringLength());
	}			}
	unsigned GetStringLength() const { return ResultPtr-ResultBuf.data(); }			unsigned GetStringLength() const { return ResultPtr-ResultBuf.data(); }

				tahonermannUnsubmitted Done Reply Inline Actions I don't think a `LiteralTranslator` object is actually needed in this case. The only use of this constructor that I see is in `ModuleMapParser::consumeToken()` in `clang/lib/Lex/ModuleMap.cpp` and, in that case, I don't think any translation is necessary. This suggests that `TranslationState` is not needed for this constructor either; `NoTranslation` can be passed to `init()`. tahonermann: I don't think a `LiteralTranslator` object is actually needed in this case. The only use of…
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Thanks, I've removed it. abhina.sreeskantharajan: Thanks, I've removed it.
	unsigned GetNumStringChars() const {			unsigned GetNumStringChars() const {
	return GetStringLength() / CharByteWidth;			return GetStringLength() / CharByteWidth;
	}			}
	/// getOffsetOfStringByte - This function returns the offset of the			/// getOffsetOfStringByte - This function returns the offset of the
	/// specified byte of the string data represented by Token. This handles			/// specified byte of the string data represented by Token. This handles
	/// advancing over escape sequences in the string.			/// advancing over escape sequences in the string.
	///			///
	/// If the Diagnostics pointer is non-null, then this will do semantic			/// If the Diagnostics pointer is non-null, then this will do semantic
	/// checking of the string literal and emit errors and warnings.			/// checking of the string literal and emit errors and warnings.
	unsigned getOffsetOfStringByte(const Token &TheTok, unsigned ByteNo) const;			unsigned getOffsetOfStringByte(const Token &TheTok, unsigned ByteNo) const;

				tahonermannUnsubmitted Done Reply Inline Actions I don't think `getOffsetOfStringByte()` should require a `ConversionState` parameter. If I understand it correctly, this function should be operating on the string in the internal encoding, never in a converted encoding. tahonermann: I don't think `getOffsetOfStringByte()` should require a `ConversionState` parameter. If I…
	bool isAscii() const { return Kind == tok::string_literal; }			bool isAscii() const { return Kind == tok::string_literal; }
	bool isWide() const { return Kind == tok::wide_string_literal; }			bool isWide() const { return Kind == tok::wide_string_literal; }
	bool isUTF8() const { return Kind == tok::utf8_string_literal; }			bool isUTF8() const { return Kind == tok::utf8_string_literal; }
	bool isUTF16() const { return Kind == tok::utf16_string_literal; }			bool isUTF16() const { return Kind == tok::utf16_string_literal; }
	bool isUTF32() const { return Kind == tok::utf32_string_literal; }			bool isUTF32() const { return Kind == tok::utf32_string_literal; }
	bool isPascal() const { return Pascal; }			bool isPascal() const { return Pascal; }

	StringRef getUDSuffix() const { return UDSuffixBuf; }			StringRef getUDSuffix() const { return UDSuffixBuf; }
	Show All 24 Lines

clang/include/clang/Lex/LiteralTranslator.h

This file was added.

				//===--- clang/Lex/LiteralTranslator.h - Translator for Literals -- C++ --==//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_CLANG_LEX_LITERALTRANSLATOR_H
				#define LLVM_CLANG_LEX_LITERALTRANSLATOR_H

				#include "clang/Basic/Diagnostic.h"
				#include "clang/Basic/LangOptions.h"
				#include "clang/Basic/TargetInfo.h"
				#include "llvm/ADT/StringMap.h"
				#include "llvm/ADT/StringRef.h"
				#include "llvm/Support/CharSet.h"

				enum ConversionState {
				NoTranslation,
				TranslateToSystemCharset,
				TranslateToExecCharset
				};

				tahonermannUnsubmitted Not Done Reply Inline Actions Some naming suggestions... The enumeration is not used to record a state, but rather to indicate an action to take. Also, use of both "conversion" and "translation" could be confusing, so I suggest sticking with one. Perhaps: enum class LiteralConversion { None, ToSystemCharset, ToExecCharset }; tahonermann: Some naming suggestions... The enumeration is not used to record a state, but rather to…
				enum CharsetTableStatusCode {
				CharsetTableOk = 1,
				InvalidCharsetTable,
				};

				class LiteralTranslator {
				public:
				llvm::StringRef InternalCharset;
				rsmithUnsubmitted Done Reply Inline Actions It is not acceptable to use global state for this per-compilation information; this will behave badly if multiple independent Clang compilations are performed by different threads in the same process, for example. rsmith: It is not acceptable to use global state for this per-compilation information; this will behave…
				tahonermannUnsubmitted Done Reply Inline Actions I don't know the LLVM style guides well, but I suspect a class with all public members should be defined using `struct` and not include access specifiers. tahonermann: I don't know the LLVM style guides well, but I suspect a class with all public members should…
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions I've made these private. abhina.sreeskantharajan: I've made these private.
				llvm::StringRef SystemCharset;
				rsmithUnsubmitted Done Reply Inline Actions Similarly, use of a global cache here will require you guard it with a mutex. As an alternative, how about we move all this state to be per-instance state, and store an instance of `LiteralTranslator` on the `Preprocessor`? rsmith: Similarly, use of a global cache here will require you guard it with a mutex. As an…
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Thanks, I've added an instance of LiteralTranslator to Preprocessor instead and use that when the Preprocessor is available. There is one constructor of StringLiteralParser that does not pass Preprocessor as an argument, so I had to create a LiteralTranslator instance there as well. abhina.sreeskantharajan: Thanks, I've added an instance of LiteralTranslator to Preprocessor instead and use that when…
				llvm::StringRef ExecCharset;
				llvm::StringMap<llvm::CharSetConverter> ExecCharsetTables;

				tahonermannUnsubmitted Done Reply Inline Actions Given the converter setters and accessors below, `ExecCharsetTables` should be a private member. tahonermann: Given the converter setters and accessors below, `ExecCharsetTables` should be a private member.
				llvm::CharSetConverter getConversionTable(const char Codepage);
				CharsetTableStatusCode findOrCreateExecCharsetTable(const char *To);
				tahonermannUnsubmitted Done Reply Inline Actions `getConversionTable()` is logically `const`. Perhaps `ExecCharsetTables` should be `mutable`. From a terminology stand point, this function is misnamed. It doesn't return a table, it returns a converter for an encoding. I suggest: llvm::CharSetConverter getCharSetConverter(const char Encoding) const; tahonermann: `getConversionTable()` is logically `const`. Perhaps `ExecCharsetTables` should be `mutable`.
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions I've renamed this function to getConverter abhina.sreeskantharajan: I've renamed this function to getConverter
				llvm::CharSetConverter *
				tahonermannUnsubmitted Done Reply Inline Actions `findOrCreateExecCharsetTable()` seems oddly named since it doesn't return whatever it finds or creates. It seems like this function would be more useful if it returned a `llvm::CharSetConverter` pointer with `nullptr` indicating lookup/creation failed. This function seems like it should be an implementation detail of the class, not a public interface. tahonermann: `findOrCreateExecCharsetTable()` seems oddly named since it doesn't return whatever it finds or…
				getCharConversionTable(ConversionState TranslationState);
				void setTranslationTables(const clang::LangOptions &Opts,
				const clang::TargetInfo &TInfo,
				clang::DiagnosticsEngine &Diags);
				};
				tahonermannUnsubmitted Not Done Reply Inline Actions `setTranslationTables()` is awkward. It is effectively operating as a constructor for the class, but isn't called at object construction and it does work that goes beyond initialization. tahonermann: `setTranslationTables()` is awkward. It is effectively operating as a constructor for the…

				tahonermannUnsubmitted Not Done Reply Inline Actions I suggest trying a design more like this: class LiteralTranslator { std::string SystemEncoding; std::string ExecutionEncoding; public: LiteralTranslator(llvm::StringRef SystemEncoding, llvm::StringRef ExecutionEncoding); // Retrieve the name for the system encoding. llvm::StringRef getSystemEncoding() const; // Retrieve the name for the execution encoding. llvm::StringRef getExecutionEncoding() const; // Retrieve a converter for converting from the internal encoding (UTF-8) // to the system encoding. llvm::CharSetConverter* getSystemEncodingConverter() const; // Retrieve a converter for converting from the internal encoding (UTF-8) // to the execution encoding. llvm::CharSetConverter* getExecutionEncodingConverter() const; }; LiteralTranslator createLiteralTranslatorFromOptions(const clang::LangOptions &Opts, const clang::TargetInfo &TInfo, clang::DiagnosticsEngine &Diags); tahonermann: I suggest trying a design more like this: class LiteralTranslator { std::string…
				#endif

clang/include/clang/Lex/Preprocessor.h

Show All 17 Lines
#include "clang/Basic/IdentifierTable.h"		#include "clang/Basic/IdentifierTable.h"
#include "clang/Basic/LLVM.h"		#include "clang/Basic/LLVM.h"
#include "clang/Basic/LangOptions.h"		#include "clang/Basic/LangOptions.h"
#include "clang/Basic/Module.h"		#include "clang/Basic/Module.h"
#include "clang/Basic/SourceLocation.h"		#include "clang/Basic/SourceLocation.h"
#include "clang/Basic/SourceManager.h"		#include "clang/Basic/SourceManager.h"
#include "clang/Basic/TokenKinds.h"		#include "clang/Basic/TokenKinds.h"
#include "clang/Lex/Lexer.h"		#include "clang/Lex/Lexer.h"
		#include "clang/Lex/LiteralTranslator.h"
#include "clang/Lex/MacroInfo.h"		#include "clang/Lex/MacroInfo.h"
#include "clang/Lex/ModuleLoader.h"		#include "clang/Lex/ModuleLoader.h"
#include "clang/Lex/ModuleMap.h"		#include "clang/Lex/ModuleMap.h"
#include "clang/Lex/PPCallbacks.h"		#include "clang/Lex/PPCallbacks.h"
#include "clang/Lex/PreprocessorExcludedConditionalDirectiveSkipMapping.h"		#include "clang/Lex/PreprocessorExcludedConditionalDirectiveSkipMapping.h"
#include "clang/Lex/Token.h"		#include "clang/Lex/Token.h"
#include "clang/Lex/TokenLexer.h"		#include "clang/Lex/TokenLexer.h"
#include "llvm/ADT/ArrayRef.h"		#include "llvm/ADT/ArrayRef.h"
▲ Show 20 Lines • Show All 102 Lines • ▼ Show 20 Lines	class Preprocessor {
LangOptions &LangOpts;		LangOptions &LangOpts;
const TargetInfo *Target = nullptr;		const TargetInfo *Target = nullptr;
const TargetInfo *AuxTarget = nullptr;		const TargetInfo *AuxTarget = nullptr;
FileManager &FileMgr;		FileManager &FileMgr;
SourceManager &SourceMgr;		SourceManager &SourceMgr;
std::unique_ptr<ScratchBuffer> ScratchBuf;		std::unique_ptr<ScratchBuffer> ScratchBuf;
HeaderSearch &HeaderInfo;		HeaderSearch &HeaderInfo;
ModuleLoader &TheModuleLoader;		ModuleLoader &TheModuleLoader;
		LiteralTranslator *LT = nullptr;
		tahonermannUnsubmitted Done Reply Inline Actions I don't see a reason for `LT` to be a pointer. Can it be made a reference or, better, a non-reference, non-pointer data member? tahonermann: I don't see a reason for `LT` to be a pointer. Can it be made a reference or, better, a non…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Thanks, I've changed it to a non-reference non-pointer member. abhina.sreeskantharajan: Thanks, I've changed it to a non-reference non-pointer member.
		rsmithUnsubmitted Done Reply Inline Actions Please give this a longer name. Abbreviation names should only be used in fairly small scopes where it's easy to look up what they refer to. Also: why `LT`? What does the `T` stand for? rsmith: Please give this a longer name. Abbreviation names should only be used in fairly small scopes…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Thanks for catching this. This was a change I missed when renaming LiteralTranslator to LiteralConverter. I've added a longer name. abhina.sreeskantharajan: Thanks for catching this. This was a change I missed when renaming LiteralTranslator to…

/// External source of macros.		/// External source of macros.
ExternalPreprocessorSource *ExternalSource;		ExternalPreprocessorSource *ExternalSource;

/// A BumpPtrAllocator object used to quickly allocate and release		/// A BumpPtrAllocator object used to quickly allocate and release
/// objects internal to the Preprocessor.		/// objects internal to the Preprocessor.
llvm::BumpPtrAllocator BP;		llvm::BumpPtrAllocator BP;

▲ Show 20 Lines • Show All 774 Lines • ▼ Show 20 Lines	public:
SourceManager &getSourceManager() const { return SourceMgr; }		SourceManager &getSourceManager() const { return SourceMgr; }
HeaderSearch &getHeaderSearchInfo() const { return HeaderInfo; }		HeaderSearch &getHeaderSearchInfo() const { return HeaderInfo; }

IdentifierTable &getIdentifierTable() { return Identifiers; }		IdentifierTable &getIdentifierTable() { return Identifiers; }
const IdentifierTable &getIdentifierTable() const { return Identifiers; }		const IdentifierTable &getIdentifierTable() const { return Identifiers; }
SelectorTable &getSelectorTable() { return Selectors; }		SelectorTable &getSelectorTable() { return Selectors; }
Builtin::Context &getBuiltinInfo() { return *BuiltinInfo; }		Builtin::Context &getBuiltinInfo() { return *BuiltinInfo; }
llvm::BumpPtrAllocator &getPreprocessorAllocator() { return BP; }		llvm::BumpPtrAllocator &getPreprocessorAllocator() { return BP; }
		LiteralTranslator *getLiteralTranslator() { return LT; }

void setExternalSource(ExternalPreprocessorSource *Source) {		void setExternalSource(ExternalPreprocessorSource *Source) {
ExternalSource = Source;		ExternalSource = Source;
}		}

ExternalPreprocessorSource *getExternalSource() const {		ExternalPreprocessorSource *getExternalSource() const {
return ExternalSource;		return ExternalSource;
}		}
▲ Show 20 Lines • Show All 1,476 Lines • Show Last 20 Lines

clang/lib/Driver/ToolChains/Clang.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show All 29 Lines
	#include "clang/Driver/Distro.h"			#include "clang/Driver/Distro.h"
	#include "clang/Driver/DriverDiagnostic.h"			#include "clang/Driver/DriverDiagnostic.h"
	#include "clang/Driver/Options.h"			#include "clang/Driver/Options.h"
	#include "clang/Driver/SanitizerArgs.h"			#include "clang/Driver/SanitizerArgs.h"
	#include "clang/Driver/XRayArgs.h"			#include "clang/Driver/XRayArgs.h"
	#include "llvm/ADT/StringExtras.h"			#include "llvm/ADT/StringExtras.h"
	#include "llvm/Config/llvm-config.h"			#include "llvm/Config/llvm-config.h"
	#include "llvm/Option/ArgList.h"			#include "llvm/Option/ArgList.h"
				#include "llvm/Support/CharSet.h"
	#include "llvm/Support/CodeGen.h"			#include "llvm/Support/CodeGen.h"
	#include "llvm/Support/Compiler.h"			#include "llvm/Support/Compiler.h"
	#include "llvm/Support/Compression.h"			#include "llvm/Support/Compression.h"
	#include "llvm/Support/FileSystem.h"			#include "llvm/Support/FileSystem.h"
	#include "llvm/Support/Host.h"			#include "llvm/Support/Host.h"
	#include "llvm/Support/Path.h"			#include "llvm/Support/Path.h"
	#include "llvm/Support/Process.h"			#include "llvm/Support/Process.h"
	#include "llvm/Support/TargetParser.h"			#include "llvm/Support/TargetParser.h"
	▲ Show 20 Lines • Show All 5,915 Lines • ▼ Show 20 Lines
	// -finput_charset=UTF-8 is default. Reject others			// -finput_charset=UTF-8 is default. Reject others
	if (Arg *inputCharset = Args.getLastArg(options::OPT_finput_charset_EQ)) {			if (Arg *inputCharset = Args.getLastArg(options::OPT_finput_charset_EQ)) {
	StringRef value = inputCharset->getValue();			StringRef value = inputCharset->getValue();
	if (!value.equals_lower("utf-8"))			if (!value.equals_lower("utf-8"))
	D.Diag(diag::err_drv_invalid_value) << inputCharset->getAsString(Args)			D.Diag(diag::err_drv_invalid_value) << inputCharset->getAsString(Args)
	<< value;			<< value;
	}			}

	// -fexec_charset=UTF-8 is default. Reject others			// Pass all -fexec-charset options to cc1.
	if (Arg *execCharset = Args.getLastArg(options::OPT_fexec_charset_EQ)) {			std::vector<std::string> vList =
	StringRef value = execCharset->getValue();			Args.getAllArgValues(options::OPT_fexec_charset_EQ);
	if (!value.equals_lower("utf-8"))			// Set the default fexec-charset as the system charset.
	D.Diag(diag::err_drv_invalid_value) << execCharset->getAsString(Args)			CmdArgs.push_back("-fexec-charset");
	<< value;			CmdArgs.push_back(Args.MakeArgString(Triple.getSystemCharset()));
				for (auto it = vList.begin(), ie = vList.end(); it != ie; ++it) {
				rsmithUnsubmitted Done Reply Inline Actions Looping over all the arguments is a little unusual. Normally we'd get the last argument value and only check that one. Do you need to pass more than one value onto the frontend? rsmith: Looping over all the arguments is a little unusual. Normally we'd get the last argument value…
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Thanks, I've changed it back to get the LastArg only and use the spelling of the argument to fix the diagnostic error message in the driver lit tests. abhina.sreeskantharajan: Thanks, I've changed it back to get the LastArg only and use the spelling of the argument to…
				llvm::ErrorOr<llvm::CharSetConverter> ErrorOrConverter =
				llvm::CharSetConverter::create("UTF-8", it->c_str());
				if (ErrorOrConverter) {
				CmdArgs.push_back("-fexec-charset");
				CmdArgs.push_back(Args.MakeArgString(*it));
				} else {
				D.Diag(clang::diag::err_drv_invalid_value) << "-fexec-charset" << *it;
				}
				tahonermannUnsubmitted Done Reply Inline Actions Thank you for adding this. tahonermann: Thank you for adding this.
	}			}
				tahonermannUnsubmitted Done Reply Inline Actions I think it would be preferable to diagnose an unrecognized character encoding name here if possible. The current changes will result in an unrecognized name (as opposed to one that is unsupported for the target) being diagnosed for each compiler instance. tahonermann: I think it would be preferable to diagnose an unrecognized character encoding name here if…
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Since we do not know what charsets are supported by the iconv library on the target platform, we don't know what charsets are actually invalid until we try creating a CharSetConverter. abhina.sreeskantharajan: Since we do not know what charsets are supported by the iconv library on the target platform…
				tahonermannUnsubmitted Done Reply Inline Actions Understood, but what would be the harm in performing a lookup (constructing a `CharSetConverter`) here? tahonermann: Understood, but what would be the harm in performing a lookup (constructing a…
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions I initially thought it will be a performance issue if we are creating the Converter twice, once here and once in the Preprocessor. But I do think its a good idea to diagnose this early. I've modified the code to diagnose and error here. abhina.sreeskantharajan: I initially thought it will be a performance issue if we are creating the Converter twice, once…

	RenderDiagnosticsOptions(D, Args, CmdArgs);			RenderDiagnosticsOptions(D, Args, CmdArgs);

	// -fno-asm-blocks is default.			// -fno-asm-blocks is default.
	if (Args.hasFlag(options::OPT_fasm_blocks, options::OPT_fno_asm_blocks,			if (Args.hasFlag(options::OPT_fasm_blocks, options::OPT_fno_asm_blocks,
	false))			false))
	CmdArgs.push_back("-fasm-blocks");			CmdArgs.push_back("-fasm-blocks");

	// -fgnu-inline-asm is default.			// -fgnu-inline-asm is default.
	▲ Show 20 Lines • Show All 1,388 Lines • Show Last 20 Lines

clang/lib/Frontend/CompilerInstance.cpp

//===--- CompilerInstance.cpp ---------------------------------------------===//		//===--- CompilerInstance.cpp ---------------------------------------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "clang/Frontend/CompilerInstance.h"		#include "clang/Frontend/CompilerInstance.h"
#include "clang/AST/ASTConsumer.h"		#include "clang/AST/ASTConsumer.h"
#include "clang/AST/ASTContext.h"		#include "clang/AST/ASTContext.h"
#include "clang/AST/Decl.h"		#include "clang/AST/Decl.h"
#include "clang/Basic/CharInfo.h"		#include "clang/Basic/CharInfo.h"
#include "clang/Basic/Diagnostic.h"		#include "clang/Basic/Diagnostic.h"
		#include "clang/Basic/DiagnosticDriver.h"
#include "clang/Basic/FileManager.h"		#include "clang/Basic/FileManager.h"
#include "clang/Basic/LangStandard.h"		#include "clang/Basic/LangStandard.h"
#include "clang/Basic/SourceManager.h"		#include "clang/Basic/SourceManager.h"
#include "clang/Basic/Stack.h"		#include "clang/Basic/Stack.h"
#include "clang/Basic/TargetInfo.h"		#include "clang/Basic/TargetInfo.h"
#include "clang/Basic/Version.h"		#include "clang/Basic/Version.h"
#include "clang/Config/config.h"		#include "clang/Config/config.h"
#include "clang/Frontend/ChainedDiagnosticConsumer.h"		#include "clang/Frontend/ChainedDiagnosticConsumer.h"
#include "clang/Frontend/FrontendAction.h"		#include "clang/Frontend/FrontendAction.h"
#include "clang/Frontend/FrontendActions.h"		#include "clang/Frontend/FrontendActions.h"
#include "clang/Frontend/FrontendDiagnostic.h"		#include "clang/Frontend/FrontendDiagnostic.h"
#include "clang/Frontend/LogDiagnosticPrinter.h"		#include "clang/Frontend/LogDiagnosticPrinter.h"
#include "clang/Frontend/SerializedDiagnosticPrinter.h"		#include "clang/Frontend/SerializedDiagnosticPrinter.h"
#include "clang/Frontend/TextDiagnosticPrinter.h"		#include "clang/Frontend/TextDiagnosticPrinter.h"
#include "clang/Frontend/Utils.h"		#include "clang/Frontend/Utils.h"
#include "clang/Frontend/VerifyDiagnosticConsumer.h"		#include "clang/Frontend/VerifyDiagnosticConsumer.h"
#include "clang/Lex/HeaderSearch.h"		#include "clang/Lex/HeaderSearch.h"
		#include "clang/Lex/LiteralTranslator.h"
#include "clang/Lex/Preprocessor.h"		#include "clang/Lex/Preprocessor.h"
#include "clang/Lex/PreprocessorOptions.h"		#include "clang/Lex/PreprocessorOptions.h"
#include "clang/Sema/CodeCompleteConsumer.h"		#include "clang/Sema/CodeCompleteConsumer.h"
#include "clang/Sema/Sema.h"		#include "clang/Sema/Sema.h"
#include "clang/Serialization/ASTReader.h"		#include "clang/Serialization/ASTReader.h"
#include "clang/Serialization/GlobalModuleIndex.h"		#include "clang/Serialization/GlobalModuleIndex.h"
#include "clang/Serialization/InMemoryModuleCache.h"		#include "clang/Serialization/InMemoryModuleCache.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
▲ Show 20 Lines • Show All 430 Lines • ▼ Show 20 Lines	AttachHeaderIncludeGen(*PP, DepOpts,
/ShowDepth=/false);		/ShowDepth=/false);
}		}

if (DepOpts.ShowIncludesDest != ShowIncludesDestination::None) {		if (DepOpts.ShowIncludesDest != ShowIncludesDestination::None) {
AttachHeaderIncludeGen(*PP, DepOpts,		AttachHeaderIncludeGen(*PP, DepOpts,
/ShowAllHeaders=/true, /OutputPath=/"",		/ShowAllHeaders=/true, /OutputPath=/"",
/ShowDepth=/true, /MSStyle=/true);		/ShowDepth=/true, /MSStyle=/true);
}		}
		PP->getLiteralTranslator()->setTranslationTables(getLangOpts(), getTarget(),
		getDiagnostics());
}		}

std::string CompilerInstance::getSpecificModuleCachePath() {		std::string CompilerInstance::getSpecificModuleCachePath() {
// Set up the module path, including the hash for the		// Set up the module path, including the hash for the
// module-creation options.		// module-creation options.
SmallString<256> SpecificModuleCache(getHeaderSearchOpts().ModuleCachePath);		SmallString<256> SpecificModuleCache(getHeaderSearchOpts().ModuleCachePath);
if (!SpecificModuleCache.empty() && !getHeaderSearchOpts().DisableModuleHash)		if (!SpecificModuleCache.empty() && !getHeaderSearchOpts().DisableModuleHash)
llvm::sys::path::append(SpecificModuleCache,		llvm::sys::path::append(SpecificModuleCache,
▲ Show 20 Lines • Show All 1,733 Lines • Show Last 20 Lines

clang/lib/Frontend/CompilerInvocation.cpp

Show First 20 Lines • Show All 3,561 Lines • ▼ Show 20 Lines	#include "clang/Basic/LangStandards.def"

Opts.CompatibilityQualifiedIdBlockParamTypeChecking =		Opts.CompatibilityQualifiedIdBlockParamTypeChecking =
Args.hasArg(OPT_fcompatibility_qualified_id_block_param_type_checking);		Args.hasArg(OPT_fcompatibility_qualified_id_block_param_type_checking);

Opts.RelativeCXXABIVTables =		Opts.RelativeCXXABIVTables =
Args.hasFlag(OPT_fexperimental_relative_cxx_abi_vtables,		Args.hasFlag(OPT_fexperimental_relative_cxx_abi_vtables,
OPT_fno_experimental_relative_cxx_abi_vtables,		OPT_fno_experimental_relative_cxx_abi_vtables,
/default=/false);		/default=/false);

		if (Arg *ExecCharset = Args.getLastArg(OPT_fexec_charset)) {
		StringRef Value = ExecCharset->getValue();
		Opts.ExecCharset = Value.str();
		tahonermannUnsubmitted Done Reply Inline Actions I wouldn't expect the cast to `std::string` to be needed here. tahonermann: I wouldn't expect the cast to `std::string` to be needed here.
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Without that cast, I get the following build error: error: no viable overloaded '=' abhina.sreeskantharajan: Without that cast, I get the following build error: ``` error: no viable overloaded '=' ```
		tahonermannUnsubmitted Done Reply Inline Actions Ok, rather than a cast, I suggest: Opts.ExecCharset = Value.str(); tahonermann: Ok, rather than a cast, I suggest: Opts.ExecCharset = Value.str();
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Thanks, I've applied this change. abhina.sreeskantharajan: Thanks, I've applied this change.
		}
}		}

static bool isStrictlyPreprocessorAction(frontend::ActionKind Action) {		static bool isStrictlyPreprocessorAction(frontend::ActionKind Action) {
switch (Action) {		switch (Action) {
case frontend::ASTDeclList:		case frontend::ASTDeclList:
case frontend::ASTDump:		case frontend::ASTDump:
case frontend::ASTPrint:		case frontend::ASTPrint:
case frontend::ASTView:		case frontend::ASTView:
▲ Show 20 Lines • Show All 536 Lines • Show Last 20 Lines

clang/lib/Lex/CMakeLists.txt

	# TODO: Add -maltivec when ARCH is PowerPC.			# TODO: Add -maltivec when ARCH is PowerPC.

	set(LLVM_LINK_COMPONENTS support)			set(LLVM_LINK_COMPONENTS support)

	add_clang_library(clangLex			add_clang_library(clangLex
	DependencyDirectivesSourceMinimizer.cpp			DependencyDirectivesSourceMinimizer.cpp
	HeaderMap.cpp			HeaderMap.cpp
	HeaderSearch.cpp			HeaderSearch.cpp
	Lexer.cpp			Lexer.cpp
	LiteralSupport.cpp			LiteralSupport.cpp
				LiteralTranslator.cpp
	MacroArgs.cpp			MacroArgs.cpp
	MacroInfo.cpp			MacroInfo.cpp
	ModuleMap.cpp			ModuleMap.cpp
	PPCaching.cpp			PPCaching.cpp
	PPCallbacks.cpp			PPCallbacks.cpp
	PPConditionalDirectiveRecord.cpp			PPConditionalDirectiveRecord.cpp
	PPDirectives.cpp			PPDirectives.cpp
	PPExpressions.cpp			PPExpressions.cpp
	Show All 13 Lines

clang/lib/Lex/LiteralSupport.cpp

Show First 20 Lines • Show All 87 Lines • ▼ Show 20 Lines

/// ProcessCharEscape - Parse a standard C escape sequence, which can occur in		/// ProcessCharEscape - Parse a standard C escape sequence, which can occur in
/// either a character or a string literal.		/// either a character or a string literal.
static unsigned ProcessCharEscape(const char *ThisTokBegin,		static unsigned ProcessCharEscape(const char *ThisTokBegin,
const char *&ThisTokBuf,		const char *&ThisTokBuf,
const char *ThisTokEnd, bool &HadError,		const char *ThisTokEnd, bool &HadError,
FullSourceLoc Loc, unsigned CharWidth,		FullSourceLoc Loc, unsigned CharWidth,
DiagnosticsEngine *Diags,		DiagnosticsEngine *Diags,
const LangOptions &Features) {		const LangOptions &Features,
		llvm::CharSetConverter *Converter) {
const char *EscapeBegin = ThisTokBuf;		const char *EscapeBegin = ThisTokBuf;

// Skip the '\' char.		// Skip the '\' char.
++ThisTokBuf;		++ThisTokBuf;

// We know that this character can't be off the end of the buffer, because		// We know that this character can't be off the end of the buffer, because
// that would have been \", which would not have been the end of string.		// that would have been \", which would not have been the end of string.
unsigned ResultChar = *ThisTokBuf++;		unsigned ResultChar = *ThisTokBuf++;
		bool Translate = true;
switch (ResultChar) {		switch (ResultChar) {
// These map to themselves.		// These map to themselves.
case '\\': case '\'': case '"': case '?': break;		case '\\': case '\'': case '"': case '?': break;

// These have fixed mappings.		// These have fixed mappings.
case 'a':		case 'a':
// TODO: K&R: the meaning of '\\a' is different in traditional C		// TODO: K&R: the meaning of '\\a' is different in traditional C
ResultChar = 7;		ResultChar = 7;
Show All 24 Lines	case 'r':
break;		break;
case 't':		case 't':
ResultChar = 9;		ResultChar = 9;
break;		break;
case 'v':		case 'v':
ResultChar = 11;		ResultChar = 11;
break;		break;
case 'x': { // Hex escape.		case 'x': { // Hex escape.
		Translate = false;
ResultChar = 0;		ResultChar = 0;
if (ThisTokBuf == ThisTokEnd \|\| !isHexDigit(*ThisTokBuf)) {		if (ThisTokBuf == ThisTokEnd \|\| !isHexDigit(*ThisTokBuf)) {
if (Diags)		if (Diags)
Diag(Diags, Features, Loc, ThisTokBegin, EscapeBegin, ThisTokBuf,		Diag(Diags, Features, Loc, ThisTokBegin, EscapeBegin, ThisTokBuf,
diag::err_hex_escape_no_digits) << "x";		diag::err_hex_escape_no_digits) << "x";
HadError = true;		HadError = true;
break;		break;
}		}
Show All 21 Lines	if (Overflow && Diags) // Too many digits to fit in
Diag(Diags, Features, Loc, ThisTokBegin, EscapeBegin, ThisTokBuf,		Diag(Diags, Features, Loc, ThisTokBegin, EscapeBegin, ThisTokBuf,
diag::err_escape_too_large) << 0;		diag::err_escape_too_large) << 0;
break;		break;
}		}
case '0': case '1': case '2': case '3':		case '0': case '1': case '2': case '3':
case '4': case '5': case '6': case '7': {		case '4': case '5': case '6': case '7': {
// Octal escapes.		// Octal escapes.
--ThisTokBuf;		--ThisTokBuf;
		Translate = false;
ResultChar = 0;		ResultChar = 0;

// Octal escapes are a series of octal digits with maximum length 3.		// Octal escapes are a series of octal digits with maximum length 3.
// "\0123" is a two digit sequence equal to "\012" "3".		// "\0123" is a two digit sequence equal to "\012" "3".
unsigned NumDigits = 0;		unsigned NumDigits = 0;
do {		do {
ResultChar <<= 3;		ResultChar <<= 3;
ResultChar \|= *ThisTokBuf++ - '0';		ResultChar \|= *ThisTokBuf++ - '0';
Show All 29 Lines	if (isPrintable(ResultChar))
<< std::string(1, ResultChar);		<< std::string(1, ResultChar);
else		else
Diag(Diags, Features, Loc, ThisTokBegin, EscapeBegin, ThisTokBuf,		Diag(Diags, Features, Loc, ThisTokBegin, EscapeBegin, ThisTokBuf,
diag::ext_unknown_escape)		diag::ext_unknown_escape)
<< "x" + llvm::utohexstr(ResultChar);		<< "x" + llvm::utohexstr(ResultChar);
break;		break;
}		}

		if (Translate && Converter) {
		char ByteChar = ResultChar;
		SmallString<8> ResultCharConv;
		Converter->convert(std::string(1, ByteChar), ResultCharConv);
		tahonermannUnsubmitted Done Reply Inline Actions What should happen if `ResultChar` >= 0x100? IBM-1047 does have representation for other UTF-8 characters. Regardless, it seems `ResultChar` should be converted to something. tahonermann: What should happen if `ResultChar` >= 0x100? IBM-1047 does have representation for other UTF-8…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions This is no longer valid, thanks for catching that. We were initially translating to ASCII instead of UTF-8 so we needed to guard against larger characters. I've removed this guard since the internal charset is UTF-8. abhina.sreeskantharajan: This is no longer valid, thanks for catching that. We were initially translating to ASCII…
		tahonermannUnsubmitted Done Reply Inline Actions Conversion can fail here, particularly in the scenario corresponding to the default switch case above; `ResultChar` could contain, for example, a lead byte of a UTF-8 sequence. Something sensible should be done here; either rejecting the code with an error or substituting `?` (in the execution encoding) seems appropriate to me. tahonermann: Conversion can fail here, particularly in the scenario corresponding to the default switch case…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Thanks, I added the substitution with the '?' character for invalid escapes. abhina.sreeskantharajan: Thanks, I added the substitution with the '?' character for invalid escapes.
		rsmithUnsubmitted Not Done Reply Inline Actions This is a regression. Our prior behavior for unknown escapes was to leave the character alone. We should still do that wherever possible -- eg, `\q` should produce `q` -- and take fallback action only if the character is unencodable. Producing a `?` seems unlikely to ever be what anyone wants; producing a hard error would seem preferable. rsmith: This is a regression. Our prior behavior for unknown escapes was to leave the character alone.
		abhina.sreeskantharajanAuthorUnsubmitted Not Done Reply Inline Actions Hi @tahonermann, do you also agree we should use the original behaviour or give a hard error instead? abhina.sreeskantharajan: Hi @tahonermann, do you also agree we should use the original behaviour or give a hard error…
		memcpy((void *)&ResultChar, ResultCharConv.data(), sizeof(unsigned));
		tahonermannUnsubmitted Done Reply Inline Actions As Richard previously noted, this `memcpy()` needs to be addressed. The intended behavior here is not clear. Are there valid scenarios in which the conversion will produce a sequence of more than one code units? I believe the input is limited to ASCII characters and invalid code units (e.g., a lead byte of a UTF-8 sequence) and in the latter case, an error and/or substitution of a `?` (in the execution encoding) seem like acceptable behaviors to me. tahonermann: As Richard previously noted, this `memcpy()` needs to be addressed. The intended behavior here…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions I replaced memcpy with an assignment. Please let me know if there is a better solution. abhina.sreeskantharajan: I replaced memcpy with an assignment. Please let me know if there is a better solution.
		rsmithUnsubmitted Done Reply Inline Actions Can you avoid using `std::string` here? Eg, pass a `StringRef`, extending the converter to be able to take one if necessary. rsmith: Can you avoid using `std::string` here? Eg, pass a `StringRef`, extending the converter to be…
		}
		rsmithUnsubmitted Done Reply Inline Actions Reinterpreting an `unsigned` as a `char` like this is not correct on big-endian, and is way too "clever" on little-endian. Please create an actual `char` object to hold the value and pass that in instead. rsmith: Reinterpreting an `unsigned` as a `char` like this is not correct on big-endian, and is way…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Thanks, I've created a char instead. abhina.sreeskantharajan: Thanks, I've created a char instead.
return ResultChar;		return ResultChar;
}		}
		rsmithUnsubmitted Done Reply Inline Actions What should happen if the result doesn't fit into an `unsigned`? This also appears to be making problematic assumptions about the endianness of the host. If we really want to pack multiple bytes of encoded output into a single `unsigned` result value (which itself seems dubious), we should do so with an endianness that doesn't depend on the host. rsmith: What should happen if the result doesn't fit into an `unsigned`? This also appears to be making…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions This may be a problem we need to revisit since ResultChar is expecting a char. abhina.sreeskantharajan: This may be a problem we need to revisit since ResultChar is expecting a char.
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions I added an assertion for this case where the size of the character increases after translation. I've also removed the memcpy to avoid endianness issues. abhina.sreeskantharajan: I added an assertion for this case where the size of the character increases after translation.
		rsmithUnsubmitted Not Done Reply Inline Actions Is there any guarantee the assertion will not fail? rsmith: Is there any guarantee the assertion will not fail?

		rsmithUnsubmitted Not Done Reply Inline Actions Is it correct, in general, to do character-at-a-time translation here, when processing a string literal? I would expect there to be some (stateful) target character sets where that's not correct. rsmith: Is it correct, in general, to do character-at-a-time translation here, when processing a string…
		tahonermannUnsubmitted Not Done Reply Inline Actions For stateful encodings, I can imagine that state would have to be transitioned to the initial state before translating the escape sequence. I suspect support for stateful encodings is not a goal at this time. tahonermann: For stateful encodings, I can imagine that state would have to be transitioned to the initial…
		abhina.sreeskantharajanAuthorUnsubmitted Not Done Reply Inline Actions Right, stateful encodings may be a problem we will need to revisit later as well. abhina.sreeskantharajan: Right, stateful encodings may be a problem we will need to revisit later as well.
static void appendCodePoint(unsigned Codepoint,		static void appendCodePoint(unsigned Codepoint,
llvm::SmallVectorImpl<char> &Str) {		llvm::SmallVectorImpl<char> &Str) {
char ResultBuf[4];		char ResultBuf[4];
char *ResultPtr = ResultBuf;		char *ResultPtr = ResultBuf;
bool Res = llvm::ConvertCodePointToUTF8(Codepoint, ResultPtr);		bool Res = llvm::ConvertCodePointToUTF8(Codepoint, ResultPtr);
(void)Res;		(void)Res;
assert(Res && "Unexpected conversion failure");		assert(Res && "Unexpected conversion failure");
Str.append(ResultBuf, ResultPtr);		Str.append(ResultBuf, ResultPtr);
▲ Show 20 Lines • Show All 993 Lines • ▼ Show 20 Lines
/// \u hex-quad		/// \u hex-quad
/// \U hex-quad hex-quad		/// \U hex-quad hex-quad
/// hex-quad:		/// hex-quad:
/// hex-digit hex-digit hex-digit hex-digit		/// hex-digit hex-digit hex-digit hex-digit
/// \endverbatim		/// \endverbatim
///		///
CharLiteralParser::CharLiteralParser(const char begin, const char end,		CharLiteralParser::CharLiteralParser(const char begin, const char end,
SourceLocation Loc, Preprocessor &PP,		SourceLocation Loc, Preprocessor &PP,
tok::TokenKind kind) {		tok::TokenKind kind,
		ConversionState TranslationState) {
// At this point we know that the character matches the regex "(L\|u\|U)?'.*'".		// At this point we know that the character matches the regex "(L\|u\|U)?'.*'".
HadError = false;		HadError = false;

Kind = kind;		Kind = kind;
		LiteralTranslator *LT = PP.getLiteralTranslator();

const char *TokBegin = begin;		const char *TokBegin = begin;

// Skip over wide character determinant.		// Skip over wide character determinant.
if (Kind != tok::char_constant)		if (Kind != tok::char_constant)
++begin;		++begin;
if (Kind == tok::utf8_char_constant)		if (Kind == tok::utf8_char_constant)
++begin;		++begin;
▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	CharLiteralParser::CharLiteralParser(const char begin, const char end,
} else if (tok::utf16_char_constant == Kind) {		} else if (tok::utf16_char_constant == Kind) {
largest_character_for_kind = 0xFFFF;		largest_character_for_kind = 0xFFFF;
} else if (tok::utf32_char_constant == Kind) {		} else if (tok::utf32_char_constant == Kind) {
largest_character_for_kind = 0x10FFFF;		largest_character_for_kind = 0x10FFFF;
} else {		} else {
largest_character_for_kind = 0x7Fu;		largest_character_for_kind = 0x7Fu;
}		}

		ConversionState State = TranslationState;
		if (Kind == tok::wide_string_literal)
		rsmithUnsubmitted Done Reply Inline Actions Shouldn't this depend on the kind of literal? We should have no converter for UTF8/UTF16/UTF32 literals, should use the wide execution character set for `L...` literals, and the narrow execution character set otherwise. (It looks like this patch doesn't properly distinguish the narrow and wide execution character sets?) rsmith: Shouldn't this depend on the kind of literal? We should have no converter for UTF8/UTF16/UTF32…
		State = TranslateToSystemCharset;
		tahonermannUnsubmitted Done Reply Inline Actions Converting wide character literals to the system encoding doesn't seem right to me. For z/OS, this should presumably convert to the wide EBCDIC encoding, but for all other supported platforms, the wide execution character set is either UTF-16 or UTF-32 depending on the size of `wchar_t` (which may be influenced by the `-fshort-wchar` option). tahonermann: Converting wide character literals to the system encoding doesn't seem right to me. For z/OS…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Since we don't implement -fwide-exec-charset yet, what do you think should be the default behaviour for the interim? abhina.sreeskantharajan: Since we don't implement -fwide-exec-charset yet, what do you think should be the default…
		tahonermannUnsubmitted Done Reply Inline Actions Perhaps an Internal compiler error to indicate that appropriate support is not yet in place? tahonermann: Perhaps an Internal compiler error to indicate that appropriate support is not yet in place?
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Thanks for the suggestion. I've added assertions for wide character translation before we do any translation. abhina.sreeskantharajan: Thanks for the suggestion. I've added assertions for wide character translation before we do…
		tahonermannUnsubmitted Done Reply Inline Actions Per the comment associated with the constructor declaration, I don't think the new constructor parameter is needed; translation to execution character set is always desired for non-UTF character literals. I think this can be something like: llvm::CharSetConverter Converter = nullptr; if (! isUTFLiteral(Kind)) { assert(LT); Converter = LT->getCharConversionTable(TranslateToExecCharset); } tahonermann:* Per the comment associated with the constructor declaration, I don't think the new constructor…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions I can't add an assertion here because LT might not be created in the case of the second StringLiteralParser constructor which does not pass the Preprocessor. But I have added the remaining changes. abhina.sreeskantharajan: I can't add an assertion here because LT might not be created in the case of the second…
		else if (isUTFLiteral(Kind))
		State = NoTranslation;

		llvm::CharSetConverter *Converter =
		LT ? LT->getCharConversionTable(State) : nullptr;

while (begin != end) {		while (begin != end) {
// Is this a span of non-escape characters?		// Is this a span of non-escape characters?
if (begin[0] != '\\') {		if (begin[0] != '\\') {
char const *start = begin;		char const *start = begin;
do {		do {
++begin;		++begin;
} while (begin != end && *begin != '\\');		} while (begin != end && *begin != '\\');

Show All 21 Lines	if (begin[0] != '\\') {
HadError = true;		HadError = true;
}		}
} else {		} else {
for (; tmp_out_start < buffer_begin; ++tmp_out_start) {		for (; tmp_out_start < buffer_begin; ++tmp_out_start) {
if (*tmp_out_start > largest_character_for_kind) {		if (*tmp_out_start > largest_character_for_kind) {
HadError = true;		HadError = true;
PP.Diag(Loc, diag::err_character_too_large);		PP.Diag(Loc, diag::err_character_too_large);
}		}
		if (!HadError && Converter) {
		SmallString<1> ConvertedChar;
		Converter->convert(StringRef((char *)tmp_out_start), ConvertedChar);
		rsmithUnsubmitted Done Reply Inline Actions Why is this case not possible? rsmith: Why is this case not possible?
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions This case should be handled when fwide-exec-charset option is implemented. Until then, we thought it was best to emit a error message that wide literal translation is not supported. abhina.sreeskantharajan: This case should be handled when fwide-exec-charset option is implemented. Until then, we…
		memmove((void *)tmp_out_start, ConvertedChar.data(), 1);
		}
}		}
		rsmithUnsubmitted Done Reply Inline Actions What assurance do we have that 1 output character is correct? I would expect we need to reject with a diagnostic if the character doesn't fit in one converted character. rsmith: What assurance do we have that 1 output character is correct? I would expect we need to reject…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Right, I'll add a similar assertion to the one we have above. abhina.sreeskantharajan: Right, I'll add a similar assertion to the one we have above.
}		}

continue;		continue;
}		}
// Is this a Universal Character Name escape?		// Is this a Universal Character Name escape?
if (begin[1] == 'u' \|\| begin[1] == 'U') {		if (begin[1] == 'u' \|\| begin[1] == 'U') {
unsigned short UcnLen = 0;		unsigned short UcnLen = 0;
if (!ProcessUCNEscape(TokBegin, begin, end, *buffer_begin, UcnLen,		if (!ProcessUCNEscape(TokBegin, begin, end, *buffer_begin, UcnLen,
FullSourceLoc(Loc, PP.getSourceManager()),		FullSourceLoc(Loc, PP.getSourceManager()),
&PP.getDiagnostics(), PP.getLangOpts(), true)) {		&PP.getDiagnostics(), PP.getLangOpts(), true)) {
HadError = true;		HadError = true;
} else if (*buffer_begin > largest_character_for_kind) {		} else if (*buffer_begin > largest_character_for_kind) {
HadError = true;		HadError = true;
PP.Diag(Loc, diag::err_character_too_large);		PP.Diag(Loc, diag::err_character_too_large);
}		}

++buffer_begin;		++buffer_begin;
continue;		continue;
}		}
unsigned CharWidth = getCharWidth(Kind, PP.getTargetInfo());		unsigned CharWidth = getCharWidth(Kind, PP.getTargetInfo());
uint64_t result =		uint64_t result =
ProcessCharEscape(TokBegin, begin, end, HadError,		ProcessCharEscape(TokBegin, begin, end, HadError,
FullSourceLoc(Loc,PP.getSourceManager()),		FullSourceLoc(Loc, PP.getSourceManager()), CharWidth,
CharWidth, &PP.getDiagnostics(), PP.getLangOpts());		&PP.getDiagnostics(), PP.getLangOpts(), nullptr);
*buffer_begin++ = result;		*buffer_begin++ = result;
}		}

unsigned NumCharsSoFar = buffer_begin - &codepoint_buffer.front();		unsigned NumCharsSoFar = buffer_begin - &codepoint_buffer.front();

if (NumCharsSoFar > 1) {		if (NumCharsSoFar > 1) {
if (isWide())		if (isWide())
PP.Diag(Loc, diag::warn_extraneous_char_constant);		PP.Diag(Loc, diag::warn_extraneous_char_constant);
▲ Show 20 Lines • Show All 91 Lines • ▼ Show 20 Lines
/// hexadecimal-escape-sequence hexadecimal-digit		/// hexadecimal-escape-sequence hexadecimal-digit
/// universal-character-name:		/// universal-character-name:
/// \u hex-quad		/// \u hex-quad
/// \U hex-quad hex-quad		/// \U hex-quad hex-quad
/// hex-quad:		/// hex-quad:
/// hex-digit hex-digit hex-digit hex-digit		/// hex-digit hex-digit hex-digit hex-digit
/// \endverbatim		/// \endverbatim
///		///
StringLiteralParser::
StringLiteralParser(ArrayRef<Token> StringToks,		StringLiteralParser::StringLiteralParser(ArrayRef<Token> StringToks,
Preprocessor &PP, bool Complain)		Preprocessor &PP, bool Complain,
		ConversionState translationState)
: SM(PP.getSourceManager()), Features(PP.getLangOpts()),		: SM(PP.getSourceManager()), Features(PP.getLangOpts()),
Target(PP.getTargetInfo()), Diags(Complain ? &PP.getDiagnostics() :nullptr),		Target(PP.getTargetInfo()),
MaxTokenLength(0), SizeBound(0), CharByteWidth(0), Kind(tok::unknown),		Diags(Complain ? &PP.getDiagnostics() : nullptr),
ResultPtr(ResultBuf.data()), hadError(false), Pascal(false) {		LT(PP.getLiteralTranslator()), MaxTokenLength(0), SizeBound(0),
		CharByteWidth(0), Kind(tok::unknown), ResultPtr(ResultBuf.data()),
		hadError(false), Pascal(false), TranslationState(translationState) {
init(StringToks);		init(StringToks);
}		}

void StringLiteralParser::init(ArrayRef<Token> StringToks){		void StringLiteralParser::init(ArrayRef<Token> StringToks){
// The literal token may have come from an invalid source location (e.g. due		// The literal token may have come from an invalid source location (e.g. due
// to a PCH error), in which case the token length will be 0.		// to a PCH error), in which case the token length will be 0.
if (StringToks.empty() \|\| StringToks[0].getLength() < 2)		if (StringToks.empty() \|\| StringToks[0].getLength() < 2)
return DiagnoseLexingError(SourceLocation());		return DiagnoseLexingError(SourceLocation());
▲ Show 20 Lines • Show All 63 Lines • ▼ Show 20 Lines	void StringLiteralParser::init(ArrayRef<Token> StringToks){
// Loop over all the strings, getting their spelling, and expanding them to		// Loop over all the strings, getting their spelling, and expanding them to
// wide strings as appropriate.		// wide strings as appropriate.
ResultPtr = &ResultBuf[0]; // Next byte to fill in.		ResultPtr = &ResultBuf[0]; // Next byte to fill in.

Pascal = false;		Pascal = false;

SourceLocation UDSuffixTokLoc;		SourceLocation UDSuffixTokLoc;

		ConversionState State = TranslationState;
		if (Kind == tok::wide_string_literal)
		State = TranslateToSystemCharset;
		tahonermannUnsubmitted Done Reply Inline Actions Converting wide string literals to the system encoding doesn't seem right to me. For z/OS, this should presumably convert to the wide EBCDIC encoding, but for all other supported platforms, the wide execution character set is either UTF-16 or UTF-32 depending on the size of `wchar_t` (which may be influenced by the `-fshort-wchar` option). tahonermann: Converting wide string literals to the system encoding doesn't seem right to me. For z/OS…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions I've now added an assertion when translating wide characters. abhina.sreeskantharajan: I've now added an assertion when translating wide characters.
		else if (isUTFLiteral(Kind))
		State = NoTranslation;
		tahonermannUnsubmitted Not Done Reply Inline Actions The stored `TranslationState` should not be completely ignored for wide and UTF string literals. The standard permits things like the following. #pragma rigoot L"bozit" #pragma rigoot u"bozit" _Pragma(L"rigoot bozit") _Pragma(u8"rigoot bozit") For at least the `_Pragma(L"...")` case, the C++ standard states the `L` is ignored, but it doesn't say anything about other encoding prefixes. tahonermann: The stored `TranslationState` should not be completely ignored for wide and UTF string literals.
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Please correct me if I'm wrong, these Pragma strings are not parsed through StringLiteralParser, they are parsed in clang/lib/Lex/Pragma.cpp in this function. void Preprocessor::Handle_Pragma(Token &Tok) So if they require translation, it would need to be done in that function. abhina.sreeskantharajan: Please correct me if I'm wrong, these Pragma strings are not parsed through StringLiteralParser…
		tahonermannUnsubmitted Not Done Reply Inline Actions Ah, ok, good. There are other cases where a string literal is not used to produce a string literal object. See https://wg21.link/p2314 for a table. You may want to audit for those cases. tahonermann: Ah, ok, good. There are other cases where a string literal is not used to produce a string…

		llvm::CharSetConverter *Converter =
		LT ? LT->getCharConversionTable(State) : nullptr;

for (unsigned i = 0, e = StringToks.size(); i != e; ++i) {		for (unsigned i = 0, e = StringToks.size(); i != e; ++i) {
const char *ThisTokBuf = &TokenBuf[0];		const char *ThisTokBuf = &TokenBuf[0];
// Get the spelling of the token, which eliminates trigraphs, etc. We know		// Get the spelling of the token, which eliminates trigraphs, etc. We know
// that ThisTokBuf points to a buffer that is big enough for the whole token		// that ThisTokBuf points to a buffer that is big enough for the whole token
// and 'spelled' tokens can only shrink.		// and 'spelled' tokens can only shrink.
bool StringInvalid = false;		bool StringInvalid = false;
unsigned ThisTokLen =		unsigned ThisTokLen =
Lexer::getSpelling(StringToks[i], ThisTokBuf, SM, Features,		Lexer::getSpelling(StringToks[i], ThisTokBuf, SM, Features,
▲ Show 20 Lines • Show All 79 Lines • ▼ Show 20 Lines	if (ThisTokBuf[0] == 'R') {
size_t CRLFPos = RemainingTokenSpan.find("\r\n");		size_t CRLFPos = RemainingTokenSpan.find("\r\n");
StringRef BeforeCRLF = RemainingTokenSpan.substr(0, CRLFPos);		StringRef BeforeCRLF = RemainingTokenSpan.substr(0, CRLFPos);
StringRef AfterCRLF = RemainingTokenSpan.substr(CRLFPos);		StringRef AfterCRLF = RemainingTokenSpan.substr(CRLFPos);

// Copy everything before the \r\n sequence into the string literal.		// Copy everything before the \r\n sequence into the string literal.
if (CopyStringFragment(StringToks[i], ThisTokBegin, BeforeCRLF))		if (CopyStringFragment(StringToks[i], ThisTokBegin, BeforeCRLF))
hadError = true;		hadError = true;

		if (!hadError && Converter) {
		SmallString<256> CpConv;
		int ResultLength = BeforeCRLF.size() * CharByteWidth;
		char *Cp = ResultPtr - ResultLength;
		Converter->convert(StringRef(Cp, ResultLength), CpConv);
		memmove(Cp, CpConv.data(), ResultLength);
		ResultPtr = Cp + CpConv.size();
		}
// Point into the \n inside the \r\n sequence and operate on the		// Point into the \n inside the \r\n sequence and operate on the
// remaining portion of the literal.		// remaining portion of the literal.
RemainingTokenSpan = AfterCRLF.substr(1);		RemainingTokenSpan = AfterCRLF.substr(1);
		rsmithUnsubmitted Not Done Reply Inline Actions Do we need to convert the newline character too? Perhaps for raw string literals it'd be better to do the normal processing here and then convert the entire string at once? rsmith: Do we need to convert the newline character too? Perhaps for raw string literals it'd be…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Yes, we need to convert newlines as well. I think the current behaviour is already converting multi line raw strings correctly. I'll add a testcase for this. abhina.sreeskantharajan: Yes, we need to convert newlines as well. I think the current behaviour is already converting…
}		}
} else {		} else {
if (ThisTokBuf[0] != '"') {		if (ThisTokBuf[0] != '"') {
// The file may have come from PCH and then changed after loading the		// The file may have come from PCH and then changed after loading the
// PCH; Fail gracefully.		// PCH; Fail gracefully.
return DiagnoseLexingError(StringToks[i].getLocation());		return DiagnoseLexingError(StringToks[i].getLocation());
}		}
++ThisTokBuf; // skip "		++ThisTokBuf; // skip "
Show All 14 Lines	if (ThisTokBuf[0] == 'R') {
while (ThisTokBuf != ThisTokEnd) {		while (ThisTokBuf != ThisTokEnd) {
// Is this a span of non-escape characters?		// Is this a span of non-escape characters?
if (ThisTokBuf[0] != '\\') {		if (ThisTokBuf[0] != '\\') {
const char *InStart = ThisTokBuf;		const char *InStart = ThisTokBuf;
do {		do {
++ThisTokBuf;		++ThisTokBuf;
} while (ThisTokBuf != ThisTokEnd && ThisTokBuf[0] != '\\');		} while (ThisTokBuf != ThisTokEnd && ThisTokBuf[0] != '\\');

		int Length = ThisTokBuf - InStart;
// Copy the character span over.		// Copy the character span over.
if (CopyStringFragment(StringToks[i], ThisTokBegin,		if (CopyStringFragment(StringToks[i], ThisTokBegin,
StringRef(InStart, ThisTokBuf - InStart)))		StringRef(InStart, ThisTokBuf - InStart)))
hadError = true;		hadError = true;

		if (!hadError && Converter) {
		SmallString<256> CpConv;
		int ResultLength = Length * CharByteWidth;
		char *Cp = ResultPtr - ResultLength;
		Converter->convert(StringRef(Cp, ResultLength), CpConv);
		memmove(Cp, CpConv.data(), ResultLength);
		ResultPtr = Cp + CpConv.size();
		}
continue;		continue;
}		}
// Is this a Universal Character Name escape?		// Is this a Universal Character Name escape?
if (ThisTokBuf[1] == 'u' \|\| ThisTokBuf[1] == 'U') {		if (ThisTokBuf[1] == 'u' \|\| ThisTokBuf[1] == 'U') {
EncodeUCNEscape(ThisTokBegin, ThisTokBuf, ThisTokEnd,		char *Cp = ResultPtr;
ResultPtr, hadError,		EncodeUCNEscape(ThisTokBegin, ThisTokBuf, ThisTokEnd, ResultPtr,
		hadError,
FullSourceLoc(StringToks[i].getLocation(), SM),		FullSourceLoc(StringToks[i].getLocation(), SM),
CharByteWidth, Diags, Features);		CharByteWidth, Diags, Features);

		if (!hadError && Converter) {
		SmallString<8> CpConv;
		Converter->convert(StringRef(Cp), CpConv);
		memmove(Cp, CpConv.data(), CpConv.size());
		ResultPtr = Cp + CpConv.size();
		}
		tahonermannUnsubmitted Done Reply Inline Actions UCNs will require conversion here. tahonermann: UCNs will require conversion here.
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions I've added code to translate UCN characters and have updated the testcase as well. abhina.sreeskantharajan: I've added code to translate UCN characters and have updated the testcase as well.
continue;		continue;
}		}
// Otherwise, this is a non-UCN escape character. Process it.		// Otherwise, this is a non-UCN escape character. Process it.
unsigned ResultChar =		unsigned ResultChar =
ProcessCharEscape(ThisTokBegin, ThisTokBuf, ThisTokEnd, hadError,		ProcessCharEscape(ThisTokBegin, ThisTokBuf, ThisTokEnd, hadError,
FullSourceLoc(StringToks[i].getLocation(), SM),		FullSourceLoc(StringToks[i].getLocation(), SM),
CharByteWidth*8, Diags, Features);		CharByteWidth * 8, Diags, Features, Converter);

if (CharByteWidth == 4) {		if (CharByteWidth == 4) {
// FIXME: Make the type of the result buffer correct instead of		// FIXME: Make the type of the result buffer correct instead of
// using reinterpret_cast.		// using reinterpret_cast.
llvm::UTF32 ResultWidePtr = reinterpret_cast<llvm::UTF32>(ResultPtr);		llvm::UTF32 ResultWidePtr = reinterpret_cast<llvm::UTF32>(ResultPtr);
*ResultWidePtr = ResultChar;		*ResultWidePtr = ResultChar;
ResultPtr += 4;		ResultPtr += 4;
} else if (CharByteWidth == 2) {		} else if (CharByteWidth == 2) {
▲ Show 20 Lines • Show All 152 Lines • ▼ Show 20 Lines	if (SpellingPtr[0] == 'R') {
++SpellingPtr;		++SpellingPtr;
return SpellingPtr - SpellingStart + ByteNo;		return SpellingPtr - SpellingStart + ByteNo;
}		}

// Skip over the leading quote		// Skip over the leading quote
assert(SpellingPtr[0] == '"' && "Should be a string literal!");		assert(SpellingPtr[0] == '"' && "Should be a string literal!");
++SpellingPtr;		++SpellingPtr;

		ConversionState State = TranslationState;
		if (Kind == tok::wide_string_literal)
		State = TranslateToSystemCharset;
		else if (isUTFLiteral(Kind))
		State = NoTranslation;
		llvm::CharSetConverter *Converter =
		LT ? LT->getCharConversionTable(State) : nullptr;

// Skip over bytes until we find the offset we're looking for.		// Skip over bytes until we find the offset we're looking for.
while (ByteNo) {		while (ByteNo) {
assert(SpellingPtr < SpellingEnd && "Didn't find byte offset!");		assert(SpellingPtr < SpellingEnd && "Didn't find byte offset!");

// Step over non-escapes simply.		// Step over non-escapes simply.
if (*SpellingPtr != '\\') {		if (*SpellingPtr != '\\') {
++SpellingPtr;		++SpellingPtr;
--ByteNo;		--ByteNo;
Show All 9 Lines	if (SpellingPtr[1] == 'u' \|\| SpellingPtr[1] == 'U') {
if (Len > ByteNo) {		if (Len > ByteNo) {
// ByteNo is somewhere within the escape sequence.		// ByteNo is somewhere within the escape sequence.
SpellingPtr = EscapePtr;		SpellingPtr = EscapePtr;
break;		break;
}		}
ByteNo -= Len;		ByteNo -= Len;
} else {		} else {
ProcessCharEscape(SpellingStart, SpellingPtr, SpellingEnd, HadError,		ProcessCharEscape(SpellingStart, SpellingPtr, SpellingEnd, HadError,
FullSourceLoc(Tok.getLocation(), SM),		FullSourceLoc(Tok.getLocation(), SM), CharByteWidth * 8,
CharByteWidth*8, Diags, Features);		Diags, Features, Converter);
--ByteNo;		--ByteNo;
}		}
assert(!HadError && "This method isn't valid on erroneous strings");		assert(!HadError && "This method isn't valid on erroneous strings");
}		}

return SpellingPtr-SpellingStart;		return SpellingPtr-SpellingStart;
}		}

/// Determine whether a suffix is a valid ud-suffix. We avoid treating reserved		/// Determine whether a suffix is a valid ud-suffix. We avoid treating reserved
/// suffixes as ud-suffixes, because the diagnostic experience is better if we		/// suffixes as ud-suffixes, because the diagnostic experience is better if we
/// treat it as an invalid suffix.		/// treat it as an invalid suffix.
bool StringLiteralParser::isValidUDSuffix(const LangOptions &LangOpts,		bool StringLiteralParser::isValidUDSuffix(const LangOptions &LangOpts,
StringRef Suffix) {		StringRef Suffix) {
return NumericLiteralParser::isValidUDSuffix(LangOpts, Suffix) \|\|		return NumericLiteralParser::isValidUDSuffix(LangOpts, Suffix) \|\|
Suffix == "sv";		Suffix == "sv";
}		}

clang/lib/Lex/LiteralTranslator.cpp

This file was added.

				//===--- LiteralTranslator.cpp - Translator for String Literals -----------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "clang/Lex/LiteralTranslator.h"
				#include "clang/Basic/DiagnosticDriver.h"

				using namespace llvm;

				llvm::CharSetConverter *
				LiteralTranslator::getConversionTable(const char *Codepage) {
				auto TableIter = ExecCharsetTables.find(Codepage);
				if (TableIter != ExecCharsetTables.end())
				return &TableIter->second;
				return nullptr;
				}

				CharsetTableStatusCode
				LiteralTranslator::findOrCreateExecCharsetTable(const char *To) {
				const char *From = InternalCharset.data();
				llvm::CharSetConverter *Converter = getConversionTable(To);
				if (Converter)
				return CharsetTableOk;

				ErrorOr<CharSetConverter> ErrorOrConverter =
				llvm::CharSetConverter::create(From, To);
				if (!ErrorOrConverter)
				return InvalidCharsetTable;
				ExecCharsetTables.insert_or_assign(StringRef(To),
				std::move(*ErrorOrConverter));
				return CharsetTableOk;
				}

				llvm::CharSetConverter *
				LiteralTranslator::getCharConversionTable(ConversionState TranslationState) {
				StringRef CodePage;
				if (TranslationState == TranslateToSystemCharset)
				CodePage = SystemCharset;
				else if (TranslationState == TranslateToExecCharset)
				CodePage = ExecCharset;
				else
				CodePage = InternalCharset;
				return getConversionTable(CodePage.data());
				}

				void LiteralTranslator::setTranslationTables(const clang::LangOptions &Opts,
				const clang::TargetInfo &TInfo,
				clang::DiagnosticsEngine &Diags) {
				using namespace llvm;
				SystemCharset = TInfo.getTriple().getSystemCharset();
				InternalCharset = "UTF-8";
				ExecCharset = Opts.ExecCharset.empty() ? InternalCharset : Opts.ExecCharset;
				// Create translation table between internal and system charset
				if (!InternalCharset.equals(SystemCharset))
				findOrCreateExecCharsetTable(SystemCharset.data());

				// Create translation table between internal and exec charset specified
				// in fexec-charset option.
				if (InternalCharset.equals(ExecCharset))
				return;
				CharsetTableStatusCode RC = findOrCreateExecCharsetTable(ExecCharset.data());

				if (RC != CharsetTableOk)
				Diags.Report(clang::diag::err_drv_invalid_value)
				<< "-fexec-charset" << ExecCharset;
				}

clang/lib/Lex/Preprocessor.cpp

Show First 20 Lines • Show All 79 Lines • ▼ Show 20 Lines	Preprocessor::Preprocessor(std::shared_ptr<PreprocessorOptions> PPOpts,
DiagnosticsEngine &diags, LangOptions &opts,		DiagnosticsEngine &diags, LangOptions &opts,
SourceManager &SM, HeaderSearch &Headers,		SourceManager &SM, HeaderSearch &Headers,
ModuleLoader &TheModuleLoader,		ModuleLoader &TheModuleLoader,
IdentifierInfoLookup *IILookup, bool OwnsHeaders,		IdentifierInfoLookup *IILookup, bool OwnsHeaders,
TranslationUnitKind TUKind)		TranslationUnitKind TUKind)
: PPOpts(std::move(PPOpts)), Diags(&diags), LangOpts(opts),		: PPOpts(std::move(PPOpts)), Diags(&diags), LangOpts(opts),
FileMgr(Headers.getFileMgr()), SourceMgr(SM),		FileMgr(Headers.getFileMgr()), SourceMgr(SM),
ScratchBuf(new ScratchBuffer(SourceMgr)), HeaderInfo(Headers),		ScratchBuf(new ScratchBuffer(SourceMgr)), HeaderInfo(Headers),
TheModuleLoader(TheModuleLoader), ExternalSource(nullptr),		TheModuleLoader(TheModuleLoader), LT(new LiteralTranslator()),
		ExternalSource(nullptr),
		tahonermannUnsubmitted Done Reply Inline Actions Per comments elsewhere, please try to make `LT` a non-pointer non-reference data member. tahonermann: Per comments elsewhere, please try to make `LT` a non-pointer non-reference data member.
// As the language options may have not been loaded yet (when		// As the language options may have not been loaded yet (when
// deserializing an ASTUnit), adding keywords to the identifier table is		// deserializing an ASTUnit), adding keywords to the identifier table is
// deferred to Preprocessor::Initialize().		// deferred to Preprocessor::Initialize().
Identifiers(IILookup), PragmaHandlers(new PragmaNamespace(StringRef())),		Identifiers(IILookup), PragmaHandlers(new PragmaNamespace(StringRef())),
TUKind(TUKind), SkipMainFilePreamble(0, true),		TUKind(TUKind), SkipMainFilePreamble(0, true),
CurSubmoduleState(&NullSubmoduleState) {		CurSubmoduleState(&NullSubmoduleState) {
OwnsHeaderSearch = OwnsHeaders;		OwnsHeaderSearch = OwnsHeaders;

▲ Show 20 Lines • Show All 1,332 Lines • Show Last 20 Lines

clang/test/CodeGen/systemz-charset.c

This file was added.

				// RUN: %clang_cc1 %s -emit-llvm -triple s390x-none-zos -fexec-charset IBM-1047 -o - \| FileCheck %s
				// RUN: %clang %s -emit-llvm -S -target s390x-ibm-zos -o - \| FileCheck %s

				const char *UpperCaseLetters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
				// CHECK: c"\C1\C2\C3\C4\C5\C6\C7\C8\C9\D1\D2\D3\D4\D5\D6\D7\D8\D9\E2\E3\E4\E5\E6\E7\E8\E9\00"
				tahonermannUnsubmitted Done Reply Inline Actions `const char ` please :) tahonermann:* `const char *` please :)

				const char *LowerCaseLetters = "abcdefghijklmnopqrstuvwxyz";
				//CHECK: c"\81\82\83\84\85\86\87\88\89\91\92\93\94\95\96\97\98\99\A2\A3\A4\A5\A6\A7\A8\A9\00"

				const char *Digits = "0123456789";
				// CHECK: c"\F0\F1\F2\F3\F4\F5\F6\F7\F8\F9\00"

				const char SpecialCharacters = " .<(+\|&!$);^-/,%%_>`:#@=";
				// CHECK: c"@KLMNOPZ[\\]^_`akllmnyz{\|~\00"

				char *EscapeCharacters = "\a\b\f\n\r\t\v\\\'\"\?";
				//CHECK: c"/\16\0C\15\0D\05\0B\E0}\7Fo\00"

				char *HexCharacters = "\x12\x13\x14";
				//CHECK: c"\12\13\14\00"

				char *OctalCharacters = "\141\142\143";
				//CHECK: c"abc\00"
				tahonermannUnsubmitted Done Reply Inline Actions `const char` here too please. tahonermann:* `const char*` here too please.

				char singleChar = 'a';
				tahonermannUnsubmitted Done Reply Inline Actions Add validation of UCNs. Something like: const char UcnCharacters = "\u00E2\u00AC\U000000DF"; // CHECK: c"\42\B0\59\00" tahonermann:* Add validation of UCNs. Something like: const char *UcnCharacters = "\u00E2\u00AC\U000000DF"…
				//CHECK: i8 -127

				const char *UcnCharacters = "\u00E2\u00AC\U000000DF";
				//CHECK: c"B\B0Y\00"

				const char *Unicode = "ÿ";
				//CHECK: c"\DF\00"

clang/test/CodeGen/systemz-charset.cpp

This file was added.

				// RUN: %clang %s -std=c++17 -emit-llvm -S -target s390x-ibm-zos -o - \| FileCheck %s

				const char *RawString = R"(Hello\n)";
				//CHECK: c"\C8\85\93\93\96\E0\95\00"

				char UnicodeChar8 = u8'1';
				//CHECK: i8 49
				char16_t UnicodeChar16 = u'1';
				//CHECK: i16 49
				char32_t UnicodeChar32 = U'1';
				//CHECK: i32 49

				const char *UnicodeString8 = u8"Hello";
				//CHECK: c"Hello\00"
				const char16_t *UnicodeString16 = u"Hello";
				//CHECK: [6 x i16] [i16 72, i16 101, i16 108, i16 108, i16 111, i16 0]
				const char32_t *UnicodeString32 = U"Hello";
				//CHECK: [6 x i32] [i32 72, i32 101, i32 108, i32 108, i32 111, i32 0]

				const char *UnicodeRawString8 = u8R"("Hello\")";
				//CHECK: c"\22Hello\\\22\00"
				const char16_t *UnicodeRawString16 = uR"("Hello\")";
				//CHECK: [9 x i16] [i16 34, i16 72, i16 101, i16 108, i16 108, i16 111, i16 92, i16 34, i16 0]
				const char32_t *UnicodeRawString32 = UR"("Hello\")";
				//CHECK: [9 x i32] [i32 34, i32 72, i32 101, i32 108, i32 108, i32 111, i32 92, i32 34, i32 0]
				tahonermannUnsubmitted Done Reply Inline Actions This is good. I suggest adding escape sequences and UCNs to validate that they are not converted to IBM-1047. tahonermann: This is good. I suggest adding escape sequences and UCNs to validate that they are not…
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Good idea, I added those testcases as per your suggestion. abhina.sreeskantharajan: Good idea, I added those testcases as per your suggestion.

clang/test/Driver/cl-options.c

	Show First 20 Lines • Show All 203 Lines • ▼ Show 20 Lines
	// RUN: %clang_cl /E /EP /showIncludes -### -- %s 2>&1 \| FileCheck -check-prefix=showIncludes_E %s			// RUN: %clang_cl /E /EP /showIncludes -### -- %s 2>&1 \| FileCheck -check-prefix=showIncludes_E %s
	// RUN: %clang_cl /EP /P /showIncludes -### -- %s 2>&1 \| FileCheck -check-prefix=showIncludes_E %s			// RUN: %clang_cl /EP /P /showIncludes -### -- %s 2>&1 \| FileCheck -check-prefix=showIncludes_E %s
	// showIncludes_E-NOT: warning: argument unused during compilation: '--show-includes'			// showIncludes_E-NOT: warning: argument unused during compilation: '--show-includes'

	// /source-charset: should warn on everything except UTF-8.			// /source-charset: should warn on everything except UTF-8.
	// RUN: %clang_cl /source-charset:utf-16 -### -- %s 2>&1 \| FileCheck -check-prefix=source-charset-utf-16 %s			// RUN: %clang_cl /source-charset:utf-16 -### -- %s 2>&1 \| FileCheck -check-prefix=source-charset-utf-16 %s
	// source-charset-utf-16: invalid value 'utf-16' in '/source-charset:utf-16'			// source-charset-utf-16: invalid value 'utf-16' in '/source-charset:utf-16'

	// /execution-charset: should warn on everything except UTF-8.			// /execution-charset: should warn on invalid charsets.
	// RUN: %clang_cl /execution-charset:utf-16 -### -- %s 2>&1 \| FileCheck -check-prefix=execution-charset-utf-16 %s			// RUN: %clang_cl /execution-charset:invalid-charset -### -- %s 2>&1 \| FileCheck -check-prefix=execution-charset-invalid %s
	// execution-charset-utf-16: invalid value 'utf-16' in '/execution-charset:utf-16'			// execution-charset-invalid: invalid value 'invalid-charset' in '-fexec-charset'
				rsmithUnsubmitted Done Reply Inline Actions Please use the given spelling of the flag in the diagnostic. (You can ask the argument how it was spelled.) rsmith: Please use the given spelling of the flag in the diagnostic. (You can ask the argument how it…
	//			//

	// RUN: %clang_cl /Umymacro -### -- %s 2>&1 \| FileCheck -check-prefix=U %s			// RUN: %clang_cl /Umymacro -### -- %s 2>&1 \| FileCheck -check-prefix=U %s
				rsmithUnsubmitted Done Reply Inline Actions Checking for "don't produce exactly this one spelling of this one diagnostic" is not a useful test; if we started warning on this again, there's a good chance the warning would be spelled differently, so your test does not do a good job of determining whether the code under test is bad (it passes in most bad states as well as in the good state). `...-NOT: error` and `...-NOT: warning` would be a bit better, if this is worth testing. rsmith: Checking for "don't produce exactly this one spelling of this one diagnostic" is not a useful…
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions You're right, I made a change just to make the testcase pass. I think this testcase is no longer needed because fexec-charset should be able to accept all charset names. We won't be able to diagnose invalid charset names until we actually try creating the CharSetConverter. abhina.sreeskantharajan: You're right, I made a change just to make the testcase pass. I think this testcase is no…
	// RUN: %clang_cl /U mymacro -### -- %s 2>&1 \| FileCheck -check-prefix=U %s			// RUN: %clang_cl /U mymacro -### -- %s 2>&1 \| FileCheck -check-prefix=U %s
	// U: "-U" "mymacro"			// U: "-U" "mymacro"

	// RUN: %clang_cl /validate-charset -### -- %s 2>&1 \| FileCheck -check-prefix=validate-charset %s			// RUN: %clang_cl /validate-charset -### -- %s 2>&1 \| FileCheck -check-prefix=validate-charset %s
	// validate-charset: -Winvalid-source-encoding			// validate-charset: -Winvalid-source-encoding

	// RUN: %clang_cl /validate-charset- -### -- %s 2>&1 \| FileCheck -check-prefix=validate-charset_ %s			// RUN: %clang_cl /validate-charset- -### -- %s 2>&1 \| FileCheck -check-prefix=validate-charset_ %s
	// validate-charset_: -Wno-invalid-source-encoding			// validate-charset_: -Wno-invalid-source-encoding
	▲ Show 20 Lines • Show All 470 Lines • Show Last 20 Lines

clang/test/Driver/clang_f_opts.c

	Show First 20 Lines • Show All 203 Lines • ▼ Show 20 Lines
	// CHECK-MAX-O: -O3			// CHECK-MAX-O: -O3

	// RUN: %clang -S -O20 -o /dev/null %s 2>&1 \| FileCheck -check-prefix=CHECK-INVALID-O %s			// RUN: %clang -S -O20 -o /dev/null %s 2>&1 \| FileCheck -check-prefix=CHECK-INVALID-O %s
	// CHECK-INVALID-O: warning: optimization level '-O20' is not supported; using '-O3' instead			// CHECK-INVALID-O: warning: optimization level '-O20' is not supported; using '-O3' instead

	// RUN: %clang -### -S -finput-charset=iso-8859-1 -o /dev/null %s 2>&1 \| FileCheck -check-prefix=CHECK-INVALID-CHARSET %s			// RUN: %clang -### -S -finput-charset=iso-8859-1 -o /dev/null %s 2>&1 \| FileCheck -check-prefix=CHECK-INVALID-CHARSET %s
	// CHECK-INVALID-CHARSET: error: invalid value 'iso-8859-1' in '-finput-charset=iso-8859-1'			// CHECK-INVALID-CHARSET: error: invalid value 'iso-8859-1' in '-finput-charset=iso-8859-1'

	// RUN: %clang -### -S -fexec-charset=iso-8859-1 -o /dev/null %s 2>&1 \| FileCheck -check-prefix=CHECK-INVALID-INPUT-CHARSET %s			// RUN: %clang -### -S -fexec-charset=invalid-charset -o /dev/null %s 2>&1 \| FileCheck -check-prefix=CHECK-INVALID-INPUT-CHARSET %s
	// CHECK-INVALID-INPUT-CHARSET: error: invalid value 'iso-8859-1' in '-fexec-charset=iso-8859-1'			// CHECK-INVALID-INPUT-CHARSET: error: invalid value 'invalid-charset' in '-fexec-charset'
				rsmithUnsubmitted Done Reply Inline Actions Again, this is not a useful test. rsmith: Again, this is not a useful test.
				tahonermannUnsubmitted Done Reply Inline Actions This looks good. Can tests also be added to validate that the `UTF-8`, `ISO8895-1`, and `IBM-1047` option arguments are properly recognized? tahonermann: This looks good. Can tests also be added to validate that the `UTF-8`, `ISO8895-1`, and `IBM…

	// Test that we don't error on these.			// Test that we don't error on these.
	// RUN: %clang -### -S -Werror \			// RUN: %clang -### -S -Werror \
	// RUN: -falign-functions -falign-functions=2 -fno-align-functions \			// RUN: -falign-functions -falign-functions=2 -fno-align-functions \
	// RUN: -fasynchronous-unwind-tables -fno-asynchronous-unwind-tables \			// RUN: -fasynchronous-unwind-tables -fno-asynchronous-unwind-tables \
	// RUN: -fbuiltin -fno-builtin \			// RUN: -fbuiltin -fno-builtin \
	// RUN: -fdiagnostics-show-location=once \			// RUN: -fdiagnostics-show-location=once \
	// RUN: -ffloat-store -fno-float-store \			// RUN: -ffloat-store -fno-float-store \
	▲ Show 20 Lines • Show All 351 Lines • Show Last 20 Lines

llvm/include/llvm/ADT/Triple.h

Show First 20 Lines • Show All 384 Lines • ▼ Show 20 Lines	public:
/// component of the triple, or "" if empty.		/// component of the triple, or "" if empty.
StringRef getEnvironmentName() const;		StringRef getEnvironmentName() const;

/// getOSAndEnvironmentName - Get the operating system and optional		/// getOSAndEnvironmentName - Get the operating system and optional
/// environment components as a single string (separated by a '-'		/// environment components as a single string (separated by a '-'
/// if the environment component is present).		/// if the environment component is present).
StringRef getOSAndEnvironmentName() const;		StringRef getOSAndEnvironmentName() const;

		/// getSystemCharset - Get the system charset of the triple.
		StringRef getSystemCharset() const;

/// @}		/// @}
/// @name Convenience Predicates		/// @name Convenience Predicates
/// @{		/// @{

/// Test whether the architecture is 64-bit		/// Test whether the architecture is 64-bit
///		///
/// Note that this tests for 64-bit pointer width, and nothing else. Note		/// Note that this tests for 64-bit pointer width, and nothing else. Note
/// that we intentionally expose only three predicates, 64-bit, 32-bit, and		/// that we intentionally expose only three predicates, 64-bit, 32-bit, and
▲ Show 20 Lines • Show All 536 Lines • Show Last 20 Lines

llvm/lib/Support/Triple.cpp

Show First 20 Lines • Show All 1,017 Lines • ▼ Show 20 Lines	StringRef Triple::getEnvironmentName() const {
return Tmp.split('-').second; // Strip third component		return Tmp.split('-').second; // Strip third component
}		}

StringRef Triple::getOSAndEnvironmentName() const {		StringRef Triple::getOSAndEnvironmentName() const {
StringRef Tmp = StringRef(Data).split('-').second; // Strip first component		StringRef Tmp = StringRef(Data).split('-').second; // Strip first component
return Tmp.split('-').second; // Strip second component		return Tmp.split('-').second; // Strip second component
}		}

		// System charset on z/OS is IBM-1047 and UTF-8 otherwise
		StringRef Triple::getSystemCharset() const {
		if (getOS() == llvm::Triple::ZOS)
		return "IBM-1047";
		tahonermannUnsubmitted Done Reply Inline Actions No support for targeting the z/OS Enhanced ASCII run-time? tahonermann: No support for targeting the z/OS Enhanced ASCII run-time?
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions We plan to support both modes in the future, but we want the default to still be IBM-1047 (EBCDIC). abhina.sreeskantharajan: We plan to support both modes in the future, but we want the default to still be IBM-1047…
		return "UTF-8";
		}

static unsigned EatNumber(StringRef &Str) {		static unsigned EatNumber(StringRef &Str) {
assert(!Str.empty() && Str[0] >= '0' && Str[0] <= '9' && "Not a number");		assert(!Str.empty() && Str[0] >= '0' && Str[0] <= '9' && "Not a number");
unsigned Result = 0;		unsigned Result = 0;

do {		do {
// Consume the leading digit.		// Consume the leading digit.
Result = Result*10 + (Str[0] - '0');		Result = Result*10 + (Str[0] - '0');

▲ Show 20 Lines • Show All 709 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Enable fexec-charset option AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 313112

clang/include/clang/Basic/LangOptions.h

clang/include/clang/Basic/TokenKinds.h

clang/include/clang/Driver/Options.td

clang/include/clang/Lex/LiteralSupport.h

clang/include/clang/Lex/LiteralTranslator.h

clang/include/clang/Lex/Preprocessor.h

clang/lib/Driver/ToolChains/Clang.cpp

clang/lib/Frontend/CompilerInstance.cpp

clang/lib/Frontend/CompilerInvocation.cpp

clang/lib/Lex/CMakeLists.txt

clang/lib/Lex/LiteralSupport.cpp

clang/lib/Lex/LiteralTranslator.cpp

clang/lib/Lex/Preprocessor.cpp

clang/test/CodeGen/systemz-charset.c

clang/test/CodeGen/systemz-charset.cpp

clang/test/Driver/cl-options.c

clang/test/Driver/clang_f_opts.c

llvm/include/llvm/ADT/Triple.h

llvm/lib/Support/Triple.cpp

Enable fexec-charset option
AbandonedPublic