This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
docs/
-
LanguageExtensions.rst
-
include/clang/
-
clang/
-
Basic/
-
LangOptions.h
-
TokenKinds.h
-
Driver/
5/5
Options.td
-
Lex/
-
LiteralConverter.h
12/12
LiteralSupport.h
4/4
Preprocessor.h
-
lib/
-
Driver/ToolChains/
-
ToolChains/
7/7
Clang.cpp
-
Frontend/
-
CompilerInstance.cpp
-
InitPreprocessor.cpp
-
Lex/
-
CMakeLists.txt
-
LiteralConverter.cpp
29/38
LiteralSupport.cpp
-
test/
-
CodeGen/
3/3
systemz-charset.c
2/2
systemz-charset.cpp
-
Driver/
3/3
cl-options.c
2/2
clang_f_opts.c
-
Preprocessor/
-
init-s390x.c
-
init-x86.c
-
llvm/
-
include/llvm/ADT/
-
llvm/
-
ADT/
-
Triple.h
-
lib/Support/
-
Support/
2/2
Triple.cpp

Differential D93031

Enable fexec-charset option
AbandonedPublic

Authored by abhina.sreeskantharajan on Dec 10 2020, 5:49 AM.

Download Raw Diff

Details

Reviewers

Kai
fanbo-meng
tahonermann
hubert.reinterpretcast
efriedma
SeanP
rsmith
ThePhD
cor3ntin
joerg
jansvoboda11

Summary

This patch enables the fexec-charset option to control the execution charset of string literals. It sets the default internal charset, system charset, and execution charset for z/OS and UTF-8 for all other platforms.
This patch depends on https://reviews.llvm.org/D88741

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	40 ms	x64 debian > libomptarget.mapping::declare_mapper_nested_default_mappers_array.cpp

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Herald added subscribers: dexonsmith, dang, hiraditya, mgorny. · View Herald TranscriptDec 10 2020, 5:49 AM

abhina.sreeskantharajan requested review of this revision.Dec 10 2020, 5:49 AM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptDec 10 2020, 5:49 AM

Herald added subscribers: llvm-commits, cfe-commits. · View Herald Transcript

Harbormaster completed remote builds in B81828: Diff 310863.Dec 10 2020, 6:01 AM

abhina.sreeskantharajan added a parent revision: D88741: [SystemZ/z/OS] Add utility class for char set conversion..Dec 10 2020, 6:24 AM

abhina.sreeskantharajan added reviewers: Kai, fanbo-meng, tahonermann, hubert.reinterpretcast, efriedma, SeanP.Dec 10 2020, 1:02 PM

I'm overall pretty happy about how clean and non-invasive the changes required here are. But please make sure you don't change the encodings of u8"..." / u"..." / U"..." literals; those need to stay as UTF-8 / UTF-16 / UTF-32. Also, we should have a story for how the wide execution character set is controlled -- is it derived from the narrow execution character set, or can the two be changed independently, or ...?

We should use the original source form of the string literal when pretty-printing a StringLiteral or CharacterLiteral; there are a bunch of UTF-8 assumptions baked into StmtPrinter that will need revisiting. And we'll need to modify the handful of places that put the contents of StringLiterals into diagnostics (#warning, #error, static_assert) and make them use a different ConversionState, since our assumption is that diagnostic output should be in UTF-8.

clang/include/clang/Lex/LiteralTranslator.h
29–31 ↗	(On Diff #310863)	It is not acceptable to use global state for this per-compilation information; this will behave badly if multiple independent Clang compilations are performed by different threads in the same process, for example.
32 ↗	(On Diff #310863)	Similarly, use of a global cache here will require you guard it with a mutex. As an alternative, how about we move all this state to be per-instance state, and store an instance of `LiteralTranslator` on the `Preprocessor`?
clang/lib/Lex/LiteralSupport.cpp
233–241	Is it correct, in general, to do character-at-a-time translation here, when processing a string literal? I would expect there to be some (stateful) target character sets where that's not correct.
238	Reinterpreting an `unsigned` as a `char` like this is not correct on big-endian, and is way too "clever" on little-endian. Please create an actual `char` object to hold the value and pass that in instead.
240	What should happen if the result doesn't fit into an `unsigned`? This also appears to be making problematic assumptions about the endianness of the host. If we really want to pack multiple bytes of encoded output into a single `unsigned` result value (which itself seems dubious), we should do so with an endianness that doesn't depend on the host.
1342	Shouldn't this depend on the kind of literal? We should have no converter for UTF8/UTF16/UTF32 literals, should use the wide execution character set for `L...` literals, and the narrow execution character set otherwise. (It looks like this patch doesn't properly distinguish the narrow and wide execution character sets?)
clang/test/Driver/cl-options.c
218	Checking for "don't produce exactly this one spelling of this one diagnostic" is not a useful test; if we started warning on this again, there's a good chance the warning would be spelled differently, so your test does not do a good job of determining whether the code under test is bad (it passes in most bad states as well as in the good state). `...-NOT: error` and `...-NOT: warning` would be a bit better, if this is worth testing.
clang/test/Driver/clang_f_opts.c
226	Again, this is not a useful test.

tahonermann added inline comments.Dec 11 2020, 11:14 PM

clang/include/clang/Driver/Options.td
4433–4434	How about substituting "character set", "character encoding", or "charset" for "codepage"? This doesn't state what names are recognized. The ones provided by the system iconv() implementation (as is the case for gcc)? Or all names and aliases specified by the IANA character set registry? The set of recognized names can be a superset of the names that are actually supported.
clang/include/clang/Lex/LiteralSupport.h
193	Does the conversion state need to be persisted as a data member? The literal is consumed in the constructor.
243	Same concern here with respect to persisting the conversion state as a data member.
245	This static data member will presumably need to be lifted to per-instance state as Richard mentioned elsewhere.
clang/lib/Driver/ToolChains/Clang.cpp
6232–6241	I think it would be preferable to diagnose an unrecognized character encoding name here if possible. The current changes will result in an unrecognized name (as opposed to one that is unsupported for the target) being diagnosed for each compiler instance.
clang/lib/Frontend/CompilerInvocation.cpp
3573 ↗	(On Diff #310863)	I wouldn't expect the cast to `std::string` to be needed here.
clang/lib/Lex/LiteralSupport.cpp
233–241	For stateful encodings, I can imagine that state would have to be transitioned to the initial state before translating the escape sequence. I suspect support for stateful encodings is not a goal at this time.
236	What should happen if `ResultChar` >= 0x100? IBM-1047 does have representation for other UTF-8 characters. Regardless, it seems `ResultChar` should be converted to something.
1787–1799	UCNs will require conversion here.
llvm/lib/Support/Triple.cpp
1051–1052	No support for targeting the z/OS Enhanced ASCII run-time?

Thanks for your quick reviews! I haven't addressed all the comments yet but I plan to address all of them. I put up this patch early because it has a few major changes:

moves LiteralTranslator class to Preprocessor instead of being a static global class
add isUTFLiteral() function to detect strings like u8"..." and stop translation
translate wide string literals to the system charset for now (we don't have an implementation plan for -fwide-charset right now)
remove tests that check fexec-charset will not accept non-UTF charsets

Harbormaster completed remote builds in B82467: Diff 311911.Dec 15 2020, 8:23 AM

In D93031#2447230, @rsmith wrote:

I'm overall pretty happy about how clean and non-invasive the changes required here are. But please make sure you don't change the encodings of u8"..." / u"..." / U"..." literals; those need to stay as UTF-8 / UTF-16 / UTF-32. Also, we should have a story for how the wide execution character set is controlled -- is it derived from the narrow execution character set, or can the two be changed independently, or ...?

We should use the original source form of the string literal when pretty-printing a StringLiteral or CharacterLiteral; there are a bunch of UTF-8 assumptions baked into StmtPrinter that will need revisiting. And we'll need to modify the handful of places that put the contents of StringLiterals into diagnostics (#warning, #error, static_assert) and make them use a different ConversionState, since our assumption is that diagnostic output should be in UTF-8.

Yes, these are some of the complications we will need to visit in later patches. We may need to somehow save the original string or reverse the translation.

clang/include/clang/Driver/Options.td
4433–4434	I've updated the description from codepage to charset. It's hard to specify what charsets are supported because iconv library differs between targets, so the list will not be the same on every platform.
clang/include/clang/Lex/LiteralTranslator.h
32 ↗	(On Diff #310863)	Thanks, I've added an instance of LiteralTranslator to Preprocessor instead and use that when the Preprocessor is available. There is one constructor of StringLiteralParser that does not pass Preprocessor as an argument, so I had to create a LiteralTranslator instance there as well.
clang/lib/Driver/ToolChains/Clang.cpp
6232–6241	Since we do not know what charsets are supported by the iconv library on the target platform, we don't know what charsets are actually invalid until we try creating a CharSetConverter.
clang/lib/Frontend/CompilerInvocation.cpp
3573 ↗	(On Diff #310863)	Without that cast, I get the following build error: error: no viable overloaded '='
clang/test/Driver/cl-options.c
218	You're right, I made a change just to make the testcase pass. I think this testcase is no longer needed because fexec-charset should be able to accept all charset names. We won't be able to diagnose invalid charset names until we actually try creating the CharSetConverter.
llvm/lib/Support/Triple.cpp
1051–1052	We plan to support both modes in the future, but we want the default to still be IBM-1047 (EBCDIC).

abhina.sreeskantharajan marked 5 inline comments as done.Dec 15 2020, 11:05 AM

tahonermann added inline comments.Dec 16 2020, 8:35 AM

clang/include/clang/Driver/Options.td
4433–4434	Being dependent on the host iconv library seems fine by me; that is the case for gcc today. I suggest making that explicit here: def fexec_charset : Separate<["-"], "fexec-charset">, MetaVarName<"<charset>">, HelpText<"Set the execution <charset> for string and character literals. Supported character encodings include XXX and those supported by the host iconv library.">;
clang/lib/Driver/ToolChains/Clang.cpp
6232–6241	Understood, but what would be the harm in performing a lookup (constructing a `CharSetConverter`) here?
clang/lib/Frontend/CompilerInvocation.cpp
3573 ↗	(On Diff #310863)	Ok, rather than a cast, I suggest: Opts.ExecCharset = Value.str();
clang/lib/Lex/LiteralSupport.cpp
1342–1343	Converting wide character literals to the system encoding doesn't seem right to me. For z/OS, this should presumably convert to the wide EBCDIC encoding, but for all other supported platforms, the wide execution character set is either UTF-16 or UTF-32 depending on the size of `wchar_t` (which may be influenced by the `-fshort-wchar` option).
1614–1618	The stored `TranslationState` should not be completely ignored for wide and UTF string literals. The standard permits things like the following. #pragma rigoot L"bozit" #pragma rigoot u"bozit" _Pragma(L"rigoot bozit") _Pragma(u8"rigoot bozit") For at least the `_Pragma(L"...")` case, the C++ standard states the `L` is ignored, but it doesn't say anything about other encoding prefixes.
1615–1616	Converting wide string literals to the system encoding doesn't seem right to me. For z/OS, this should presumably convert to the wide EBCDIC encoding, but for all other supported platforms, the wide execution character set is either UTF-16 or UTF-32 depending on the size of `wchar_t` (which may be influenced by the `-fshort-wchar` option).

tahonermann added inline comments.Dec 16 2020, 8:55 AM

clang/test/CodeGen/systemz-charset.c
5	`const char *` please :)
25	Add validation of UCNs. Something like: const char *UcnCharacters = "\u00E2\u00AC\U000000DF"; // CHECK: c"\42\B0\59\00"

Thanks for your patience, I've addressed some more comments. Here is the summary of the changes in this patch:

add translation for UCN strings, update testcase
fix helptext for fexec-charset option in Options.td
check for invalid charsets when parsing driver options.
fix up char conversion code

Harbormaster completed remote builds in B83156: Diff 313112.Dec 21 2020, 8:15 AM

abhina.sreeskantharajan marked 11 inline comments as done.Dec 21 2020, 8:25 AM

abhina.sreeskantharajan added inline comments.

clang/include/clang/Driver/Options.td
4433–4434	I've updated the HelpText with your suggested description.
clang/include/clang/Lex/LiteralSupport.h
193	Thanks, I've removed this.
243	If this member is removed in StringLiteralParser, we will need to pass the State to multiple functions in StringLiteralParser like init(). Would this solution be preferable to keeping a data member?
clang/lib/Driver/ToolChains/Clang.cpp
6232–6241	I initially thought it will be a performance issue if we are creating the Converter twice, once here and once in the Preprocessor. But I do think its a good idea to diagnose this early. I've modified the code to diagnose and error here.
clang/lib/Frontend/CompilerInvocation.cpp
3573 ↗	(On Diff #310863)	Thanks, I've applied this change.
clang/lib/Lex/LiteralSupport.cpp
233–241	Right, stateful encodings may be a problem we will need to revisit later as well.
236	This is no longer valid, thanks for catching that. We were initially translating to ASCII instead of UTF-8 so we needed to guard against larger characters. I've removed this guard since the internal charset is UTF-8.
238	Thanks, I've created a char instead.
240	This may be a problem we need to revisit since ResultChar is expecting a char.
1787–1799	I've added code to translate UCN characters and have updated the testcase as well.

abhina.sreeskantharajan marked 8 inline comments as done.Dec 21 2020, 8:25 AM

abhina.sreeskantharajan added inline comments.Dec 21 2020, 8:29 AM

clang/lib/Lex/LiteralSupport.cpp
1342–1343	Since we don't implement -fwide-exec-charset yet, what do you think should be the default behaviour for the interim?

abhina.sreeskantharajan added a reviewer: rsmith.Dec 21 2020, 8:32 AM

tahonermann added inline comments.Dec 23 2020, 10:04 PM

clang/include/clang/Lex/LiteralSupport.h
243	I think so, yes. Data members should be used to reflect the state of the object, not as a convenient mechanism to avoid passing arguments.
clang/lib/Driver/ToolChains/Clang.cpp
6244–6246	Thank you for adding this.
clang/lib/Lex/LiteralSupport.cpp
236	Conversion can fail here, particularly in the scenario corresponding to the default switch case above; `ResultChar` could contain, for example, a lead byte of a UTF-8 sequence. Something sensible should be done here; either rejecting the code with an error or substituting `?` (in the execution encoding) seems appropriate to me.
237	As Richard previously noted, this `memcpy()` needs to be addressed. The intended behavior here is not clear. Are there valid scenarios in which the conversion will produce a sequence of more than one code units? I believe the input is limited to ASCII characters and invalid code units (e.g., a lead byte of a UTF-8 sequence) and in the latter case, an error and/or substitution of a `?` (in the execution encoding) seem like acceptable behaviors to me.
1342–1343	Perhaps an Internal compiler error to indicate that appropriate support is not yet in place?
clang/test/CodeGen/systemz-charset.c
17–24	`const char*` here too please.
clang/test/CodeGen/systemz-charset.cpp
2–26	This is good. I suggest adding escape sequences and UCNs to validate that they are not converted to IBM-1047.
clang/test/Driver/clang_f_opts.c
225–233	This looks good. Can tests also be added to validate that the `UTF-8`, `ISO8895-1`, and `IBM-1047` option arguments are properly recognized?

Thanks for the review! I've addressed most of the comments but I still need to work on the translation issues in CharLiteralParser that was kindly pointed out by Tom and Richard. Here are the summary of changes in this patch:

Removed TranslationState as a member of StringLiteralParser and pass it as an argument instead
Added an assertion for wide character translation instead of translating them to the system charset
Invalid char escapes are changed to '?' and then translated
Updated testcases as requested

Harbormaster completed remote builds in B83668: Diff 313990.Dec 29 2020, 11:34 AM

abhina.sreeskantharajan marked 8 inline comments as done.Dec 29 2020, 11:39 AM

abhina.sreeskantharajan added inline comments.

clang/include/clang/Lex/LiteralSupport.h
243	Thanks, I've removed this member.
clang/lib/Lex/LiteralSupport.cpp
1342–1343	Thanks for the suggestion. I've added assertions for wide character translation before we do any translation.
1615–1616	I've now added an assertion when translating wide characters.
clang/test/CodeGen/systemz-charset.cpp
2–26	Good idea, I added those testcases as per your suggestion.

abhina.sreeskantharajan marked 4 inline comments as done.Dec 29 2020, 11:39 AM

abhina.sreeskantharajan marked an inline comment as done.Dec 29 2020, 12:45 PM

abhina.sreeskantharajan added inline comments.

clang/lib/Lex/LiteralSupport.cpp
236	Thanks, I added the substitution with the '?' character for invalid escapes.

abhina.sreeskantharajan marked an inline comment as done.Dec 29 2020, 12:45 PM

This patch replaces the memcpy in CharLiteralParser with an assignment. I've added an assertion for cases where the character size increases after translation.

Harbormaster completed remote builds in B83747: Diff 314115.Dec 30 2020, 6:50 AM

abhina.sreeskantharajan marked 2 inline comments as done.Dec 30 2020, 6:52 AM

abhina.sreeskantharajan added inline comments.

clang/lib/Lex/LiteralSupport.cpp
237	I replaced memcpy with an assignment. Please let me know if there is a better solution.
240	I added an assertion for this case where the size of the character increases after translation. I've also removed the memcpy to avoid endianness issues.

abhina.sreeskantharajan marked an inline comment as done.Dec 30 2020, 6:52 AM

abhina.sreeskantharajan added inline comments.Dec 30 2020, 7:22 AM

clang/lib/Lex/LiteralSupport.cpp
1614–1618	Please correct me if I'm wrong, these Pragma strings are not parsed through StringLiteralParser, they are parsed in clang/lib/Lex/Pragma.cpp in this function. void Preprocessor::Handle_Pragma(Token &Tok) So if they require translation, it would need to be done in that function.

ping :)
Is there any more feedback on the implementation inside ProcessCharEscape()?

Herald added a reviewer: jansvoboda11. · View Herald TranscriptJan 26 2021, 12:06 PM

Hi, Abhina. Sorry for the delay getting back to you. I added some more comments.

clang/include/clang/Lex/LiteralSupport.h
192	Is the default argument for `TranslationState` actually used anywhere? I'm skeptical that a default argument provides a benefit here. Actually, this diff doesn't include any changes to construct a `CharLiteralParser` with an explicit argument. It seems this argument isn't actually needed. The only places I see objects of `CharLiteralParser` type constructed are in `EvaluateValue()` in `clang/lib/Lex/PPExpressions.cpp` and `Sema::ActOnCharacterConstant()` in `clang/lib/Sema/SemaExpr.cpp`.
244–245	I don't think a `LiteralTranslator` object is actually needed in this case. The only use of this constructor that I see is in `ModuleMapParser::consumeToken()` in `clang/lib/Lex/ModuleMap.cpp` and, in that case, I don't think any translation is necessary. This suggests that `TranslationState` is not needed for this constructor either; `NoTranslation` can be passed to `init()`.
258–259	I don't think `getOffsetOfStringByte()` should require a `ConversionState` parameter. If I understand it correctly, this function should be operating on the string in the internal encoding, never in a converted encoding.
clang/include/clang/Lex/LiteralTranslator.h
19–23 ↗	(On Diff #314115)	Some naming suggestions... The enumeration is not used to record a state, but rather to indicate an action to take. Also, use of both "conversion" and "translation" could be confusing, so I suggest sticking with one. Perhaps: enum class LiteralConversion { None, ToSystemCharset, ToExecCharset };
30–31 ↗	(On Diff #314115)	I don't know the LLVM style guides well, but I suspect a class with all public members should be defined using `struct` and not include access specifiers.
35 ↗	(On Diff #314115)	Given the converter setters and accessors below, `ExecCharsetTables` should be a private member.
37 ↗	(On Diff #314115)	`getConversionTable()` is logically `const`. Perhaps `ExecCharsetTables` should be `mutable`. From a terminology stand point, this function is misnamed. It doesn't return a table, it returns a converter for an encoding. I suggest: llvm::CharSetConverter getCharSetConverter(const char Encoding) const;
38 ↗	(On Diff #314115)	`findOrCreateExecCharsetTable()` seems oddly named since it doesn't return whatever it finds or creates. It seems like this function would be more useful if it returned a `llvm::CharSetConverter` pointer with `nullptr` indicating lookup/creation failed. This function seems like it should be an implementation detail of the class, not a public interface.
41–43 ↗	(On Diff #314115)	`setTranslationTables()` is awkward. It is effectively operating as a constructor for the class, but isn't called at object construction and it does work that goes beyond initialization.
44 ↗	(On Diff #314115)	I suggest trying a design more like this: class LiteralTranslator { std::string SystemEncoding; std::string ExecutionEncoding; public: LiteralTranslator(llvm::StringRef SystemEncoding, llvm::StringRef ExecutionEncoding); // Retrieve the name for the system encoding. llvm::StringRef getSystemEncoding() const; // Retrieve the name for the execution encoding. llvm::StringRef getExecutionEncoding() const; // Retrieve a converter for converting from the internal encoding (UTF-8) // to the system encoding. llvm::CharSetConverter* getSystemEncodingConverter() const; // Retrieve a converter for converting from the internal encoding (UTF-8) // to the execution encoding. llvm::CharSetConverter* getExecutionEncodingConverter() const; }; LiteralTranslator createLiteralTranslatorFromOptions(const clang::LangOptions &Opts, const clang::TargetInfo &TInfo, clang::DiagnosticsEngine &Diags);
clang/include/clang/Lex/Preprocessor.h
145	I don't see a reason for `LT` to be a pointer. Can it be made a reference or, better, a non-reference, non-pointer data member?
clang/lib/Lex/LiteralSupport.cpp
1341–1343	Per the comment associated with the constructor declaration, I don't think the new constructor parameter is needed; translation to execution character set is always desired for non-UTF character literals. I think this can be something like: llvm::CharSetConverter *Converter = nullptr; if (! isUTFLiteral(Kind)) { assert(LT); Converter = LT->getCharConversionTable(TranslateToExecCharset); }
1614–1618	Ah, ok, good. There are other cases where a string literal is not used to produce a string literal object. See https://wg21.link/p2314 for a table. You may want to audit for those cases.
clang/lib/Lex/Preprocessor.cpp
88 ↗	(On Diff #314115)	Per comments elsewhere, please try to make `LT` a non-pointer non-reference data member.

jansvoboda11 added inline comments.Mar 1 2021, 12:00 AM

clang/include/clang/Driver/Options.td
4433	Could you switch to the option marshalling infrastructure? https://clang.llvm.org/docs/InternalsManual.html#adding-new-command-line-option Adding `MarshallingInfoString<LangOpts<"ExecCharset">>` here should do the trick. You can then delete the option parsing in `CompilerInvocation.cpp`.

Thanks for the feedback! I haven't addressed all the comments yet but I've made major renaming changes and hope to get feedback on it.

abhina.sreeskantharajan marked 6 inline comments as done.Mar 2 2021, 8:53 AM

abhina.sreeskantharajan added inline comments.

clang/include/clang/Lex/LiteralSupport.h
192	You're right, we don't have any cases that use this arg yet so we can remove it.
244–245	Thanks, I've removed it.
clang/include/clang/Lex/Preprocessor.h
145	Thanks, I've changed it to a non-reference non-pointer member.
clang/lib/Lex/LiteralSupport.cpp
1341–1343	I can't add an assertion here because LT might not be created in the case of the second StringLiteralParser constructor which does not pass the Preprocessor. But I have added the remaining changes.

abhina.sreeskantharajan marked 4 inline comments as done.Mar 2 2021, 8:55 AM

Hi Tom, @tahonermann I renamed the LiteralTranslator class to LiteralConverter.cpp and have renamed a lot of the functions. Let me know what you think. I agree that the setConverters function is awkward, the problem stems from initializing the member early in Preprocessor but only being able to create the Converters once we know the target host later in the compilation process.

Harbormaster completed remote builds in B91586: Diff 327470.Mar 2 2021, 9:40 AM

rsmith added inline comments.Mar 2 2021, 12:12 PM

clang/include/clang/Lex/Preprocessor.h
145	Please give this a longer name. Abbreviation names should only be used in fairly small scopes where it's easy to look up what they refer to. Also: why `LT`? What does the `T` stand for?
clang/lib/Driver/ToolChains/Clang.cpp
6238	Looping over all the arguments is a little unusual. Normally we'd get the last argument value and only check that one. Do you need to pass more than one value onto the frontend?
clang/lib/Lex/LiteralSupport.cpp
236	This is a regression. Our prior behavior for unknown escapes was to leave the character alone. We should still do that wherever possible -- eg, `\q` should produce `q` -- and take fallback action only if the character is unencodable. Producing a `?` seems unlikely to ever be what anyone wants; producing a hard error would seem preferable.
237	Can you avoid using `std::string` here? Eg, pass a `StringRef`, extending the converter to be able to take one if necessary.
240	Is there any guarantee the assertion will not fail?
1383–1384	Why is this case not possible?
1387	What assurance do we have that 1 output character is correct? I would expect we need to reject with a diagnostic if the character doesn't fit in one converted character.
1735–1737	Do we need to convert the newline character too? Perhaps for raw string literals it'd be better to do the normal processing here and then convert the entire string at once?
clang/test/Driver/cl-options.c
215	Please use the given spelling of the flag in the diagnostic. (You can ask the argument how it was spelled.)

Addressing some more comments. Updating the argument parsing, lit tests, some more renaming.

abhina.sreeskantharajan marked 4 inline comments as done.Mar 4 2021, 6:20 AM

abhina.sreeskantharajan added inline comments.

clang/include/clang/Lex/Preprocessor.h
145	Thanks for catching this. This was a change I missed when renaming LiteralTranslator to LiteralConverter. I've added a longer name.
clang/lib/Driver/ToolChains/Clang.cpp
6238	Thanks, I've changed it back to get the LastArg only and use the spelling of the argument to fix the diagnostic error message in the driver lit tests.
clang/lib/Lex/LiteralSupport.cpp
236	Hi @tahonermann, do you also agree we should use the original behaviour or give a hard error instead?
1383–1384	This case should be handled when fwide-exec-charset option is implemented. Until then, we thought it was best to emit a error message that wide literal translation is not supported.

abhina.sreeskantharajan marked 3 inline comments as done.Mar 4 2021, 6:20 AM

abhina.sreeskantharajan marked an inline comment as done.Mar 4 2021, 6:23 AM

Harbormaster completed remote builds in B92056: Diff 328151.Mar 4 2021, 4:28 PM

Add assertion, add testcase for multi-line raw string

abhina.sreeskantharajan added inline comments.Mar 5 2021, 7:04 AM

clang/lib/Lex/LiteralSupport.cpp
1387	Right, I'll add a similar assertion to the one we have above.
1735–1737	Yes, we need to convert newlines as well. I think the current behaviour is already converting multi line raw strings correctly. I'll add a testcase for this.

Harbormaster completed remote builds in B92310: Diff 328513.Mar 5 2021, 11:45 PM

abhina.sreeskantharajan marked 3 inline comments as done.Mar 8 2021, 12:01 PM

abhina.sreeskantharajan added inline comments.

clang/include/clang/Lex/LiteralTranslator.h
30–31 ↗	(On Diff #314115)	I've made these private.
37 ↗	(On Diff #314115)	I've renamed this function to getConverter

abhina.sreeskantharajan marked 2 inline comments as done.Mar 8 2021, 12:01 PM

abhina.sreeskantharajan marked 2 inline comments as done.Mar 15 2021, 5:44 AM

Rebase + fix CharLiteralParser endian issue by saving the char to a char variable first and then creating a StringRef

Harbormaster completed remote builds in B97954: Diff 336416.Apr 9 2021, 5:40 AM

Accidentally added dependent patch in this one. Removing that

Harbormaster completed remote builds in B97955: Diff 336417.Apr 9 2021, 6:21 AM

ThePhD mentioned this in D100346: [Clang] String Literal and Wide String Literal Encoding from the Preprocessor.Apr 12 2021, 3:03 PM

Rebase + set size of char as 1 when creating a StringRef to fix lit failure

Harbormaster completed remote builds in B99514: Diff 338565.Apr 19 2021, 11:57 AM

Just a tiny comment: could you please make sure the name of the resolved encoding is also propagated to InitPreprocessor.cpp that sets the __clang_literal_encoding__ macro? (https://github.com/llvm/llvm-project/blob/main/clang/lib/Frontend/InitPreprocessor.cpp#L784)

Thanks for catching that. This sets the clang_literal_encoding to Opts.ExecCharset or defaults to SystemCharset.

Harbormaster completed remote builds in B100076: Diff 339355.Apr 21 2021, 1:31 PM

We should use the original source form of the string literal when pretty-printing a StringLiteral or CharacterLiteral; there are a bunch of UTF-8 assumptions baked into StmtPrinter that will need revisiting. And we'll need to modify the handful of places that put the contents of StringLiterals into diagnostics (#warning, #error, static_assert) and make them use a different ConversionState, since our assumption is that diagnostic output should be in UTF-8.

Yes, these are some of the complications we will need to visit in later patches. We may need to somehow save the original string or reverse the translation.

The operation is destructive and therefore cannot be reverted.
So I do believe the correct behavior here would indeed be to keep the original spelling around - with *some* of phase 5 applied (replacement of UCNs and replacement of numeric escape sequences).
An alternative would be to do the conversion lazily when the strings are evaluated, rather than during lexing, although that might be more involved

"Keeping the original spelling around" would assume that the input is not using a stateful encoding. That seems worse as assumption than giving the canonical output in UTF-8 and shifting the problem to the user's editor?

In D93031#2706988, @joerg wrote:

"Keeping the original spelling around" would assume that the input is not using a stateful encoding. That seems worse as assumption than giving the canonical output in UTF-8 and shifting the problem to the user's editor?

Right, terrible choice of words
s/original spelling/the concatenated, non-encoded string literal, in UTF-8

In D93031#2706660, @cor3ntin wrote:

We should use the original source form of the string literal when pretty-printing a StringLiteral or CharacterLiteral; there are a bunch of UTF-8 assumptions baked into StmtPrinter that will need revisiting. And we'll need to modify the handful of places that put the contents of StringLiterals into diagnostics (#warning, #error, static_assert) and make them use a different ConversionState, since our assumption is that diagnostic output should be in UTF-8.

Yes, these are some of the complications we will need to visit in later patches. We may need to somehow save the original string or reverse the translation.

The operation is destructive and therefore cannot be reverted.
So I do believe the correct behavior here would indeed be to keep the original spelling around - with *some* of phase 5 applied (replacement of UCNs and replacement of numeric escape sequences).
An alternative would be to do the conversion lazily when the strings are evaluated, rather than during lexing, although that might be more involved

Thanks for the input! I agree doing the conversion lazily will help avoid hitting these issues since we push translation to a later stage but as you mentioned it will be more involved. I think keeping the original spelling might be the best solution. We can make a extra member in StringLiteralParser to save the string prior to translation. But we would need to go through each use of StringLiteralParser and save the original encoding (possibly print it in the .ll file along with the translated string or as an attribute?). Let me know what you think.

abhina.sreeskantharajan added reviewers: cor3ntin, joerg.Apr 22 2021, 6:32 AM

cor3ntin mentioned this in D105759: Implement P2361 Unevaluated string literals.Jul 10 2021, 6:53 AM

cor3ntin mentioned this in D106577: [clang] Define __STDC_ISO_10646__.Jul 22 2021, 11:05 AM

nigelp-xmos added a subscriber: nigelp-xmos.Aug 3 2021, 9:43 AM

jansvoboda11 resigned from this revision.Aug 6 2021, 4:44 AM

evantypanski added a subscriber: evantypanski.Oct 28 2021, 8:54 AM

srl295 added a subscriber: srl295.Mar 8 2022, 11:41 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 8 2022, 11:41 AM

tahonermann mentioned this in D135366: [clang][Interp] Implement String- and CharacterLiterals.Oct 10 2022, 2:06 PM

tahonermann mentioned this in D134036: [libc++][format] Implements string escaping..Oct 10 2022, 2:20 PM

barannikov88 added a subscriber: barannikov88.Feb 11 2023, 2:17 PM

Herald added a subscriber: MaskRay. · View Herald TranscriptFeb 11 2023, 2:17 PM

@abhina.sreeskantharajan
What is the status of this patch?

In D93031#4308764, @barannikov88 wrote:

@abhina.sreeskantharajan
What is the status of this patch?

Hello, I was waiting for the CharSetConverter patch to land. Now that this patch has landed https://reviews.llvm.org/D148821 to add limited EBCDIC <-> UTF-8 conversion support, I have started to refactor my patch to use this instead. This implementation also heavily relies on iconv support which is still being discussed in the CharSet Converter RFC here https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795/16

In D93031#4309546, @abhina.sreeskantharajan wrote:

In D93031#4308764, @barannikov88 wrote:

@abhina.sreeskantharajan
What is the status of this patch?

Hello, I was waiting for the CharSetConverter patch to land. Now that this patch has landed https://reviews.llvm.org/D148821 to add limited EBCDIC <-> UTF-8 conversion support, I have started to refactor my patch to use this instead. This implementation also heavily relies on iconv support which is still being discussed in the CharSet Converter RFC here https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795/16

Thanks! I was beginning to think it is forgotten/abandoned.

I have opened a new patch https://reviews.llvm.org/D153419 and am closing this revision

cor3ntin mentioned this in rG95f50964fbf5: Implement P2361 Unevaluated string literals.Jul 7 2023, 4:30 AM

Revision Contents

Path

Size

clang/

docs/

LanguageExtensions.rst

3 lines

include/

clang/

Basic/

LangOptions.h

3 lines

TokenKinds.h

7 lines

Driver/

Options.td

5 lines

Lex/

LiteralConverter.h

36 lines

LiteralSupport.h

30 lines

Preprocessor.h

3 lines

lib/

Driver/

ToolChains/

Clang.cpp

18 lines

Frontend/

CompilerInstance.cpp

4 lines

InitPreprocessor.cpp

12 lines

Lex/

CMakeLists.txt

1 line

LiteralConverter.cpp

68 lines

LiteralSupport.cpp

108 lines

test/

CodeGen/

systemz-charset.c

35 lines

systemz-charset.cpp

46 lines

Driver/

cl-options.c

7 lines

clang_f_opts.c

12 lines

Preprocessor/

init-s390x.c

1 line

init-x86.c

2 lines

llvm/

include/

llvm/

ADT/

Triple.h

3 lines

lib/

Support/

Triple.cpp

7 lines

Diff 339355

clang/docs/LanguageExtensions.rst

	Show First 20 Lines • Show All 380 Lines • ▼ Show 20 Lines

	``__clang_version__``			``__clang_version__``
	Defined to a string that captures the Clang marketing version, including the			Defined to a string that captures the Clang marketing version, including the
	Subversion tag or revision number, e.g., "``1.5 (trunk 102332)``".			Subversion tag or revision number, e.g., "``1.5 (trunk 102332)``".

	``__clang_literal_encoding__``			``__clang_literal_encoding__``
	Defined to a narrow string literal that represents the current encoding of			Defined to a narrow string literal that represents the current encoding of
	narrow string literals, e.g., ``"hello"``. This macro typically expands to			narrow string literals, e.g., ``"hello"``. This macro typically expands to
	"UTF-8" (but may change in the future if the			the charset specified by -fexec-charset if specified, or the system charset.
	``-fexec-charset="Encoding-Name"`` option is implemented.)

	``__clang_wide_literal_encoding__``			``__clang_wide_literal_encoding__``
	Defined to a narrow string literal that represents the current encoding of			Defined to a narrow string literal that represents the current encoding of
	wide string literals, e.g., ``L"hello"``. This macro typically expands to			wide string literals, e.g., ``L"hello"``. This macro typically expands to
	"UTF-16" or "UTF-32" (but may change in the future if the			"UTF-16" or "UTF-32" (but may change in the future if the
	``-fwide-exec-charset="Encoding-Name"`` option is implemented.)			``-fwide-exec-charset="Encoding-Name"`` option is implemented.)

	.. _langext-vectors:			.. _langext-vectors:
	▲ Show 20 Lines • Show All 3,405 Lines • Show Last 20 Lines

clang/include/clang/Basic/LangOptions.h

Show First 20 Lines • Show All 336 Lines • ▼ Show 20 Lines	public:
/// device variables in host code for single source offloading languages		/// device variables in host code for single source offloading languages
/// like CUDA/HIP.		/// like CUDA/HIP.
std::string CUID;		std::string CUID;

/// Indicates whether the front-end is explicitly told that the		/// Indicates whether the front-end is explicitly told that the
/// input is a header file (i.e. -x c-header).		/// input is a header file (i.e. -x c-header).
bool IsHeaderFile = false;		bool IsHeaderFile = false;

		/// Name of the exec charset to convert the internal charset to.
		std::string ExecCharset;

LangOptions();		LangOptions();

// Define accessors/mutators for language options of enumeration type.		// Define accessors/mutators for language options of enumeration type.
#define LANGOPT(Name, Bits, Default, Description)		#define LANGOPT(Name, Bits, Default, Description)
#define ENUM_LANGOPT(Name, Type, Bits, Default, Description) \		#define ENUM_LANGOPT(Name, Type, Bits, Default, Description) \
Type get##Name() const { return static_cast<Type>(Name); } \		Type get##Name() const { return static_cast<Type>(Name); } \
void set##Name(Type Value) { Name = static_cast<unsigned>(Value); }		void set##Name(Type Value) { Name = static_cast<unsigned>(Value); }
#include "clang/Basic/LangOptions.def"		#include "clang/Basic/LangOptions.def"
▲ Show 20 Lines • Show All 317 Lines • Show Last 20 Lines

clang/include/clang/Basic/TokenKinds.h

	Show First 20 Lines • Show All 84 Lines • ▼ Show 20 Lines
	/// constant, string, etc.			/// constant, string, etc.
	inline bool isLiteral(TokenKind K) {			inline bool isLiteral(TokenKind K) {
	return K == tok::numeric_constant \|\| K == tok::char_constant \|\|			return K == tok::numeric_constant \|\| K == tok::char_constant \|\|
	K == tok::wide_char_constant \|\| K == tok::utf8_char_constant \|\|			K == tok::wide_char_constant \|\| K == tok::utf8_char_constant \|\|
	K == tok::utf16_char_constant \|\| K == tok::utf32_char_constant \|\|			K == tok::utf16_char_constant \|\| K == tok::utf32_char_constant \|\|
	isStringLiteral(K) \|\| K == tok::header_name;			isStringLiteral(K) \|\| K == tok::header_name;
	}			}

				/// Return true if this is a utf literal kind.
				inline bool isUTFLiteral(TokenKind K) {
				return K == tok::utf8_char_constant \|\| K == tok::utf8_string_literal \|\|
				K == tok::utf16_char_constant \|\| K == tok::utf16_string_literal \|\|
				K == tok::utf32_char_constant \|\| K == tok::utf32_string_literal;
				}

	/// Return true if this is any of tok::annot_* kinds.			/// Return true if this is any of tok::annot_* kinds.
	bool isAnnotation(TokenKind K);			bool isAnnotation(TokenKind K);

	/// Return true if this is an annotation token representing a pragma.			/// Return true if this is an annotation token representing a pragma.
	bool isPragmaAnnotation(TokenKind K);			bool isPragmaAnnotation(TokenKind K);

	} // end namespace tok			} // end namespace tok
	} // end namespace clang			} // end namespace clang
	Show All 20 Lines

clang/include/clang/Driver/Options.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 4,424 Lines • ▼ Show 20 Lines
	let Flags = [CC1Option, NoDriverOption] in {			let Flags = [CC1Option, NoDriverOption] in {

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Target Options			// Target Options
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	let Flags = [CC1Option, CC1AsOption, NoDriverOption] in {			let Flags = [CC1Option, CC1AsOption, NoDriverOption] in {

				def fexec_charset : Separate<["-"], "fexec-charset">, MetaVarName<"<charset>">,
				jansvoboda11Unsubmitted Done Reply Inline Actions Could you switch to the option marshalling infrastructure? https://clang.llvm.org/docs/InternalsManual.html#adding-new-command-line-option Adding `MarshallingInfoString<LangOpts<"ExecCharset">>` here should do the trick. You can then delete the option parsing in `CompilerInvocation.cpp`. jansvoboda11: Could you switch to the option marshalling infrastructure? https://clang.llvm.
				HelpText<"Set the execution <charset> for string and character literals. "
				tahonermannUnsubmitted Done Reply Inline Actions How about substituting "character set", "character encoding", or "charset" for "codepage"? This doesn't state what names are recognized. The ones provided by the system iconv() implementation (as is the case for gcc)? Or all names and aliases specified by the IANA character set registry? The set of recognized names can be a superset of the names that are actually supported. tahonermann: How about substituting "character set", "character encoding", or "charset" for "codepage"?
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions I've updated the description from codepage to charset. It's hard to specify what charsets are supported because iconv library differs between targets, so the list will not be the same on every platform. abhina.sreeskantharajan: I've updated the description from codepage to charset. It's hard to specify what charsets are…
				tahonermannUnsubmitted Done Reply Inline Actions Being dependent on the host iconv library seems fine by me; that is the case for gcc today. I suggest making that explicit here: def fexec_charset : Separate<["-"], "fexec-charset">, MetaVarName<"<charset>">, HelpText<"Set the execution <charset> for string and character literals. Supported character encodings include XXX and those supported by the host iconv library.">; tahonermann: Being dependent on the host iconv library seems fine by me; that is the case for gcc today. I…
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions I've updated the HelpText with your suggested description. abhina.sreeskantharajan: I've updated the HelpText with your suggested description.
				"Supported character encodings include ISO8859-1, UTF-8, IBM-1047 "
				"and those supported by the host iconv library.">,
				MarshallingInfoString<LangOpts<"ExecCharset">>;
	def target_cpu : Separate<["-"], "target-cpu">,			def target_cpu : Separate<["-"], "target-cpu">,
	HelpText<"Target a specific cpu type">,			HelpText<"Target a specific cpu type">,
	MarshallingInfoString<TargetOpts<"CPU">>;			MarshallingInfoString<TargetOpts<"CPU">>;
	def tune_cpu : Separate<["-"], "tune-cpu">,			def tune_cpu : Separate<["-"], "tune-cpu">,
	HelpText<"Tune for a specific cpu type">,			HelpText<"Tune for a specific cpu type">,
	MarshallingInfoString<TargetOpts<"TuneCPU">>;			MarshallingInfoString<TargetOpts<"TuneCPU">>;
	def target_feature : Separate<["-"], "target-feature">,			def target_feature : Separate<["-"], "target-feature">,
	HelpText<"Target specific attributes">,			HelpText<"Target specific attributes">,
	▲ Show 20 Lines • Show All 1,702 Lines • Show Last 20 Lines

clang/include/clang/Lex/LiteralConverter.h

This file was added.

				//===--- clang/Lex/LiteralConverter.h - Translator for Literals -- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_CLANG_LEX_LITERALCONVERTER_H
				#define LLVM_CLANG_LEX_LITERALCONVERTER_H

				#include "clang/Basic/Diagnostic.h"
				#include "clang/Basic/LangOptions.h"
				#include "clang/Basic/TargetInfo.h"
				#include "llvm/ADT/StringMap.h"
				#include "llvm/ADT/StringRef.h"
				#include "llvm/Support/CharSet.h"

				enum ConversionAction { NoConversion, ToSystemCharset, ToExecCharset };

				class LiteralConverter {
				llvm::StringRef InternalCharset;
				llvm::StringRef SystemCharset;
				llvm::StringRef ExecCharset;
				llvm::StringMap<llvm::CharSetConverter> CharsetConverters;

				public:
				llvm::CharSetConverter getConverter(const char Codepage);
				llvm::CharSetConverter *getConverter(ConversionAction Action);
				llvm::CharSetConverter createAndInsertCharConverter(const char To);
				void setConvertersFromOptions(const clang::LangOptions &Opts,
				const clang::TargetInfo &TInfo,
				clang::DiagnosticsEngine &Diags);
				};

				#endif

clang/include/clang/Lex/LiteralSupport.h

	Show All 11 Lines
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLVM_CLANG_LEX_LITERALSUPPORT_H			#ifndef LLVM_CLANG_LEX_LITERALSUPPORT_H
	#define LLVM_CLANG_LEX_LITERALSUPPORT_H			#define LLVM_CLANG_LEX_LITERALSUPPORT_H

	#include "clang/Basic/CharInfo.h"			#include "clang/Basic/CharInfo.h"
	#include "clang/Basic/LLVM.h"			#include "clang/Basic/LLVM.h"
	#include "clang/Basic/TokenKinds.h"			#include "clang/Basic/TokenKinds.h"
				#include "clang/Lex/LiteralConverter.h"
	#include "llvm/ADT/APFloat.h"			#include "llvm/ADT/APFloat.h"
	#include "llvm/ADT/ArrayRef.h"			#include "llvm/ADT/ArrayRef.h"
	#include "llvm/ADT/SmallString.h"			#include "llvm/ADT/SmallString.h"
	#include "llvm/ADT/StringRef.h"			#include "llvm/ADT/StringRef.h"
				#include "llvm/Support/CharSet.h"
	#include "llvm/Support/DataTypes.h"			#include "llvm/Support/DataTypes.h"

	namespace clang {			namespace clang {

	class DiagnosticsEngine;			class DiagnosticsEngine;
	class Preprocessor;			class Preprocessor;
	class Token;			class Token;
	class SourceLocation;			class SourceLocation;
	▲ Show 20 Lines • Show All 148 Lines • ▼ Show 20 Lines
	class CharLiteralParser {			class CharLiteralParser {
	uint64_t Value;			uint64_t Value;
	tok::TokenKind Kind;			tok::TokenKind Kind;
	bool IsMultiChar;			bool IsMultiChar;
	bool HadError;			bool HadError;
	SmallString<32> UDSuffixBuf;			SmallString<32> UDSuffixBuf;
	unsigned UDSuffixOffset;			unsigned UDSuffixOffset;
	public:			public:
	CharLiteralParser(const char begin, const char end,			CharLiteralParser(const char begin, const char end, SourceLocation Loc,
	SourceLocation Loc, Preprocessor &PP,			Preprocessor &PP, tok::TokenKind kind);
	tok::TokenKind kind);

				tahonermannUnsubmitted Done Reply Inline Actions Is the default argument for `TranslationState` actually used anywhere? I'm skeptical that a default argument provides a benefit here. Actually, this diff doesn't include any changes to construct a `CharLiteralParser` with an explicit argument. It seems this argument isn't actually needed. The only places I see objects of `CharLiteralParser` type constructed are in `EvaluateValue()` in `clang/lib/Lex/PPExpressions.cpp` and `Sema::ActOnCharacterConstant()` in `clang/lib/Sema/SemaExpr.cpp`. tahonermann: Is the default argument for `TranslationState` actually used anywhere? I'm skeptical that a…
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions You're right, we don't have any cases that use this arg yet so we can remove it. abhina.sreeskantharajan: You're right, we don't have any cases that use this arg yet so we can remove it.
	bool hadError() const { return HadError; }			bool hadError() const { return HadError; }
				tahonermannUnsubmitted Done Reply Inline Actions Does the conversion state need to be persisted as a data member? The literal is consumed in the constructor. tahonermann: Does the conversion state need to be persisted as a data member? The literal is consumed in…
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Thanks, I've removed this. abhina.sreeskantharajan: Thanks, I've removed this.
	bool isAscii() const { return Kind == tok::char_constant; }			bool isAscii() const { return Kind == tok::char_constant; }
	bool isWide() const { return Kind == tok::wide_char_constant; }			bool isWide() const { return Kind == tok::wide_char_constant; }
	bool isUTF8() const { return Kind == tok::utf8_char_constant; }			bool isUTF8() const { return Kind == tok::utf8_char_constant; }
	bool isUTF16() const { return Kind == tok::utf16_char_constant; }			bool isUTF16() const { return Kind == tok::utf16_char_constant; }
	bool isUTF32() const { return Kind == tok::utf32_char_constant; }			bool isUTF32() const { return Kind == tok::utf32_char_constant; }
	bool isMultiChar() const { return IsMultiChar; }			bool isMultiChar() const { return IsMultiChar; }
	uint64_t getValue() const { return Value; }			uint64_t getValue() const { return Value; }
	StringRef getUDSuffix() const { return UDSuffixBuf; }			StringRef getUDSuffix() const { return UDSuffixBuf; }
	unsigned getUDSuffixOffset() const {			unsigned getUDSuffixOffset() const {
	assert(!UDSuffixBuf.empty() && "no ud-suffix");			assert(!UDSuffixBuf.empty() && "no ud-suffix");
	return UDSuffixOffset;			return UDSuffixOffset;
	}			}
	};			};

	/// StringLiteralParser - This decodes string escape characters and performs			/// StringLiteralParser - This decodes string escape characters and performs
	/// wide string analysis and Translation Phase #6 (concatenation of string			/// wide string analysis and Translation Phase #6 (concatenation of string
	/// literals) (C99 5.1.1.2p1).			/// literals) (C99 5.1.1.2p1).
	class StringLiteralParser {			class StringLiteralParser {
	const SourceManager &SM;			const SourceManager &SM;
	const LangOptions &Features;			const LangOptions &Features;
	const TargetInfo &Target;			const TargetInfo &Target;
	DiagnosticsEngine *Diags;			DiagnosticsEngine *Diags;
				LiteralConverter *LiteralConv;

	unsigned MaxTokenLength;			unsigned MaxTokenLength;
	unsigned SizeBound;			unsigned SizeBound;
	unsigned CharByteWidth;			unsigned CharByteWidth;
	tok::TokenKind Kind;			tok::TokenKind Kind;
	SmallString<512> ResultBuf;			SmallString<512> ResultBuf;
	char *ResultPtr; // cursor			char *ResultPtr; // cursor
	SmallString<32> UDSuffixBuf;			SmallString<32> UDSuffixBuf;
	unsigned UDSuffixToken;			unsigned UDSuffixToken;
	unsigned UDSuffixOffset;			unsigned UDSuffixOffset;
	public:			public:
	StringLiteralParser(ArrayRef<Token> StringToks,			StringLiteralParser(ArrayRef<Token> StringToks, Preprocessor &PP,
	Preprocessor &PP, bool Complain = true);			bool Complain = true,
	StringLiteralParser(ArrayRef<Token> StringToks,			ConversionAction Action = ToExecCharset);
	const SourceManager &sm, const LangOptions &features,			StringLiteralParser(ArrayRef<Token> StringToks, const SourceManager &sm,
	const TargetInfo &target,			const LangOptions &features, const TargetInfo &target,
	DiagnosticsEngine *diags = nullptr)			DiagnosticsEngine *diags = nullptr)
	: SM(sm), Features(features), Target(target), Diags(diags),			: SM(sm), Features(features), Target(target), Diags(diags),
	MaxTokenLength(0), SizeBound(0), CharByteWidth(0), Kind(tok::unknown),			LiteralConv(nullptr), MaxTokenLength(0), SizeBound(0), CharByteWidth(0),
	ResultPtr(ResultBuf.data()), hadError(false), Pascal(false) {			Kind(tok::unknown), ResultPtr(ResultBuf.data()), hadError(false),
	init(StringToks);			Pascal(false) {
				init(StringToks, NoConversion);
	}			}


	bool hadError;			bool hadError;
	bool Pascal;			bool Pascal;

				tahonermannUnsubmitted Done Reply Inline Actions Same concern here with respect to persisting the conversion state as a data member. tahonermann: Same concern here with respect to persisting the conversion state as a data member.
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions If this member is removed in StringLiteralParser, we will need to pass the State to multiple functions in StringLiteralParser like init(). Would this solution be preferable to keeping a data member? abhina.sreeskantharajan: If this member is removed in StringLiteralParser, we will need to pass the State to multiple…
				tahonermannUnsubmitted Done Reply Inline Actions I think so, yes. Data members should be used to reflect the state of the object, not as a convenient mechanism to avoid passing arguments. tahonermann: I think so, yes. Data members should be used to reflect the state of the object, not as a…
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Thanks, I've removed this member. abhina.sreeskantharajan: Thanks, I've removed this member.
	StringRef GetString() const {			StringRef GetString() const {
	return StringRef(ResultBuf.data(), GetStringLength());			return StringRef(ResultBuf.data(), GetStringLength());
				tahonermannUnsubmitted Done Reply Inline Actions This static data member will presumably need to be lifted to per-instance state as Richard mentioned elsewhere. tahonermann: This static data member will presumably need to be lifted to per-instance state as Richard…
				tahonermannUnsubmitted Done Reply Inline Actions I don't think a `LiteralTranslator` object is actually needed in this case. The only use of this constructor that I see is in `ModuleMapParser::consumeToken()` in `clang/lib/Lex/ModuleMap.cpp` and, in that case, I don't think any translation is necessary. This suggests that `TranslationState` is not needed for this constructor either; `NoTranslation` can be passed to `init()`. tahonermann: I don't think a `LiteralTranslator` object is actually needed in this case. The only use of…
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Thanks, I've removed it. abhina.sreeskantharajan: Thanks, I've removed it.
	}			}
	unsigned GetStringLength() const { return ResultPtr-ResultBuf.data(); }			unsigned GetStringLength() const { return ResultPtr-ResultBuf.data(); }

	unsigned GetNumStringChars() const {			unsigned GetNumStringChars() const {
	return GetStringLength() / CharByteWidth;			return GetStringLength() / CharByteWidth;
	}			}
	/// getOffsetOfStringByte - This function returns the offset of the			/// getOffsetOfStringByte - This function returns the offset of the
	/// specified byte of the string data represented by Token. This handles			/// specified byte of the string data represented by Token. This handles
	/// advancing over escape sequences in the string.			/// advancing over escape sequences in the string.
	///			///
	/// If the Diagnostics pointer is non-null, then this will do semantic			/// If the Diagnostics pointer is non-null, then this will do semantic
	/// checking of the string literal and emit errors and warnings.			/// checking of the string literal and emit errors and warnings.
	unsigned getOffsetOfStringByte(const Token &TheTok, unsigned ByteNo) const;			unsigned getOffsetOfStringByte(const Token &TheTok, unsigned ByteNo) const;

				tahonermannUnsubmitted Done Reply Inline Actions I don't think `getOffsetOfStringByte()` should require a `ConversionState` parameter. If I understand it correctly, this function should be operating on the string in the internal encoding, never in a converted encoding. tahonermann: I don't think `getOffsetOfStringByte()` should require a `ConversionState` parameter. If I…
	bool isAscii() const { return Kind == tok::string_literal; }			bool isAscii() const { return Kind == tok::string_literal; }
	bool isWide() const { return Kind == tok::wide_string_literal; }			bool isWide() const { return Kind == tok::wide_string_literal; }
	bool isUTF8() const { return Kind == tok::utf8_string_literal; }			bool isUTF8() const { return Kind == tok::utf8_string_literal; }
	bool isUTF16() const { return Kind == tok::utf16_string_literal; }			bool isUTF16() const { return Kind == tok::utf16_string_literal; }
	bool isUTF32() const { return Kind == tok::utf32_string_literal; }			bool isUTF32() const { return Kind == tok::utf32_string_literal; }
	bool isPascal() const { return Pascal; }			bool isPascal() const { return Pascal; }

	StringRef getUDSuffix() const { return UDSuffixBuf; }			StringRef getUDSuffix() const { return UDSuffixBuf; }

	/// Get the index of a token containing a ud-suffix.			/// Get the index of a token containing a ud-suffix.
	unsigned getUDSuffixToken() const {			unsigned getUDSuffixToken() const {
	assert(!UDSuffixBuf.empty() && "no ud-suffix");			assert(!UDSuffixBuf.empty() && "no ud-suffix");
	return UDSuffixToken;			return UDSuffixToken;
	}			}
	/// Get the spelling offset of the first byte of the ud-suffix.			/// Get the spelling offset of the first byte of the ud-suffix.
	unsigned getUDSuffixOffset() const {			unsigned getUDSuffixOffset() const {
	assert(!UDSuffixBuf.empty() && "no ud-suffix");			assert(!UDSuffixBuf.empty() && "no ud-suffix");
	return UDSuffixOffset;			return UDSuffixOffset;
	}			}

	static bool isValidUDSuffix(const LangOptions &LangOpts, StringRef Suffix);			static bool isValidUDSuffix(const LangOptions &LangOpts, StringRef Suffix);

	private:			private:
	void init(ArrayRef<Token> StringToks);			void init(ArrayRef<Token> StringToks, ConversionAction Action);
	bool CopyStringFragment(const Token &Tok, const char *TokBegin,			bool CopyStringFragment(const Token &Tok, const char *TokBegin,
	StringRef Fragment);			StringRef Fragment);
	void DiagnoseLexingError(SourceLocation Loc);			void DiagnoseLexingError(SourceLocation Loc);
	};			};

	} // end namespace clang			} // end namespace clang

	#endif			#endif

clang/include/clang/Lex/Preprocessor.h

Show All 17 Lines
#include "clang/Basic/IdentifierTable.h"		#include "clang/Basic/IdentifierTable.h"
#include "clang/Basic/LLVM.h"		#include "clang/Basic/LLVM.h"
#include "clang/Basic/LangOptions.h"		#include "clang/Basic/LangOptions.h"
#include "clang/Basic/Module.h"		#include "clang/Basic/Module.h"
#include "clang/Basic/SourceLocation.h"		#include "clang/Basic/SourceLocation.h"
#include "clang/Basic/SourceManager.h"		#include "clang/Basic/SourceManager.h"
#include "clang/Basic/TokenKinds.h"		#include "clang/Basic/TokenKinds.h"
#include "clang/Lex/Lexer.h"		#include "clang/Lex/Lexer.h"
		#include "clang/Lex/LiteralConverter.h"
#include "clang/Lex/MacroInfo.h"		#include "clang/Lex/MacroInfo.h"
#include "clang/Lex/ModuleLoader.h"		#include "clang/Lex/ModuleLoader.h"
#include "clang/Lex/ModuleMap.h"		#include "clang/Lex/ModuleMap.h"
#include "clang/Lex/PPCallbacks.h"		#include "clang/Lex/PPCallbacks.h"
#include "clang/Lex/PreprocessorExcludedConditionalDirectiveSkipMapping.h"		#include "clang/Lex/PreprocessorExcludedConditionalDirectiveSkipMapping.h"
#include "clang/Lex/Token.h"		#include "clang/Lex/Token.h"
#include "clang/Lex/TokenLexer.h"		#include "clang/Lex/TokenLexer.h"
#include "llvm/ADT/ArrayRef.h"		#include "llvm/ADT/ArrayRef.h"
▲ Show 20 Lines • Show All 102 Lines • ▼ Show 20 Lines	class Preprocessor {
LangOptions &LangOpts;		LangOptions &LangOpts;
const TargetInfo *Target = nullptr;		const TargetInfo *Target = nullptr;
const TargetInfo *AuxTarget = nullptr;		const TargetInfo *AuxTarget = nullptr;
FileManager &FileMgr;		FileManager &FileMgr;
SourceManager &SourceMgr;		SourceManager &SourceMgr;
std::unique_ptr<ScratchBuffer> ScratchBuf;		std::unique_ptr<ScratchBuffer> ScratchBuf;
HeaderSearch &HeaderInfo;		HeaderSearch &HeaderInfo;
ModuleLoader &TheModuleLoader;		ModuleLoader &TheModuleLoader;
		LiteralConverter LiteralConv;
		tahonermannUnsubmitted Done Reply Inline Actions I don't see a reason for `LT` to be a pointer. Can it be made a reference or, better, a non-reference, non-pointer data member? tahonermann: I don't see a reason for `LT` to be a pointer. Can it be made a reference or, better, a non…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Thanks, I've changed it to a non-reference non-pointer member. abhina.sreeskantharajan: Thanks, I've changed it to a non-reference non-pointer member.
		rsmithUnsubmitted Done Reply Inline Actions Please give this a longer name. Abbreviation names should only be used in fairly small scopes where it's easy to look up what they refer to. Also: why `LT`? What does the `T` stand for? rsmith: Please give this a longer name. Abbreviation names should only be used in fairly small scopes…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Thanks for catching this. This was a change I missed when renaming LiteralTranslator to LiteralConverter. I've added a longer name. abhina.sreeskantharajan: Thanks for catching this. This was a change I missed when renaming LiteralTranslator to…

/// External source of macros.		/// External source of macros.
ExternalPreprocessorSource *ExternalSource;		ExternalPreprocessorSource *ExternalSource;

/// A BumpPtrAllocator object used to quickly allocate and release		/// A BumpPtrAllocator object used to quickly allocate and release
/// objects internal to the Preprocessor.		/// objects internal to the Preprocessor.
llvm::BumpPtrAllocator BP;		llvm::BumpPtrAllocator BP;

▲ Show 20 Lines • Show All 774 Lines • ▼ Show 20 Lines	public:
SourceManager &getSourceManager() const { return SourceMgr; }		SourceManager &getSourceManager() const { return SourceMgr; }
HeaderSearch &getHeaderSearchInfo() const { return HeaderInfo; }		HeaderSearch &getHeaderSearchInfo() const { return HeaderInfo; }

IdentifierTable &getIdentifierTable() { return Identifiers; }		IdentifierTable &getIdentifierTable() { return Identifiers; }
const IdentifierTable &getIdentifierTable() const { return Identifiers; }		const IdentifierTable &getIdentifierTable() const { return Identifiers; }
SelectorTable &getSelectorTable() { return Selectors; }		SelectorTable &getSelectorTable() { return Selectors; }
Builtin::Context &getBuiltinInfo() { return *BuiltinInfo; }		Builtin::Context &getBuiltinInfo() { return *BuiltinInfo; }
llvm::BumpPtrAllocator &getPreprocessorAllocator() { return BP; }		llvm::BumpPtrAllocator &getPreprocessorAllocator() { return BP; }
		LiteralConverter &getLiteralConverter() { return LiteralConv; }

void setExternalSource(ExternalPreprocessorSource *Source) {		void setExternalSource(ExternalPreprocessorSource *Source) {
ExternalSource = Source;		ExternalSource = Source;
}		}

ExternalPreprocessorSource *getExternalSource() const {		ExternalPreprocessorSource *getExternalSource() const {
return ExternalSource;		return ExternalSource;
}		}
▲ Show 20 Lines • Show All 1,481 Lines • Show Last 20 Lines

clang/lib/Driver/ToolChains/Clang.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show All 30 Lines
	#include "clang/Driver/Distro.h"			#include "clang/Driver/Distro.h"
	#include "clang/Driver/DriverDiagnostic.h"			#include "clang/Driver/DriverDiagnostic.h"
	#include "clang/Driver/Options.h"			#include "clang/Driver/Options.h"
	#include "clang/Driver/SanitizerArgs.h"			#include "clang/Driver/SanitizerArgs.h"
	#include "clang/Driver/XRayArgs.h"			#include "clang/Driver/XRayArgs.h"
	#include "llvm/ADT/StringExtras.h"			#include "llvm/ADT/StringExtras.h"
	#include "llvm/Config/llvm-config.h"			#include "llvm/Config/llvm-config.h"
	#include "llvm/Option/ArgList.h"			#include "llvm/Option/ArgList.h"
				#include "llvm/Support/CharSet.h"
	#include "llvm/Support/CodeGen.h"			#include "llvm/Support/CodeGen.h"
	#include "llvm/Support/Compiler.h"			#include "llvm/Support/Compiler.h"
	#include "llvm/Support/Compression.h"			#include "llvm/Support/Compression.h"
	#include "llvm/Support/FileSystem.h"			#include "llvm/Support/FileSystem.h"
	#include "llvm/Support/Host.h"			#include "llvm/Support/Host.h"
	#include "llvm/Support/Path.h"			#include "llvm/Support/Path.h"
	#include "llvm/Support/Process.h"			#include "llvm/Support/Process.h"
	#include "llvm/Support/TargetParser.h"			#include "llvm/Support/TargetParser.h"
	▲ Show 20 Lines • Show All 6,171 Lines • ▼ Show 20 Lines
	// -finput_charset=UTF-8 is default. Reject others			// -finput_charset=UTF-8 is default. Reject others
	if (Arg *inputCharset = Args.getLastArg(options::OPT_finput_charset_EQ)) {			if (Arg *inputCharset = Args.getLastArg(options::OPT_finput_charset_EQ)) {
	StringRef value = inputCharset->getValue();			StringRef value = inputCharset->getValue();
	if (!value.equals_lower("utf-8"))			if (!value.equals_lower("utf-8"))
	D.Diag(diag::err_drv_invalid_value) << inputCharset->getAsString(Args)			D.Diag(diag::err_drv_invalid_value) << inputCharset->getAsString(Args)
	<< value;			<< value;
	}			}

	// -fexec_charset=UTF-8 is default. Reject others			// Set the default fexec-charset as the system charset.
				CmdArgs.push_back("-fexec-charset");
				CmdArgs.push_back(Args.MakeArgString(Triple.getSystemCharset()));
	if (Arg *execCharset = Args.getLastArg(options::OPT_fexec_charset_EQ)) {			if (Arg *execCharset = Args.getLastArg(options::OPT_fexec_charset_EQ)) {
	StringRef value = execCharset->getValue();			StringRef value = execCharset->getValue();
	if (!value.equals_lower("utf-8"))			llvm::ErrorOr<llvm::CharSetConverter> ErrorOrConverter =
	D.Diag(diag::err_drv_invalid_value) << execCharset->getAsString(Args)			llvm::CharSetConverter::create("UTF-8", value.data());
	<< value;			if (ErrorOrConverter) {
				CmdArgs.push_back("-fexec-charset");
				CmdArgs.push_back(Args.MakeArgString(value));
				} else {
				D.Diag(diag::err_drv_invalid_value)
				rsmithUnsubmitted Done Reply Inline Actions Looping over all the arguments is a little unusual. Normally we'd get the last argument value and only check that one. Do you need to pass more than one value onto the frontend? rsmith: Looping over all the arguments is a little unusual. Normally we'd get the last argument value…
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Thanks, I've changed it back to get the LastArg only and use the spelling of the argument to fix the diagnostic error message in the driver lit tests. abhina.sreeskantharajan: Thanks, I've changed it back to get the LastArg only and use the spelling of the argument to…
				<< execCharset->getAsString(Args) << value;
				}
	}			}
				tahonermannUnsubmitted Done Reply Inline Actions I think it would be preferable to diagnose an unrecognized character encoding name here if possible. The current changes will result in an unrecognized name (as opposed to one that is unsupported for the target) being diagnosed for each compiler instance. tahonermann: I think it would be preferable to diagnose an unrecognized character encoding name here if…
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Since we do not know what charsets are supported by the iconv library on the target platform, we don't know what charsets are actually invalid until we try creating a CharSetConverter. abhina.sreeskantharajan: Since we do not know what charsets are supported by the iconv library on the target platform…
				tahonermannUnsubmitted Done Reply Inline Actions Understood, but what would be the harm in performing a lookup (constructing a `CharSetConverter`) here? tahonermann: Understood, but what would be the harm in performing a lookup (constructing a…
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions I initially thought it will be a performance issue if we are creating the Converter twice, once here and once in the Preprocessor. But I do think its a good idea to diagnose this early. I've modified the code to diagnose and error here. abhina.sreeskantharajan: I initially thought it will be a performance issue if we are creating the Converter twice, once…

	RenderDiagnosticsOptions(D, Args, CmdArgs);			RenderDiagnosticsOptions(D, Args, CmdArgs);

	// -fno-asm-blocks is default.			// -fno-asm-blocks is default.
	if (Args.hasFlag(options::OPT_fasm_blocks, options::OPT_fno_asm_blocks,			if (Args.hasFlag(options::OPT_fasm_blocks, options::OPT_fno_asm_blocks,
	false))			false))
				tahonermannUnsubmitted Done Reply Inline Actions Thank you for adding this. tahonermann: Thank you for adding this.
	CmdArgs.push_back("-fasm-blocks");			CmdArgs.push_back("-fasm-blocks");

	// -fgnu-inline-asm is default.			// -fgnu-inline-asm is default.
	if (!Args.hasFlag(options::OPT_fgnu_inline_asm,			if (!Args.hasFlag(options::OPT_fgnu_inline_asm,
	options::OPT_fno_gnu_inline_asm, true))			options::OPT_fno_gnu_inline_asm, true))
	CmdArgs.push_back("-fno-gnu-inline-asm");			CmdArgs.push_back("-fno-gnu-inline-asm");

	// Enable vectorization per default according to the optimization level			// Enable vectorization per default according to the optimization level
	▲ Show 20 Lines • Show All 1,390 Lines • Show Last 20 Lines

clang/lib/Frontend/CompilerInstance.cpp

//===--- CompilerInstance.cpp ---------------------------------------------===//		//===--- CompilerInstance.cpp ---------------------------------------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "clang/Frontend/CompilerInstance.h"		#include "clang/Frontend/CompilerInstance.h"
#include "clang/AST/ASTConsumer.h"		#include "clang/AST/ASTConsumer.h"
#include "clang/AST/ASTContext.h"		#include "clang/AST/ASTContext.h"
#include "clang/AST/Decl.h"		#include "clang/AST/Decl.h"
#include "clang/Basic/CharInfo.h"		#include "clang/Basic/CharInfo.h"
#include "clang/Basic/Diagnostic.h"		#include "clang/Basic/Diagnostic.h"
		#include "clang/Basic/DiagnosticDriver.h"
#include "clang/Basic/FileManager.h"		#include "clang/Basic/FileManager.h"
#include "clang/Basic/LangStandard.h"		#include "clang/Basic/LangStandard.h"
#include "clang/Basic/SourceManager.h"		#include "clang/Basic/SourceManager.h"
#include "clang/Basic/Stack.h"		#include "clang/Basic/Stack.h"
#include "clang/Basic/TargetInfo.h"		#include "clang/Basic/TargetInfo.h"
#include "clang/Basic/Version.h"		#include "clang/Basic/Version.h"
#include "clang/Config/config.h"		#include "clang/Config/config.h"
#include "clang/Frontend/ChainedDiagnosticConsumer.h"		#include "clang/Frontend/ChainedDiagnosticConsumer.h"
#include "clang/Frontend/FrontendAction.h"		#include "clang/Frontend/FrontendAction.h"
#include "clang/Frontend/FrontendActions.h"		#include "clang/Frontend/FrontendActions.h"
#include "clang/Frontend/FrontendDiagnostic.h"		#include "clang/Frontend/FrontendDiagnostic.h"
#include "clang/Frontend/LogDiagnosticPrinter.h"		#include "clang/Frontend/LogDiagnosticPrinter.h"
#include "clang/Frontend/SerializedDiagnosticPrinter.h"		#include "clang/Frontend/SerializedDiagnosticPrinter.h"
#include "clang/Frontend/TextDiagnosticPrinter.h"		#include "clang/Frontend/TextDiagnosticPrinter.h"
#include "clang/Frontend/Utils.h"		#include "clang/Frontend/Utils.h"
#include "clang/Frontend/VerifyDiagnosticConsumer.h"		#include "clang/Frontend/VerifyDiagnosticConsumer.h"
#include "clang/Lex/HeaderSearch.h"		#include "clang/Lex/HeaderSearch.h"
		#include "clang/Lex/LiteralConverter.h"
#include "clang/Lex/Preprocessor.h"		#include "clang/Lex/Preprocessor.h"
#include "clang/Lex/PreprocessorOptions.h"		#include "clang/Lex/PreprocessorOptions.h"
#include "clang/Sema/CodeCompleteConsumer.h"		#include "clang/Sema/CodeCompleteConsumer.h"
#include "clang/Sema/Sema.h"		#include "clang/Sema/Sema.h"
#include "clang/Serialization/ASTReader.h"		#include "clang/Serialization/ASTReader.h"
#include "clang/Serialization/GlobalModuleIndex.h"		#include "clang/Serialization/GlobalModuleIndex.h"
#include "clang/Serialization/InMemoryModuleCache.h"		#include "clang/Serialization/InMemoryModuleCache.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
▲ Show 20 Lines • Show All 484 Lines • ▼ Show 20 Lines	AttachHeaderIncludeGen(*PP, DepOpts,
/ShowDepth=/false);		/ShowDepth=/false);
}		}

if (DepOpts.ShowIncludesDest != ShowIncludesDestination::None) {		if (DepOpts.ShowIncludesDest != ShowIncludesDestination::None) {
AttachHeaderIncludeGen(*PP, DepOpts,		AttachHeaderIncludeGen(*PP, DepOpts,
/ShowAllHeaders=/true, /OutputPath=/"",		/ShowAllHeaders=/true, /OutputPath=/"",
/ShowDepth=/true, /MSStyle=/true);		/ShowDepth=/true, /MSStyle=/true);
}		}
		PP->getLiteralConverter().setConvertersFromOptions(getLangOpts(), getTarget(),
		getDiagnostics());
}		}

std::string CompilerInstance::getSpecificModuleCachePath(StringRef ModuleHash) {		std::string CompilerInstance::getSpecificModuleCachePath(StringRef ModuleHash) {
// Set up the module path, including the hash for the module-creation options.		// Set up the module path, including the hash for the module-creation options.
SmallString<256> SpecificModuleCache(getHeaderSearchOpts().ModuleCachePath);		SmallString<256> SpecificModuleCache(getHeaderSearchOpts().ModuleCachePath);
if (!SpecificModuleCache.empty() && !getHeaderSearchOpts().DisableModuleHash)		if (!SpecificModuleCache.empty() && !getHeaderSearchOpts().DisableModuleHash)
llvm::sys::path::append(SpecificModuleCache, ModuleHash);		llvm::sys::path::append(SpecificModuleCache, ModuleHash);
return std::string(SpecificModuleCache.str());		return std::string(SpecificModuleCache.str());
▲ Show 20 Lines • Show All 1,657 Lines • Show Last 20 Lines

clang/lib/Frontend/InitPreprocessor.cpp

Show First 20 Lines • Show All 772 Lines • ▼ Show 20 Lines	#undef TOSTR2
if (LangOpts.MicrosoftExt) {		if (LangOpts.MicrosoftExt) {
if (LangOpts.WChar) {		if (LangOpts.WChar) {
// wchar_t supported as a keyword.		// wchar_t supported as a keyword.
Builder.defineMacro("_WCHAR_T_DEFINED");		Builder.defineMacro("_WCHAR_T_DEFINED");
Builder.defineMacro("_NATIVE_WCHAR_T_DEFINED");		Builder.defineMacro("_NATIVE_WCHAR_T_DEFINED");
}		}
}		}

// Macros to help identify the narrow and wide character sets		// Macros to help identify the narrow and wide character sets. This is set
// FIXME: clang currently ignores -fexec-charset=. If this changes,		// to fexec-charset. If fexec-charset is not specified, the default is the
// then this may need to be updated.		// system charset.
Builder.defineMacro("__clang_literal_encoding__", "\"UTF-8\"");		if (!LangOpts.ExecCharset.empty())
		Builder.defineMacro("__clang_literal_encoding__", LangOpts.ExecCharset);
		else
		Builder.defineMacro("__clang_literal_encoding__",
		TI.getTriple().getSystemCharset());
if (TI.getTypeWidth(TI.getWCharType()) >= 32) {		if (TI.getTypeWidth(TI.getWCharType()) >= 32) {
// FIXME: 32-bit wchar_t signals UTF-32. This may change		// FIXME: 32-bit wchar_t signals UTF-32. This may change
// if -fwide-exec-charset= is ever supported.		// if -fwide-exec-charset= is ever supported.
Builder.defineMacro("__clang_wide_literal_encoding__", "\"UTF-32\"");		Builder.defineMacro("__clang_wide_literal_encoding__", "\"UTF-32\"");
} else {		} else {
// FIXME: Less-than 32-bit wchar_t generally means UTF-16		// FIXME: Less-than 32-bit wchar_t generally means UTF-16
// (e.g., Windows, 32-bit IBM). This may need to be		// (e.g., Windows, 32-bit IBM). This may need to be
// updated if -fwide-exec-charset= is ever supported.		// updated if -fwide-exec-charset= is ever supported.
▲ Show 20 Lines • Show All 460 Lines • Show Last 20 Lines

clang/lib/Lex/CMakeLists.txt

	# TODO: Add -maltivec when ARCH is PowerPC.			# TODO: Add -maltivec when ARCH is PowerPC.

	set(LLVM_LINK_COMPONENTS support)			set(LLVM_LINK_COMPONENTS support)

	add_clang_library(clangLex			add_clang_library(clangLex
	DependencyDirectivesSourceMinimizer.cpp			DependencyDirectivesSourceMinimizer.cpp
	HeaderMap.cpp			HeaderMap.cpp
	HeaderSearch.cpp			HeaderSearch.cpp
	Lexer.cpp			Lexer.cpp
				LiteralConverter.cpp
	LiteralSupport.cpp			LiteralSupport.cpp
	MacroArgs.cpp			MacroArgs.cpp
	MacroInfo.cpp			MacroInfo.cpp
	ModuleMap.cpp			ModuleMap.cpp
	PPCaching.cpp			PPCaching.cpp
	PPCallbacks.cpp			PPCallbacks.cpp
	PPConditionalDirectiveRecord.cpp			PPConditionalDirectiveRecord.cpp
	PPDirectives.cpp			PPDirectives.cpp
	Show All 14 Lines

clang/lib/Lex/LiteralConverter.cpp

This file was added.

				//===--- LiteralConverter.cpp - Translator for String Literals -----------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "clang/Lex/LiteralConverter.h"
				#include "clang/Basic/DiagnosticDriver.h"

				using namespace llvm;

				llvm::CharSetConverter LiteralConverter::getConverter(const char Codepage) {
				auto Iter = CharsetConverters.find(Codepage);
				if (Iter != CharsetConverters.end())
				return &Iter->second;
				return nullptr;
				}

				llvm::CharSetConverter *
				LiteralConverter::getConverter(ConversionAction Action) {
				StringRef CodePage;
				if (Action == ToSystemCharset)
				CodePage = SystemCharset;
				else if (Action == ToExecCharset)
				CodePage = ExecCharset;
				else
				CodePage = InternalCharset;
				return getConverter(CodePage.data());
				}

				llvm::CharSetConverter *
				LiteralConverter::createAndInsertCharConverter(const char *To) {
				const char *From = InternalCharset.data();
				llvm::CharSetConverter *Converter = getConverter(To);
				if (Converter)
				return Converter;

				ErrorOr<CharSetConverter> ErrorOrConverter =
				llvm::CharSetConverter::create(From, To);
				if (!ErrorOrConverter)
				return nullptr;
				CharsetConverters.insert_or_assign(StringRef(To),
				std::move(*ErrorOrConverter));
				return Converter;
				}

				void LiteralConverter::setConvertersFromOptions(
				const clang::LangOptions &Opts, const clang::TargetInfo &TInfo,
				clang::DiagnosticsEngine &Diags) {
				using namespace llvm;
				SystemCharset = TInfo.getTriple().getSystemCharset();
				InternalCharset = "UTF-8";
				ExecCharset = Opts.ExecCharset.empty() ? InternalCharset : Opts.ExecCharset;
				// Create converter between internal and system charset
				if (!InternalCharset.equals(SystemCharset))
				createAndInsertCharConverter(SystemCharset.data());

				// Create converter between internal and exec charset specified
				// in fexec-charset option.
				if (InternalCharset.equals(ExecCharset))
				return;
				if (!createAndInsertCharConverter(ExecCharset.data())) {
				Diags.Report(clang::diag::err_drv_invalid_value)
				<< "-fexec-charset" << ExecCharset;
				}
				}

clang/lib/Lex/LiteralSupport.cpp

Show First 20 Lines • Show All 87 Lines • ▼ Show 20 Lines

/// ProcessCharEscape - Parse a standard C escape sequence, which can occur in		/// ProcessCharEscape - Parse a standard C escape sequence, which can occur in
/// either a character or a string literal.		/// either a character or a string literal.
static unsigned ProcessCharEscape(const char *ThisTokBegin,		static unsigned ProcessCharEscape(const char *ThisTokBegin,
const char *&ThisTokBuf,		const char *&ThisTokBuf,
const char *ThisTokEnd, bool &HadError,		const char *ThisTokEnd, bool &HadError,
FullSourceLoc Loc, unsigned CharWidth,		FullSourceLoc Loc, unsigned CharWidth,
DiagnosticsEngine *Diags,		DiagnosticsEngine *Diags,
const LangOptions &Features) {		const LangOptions &Features,
		llvm::CharSetConverter *Converter) {
const char *EscapeBegin = ThisTokBuf;		const char *EscapeBegin = ThisTokBuf;

// Skip the '\' char.		// Skip the '\' char.
++ThisTokBuf;		++ThisTokBuf;

// We know that this character can't be off the end of the buffer, because		// We know that this character can't be off the end of the buffer, because
// that would have been \", which would not have been the end of string.		// that would have been \", which would not have been the end of string.
unsigned ResultChar = *ThisTokBuf++;		unsigned ResultChar = *ThisTokBuf++;
		bool Translate = true;
		bool Invalid = false;
switch (ResultChar) {		switch (ResultChar) {
// These map to themselves.		// These map to themselves.
case '\\': case '\'': case '"': case '?': break;		case '\\': case '\'': case '"': case '?': break;

// These have fixed mappings.		// These have fixed mappings.
case 'a':		case 'a':
// TODO: K&R: the meaning of '\\a' is different in traditional C		// TODO: K&R: the meaning of '\\a' is different in traditional C
ResultChar = 7;		ResultChar = 7;
Show All 24 Lines	case 'r':
break;		break;
case 't':		case 't':
ResultChar = 9;		ResultChar = 9;
break;		break;
case 'v':		case 'v':
ResultChar = 11;		ResultChar = 11;
break;		break;
case 'x': { // Hex escape.		case 'x': { // Hex escape.
		Translate = false;
ResultChar = 0;		ResultChar = 0;
if (ThisTokBuf == ThisTokEnd \|\| !isHexDigit(*ThisTokBuf)) {		if (ThisTokBuf == ThisTokEnd \|\| !isHexDigit(*ThisTokBuf)) {
if (Diags)		if (Diags)
Diag(Diags, Features, Loc, ThisTokBegin, EscapeBegin, ThisTokBuf,		Diag(Diags, Features, Loc, ThisTokBegin, EscapeBegin, ThisTokBuf,
diag::err_hex_escape_no_digits) << "x";		diag::err_hex_escape_no_digits) << "x";
HadError = true;		HadError = true;
break;		break;
}		}
Show All 21 Lines	if (Overflow && Diags) // Too many digits to fit in
Diag(Diags, Features, Loc, ThisTokBegin, EscapeBegin, ThisTokBuf,		Diag(Diags, Features, Loc, ThisTokBegin, EscapeBegin, ThisTokBuf,
diag::err_escape_too_large) << 0;		diag::err_escape_too_large) << 0;
break;		break;
}		}
case '0': case '1': case '2': case '3':		case '0': case '1': case '2': case '3':
case '4': case '5': case '6': case '7': {		case '4': case '5': case '6': case '7': {
// Octal escapes.		// Octal escapes.
--ThisTokBuf;		--ThisTokBuf;
		Translate = false;
ResultChar = 0;		ResultChar = 0;

// Octal escapes are a series of octal digits with maximum length 3.		// Octal escapes are a series of octal digits with maximum length 3.
// "\0123" is a two digit sequence equal to "\012" "3".		// "\0123" is a two digit sequence equal to "\012" "3".
unsigned NumDigits = 0;		unsigned NumDigits = 0;
do {		do {
ResultChar <<= 3;		ResultChar <<= 3;
ResultChar \|= *ThisTokBuf++ - '0';		ResultChar \|= *ThisTokBuf++ - '0';
Show All 15 Lines	static unsigned ProcessCharEscape(const char *ThisTokBegin,
case '(': case '{': case '[': case '%':		case '(': case '{': case '[': case '%':
// GCC accepts these as extensions. We warn about them as such though.		// GCC accepts these as extensions. We warn about them as such though.
if (Diags)		if (Diags)
Diag(Diags, Features, Loc, ThisTokBegin, EscapeBegin, ThisTokBuf,		Diag(Diags, Features, Loc, ThisTokBegin, EscapeBegin, ThisTokBuf,
diag::ext_nonstandard_escape)		diag::ext_nonstandard_escape)
<< std::string(1, ResultChar);		<< std::string(1, ResultChar);
break;		break;
default:		default:
		Invalid = true;
if (!Diags)		if (!Diags)
break;		break;

if (isPrintable(ResultChar))		if (isPrintable(ResultChar))
Diag(Diags, Features, Loc, ThisTokBegin, EscapeBegin, ThisTokBuf,		Diag(Diags, Features, Loc, ThisTokBegin, EscapeBegin, ThisTokBuf,
diag::ext_unknown_escape)		diag::ext_unknown_escape)
<< std::string(1, ResultChar);		<< std::string(1, ResultChar);
else		else
Diag(Diags, Features, Loc, ThisTokBegin, EscapeBegin, ThisTokBuf,		Diag(Diags, Features, Loc, ThisTokBegin, EscapeBegin, ThisTokBuf,
diag::ext_unknown_escape)		diag::ext_unknown_escape)
<< "x" + llvm::utohexstr(ResultChar);		<< "x" + llvm::utohexstr(ResultChar);
break;		break;
}		}

		if (Translate && Converter) {
		// Invalid escapes are written as '?' and then translated.
		char ByteChar = Invalid ? '?' : ResultChar;
		SmallString<8> ResultCharConv;
		tahonermannUnsubmitted Done Reply Inline Actions What should happen if `ResultChar` >= 0x100? IBM-1047 does have representation for other UTF-8 characters. Regardless, it seems `ResultChar` should be converted to something. tahonermann: What should happen if `ResultChar` >= 0x100? IBM-1047 does have representation for other UTF-8…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions This is no longer valid, thanks for catching that. We were initially translating to ASCII instead of UTF-8 so we needed to guard against larger characters. I've removed this guard since the internal charset is UTF-8. abhina.sreeskantharajan: This is no longer valid, thanks for catching that. We were initially translating to ASCII…
		tahonermannUnsubmitted Done Reply Inline Actions Conversion can fail here, particularly in the scenario corresponding to the default switch case above; `ResultChar` could contain, for example, a lead byte of a UTF-8 sequence. Something sensible should be done here; either rejecting the code with an error or substituting `?` (in the execution encoding) seems appropriate to me. tahonermann: Conversion can fail here, particularly in the scenario corresponding to the default switch case…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Thanks, I added the substitution with the '?' character for invalid escapes. abhina.sreeskantharajan: Thanks, I added the substitution with the '?' character for invalid escapes.
		rsmithUnsubmitted Not Done Reply Inline Actions This is a regression. Our prior behavior for unknown escapes was to leave the character alone. We should still do that wherever possible -- eg, `\q` should produce `q` -- and take fallback action only if the character is unencodable. Producing a `?` seems unlikely to ever be what anyone wants; producing a hard error would seem preferable. rsmith: This is a regression. Our prior behavior for unknown escapes was to leave the character alone.
		abhina.sreeskantharajanAuthorUnsubmitted Not Done Reply Inline Actions Hi @tahonermann, do you also agree we should use the original behaviour or give a hard error instead? abhina.sreeskantharajan: Hi @tahonermann, do you also agree we should use the original behaviour or give a hard error…
		Converter->convert(StringRef(&ByteChar, 1), ResultCharConv);
		tahonermannUnsubmitted Done Reply Inline Actions As Richard previously noted, this `memcpy()` needs to be addressed. The intended behavior here is not clear. Are there valid scenarios in which the conversion will produce a sequence of more than one code units? I believe the input is limited to ASCII characters and invalid code units (e.g., a lead byte of a UTF-8 sequence) and in the latter case, an error and/or substitution of a `?` (in the execution encoding) seem like acceptable behaviors to me. tahonermann: As Richard previously noted, this `memcpy()` needs to be addressed. The intended behavior here…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions I replaced memcpy with an assignment. Please let me know if there is a better solution. abhina.sreeskantharajan: I replaced memcpy with an assignment. Please let me know if there is a better solution.
		rsmithUnsubmitted Done Reply Inline Actions Can you avoid using `std::string` here? Eg, pass a `StringRef`, extending the converter to be able to take one if necessary. rsmith: Can you avoid using `std::string` here? Eg, pass a `StringRef`, extending the converter to be…
		assert(ResultCharConv.size() == 1 &&
		rsmithUnsubmitted Done Reply Inline Actions Reinterpreting an `unsigned` as a `char` like this is not correct on big-endian, and is way too "clever" on little-endian. Please create an actual `char` object to hold the value and pass that in instead. rsmith: Reinterpreting an `unsigned` as a `char` like this is not correct on big-endian, and is way…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Thanks, I've created a char instead. abhina.sreeskantharajan: Thanks, I've created a char instead.
		"Char size increased after translation");
		ResultChar = ResultCharConv[0];
		rsmithUnsubmitted Done Reply Inline Actions What should happen if the result doesn't fit into an `unsigned`? This also appears to be making problematic assumptions about the endianness of the host. If we really want to pack multiple bytes of encoded output into a single `unsigned` result value (which itself seems dubious), we should do so with an endianness that doesn't depend on the host. rsmith: What should happen if the result doesn't fit into an `unsigned`? This also appears to be making…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions This may be a problem we need to revisit since ResultChar is expecting a char. abhina.sreeskantharajan: This may be a problem we need to revisit since ResultChar is expecting a char.
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions I added an assertion for this case where the size of the character increases after translation. I've also removed the memcpy to avoid endianness issues. abhina.sreeskantharajan: I added an assertion for this case where the size of the character increases after translation.
		rsmithUnsubmitted Not Done Reply Inline Actions Is there any guarantee the assertion will not fail? rsmith: Is there any guarantee the assertion will not fail?
		}
		rsmithUnsubmitted Not Done Reply Inline Actions Is it correct, in general, to do character-at-a-time translation here, when processing a string literal? I would expect there to be some (stateful) target character sets where that's not correct. rsmith: Is it correct, in general, to do character-at-a-time translation here, when processing a string…
		tahonermannUnsubmitted Not Done Reply Inline Actions For stateful encodings, I can imagine that state would have to be transitioned to the initial state before translating the escape sequence. I suspect support for stateful encodings is not a goal at this time. tahonermann: For stateful encodings, I can imagine that state would have to be transitioned to the initial…
		abhina.sreeskantharajanAuthorUnsubmitted Not Done Reply Inline Actions Right, stateful encodings may be a problem we will need to revisit later as well. abhina.sreeskantharajan: Right, stateful encodings may be a problem we will need to revisit later as well.
return ResultChar;		return ResultChar;
}		}

static void appendCodePoint(unsigned Codepoint,		static void appendCodePoint(unsigned Codepoint,
llvm::SmallVectorImpl<char> &Str) {		llvm::SmallVectorImpl<char> &Str) {
char ResultBuf[4];		char ResultBuf[4];
char *ResultPtr = ResultBuf;		char *ResultPtr = ResultBuf;
bool Res = llvm::ConvertCodePointToUTF8(Codepoint, ResultPtr);		bool Res = llvm::ConvertCodePointToUTF8(Codepoint, ResultPtr);
▲ Show 20 Lines • Show All 1,021 Lines • ▼ Show 20 Lines
///		///
CharLiteralParser::CharLiteralParser(const char begin, const char end,		CharLiteralParser::CharLiteralParser(const char begin, const char end,
SourceLocation Loc, Preprocessor &PP,		SourceLocation Loc, Preprocessor &PP,
tok::TokenKind kind) {		tok::TokenKind kind) {
// At this point we know that the character matches the regex "(L\|u\|U)?'.*'".		// At this point we know that the character matches the regex "(L\|u\|U)?'.*'".
HadError = false;		HadError = false;

Kind = kind;		Kind = kind;
		LiteralConverter *LiteralConv = &PP.getLiteralConverter();

const char *TokBegin = begin;		const char *TokBegin = begin;

// Skip over wide character determinant.		// Skip over wide character determinant.
if (Kind != tok::char_constant)		if (Kind != tok::char_constant)
++begin;		++begin;
if (Kind == tok::utf8_char_constant)		if (Kind == tok::utf8_char_constant)
++begin;		++begin;
▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	CharLiteralParser::CharLiteralParser(const char begin, const char end,
} else if (tok::utf16_char_constant == Kind) {		} else if (tok::utf16_char_constant == Kind) {
largest_character_for_kind = 0xFFFF;		largest_character_for_kind = 0xFFFF;
} else if (tok::utf32_char_constant == Kind) {		} else if (tok::utf32_char_constant == Kind) {
largest_character_for_kind = 0x10FFFF;		largest_character_for_kind = 0x10FFFF;
} else {		} else {
largest_character_for_kind = 0x7Fu;		largest_character_for_kind = 0x7Fu;
}		}

		llvm::CharSetConverter *Converter = nullptr;
		if (!isUTFLiteral(Kind) && LiteralConv)
		rsmithUnsubmitted Done Reply Inline Actions Shouldn't this depend on the kind of literal? We should have no converter for UTF8/UTF16/UTF32 literals, should use the wide execution character set for `L...` literals, and the narrow execution character set otherwise. (It looks like this patch doesn't properly distinguish the narrow and wide execution character sets?) rsmith: Shouldn't this depend on the kind of literal? We should have no converter for UTF8/UTF16/UTF32…
		Converter = LiteralConv->getConverter(ToExecCharset);
		tahonermannUnsubmitted Done Reply Inline Actions Converting wide character literals to the system encoding doesn't seem right to me. For z/OS, this should presumably convert to the wide EBCDIC encoding, but for all other supported platforms, the wide execution character set is either UTF-16 or UTF-32 depending on the size of `wchar_t` (which may be influenced by the `-fshort-wchar` option). tahonermann: Converting wide character literals to the system encoding doesn't seem right to me. For z/OS…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Since we don't implement -fwide-exec-charset yet, what do you think should be the default behaviour for the interim? abhina.sreeskantharajan: Since we don't implement -fwide-exec-charset yet, what do you think should be the default…
		tahonermannUnsubmitted Done Reply Inline Actions Perhaps an Internal compiler error to indicate that appropriate support is not yet in place? tahonermann: Perhaps an Internal compiler error to indicate that appropriate support is not yet in place?
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Thanks for the suggestion. I've added assertions for wide character translation before we do any translation. abhina.sreeskantharajan: Thanks for the suggestion. I've added assertions for wide character translation before we do…
		tahonermannUnsubmitted Done Reply Inline Actions Per the comment associated with the constructor declaration, I don't think the new constructor parameter is needed; translation to execution character set is always desired for non-UTF character literals. I think this can be something like: llvm::CharSetConverter Converter = nullptr; if (! isUTFLiteral(Kind)) { assert(LT); Converter = LT->getCharConversionTable(TranslateToExecCharset); } tahonermann:* Per the comment associated with the constructor declaration, I don't think the new constructor…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions I can't add an assertion here because LT might not be created in the case of the second StringLiteralParser constructor which does not pass the Preprocessor. But I have added the remaining changes. abhina.sreeskantharajan: I can't add an assertion here because LT might not be created in the case of the second…

while (begin != end) {		while (begin != end) {
// Is this a span of non-escape characters?		// Is this a span of non-escape characters?
if (begin[0] != '\\') {		if (begin[0] != '\\') {
char const *start = begin;		char const *start = begin;
do {		do {
++begin;		++begin;
} while (begin != end && *begin != '\\');		} while (begin != end && *begin != '\\');

Show All 21 Lines	if (begin[0] != '\\') {
HadError = true;		HadError = true;
}		}
} else {		} else {
for (; tmp_out_start < buffer_begin; ++tmp_out_start) {		for (; tmp_out_start < buffer_begin; ++tmp_out_start) {
if (*tmp_out_start > largest_character_for_kind) {		if (*tmp_out_start > largest_character_for_kind) {
HadError = true;		HadError = true;
PP.Diag(Loc, diag::err_character_too_large);		PP.Diag(Loc, diag::err_character_too_large);
}		}
		if (!HadError && Converter) {
		assert(Kind != tok::wide_char_constant &&
		"Wide character translation not supported");
		rsmithUnsubmitted Done Reply Inline Actions Why is this case not possible? rsmith: Why is this case not possible?
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions This case should be handled when fwide-exec-charset option is implemented. Until then, we thought it was best to emit a error message that wide literal translation is not supported. abhina.sreeskantharajan: This case should be handled when fwide-exec-charset option is implemented. Until then, we…
		char ByteChar = *tmp_out_start;
		SmallString<1> ConvertedChar;
		Converter->convert(StringRef(&ByteChar, 1), ConvertedChar);
		rsmithUnsubmitted Done Reply Inline Actions What assurance do we have that 1 output character is correct? I would expect we need to reject with a diagnostic if the character doesn't fit in one converted character. rsmith: What assurance do we have that 1 output character is correct? I would expect we need to reject…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Right, I'll add a similar assertion to the one we have above. abhina.sreeskantharajan: Right, I'll add a similar assertion to the one we have above.
		assert(ConvertedChar.size() == 1 &&
		"Char size increased after translation");
		*tmp_out_start = ConvertedChar[0];
		}
}		}
}		}

continue;		continue;
}		}
// Is this a Universal Character Name escape?		// Is this a Universal Character Name escape?
if (begin[1] == 'u' \|\| begin[1] == 'U') {		if (begin[1] == 'u' \|\| begin[1] == 'U') {
unsigned short UcnLen = 0;		unsigned short UcnLen = 0;
if (!ProcessUCNEscape(TokBegin, begin, end, *buffer_begin, UcnLen,		if (!ProcessUCNEscape(TokBegin, begin, end, *buffer_begin, UcnLen,
FullSourceLoc(Loc, PP.getSourceManager()),		FullSourceLoc(Loc, PP.getSourceManager()),
&PP.getDiagnostics(), PP.getLangOpts(), true)) {		&PP.getDiagnostics(), PP.getLangOpts(), true)) {
HadError = true;		HadError = true;
} else if (*buffer_begin > largest_character_for_kind) {		} else if (*buffer_begin > largest_character_for_kind) {
HadError = true;		HadError = true;
PP.Diag(Loc, diag::err_character_too_large);		PP.Diag(Loc, diag::err_character_too_large);
}		}

++buffer_begin;		++buffer_begin;
continue;		continue;
}		}
unsigned CharWidth = getCharWidth(Kind, PP.getTargetInfo());		unsigned CharWidth = getCharWidth(Kind, PP.getTargetInfo());
uint64_t result =		uint64_t result =
ProcessCharEscape(TokBegin, begin, end, HadError,		ProcessCharEscape(TokBegin, begin, end, HadError,
FullSourceLoc(Loc,PP.getSourceManager()),		FullSourceLoc(Loc, PP.getSourceManager()), CharWidth,
CharWidth, &PP.getDiagnostics(), PP.getLangOpts());		&PP.getDiagnostics(), PP.getLangOpts(), nullptr);
*buffer_begin++ = result;		*buffer_begin++ = result;
}		}

unsigned NumCharsSoFar = buffer_begin - &codepoint_buffer.front();		unsigned NumCharsSoFar = buffer_begin - &codepoint_buffer.front();

if (NumCharsSoFar > 1) {		if (NumCharsSoFar > 1) {
if (isWide())		if (isWide())
PP.Diag(Loc, diag::warn_extraneous_char_constant);		PP.Diag(Loc, diag::warn_extraneous_char_constant);
▲ Show 20 Lines • Show All 91 Lines • ▼ Show 20 Lines
/// hexadecimal-escape-sequence hexadecimal-digit		/// hexadecimal-escape-sequence hexadecimal-digit
/// universal-character-name:		/// universal-character-name:
/// \u hex-quad		/// \u hex-quad
/// \U hex-quad hex-quad		/// \U hex-quad hex-quad
/// hex-quad:		/// hex-quad:
/// hex-digit hex-digit hex-digit hex-digit		/// hex-digit hex-digit hex-digit hex-digit
/// \endverbatim		/// \endverbatim
///		///
StringLiteralParser::
StringLiteralParser(ArrayRef<Token> StringToks,		StringLiteralParser::StringLiteralParser(ArrayRef<Token> StringToks,
Preprocessor &PP, bool Complain)		Preprocessor &PP, bool Complain,
		ConversionAction Action)
: SM(PP.getSourceManager()), Features(PP.getLangOpts()),		: SM(PP.getSourceManager()), Features(PP.getLangOpts()),
Target(PP.getTargetInfo()), Diags(Complain ? &PP.getDiagnostics() :nullptr),		Target(PP.getTargetInfo()),
MaxTokenLength(0), SizeBound(0), CharByteWidth(0), Kind(tok::unknown),		Diags(Complain ? &PP.getDiagnostics() : nullptr),
ResultPtr(ResultBuf.data()), hadError(false), Pascal(false) {		LiteralConv(&PP.getLiteralConverter()), MaxTokenLength(0), SizeBound(0),
init(StringToks);		CharByteWidth(0), Kind(tok::unknown), ResultPtr(ResultBuf.data()),
		hadError(false), Pascal(false) {
		init(StringToks, Action);
}		}

void StringLiteralParser::init(ArrayRef<Token> StringToks){		void StringLiteralParser::init(ArrayRef<Token> StringToks,
		ConversionAction Action) {
// The literal token may have come from an invalid source location (e.g. due		// The literal token may have come from an invalid source location (e.g. due
// to a PCH error), in which case the token length will be 0.		// to a PCH error), in which case the token length will be 0.
if (StringToks.empty() \|\| StringToks[0].getLength() < 2)		if (StringToks.empty() \|\| StringToks[0].getLength() < 2)
return DiagnoseLexingError(SourceLocation());		return DiagnoseLexingError(SourceLocation());

// Scan all of the string portions, remember the max individual token length,		// Scan all of the string portions, remember the max individual token length,
// computing a bound on the concatenated string length, and see whether any		// computing a bound on the concatenated string length, and see whether any
// piece is a wide-string. If any of the string portions is a wide-string		// piece is a wide-string. If any of the string portions is a wide-string
▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	void StringLiteralParser::init(ArrayRef<Token> StringToks,
// Loop over all the strings, getting their spelling, and expanding them to		// Loop over all the strings, getting their spelling, and expanding them to
// wide strings as appropriate.		// wide strings as appropriate.
ResultPtr = &ResultBuf[0]; // Next byte to fill in.		ResultPtr = &ResultBuf[0]; // Next byte to fill in.

Pascal = false;		Pascal = false;

SourceLocation UDSuffixTokLoc;		SourceLocation UDSuffixTokLoc;

		llvm::CharSetConverter *Converter = nullptr;
		if (!isUTFLiteral(Kind) && LiteralConv)
		Converter = LiteralConv->getConverter(Action);
		tahonermannUnsubmitted Done Reply Inline Actions Converting wide string literals to the system encoding doesn't seem right to me. For z/OS, this should presumably convert to the wide EBCDIC encoding, but for all other supported platforms, the wide execution character set is either UTF-16 or UTF-32 depending on the size of `wchar_t` (which may be influenced by the `-fshort-wchar` option). tahonermann: Converting wide string literals to the system encoding doesn't seem right to me. For z/OS…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions I've now added an assertion when translating wide characters. abhina.sreeskantharajan: I've now added an assertion when translating wide characters.

for (unsigned i = 0, e = StringToks.size(); i != e; ++i) {		for (unsigned i = 0, e = StringToks.size(); i != e; ++i) {
		tahonermannUnsubmitted Not Done Reply Inline Actions The stored `TranslationState` should not be completely ignored for wide and UTF string literals. The standard permits things like the following. #pragma rigoot L"bozit" #pragma rigoot u"bozit" _Pragma(L"rigoot bozit") _Pragma(u8"rigoot bozit") For at least the `_Pragma(L"...")` case, the C++ standard states the `L` is ignored, but it doesn't say anything about other encoding prefixes. tahonermann: The stored `TranslationState` should not be completely ignored for wide and UTF string literals.
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Please correct me if I'm wrong, these Pragma strings are not parsed through StringLiteralParser, they are parsed in clang/lib/Lex/Pragma.cpp in this function. void Preprocessor::Handle_Pragma(Token &Tok) So if they require translation, it would need to be done in that function. abhina.sreeskantharajan: Please correct me if I'm wrong, these Pragma strings are not parsed through StringLiteralParser…
		tahonermannUnsubmitted Not Done Reply Inline Actions Ah, ok, good. There are other cases where a string literal is not used to produce a string literal object. See https://wg21.link/p2314 for a table. You may want to audit for those cases. tahonermann: Ah, ok, good. There are other cases where a string literal is not used to produce a string…
const char *ThisTokBuf = &TokenBuf[0];		const char *ThisTokBuf = &TokenBuf[0];
// Get the spelling of the token, which eliminates trigraphs, etc. We know		// Get the spelling of the token, which eliminates trigraphs, etc. We know
// that ThisTokBuf points to a buffer that is big enough for the whole token		// that ThisTokBuf points to a buffer that is big enough for the whole token
// and 'spelled' tokens can only shrink.		// and 'spelled' tokens can only shrink.
bool StringInvalid = false;		bool StringInvalid = false;
unsigned ThisTokLen =		unsigned ThisTokLen =
Lexer::getSpelling(StringToks[i], ThisTokBuf, SM, Features,		Lexer::getSpelling(StringToks[i], ThisTokBuf, SM, Features,
&StringInvalid);		&StringInvalid);
▲ Show 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	if (ThisTokBuf[0] == 'R') {
size_t CRLFPos = RemainingTokenSpan.find("\r\n");		size_t CRLFPos = RemainingTokenSpan.find("\r\n");
StringRef BeforeCRLF = RemainingTokenSpan.substr(0, CRLFPos);		StringRef BeforeCRLF = RemainingTokenSpan.substr(0, CRLFPos);
StringRef AfterCRLF = RemainingTokenSpan.substr(CRLFPos);		StringRef AfterCRLF = RemainingTokenSpan.substr(CRLFPos);

// Copy everything before the \r\n sequence into the string literal.		// Copy everything before the \r\n sequence into the string literal.
if (CopyStringFragment(StringToks[i], ThisTokBegin, BeforeCRLF))		if (CopyStringFragment(StringToks[i], ThisTokBegin, BeforeCRLF))
hadError = true;		hadError = true;

		if (!hadError && Converter) {
		assert(Kind != tok::wide_string_literal &&
		"Wide character translation not supported");
		SmallString<256> CpConv;
		int ResultLength = BeforeCRLF.size() * CharByteWidth;
		char *Cp = ResultPtr - ResultLength;
		Converter->convert(StringRef(Cp, ResultLength), CpConv);
		memmove(Cp, CpConv.data(), ResultLength);
		ResultPtr = Cp + CpConv.size();
		}
// Point into the \n inside the \r\n sequence and operate on the		// Point into the \n inside the \r\n sequence and operate on the
// remaining portion of the literal.		// remaining portion of the literal.
RemainingTokenSpan = AfterCRLF.substr(1);		RemainingTokenSpan = AfterCRLF.substr(1);
		rsmithUnsubmitted Not Done Reply Inline Actions Do we need to convert the newline character too? Perhaps for raw string literals it'd be better to do the normal processing here and then convert the entire string at once? rsmith: Do we need to convert the newline character too? Perhaps for raw string literals it'd be…
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Yes, we need to convert newlines as well. I think the current behaviour is already converting multi line raw strings correctly. I'll add a testcase for this. abhina.sreeskantharajan: Yes, we need to convert newlines as well. I think the current behaviour is already converting…
}		}
} else {		} else {
if (ThisTokBuf[0] != '"') {		if (ThisTokBuf[0] != '"') {
// The file may have come from PCH and then changed after loading the		// The file may have come from PCH and then changed after loading the
// PCH; Fail gracefully.		// PCH; Fail gracefully.
return DiagnoseLexingError(StringToks[i].getLocation());		return DiagnoseLexingError(StringToks[i].getLocation());
}		}
++ThisTokBuf; // skip "		++ThisTokBuf; // skip "
Show All 14 Lines	if (ThisTokBuf[0] == 'R') {
while (ThisTokBuf != ThisTokEnd) {		while (ThisTokBuf != ThisTokEnd) {
// Is this a span of non-escape characters?		// Is this a span of non-escape characters?
if (ThisTokBuf[0] != '\\') {		if (ThisTokBuf[0] != '\\') {
const char *InStart = ThisTokBuf;		const char *InStart = ThisTokBuf;
do {		do {
++ThisTokBuf;		++ThisTokBuf;
} while (ThisTokBuf != ThisTokEnd && ThisTokBuf[0] != '\\');		} while (ThisTokBuf != ThisTokEnd && ThisTokBuf[0] != '\\');

		int Length = ThisTokBuf - InStart;
// Copy the character span over.		// Copy the character span over.
if (CopyStringFragment(StringToks[i], ThisTokBegin,		if (CopyStringFragment(StringToks[i], ThisTokBegin,
StringRef(InStart, ThisTokBuf - InStart)))		StringRef(InStart, ThisTokBuf - InStart)))
hadError = true;		hadError = true;

		if (!hadError && Converter) {
		assert(Kind != tok::wide_string_literal &&
		"Wide character translation not supported");
		SmallString<256> CpConv;
		int ResultLength = Length * CharByteWidth;
		char *Cp = ResultPtr - ResultLength;
		Converter->convert(StringRef(Cp, ResultLength), CpConv);
		memmove(Cp, CpConv.data(), ResultLength);
		ResultPtr = Cp + CpConv.size();
		}
continue;		continue;
}		}
// Is this a Universal Character Name escape?		// Is this a Universal Character Name escape?
if (ThisTokBuf[1] == 'u' \|\| ThisTokBuf[1] == 'U') {		if (ThisTokBuf[1] == 'u' \|\| ThisTokBuf[1] == 'U') {
EncodeUCNEscape(ThisTokBegin, ThisTokBuf, ThisTokEnd,		char *Cp = ResultPtr;
ResultPtr, hadError,		EncodeUCNEscape(ThisTokBegin, ThisTokBuf, ThisTokEnd, ResultPtr,
		hadError,
FullSourceLoc(StringToks[i].getLocation(), SM),		FullSourceLoc(StringToks[i].getLocation(), SM),
CharByteWidth, Diags, Features);		CharByteWidth, Diags, Features);

		if (!hadError && Converter) {
		SmallString<8> CpConv;
		Converter->convert(StringRef(Cp), CpConv);
		memmove(Cp, CpConv.data(), CpConv.size());
		ResultPtr = Cp + CpConv.size();
		}
		tahonermannUnsubmitted Done Reply Inline Actions UCNs will require conversion here. tahonermann: UCNs will require conversion here.
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions I've added code to translate UCN characters and have updated the testcase as well. abhina.sreeskantharajan: I've added code to translate UCN characters and have updated the testcase as well.
continue;		continue;
}		}
// Otherwise, this is a non-UCN escape character. Process it.		// Otherwise, this is a non-UCN escape character. Process it.
unsigned ResultChar =		unsigned ResultChar =
ProcessCharEscape(ThisTokBegin, ThisTokBuf, ThisTokEnd, hadError,		ProcessCharEscape(ThisTokBegin, ThisTokBuf, ThisTokEnd, hadError,
FullSourceLoc(StringToks[i].getLocation(), SM),		FullSourceLoc(StringToks[i].getLocation(), SM),
CharByteWidth*8, Diags, Features);		CharByteWidth * 8, Diags, Features, Converter);

if (CharByteWidth == 4) {		if (CharByteWidth == 4) {
// FIXME: Make the type of the result buffer correct instead of		// FIXME: Make the type of the result buffer correct instead of
// using reinterpret_cast.		// using reinterpret_cast.
llvm::UTF32 ResultWidePtr = reinterpret_cast<llvm::UTF32>(ResultPtr);		llvm::UTF32 ResultWidePtr = reinterpret_cast<llvm::UTF32>(ResultPtr);
*ResultWidePtr = ResultChar;		*ResultWidePtr = ResultChar;
ResultPtr += 4;		ResultPtr += 4;
} else if (CharByteWidth == 2) {		} else if (CharByteWidth == 2) {
▲ Show 20 Lines • Show All 177 Lines • ▼ Show 20 Lines	if (SpellingPtr[1] == 'u' \|\| SpellingPtr[1] == 'U') {
if (Len > ByteNo) {		if (Len > ByteNo) {
// ByteNo is somewhere within the escape sequence.		// ByteNo is somewhere within the escape sequence.
SpellingPtr = EscapePtr;		SpellingPtr = EscapePtr;
break;		break;
}		}
ByteNo -= Len;		ByteNo -= Len;
} else {		} else {
ProcessCharEscape(SpellingStart, SpellingPtr, SpellingEnd, HadError,		ProcessCharEscape(SpellingStart, SpellingPtr, SpellingEnd, HadError,
FullSourceLoc(Tok.getLocation(), SM),		FullSourceLoc(Tok.getLocation(), SM), CharByteWidth * 8,
CharByteWidth*8, Diags, Features);		Diags, Features, nullptr);
--ByteNo;		--ByteNo;
}		}
assert(!HadError && "This method isn't valid on erroneous strings");		assert(!HadError && "This method isn't valid on erroneous strings");
}		}

return SpellingPtr-SpellingStart;		return SpellingPtr-SpellingStart;
}		}

/// Determine whether a suffix is a valid ud-suffix. We avoid treating reserved		/// Determine whether a suffix is a valid ud-suffix. We avoid treating reserved
/// suffixes as ud-suffixes, because the diagnostic experience is better if we		/// suffixes as ud-suffixes, because the diagnostic experience is better if we
/// treat it as an invalid suffix.		/// treat it as an invalid suffix.
bool StringLiteralParser::isValidUDSuffix(const LangOptions &LangOpts,		bool StringLiteralParser::isValidUDSuffix(const LangOptions &LangOpts,
StringRef Suffix) {		StringRef Suffix) {
return NumericLiteralParser::isValidUDSuffix(LangOpts, Suffix) \|\|		return NumericLiteralParser::isValidUDSuffix(LangOpts, Suffix) \|\|
Suffix == "sv";		Suffix == "sv";
}		}

clang/test/CodeGen/systemz-charset.c

This file was added.

				// RUN: %clang_cc1 %s -emit-llvm -triple s390x-none-zos -fexec-charset IBM-1047 -o - \| FileCheck %s
				// RUN: %clang %s -emit-llvm -S -target s390x-ibm-zos -o - \| FileCheck %s

				const char *UpperCaseLetters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
				// CHECK: c"\C1\C2\C3\C4\C5\C6\C7\C8\C9\D1\D2\D3\D4\D5\D6\D7\D8\D9\E2\E3\E4\E5\E6\E7\E8\E9\00"
				tahonermannUnsubmitted Done Reply Inline Actions `const char ` please :) tahonermann:* `const char *` please :)

				const char *LowerCaseLetters = "abcdefghijklmnopqrstuvwxyz";
				//CHECK: c"\81\82\83\84\85\86\87\88\89\91\92\93\94\95\96\97\98\99\A2\A3\A4\A5\A6\A7\A8\A9\00"

				const char *Digits = "0123456789";
				// CHECK: c"\F0\F1\F2\F3\F4\F5\F6\F7\F8\F9\00"

				const char SpecialCharacters = " .<(+\|&!$);^-/,%%_>`:#@=";
				// CHECK: c"@KLMNOPZ[\\]^_`akllmnyz{\|~\00"

				const char *EscapeCharacters = "\a\b\f\n\r\t\v\\\'\"\?";
				//CHECK: c"/\16\0C\15\0D\05\0B\E0}\7Fo\00"

				const char *InvalidEscape = "\y\z";
				//CHECK: c"oo\00"

				const char *HexCharacters = "\x12\x13\x14";
				//CHECK: c"\12\13\14\00"

				tahonermannUnsubmitted Done Reply Inline Actions `const char` here too please. tahonermann:* `const char*` here too please.
				const char *OctalCharacters = "\141\142\143";
				tahonermannUnsubmitted Done Reply Inline Actions Add validation of UCNs. Something like: const char UcnCharacters = "\u00E2\u00AC\U000000DF"; // CHECK: c"\42\B0\59\00" tahonermann:* Add validation of UCNs. Something like: const char *UcnCharacters = "\u00E2\u00AC\U000000DF"…
				//CHECK: c"abc\00"

				const char singleChar = 'a';
				//CHECK: i8 -127

				const char *UcnCharacters = "\u00E2\u00AC\U000000DF";
				//CHECK: c"B\B0Y\00"

				const char *Unicode = "ÿ";
				//CHECK: c"\DF\00"

clang/test/CodeGen/systemz-charset.cpp

This file was added.

				// RUN: %clang %s -std=c++17 -emit-llvm -S -target s390x-ibm-zos -o - \| FileCheck %s

				const char *RawString = R"(Hello\n)";
				//CHECK: c"\C8\85\93\93\96\E0\95\00"

				const char *MultiLineRawString = R"(
				Hello
				There)";
				//CHECK: c"\15\C8\85\93\93\96\15\E3\88\85\99\85\00"

				char UnicodeChar8 = u8'1';
				//CHECK: i8 49
				char16_t UnicodeChar16 = u'1';
				//CHECK: i16 49
				char32_t UnicodeChar32 = U'1';
				//CHECK: i32 49

				const char *EscapeCharacters8 = u8"\a\b\f\n\r\t\v\\\'\"\?";
				//CHECK: c"\07\08\0C\0A\0D\09\0B\\'\22?\00"

				const char16_t *EscapeCharacters16 = u"\a\b\f\n\r\t\v\\\'\"\?";
				//CHECK: [12 x i16] [i16 7, i16 8, i16 12, i16 10, i16 13, i16 9, i16 11, i16 92, i16 39, i16 34, i16 63, i16 0]

				const char32_t *EscapeCharacters32 = U"\a\b\f\n\r\t\v\\\'\"\?";
				//CHECK: [12 x i32] [i32 7, i32 8, i32 12, i32 10, i32 13, i32 9, i32 11, i32 92, i32 39, i32 34, i32 63, i32 0]

				tahonermannUnsubmitted Done Reply Inline Actions This is good. I suggest adding escape sequences and UCNs to validate that they are not converted to IBM-1047. tahonermann: This is good. I suggest adding escape sequences and UCNs to validate that they are not…
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Good idea, I added those testcases as per your suggestion. abhina.sreeskantharajan: Good idea, I added those testcases as per your suggestion.
				const char *UnicodeString8 = u8"Hello";
				//CHECK: c"Hello\00"
				const char16_t *UnicodeString16 = u"Hello";
				//CHECK: [6 x i16] [i16 72, i16 101, i16 108, i16 108, i16 111, i16 0]
				const char32_t *UnicodeString32 = U"Hello";
				//CHECK: [6 x i32] [i32 72, i32 101, i32 108, i32 108, i32 111, i32 0]

				const char *UnicodeRawString8 = u8R"("Hello\")";
				//CHECK: c"\22Hello\\\22\00"
				const char16_t *UnicodeRawString16 = uR"("Hello\")";
				//CHECK: [9 x i16] [i16 34, i16 72, i16 101, i16 108, i16 108, i16 111, i16 92, i16 34, i16 0]
				const char32_t *UnicodeRawString32 = UR"("Hello\")";
				//CHECK: [9 x i32] [i32 34, i32 72, i32 101, i32 108, i32 108, i32 111, i32 92, i32 34, i32 0]

				const char *UnicodeUCNString8 = u8"\u00E2\u00AC\U000000DF";
				//CHECK: c"\C3\A2\C2\AC\C3\9F\00"
				const char16_t *UnicodeUCNString16 = u"\u00E2\u00AC\U000000DF";
				//CHECK: [4 x i16] [i16 226, i16 172, i16 223, i16 0]
				const char32_t *UnicodeUCNString32 = U"\u00E2\u00AC\U000000DF";
				//CHECK: [4 x i32] [i32 226, i32 172, i32 223, i32 0]

clang/test/Driver/cl-options.c

	Show First 20 Lines • Show All 204 Lines • ▼ Show 20 Lines
	// RUN: %clang_cl /E /EP /showIncludes -### -- %s 2>&1 \| FileCheck -check-prefix=showIncludes_E %s			// RUN: %clang_cl /E /EP /showIncludes -### -- %s 2>&1 \| FileCheck -check-prefix=showIncludes_E %s
	// RUN: %clang_cl /EP /P /showIncludes -### -- %s 2>&1 \| FileCheck -check-prefix=showIncludes_E %s			// RUN: %clang_cl /EP /P /showIncludes -### -- %s 2>&1 \| FileCheck -check-prefix=showIncludes_E %s
	// showIncludes_E-NOT: warning: argument unused during compilation: '--show-includes'			// showIncludes_E-NOT: warning: argument unused during compilation: '--show-includes'

	// /source-charset: should warn on everything except UTF-8.			// /source-charset: should warn on everything except UTF-8.
	// RUN: %clang_cl /source-charset:utf-16 -### -- %s 2>&1 \| FileCheck -check-prefix=source-charset-utf-16 %s			// RUN: %clang_cl /source-charset:utf-16 -### -- %s 2>&1 \| FileCheck -check-prefix=source-charset-utf-16 %s
	// source-charset-utf-16: invalid value 'utf-16' in '/source-charset:utf-16'			// source-charset-utf-16: invalid value 'utf-16' in '/source-charset:utf-16'

	// /execution-charset: should warn on everything except UTF-8.			// /execution-charset: should warn on invalid charsets.
	// RUN: %clang_cl /execution-charset:utf-16 -### -- %s 2>&1 \| FileCheck -check-prefix=execution-charset-utf-16 %s			// RUN: %clang_cl /execution-charset:invalid-charset -### -- %s 2>&1 \| FileCheck -check-prefix=execution-charset-invalid %s
	// execution-charset-utf-16: invalid value 'utf-16' in '/execution-charset:utf-16'			// execution-charset-invalid: invalid value 'invalid-charset' in '/execution-charset:invalid-charset'
				rsmithUnsubmitted Done Reply Inline Actions Please use the given spelling of the flag in the diagnostic. (You can ask the argument how it was spelled.) rsmith: Please use the given spelling of the flag in the diagnostic. (You can ask the argument how it…
	//			//

	// RUN: %clang_cl /Umymacro -### -- %s 2>&1 \| FileCheck -check-prefix=U %s			// RUN: %clang_cl /Umymacro -### -- %s 2>&1 \| FileCheck -check-prefix=U %s
				rsmithUnsubmitted Done Reply Inline Actions Checking for "don't produce exactly this one spelling of this one diagnostic" is not a useful test; if we started warning on this again, there's a good chance the warning would be spelled differently, so your test does not do a good job of determining whether the code under test is bad (it passes in most bad states as well as in the good state). `...-NOT: error` and `...-NOT: warning` would be a bit better, if this is worth testing. rsmith: Checking for "don't produce exactly this one spelling of this one diagnostic" is not a useful…
				abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions You're right, I made a change just to make the testcase pass. I think this testcase is no longer needed because fexec-charset should be able to accept all charset names. We won't be able to diagnose invalid charset names until we actually try creating the CharSetConverter. abhina.sreeskantharajan: You're right, I made a change just to make the testcase pass. I think this testcase is no…
	// RUN: %clang_cl /U mymacro -### -- %s 2>&1 \| FileCheck -check-prefix=U %s			// RUN: %clang_cl /U mymacro -### -- %s 2>&1 \| FileCheck -check-prefix=U %s
	// U: "-U" "mymacro"			// U: "-U" "mymacro"

	// RUN: %clang_cl /validate-charset -### -- %s 2>&1 \| FileCheck -check-prefix=validate-charset %s			// RUN: %clang_cl /validate-charset -### -- %s 2>&1 \| FileCheck -check-prefix=validate-charset %s
	// validate-charset: -Winvalid-source-encoding			// validate-charset: -Winvalid-source-encoding

	// RUN: %clang_cl /validate-charset- -### -- %s 2>&1 \| FileCheck -check-prefix=validate-charset_ %s			// RUN: %clang_cl /validate-charset- -### -- %s 2>&1 \| FileCheck -check-prefix=validate-charset_ %s
	// validate-charset_: -Wno-invalid-source-encoding			// validate-charset_: -Wno-invalid-source-encoding
	▲ Show 20 Lines • Show All 498 Lines • Show Last 20 Lines

clang/test/Driver/clang_f_opts.c

	Show First 20 Lines • Show All 216 Lines • ▼ Show 20 Lines
	// CHECK-MAX-O: -O3			// CHECK-MAX-O: -O3

	// RUN: %clang -S -O20 -o /dev/null %s 2>&1 \| FileCheck -check-prefix=CHECK-INVALID-O %s			// RUN: %clang -S -O20 -o /dev/null %s 2>&1 \| FileCheck -check-prefix=CHECK-INVALID-O %s
	// CHECK-INVALID-O: warning: optimization level '-O20' is not supported; using '-O3' instead			// CHECK-INVALID-O: warning: optimization level '-O20' is not supported; using '-O3' instead

	// RUN: %clang -### -S -finput-charset=iso-8859-1 -o /dev/null %s 2>&1 \| FileCheck -check-prefix=CHECK-INVALID-CHARSET %s			// RUN: %clang -### -S -finput-charset=iso-8859-1 -o /dev/null %s 2>&1 \| FileCheck -check-prefix=CHECK-INVALID-CHARSET %s
	// CHECK-INVALID-CHARSET: error: invalid value 'iso-8859-1' in '-finput-charset=iso-8859-1'			// CHECK-INVALID-CHARSET: error: invalid value 'iso-8859-1' in '-finput-charset=iso-8859-1'

	// RUN: %clang -### -S -fexec-charset=iso-8859-1 -o /dev/null %s 2>&1 \| FileCheck -check-prefix=CHECK-INVALID-INPUT-CHARSET %s			// RUN: %clang -### -S -fexec-charset=invalid-charset -o /dev/null %s 2>&1 \| FileCheck -check-prefix=CHECK-INVALID-INPUT-CHARSET %s
	// CHECK-INVALID-INPUT-CHARSET: error: invalid value 'iso-8859-1' in '-fexec-charset=iso-8859-1'			// CHECK-INVALID-INPUT-CHARSET: error: invalid value 'invalid-charset' in '-fexec-charset=invalid-charset'
				rsmithUnsubmitted Done Reply Inline Actions Again, this is not a useful test. rsmith: Again, this is not a useful test.

				// Test that we support the following exec charsets.
				// RUN: %clang -### -S -fexec-charset=UTF-8 -o /dev/null %s 2>&1 \| FileCheck --check-prefix=INVALID %s
				// RUN: %clang -### -S -fexec-charset=ISO8859-1 -o /dev/null %s 2>&1 \| FileCheck --check-prefix=INVALID %s
				// RUN: %clang -### -S -fexec-charset=IBM-1047 -o /dev/null %s 2>&1 \| FileCheck --check-prefix=INVALID %s
				// INVALID-NOT: error: invalid value

				tahonermannUnsubmitted Done Reply Inline Actions This looks good. Can tests also be added to validate that the `UTF-8`, `ISO8895-1`, and `IBM-1047` option arguments are properly recognized? tahonermann: This looks good. Can tests also be added to validate that the `UTF-8`, `ISO8895-1`, and `IBM…
	// Test that we don't error on these.			// Test that we don't error on these.
	// RUN: %clang -### -S -Werror \			// RUN: %clang -### -S -Werror \
	// RUN: -falign-functions -falign-functions=2 -fno-align-functions \			// RUN: -falign-functions -falign-functions=2 -fno-align-functions \
	// RUN: -fasynchronous-unwind-tables -fno-asynchronous-unwind-tables \			// RUN: -fasynchronous-unwind-tables -fno-asynchronous-unwind-tables \
	// RUN: -fbuiltin -fno-builtin \			// RUN: -fbuiltin -fno-builtin \
	// RUN: -fdiagnostics-show-location=once \			// RUN: -fdiagnostics-show-location=once \
	// RUN: -ffloat-store -fno-float-store \			// RUN: -ffloat-store -fno-float-store \
	// RUN: -feliminate-unused-debug-types -fno-eliminate-unused-debug-types \			// RUN: -feliminate-unused-debug-types -fno-eliminate-unused-debug-types \
	// RUN: -fgcse -fno-gcse \			// RUN: -fgcse -fno-gcse \
	// RUN: -fident -fno-ident \			// RUN: -fident -fno-ident \
	// RUN: -fimplicit-templates -fno-implicit-templates \			// RUN: -fimplicit-templates -fno-implicit-templates \
	// RUN: -finput-charset=UTF-8 \			// RUN: -finput-charset=UTF-8 \
	// RUN: -fexec-charset=UTF-8 \			// RUN: -fexec-charset=UTF-8 \
	// RUN: -fivopts -fno-ivopts \			// RUN: -fivopts -fno-ivopts \
	// RUN: -fnon-call-exceptions -fno-non-call-exceptions \			// RUN: -fnon-call-exceptions -fno-non-call-exceptions \
	// RUN: -fpermissive -fno-permissive \			// RUN: -fpermissive -fno-permissive \
	// RUN: -fdefer-pop -fno-defer-pop \			// RUN: -fdefer-pop -fno-defer-pop \
	// RUN: -fprefetch-loop-arrays -fno-prefetch-loop-arrays \			// RUN: -fprefetch-loop-arrays -fno-prefetch-loop-arrays \
	// RUN: -fprofile-correction -fno-profile-correction \			// RUN: -fprofile-correction -fno-profile-correction \
	// RUN: -fprofile-values -fno-profile-values \			// RUN: -fprofile-values -fno-profile-values \
	// RUN: -frounding-math -fno-rounding-math \			// RUN: -frounding-math -fno-rounding-math \
	▲ Show 20 Lines • Show All 349 Lines • Show Last 20 Lines

clang/test/Preprocessor/init-s390x.c

	Show First 20 Lines • Show All 196 Lines • ▼ Show 20 Lines
	// S390X-ZOS-GNUXX: #define __DLL__ 1			// S390X-ZOS-GNUXX: #define __DLL__ 1
	// S390X-ZOS: #define __LONGNAME__ 1			// S390X-ZOS: #define __LONGNAME__ 1
	// S390X-ZOS: #define __MVS__ 1			// S390X-ZOS: #define __MVS__ 1
	// S390X-ZOS: #define __THW_370__ 1			// S390X-ZOS: #define __THW_370__ 1
	// S390X-ZOS: #define __THW_BIG_ENDIAN__ 1			// S390X-ZOS: #define __THW_BIG_ENDIAN__ 1
	// S390X-ZOS: #define __TOS_390__ 1			// S390X-ZOS: #define __TOS_390__ 1
	// S390X-ZOS: #define __TOS_MVS__ 1			// S390X-ZOS: #define __TOS_MVS__ 1
	// S390X-ZOS: #define __XPLINK__ 1			// S390X-ZOS: #define __XPLINK__ 1
				// S390X-ZOS: #define __clang_literal_encoding__ IBM-1047
	// S390X-ZOS-GNUXX: #define __wchar_t 1			// S390X-ZOS-GNUXX: #define __wchar_t 1

clang/test/Preprocessor/init-x86.c

	Show First 20 Lines • Show All 1,300 Lines • ▼ Show 20 Lines
	// X86_64-CLOUDABI:#define __WCHAR_TYPE__ int			// X86_64-CLOUDABI:#define __WCHAR_TYPE__ int
	// X86_64-CLOUDABI:#define __WCHAR_WIDTH__ 32			// X86_64-CLOUDABI:#define __WCHAR_WIDTH__ 32
	// X86_64-CLOUDABI:#define __WINT_MAX__ 2147483647			// X86_64-CLOUDABI:#define __WINT_MAX__ 2147483647
	// X86_64-CLOUDABI:#define __WINT_TYPE__ int			// X86_64-CLOUDABI:#define __WINT_TYPE__ int
	// X86_64-CLOUDABI:#define __WINT_WIDTH__ 32			// X86_64-CLOUDABI:#define __WINT_WIDTH__ 32
	// X86_64-CLOUDABI:#define __amd64 1			// X86_64-CLOUDABI:#define __amd64 1
	// X86_64-CLOUDABI:#define __amd64__ 1			// X86_64-CLOUDABI:#define __amd64__ 1
	// X86_64-CLOUDABI:#define __clang__ 1			// X86_64-CLOUDABI:#define __clang__ 1
	// X86_64-CLOUDABI:#define __clang_literal_encoding__ {{.*}}			// X86_64-CLOUDABI:#define __clang_literal_encoding__ UTF-8
	// X86_64-CLOUDABI:#define __clang_major__ {{.*}}			// X86_64-CLOUDABI:#define __clang_major__ {{.*}}
	// X86_64-CLOUDABI:#define __clang_minor__ {{.*}}			// X86_64-CLOUDABI:#define __clang_minor__ {{.*}}
	// X86_64-CLOUDABI:#define __clang_patchlevel__ {{.*}}			// X86_64-CLOUDABI:#define __clang_patchlevel__ {{.*}}
	// X86_64-CLOUDABI:#define __clang_version__ {{.*}}			// X86_64-CLOUDABI:#define __clang_version__ {{.*}}
	// X86_64-CLOUDABI:#define __clang_wide_literal_encoding__ {{.*}}			// X86_64-CLOUDABI:#define __clang_wide_literal_encoding__ {{.*}}
	// X86_64-CLOUDABI:#define __llvm__ 1			// X86_64-CLOUDABI:#define __llvm__ 1
	// X86_64-CLOUDABI:#define __x86_64 1			// X86_64-CLOUDABI:#define __x86_64 1
	// X86_64-CLOUDABI:#define __x86_64__ 1			// X86_64-CLOUDABI:#define __x86_64__ 1
	▲ Show 20 Lines • Show All 417 Lines • Show Last 20 Lines

llvm/include/llvm/ADT/Triple.h

Show First 20 Lines • Show All 391 Lines • ▼ Show 20 Lines	public:
/// component of the triple, or "" if empty.		/// component of the triple, or "" if empty.
StringRef getEnvironmentName() const;		StringRef getEnvironmentName() const;

/// getOSAndEnvironmentName - Get the operating system and optional		/// getOSAndEnvironmentName - Get the operating system and optional
/// environment components as a single string (separated by a '-'		/// environment components as a single string (separated by a '-'
/// if the environment component is present).		/// if the environment component is present).
StringRef getOSAndEnvironmentName() const;		StringRef getOSAndEnvironmentName() const;

		/// getSystemCharset - Get the system charset of the triple.
		StringRef getSystemCharset() const;

/// @}		/// @}
/// @name Convenience Predicates		/// @name Convenience Predicates
/// @{		/// @{

/// Test whether the architecture is 64-bit		/// Test whether the architecture is 64-bit
///		///
/// Note that this tests for 64-bit pointer width, and nothing else. Note		/// Note that this tests for 64-bit pointer width, and nothing else. Note
/// that we intentionally expose only three predicates, 64-bit, 32-bit, and		/// that we intentionally expose only three predicates, 64-bit, 32-bit, and
▲ Show 20 Lines • Show All 567 Lines • Show Last 20 Lines

llvm/lib/Support/Triple.cpp

Show First 20 Lines • Show All 1,040 Lines • ▼ Show 20 Lines	StringRef Triple::getEnvironmentName() const {
return Tmp.split('-').second; // Strip third component		return Tmp.split('-').second; // Strip third component
}		}

StringRef Triple::getOSAndEnvironmentName() const {		StringRef Triple::getOSAndEnvironmentName() const {
StringRef Tmp = StringRef(Data).split('-').second; // Strip first component		StringRef Tmp = StringRef(Data).split('-').second; // Strip first component
return Tmp.split('-').second; // Strip second component		return Tmp.split('-').second; // Strip second component
}		}

		// System charset on z/OS is IBM-1047 and UTF-8 otherwise
		StringRef Triple::getSystemCharset() const {
		if (getOS() == llvm::Triple::ZOS)
		return "IBM-1047";
		tahonermannUnsubmitted Done Reply Inline Actions No support for targeting the z/OS Enhanced ASCII run-time? tahonermann: No support for targeting the z/OS Enhanced ASCII run-time?
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions We plan to support both modes in the future, but we want the default to still be IBM-1047 (EBCDIC). abhina.sreeskantharajan: We plan to support both modes in the future, but we want the default to still be IBM-1047…
		return "UTF-8";
		}

static unsigned EatNumber(StringRef &Str) {		static unsigned EatNumber(StringRef &Str) {
assert(!Str.empty() && isDigit(Str[0]) && "Not a number");		assert(!Str.empty() && isDigit(Str[0]) && "Not a number");
unsigned Result = 0;		unsigned Result = 0;

do {		do {
// Consume the leading digit.		// Consume the leading digit.
Result = Result*10 + (Str[0] - '0');		Result = Result*10 + (Str[0] - '0');

▲ Show 20 Lines • Show All 723 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Enable fexec-charset option AbandonedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 339355

clang/docs/LanguageExtensions.rst

clang/include/clang/Basic/LangOptions.h

clang/include/clang/Basic/TokenKinds.h

clang/include/clang/Driver/Options.td

clang/include/clang/Lex/LiteralConverter.h

clang/include/clang/Lex/LiteralSupport.h

clang/include/clang/Lex/Preprocessor.h

clang/lib/Driver/ToolChains/Clang.cpp

clang/lib/Frontend/CompilerInstance.cpp

clang/lib/Frontend/InitPreprocessor.cpp

clang/lib/Lex/CMakeLists.txt

clang/lib/Lex/LiteralConverter.cpp

clang/lib/Lex/LiteralSupport.cpp

clang/test/CodeGen/systemz-charset.c

clang/test/CodeGen/systemz-charset.cpp

clang/test/Driver/cl-options.c

clang/test/Driver/clang_f_opts.c

clang/test/Preprocessor/init-s390x.c

clang/test/Preprocessor/init-x86.c

llvm/include/llvm/ADT/Triple.h

llvm/lib/Support/Triple.cpp

Enable fexec-charset option
AbandonedPublic