This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
cmake/
-
config-ix.cmake
-
include/llvm/
-
llvm/
-
Config/
-
config.h.cmake
-
Support/
7/21
CharSet.h
-
lib/Support/
-
Support/
-
CMakeLists.txt
19/40
CharSet.cpp
-
unittests/Support/
-
Support/
-
CMakeLists.txt
4
CharSetTest.cpp

Differential D88741

[SystemZ/z/OS] Add utility class for char set conversion.
Needs ReviewPublic

Authored by Kai on Oct 2 2020, 8:50 AM.

Download Raw Diff

Details

Reviewers

uweigand
kbarton
yusra.syeda
hubert.reinterpretcast
tahonermann
ZarkoCA

Summary

This class adds support for conversion between different character
sets. The conversion between EBCDIC-1047 and Latin-1/UTF-8 is always
available. If the iconv library is installed, then all conversions
of the iconv library can be used, too.

Use of iconv functions is not as standardized as wanted.
Challenges found so far:

On some functions, the iconv*() functions are part of the C library. On other systems, a separate library must be linked in.
There are systems with multiple, incompatible implementations of iconv functionality.
Not each system provides an implementation.
Mapping of EBCDIC-1047 to ASCII/Latin-1/UTF-8 and vice versa is done differently on different platforms.
Each implementation supports different mappings.

As result, the implementation provides a thin wrapper around the
iconv functionality, including a fixed conversion for EBCDIC-1047.

Diff Detail

Event Timeline

Kai created this revision.Oct 2 2020, 8:50 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 2 2020, 8:50 AM

Herald added subscribers: llvm-commits, hiraditya, mgorny. · View Herald Transcript

Kai requested review of this revision.Oct 2 2020, 8:50 AM

Harbormaster completed remote builds in B73800: Diff 295839.Oct 2 2020, 9:26 AM

Drive by as it's years since I've dealt with any of this:

Seems ok to me. Be good to send your RFC as well to clang-dev to get a view into -finput-charset translations as well if we wanted to go there.

In D88741#2308899, @echristo wrote:

Seems ok to me. Be good to send your RFC as well to clang-dev to get a view into -finput-charset translations as well if we wanted to go there.

Sure, done.

yusra.syeda mentioned this in D88749: [SystemZ/z/OS] Add GOFF reader.Oct 2 2020, 12:00 PM

ctetreau added a subscriber: ctetreau.Oct 5 2020, 10:20 AM

ctetreau added inline comments.

llvm/lib/Support/CharSet.cpp
27	These names leak the iconv dependency, and can be error prone if somebody types them wrong. We should have an enum class for the set of char sets that can be converted. If we keep the iconv dependency, it's trivial to map `CharSet::IBM1047 -> "IBM-1047"`, but the compiler can catch the error if you mistype the enum name.
159	If we replace the name strings with enums, then it should be possible to make this function total and remove the `ErrorOr`
191	If I'm understanding this correctly, having iconv provides the possibility of supporting conversions other than to and from ascii, utf8, and ebcdic? I'm concerned that this is going to create a ton of bug reports of the form "CharSetConverter::create returned an error on my machine, but not my coworker's machine!" which will be closed as "operator error, install iconv". I feel like there should be a set of conversions supported by CharSetConverter, and they should work regardless of the presense of iconv. From messages I've seen on the mailing lists, it sounds like there is license uncertainty with linking iconv. Maybe it's best to just not have this?

yusra.syeda mentioned this in D89071: [SystemZ/z/OS] Add GOFFObjectFile class and details of GOFF file format.Oct 8 2020, 2:02 PM

yusra.syeda added a child revision: D89071: [SystemZ/z/OS] Add GOFFObjectFile class and details of GOFF file format.Oct 20 2020, 10:12 AM

Minor addition for this change; if this does go through, llvm-specific macros in C-like languages should expose the chosen name. I submitted a patch and a feature request to MSVC and GCC for their own implementation-specific macros as well:

It doesn't need to be synchronized with any other compiler's name or exposed feature: it just needs to exist. This allows folks interested in helping people write portable code that uses the execution character set preserve the invariants of their code by allowing them to inspect the name and act meaningfully for names they recognize.

Seems reasonable to me.

-eric

Sorry for taking so long - I was on vacation.

Added an enumeration for the basic charsets
Added an named constructor just for the basic charsets
Updated the test cases

I like the addition of the enumeration. It simplifies the source in some parts.
However, I did not drop the function which uses iconv for conversion. This will be needed for some further extensions.

Herald added a subscriber: dexonsmith. · View Herald TranscriptNov 4 2020, 5:28 AM

Harbormaster completed remote builds in B77542: Diff 302822.Nov 4 2020, 5:48 AM

abhina.sreeskantharajan added a subscriber: abhina.sreeskantharajan.Nov 18 2020, 10:00 AM

fanbo-meng added a subscriber: fanbo-meng.Nov 18 2020, 10:01 AM

fanbo-meng added inline comments.

llvm/lib/Support/CharSet.cpp
90	The naming here with "UTF" is ambiguous as it can mean UTF8, UTF16, UTF32. Using UTF8 with the enum names here would be better.

abhina.sreeskantharajan mentioned this in D93031: Enable fexec-charset option .Dec 10 2020, 5:49 AM

abhina.sreeskantharajan added a child revision: D93031: Enable fexec-charset option .Dec 10 2020, 6:24 AM

Drive-by comment as I'm reviewing a downstream patch - I think it's preferable to use llvm::Error and llvm::Expected instead of std::error_code, for reporting errors, as it allows you to provide more contextual information (e.g. why the conversion failed, which characters were invalid etc).

The CI on Windows is failing because it seems that the iconv header exists but the path is unknown. Is it possible to set a variable to Iconv_INCLUDE_DIRS and use that instead of writing out <iconv.h> ? I'm seeing this failure on the CI on my fexec-charset patch as well.

yusra.syeda added a child revision: D98437: [SystemZ][z/OS] Add GOFFObjectFile class support for HDR, ESD and END records.Mar 11 2021, 10:16 AM

ThePhD mentioned this in D100346: [Clang] String Literal and Wide String Literal Encoding from the Preprocessor.Apr 12 2021, 3:03 PM

Renamed enum members to include 8 for UTF8.

Kai marked an inline comment as done.Apr 20 2021, 1:13 PM

Kai added inline comments.

llvm/lib/Support/CharSet.cpp
90	Renamed.

Added the path of the <iconv.h> header as additional include path,

Use the right CMake variable Iconv_INCLUDE_DIRS.

In D88741#2613823, @abhina.sreeskantharajan wrote:

The CI on Windows is failing because it seems that the iconv header exists but the path is unknown. Is it possible to set a variable to Iconv_INCLUDE_DIRS and use that instead of writing out <iconv.h> ? I'm seeing this failure on the CI on my fexec-charset patch as well.

Iconv_INCLUDE_DIRS is now in the header file search path. This hopefully fixes the compile error.

Kai added inline comments.Apr 20 2021, 1:45 PM

llvm/lib/Support/CharSet.cpp
191	This can be viewed as the same as compiling a file: it fails on my machine if I do not have the same file in the same place as my coworker. This problem seems acceptable for the clang -fexec-charset patch. iconv is a POSIX specification (POSIX iconv.h), so there is no license problem.

Harbormaster completed remote builds in B99790: Diff 338973.Apr 20 2021, 1:50 PM

Harbormaster completed remote builds in B99793: Diff 338977.Apr 20 2021, 2:08 PM

Harbormaster completed remote builds in B99798: Diff 338983.Apr 20 2021, 2:16 PM

yusra.syeda added a child revision: D103490: [SystemZ][z/OS] Add support for TXT records in the GOFF reader.Jun 1 2021, 1:43 PM

yusra.syeda mentioned this in D103490: [SystemZ][z/OS] Add support for TXT records in the GOFF reader.Jun 28 2021, 1:36 PM

kpn added a subscriber: kpn.Jun 28 2021, 1:39 PM

Introduces build option LLVM_ENABLE_ICONV to optionally turn building with iconv off
Add include path to the iconv header when building with iconv enabled and not part of C library This hopefully fixes the build failure on the Windows bot.

Harbormaster completed remote builds in B114091: Diff 358739.Jul 14 2021, 3:36 PM

Fixed a problem in the CMake files.
This should finally fix the problem seen on the Windows builder.

Harbormaster completed remote builds in B123316: Diff 371730.Sep 9 2021, 3:31 PM

Fix wrong use of #ifdef

Harbormaster completed remote builds in B123328: Diff 371744.Sep 9 2021, 4:43 PM

Rebased on latest main.

Harbormaster completed remote builds in B128460: Diff 379158.Oct 12 2021, 2:24 PM

hubert.reinterpretcast added inline comments.Oct 12 2021, 4:37 PM

llvm/include/llvm/Support/CharSet.h
48	Just noting that https://wg21.link/p1885 proposes enumeration names (and values) for these.
llvm/lib/Support/CharSet.cpp
31–36	There is a normalization process for character set names described by https://wg21.link/p1885 (please refer to the source material directly as well: https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching). `ISO8859-1` is neither the primary name nor one of the aliases defined for this charset in the IANA character set registry.
103	The comment should say, "only valid sequences encoding UCS scalar values in the range [U+0080, U+00FF] can be decoded".
105	The API contract between the table conversion and the `iconv` conversion is inconsistent. `iconv` conversion performs an implementation-defined conversion for valid input characters that does not have a representation in the output codeset. This implementation fails the conversion instead.
114	The coding guidelines say to use preincrement in cases where postincrement is not needed.
141	This looks like a use case for `resize_for_overwrite`.
152	Missing overflow detection for when the most significant bit of `Capacity` is already set.
153	Same comment about `resize_for_overwrite`.
156	The coding guidelines have been updated to discourage mixing use and omission of braces for the constituent parts of an if/else chain.
195	Use a C-style cast if necessary, but `reinterpret_cast` is wrong here.

hubert.reinterpretcast added inline comments.Oct 12 2021, 6:54 PM

llvm/lib/Support/CharSet.cpp
176–179	This is a public constructor. It would be appropriate to check that `CSFrom` and `CSTo` aren't both `CS_IBM1047`...
180–184	Because of the [U+0000, U+00FF] limitation of the `convertWithTable` converter, we should not be getting here if there is a no-op conversion from UTF-8 to UTF-8. Either a no-op converter should be returned, or a request for such a converter is erroneous (`report_fatal_error` here would be friendlier by failing fast).

hubert.reinterpretcast added inline comments.Oct 12 2021, 8:04 PM

llvm/include/llvm/Support/CharSet.h
99	The expectations around null termination and shift state should be documented. Note that https://wg21.link/p2029 recommends that numeric escapes do not trigger changes in shift state, so avoiding automatic transitions to the initial shift state at the end of the (potential mid-string) translation is probably the right thing to do. Using a null `StringRef` as the method to cause a return to the initial shift state would be consistent with the `iconv` interface.
108	Same comment about null termination and shift state.

hubert.reinterpretcast added inline comments.Oct 12 2021, 8:52 PM

llvm/unittests/Support/CharSetTest.cpp
131	The other cases of "identity conversion" look like they would have suspicious behaviour. If they do, then this test is insufficient.
192	There is no representation in the testing of stateful encodings. Reasonable tests (separately for ISO-2022-JP and IBM-939) include: "Returning to the initial shift state" when in the initial shift state generates an empty output sequence. "Returning to the initial shift state" after the previous conversion ended with a character that requires a shift from the initial shift state generates a non-empty output sequence.

Changes based on review comments:

Renamed/moved enumeration members for charset names
Using a charset alias algorithm to make using charset names easier
Some source code style updates

Kai added inline comments.Oct 14 2021, 1:04 PM

llvm/include/llvm/Support/CharSet.h
48	Good point, thanks. It makes a lot of sense to have the names similar to proposed standards.
llvm/lib/Support/CharSet.cpp
31–36	Thanks! I added the normalization algorithm, as I think that it makes using the converter easier. Obviously, the `ISO8859-1` name comes from the z/OS implementation of iconv, which accepts 'ISO8859-1` or `819` but not the registered name `ISO-8859-1`.
103	Changed.
114	Thanks for catching - my fault.
156	Thanks again, I was not aware of this change.

Harbormaster completed remote builds in B128927: Diff 379800.Oct 14 2021, 1:13 PM

ZarkoCA added a subscriber: ZarkoCA.Oct 14 2021, 7:06 PM

ZarkoCA added inline comments.

llvm/include/llvm/Support/CharSet.h
33	I think enums should start with an uppercase letter as per the style guide.
43	nit: may be more clear to add the ending comment.

I've looked at the new changes and verified that the recent updates are an improvement. There remain unaddressed comments.

llvm/include/llvm/Support/CharSet.h
33	Thanks @ZarkoCA. The coding standards don't state this, but the existing practice is that the exception (from the coding standards) about following the C++ STL naming convention in some cases is more broadly applicable for components with a corresponding standard (or proposed standard) facility like this one.

Update table-based algorithm to map Unicode characters outside the range [U+0080, U+00FF] to the SUB character. This is the same behavior as iconv.
Add a new identify transformation which does not alter UTF-8 sequences
Use resize_for_overwrite for memory allocation

I've looked at the new changes and verified that the recent updates are an improvement. There remain unaddressed comments.

Yep, I know.

Kai added inline comments.Oct 19 2021, 7:59 AM

llvm/include/llvm/Support/CharSet.h
33	To be precise, the exception is documented here: https://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly (at the end of the paragraph).
llvm/lib/Support/CharSet.cpp
105	Changed implementation to have same behavior as iconv.
141	Changed, thanks.
153	Changed, thanks.
180–184	Added a no-op converter.
195	Changed.

Harbormaster completed remote builds in B129546: Diff 380686.Oct 19 2021, 8:37 AM

hubert.reinterpretcast added inline comments.Oct 19 2021, 8:39 PM

llvm/lib/Support/CharSet.cpp
105	Thanks for making the change. Thinking about it a bit more, I am wondering if there should be a "policy" available for the class to support selecting one behaviour or the other (or at least identifying that there were characters with no output codeset representation encountered). For example, `iconv` conversions that would not round-trip are identifiable via non-error non-zero return codes (indicating the number of input characters that did not have a representation in the output codeset). This information/policy will be useful for implementing the diagnostics required if https://wg21.link/p1854 is adopted.

hubert.reinterpretcast added inline comments.Oct 26 2021, 8:04 PM

llvm/lib/Support/CharSet.cpp
135	The comment should say that the encoded value is greater than 0xFF or the encoding is overlong. Overlong encodings are still an error. If the conversion facility has no use case for "lossy" translation, then reverting to the previous code for the UTF-8 to ISO-8859-1 decoding (and fixing the `iconv` path to check the return code) would save effort on getting the error handling right. It is also an error if the encoded value ends up being not a UCS scalar value (either because it is greater then 0x10FFFF or because it encodes a surrogate code point).
139	I suggest renaming this to `SeqLength` and changing the comment above to say "the rest of the sequence is skipped" (if indeed that aspect survives the changes to either implement error detection or abandon lossy translation).
140	An error should also be flagged when `SkipLen == 1`.

hubert.reinterpretcast added inline comments.Oct 26 2021, 8:42 PM

llvm/lib/Support/CharSet.cpp
179	My first comment regarding `resize_for_overwrite` was for this line.
213	This is consistent with `convertWithTable` but not `convertWithIconv` with respect to whether `Result` is being appended to as opposed to it having its contents replaced. All three should be consistent.
llvm/unittests/Support/CharSetTest.cpp
43	Comment should indicate that this is the substitution character.

hubert.reinterpretcast added a subscriber: cor3ntin.Oct 28 2021, 8:52 AM

evantypanski added a subscriber: evantypanski.Oct 28 2021, 8:55 AM

only support reversible conversions
make sure that all converters behave in the same way
updated comments

Kai added inline comments.Nov 3 2021, 12:53 PM

llvm/include/llvm/Support/CharSet.h
99	I updated the comment (and also the implementation) that an empty string resets the shift state. However, this feels a bit clumsy, and leaks a lot of the details of the iconv interface. So I wonder if a new method, e.g. `resetState()` would not be the better approach.
llvm/lib/Support/CharSet.cpp
135	Ah, I see your point. I reverted the error handling here, and check the return code from `iconv()` instead below. There is no real use case for it. I
152	Changed the handling to detect overflows.
179	Sorry, I noted this later, too.
213	Thanks for catching this! All three converters should now behave in the same way, that is, replacing the content.

Harbormaster completed remote builds in B132311: Diff 384560.Nov 3 2021, 3:11 PM

hubert.reinterpretcast added inline comments.Nov 6 2021, 9:01 PM

llvm/CMakeLists.txt
386 ↗	(On Diff #384560)	Minor nit: s/for for/for;
llvm/include/llvm/Support/CharSet.h
43	Minor comment still to address here.
77	Minor nit: s/occuor/occur/;
91	If the current object had a cleanup action on entry to the move assignment operator, then that cleanup action should be run prior to replacing the value.
99	I think that empty string is maybe not unique enough for this operation, so yes, a new method is probably better. However, it may be useful to understand how using `StringRef("", 1u)` as the input fits in.
108	Here too: s/occuor/occur/;
llvm/lib/Support/CharSet.cpp
136
164	It may be useful for the client of the interface to be able to retrieve the shift sequence (if any) to the initial shift state from the current internal output shift state. Although that use case might be achieved by using `StringRef("", 1u)`.

hubert.reinterpretcast added inline comments.Nov 7 2021, 7:48 AM

llvm/lib/Support/CharSet.cpp
198–200	I don't think there's any "partially converted input characters" to flush. There's only shift sequences that may be unnecessary or unwanted.
204–206	I think forcing return to the initial shift state at this level is too rigid. As mentioned in https://reviews.llvm.org/D88741?id=379158#inline-1064326, the recommended behaviour for numeric escapes involves no return to the initial shift state prior to the injection of the values specified. Whether Clang follows that recommendation or not for one platform or the other is a higher level question that should be had in the context of a client of this facility. The facility itself should leave the possibility of resuming with the prior translation's consequent shift state.

hubert.reinterpretcast added inline comments.Nov 7 2021, 9:21 AM

llvm/include/llvm/Support/CharSet.h
103	I think "string" is too ambiguous. Additional sentences to clarify would help: Converts the contents of a StringRef, which conventionally does not include a terminating null character. No additional null termination of the result is attempted.
128	Can refer to the other function to note clarifications re: null termination.

hubert.reinterpretcast added a reviewer: ZarkoCA.Nov 7 2021, 11:17 AM

@Kai, I've completed my review of the current state of this patch. I am out-of-office for three weeks. I hope my comments are clear enough to address and for other reviewers to confirm action/response on.

llvm/unittests/Support/CharSetTest.cpp
192	I suggest trying to write these tests.

hubert.reinterpretcast edited the summary of this revision. (Show Details)Nov 8 2021, 6:35 AM

Fix a couple of review comments.

Kai added inline comments.Nov 10 2021, 1:44 PM

llvm/CMakeLists.txt
386 ↗	(On Diff #384560)	Changed.
llvm/include/llvm/Support/CharSet.h
43	Comment added.
77	Fixed. Thanks for catching!
91	Call to cleanup added.
108	Fixed, too.
llvm/lib/Support/CharSet.cpp
136	Changed.

Harbormaster completed remote builds in B133577: Diff 386303.Nov 10 2021, 2:48 PM

The new changes are improvements. The comments regarding null termination and shift state are still pending resolution.

Rewrite based on review comments

New class hierarchy using pimpl idiom
New handling for flushing

Herald added a project: Restricted Project. · View Herald TranscriptMar 29 2023, 6:44 AM

dexonsmith removed a subscriber: dexonsmith.Mar 29 2023, 9:26 AM

Harbormaster completed remote builds in B222504: Diff 509344.Mar 29 2023, 10:14 AM

Thanks for working on this.

Before starting a more in depth review on that, I think this is big enough that we want to see an RFC with wider consensus and interest from the community as far as maintaining this.
I know of https://discourse.llvm.org/t/rfc-adding-a-char-set-converter-to-support-library/56592/2 and https://discourse.llvm.org/t/rfc-adding-a-char-set-converter-to-support-library/56574 , but neither gather a lot of comments.

I would like to better understand the set of circumstances in which the system iconv's IBM1047 transcoder would not be suitable for use by llvm, on the systems that make use of that encoding.
Maintaining these tables won't be free. Are the discrepancies regarding LF handling something you expect to cause issues in practice?

I am of the opinion that we should rely on iconv as much as possible, as i do not think maintaining conversion table will be a good use of our resources, however i think people might have widely different opinions and you will want
to make sure there is community buy-in on that point.

We should note that GCC has had good success using iconv, and I don't think the platform-specific availability of encondings has been an issue in practice but it's worth addressing. I know some people mentioned that point previously.
IMO, Any user confusion in that regard could be addressed by providing a sufficiently high-quality diagnostic in the presence of unsupported encoding.

If we do use iconv though, i would like us to have a better understanding of use cases, The patch currently links iconv to all llvm libraries, which might be overkill if the only project using it is Clang, and I wonder how that affects packaging
on linux distributions.

Speaking of, maybe the LLVM foundation will be able to provide legal opinions on linking to iconv on various platforms. Maybe something @tstellar would be able to help with. I would not expect particular challenges but we
should to make sure.

@aaron.ballman also reminded me that if iconv is generally available on posix-ish platforms, it is usually not available on Windows. Building iconv-like facilities on top of win32 APIs may be challenging, but we might want to think about
ways to allow or facilitate the use of iconv in Windows build. At least i would like us to give some consideration to that point, Indeed, Windows would, logically be the other big motivation to teach Clang about encodings beyond UTF-8.

Are there any other supported platforms for which iconv availability would be a concern?

I think all of these points should be addressed in an RFC so that we have a clear plan moving forward.

On a slightly more technical aspects, I'm concerned about the attempt at providing genericity in the CharSetConverterTable class - as Hubert alluded to, a table of bytes only works for specific, stateless, single byte encoding that have a reasonable mapping to one another.
If we do agree that we want a homegrown IBM1047<->UTF-8 conversion class, maybe we should have a class that does only that instead of trying to future proof and expect to add more tables. As i hope that we won't try to support more encoding without iconv.

CharSetConverterIdentity will never be a very efficient thing to use as implemented as it will perform a copy that may not be be useful. Have you consider either asserting that the encoding are distincts, or providing an error in that case instead of making a no-op copying converter?

llvm/include/llvm/Support/CharSet.h
94	Please use `std::unique_ptr` here, that way you won't have to manually manage memory in the destructor and move assignment operator

Everybody0523 added a subscriber: Everybody0523.Mar 30 2023, 11:45 AM

Use std::unique_ptr
clang-format the file

Harbormaster completed remote builds in B223098: Diff 510158.Mar 31 2023, 6:55 PM

In D88741#4232746, @cor3ntin wrote:

Thanks for working on this.

Before starting a more in depth review on that, I think this is big enough that we want to see an RFC with wider consensus and interest from the community as far as maintaining this.
I know of https://discourse.llvm.org/t/rfc-adding-a-char-set-converter-to-support-library/56592/2 and https://discourse.llvm.org/t/rfc-adding-a-char-set-converter-to-support-library/56574 , but neither gather a lot of comments.

+1, neither of those RFCs really showed a strong community consensus behind the idea, so I think another RFC would be appropriate. Please be sure to address the items in https://clang.llvm.org/get_involved.html#criteria explicitly in the RFC where appropriate.

I am of the opinion that we should rely on iconv as much as possible, as i do not think maintaining conversion table will be a good use of our resources, however i think people might have widely different opinions and you will want
to make sure there is community buy-in on that point.

We should note that GCC has had good success using iconv, and I don't think the platform-specific availability of encondings has been an issue in practice but it's worth addressing. I know some people mentioned that point previously.
IMO, Any user confusion in that regard could be addressed by providing a sufficiently high-quality diagnostic in the presence of unsupported encoding.

If we do use iconv though, i would like us to have a better understanding of use cases, The patch currently links iconv to all llvm libraries, which might be overkill if the only project using it is Clang, and I wonder how that affects packaging
on linux distributions.

Speaking of, maybe the LLVM foundation will be able to provide legal opinions on linking to iconv on various platforms. Maybe something @tstellar would be able to help with. I would not expect particular challenges but we
should to make sure.

Agreed -- any time we bring in a third-party dependency, we should be going through an explicit license review just to make sure we're in the clear.

@aaron.ballman also reminded me that if iconv is generally available on posix-ish platforms, it is usually not available on Windows. Building iconv-like facilities on top of win32 APIs may be challenging, but we might want to think about
ways to allow or facilitate the use of iconv in Windows build. At least i would like us to give some consideration to that point, Indeed, Windows would, logically be the other big motivation to teach Clang about encodings beyond UTF-8.

This is my big concern. iconv isn't installed on Windows by default, so this means we either need to dynamically link against iconv (which means Windows users now have extra dependencies they need to deal with and this can negatively impact downstream projects that ship an iconv version because of versioning differences) or we need to statically link against iconv (which means Windows users now have a significantly larger binary to deal with, and it still can negatively impact downstream projects that also statically link against iconv). That said, I also don't think we want to try to replace iconv with our own set of APIs. This should be explicitly discussed in the new RFC.

Are there any other supported platforms for which iconv availability would be a concern?

I think all of these points should be addressed in an RFC so that we have a clear plan moving forward.

@aaron.ballman because of the LGPL nature of iconv we night be restricted in our ability to link iconv statistically (but IANAL).
However, we might actually have options there. ICU is generally available on windows https://learn.microsoft.com/en-us/windows/win32/intl/international-components-for-unicode--icu- - that's something that could be investiguated.
ICU converter API is similar to that of iconv https://unicode-org.github.io/icu/userguide/conversion/converters.html#creating-a-converter so, it could be surfaced the same way (ie through a name), both libs use a (potentially superset) of IANA names

CharSetConverterIdentity will never be a very efficient thing to use as implemented as it will perform a copy that may not be be useful. Have you consider either asserting that the encoding are distincts, or providing an error in that case instead of making a no-op copying converter?

We did consider that, but ultimately decided against it. The reason for that is so we are able to create a create method that is guaranteed to return a functional converter (ie. is guaranteed to not throw an error). However, if it is greatly preferable to have it assert/error in the event of an "identity" conversion, we can make that change too.

Based on comments here and on the new RFC, we've decided to remove the use of iconv and limit this patch to supporting conversions between EBCDIC and UTF-8.

Strip down implementation to just having 2 functions to convert between EBCDIC-1047 and UTF-8.

Harbormaster completed remote builds in B226377: Diff 514638.Apr 18 2023, 7:28 AM

Fix error in unit test configuration.

Harbormaster completed remote builds in B226387: Diff 514654.Apr 19 2023, 11:34 AM

yusra.syeda removed a child revision: D98437: [SystemZ][z/OS] Add GOFFObjectFile class support for HDR, ESD and END records.Apr 24 2023, 12:39 PM

abhina.sreeskantharajan mentioned this in D153417: New CharSetConverter wrapper class for ConverterEBCDIC.Jun 27 2023, 5:52 AM

Revision Contents

Path

Size

llvm/

cmake/

config-ix.cmake

8 lines

include/

llvm/

Config/

config.h.cmake

3 lines

Support/

CharSet.h

117 lines

lib/

Support/

CMakeLists.txt

7 lines

CharSet.cpp

203 lines

unittests/

Support/

CMakeLists.txt

1 line

CharSetTest.cpp

191 lines

Diff 338983

llvm/cmake/config-ix.cmake

Show First 20 Lines • Show All 188 Lines • ▼ Show 20 Lines	else()
set(LLVM_ENABLE_TERMINFO 0)		set(LLVM_ENABLE_TERMINFO 0)
endif()		endif()

check_library_exists(xar xar_open "" HAVE_LIBXAR)		check_library_exists(xar xar_open "" HAVE_LIBXAR)
if(HAVE_LIBXAR)		if(HAVE_LIBXAR)
set(XAR_LIB xar)		set(XAR_LIB xar)
endif()		endif()

		# Check for iconv.
		find_package(Iconv)
		if(Iconv_FOUND)
		set(HAVE_ICONV 1)
		else()
		set(HAVE_ICONV 0)
		endif()

# function checks		# function checks
check_symbol_exists(arc4random "stdlib.h" HAVE_DECL_ARC4RANDOM)		check_symbol_exists(arc4random "stdlib.h" HAVE_DECL_ARC4RANDOM)
find_package(Backtrace)		find_package(Backtrace)
set(HAVE_BACKTRACE ${Backtrace_FOUND})		set(HAVE_BACKTRACE ${Backtrace_FOUND})
set(BACKTRACE_HEADER ${Backtrace_HEADER})		set(BACKTRACE_HEADER ${Backtrace_HEADER})

# Prevent check_symbol_exists from using API that is not supported for a given		# Prevent check_symbol_exists from using API that is not supported for a given
# deployment target.		# deployment target.
▲ Show 20 Lines • Show All 480 Lines • Show Last 20 Lines

llvm/include/llvm/Config/config.h.cmake

	Show First 20 Lines • Show All 91 Lines • ▼ Show 20 Lines
	#cmakedefine HAVE_GETPAGESIZE ${HAVE_GETPAGESIZE}			#cmakedefine HAVE_GETPAGESIZE ${HAVE_GETPAGESIZE}

	/* Define to 1 if you have the `getrlimit' function. */			/* Define to 1 if you have the `getrlimit' function. */
	#cmakedefine HAVE_GETRLIMIT ${HAVE_GETRLIMIT}			#cmakedefine HAVE_GETRLIMIT ${HAVE_GETRLIMIT}

	/* Define to 1 if you have the `getrusage' function. */			/* Define to 1 if you have the `getrusage' function. */
	#cmakedefine HAVE_GETRUSAGE ${HAVE_GETRUSAGE}			#cmakedefine HAVE_GETRUSAGE ${HAVE_GETRUSAGE}

				/* Define to 1 if you have the iconv library functions. */
				#cmakedefine HAVE_ICONV ${HAVE_ICONV}

	/* Define to 1 if you have the `isatty' function. */			/* Define to 1 if you have the `isatty' function. */
	#cmakedefine HAVE_ISATTY 1			#cmakedefine HAVE_ISATTY 1

	/* Define to 1 if you have the `edit' library (-ledit). */			/* Define to 1 if you have the `edit' library (-ledit). */
	#cmakedefine HAVE_LIBEDIT ${HAVE_LIBEDIT}			#cmakedefine HAVE_LIBEDIT ${HAVE_LIBEDIT}

	/* Define to 1 if you have the `pfm' library (-lpfm). */			/* Define to 1 if you have the `pfm' library (-lpfm). */
	#cmakedefine HAVE_LIBPFM ${HAVE_LIBPFM}			#cmakedefine HAVE_LIBPFM ${HAVE_LIBPFM}
	▲ Show 20 Lines • Show All 254 Lines • Show Last 20 Lines

llvm/include/llvm/Support/CharSet.h

This file was added.

//===-- CharSet.h - Utility class to convert between char sets ----*- C++ -*-=//

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

///

/// \file

/// This file provides a utility class to convert between different character

/// set encodings.

///

//===----------------------------------------------------------------------===//

#ifndef LLVM_SUPPORT_CHARSET_H

#define LLVM_SUPPORT_CHARSET_H

#include "llvm/ADT/StringRef.h"

#include "llvm/Config/config.h"

#include "llvm/Support/ErrorOr.h"

#include <functional>

#include <string>

#include <system_error>

namespace llvm {

template <typename T> class SmallVectorImpl;

/// Utility class to convert between different character set encodings.

/// The class always supports converting between EBCDIC 1047 and Latin-1/UTF-8.

/// If the iconv library is available, then arbitrary conversions are supported.

/// TODO Add Windows support.

ZarkoCAUnsubmitted

Not Done

namespace text_encoding {

- enum class id {

+ enum class Id {

/// UTF-8 character set encoding.

I think enums should start with an uppercase letter as per the style guide.

ZarkoCA: I think enums should start with an uppercase letter as per the style guide.

hubert.reinterpretcastUnsubmitted

Not Done

Thanks @ZarkoCA. The coding standards don't state this, but the existing practice is that the exception (from the coding standards) about following the C++ STL naming convention in some cases is more broadly applicable for components with a corresponding standard (or proposed standard) facility like this one.

hubert.reinterpretcast: Thanks @ZarkoCA. The coding standards don't state this, but the existing practice is that the…

KaiAuthorUnsubmitted

Done

To be precise, the exception is documented here: https://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly (at the end of the paragraph).

Kai: To be precise, the exception is documented here: https://llvm.org/docs/CodingStandards.

class CharSetConverter {

public:

using ConverterFunc =

std::function<std::error_code(StringRef, SmallVectorImpl<char> &)>;

using CleanupFunc = std::function<void(void)>;

private:

ConverterFunc Convert;

CleanupFunc Cleanup;

ZarkoCAUnsubmitted

Not Done

IBM1047

};

- }

+ } // end namespace text_encoding

/// Utility class to convert between different character set encodings.

nit: may be more clear to add the ending comment.

ZarkoCA: nit: may be more clear to add the ending comment.

hubert.reinterpretcastUnsubmitted

Not Done

Minor comment still to address here.

hubert.reinterpretcast: Minor comment still to address here.

KaiAuthorUnsubmitted

Done

Comment added.

Kai: Comment added.

public:

enum CharSetNames {

/// UTF-8 character set encoding.

CS_UTF8,

hubert.reinterpretcastUnsubmitted

Not Done

Just noting that https://wg21.link/p1885 proposes enumeration names (and values) for these.

hubert.reinterpretcast: Just noting that https://wg21.link/p1885 proposes enumeration names (and values) for these.

KaiAuthorUnsubmitted

Done

Good point, thanks. It makes a lot of sense to have the names similar to proposed standards.

Kai: Good point, thanks. It makes a lot of sense to have the names similar to proposed standards.

/// ISO 8859-1 (Latin-1) character set encoding.

CS_LATIN1,

/// IBM EBCDIC 1047 character set encoding.

CS_IBM1047

};

private:

CharSetConverter(ConverterFunc Convert, CleanupFunc Cleanup)

: Convert(Convert), Cleanup(Cleanup) {}

public:

/// Creates a CharSetConverter instance.

/// \param[in] CSFrom name of the source character encoding

/// \param[in] CSTo name of the target character encoding

/// \return a CharSetConverter instance

static CharSetConverter create(CharSetNames CSFrom, CharSetNames CSTo);

/// Creates a CharSetConverter instance.

/// Returns std::errc::invalid_argument in case the requested conversion is

/// not supported.

/// \param[in] CSFrom name of the source character encoding

/// \param[in] CSTo name of the target character encoding

/// \return a CharSetConverter instance or an error code

static ErrorOr<CharSetConverter> create(StringRef CSFrom, StringRef CSTo);

CharSetConverter(const CharSetConverter &) = delete;

CharSetConverter &operator=(const CharSetConverter &) = delete;

hubert.reinterpretcastUnsubmitted

Not Done

Minor nit: s/occuor/occur/;

hubert.reinterpretcast: Minor nit: s/occuor/occur/;

KaiAuthorUnsubmitted

Done

Fixed. Thanks for catching!

Kai: Fixed. Thanks for catching!

CharSetConverter(CharSetConverter &&Other) {

this->Convert = Other.Convert;

this->Cleanup = Other.Cleanup;

Other.Cleanup = nullptr;

}

CharSetConverter &operator=(CharSetConverter &&Other) {

this->Convert = Other.Convert;

this->Cleanup = Other.Cleanup;

Other.Cleanup = nullptr;

return *this;

}

~CharSetConverter() {

hubert.reinterpretcastUnsubmitted

Not Done

If the current object had a cleanup action on entry to the move assignment operator, then that cleanup action should be run prior to replacing the value.

hubert.reinterpretcast: If the current object had a cleanup action on entry to the move assignment operator, then that…

KaiAuthorUnsubmitted

Done

Call to cleanup added.

Kai: Call to cleanup added.

if (Cleanup)

Cleanup();

}

cor3ntinUnsubmitted

Not Done

Please use std::unique_ptr here, that way you won't have to manually manage memory in the destructor and move assignment operator

cor3ntin: Please use `std::unique_ptr` here, that way you won't have to manually manage memory in the…

/// Converts a string.

/// \param[in] Source source string

/// \param[in,out] Result container for converted string

/// \return error code in case something went wrong

hubert.reinterpretcastUnsubmitted

Not Done

The expectations around null termination and shift state should be documented. Note that https://wg21.link/p2029 recommends that numeric escapes do not trigger changes in shift state, so avoiding automatic transitions to the initial shift state at the end of the (potential mid-string) translation is probably the right thing to do. Using a null StringRef as the method to cause a return to the initial shift state would be consistent with the iconv interface.

hubert.reinterpretcast: The expectations around null termination and shift state should be documented. Note that https…

KaiAuthorUnsubmitted

Done

I updated the comment (and also the implementation) that an empty string resets the shift state. However, this feels a bit clumsy, and leaks a lot of the details of the iconv interface. So I wonder if a new method, e.g. resetState() would not be the better approach.

Kai: I updated the comment (and also the implementation) that an empty string resets the shift state.

hubert.reinterpretcastUnsubmitted

Not Done

I think that empty string is maybe not unique enough for this operation, so yes, a new method is probably better. However, it may be useful to understand how using StringRef("", 1u) as the input fits in.

hubert.reinterpretcast: I think that empty string is maybe not unique enough for this operation, so yes, a new method…

std::error_code convert(StringRef Source,

SmallVectorImpl<char> &Result) const {

return Convert(Source, Result);

}

hubert.reinterpretcastUnsubmitted

Not Done

I think "string" is too ambiguous. Additional sentences to clarify would help:
Converts the contents of a StringRef, which conventionally does not include a terminating null character. No additional null termination of the result is attempted.

hubert.reinterpretcast: I think "string" is too ambiguous. Additional sentences to clarify would help: Converts the…

/// Converts a string.

/// \param[in] Source source string

/// \param[in,out] Result container for converted string

/// \return error code in case something went wrong

hubert.reinterpretcastUnsubmitted

Not Done

Same comment about null termination and shift state.

hubert.reinterpretcast: Same comment about null termination and shift state.

hubert.reinterpretcastUnsubmitted

Not Done

Here too: s/occuor/occur/;

hubert.reinterpretcast: Here too: s/occuor/occur/;

KaiAuthorUnsubmitted

Done

Fixed, too.

Kai: Fixed, too.

std::error_code convert(const std::string &Source,

SmallVectorImpl<char> &Result) const {

return convert(StringRef(Source), Result);

}

};

} // namespace llvm

#endif

hubert.reinterpretcastUnsubmitted

Not Done

Can refer to the other function to note clarifications re: null termination.

hubert.reinterpretcast: Can refer to the other function to note clarifications re: null termination.

llvm/lib/Support/CMakeLists.txt

Show First 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	if (MSVC)
set (delayload_flags delayimp -delayload:shell32.dll -delayload:ole32.dll)		set (delayload_flags delayimp -delayload:shell32.dll -delayload:ole32.dll)
endif()		endif()

# Link Z3 if the user wants to build it.		# Link Z3 if the user wants to build it.
if(LLVM_WITH_Z3)		if(LLVM_WITH_Z3)
set(system_libs ${system_libs} ${Z3_LIBRARIES})		set(system_libs ${system_libs} ${Z3_LIBRARIES})
endif()		endif()

		# Link iconv library if it is an external library.
		if(Iconv_FOUND AND NOT Iconv_IS_BUILT_IN)
		set(system_libs ${system_libs} ${Iconv_LIBRARIES})
		endif()

# Override the C runtime allocator on Windows and embed it into LLVM tools & libraries		# Override the C runtime allocator on Windows and embed it into LLVM tools & libraries
if(LLVM_INTEGRATED_CRT_ALLOC)		if(LLVM_INTEGRATED_CRT_ALLOC)
if (CMAKE_BUILD_TYPE AND NOT ${LLVM_USE_CRT_${uppercase_CMAKE_BUILD_TYPE}} MATCHES "^(MT\|MTd)$")		if (CMAKE_BUILD_TYPE AND NOT ${LLVM_USE_CRT_${uppercase_CMAKE_BUILD_TYPE}} MATCHES "^(MT\|MTd)$")
message(FATAL_ERROR "LLVM_INTEGRATED_CRT_ALLOC only works with /MT or /MTd. Use LLVM_USE_CRT_${uppercase_CMAKE_BUILD_TYPE} to set the appropriate option.")		message(FATAL_ERROR "LLVM_INTEGRATED_CRT_ALLOC only works with /MT or /MTd. Use LLVM_USE_CRT_${uppercase_CMAKE_BUILD_TYPE} to set the appropriate option.")
endif()		endif()

string(REGEX REPLACE "(/\|\\\\)$" "" LLVM_INTEGRATED_CRT_ALLOC "${LLVM_INTEGRATED_CRT_ALLOC}")		string(REGEX REPLACE "(/\|\\\\)$" "" LLVM_INTEGRATED_CRT_ALLOC "${LLVM_INTEGRATED_CRT_ALLOC}")

Show All 33 Lines	add_llvm_component_library(LLVMSupport
BinaryStreamReader.cpp		BinaryStreamReader.cpp
BinaryStreamRef.cpp		BinaryStreamRef.cpp
BinaryStreamWriter.cpp		BinaryStreamWriter.cpp
BlockFrequency.cpp		BlockFrequency.cpp
BranchProbability.cpp		BranchProbability.cpp
BuryPointer.cpp		BuryPointer.cpp
CachePruning.cpp		CachePruning.cpp
circular_raw_ostream.cpp		circular_raw_ostream.cpp
		CharSet.cpp
Chrono.cpp		Chrono.cpp
COM.cpp		COM.cpp
CodeGenCoverage.cpp		CodeGenCoverage.cpp
CommandLine.cpp		CommandLine.cpp
Compression.cpp		Compression.cpp
CRC.cpp		CRC.cpp
ConvertUTF.cpp		ConvertUTF.cpp
ConvertUTFWrapper.cpp		ConvertUTFWrapper.cpp
▲ Show 20 Lines • Show All 111 Lines • ▼ Show 20 Lines	# System
Watchdog.cpp		Watchdog.cpp

ADDITIONAL_HEADER_DIRS		ADDITIONAL_HEADER_DIRS
Unix		Unix
Windows		Windows
${LLVM_MAIN_INCLUDE_DIR}/llvm/ADT		${LLVM_MAIN_INCLUDE_DIR}/llvm/ADT
${LLVM_MAIN_INCLUDE_DIR}/llvm/Support		${LLVM_MAIN_INCLUDE_DIR}/llvm/Support
${Backtrace_INCLUDE_DIRS}		${Backtrace_INCLUDE_DIRS}
		${Iconv_INCLUDE_DIRS}

LINK_LIBS		LINK_LIBS
${system_libs} ${imported_libs} ${delayload_flags}		${system_libs} ${imported_libs} ${delayload_flags}

LINK_COMPONENTS		LINK_COMPONENTS
Demangle		Demangle
)		)

▲ Show 20 Lines • Show All 42 Lines • Show Last 20 Lines

llvm/lib/Support/CharSet.cpp

This file was added.

//===-- CharSet.cpp - Utility class to convert between char sets --*- C++ -*-=//

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

///

/// \file

/// This file provides utility classes to convert between different character

/// set encoding.

///

//===----------------------------------------------------------------------===//

#include "llvm/Support/CharSet.h"

#include "llvm/ADT/SmallVector.h"

#include <algorithm>

#include <system_error>

#ifdef HAVE_ICONV

#include <iconv.h>

#endif

using namespace llvm;

namespace {

ctetreauUnsubmitted

Done

These names leak the iconv dependency, and can be error prone if somebody types them wrong. We should have an enum class for the set of char sets that can be converted. If we keep the iconv dependency, it's trivial to map CharSet::IBM1047 -> "IBM-1047", but the compiler can catch the error if you mistype the enum name.

ctetreau: These names leak the iconv dependency, and can be error prone if somebody types them wrong. We…

// Maps the charset name to enum constant if possible.

Optional<CharSetConverter::CharSetNames> getKnownCharSet(StringRef CSName) {

#define CSNAME(CS, STR) \

if (CSName == STR) \

return CS

CSNAME(CharSetConverter::CS_UTF8, "UTF-8");

CSNAME(CharSetConverter::CS_LATIN1, "ISO8859-1");

CSNAME(CharSetConverter::CS_IBM1047, "IBM-1047");

#undef CSNAME

hubert.reinterpretcastUnsubmitted

Not Done

There is a normalization process for character set names described by https://wg21.link/p1885 (please refer to the source material directly as well: https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching).

ISO8859-1 is neither the primary name nor one of the aliases defined for this charset in the IANA character set registry.

hubert.reinterpretcast: There is a normalization process for character set names described by https://wg21.link/p1885…

KaiAuthorUnsubmitted

Done

Thanks! I added the normalization algorithm, as I think that it makes using the converter easier.

Obviously, the ISO8859-1 name comes from the z/OS implementation of iconv, which accepts 'ISO8859-1` or 819 but not the registered name ISO-8859-1.

Kai: Thanks! I added the normalization algorithm, as I think that it makes using the converter…

return None;

}

// Character conversion between Enhanced ASCII and EBCDIC (IBM-1047).

const unsigned char ISO88591ToIBM1047[256] = {

0x00, 0x01, 0x02, 0x03, 0x37, 0x2d, 0x2e, 0x2f, 0x16, 0x05, 0x15, 0x0b,

0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x3c, 0x3d, 0x32, 0x26,

0x18, 0x19, 0x3f, 0x27, 0x1c, 0x1d, 0x1e, 0x1f, 0x40, 0x5a, 0x7f, 0x7b,

0x5b, 0x6c, 0x50, 0x7d, 0x4d, 0x5d, 0x5c, 0x4e, 0x6b, 0x60, 0x4b, 0x61,

0xf0, 0xf1, 0xf2, 0xf3, 0xf4, 0xf5, 0xf6, 0xf7, 0xf8, 0xf9, 0x7a, 0x5e,

0x4c, 0x7e, 0x6e, 0x6f, 0x7c, 0xc1, 0xc2, 0xc3, 0xc4, 0xc5, 0xc6, 0xc7,

0xc8, 0xc9, 0xd1, 0xd2, 0xd3, 0xd4, 0xd5, 0xd6, 0xd7, 0xd8, 0xd9, 0xe2,

0xe3, 0xe4, 0xe5, 0xe6, 0xe7, 0xe8, 0xe9, 0xad, 0xe0, 0xbd, 0x5f, 0x6d,

0x79, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, 0x88, 0x89, 0x91, 0x92,

0x93, 0x94, 0x95, 0x96, 0x97, 0x98, 0x99, 0xa2, 0xa3, 0xa4, 0xa5, 0xa6,

0xa7, 0xa8, 0xa9, 0xc0, 0x4f, 0xd0, 0xa1, 0x07, 0x20, 0x21, 0x22, 0x23,

0x24, 0x25, 0x06, 0x17, 0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x09, 0x0a, 0x1b,

0x30, 0x31, 0x1a, 0x33, 0x34, 0x35, 0x36, 0x08, 0x38, 0x39, 0x3a, 0x3b,

0x04, 0x14, 0x3e, 0xff, 0x41, 0xaa, 0x4a, 0xb1, 0x9f, 0xb2, 0x6a, 0xb5,

0xbb, 0xb4, 0x9a, 0x8a, 0xb0, 0xca, 0xaf, 0xbc, 0x90, 0x8f, 0xea, 0xfa,

0xbe, 0xa0, 0xb6, 0xb3, 0x9d, 0xda, 0x9b, 0x8b, 0xb7, 0xb8, 0xb9, 0xab,

0x64, 0x65, 0x62, 0x66, 0x63, 0x67, 0x9e, 0x68, 0x74, 0x71, 0x72, 0x73,

0x78, 0x75, 0x76, 0x77, 0xac, 0x69, 0xed, 0xee, 0xeb, 0xef, 0xec, 0xbf,

0x80, 0xfd, 0xfe, 0xfb, 0xfc, 0xba, 0xae, 0x59, 0x44, 0x45, 0x42, 0x46,

0x43, 0x47, 0x9c, 0x48, 0x54, 0x51, 0x52, 0x53, 0x58, 0x55, 0x56, 0x57,

0x8c, 0x49, 0xcd, 0xce, 0xcb, 0xcf, 0xcc, 0xe1, 0x70, 0xdd, 0xde, 0xdb,

0xdc, 0x8d, 0x8e, 0xdf};

const unsigned char IBM1047ToISO88591[256] = {

0x00, 0x01, 0x02, 0x03, 0x9c, 0x09, 0x86, 0x7f, 0x97, 0x8d, 0x8e, 0x0b,

0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x9d, 0x0a, 0x08, 0x87,

0x18, 0x19, 0x92, 0x8f, 0x1c, 0x1d, 0x1e, 0x1f, 0x80, 0x81, 0x82, 0x83,

0x84, 0x85, 0x17, 0x1b, 0x88, 0x89, 0x8a, 0x8b, 0x8c, 0x05, 0x06, 0x07,

0x90, 0x91, 0x16, 0x93, 0x94, 0x95, 0x96, 0x04, 0x98, 0x99, 0x9a, 0x9b,

0x14, 0x15, 0x9e, 0x1a, 0x20, 0xa0, 0xe2, 0xe4, 0xe0, 0xe1, 0xe3, 0xe5,

0xe7, 0xf1, 0xa2, 0x2e, 0x3c, 0x28, 0x2b, 0x7c, 0x26, 0xe9, 0xea, 0xeb,

0xe8, 0xed, 0xee, 0xef, 0xec, 0xdf, 0x21, 0x24, 0x2a, 0x29, 0x3b, 0x5e,

0x2d, 0x2f, 0xc2, 0xc4, 0xc0, 0xc1, 0xc3, 0xc5, 0xc7, 0xd1, 0xa6, 0x2c,

0x25, 0x5f, 0x3e, 0x3f, 0xf8, 0xc9, 0xca, 0xcb, 0xc8, 0xcd, 0xce, 0xcf,

0xcc, 0x60, 0x3a, 0x23, 0x40, 0x27, 0x3d, 0x22, 0xd8, 0x61, 0x62, 0x63,

0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0xab, 0xbb, 0xf0, 0xfd, 0xfe, 0xb1,

0xb0, 0x6a, 0x6b, 0x6c, 0x6d, 0x6e, 0x6f, 0x70, 0x71, 0x72, 0xaa, 0xba,

0xe6, 0xb8, 0xc6, 0xa4, 0xb5, 0x7e, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78,

0x79, 0x7a, 0xa1, 0xbf, 0xd0, 0x5b, 0xde, 0xae, 0xac, 0xa3, 0xa5, 0xb7,

0xa9, 0xa7, 0xb6, 0xbc, 0xbd, 0xbe, 0xdd, 0xa8, 0xaf, 0x5d, 0xb4, 0xd7,

0x7b, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48, 0x49, 0xad, 0xf4,

0xf6, 0xf2, 0xf3, 0xf5, 0x7d, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f, 0x50,

0x51, 0x52, 0xb9, 0xfb, 0xfc, 0xf9, 0xfa, 0xff, 0x5c, 0xf7, 0x53, 0x54,

0x55, 0x56, 0x57, 0x58, 0x59, 0x5a, 0xb2, 0xd4, 0xd6, 0xd2, 0xd3, 0xd5,

0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0xb3, 0xdb,

0xdc, 0xd9, 0xda, 0x9f};

enum { NoUTF8 = 0x0, SrcIsUTF8 = 0x1, DstIsUTF8 = 0x2 };

fanbo-mengUnsubmitted

Done

The naming here with "UTF" is ambiguous as it can mean UTF8, UTF16, UTF32. Using UTF8 with the enum names here would be better.

fanbo-meng: The naming here with "UTF" is ambiguous as it can mean UTF8, UTF16, UTF32. Using UTF8 with the…

KaiAuthorUnsubmitted

Done

Renamed.

Kai: Renamed.

std::error_code convertWithTable(const unsigned char *Table, unsigned Flags,

StringRef Source,

SmallVectorImpl<char> &Result) {

const unsigned char *Ptr =

reinterpret_cast<const unsigned char *>(Source.data());

size_t Length = Source.size();

while (Length--) {

unsigned char Ch = *Ptr++;

// Handle UTF-8 2-byte-sequences in input.

if (Flags & SrcIsUTF8) {

if (Ch >= 128) {

// Only two-byte sequences can be decoded.

if (Ch != 0xc2 && Ch != 0xc3)

hubert.reinterpretcastUnsubmitted

Not Done

The comment should say, "only valid sequences encoding UCS scalar values in the range [U+0080, U+00FF] can be decoded".

hubert.reinterpretcast: The comment should say, "only valid sequences encoding UCS scalar values in the range [U+0080…

KaiAuthorUnsubmitted

Done

Changed.

Kai: Changed.

return std::make_error_code(std::errc::illegal_byte_sequence);

// Is buffer truncated?

hubert.reinterpretcastUnsubmitted

Not Done

The API contract between the table conversion and the iconv conversion is inconsistent. iconv conversion performs an implementation-defined conversion for valid input characters that does not have a representation in the output codeset. This implementation fails the conversion instead.

hubert.reinterpretcast: The API contract between the table conversion and the `iconv` conversion is inconsistent.

KaiAuthorUnsubmitted

Done

Changed implementation to have same behavior as iconv.

Kai: Changed implementation to have same behavior as iconv.

hubert.reinterpretcastUnsubmitted

Not Done

Thanks for making the change. Thinking about it a bit more, I am wondering if there should be a "policy" available for the class to support selecting one behaviour or the other (or at least identifying that there were characters with no output codeset representation encountered). For example, iconv conversions that would not round-trip are identifiable via non-error non-zero return codes (indicating the number of input characters that did not have a representation in the output codeset).

This information/policy will be useful for implementing the diagnostics required if https://wg21.link/p1854 is adopted.

hubert.reinterpretcast: Thanks for making the change. Thinking about it a bit more, I am wondering if there should be a…

if (!Length)

return std::make_error_code(std::errc::invalid_argument);

unsigned char Ch2 = *Ptr++;

// Is second byte well-formed?

if ((Ch2 & 0xc0) != 0x80)

return std::make_error_code(std::errc::illegal_byte_sequence);

Ch = Ch2 | (Ch << 6);

Length--;

}

hubert.reinterpretcastUnsubmitted

Not Done

The coding guidelines say to use preincrement in cases where postincrement is not needed.

hubert.reinterpretcast: The coding guidelines say to use preincrement in cases where postincrement is not needed.

KaiAuthorUnsubmitted

Done

Thanks for catching - my fault.

Kai: Thanks for catching - my fault.

}

// Translate the character.

Ch = Table ? Table[Ch] : Ch;

// Handle UTF-8 2-byte-sequences in output.

if (Flags & DstIsUTF8) {

if (Ch >= 128) {

// First byte prefixed with either 0xc2 or 0xc3.

Result.push_back(static_cast<char>(0xc0 | (Ch >> 6)));

// Second byte is either the same as the ASCII byte or ASCII byte -64.

Ch = Ch & 0xbf;

}

Result.push_back(static_cast<char>(Ch));

}

return std::error_code();

}

#ifdef HAVE_ICONV

std::error_code convertWithIconv(iconv_t ConvDesc, StringRef Source,

SmallVectorImpl<char> &Result) {

// Setup the input.

hubert.reinterpretcastUnsubmitted

Not Done

The comment should say that the encoded value is greater than 0xFF or the encoding is overlong. Overlong encodings are still an error. If the conversion facility has no use case for "lossy" translation, then reverting to the previous code for the UTF-8 to ISO-8859-1 decoding (and fixing the iconv path to check the return code) would save effort on getting the error handling right.

It is also an error if the encoded value ends up being not a UCS scalar value (either because it is greater then 0x10FFFF or because it encodes a surrogate code point).

hubert.reinterpretcast: The comment should say that the encoded value is greater than 0xFF or the encoding is overlong.

KaiAuthorUnsubmitted

Done

Ah, I see your point. I reverted the error handling here, and check the return code from iconv() instead below. There is no real use case for it. I

Kai: Ah, I see your point. I reverted the error handling here, and check the return code from `iconv…

size_t InputLength = Source.size();

hubert.reinterpretcastUnsubmitted

Not Done

Ch = Ch2 | (Ch << 6);

- Length--;

+ --Length;

}

// Translate the character.

hubert.reinterpretcast:

KaiAuthorUnsubmitted

Done

Changed.

Kai: Changed.

char *Input = const_cast<char *>(Source.data());

// Setup the output. We directly write into the SmallVector.

size_t Capacity = Result.capacity();

hubert.reinterpretcastUnsubmitted

Not Done

I suggest renaming this to SeqLength and changing the comment above to say "the rest of the sequence is skipped" (if indeed that aspect survives the changes to either implement error detection or abandon lossy translation).

hubert.reinterpretcast: I suggest renaming this to `SeqLength` and changing the comment above to say "the rest of the…

Result.resize(Capacity);

hubert.reinterpretcastUnsubmitted

Not Done

An error should also be flagged when SkipLen == 1.

hubert.reinterpretcast: An error should also be flagged when `SkipLen == 1`.

char *Output = static_cast<char *>(Result.data());

hubert.reinterpretcastUnsubmitted

Not Done

This looks like a use case for resize_for_overwrite.

hubert.reinterpretcast: This looks like a use case for `resize_for_overwrite`.

KaiAuthorUnsubmitted

Done

Changed, thanks.

Kai: Changed, thanks.

size_t OutputLength = Capacity;

while (iconv(ConvDesc, &Input, &InputLength, &Output, &OutputLength) ==

static_cast<size_t>(-1)) {

if (errno == E2BIG) {

// No space left in output buffer. Double the size of the underlying

// memory in the SmallVectorImpl, adjust pointer and length and continue

// the conversion.

const size_t Used = Capacity - OutputLength;

Capacity *= 2;

Result.resize(Capacity);

hubert.reinterpretcastUnsubmitted

Not Done

Missing overflow detection for when the most significant bit of Capacity is already set.

hubert.reinterpretcast: Missing overflow detection for when the most significant bit of `Capacity` is already set.

KaiAuthorUnsubmitted

Done

Changed the handling to detect overflows.

Kai: Changed the handling to detect overflows.

Output = static_cast<char *>(Result.data()) + Used;

hubert.reinterpretcastUnsubmitted

Not Done

Same comment about resize_for_overwrite.

hubert.reinterpretcast: Same comment about `resize_for_overwrite`.

KaiAuthorUnsubmitted

Done

Changed, thanks.

Kai: Changed, thanks.

OutputLength = Capacity - Used;

} else

// Some other error occured.

hubert.reinterpretcastUnsubmitted

Not Done

The coding guidelines have been updated to discourage mixing use and omission of braces for the constituent parts of an if/else chain.

hubert.reinterpretcast: The coding guidelines have been updated to discourage mixing use and omission of braces for the…

KaiAuthorUnsubmitted

Done

Thanks again, I was not aware of this change.

Kai: Thanks again, I was not aware of this change.

return std::error_code(errno, std::generic_category());

}

ctetreauUnsubmitted

Done

If we replace the name strings with enums, then it should be possible to make this function total and remove the ErrorOr

ctetreau: If we replace the name strings with enums, then it should be possible to make this function…

// Re-adjust size to actual size.

Result.resize(Capacity - OutputLength);

return std::error_code();

}

#endif

hubert.reinterpretcastUnsubmitted

Not Done

It may be useful for the client of the interface to be able to retrieve the shift sequence (if any) to the initial shift state from the current internal output shift state. Although that use case might be achieved by using StringRef("", 1u).

hubert.reinterpretcast: It may be useful for the client of the interface to be able to retrieve the shift sequence (if…

} // namespace

CharSetConverter CharSetConverter::create(CharSetNames CSFrom,

CharSetNames CSTo) {

unsigned Flags = NoUTF8;

if (CSFrom == CS_UTF8)

Flags |= SrcIsUTF8;

if (CSTo == CS_UTF8)

Flags |= DstIsUTF8;

const unsigned char *Table = nullptr;

if (CSFrom == CS_IBM1047)

Table = IBM1047ToISO88591;

if (CSTo == CS_IBM1047)

Table = ISO88591ToIBM1047;

return CharSetConverter{

hubert.reinterpretcastUnsubmitted

Not Done

This is a public constructor. It would be appropriate to check that CSFrom and CSTo aren't both CS_IBM1047...

hubert.reinterpretcast: This is a public constructor. It would be appropriate to check that `CSFrom` and `CSTo` aren't…

hubert.reinterpretcastUnsubmitted

Not Done

My first comment regarding resize_for_overwrite was for this line.

hubert.reinterpretcast: My first comment regarding `resize_for_overwrite` was for this line.

KaiAuthorUnsubmitted

Done

Sorry, I noted this later, too.

Kai: Sorry, I noted this later, too.

[Table, Flags](StringRef Source, SmallVectorImpl<char> &Result) {

return convertWithTable(Table, Flags, Source, Result);

nullptr};

}

hubert.reinterpretcastUnsubmitted

Not Done

Because of the [U+0000, U+00FF] limitation of the convertWithTable converter, we should not be getting here if there is a no-op conversion from UTF-8 to UTF-8. Either a no-op converter should be returned, or a request for such a converter is erroneous (report_fatal_error here would be friendlier by failing fast).

hubert.reinterpretcast: Because of the [U+0000, U+00FF] limitation of the `convertWithTable` converter, we should not…

KaiAuthorUnsubmitted

Done

Added a no-op converter.

Kai: Added a no-op converter.

ErrorOr<CharSetConverter> CharSetConverter::create(StringRef CSFrom,

StringRef CSTo) {

Optional<CharSetConverter::CharSetNames> From = getKnownCharSet(CSFrom);

Optional<CharSetConverter::CharSetNames> To = getKnownCharSet(CSTo);

if (From && To)

return create(*From, *To);

ctetreauUnsubmitted

Not Done

If I'm understanding this correctly, having iconv provides the possibility of supporting conversions other than to and from ascii, utf8, and ebcdic? I'm concerned that this is going to create a ton of bug reports of the form "CharSetConverter::create returned an error on my machine, but not my coworker's machine!" which will be closed as "operator error, install iconv". I feel like there should be a set of conversions supported by CharSetConverter, and they should work regardless of the presense of iconv.

From messages I've seen on the mailing lists, it sounds like there is license uncertainty with linking iconv. Maybe it's best to just not have this?

ctetreau: If I'm understanding this correctly, having iconv provides the possibility of supporting…

KaiAuthorUnsubmitted

Done

This can be viewed as the same as compiling a file: it fails on my machine if I do not have the same file in the same place as my coworker. This problem seems acceptable for the clang -fexec-charset patch.

iconv is a POSIX specification (POSIX iconv.h), so there is no license problem.

Kai: This can be viewed as the same as compiling a file: it fails on my machine if I do not have the…

#ifdef HAVE_ICONV

iconv_t ConvDesc = iconv_open(CSTo.str().c_str(), CSFrom.str().c_str());

if (ConvDesc == reinterpret_cast<iconv_t>(-1))

return std::error_code(errno, std::generic_category());

hubert.reinterpretcastUnsubmitted

Not Done

Use a C-style cast if necessary, but reinterpret_cast is wrong here.

hubert.reinterpretcast: Use a C-style cast if necessary, but `reinterpret_cast` is wrong here.

KaiAuthorUnsubmitted

Done

Changed.

Kai: Changed.

return CharSetConverter{

[ConvDesc](StringRef Source, SmallVectorImpl<char> &Result) {

return convertWithIconv(ConvDesc, Source, Result);

[ConvDesc]() { iconv_close(ConvDesc); }};

hubert.reinterpretcastUnsubmitted

Not Done

I don't think there's any "partially converted input characters" to flush. There's only shift sequences that may be unnecessary or unwanted.

hubert.reinterpretcast: I don't think there's any "partially converted input characters" to flush. There's only shift…

#endif

return std::make_error_code(std::errc::invalid_argument);

}

No newline at end of file

hubert.reinterpretcastUnsubmitted

Not Done

This is consistent with convertWithTable but not convertWithIconv with respect to whether Result is being appended to as opposed to it having its contents replaced. All three should be consistent.

hubert.reinterpretcast: This is consistent with `convertWithTable` but not `convertWithIconv` with respect to whether…

KaiAuthorUnsubmitted

Done

Thanks for catching this! All three converters should now behave in the same way, that is, replacing the content.

Kai: Thanks for catching this! All three converters should now behave in the same way, that is…

llvm/unittests/Support/CMakeLists.txt

	set(LLVM_LINK_COMPONENTS			set(LLVM_LINK_COMPONENTS
	Support			Support
	)			)

	add_llvm_unittest(SupportTests			add_llvm_unittest(SupportTests
	AlignmentTest.cpp			AlignmentTest.cpp
	AlignOfTest.cpp			AlignOfTest.cpp
	AllocatorTest.cpp			AllocatorTest.cpp
	AnnotationsTest.cpp			AnnotationsTest.cpp
	ARMAttributeParser.cpp			ARMAttributeParser.cpp
	ArrayRecyclerTest.cpp			ArrayRecyclerTest.cpp
	Base64Test.cpp			Base64Test.cpp
	BinaryStreamTest.cpp			BinaryStreamTest.cpp
	BlockFrequencyTest.cpp			BlockFrequencyTest.cpp
	BranchProbabilityTest.cpp			BranchProbabilityTest.cpp
	CachePruningTest.cpp			CachePruningTest.cpp
				CharSetTest.cpp
	CrashRecoveryTest.cpp			CrashRecoveryTest.cpp
	Casting.cpp			Casting.cpp
	CheckedArithmeticTest.cpp			CheckedArithmeticTest.cpp
	Chrono.cpp			Chrono.cpp
	CommandLineTest.cpp			CommandLineTest.cpp
	CompressionTest.cpp			CompressionTest.cpp
	ConvertUTFTest.cpp			ConvertUTFTest.cpp
	CRCTest.cpp			CRCTest.cpp
	▲ Show 20 Lines • Show All 105 Lines • Show Last 20 Lines

llvm/unittests/Support/CharSetTest.cpp

This file was added.

				//===- unittests/Support/CharSetTest.cpp - Charset conversion tests -------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "llvm/Support/CharSet.h"
				#include "llvm/ADT/SmallString.h"
				#include "gtest/gtest.h"
				using namespace llvm;

				namespace {

				// String "Hello World!"
				static const char HelloA[] =
				"\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x21\x0a";
				static const char HelloE[] =
				"\xC8\x85\x93\x93\x96\x40\xE6\x96\x99\x93\x84\x5A\x15";

				// String "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
				static const char ABCStrA[] =
				"\x41\x42\x43\x44\x45\x46\x47\x48\x49\x4A\x4B\x4C\x4D\x4E\x4F\x50\x51\x52"
				"\x53\x54\x55\x56\x57\x58\x59\x5A\x61\x62\x63\x64\x65\x66\x67\x68\x69\x6A"
				"\x6B\x6C\x6D\x6E\x6F\x70\x71\x72\x73\x74\x75\x76\x77\x78\x79\x7A";
				static const char ABCStrE[] =
				"\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9"
				"\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\x81\x82\x83\x84\x85\x86\x87\x88\x89\x91"
				"\x92\x93\x94\x95\x96\x97\x98\x99\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9";

				// String "¡¢£AÄÅÆEÈÉÊaàáâãäeèéêë"
				static const char AccentUTF[] =
				"\xc2\xa1\xc2\xa2\xc2\xa3\x41\xc3\x84\xc3\x85\xc3\x86\x45\xc3\x88\xc3\x89"
				"\xc3\x8a\x61\xc3\xa0\xc3\xa1\xc3\xa2\xc3\xa3\xc3\xa4\x65\xc3\xa8\xc3\xa9"
				"\xc3\xaa\xc3\xab";
				static const char AccentE[] = "\xaa\x4a\xb1\xc1\x63\x67\x9e\xc5\x74\x71\x72"
				"\x81\x44\x45\x42\x46\x43\x85\x54\x51\x52\x53";

				TEST(CharSet, FromASCII) {
				// Hello string.
				StringRef Src(HelloA);
				SmallString<64> Dst;
				hubert.reinterpretcastUnsubmitted Not Done Reply Inline Actions Comment should indicate that this is the substitution character. hubert.reinterpretcast: Comment should indicate that this is the substitution character.

				CharSetConverter Conv = CharSetConverter::create(
				CharSetConverter::CS_LATIN1, CharSetConverter::CS_IBM1047);
				std::error_code EC = Conv.convert(Src, Dst);
				EXPECT_TRUE(!EC);
				EXPECT_STREQ(HelloE, static_cast<std::string>(Dst).c_str());

				// ABC string.
				Src = ABCStrA;
				Dst.clear();
				EC = Conv.convert(Src, Dst);
				EXPECT_TRUE(!EC);
				EXPECT_STREQ(ABCStrE, static_cast<std::string>(Dst).c_str());
				}

				TEST(CharSet, ToASCII) {
				// Hello string.
				StringRef Src(HelloE);
				SmallString<64> Dst;

				CharSetConverter Conv = CharSetConverter::create(CharSetConverter::CS_IBM1047,
				CharSetConverter::CS_LATIN1);
				std::error_code EC = Conv.convert(Src, Dst);
				EXPECT_TRUE(!EC);
				EXPECT_STREQ(HelloA, static_cast<std::string>(Dst).c_str());

				// ABC string.
				Src = ABCStrE;
				Dst.clear();
				EC = Conv.convert(Src, Dst);
				EXPECT_TRUE(!EC);
				EXPECT_STREQ(ABCStrA, static_cast<std::string>(Dst).c_str());
				}

				TEST(CharSet, FromUTF8) {
				// Hello string.
				StringRef Src(HelloA);
				SmallString<64> Dst;

				CharSetConverter Conv = CharSetConverter::create(
				CharSetConverter::CS_UTF8, CharSetConverter::CS_IBM1047);
				std::error_code EC = Conv.convert(Src, Dst);
				EXPECT_TRUE(!EC);
				EXPECT_STREQ(HelloE, static_cast<std::string>(Dst).c_str());

				// ABC string.
				Src = ABCStrA;
				Dst.clear();
				EC = Conv.convert(Src, Dst);
				EXPECT_TRUE(!EC);
				EXPECT_STREQ(ABCStrE, static_cast<std::string>(Dst).c_str());

				// Accent string.
				Src = AccentUTF;
				Dst.clear();
				EC = Conv.convert(Src, Dst);
				EXPECT_TRUE(!EC);
				EXPECT_STREQ(AccentE, static_cast<std::string>(Dst).c_str());
				}

				TEST(CharSet, ToUTF8) {
				// Hello string.
				StringRef Src(HelloE);
				SmallString<64> Dst;

				CharSetConverter Conv = CharSetConverter::create(CharSetConverter::CS_IBM1047,
				CharSetConverter::CS_UTF8);
				std::error_code EC = Conv.convert(Src, Dst);
				EXPECT_TRUE(!EC);
				EXPECT_STREQ(HelloA, static_cast<std::string>(Dst).c_str());

				// ABC string.
				Src = ABCStrE;
				Dst.clear();
				EC = Conv.convert(Src, Dst);
				EXPECT_TRUE(!EC);
				EXPECT_STREQ(ABCStrA, static_cast<std::string>(Dst).c_str());

				// Accent string.
				Src = AccentE;
				Dst.clear();
				EC = Conv.convert(Src, Dst);
				EXPECT_TRUE(!EC);
				EXPECT_STREQ(AccentUTF, static_cast<std::string>(Dst).c_str());
				}

				TEST(CharSet, Identity) {
				// Hello string.
				hubert.reinterpretcastUnsubmitted Not Done Reply Inline Actions The other cases of "identity conversion" look like they would have suspicious behaviour. If they do, then this test is insufficient. hubert.reinterpretcast: The other cases of "identity conversion" look like they would have suspicious behaviour. If…
				StringRef Src(HelloA);
				SmallString<64> Dst;

				CharSetConverter Conv = CharSetConverter::create(CharSetConverter::CS_LATIN1,
				CharSetConverter::CS_LATIN1);
				std::error_code EC = Conv.convert(Src, Dst);
				EXPECT_TRUE(!EC);
				EXPECT_STREQ(HelloA, static_cast<std::string>(Dst).c_str());

				// ABC string.
				Src = ABCStrA;
				Dst.clear();
				EC = Conv.convert(Src, Dst);
				EXPECT_TRUE(!EC);
				EXPECT_STREQ(ABCStrA, static_cast<std::string>(Dst).c_str());
				}

				TEST(CharSet, RoundTrip) {
				ErrorOr<CharSetConverter> ConvToUTF16 =
				CharSetConverter::create("IBM-1047", "UTF-16");
				// Stop test if conversion is not supported (no underlying iconv support).
				if (!ConvToUTF16) {
				ASSERT_EQ(ConvToUTF16.getError(),
				std::make_error_code(std::errc::invalid_argument));
				return;
				}
				ErrorOr<CharSetConverter> ConvToUTF32 =
				CharSetConverter::create("UTF-16", "UTF-32");
				// Stop test if conversion is not supported (no underlying iconv support).
				if (!ConvToUTF32) {
				ASSERT_EQ(ConvToUTF32.getError(),
				std::make_error_code(std::errc::invalid_argument));
				return;
				}
				ErrorOr<CharSetConverter> ConvToEBCDIC =
				CharSetConverter::create("UTF-32", "IBM-1047");
				// Stop test if conversion is not supported (no underlying iconv support).
				if (!ConvToEBCDIC) {
				ASSERT_EQ(ConvToEBCDIC.getError(),
				std::make_error_code(std::errc::invalid_argument));
				return;
				}

				// Setup source string.
				char SrcStr[256];
				for (size_t I = 0; I < 256; ++I)
				SrcStr[I] = (I + 1) % 256;

				SmallString<99> Dst1Str, Dst2Str, Dst3Str;

				std::error_code EC = ConvToUTF16->convert(StringRef(SrcStr), Dst1Str);
				EXPECT_TRUE(!EC);
				EC = ConvToUTF32->convert(Dst1Str, Dst2Str);
				EXPECT_TRUE(!EC);
				EC = ConvToEBCDIC->convert(Dst2Str, Dst3Str);
				EXPECT_TRUE(!EC);
				EXPECT_STREQ(SrcStr, static_cast<std::string>(Dst3Str).c_str());
				}

				} // namespace
				hubert.reinterpretcastUnsubmitted Not Done Reply Inline Actions There is no representation in the testing of stateful encodings. Reasonable tests (separately for ISO-2022-JP and IBM-939) include: "Returning to the initial shift state" when in the initial shift state generates an empty output sequence. "Returning to the initial shift state" after the previous conversion ended with a character that requires a shift from the initial shift state generates a non-empty output sequence. hubert.reinterpretcast: There is no representation in the testing of stateful encodings. Reasonable tests (separately…
				hubert.reinterpretcastUnsubmitted Not Done Reply Inline Actions I suggest trying to write these tests. hubert.reinterpretcast: I suggest trying to write these tests.

This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ/z/OS] Add utility class for char set conversion.Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 338983

llvm/cmake/config-ix.cmake

llvm/include/llvm/Config/config.h.cmake

llvm/include/llvm/Support/CharSet.h

llvm/lib/Support/CMakeLists.txt

llvm/lib/Support/CharSet.cpp

llvm/unittests/Support/CMakeLists.txt

llvm/unittests/Support/CharSetTest.cpp

[SystemZ/z/OS] Add utility class for char set conversion.
Needs ReviewPublic