Download Raw Diff

Details

Reviewers

uweigand
kbarton
yusra.syeda
hubert.reinterpretcast
tahonermann
ZarkoCA

Summary

This class adds support for conversion between different character
sets. The conversion between EBCDIC-1047 and Latin-1/UTF-8 is always
available. If the iconv library is installed, then all conversions
of the iconv library can be used, too.

Use of iconv functions is not as standardized as wanted.
Challenges found so far:

On some functions, the iconv*() functions are part of the C library. On other systems, a separate library must be linked in.
There are systems with multiple, incompatible implementations of iconv functionality.
Not each system provides an implementation.
Mapping of EBCDIC-1047 to ASCII/Latin-1/UTF-8 and vice versa is done differently on different platforms.
Each implementation supports different mappings.

As result, the implementation provides a thin wrapper around the
iconv functionality, including a fixed conversion for EBCDIC-1047.

Diff Detail

Unit TestsFailed

	Time	Test
	130 ms	x64 debian > LLVM.Examples/OrcV2Examples::lljit-with-thinlto-summaries.test
	20 ms	x64 debian > LLVM.Examples/OrcV2Examples::orcv2-cbindings-add-object-file.test
	30 ms	x64 debian > LLVM.Examples/OrcV2Examples::orcv2-cbindings-basic-usage.test
	30 ms	x64 debian > LLVM.Examples/OrcV2Examples::orcv2-cbindings-lazy.test
	30 ms	x64 debian > LLVM.Examples/OrcV2Examples::orcv2-cbindings-removable-code.test

Event Timeline

Kai created this revision.Oct 2 2020, 8:50 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 2 2020, 8:50 AM

Herald added subscribers: llvm-commits, hiraditya, mgorny. · View Herald Transcript

Kai requested review of this revision.Oct 2 2020, 8:50 AM

Harbormaster completed remote builds in B73800: Diff 295839.Oct 2 2020, 9:26 AM

Drive by as it's years since I've dealt with any of this:

Seems ok to me. Be good to send your RFC as well to clang-dev to get a view into -finput-charset translations as well if we wanted to go there.

In D88741#2308899, @echristo wrote:

Seems ok to me. Be good to send your RFC as well to clang-dev to get a view into -finput-charset translations as well if we wanted to go there.

Sure, done.

yusra.syeda mentioned this in D88749: [SystemZ/z/OS] Add GOFF reader.Oct 2 2020, 12:00 PM

ctetreau added a subscriber: ctetreau.Oct 5 2020, 10:20 AM

ctetreau added inline comments.

llvm/lib/Support/CharSet.cpp
26 ↗	(On Diff #295839)	These names leak the iconv dependency, and can be error prone if somebody types them wrong. We should have an enum class for the set of char sets that can be converted. If we keep the iconv dependency, it's trivial to map `CharSet::IBM1047 -> "IBM-1047"`, but the compiler can catch the error if you mistype the enum name.
158 ↗	(On Diff #295839)	If we replace the name strings with enums, then it should be possible to make this function total and remove the `ErrorOr`
190 ↗	(On Diff #295839)	If I'm understanding this correctly, having iconv provides the possibility of supporting conversions other than to and from ascii, utf8, and ebcdic? I'm concerned that this is going to create a ton of bug reports of the form "CharSetConverter::create returned an error on my machine, but not my coworker's machine!" which will be closed as "operator error, install iconv". I feel like there should be a set of conversions supported by CharSetConverter, and they should work regardless of the presense of iconv. From messages I've seen on the mailing lists, it sounds like there is license uncertainty with linking iconv. Maybe it's best to just not have this?

yusra.syeda mentioned this in D89071: [SystemZ/z/OS] Add GOFFObjectFile class and details of GOFF file format.Oct 8 2020, 2:02 PM

yusra.syeda added a child revision: D89071: [SystemZ/z/OS] Add GOFFObjectFile class and details of GOFF file format.Oct 20 2020, 10:12 AM

Minor addition for this change; if this does go through, llvm-specific macros in C-like languages should expose the chosen name. I submitted a patch and a feature request to MSVC and GCC for their own implementation-specific macros as well:

It doesn't need to be synchronized with any other compiler's name or exposed feature: it just needs to exist. This allows folks interested in helping people write portable code that uses the execution character set preserve the invariants of their code by allowing them to inspect the name and act meaningfully for names they recognize.

Seems reasonable to me.

-eric

Sorry for taking so long - I was on vacation.

Added an enumeration for the basic charsets
Added an named constructor just for the basic charsets
Updated the test cases

I like the addition of the enumeration. It simplifies the source in some parts.
However, I did not drop the function which uses iconv for conversion. This will be needed for some further extensions.

Herald added a subscriber: dexonsmith. · View Herald TranscriptNov 4 2020, 5:28 AM

Harbormaster completed remote builds in B77542: Diff 302822.Nov 4 2020, 5:48 AM

abhina.sreeskantharajan added a subscriber: abhina.sreeskantharajan.Nov 18 2020, 10:00 AM

fanbo-meng added a subscriber: fanbo-meng.Nov 18 2020, 10:01 AM

fanbo-meng added inline comments.

llvm/lib/Support/CharSet.cpp
89 ↗	(On Diff #302822)	The naming here with "UTF" is ambiguous as it can mean UTF8, UTF16, UTF32. Using UTF8 with the enum names here would be better.

abhina.sreeskantharajan mentioned this in D93031: Enable fexec-charset option .Dec 10 2020, 5:49 AM

abhina.sreeskantharajan added a child revision: D93031: Enable fexec-charset option .Dec 10 2020, 6:24 AM

Drive-by comment as I'm reviewing a downstream patch - I think it's preferable to use llvm::Error and llvm::Expected instead of std::error_code, for reporting errors, as it allows you to provide more contextual information (e.g. why the conversion failed, which characters were invalid etc).

The CI on Windows is failing because it seems that the iconv header exists but the path is unknown. Is it possible to set a variable to Iconv_INCLUDE_DIRS and use that instead of writing out <iconv.h> ? I'm seeing this failure on the CI on my fexec-charset patch as well.

yusra.syeda added a child revision: D98437: [SystemZ][z/OS] Add GOFFObjectFile class support for HDR, ESD and END records.Mar 11 2021, 10:16 AM

ThePhD mentioned this in D100346: [Clang] String Literal and Wide String Literal Encoding from the Preprocessor.Apr 12 2021, 3:03 PM

Renamed enum members to include 8 for UTF8.

Kai marked an inline comment as done.Apr 20 2021, 1:13 PM

Kai added inline comments.

llvm/lib/Support/CharSet.cpp
89 ↗	(On Diff #302822)	Renamed.

Added the path of the <iconv.h> header as additional include path,

Use the right CMake variable Iconv_INCLUDE_DIRS.

In D88741#2613823, @abhina.sreeskantharajan wrote:

The CI on Windows is failing because it seems that the iconv header exists but the path is unknown. Is it possible to set a variable to Iconv_INCLUDE_DIRS and use that instead of writing out <iconv.h> ? I'm seeing this failure on the CI on my fexec-charset patch as well.

Iconv_INCLUDE_DIRS is now in the header file search path. This hopefully fixes the compile error.

Kai added inline comments.Apr 20 2021, 1:45 PM

llvm/lib/Support/CharSet.cpp
190 ↗	(On Diff #295839)	This can be viewed as the same as compiling a file: it fails on my machine if I do not have the same file in the same place as my coworker. This problem seems acceptable for the clang -fexec-charset patch. iconv is a POSIX specification (POSIX iconv.h), so there is no license problem.

Harbormaster completed remote builds in B99790: Diff 338973.Apr 20 2021, 1:50 PM

Harbormaster completed remote builds in B99793: Diff 338977.Apr 20 2021, 2:08 PM

Harbormaster completed remote builds in B99798: Diff 338983.Apr 20 2021, 2:16 PM

yusra.syeda added a child revision: D103490: [SystemZ][z/OS] Add support for TXT records in the GOFF reader.Jun 1 2021, 1:43 PM

yusra.syeda mentioned this in D103490: [SystemZ][z/OS] Add support for TXT records in the GOFF reader.Jun 28 2021, 1:36 PM

kpn added a subscriber: kpn.Jun 28 2021, 1:39 PM

Introduces build option LLVM_ENABLE_ICONV to optionally turn building with iconv off
Add include path to the iconv header when building with iconv enabled and not part of C library This hopefully fixes the build failure on the Windows bot.

Harbormaster completed remote builds in B114091: Diff 358739.Jul 14 2021, 3:36 PM

Fixed a problem in the CMake files.
This should finally fix the problem seen on the Windows builder.

Harbormaster completed remote builds in B123316: Diff 371730.Sep 9 2021, 3:31 PM

Fix wrong use of #ifdef

Harbormaster completed remote builds in B123328: Diff 371744.Sep 9 2021, 4:43 PM

Rebased on latest main.

Harbormaster completed remote builds in B128460: Diff 379158.Oct 12 2021, 2:24 PM

hubert.reinterpretcast added inline comments.Oct 12 2021, 4:37 PM

llvm/include/llvm/Support/CharSet.h
47 ↗	(On Diff #371744)	Just noting that https://wg21.link/p1885 proposes enumeration names (and values) for these.
llvm/lib/Support/CharSet.cpp
30–35 ↗	(On Diff #371744)	There is a normalization process for character set names described by https://wg21.link/p1885 (please refer to the source material directly as well: https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching). `ISO8859-1` is neither the primary name nor one of the aliases defined for this charset in the IANA character set registry.
102 ↗	(On Diff #371744)	The comment should say, "only valid sequences encoding UCS scalar values in the range [U+0080, U+00FF] can be decoded".
104 ↗	(On Diff #371744)	The API contract between the table conversion and the `iconv` conversion is inconsistent. `iconv` conversion performs an implementation-defined conversion for valid input characters that does not have a representation in the output codeset. This implementation fails the conversion instead.
113 ↗	(On Diff #371744)	The coding guidelines say to use preincrement in cases where postincrement is not needed.
140 ↗	(On Diff #371744)	This looks like a use case for `resize_for_overwrite`.
151 ↗	(On Diff #371744)	Missing overflow detection for when the most significant bit of `Capacity` is already set.
152 ↗	(On Diff #371744)	Same comment about `resize_for_overwrite`.
155 ↗	(On Diff #371744)	The coding guidelines have been updated to discourage mixing use and omission of braces for the constituent parts of an if/else chain.
194 ↗	(On Diff #371744)	Use a C-style cast if necessary, but `reinterpret_cast` is wrong here.

hubert.reinterpretcast added inline comments.Oct 12 2021, 6:54 PM

llvm/lib/Support/CharSet.cpp
175–178 ↗	(On Diff #379158)	This is a public constructor. It would be appropriate to check that `CSFrom` and `CSTo` aren't both `CS_IBM1047`...
179–183 ↗	(On Diff #379158)	Because of the [U+0000, U+00FF] limitation of the `convertWithTable` converter, we should not be getting here if there is a no-op conversion from UTF-8 to UTF-8. Either a no-op converter should be returned, or a request for such a converter is erroneous (`report_fatal_error` here would be friendlier by failing fast).

hubert.reinterpretcast added inline comments.Oct 12 2021, 8:04 PM

llvm/include/llvm/Support/CharSet.h
98 ↗	(On Diff #379158)	The expectations around null termination and shift state should be documented. Note that https://wg21.link/p2029 recommends that numeric escapes do not trigger changes in shift state, so avoiding automatic transitions to the initial shift state at the end of the (potential mid-string) translation is probably the right thing to do. Using a null `StringRef` as the method to cause a return to the initial shift state would be consistent with the `iconv` interface.
107 ↗	(On Diff #379158)	Same comment about null termination and shift state.

hubert.reinterpretcast added inline comments.Oct 12 2021, 8:52 PM

llvm/unittests/Support/CharSetTest.cpp
130 ↗	(On Diff #379158)	The other cases of "identity conversion" look like they would have suspicious behaviour. If they do, then this test is insufficient.
191 ↗	(On Diff #379158)	There is no representation in the testing of stateful encodings. Reasonable tests (separately for ISO-2022-JP and IBM-939) include: "Returning to the initial shift state" when in the initial shift state generates an empty output sequence. "Returning to the initial shift state" after the previous conversion ended with a character that requires a shift from the initial shift state generates a non-empty output sequence.

Changes based on review comments:

Renamed/moved enumeration members for charset names
Using a charset alias algorithm to make using charset names easier
Some source code style updates

Kai added inline comments.Oct 14 2021, 1:04 PM

llvm/include/llvm/Support/CharSet.h
47 ↗	(On Diff #371744)	Good point, thanks. It makes a lot of sense to have the names similar to proposed standards.
llvm/lib/Support/CharSet.cpp
30–35 ↗	(On Diff #371744)	Thanks! I added the normalization algorithm, as I think that it makes using the converter easier. Obviously, the `ISO8859-1` name comes from the z/OS implementation of iconv, which accepts 'ISO8859-1` or `819` but not the registered name `ISO-8859-1`.
102 ↗	(On Diff #371744)	Changed.
113 ↗	(On Diff #371744)	Thanks for catching - my fault.
155 ↗	(On Diff #371744)	Thanks again, I was not aware of this change.

Harbormaster completed remote builds in B128927: Diff 379800.Oct 14 2021, 1:13 PM

ZarkoCA added a subscriber: ZarkoCA.Oct 14 2021, 7:06 PM

ZarkoCA added inline comments.

llvm/include/llvm/Support/CharSet.h
32 ↗	(On Diff #379800)	I think enums should start with an uppercase letter as per the style guide.
42 ↗	(On Diff #379800)	nit: may be more clear to add the ending comment.

I've looked at the new changes and verified that the recent updates are an improvement. There remain unaddressed comments.

llvm/include/llvm/Support/CharSet.h
32 ↗	(On Diff #379800)	Thanks @ZarkoCA. The coding standards don't state this, but the existing practice is that the exception (from the coding standards) about following the C++ STL naming convention in some cases is more broadly applicable for components with a corresponding standard (or proposed standard) facility like this one.

Update table-based algorithm to map Unicode characters outside the range [U+0080, U+00FF] to the SUB character. This is the same behavior as iconv.
Add a new identify transformation which does not alter UTF-8 sequences
Use resize_for_overwrite for memory allocation

I've looked at the new changes and verified that the recent updates are an improvement. There remain unaddressed comments.

Yep, I know.

Kai added inline comments.Oct 19 2021, 7:59 AM

llvm/include/llvm/Support/CharSet.h
32 ↗	(On Diff #379800)	To be precise, the exception is documented here: https://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly (at the end of the paragraph).
llvm/lib/Support/CharSet.cpp
179–183 ↗	(On Diff #379158)	Added a no-op converter.
104 ↗	(On Diff #371744)	Changed implementation to have same behavior as iconv.
140 ↗	(On Diff #371744)	Changed, thanks.
152 ↗	(On Diff #371744)	Changed, thanks.
194 ↗	(On Diff #371744)	Changed.

Harbormaster completed remote builds in B129546: Diff 380686.Oct 19 2021, 8:37 AM

hubert.reinterpretcast added inline comments.Oct 19 2021, 8:39 PM

llvm/lib/Support/CharSet.cpp
104 ↗	(On Diff #371744)	Thanks for making the change. Thinking about it a bit more, I am wondering if there should be a "policy" available for the class to support selecting one behaviour or the other (or at least identifying that there were characters with no output codeset representation encountered). For example, `iconv` conversions that would not round-trip are identifiable via non-error non-zero return codes (indicating the number of input characters that did not have a representation in the output codeset). This information/policy will be useful for implementing the diagnostics required if https://wg21.link/p1854 is adopted.

hubert.reinterpretcast added inline comments.Oct 26 2021, 8:04 PM

llvm/lib/Support/CharSet.cpp
134 ↗	(On Diff #380686)	The comment should say that the encoded value is greater than 0xFF or the encoding is overlong. Overlong encodings are still an error. If the conversion facility has no use case for "lossy" translation, then reverting to the previous code for the UTF-8 to ISO-8859-1 decoding (and fixing the `iconv` path to check the return code) would save effort on getting the error handling right. It is also an error if the encoded value ends up being not a UCS scalar value (either because it is greater then 0x10FFFF or because it encodes a surrogate code point).
138 ↗	(On Diff #380686)	I suggest renaming this to `SeqLength` and changing the comment above to say "the rest of the sequence is skipped" (if indeed that aspect survives the changes to either implement error detection or abandon lossy translation).
139 ↗	(On Diff #380686)	An error should also be flagged when `SkipLen == 1`.

hubert.reinterpretcast added inline comments.Oct 26 2021, 8:42 PM

llvm/lib/Support/CharSet.cpp
178 ↗	(On Diff #380686)	My first comment regarding `resize_for_overwrite` was for this line.
212 ↗	(On Diff #380686)	This is consistent with `convertWithTable` but not `convertWithIconv` with respect to whether `Result` is being appended to as opposed to it having its contents replaced. All three should be consistent.
llvm/unittests/Support/CharSetTest.cpp
42 ↗	(On Diff #380686)	Comment should indicate that this is the substitution character.

hubert.reinterpretcast added a subscriber: cor3ntin.Oct 28 2021, 8:52 AM

evantypanski added a subscriber: evantypanski.Oct 28 2021, 8:55 AM

only support reversible conversions
make sure that all converters behave in the same way
updated comments

Kai added inline comments.Nov 3 2021, 12:53 PM

llvm/include/llvm/Support/CharSet.h
98 ↗	(On Diff #379158)	I updated the comment (and also the implementation) that an empty string resets the shift state. However, this feels a bit clumsy, and leaks a lot of the details of the iconv interface. So I wonder if a new method, e.g. `resetState()` would not be the better approach.
llvm/lib/Support/CharSet.cpp
134 ↗	(On Diff #380686)	Ah, I see your point. I reverted the error handling here, and check the return code from `iconv()` instead below. There is no real use case for it. I
178 ↗	(On Diff #380686)	Sorry, I noted this later, too.
212 ↗	(On Diff #380686)	Thanks for catching this! All three converters should now behave in the same way, that is, replacing the content.
151 ↗	(On Diff #371744)	Changed the handling to detect overflows.

Harbormaster completed remote builds in B132311: Diff 384560.Nov 3 2021, 3:11 PM

hubert.reinterpretcast added inline comments.Nov 6 2021, 9:01 PM

llvm/CMakeLists.txt
386 ↗	(On Diff #384560)	Minor nit: s/for for/for;
llvm/include/llvm/Support/CharSet.h
76 ↗	(On Diff #384560)	Minor nit: s/occuor/occur/;
90 ↗	(On Diff #384560)	If the current object had a cleanup action on entry to the move assignment operator, then that cleanup action should be run prior to replacing the value.
107 ↗	(On Diff #384560)	Here too: s/occuor/occur/;
98 ↗	(On Diff #379158)	I think that empty string is maybe not unique enough for this operation, so yes, a new method is probably better. However, it may be useful to understand how using `StringRef("", 1u)` as the input fits in.
42 ↗	(On Diff #379800)	Minor comment still to address here.
llvm/lib/Support/CharSet.cpp
135 ↗	(On Diff #384560)
163 ↗	(On Diff #384560)	It may be useful for the client of the interface to be able to retrieve the shift sequence (if any) to the initial shift state from the current internal output shift state. Although that use case might be achieved by using `StringRef("", 1u)`.

hubert.reinterpretcast added inline comments.Nov 7 2021, 7:48 AM

llvm/lib/Support/CharSet.cpp
197–199 ↗	(On Diff #384560)	I don't think there's any "partially converted input characters" to flush. There's only shift sequences that may be unnecessary or unwanted.
203–205 ↗	(On Diff #384560)	I think forcing return to the initial shift state at this level is too rigid. As mentioned in https://reviews.llvm.org/D88741?id=379158#inline-1064326, the recommended behaviour for numeric escapes involves no return to the initial shift state prior to the injection of the values specified. Whether Clang follows that recommendation or not for one platform or the other is a higher level question that should be had in the context of a client of this facility. The facility itself should leave the possibility of resuming with the prior translation's consequent shift state.

hubert.reinterpretcast added inline comments.Nov 7 2021, 9:21 AM

llvm/include/llvm/Support/CharSet.h
102 ↗	(On Diff #384560)	I think "string" is too ambiguous. Additional sentences to clarify would help: Converts the contents of a StringRef, which conventionally does not include a terminating null character. No additional null termination of the result is attempted.
127 ↗	(On Diff #384560)	Can refer to the other function to note clarifications re: null termination.

hubert.reinterpretcast added a reviewer: ZarkoCA.Nov 7 2021, 11:17 AM

@Kai, I've completed my review of the current state of this patch. I am out-of-office for three weeks. I hope my comments are clear enough to address and for other reviewers to confirm action/response on.

llvm/unittests/Support/CharSetTest.cpp
191 ↗	(On Diff #379158)	I suggest trying to write these tests.

hubert.reinterpretcast edited the summary of this revision. (Show Details)Nov 8 2021, 6:35 AM

Fix a couple of review comments.

Kai added inline comments.Nov 10 2021, 1:44 PM

llvm/CMakeLists.txt
386 ↗	(On Diff #384560)	Changed.
llvm/include/llvm/Support/CharSet.h
76 ↗	(On Diff #384560)	Fixed. Thanks for catching!
90 ↗	(On Diff #384560)	Call to cleanup added.
107 ↗	(On Diff #384560)	Fixed, too.
42 ↗	(On Diff #379800)	Comment added.
llvm/lib/Support/CharSet.cpp
135 ↗	(On Diff #384560)	Changed.

Harbormaster completed remote builds in B133577: Diff 386303.Nov 10 2021, 2:48 PM

The new changes are improvements. The comments regarding null termination and shift state are still pending resolution.

Rewrite based on review comments

New class hierarchy using pimpl idiom
New handling for flushing

Herald added a project: Restricted Project. · View Herald TranscriptMar 29 2023, 6:44 AM

dexonsmith removed a subscriber: dexonsmith.Mar 29 2023, 9:26 AM

Harbormaster completed remote builds in B222504: Diff 509344.Mar 29 2023, 10:14 AM

Thanks for working on this.

Before starting a more in depth review on that, I think this is big enough that we want to see an RFC with wider consensus and interest from the community as far as maintaining this.
I know of https://discourse.llvm.org/t/rfc-adding-a-char-set-converter-to-support-library/56592/2 and https://discourse.llvm.org/t/rfc-adding-a-char-set-converter-to-support-library/56574 , but neither gather a lot of comments.

I would like to better understand the set of circumstances in which the system iconv's IBM1047 transcoder would not be suitable for use by llvm, on the systems that make use of that encoding.
Maintaining these tables won't be free. Are the discrepancies regarding LF handling something you expect to cause issues in practice?

I am of the opinion that we should rely on iconv as much as possible, as i do not think maintaining conversion table will be a good use of our resources, however i think people might have widely different opinions and you will want
to make sure there is community buy-in on that point.

We should note that GCC has had good success using iconv, and I don't think the platform-specific availability of encondings has been an issue in practice but it's worth addressing. I know some people mentioned that point previously.
IMO, Any user confusion in that regard could be addressed by providing a sufficiently high-quality diagnostic in the presence of unsupported encoding.

If we do use iconv though, i would like us to have a better understanding of use cases, The patch currently links iconv to all llvm libraries, which might be overkill if the only project using it is Clang, and I wonder how that affects packaging
on linux distributions.

Speaking of, maybe the LLVM foundation will be able to provide legal opinions on linking to iconv on various platforms. Maybe something @tstellar would be able to help with. I would not expect particular challenges but we
should to make sure.

@aaron.ballman also reminded me that if iconv is generally available on posix-ish platforms, it is usually not available on Windows. Building iconv-like facilities on top of win32 APIs may be challenging, but we might want to think about
ways to allow or facilitate the use of iconv in Windows build. At least i would like us to give some consideration to that point, Indeed, Windows would, logically be the other big motivation to teach Clang about encodings beyond UTF-8.

Are there any other supported platforms for which iconv availability would be a concern?

I think all of these points should be addressed in an RFC so that we have a clear plan moving forward.

On a slightly more technical aspects, I'm concerned about the attempt at providing genericity in the CharSetConverterTable class - as Hubert alluded to, a table of bytes only works for specific, stateless, single byte encoding that have a reasonable mapping to one another.
If we do agree that we want a homegrown IBM1047<->UTF-8 conversion class, maybe we should have a class that does only that instead of trying to future proof and expect to add more tables. As i hope that we won't try to support more encoding without iconv.

CharSetConverterIdentity will never be a very efficient thing to use as implemented as it will perform a copy that may not be be useful. Have you consider either asserting that the encoding are distincts, or providing an error in that case instead of making a no-op copying converter?

llvm/include/llvm/Support/CharSet.h
93 ↗	(On Diff #509344)	Please use `std::unique_ptr` here, that way you won't have to manually manage memory in the destructor and move assignment operator

Everybody0523 added a subscriber: Everybody0523.Mar 30 2023, 11:45 AM

Use std::unique_ptr
clang-format the file

Harbormaster completed remote builds in B223098: Diff 510158.Mar 31 2023, 6:55 PM

In D88741#4232746, @cor3ntin wrote:

Thanks for working on this.

Before starting a more in depth review on that, I think this is big enough that we want to see an RFC with wider consensus and interest from the community as far as maintaining this.
I know of https://discourse.llvm.org/t/rfc-adding-a-char-set-converter-to-support-library/56592/2 and https://discourse.llvm.org/t/rfc-adding-a-char-set-converter-to-support-library/56574 , but neither gather a lot of comments.

+1, neither of those RFCs really showed a strong community consensus behind the idea, so I think another RFC would be appropriate. Please be sure to address the items in https://clang.llvm.org/get_involved.html#criteria explicitly in the RFC where appropriate.

I am of the opinion that we should rely on iconv as much as possible, as i do not think maintaining conversion table will be a good use of our resources, however i think people might have widely different opinions and you will want
to make sure there is community buy-in on that point.

We should note that GCC has had good success using iconv, and I don't think the platform-specific availability of encondings has been an issue in practice but it's worth addressing. I know some people mentioned that point previously.
IMO, Any user confusion in that regard could be addressed by providing a sufficiently high-quality diagnostic in the presence of unsupported encoding.

If we do use iconv though, i would like us to have a better understanding of use cases, The patch currently links iconv to all llvm libraries, which might be overkill if the only project using it is Clang, and I wonder how that affects packaging
on linux distributions.

Speaking of, maybe the LLVM foundation will be able to provide legal opinions on linking to iconv on various platforms. Maybe something @tstellar would be able to help with. I would not expect particular challenges but we
should to make sure.

Agreed -- any time we bring in a third-party dependency, we should be going through an explicit license review just to make sure we're in the clear.

@aaron.ballman also reminded me that if iconv is generally available on posix-ish platforms, it is usually not available on Windows. Building iconv-like facilities on top of win32 APIs may be challenging, but we might want to think about
ways to allow or facilitate the use of iconv in Windows build. At least i would like us to give some consideration to that point, Indeed, Windows would, logically be the other big motivation to teach Clang about encodings beyond UTF-8.

This is my big concern. iconv isn't installed on Windows by default, so this means we either need to dynamically link against iconv (which means Windows users now have extra dependencies they need to deal with and this can negatively impact downstream projects that ship an iconv version because of versioning differences) or we need to statically link against iconv (which means Windows users now have a significantly larger binary to deal with, and it still can negatively impact downstream projects that also statically link against iconv). That said, I also don't think we want to try to replace iconv with our own set of APIs. This should be explicitly discussed in the new RFC.

Are there any other supported platforms for which iconv availability would be a concern?

I think all of these points should be addressed in an RFC so that we have a clear plan moving forward.

@aaron.ballman because of the LGPL nature of iconv we night be restricted in our ability to link iconv statistically (but IANAL).
However, we might actually have options there. ICU is generally available on windows https://learn.microsoft.com/en-us/windows/win32/intl/international-components-for-unicode--icu- - that's something that could be investiguated.
ICU converter API is similar to that of iconv https://unicode-org.github.io/icu/userguide/conversion/converters.html#creating-a-converter so, it could be surfaced the same way (ie through a name), both libs use a (potentially superset) of IANA names

CharSetConverterIdentity will never be a very efficient thing to use as implemented as it will perform a copy that may not be be useful. Have you consider either asserting that the encoding are distincts, or providing an error in that case instead of making a no-op copying converter?

We did consider that, but ultimately decided against it. The reason for that is so we are able to create a create method that is guaranteed to return a functional converter (ie. is guaranteed to not throw an error). However, if it is greatly preferable to have it assert/error in the event of an "identity" conversion, we can make that change too.

Based on comments here and on the new RFC, we've decided to remove the use of iconv and limit this patch to supporting conversions between EBCDIC and UTF-8.

Strip down implementation to just having 2 functions to convert between EBCDIC-1047 and UTF-8.

Harbormaster completed remote builds in B226377: Diff 514638.Apr 18 2023, 7:28 AM

Fix error in unit test configuration.

Harbormaster completed remote builds in B226387: Diff 514654.Apr 19 2023, 11:34 AM

yusra.syeda removed a child revision: D98437: [SystemZ][z/OS] Add GOFFObjectFile class support for HDR, ESD and END records.Apr 24 2023, 12:39 PM

abhina.sreeskantharajan mentioned this in D153417: New CharSetConverter wrapper class for ConverterEBCDIC.Jun 27 2023, 5:52 AM

This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ/z/OS] Add utility class for char set conversion.
Needs ReviewPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 514654

llvm/include/llvm/Support/ConvertEBCDIC.h

llvm/lib/Support/CMakeLists.txt

llvm/lib/Support/ConvertEBCDIC.cpp

llvm/unittests/Support/CMakeLists.txt

llvm/unittests/Support/ConvertEBCDICTest.cpp

This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ/z/OS] Add utility class for char set conversion.Needs ReviewPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 514654

llvm/include/llvm/Support/ConvertEBCDIC.h

llvm/lib/Support/CMakeLists.txt

llvm/lib/Support/ConvertEBCDIC.cpp

llvm/unittests/Support/CMakeLists.txt

llvm/unittests/Support/ConvertEBCDICTest.cpp

[SystemZ/z/OS] Add utility class for char set conversion.
Needs ReviewPublic