This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ/z/OS] Add utility class for char set conversion.
Needs ReviewPublic

Authored by Kai on Oct 2 2020, 8:50 AM.

Details

Summary

This class adds support for conversion between different character
sets. The conversion between EBCDIC-1047 and Latin-1/UTF-8 is always
available. If the iconv library is installed, then all conversions
of the iconv library can be used, too.

Use of iconv functions is not as standardized as wanted.
Challenges found so far:

  • On some functions, the iconv*() functions are part of the C library. On other systems, a separate library must be linked in.
  • There are systems with multiple, incompatible implementations of iconv functionality.
  • Not each system provides an implementation.
  • Mapping of EBCDIC-1047 to ASCII/Latin-1/UTF-8 and vice versa is done differently on different platforms.
  • Each implementation supports different mappings.

As result, the implementation provides a thin wrapper around the
iconv functionality, including a fixed conversion for EBCDIC-1047.

Diff Detail

Event Timeline

Kai created this revision.Oct 2 2020, 8:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 2 2020, 8:50 AM
Kai requested review of this revision.Oct 2 2020, 8:50 AM

Drive by as it's years since I've dealt with any of this:

Seems ok to me. Be good to send your RFC as well to clang-dev to get a view into -finput-charset translations as well if we wanted to go there.

Kai added a comment.Oct 2 2020, 11:11 AM

Seems ok to me. Be good to send your RFC as well to clang-dev to get a view into -finput-charset translations as well if we wanted to go there.

Sure, done.

ctetreau added inline comments.
llvm/lib/Support/CharSet.cpp
27

These names leak the iconv dependency, and can be error prone if somebody types them wrong. We should have an enum class for the set of char sets that can be converted. If we keep the iconv dependency, it's trivial to map CharSet::IBM1047 -> "IBM-1047", but the compiler can catch the error if you mistype the enum name.

159

If we replace the name strings with enums, then it should be possible to make this function total and remove the ErrorOr

191

If I'm understanding this correctly, having iconv provides the possibility of supporting conversions other than to and from ascii, utf8, and ebcdic? I'm concerned that this is going to create a ton of bug reports of the form "CharSetConverter::create returned an error on my machine, but not my coworker's machine!" which will be closed as "operator error, install iconv". I feel like there should be a set of conversions supported by CharSetConverter, and they should work regardless of the presense of iconv.

From messages I've seen on the mailing lists, it sounds like there is license uncertainty with linking iconv. Maybe it's best to just not have this?

ThePhD added a subscriber: ThePhD.Oct 24 2020, 6:23 PM

Minor addition for this change; if this does go through, llvm-specific macros in C-like languages should expose the chosen name. I submitted a patch and a feature request to MSVC and GCC for their own implementation-specific macros as well:

It doesn't need to be synchronized with any other compiler's name or exposed feature: it just needs to exist. This allows folks interested in helping people write portable code that uses the execution character set preserve the invariants of their code by allowing them to inspect the name and act meaningfully for names they recognize.

Seems reasonable to me.

-eric

Kai updated this revision to Diff 302822.Nov 4 2020, 5:28 AM

Sorry for taking so long - I was on vacation.

  • Added an enumeration for the basic charsets
  • Added an named constructor just for the basic charsets
  • Updated the test cases

I like the addition of the enumeration. It simplifies the source in some parts.
However, I did not drop the function which uses iconv for conversion. This will be needed for some further extensions.

fanbo-meng added inline comments.
llvm/lib/Support/CharSet.cpp
90

The naming here with "UTF" is ambiguous as it can mean UTF8, UTF16, UTF32. Using UTF8 with the enum names here would be better.

Drive-by comment as I'm reviewing a downstream patch - I think it's preferable to use llvm::Error and llvm::Expected instead of std::error_code, for reporting errors, as it allows you to provide more contextual information (e.g. why the conversion failed, which characters were invalid etc).

The CI on Windows is failing because it seems that the iconv header exists but the path is unknown. Is it possible to set a variable to Iconv_INCLUDE_DIRS and use that instead of writing out <iconv.h> ? I'm seeing this failure on the CI on my fexec-charset patch as well.

Kai updated this revision to Diff 338973.Apr 20 2021, 1:12 PM
Kai marked 2 inline comments as done.

Renamed enum members to include 8 for UTF8.

Kai marked an inline comment as done.Apr 20 2021, 1:13 PM
Kai added inline comments.
llvm/lib/Support/CharSet.cpp
90

Renamed.

Kai updated this revision to Diff 338977.Apr 20 2021, 1:23 PM
Kai marked an inline comment as done.

Added the path of the <iconv.h> header as additional include path,

Kai updated this revision to Diff 338983.Apr 20 2021, 1:37 PM

Use the right CMake variable Iconv_INCLUDE_DIRS.

Kai added a comment.Apr 20 2021, 1:39 PM

The CI on Windows is failing because it seems that the iconv header exists but the path is unknown. Is it possible to set a variable to Iconv_INCLUDE_DIRS and use that instead of writing out <iconv.h> ? I'm seeing this failure on the CI on my fexec-charset patch as well.

Iconv_INCLUDE_DIRS is now in the header file search path. This hopefully fixes the compile error.

Kai added inline comments.Apr 20 2021, 1:45 PM
llvm/lib/Support/CharSet.cpp
191

This can be viewed as the same as compiling a file: it fails on my machine if I do not have the same file in the same place as my coworker. This problem seems acceptable for the clang -fexec-charset patch.

iconv is a POSIX specification (POSIX iconv.h), so there is no license problem.

kpn added a subscriber: kpn.Jun 28 2021, 1:39 PM
Kai updated this revision to Diff 358739.Jul 14 2021, 2:02 PM
  • Introduces build option LLVM_ENABLE_ICONV to optionally turn building with iconv off
  • Add include path to the iconv header when building with iconv enabled and not part of C library This hopefully fixes the build failure on the Windows bot.
Kai updated this revision to Diff 371730.Sep 9 2021, 3:11 PM

Fixed a problem in the CMake files.
This should finally fix the problem seen on the Windows builder.

Kai updated this revision to Diff 371744.Sep 9 2021, 4:12 PM

Fix wrong use of #ifdef

Kai updated this revision to Diff 379158.Oct 12 2021, 1:05 PM

Rebased on latest main.

llvm/include/llvm/Support/CharSet.h
48

Just noting that https://wg21.link/p1885 proposes enumeration names (and values) for these.

llvm/lib/Support/CharSet.cpp
31–36

There is a normalization process for character set names described by https://wg21.link/p1885 (please refer to the source material directly as well: https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching).

ISO8859-1 is neither the primary name nor one of the aliases defined for this charset in the IANA character set registry.

103

The comment should say, "only valid sequences encoding UCS scalar values in the range [U+0080, U+00FF] can be decoded".

105

The API contract between the table conversion and the iconv conversion is inconsistent. iconv conversion performs an implementation-defined conversion for valid input characters that does not have a representation in the output codeset. This implementation fails the conversion instead.

114

The coding guidelines say to use preincrement in cases where postincrement is not needed.

141

This looks like a use case for resize_for_overwrite.

152

Missing overflow detection for when the most significant bit of Capacity is already set.

153

Same comment about resize_for_overwrite.

156

The coding guidelines have been updated to discourage mixing use and omission of braces for the constituent parts of an if/else chain.

195

Use a C-style cast if necessary, but reinterpret_cast is wrong here.

llvm/lib/Support/CharSet.cpp
176–179

This is a public constructor. It would be appropriate to check that CSFrom and CSTo aren't both CS_IBM1047...

180–184

Because of the [U+0000, U+00FF] limitation of the convertWithTable converter, we should not be getting here if there is a no-op conversion from UTF-8 to UTF-8. Either a no-op converter should be returned, or a request for such a converter is erroneous (report_fatal_error here would be friendlier by failing fast).

llvm/include/llvm/Support/CharSet.h
99

The expectations around null termination and shift state should be documented. Note that https://wg21.link/p2029 recommends that numeric escapes do not trigger changes in shift state, so avoiding automatic transitions to the initial shift state at the end of the (potential mid-string) translation is probably the right thing to do. Using a null StringRef as the method to cause a return to the initial shift state would be consistent with the iconv interface.

108

Same comment about null termination and shift state.

llvm/unittests/Support/CharSetTest.cpp
131

The other cases of "identity conversion" look like they would have suspicious behaviour. If they do, then this test is insufficient.

192

There is no representation in the testing of stateful encodings.

Reasonable tests (separately for ISO-2022-JP and IBM-939) include:

  • "Returning to the initial shift state" when in the initial shift state generates an empty output sequence.
  • "Returning to the initial shift state" after the previous conversion ended with a character that requires a shift from the initial shift state generates a non-empty output sequence.
Kai updated this revision to Diff 379800.Oct 14 2021, 12:29 PM

Changes based on review comments:

  • Renamed/moved enumeration members for charset names
  • Using a charset alias algorithm to make using charset names easier
  • Some source code style updates
Kai added inline comments.Oct 14 2021, 1:04 PM
llvm/include/llvm/Support/CharSet.h
48

Good point, thanks. It makes a lot of sense to have the names similar to proposed standards.

llvm/lib/Support/CharSet.cpp
31–36

Thanks! I added the normalization algorithm, as I think that it makes using the converter easier.

Obviously, the ISO8859-1 name comes from the z/OS implementation of iconv, which accepts 'ISO8859-1` or 819 but not the registered name ISO-8859-1.

103

Changed.

114

Thanks for catching - my fault.

156

Thanks again, I was not aware of this change.

ZarkoCA added inline comments.
llvm/include/llvm/Support/CharSet.h
33

I think enums should start with an uppercase letter as per the style guide.

43

nit: may be more clear to add the ending comment.

I've looked at the new changes and verified that the recent updates are an improvement. There remain unaddressed comments.

llvm/include/llvm/Support/CharSet.h
33

Thanks @ZarkoCA. The coding standards don't state this, but the existing practice is that the exception (from the coding standards) about following the C++ STL naming convention in some cases is more broadly applicable for components with a corresponding standard (or proposed standard) facility like this one.

Kai updated this revision to Diff 380686.Oct 19 2021, 7:17 AM
  • Update table-based algorithm to map Unicode characters outside the range [U+0080, U+00FF] to the SUB character. This is the same behavior as iconv.
  • Add a new identify transformation which does not alter UTF-8 sequences
  • Use resize_for_overwrite for memory allocation
Kai added a comment.Oct 19 2021, 7:18 AM

I've looked at the new changes and verified that the recent updates are an improvement. There remain unaddressed comments.

Yep, I know.

Kai added inline comments.Oct 19 2021, 7:59 AM
llvm/include/llvm/Support/CharSet.h
33

To be precise, the exception is documented here: https://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly (at the end of the paragraph).

llvm/lib/Support/CharSet.cpp
105

Changed implementation to have same behavior as iconv.

141

Changed, thanks.

153

Changed, thanks.

180–184

Added a no-op converter.

195

Changed.

llvm/lib/Support/CharSet.cpp
105

Thanks for making the change. Thinking about it a bit more, I am wondering if there should be a "policy" available for the class to support selecting one behaviour or the other (or at least identifying that there were characters with no output codeset representation encountered). For example, iconv conversions that would not round-trip are identifiable via non-error non-zero return codes (indicating the number of input characters that did not have a representation in the output codeset).

This information/policy will be useful for implementing the diagnostics required if https://wg21.link/p1854 is adopted.

llvm/lib/Support/CharSet.cpp
135

The comment should say that the encoded value is greater than 0xFF or the encoding is overlong. Overlong encodings are still an error. If the conversion facility has no use case for "lossy" translation, then reverting to the previous code for the UTF-8 to ISO-8859-1 decoding (and fixing the iconv path to check the return code) would save effort on getting the error handling right.

It is also an error if the encoded value ends up being not a UCS scalar value (either because it is greater then 0x10FFFF or because it encodes a surrogate code point).

139

I suggest renaming this to SeqLength and changing the comment above to say "the rest of the sequence is skipped" (if indeed that aspect survives the changes to either implement error detection or abandon lossy translation).

140

An error should also be flagged when SkipLen == 1.

llvm/lib/Support/CharSet.cpp
179

My first comment regarding resize_for_overwrite was for this line.

213

This is consistent with convertWithTable but not convertWithIconv with respect to whether Result is being appended to as opposed to it having its contents replaced. All three should be consistent.

llvm/unittests/Support/CharSetTest.cpp
43

Comment should indicate that this is the substitution character.

Kai updated this revision to Diff 384560.Nov 3 2021, 12:40 PM
  • only support reversible conversions
  • make sure that all converters behave in the same way
  • updated comments
Kai added inline comments.Nov 3 2021, 12:53 PM
llvm/include/llvm/Support/CharSet.h
99

I updated the comment (and also the implementation) that an empty string resets the shift state. However, this feels a bit clumsy, and leaks a lot of the details of the iconv interface. So I wonder if a new method, e.g. resetState() would not be the better approach.

llvm/lib/Support/CharSet.cpp
135

Ah, I see your point. I reverted the error handling here, and check the return code from iconv() instead below. There is no real use case for it. I

152

Changed the handling to detect overflows.

179

Sorry, I noted this later, too.

213

Thanks for catching this! All three converters should now behave in the same way, that is, replacing the content.

llvm/CMakeLists.txt
386 ↗(On Diff #384560)

Minor nit: s/for for/for;

llvm/include/llvm/Support/CharSet.h
43

Minor comment still to address here.

77

Minor nit: s/occuor/occur/;

91

If the current object had a cleanup action on entry to the move assignment operator, then that cleanup action should be run prior to replacing the value.

99

I think that empty string is maybe not unique enough for this operation, so yes, a new method is probably better. However, it may be useful to understand how using StringRef("", 1u) as the input fits in.

108

Here too: s/occuor/occur/;

llvm/lib/Support/CharSet.cpp
136
164

It may be useful for the client of the interface to be able to retrieve the shift sequence (if any) to the initial shift state from the current internal output shift state. Although that use case might be achieved by using StringRef("", 1u).

llvm/lib/Support/CharSet.cpp
198–200

I don't think there's any "partially converted input characters" to flush. There's only shift sequences that may be unnecessary or unwanted.

204–206

I think forcing return to the initial shift state at this level is too rigid.

As mentioned in https://reviews.llvm.org/D88741?id=379158#inline-1064326, the recommended behaviour for numeric escapes involves no return to the initial shift state prior to the injection of the values specified.

Whether Clang follows that recommendation or not for one platform or the other is a higher level question that should be had in the context of a client of this facility. The facility itself should leave the possibility of resuming with the prior translation's consequent shift state.

llvm/include/llvm/Support/CharSet.h
103

I think "string" is too ambiguous. Additional sentences to clarify would help:
Converts the contents of a StringRef, which conventionally does not include a terminating null character. No additional null termination of the result is attempted.

128

Can refer to the other function to note clarifications re: null termination.

@Kai, I've completed my review of the current state of this patch. I am out-of-office for three weeks. I hope my comments are clear enough to address and for other reviewers to confirm action/response on.

llvm/unittests/Support/CharSetTest.cpp
192

I suggest trying to write these tests.

hubert.reinterpretcast edited the summary of this revision. (Show Details)Nov 8 2021, 6:35 AM
Kai updated this revision to Diff 386303.Nov 10 2021, 1:41 PM

Fix a couple of review comments.

Kai added inline comments.Nov 10 2021, 1:44 PM
llvm/CMakeLists.txt
386 ↗(On Diff #384560)

Changed.

llvm/include/llvm/Support/CharSet.h
43

Comment added.

77

Fixed. Thanks for catching!

91

Call to cleanup added.

108

Fixed, too.

llvm/lib/Support/CharSet.cpp
136

Changed.

The new changes are improvements. The comments regarding null termination and shift state are still pending resolution.

Kai updated this revision to Diff 509344.Mar 29 2023, 6:43 AM

Rewrite based on review comments

  • New class hierarchy using pimpl idiom
  • New handling for flushing
Herald added a project: Restricted Project. · View Herald TranscriptMar 29 2023, 6:44 AM

Thanks for working on this.

Before starting a more in depth review on that, I think this is big enough that we want to see an RFC with wider consensus and interest from the community as far as maintaining this.
I know of https://discourse.llvm.org/t/rfc-adding-a-char-set-converter-to-support-library/56592/2 and https://discourse.llvm.org/t/rfc-adding-a-char-set-converter-to-support-library/56574 , but neither gather a lot of comments.

I would like to better understand the set of circumstances in which the system iconv's IBM1047 transcoder would not be suitable for use by llvm, on the systems that make use of that encoding.
Maintaining these tables won't be free. Are the discrepancies regarding LF handling something you expect to cause issues in practice?

I am of the opinion that we should rely on iconv as much as possible, as i do not think maintaining conversion table will be a good use of our resources, however i think people might have widely different opinions and you will want
to make sure there is community buy-in on that point.

We should note that GCC has had good success using iconv, and I don't think the platform-specific availability of encondings has been an issue in practice but it's worth addressing. I know some people mentioned that point previously.
IMO, Any user confusion in that regard could be addressed by providing a sufficiently high-quality diagnostic in the presence of unsupported encoding.

If we do use iconv though, i would like us to have a better understanding of use cases, The patch currently links iconv to all llvm libraries, which might be overkill if the only project using it is Clang, and I wonder how that affects packaging
on linux distributions.

Speaking of, maybe the LLVM foundation will be able to provide legal opinions on linking to iconv on various platforms. Maybe something @tstellar would be able to help with. I would not expect particular challenges but we
should to make sure.

@aaron.ballman also reminded me that if iconv is generally available on posix-ish platforms, it is usually not available on Windows. Building iconv-like facilities on top of win32 APIs may be challenging, but we might want to think about
ways to allow or facilitate the use of iconv in Windows build. At least i would like us to give some consideration to that point, Indeed, Windows would, logically be the other big motivation to teach Clang about encodings beyond UTF-8.

Are there any other supported platforms for which iconv availability would be a concern?

I think all of these points should be addressed in an RFC so that we have a clear plan moving forward.

On a slightly more technical aspects, I'm concerned about the attempt at providing genericity in the CharSetConverterTable class - as Hubert alluded to, a table of bytes only works for specific, stateless, single byte encoding that have a reasonable mapping to one another.
If we do agree that we want a homegrown IBM1047<->UTF-8 conversion class, maybe we should have a class that does only that instead of trying to future proof and expect to add more tables. As i hope that we won't try to support more encoding without iconv.

CharSetConverterIdentity will never be a very efficient thing to use as implemented as it will perform a copy that may not be be useful. Have you consider either asserting that the encoding are distincts, or providing an error in that case instead of making a no-op copying converter?

llvm/include/llvm/Support/CharSet.h
94

Please use std::unique_ptr here, that way you won't have to manually manage memory in the destructor and move assignment operator

Kai updated this revision to Diff 510158.Mar 31 2023, 6:03 PM
  • Use std::unique_ptr
  • clang-format the file

Thanks for working on this.

Before starting a more in depth review on that, I think this is big enough that we want to see an RFC with wider consensus and interest from the community as far as maintaining this.
I know of https://discourse.llvm.org/t/rfc-adding-a-char-set-converter-to-support-library/56592/2 and https://discourse.llvm.org/t/rfc-adding-a-char-set-converter-to-support-library/56574 , but neither gather a lot of comments.

+1, neither of those RFCs really showed a strong community consensus behind the idea, so I think another RFC would be appropriate. Please be sure to address the items in https://clang.llvm.org/get_involved.html#criteria explicitly in the RFC where appropriate.

I am of the opinion that we should rely on iconv as much as possible, as i do not think maintaining conversion table will be a good use of our resources, however i think people might have widely different opinions and you will want
to make sure there is community buy-in on that point.

We should note that GCC has had good success using iconv, and I don't think the platform-specific availability of encondings has been an issue in practice but it's worth addressing. I know some people mentioned that point previously.
IMO, Any user confusion in that regard could be addressed by providing a sufficiently high-quality diagnostic in the presence of unsupported encoding.

If we do use iconv though, i would like us to have a better understanding of use cases, The patch currently links iconv to all llvm libraries, which might be overkill if the only project using it is Clang, and I wonder how that affects packaging
on linux distributions.

Speaking of, maybe the LLVM foundation will be able to provide legal opinions on linking to iconv on various platforms. Maybe something @tstellar would be able to help with. I would not expect particular challenges but we
should to make sure.

Agreed -- any time we bring in a third-party dependency, we should be going through an explicit license review just to make sure we're in the clear.

@aaron.ballman also reminded me that if iconv is generally available on posix-ish platforms, it is usually not available on Windows. Building iconv-like facilities on top of win32 APIs may be challenging, but we might want to think about
ways to allow or facilitate the use of iconv in Windows build. At least i would like us to give some consideration to that point, Indeed, Windows would, logically be the other big motivation to teach Clang about encodings beyond UTF-8.

This is my big concern. iconv isn't installed on Windows by default, so this means we either need to dynamically link against iconv (which means Windows users now have extra dependencies they need to deal with and this can negatively impact downstream projects that ship an iconv version because of versioning differences) or we need to statically link against iconv (which means Windows users now have a significantly larger binary to deal with, and it still can negatively impact downstream projects that also statically link against iconv). That said, I also don't think we want to try to replace iconv with our own set of APIs. This should be explicitly discussed in the new RFC.

Are there any other supported platforms for which iconv availability would be a concern?

I think all of these points should be addressed in an RFC so that we have a clear plan moving forward.

+1

@aaron.ballman because of the LGPL nature of iconv we night be restricted in our ability to link iconv statistically (but IANAL).
However, we might actually have options there. ICU is generally available on windows https://learn.microsoft.com/en-us/windows/win32/intl/international-components-for-unicode--icu- - that's something that could be investiguated.
ICU converter API is similar to that of iconv https://unicode-org.github.io/icu/userguide/conversion/converters.html#creating-a-converter so, it could be surfaced the same way (ie through a name), both libs use a (potentially superset) of IANA names

CharSetConverterIdentity will never be a very efficient thing to use as implemented as it will perform a copy that may not be be useful. Have you consider either asserting that the encoding are distincts, or providing an error in that case instead of making a no-op copying converter?

We did consider that, but ultimately decided against it. The reason for that is so we are able to create a create method that is guaranteed to return a functional converter (ie. is guaranteed to not throw an error). However, if it is greatly preferable to have it assert/error in the event of an "identity" conversion, we can make that change too.

Based on comments here and on the new RFC, we've decided to remove the use of iconv and limit this patch to supporting conversions between EBCDIC and UTF-8.

Kai updated this revision to Diff 514638.Apr 18 2023, 7:22 AM

Strip down implementation to just having 2 functions to convert between EBCDIC-1047 and UTF-8.

Kai updated this revision to Diff 514654.Apr 18 2023, 8:46 AM

Fix error in unit test configuration.