Page MenuHomePhabricator

[SystemZ/z/OS] Add utility class for char set conversion.
Needs ReviewPublic

Authored by Kai on Oct 2 2020, 8:50 AM.

Details

Summary

This class adds support for conversion between different character
sets. The conversion between EBCDIC-1047 and Latin-1/UTF-8 is always
available. If the iconv library is installed, then all conversions
of the iconv library can be used, too.

Use of iconv functions is not as standardized as wanted.
Challenges found so far:

  • On some functions, the iconv*() functions are part of the C library. On other systems, a separate library must be linked in.
  • There are systems with multiple, incompatible implementations of iconv functionality.
  • Not each system provides an implementation.
  • Mapping of EBCDIC-1047 to ASCII/Latin-1/UTF-8 and vice versa is done differently on different platforms.
  • Each implementation supports different mappings.

As result, the implementation provides a thin wrapper around the
iconv functionality, including a fixed conversion for EBCDIC-1047.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
kpn added a subscriber: kpn.Jun 28 2021, 1:39 PM
Kai updated this revision to Diff 358739.Jul 14 2021, 2:02 PM
  • Introduces build option LLVM_ENABLE_ICONV to optionally turn building with iconv off
  • Add include path to the iconv header when building with iconv enabled and not part of C library This hopefully fixes the build failure on the Windows bot.
Kai updated this revision to Diff 371730.Sep 9 2021, 3:11 PM

Fixed a problem in the CMake files.
This should finally fix the problem seen on the Windows builder.

Kai updated this revision to Diff 371744.Sep 9 2021, 4:12 PM

Fix wrong use of #ifdef

Kai updated this revision to Diff 379158.Oct 12 2021, 1:05 PM

Rebased on latest main.

llvm/include/llvm/Support/CharSet.h
48

Just noting that https://wg21.link/p1885 proposes enumeration names (and values) for these.

llvm/lib/Support/CharSet.cpp
31–36

There is a normalization process for character set names described by https://wg21.link/p1885 (please refer to the source material directly as well: https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching).

ISO8859-1 is neither the primary name nor one of the aliases defined for this charset in the IANA character set registry.

103

The comment should say, "only valid sequences encoding UCS scalar values in the range [U+0080, U+00FF] can be decoded".

105

The API contract between the table conversion and the iconv conversion is inconsistent. iconv conversion performs an implementation-defined conversion for valid input characters that does not have a representation in the output codeset. This implementation fails the conversion instead.

114

The coding guidelines say to use preincrement in cases where postincrement is not needed.

141

This looks like a use case for resize_for_overwrite.

152

Missing overflow detection for when the most significant bit of Capacity is already set.

153

Same comment about resize_for_overwrite.

156

The coding guidelines have been updated to discourage mixing use and omission of braces for the constituent parts of an if/else chain.

195

Use a C-style cast if necessary, but reinterpret_cast is wrong here.

llvm/lib/Support/CharSet.cpp
176–179

This is a public constructor. It would be appropriate to check that CSFrom and CSTo aren't both CS_IBM1047...

180–184

Because of the [U+0000, U+00FF] limitation of the convertWithTable converter, we should not be getting here if there is a no-op conversion from UTF-8 to UTF-8. Either a no-op converter should be returned, or a request for such a converter is erroneous (report_fatal_error here would be friendlier by failing fast).

llvm/include/llvm/Support/CharSet.h
99

The expectations around null termination and shift state should be documented. Note that https://wg21.link/p2029 recommends that numeric escapes do not trigger changes in shift state, so avoiding automatic transitions to the initial shift state at the end of the (potential mid-string) translation is probably the right thing to do. Using a null StringRef as the method to cause a return to the initial shift state would be consistent with the iconv interface.

108

Same comment about null termination and shift state.

llvm/unittests/Support/CharSetTest.cpp
131

The other cases of "identity conversion" look like they would have suspicious behaviour. If they do, then this test is insufficient.

192

There is no representation in the testing of stateful encodings.

Reasonable tests (separately for ISO-2022-JP and IBM-939) include:

  • "Returning to the initial shift state" when in the initial shift state generates an empty output sequence.
  • "Returning to the initial shift state" after the previous conversion ended with a character that requires a shift from the initial shift state generates a non-empty output sequence.
Kai updated this revision to Diff 379800.Oct 14 2021, 12:29 PM

Changes based on review comments:

  • Renamed/moved enumeration members for charset names
  • Using a charset alias algorithm to make using charset names easier
  • Some source code style updates
Kai added inline comments.Oct 14 2021, 1:04 PM
llvm/include/llvm/Support/CharSet.h
48

Good point, thanks. It makes a lot of sense to have the names similar to proposed standards.

llvm/lib/Support/CharSet.cpp
31–36

Thanks! I added the normalization algorithm, as I think that it makes using the converter easier.

Obviously, the ISO8859-1 name comes from the z/OS implementation of iconv, which accepts 'ISO8859-1` or 819 but not the registered name ISO-8859-1.

103

Changed.

114

Thanks for catching - my fault.

156

Thanks again, I was not aware of this change.

ZarkoCA added inline comments.
llvm/include/llvm/Support/CharSet.h
33

I think enums should start with an uppercase letter as per the style guide.

43

nit: may be more clear to add the ending comment.

I've looked at the new changes and verified that the recent updates are an improvement. There remain unaddressed comments.

llvm/include/llvm/Support/CharSet.h
33

Thanks @ZarkoCA. The coding standards don't state this, but the existing practice is that the exception (from the coding standards) about following the C++ STL naming convention in some cases is more broadly applicable for components with a corresponding standard (or proposed standard) facility like this one.

Kai updated this revision to Diff 380686.Oct 19 2021, 7:17 AM
  • Update table-based algorithm to map Unicode characters outside the range [U+0080, U+00FF] to the SUB character. This is the same behavior as iconv.
  • Add a new identify transformation which does not alter UTF-8 sequences
  • Use resize_for_overwrite for memory allocation
Kai added a comment.Oct 19 2021, 7:18 AM

I've looked at the new changes and verified that the recent updates are an improvement. There remain unaddressed comments.

Yep, I know.

Kai added inline comments.Oct 19 2021, 7:59 AM
llvm/include/llvm/Support/CharSet.h
33

To be precise, the exception is documented here: https://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly (at the end of the paragraph).

llvm/lib/Support/CharSet.cpp
105

Changed implementation to have same behavior as iconv.

141

Changed, thanks.

153

Changed, thanks.

180–184

Added a no-op converter.

195

Changed.

llvm/lib/Support/CharSet.cpp
105

Thanks for making the change. Thinking about it a bit more, I am wondering if there should be a "policy" available for the class to support selecting one behaviour or the other (or at least identifying that there were characters with no output codeset representation encountered). For example, iconv conversions that would not round-trip are identifiable via non-error non-zero return codes (indicating the number of input characters that did not have a representation in the output codeset).

This information/policy will be useful for implementing the diagnostics required if https://wg21.link/p1854 is adopted.

llvm/lib/Support/CharSet.cpp
135

The comment should say that the encoded value is greater than 0xFF or the encoding is overlong. Overlong encodings are still an error. If the conversion facility has no use case for "lossy" translation, then reverting to the previous code for the UTF-8 to ISO-8859-1 decoding (and fixing the iconv path to check the return code) would save effort on getting the error handling right.

It is also an error if the encoded value ends up being not a UCS scalar value (either because it is greater then 0x10FFFF or because it encodes a surrogate code point).

139

I suggest renaming this to SeqLength and changing the comment above to say "the rest of the sequence is skipped" (if indeed that aspect survives the changes to either implement error detection or abandon lossy translation).

140

An error should also be flagged when SkipLen == 1.

llvm/lib/Support/CharSet.cpp
179

My first comment regarding resize_for_overwrite was for this line.

213

This is consistent with convertWithTable but not convertWithIconv with respect to whether Result is being appended to as opposed to it having its contents replaced. All three should be consistent.

llvm/unittests/Support/CharSetTest.cpp
43

Comment should indicate that this is the substitution character.

Kai updated this revision to Diff 384560.Nov 3 2021, 12:40 PM
  • only support reversible conversions
  • make sure that all converters behave in the same way
  • updated comments
Kai added inline comments.Nov 3 2021, 12:53 PM
llvm/include/llvm/Support/CharSet.h
99

I updated the comment (and also the implementation) that an empty string resets the shift state. However, this feels a bit clumsy, and leaks a lot of the details of the iconv interface. So I wonder if a new method, e.g. resetState() would not be the better approach.

llvm/lib/Support/CharSet.cpp
135

Ah, I see your point. I reverted the error handling here, and check the return code from iconv() instead below. There is no real use case for it. I

152

Changed the handling to detect overflows.

179

Sorry, I noted this later, too.

213

Thanks for catching this! All three converters should now behave in the same way, that is, replacing the content.

llvm/CMakeLists.txt
394

Minor nit: s/for for/for;

llvm/include/llvm/Support/CharSet.h
43

Minor comment still to address here.

77

Minor nit: s/occuor/occur/;

91

If the current object had a cleanup action on entry to the move assignment operator, then that cleanup action should be run prior to replacing the value.

99

I think that empty string is maybe not unique enough for this operation, so yes, a new method is probably better. However, it may be useful to understand how using StringRef("", 1u) as the input fits in.

108

Here too: s/occuor/occur/;

llvm/lib/Support/CharSet.cpp
136
164

It may be useful for the client of the interface to be able to retrieve the shift sequence (if any) to the initial shift state from the current internal output shift state. Although that use case might be achieved by using StringRef("", 1u).

llvm/lib/Support/CharSet.cpp
198–200

I don't think there's any "partially converted input characters" to flush. There's only shift sequences that may be unnecessary or unwanted.

204–206

I think forcing return to the initial shift state at this level is too rigid.

As mentioned in https://reviews.llvm.org/D88741?id=379158#inline-1064326, the recommended behaviour for numeric escapes involves no return to the initial shift state prior to the injection of the values specified.

Whether Clang follows that recommendation or not for one platform or the other is a higher level question that should be had in the context of a client of this facility. The facility itself should leave the possibility of resuming with the prior translation's consequent shift state.

llvm/include/llvm/Support/CharSet.h
103

I think "string" is too ambiguous. Additional sentences to clarify would help:
Converts the contents of a StringRef, which conventionally does not include a terminating null character. No additional null termination of the result is attempted.

128

Can refer to the other function to note clarifications re: null termination.

@Kai, I've completed my review of the current state of this patch. I am out-of-office for three weeks. I hope my comments are clear enough to address and for other reviewers to confirm action/response on.

llvm/unittests/Support/CharSetTest.cpp
192

I suggest trying to write these tests.

hubert.reinterpretcast edited the summary of this revision. (Show Details)Nov 8 2021, 6:35 AM
Kai updated this revision to Diff 386303.Nov 10 2021, 1:41 PM

Fix a couple of review comments.

Kai added inline comments.Nov 10 2021, 1:44 PM
llvm/CMakeLists.txt
394

Changed.

llvm/include/llvm/Support/CharSet.h
43

Comment added.

77

Fixed. Thanks for catching!

91

Call to cleanup added.

108

Fixed, too.

llvm/lib/Support/CharSet.cpp
136

Changed.

The new changes are improvements. The comments regarding null termination and shift state are still pending resolution.