Page MenuHomePhabricator

[SystemZ/z/OS] Add utility class for char set conversion.
Needs ReviewPublic

Authored by Kai on Oct 2 2020, 8:50 AM.



This class adds support for conversion between different character
sets. The conversion between EBCDIC-1047 and Latin-1/UTF-8 is always
available. If the iconv library is installed, then all conversions
of the iconv library can be used, too.

Use of iconv functions is not as standarddized as wanted.
Challenges found so far:

  • On some functions, the iconv*() functions are part of the C library. On other systems, a separate library must be linked in.
  • Their are systems with multiple, incompatible implementations of iconv functionality.
  • Not each system provides an implementation.
  • Mapping of EBCDIC-1047 to ASCII/Latin-1/UTF-8 and vice versa is done differently on different platforms.
  • Each implementation supports different mappings.

As result, the implementation provides a thin wrapper around the
iconv functionality, including a fixed conversion for EBDIC-1047.

Diff Detail

Unit TestsFailed

380 mslinux > HWAddressSanitizer-x86_64.TestCases::sizes.cpp
Script: -- : 'RUN: at line 3'; /mnt/disks/ssd0/agent/llvm-project/build/./bin/clang --driver-mode=g++ -m64 -gline-tables-only -fsanitize=hwaddress -fuse-ld=lld -mcmodel=large -mllvm -hwasan-globals -mllvm -hwasan-use-short-granules -mllvm -hwasan-instrument-landing-pads=0 -mllvm -hwasan-instrument-personality-functions /mnt/disks/ssd0/agent/llvm-project/compiler-rt/test/hwasan/TestCases/sizes.cpp -nostdlib++ -lstdc++ -o /mnt/disks/ssd0/agent/llvm-project/build/projects/compiler-rt/test/hwasan/X86_64/TestCases/Output/sizes.cpp.tmp

Event Timeline

Kai created this revision.Oct 2 2020, 8:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 2 2020, 8:50 AM
Kai requested review of this revision.Oct 2 2020, 8:50 AM

Drive by as it's years since I've dealt with any of this:

Seems ok to me. Be good to send your RFC as well to clang-dev to get a view into -finput-charset translations as well if we wanted to go there.

Kai added a comment.Oct 2 2020, 11:11 AM

Seems ok to me. Be good to send your RFC as well to clang-dev to get a view into -finput-charset translations as well if we wanted to go there.

Sure, done.

ctetreau added inline comments.

These names leak the iconv dependency, and can be error prone if somebody types them wrong. We should have an enum class for the set of char sets that can be converted. If we keep the iconv dependency, it's trivial to map CharSet::IBM1047 -> "IBM-1047", but the compiler can catch the error if you mistype the enum name.


If we replace the name strings with enums, then it should be possible to make this function total and remove the ErrorOr


If I'm understanding this correctly, having iconv provides the possibility of supporting conversions other than to and from ascii, utf8, and ebcdic? I'm concerned that this is going to create a ton of bug reports of the form "CharSetConverter::create returned an error on my machine, but not my coworker's machine!" which will be closed as "operator error, install iconv". I feel like there should be a set of conversions supported by CharSetConverter, and they should work regardless of the presense of iconv.

From messages I've seen on the mailing lists, it sounds like there is license uncertainty with linking iconv. Maybe it's best to just not have this?

ThePhD added a subscriber: ThePhD.Oct 24 2020, 6:23 PM

Minor addition for this change; if this does go through, llvm-specific macros in C-like languages should expose the chosen name. I submitted a patch and a feature request to MSVC and GCC for their own implementation-specific macros as well:

It doesn't need to be synchronized with any other compiler's name or exposed feature: it just needs to exist. This allows folks interested in helping people write portable code that uses the execution character set preserve the invariants of their code by allowing them to inspect the name and act meaningfully for names they recognize.

Seems reasonable to me.


Kai updated this revision to Diff 302822.Nov 4 2020, 5:28 AM

Sorry for taking so long - I was on vacation.

  • Added an enumeration for the basic charsets
  • Added an named constructor just for the basic charsets
  • Updated the test cases

I like the addition of the enumeration. It simplifies the source in some parts.
However, I did not drop the function which uses iconv for conversion. This will be needed for some further extensions.

fanbo-meng added inline comments.

The naming here with "UTF" is ambiguous as it can mean UTF8, UTF16, UTF32. Using UTF8 with the enum names here would be better.

Drive-by comment as I'm reviewing a downstream patch - I think it's preferable to use llvm::Error and llvm::Expected instead of std::error_code, for reporting errors, as it allows you to provide more contextual information (e.g. why the conversion failed, which characters were invalid etc).

The CI on Windows is failing because it seems that the iconv header exists but the path is unknown. Is it possible to set a variable to Iconv_INCLUDE_DIRS and use that instead of writing out <iconv.h> ? I'm seeing this failure on the CI on my fexec-charset patch as well.