This is an archive of the discontinued LLVM Phabricator instance.

Adding iconv support to CharSetConverter class
Needs ReviewPublic

Authored by abhina.sreeskantharajan on Jun 21 2023, 6:19 AM.

Details

Summary

This patch adds iconv support to the CharSetConverter class proposed here https://reviews.llvm.org/D153417.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptJun 21 2023, 6:19 AM
abhina.sreeskantharajan requested review of this revision.Jun 21 2023, 6:19 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 21 2023, 6:19 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
michaelplatings added inline comments.
clang/lib/Basic/CMakeLists.txt
62

This doesn't look like an idomatic way to link a library. Could you use target_link_libraries instead?

Use target_link_libraries instead

abhina.sreeskantharajan marked an inline comment as done.Jun 21 2023, 11:38 AM
abhina.sreeskantharajan added inline comments.
clang/lib/Basic/CMakeLists.txt
62

Thanks, I've made this change

This doesn't really address the concerns from https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795/17 about consistency. It's bad if different hosts support a different set of charsets/charset names, and it's really bad if a given charset name has different meanings on different hosts.

Can we use the existing conversion utilities in LLVM for UTF-16/UTF-32?

This doesn't really address the concerns from https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795/17 about consistency. It's bad if different hosts support a different set of charsets/charset names, and it's really bad if a given charset name has different meanings on different hosts.

I agree, in particular we know that the people most likely to use this feature are windows users, and iconv doesn't do anything for them.

Can we use the existing conversion utilities in LLVM for UTF-16/UTF-32?

Not sure how useful this would be, UTF-16/UTF-32 facilities are used directly when they are needed, and utf-16 source input files are rare.

abhina.sreeskantharajan marked an inline comment as done.Jun 22 2023, 6:32 AM

Please correct me if I'm wrong, I'm not too familiar with icu4c, but I think adding support for ICU would be the better long-term solution since it seems to allow the same behaviour across different platforms. However, the issue on the z/OS platform is that there currently isn't support for this library so iconv seems to be the only solution we can use until we do get support. So would an alternative be to use iconv only on z/OS (and hopefully this is a temporary solution until icu is supported on z/OS) and use icu on all other platforms?

Even if we do decide we have to use platform-specific facilities because there's no suitable library, I think we should at least have a hardcoded set of encodings we recognize, so we aren't passing arbitrary encoding names directly from the command-line to the iconv() call.

Do you have a list of specific encodings you care about?

Can we use the existing conversion utilities in LLVM for UTF-16/UTF-32?

Not sure how useful this would be, UTF-16/UTF-32 facilities are used directly when they are needed, and utf-16 source input files are rare.

UTF-16 source files see some usage on Windows. Not sure exactly how common it is, but I think certain versions of Visual Studio defaulted to UTF-16... obviously, people who know what they're doing avoid encoding their files that way. I just noted it because some of the unit-tests were using UTF-16/UTF-32.

Even if we do decide we have to use platform-specific facilities because there's no suitable library, I think we should at least have a hardcoded set of encodings we recognize, so we aren't passing arbitrary encoding names directly from the command-line to the iconv() call.

Do you have a list of specific encodings you care about?

I think this is reasonable since gcc's fexec-charset option also says the name can be any encoding supported by the iconv library (copy pasted below from https://gcc.gnu.org/onlinedocs/gcc-13.1.0/gcc.pdf) so this would match that behaviour. Unfortunately, I don't think we are able to provide a hardcoded list because the locales installed can differ between machines and the only thing we can guarantee is -fexec-charset=utf-8 is always supported.

-fexec-charset=charset
Set the execution character set, used for string and character constants. The
default is UTF-8. charset can be any encoding supported by the system’s iconv
library routine.

I think this is reasonable since gcc's fexec-charset option also says the name can be any encoding supported by the iconv library (copy pasted below from https://gcc.gnu.org/onlinedocs/gcc-13.1.0/gcc.pdf) so this would match that behaviour

"gcc did it" doesn't mean the issues raised don't exist; it just means the gcc developers care less about those issues (and they use GNU iconv on Windows, which we can't do).

Please correct me if I'm wrong, I'm not too familiar with icu4c, but I think adding support for ICU would be the better long-term solution since it seems to allow the same behaviour across different platforms.

I tend to agree. Additionally, as more Unicode support is added to C++, the likelihood that we'll want to use other functionality from ICU increases.

However, the issue on the z/OS platform is that there currently isn't support for this library so iconv seems to be the only solution we can use until we do get support. So would an alternative be to use iconv only on z/OS (and hopefully this is a temporary solution until icu is supported on z/OS) and use icu on all other platforms?

ICU isn't supported on z/OS because the historical z/OS compiler (xlC) never gained support for C++11 or later so support for z/OS was dropped when ICU moved to C++11. Now that IBM has embraced LLVM and Clang, I would expect it to be possible to build ICU for z/OS again with moderate porting effort. It would be great if someone from IBM could confirm whether such an effort is underway (@hubert.reinterpretcast?).

Please correct me if I'm wrong, I'm not too familiar with icu4c, but I think adding support for ICU would be the better long-term solution since it seems to allow the same behaviour across different platforms.

I tend to agree. Additionally, as more Unicode support is added to C++, the likelihood that we'll want to use other functionality from ICU increases.

However, the issue on the z/OS platform is that there currently isn't support for this library so iconv seems to be the only solution we can use until we do get support. So would an alternative be to use iconv only on z/OS (and hopefully this is a temporary solution until icu is supported on z/OS) and use icu on all other platforms?

ICU isn't supported on z/OS because the historical z/OS compiler (xlC) never gained support for C++11 or later so support for z/OS was dropped when ICU moved to C++11. Now that IBM has embraced LLVM and Clang, I would expect it to be possible to build ICU for z/OS again with moderate porting effort. It would be great if someone from IBM could confirm whether such an effort is underway (@hubert.reinterpretcast?).

FYI the primary discussion on this topic is here now, https://discourse.llvm.org/t/rfc-enabling-fexec-charset-support-to-llvm-and-clang-reposting/71512/1

Please correct me if I'm wrong, I'm not too familiar with icu4c, but I think adding support for ICU would be the better long-term solution since it seems to allow the same behaviour across different platforms.

I tend to agree. Additionally, as more Unicode support is added to C++, the likelihood that we'll want to use other functionality from ICU increases.

However, the issue on the z/OS platform is that there currently isn't support for this library so iconv seems to be the only solution we can use until we do get support. So would an alternative be to use iconv only on z/OS (and hopefully this is a temporary solution until icu is supported on z/OS) and use icu on all other platforms?

ICU isn't supported on z/OS because the historical z/OS compiler (xlC) never gained support for C++11 or later so support for z/OS was dropped when ICU moved to C++11. Now that IBM has embraced LLVM and Clang, I would expect it to be possible to build ICU for z/OS again with moderate porting effort. It would be great if someone from IBM could confirm whether such an effort is underway (@hubert.reinterpretcast?).

We currently have no plan or resources allocated towards porting ICU on z/OS. Our users also rely on iconv for the system locales, but (and please correct me if I'm wrong) it seems like ICU does not use system locales so this may not meet our users' needs. So we would still prefer to have iconv support available at the very least for z/OS, even if ICU is the preferred default.

I'll also post the same reply on the RFC so we can continue the discussion there

michaelplatings resigned from this revision.Jul 7 2023, 9:14 AM