This patch adds iconv support to the CharSetConverter class proposed here https://reviews.llvm.org/D153417.
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
clang/lib/Basic/CMakeLists.txt | ||
---|---|---|
62 | This doesn't look like an idomatic way to link a library. Could you use target_link_libraries instead? |
clang/lib/Basic/CMakeLists.txt | ||
---|---|---|
62 | Thanks, I've made this change |
This doesn't really address the concerns from https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795/17 about consistency. It's bad if different hosts support a different set of charsets/charset names, and it's really bad if a given charset name has different meanings on different hosts.
Can we use the existing conversion utilities in LLVM for UTF-16/UTF-32?
I agree, in particular we know that the people most likely to use this feature are windows users, and iconv doesn't do anything for them.
Can we use the existing conversion utilities in LLVM for UTF-16/UTF-32?
Not sure how useful this would be, UTF-16/UTF-32 facilities are used directly when they are needed, and utf-16 source input files are rare.
Please correct me if I'm wrong, I'm not too familiar with icu4c, but I think adding support for ICU would be the better long-term solution since it seems to allow the same behaviour across different platforms. However, the issue on the z/OS platform is that there currently isn't support for this library so iconv seems to be the only solution we can use until we do get support. So would an alternative be to use iconv only on z/OS (and hopefully this is a temporary solution until icu is supported on z/OS) and use icu on all other platforms?
Even if we do decide we have to use platform-specific facilities because there's no suitable library, I think we should at least have a hardcoded set of encodings we recognize, so we aren't passing arbitrary encoding names directly from the command-line to the iconv() call.
Do you have a list of specific encodings you care about?
UTF-16 source files see some usage on Windows. Not sure exactly how common it is, but I think certain versions of Visual Studio defaulted to UTF-16... obviously, people who know what they're doing avoid encoding their files that way. I just noted it because some of the unit-tests were using UTF-16/UTF-32.
I think this is reasonable since gcc's fexec-charset option also says the name can be any encoding supported by the iconv library (copy pasted below from https://gcc.gnu.org/onlinedocs/gcc-13.1.0/gcc.pdf) so this would match that behaviour. Unfortunately, I don't think we are able to provide a hardcoded list because the locales installed can differ between machines and the only thing we can guarantee is -fexec-charset=utf-8 is always supported.
-fexec-charset=charset Set the execution character set, used for string and character constants. The default is UTF-8. charset can be any encoding supported by the system’s iconv library routine.
I think this is reasonable since gcc's fexec-charset option also says the name can be any encoding supported by the iconv library (copy pasted below from https://gcc.gnu.org/onlinedocs/gcc-13.1.0/gcc.pdf) so this would match that behaviour
"gcc did it" doesn't mean the issues raised don't exist; it just means the gcc developers care less about those issues (and they use GNU iconv on Windows, which we can't do).
Please correct me if I'm wrong, I'm not too familiar with icu4c, but I think adding support for ICU would be the better long-term solution since it seems to allow the same behaviour across different platforms.
I tend to agree. Additionally, as more Unicode support is added to C++, the likelihood that we'll want to use other functionality from ICU increases.
However, the issue on the z/OS platform is that there currently isn't support for this library so iconv seems to be the only solution we can use until we do get support. So would an alternative be to use iconv only on z/OS (and hopefully this is a temporary solution until icu is supported on z/OS) and use icu on all other platforms?
ICU isn't supported on z/OS because the historical z/OS compiler (xlC) never gained support for C++11 or later so support for z/OS was dropped when ICU moved to C++11. Now that IBM has embraced LLVM and Clang, I would expect it to be possible to build ICU for z/OS again with moderate porting effort. It would be great if someone from IBM could confirm whether such an effort is underway (@hubert.reinterpretcast?).
FYI the primary discussion on this topic is here now, https://discourse.llvm.org/t/rfc-enabling-fexec-charset-support-to-llvm-and-clang-reposting/71512/1
We currently have no plan or resources allocated towards porting ICU on z/OS. Our users also rely on iconv for the system locales, but (and please correct me if I'm wrong) it seems like ICU does not use system locales so this may not meet our users' needs. So we would still prefer to have iconv support available at the very least for z/OS, even if ICU is the preferred default.
I'll also post the same reply on the RFC so we can continue the discussion there
This doesn't look like an idomatic way to link a library. Could you use target_link_libraries instead?