This is an archive of the discontinued LLVM Phabricator instance.

Adding iconv support to CharSetConverter class
Needs ReviewPublic

Authored by abhina.sreeskantharajan on Jun 21 2023, 6:19 AM.

Download Raw Diff

Details

Reviewers

aaron.ballman
Everybody0523
tahonermann
nikic
cor3ntin
jcranmer
efriedma
michaelplatings

Summary

This patch adds iconv support to the CharSetConverter class proposed here https://reviews.llvm.org/D153417.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

abhina.sreeskantharajan created this revision.Jun 21 2023, 6:19 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 21 2023, 6:19 AM

abhina.sreeskantharajan requested review of this revision.Jun 21 2023, 6:19 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 21 2023, 6:19 AM

Herald added a subscriber: cfe-commits. · View Herald Transcript

Harbormaster completed remote builds in B240225: Diff 533236.Jun 21 2023, 6:20 AM

abhina.sreeskantharajan added a parent revision: D153417: New CharSetConverter wrapper class for ConverterEBCDIC.Jun 21 2023, 6:20 AM

abhina.sreeskantharajan mentioned this in D153417: New CharSetConverter wrapper class for ConverterEBCDIC.Jun 21 2023, 6:42 AM

abhina.sreeskantharajan edited the summary of this revision. (Show Details)

abhina.sreeskantharajan added reviewers: aaron.ballman, Everybody0523, tahonermann, nikic, cor3ntin, jcranmer.

Herald added a subscriber: StephenFan. · View Herald TranscriptJun 21 2023, 6:44 AM

michaelplatings added a subscriber: michaelplatings.Jun 21 2023, 8:00 AM

michaelplatings added inline comments.

clang/lib/Basic/CMakeLists.txt
62	This doesn't look like an idomatic way to link a library. Could you use target_link_libraries instead?

Use target_link_libraries instead

abhina.sreeskantharajan marked an inline comment as done.Jun 21 2023, 11:38 AM

abhina.sreeskantharajan added inline comments.

clang/lib/Basic/CMakeLists.txt
62	Thanks, I've made this change

This doesn't really address the concerns from https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795/17 about consistency. It's bad if different hosts support a different set of charsets/charset names, and it's really bad if a given charset name has different meanings on different hosts.

Can we use the existing conversion utilities in LLVM for UTF-16/UTF-32?

Harbormaster completed remote builds in B240316: Diff 533352.Jun 21 2023, 4:43 PM

In D153418#4438766, @efriedma wrote:

This doesn't really address the concerns from https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795/17 about consistency. It's bad if different hosts support a different set of charsets/charset names, and it's really bad if a given charset name has different meanings on different hosts.

I agree, in particular we know that the people most likely to use this feature are windows users, and iconv doesn't do anything for them.

Can we use the existing conversion utilities in LLVM for UTF-16/UTF-32?

Not sure how useful this would be, UTF-16/UTF-32 facilities are used directly when they are needed, and utf-16 source input files are rare.

Please correct me if I'm wrong, I'm not too familiar with icu4c, but I think adding support for ICU would be the better long-term solution since it seems to allow the same behaviour across different platforms. However, the issue on the z/OS platform is that there currently isn't support for this library so iconv seems to be the only solution we can use until we do get support. So would an alternative be to use iconv only on z/OS (and hopefully this is a temporary solution until icu is supported on z/OS) and use icu on all other platforms?

Even if we do decide we have to use platform-specific facilities because there's no suitable library, I think we should at least have a hardcoded set of encodings we recognize, so we aren't passing arbitrary encoding names directly from the command-line to the iconv() call.

Do you have a list of specific encodings you care about?

In D153418#4440420, @cor3ntin wrote:

In D153418#4438766, @efriedma wrote:

Can we use the existing conversion utilities in LLVM for UTF-16/UTF-32?

Not sure how useful this would be, UTF-16/UTF-32 facilities are used directly when they are needed, and utf-16 source input files are rare.

UTF-16 source files see some usage on Windows. Not sure exactly how common it is, but I think certain versions of Visual Studio defaulted to UTF-16... obviously, people who know what they're doing avoid encoding their files that way. I just noted it because some of the unit-tests were using UTF-16/UTF-32.

In D153418#4441603, @efriedma wrote:

Even if we do decide we have to use platform-specific facilities because there's no suitable library, I think we should at least have a hardcoded set of encodings we recognize, so we aren't passing arbitrary encoding names directly from the command-line to the iconv() call.

Do you have a list of specific encodings you care about?

I think this is reasonable since gcc's fexec-charset option also says the name can be any encoding supported by the iconv library (copy pasted below from https://gcc.gnu.org/onlinedocs/gcc-13.1.0/gcc.pdf) so this would match that behaviour. Unfortunately, I don't think we are able to provide a hardcoded list because the locales installed can differ between machines and the only thing we can guarantee is -fexec-charset=utf-8 is always supported.

-fexec-charset=charset
Set the execution character set, used for string and character constants. The
default is UTF-8. charset can be any encoding supported by the system’s iconv
library routine.

abhina.sreeskantharajan added a reviewer: efriedma.Jun 22 2023, 10:45 AM

I think this is reasonable since gcc's fexec-charset option also says the name can be any encoding supported by the iconv library (copy pasted below from https://gcc.gnu.org/onlinedocs/gcc-13.1.0/gcc.pdf) so this would match that behaviour

"gcc did it" doesn't mean the issues raised don't exist; it just means the gcc developers care less about those issues (and they use GNU iconv on Windows, which we can't do).

Please correct me if I'm wrong, I'm not too familiar with icu4c, but I think adding support for ICU would be the better long-term solution since it seems to allow the same behaviour across different platforms.

I tend to agree. Additionally, as more Unicode support is added to C++, the likelihood that we'll want to use other functionality from ICU increases.

However, the issue on the z/OS platform is that there currently isn't support for this library so iconv seems to be the only solution we can use until we do get support. So would an alternative be to use iconv only on z/OS (and hopefully this is a temporary solution until icu is supported on z/OS) and use icu on all other platforms?

ICU isn't supported on z/OS because the historical z/OS compiler (xlC) never gained support for C++11 or later so support for z/OS was dropped when ICU moved to C++11. Now that IBM has embraced LLVM and Clang, I would expect it to be possible to build ICU for z/OS again with moderate porting effort. It would be great if someone from IBM could confirm whether such an effort is underway (@hubert.reinterpretcast?).

In D153418#4478766, @tahonermann wrote:

Please correct me if I'm wrong, I'm not too familiar with icu4c, but I think adding support for ICU would be the better long-term solution since it seems to allow the same behaviour across different platforms.

I tend to agree. Additionally, as more Unicode support is added to C++, the likelihood that we'll want to use other functionality from ICU increases.

However, the issue on the z/OS platform is that there currently isn't support for this library so iconv seems to be the only solution we can use until we do get support. So would an alternative be to use iconv only on z/OS (and hopefully this is a temporary solution until icu is supported on z/OS) and use icu on all other platforms?

ICU isn't supported on z/OS because the historical z/OS compiler (xlC) never gained support for C++11 or later so support for z/OS was dropped when ICU moved to C++11. Now that IBM has embraced LLVM and Clang, I would expect it to be possible to build ICU for z/OS again with moderate porting effort. It would be great if someone from IBM could confirm whether such an effort is underway (@hubert.reinterpretcast?).

FYI the primary discussion on this topic is here now, https://discourse.llvm.org/t/rfc-enabling-fexec-charset-support-to-llvm-and-clang-reposting/71512/1

michaelplatings removed a subscriber: michaelplatings.Jul 7 2023, 12:16 AM

In D153418#4478766, @tahonermann wrote:

Please correct me if I'm wrong, I'm not too familiar with icu4c, but I think adding support for ICU would be the better long-term solution since it seems to allow the same behaviour across different platforms.

I tend to agree. Additionally, as more Unicode support is added to C++, the likelihood that we'll want to use other functionality from ICU increases.

However, the issue on the z/OS platform is that there currently isn't support for this library so iconv seems to be the only solution we can use until we do get support. So would an alternative be to use iconv only on z/OS (and hopefully this is a temporary solution until icu is supported on z/OS) and use icu on all other platforms?

ICU isn't supported on z/OS because the historical z/OS compiler (xlC) never gained support for C++11 or later so support for z/OS was dropped when ICU moved to C++11. Now that IBM has embraced LLVM and Clang, I would expect it to be possible to build ICU for z/OS again with moderate porting effort. It would be great if someone from IBM could confirm whether such an effort is underway (@hubert.reinterpretcast?).

We currently have no plan or resources allocated towards porting ICU on z/OS. Our users also rely on iconv for the system locales, but (and please correct me if I'm wrong) it seems like ICU does not use system locales so this may not meet our users' needs. So we would still prefer to have iconv support available at the very least for z/OS, even if ICU is the preferred default.

I'll also post the same reply on the RFC so we can continue the discussion there

michaelplatings resigned from this revision.Jul 7 2023, 9:14 AM

Revision Contents

Path

Size

clang/

include/

clang/

Basic/

CharSet.h

7 lines

Config/

config.h.cmake

3 lines

lib/

Basic/

CMakeLists.txt

14 lines

CharSet.cpp

138 lines

unittests/

Basic/

CharSetTest.cpp

173 lines

Diff 533352

clang/include/clang/Basic/CharSet.h

Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	public:
/// - std::errc::illegal_byte_sequence: The input contains an invalid		/// - std::errc::illegal_byte_sequence: The input contains an invalid
/// multibyte sequence.		/// multibyte sequence.
/// - std::errc::invalid_argument: The input contains an incomplete		/// - std::errc::invalid_argument: The input contains an incomplete
/// multibyte sequence.		/// multibyte sequence.
///		///
/// In case of an error, the result string contains the successfully converted		/// In case of an error, the result string contains the successfully converted
/// part of the input string.		/// part of the input string.
///		///
		/// If the Source parameter has a zero length, then no conversion is
		/// performed. Instead, the internal conversation state of iconv is reset to
		/// the initial state if iconv is used for the conversion. Otherwise it is a
		/// no-op.
virtual std::error_code convert(StringRef Source,		virtual std::error_code convert(StringRef Source,
SmallVectorImpl<char> &Result,		SmallVectorImpl<char> &Result,
bool ShouldAutoFlush) const = 0;		bool ShouldAutoFlush) const = 0;

/// Restore the conversion to the original state.		/// Restore the conversion to the original state.
/// \return error code in case something went wrong		/// \return error code in case something went wrong
///		///
/// If the original character set or the destination character set		/// If the original character set or the destination character set
Show All 13 Lines	enum class id {

/// IBM EBCDIC 1047 character set encoding.		/// IBM EBCDIC 1047 character set encoding.
IBM1047		IBM1047
};		};
} // end namespace text_encoding		} // end namespace text_encoding

/// Utility class to convert between different character set encodings.		/// Utility class to convert between different character set encodings.
/// The class always supports converting between EBCDIC 1047 and Latin-1/UTF-8.		/// The class always supports converting between EBCDIC 1047 and Latin-1/UTF-8.
		/// If the iconv library is available, then arbitrary conversions are supported.
		/// TODO Add Windows support.
class CharSetConverter {		class CharSetConverter {
// details::CharSetConverterImplBase *Converter;		// details::CharSetConverterImplBase *Converter;
std::unique_ptr<details::CharSetConverterImplBase> Converter;		std::unique_ptr<details::CharSetConverterImplBase> Converter;

CharSetConverter(std::unique_ptr<details::CharSetConverterImplBase> Converter)		CharSetConverter(std::unique_ptr<details::CharSetConverterImplBase> Converter)
: Converter(std::move(Converter)) {}		: Converter(std::move(Converter)) {}

public:		public:
▲ Show 20 Lines • Show All 69 Lines • Show Last 20 Lines

clang/include/clang/Config/config.h.cmake

	Show First 20 Lines • Show All 51 Lines • ▼ Show 20 Lines
	#define GCC_INSTALL_PREFIX "${GCC_INSTALL_PREFIX}"			#define GCC_INSTALL_PREFIX "${GCC_INSTALL_PREFIX}"

	/* Define if we have libxml2 */			/* Define if we have libxml2 */
	#cmakedefine CLANG_HAVE_LIBXML ${CLANG_HAVE_LIBXML}			#cmakedefine CLANG_HAVE_LIBXML ${CLANG_HAVE_LIBXML}

	/* Define if we have sys/resource.h (rlimits) */			/* Define if we have sys/resource.h (rlimits) */
	#cmakedefine CLANG_HAVE_RLIMITS ${CLANG_HAVE_RLIMITS}			#cmakedefine CLANG_HAVE_RLIMITS ${CLANG_HAVE_RLIMITS}

				/* Define if iconv library is available */
				#cmakedefine HAVE_ICONV ${HAVE_ICONV}

	/* Linker version detected at compile time. */			/* Linker version detected at compile time. */
	#cmakedefine HOST_LINK_VERSION "${HOST_LINK_VERSION}"			#cmakedefine HOST_LINK_VERSION "${HOST_LINK_VERSION}"

	/* pass --build-id to ld */			/* pass --build-id to ld */
	#cmakedefine ENABLE_LINKER_BUILD_ID			#cmakedefine ENABLE_LINKER_BUILD_ID

	/* enable x86 relax relocations by default */			/* enable x86 relax relocations by default */
	#cmakedefine01 ENABLE_X86_RELAX_RELOCATIONS			#cmakedefine01 ENABLE_X86_RELAX_RELOCATIONS
	Show All 13 Lines

clang/lib/Basic/CMakeLists.txt

Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	set_source_files_properties("${version_inc}"
PROPERTIES GENERATED TRUE		PROPERTIES GENERATED TRUE
HEADER_FILE_ONLY TRUE)		HEADER_FILE_ONLY TRUE)

if(CLANG_VENDOR)		if(CLANG_VENDOR)
set_source_files_properties(Version.cpp		set_source_files_properties(Version.cpp
PROPERTIES COMPILE_DEFINITIONS "CLANG_VENDOR=\"${CLANG_VENDOR} \"")		PROPERTIES COMPILE_DEFINITIONS "CLANG_VENDOR=\"${CLANG_VENDOR} \"")
endif()		endif()

		# Link iconv library if it is an external library.
		find_package(Iconv)
		if(Iconv_FOUND)
		set(HAVE_ICONV 1)
		else()
		set(HAVE_ICONV 0)
		endif()
		if(Iconv_FOUND AND NOT Iconv_IS_BUILT_IN)
		target_link_libraries(clangBasic
		michaelplatingsUnsubmitted Done Reply Inline Actions This doesn't look like an idomatic way to link a library. Could you use target_link_libraries instead? michaelplatings: This doesn't look like an idomatic way to link a library. Could you use [[ https://cmake.
		abhina.sreeskantharajanAuthorUnsubmitted Done Reply Inline Actions Thanks, I've made this change abhina.sreeskantharajan: Thanks, I've made this change
		PRIVATE
		${Iconv_LIBRARIES}
		)
		endif()

add_clang_library(clangBasic		add_clang_library(clangBasic
Attributes.cpp		Attributes.cpp
Builtins.cpp		Builtins.cpp
CLWarnings.cpp		CLWarnings.cpp
CharInfo.cpp		CharInfo.cpp
CharSet.cpp		CharSet.cpp
CodeGenOptions.cpp		CodeGenOptions.cpp
Cuda.cpp		Cuda.cpp
▲ Show 20 Lines • Show All 74 Lines • Show Last 20 Lines

clang/lib/Basic/CharSet.cpp

Show All 16 Lines
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/StringExtras.h"		#include "llvm/ADT/StringExtras.h"
#include "llvm/Support/ConvertEBCDIC.h"		#include "llvm/Support/ConvertEBCDIC.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
#include <algorithm>		#include <algorithm>
#include <limits>		#include <limits>
#include <system_error>		#include <system_error>

		#ifdef HAVE_ICONV
		#include <iconv.h>
		#endif

using namespace llvm;		using namespace llvm;

// Normalize the charset name with the charset alias matching algorithm proposed		// Normalize the charset name with the charset alias matching algorithm proposed
// in https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching.		// in https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching.
void normalizeCharSetName(StringRef CSName, SmallVectorImpl<char> &Normalized) {		void normalizeCharSetName(StringRef CSName, SmallVectorImpl<char> &Normalized) {
bool PrevDigit = false;		bool PrevDigit = false;
for (auto Ch : CSName) {		for (auto Ch : CSName) {
if (isAlnum(Ch)) {		if (isAlnum(Ch)) {
▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	std::error_code CharSetConverterTable::flush() const {
return std::error_code();		return std::error_code();
}		}

std::error_code		std::error_code
CharSetConverterTable::flush(SmallVectorImpl<char> &Result) const {		CharSetConverterTable::flush(SmallVectorImpl<char> &Result) const {
return std::error_code();		return std::error_code();
}		}

		#ifdef HAVE_ICONV
		class CharSetConverterIconv : public details::CharSetConverterImplBase {
		iconv_t ConvDesc;

		public:
		CharSetConverterIconv(iconv_t ConvDesc) : ConvDesc(ConvDesc) {}

		std::error_code convert(StringRef Source, SmallVectorImpl<char> &Result,
		bool ShouldAutoFlush) const override;
		std::error_code flush() const override;
		std::error_code flush(SmallVectorImpl<char> &Result) const override;
		};

		std::error_code CharSetConverterIconv::convert(StringRef Source,
		SmallVectorImpl<char> &Result,
		bool ShouldAutoFlush) const {
		// Setup the input. Use nullptr to reset iconv state if input length is zero.
		size_t InputLength = Source.size();
		char Input = InputLength ? const_cast<char >(Source.data()) : nullptr;
		// Setup the output. We directly write into the SmallVector.
		size_t Capacity = Result.capacity();
		Result.resize_for_overwrite(Capacity);
		char Output = InputLength ? static_cast<char >(Result.data()) : nullptr;
		size_t OutputLength = Capacity;

		size_t Ret;

		// Handle errors returned from iconv().
		auto HandleError = [&Capacity, &Output, &OutputLength, &Result](size_t Ret) {
		if (Ret == static_cast<size_t>(-1)) {
		// An error occured. Check if we can gracefully handle it.
		if (errno == E2BIG && Capacity < std::numeric_limits<size_t>::max()) {
		// No space left in output buffer. Double the size of the underlying
		// memory in the SmallVectorImpl, adjust pointer and length and continue
		// the conversion.
		const size_t Used = Capacity - OutputLength;
		Capacity = (Capacity < std::numeric_limits<size_t>::max() / 2)
		? 2 * Capacity
		: std::numeric_limits<size_t>::max();
		Result.resize_for_overwrite(Capacity);
		Output = static_cast<char *>(Result.data()) + Used;
		OutputLength = Capacity - Used;
		return std::error_code();
		} else {
		// Some other error occured.
		return std::error_code(errno, std::generic_category());
		}
		} else {
		// A positive return value indicates that some characters were converted
		// in a nonreversible way, that is, replaced with a SUB symbol. Returning
		// an error in this case makes sure that both conversion routines behave
		// in the same way.
		return std::make_error_code(std::errc::illegal_byte_sequence);
		}
		};

		// Convert the string.
		while ((Ret = iconv(ConvDesc, &Input, &InputLength, &Output, &OutputLength)))
		if (auto EC = HandleError(Ret))
		return EC;
		if (ShouldAutoFlush) {
		while ((Ret = iconv(ConvDesc, nullptr, nullptr, &Output, &OutputLength)))
		if (auto EC = HandleError(Ret))
		return EC;
		}

		// Re-adjust size to actual size.
		Result.resize(Capacity - OutputLength);
		return std::error_code();
		}

		std::error_code CharSetConverterIconv::flush() const {
		size_t Ret = iconv(ConvDesc, nullptr, nullptr, nullptr, nullptr);
		if (Ret == static_cast<size_t>(-1)) {
		return std::error_code(errno, std::generic_category());
		}
		return std::error_code();
		}

		std::error_code
		CharSetConverterIconv::flush(SmallVectorImpl<char> &Result) const {
		char *Output = Result.data();
		size_t OutputLength = Result.capacity();
		size_t Capacity = Result.capacity();
		Result.resize_for_overwrite(Capacity);

		// Handle errors returned from iconv().
		auto HandleError = [&Capacity, &Output, &OutputLength, &Result](size_t Ret) {
		if (Ret == static_cast<size_t>(-1)) {
		// An error occured. Check if we can gracefully handle it.
		if (errno == E2BIG && Capacity < std::numeric_limits<size_t>::max()) {
		// No space left in output buffer. Increase the size of the underlying
		// memory in the SmallVectorImpl by 2 bytes, adjust pointer and length
		// and continue the conversion.
		const size_t Used = Capacity - OutputLength;
		Capacity = (Capacity < std::numeric_limits<size_t>::max() - 2)
		? 2 + Capacity
		: std::numeric_limits<size_t>::max();
		Result.resize_for_overwrite(Capacity);
		Output = static_cast<char *>(Result.data()) + Used;
		OutputLength = Capacity - Used;
		return std::error_code();
		} else {
		// Some other error occured.
		return std::error_code(errno, std::generic_category());
		}
		} else {
		// A positive return value indicates that some characters were converted
		// in a nonreversible way, that is, replaced with a SUB symbol. Returning
		// an error in this case makes sure that both conversion routines behave
		// in the same way.
		return std::make_error_code(std::errc::illegal_byte_sequence);
		}
		};

		size_t Ret;
		while ((Ret = iconv(ConvDesc, nullptr, nullptr, &Output, &OutputLength)))
		if (auto EC = HandleError(Ret))
		return EC;

		// Re-adjust size to actual size.
		Result.resize(Capacity - OutputLength);
		return std::error_code();
		}

		#endif // HAVE_ICONV
} // namespace		} // namespace

CharSetConverter CharSetConverter::create(text_encoding::id CPFrom,		CharSetConverter CharSetConverter::create(text_encoding::id CPFrom,
text_encoding::id CPTo) {		text_encoding::id CPTo) {

assert(CPFrom != CPTo && "Text encodings should be distinct");		assert(CPFrom != CPTo && "Text encodings should be distinct");

ConversionType Conversion;		ConversionType Conversion;
if (CPFrom == text_encoding::id::UTF8 && CPTo == text_encoding::id::IBM1047)		if (CPFrom == text_encoding::id::UTF8 && CPTo == text_encoding::id::IBM1047)
Conversion = UTFToIBM1047;		Conversion = UTFToIBM1047;
else		else
Conversion = IBM1047ToUTF;		Conversion = IBM1047ToUTF;
std::unique_ptr<details::CharSetConverterImplBase> Converter =		std::unique_ptr<details::CharSetConverterImplBase> Converter =
std::make_unique<CharSetConverterTable>(Conversion);		std::make_unique<CharSetConverterTable>(Conversion);
return CharSetConverter(std::move(Converter));		return CharSetConverter(std::move(Converter));
}		}

ErrorOr<CharSetConverter> CharSetConverter::create(StringRef CSFrom,		ErrorOr<CharSetConverter> CharSetConverter::create(StringRef CSFrom,
StringRef CSTo) {		StringRef CSTo) {
std::optional<text_encoding::id> From = getKnownCharSet(CSFrom);		std::optional<text_encoding::id> From = getKnownCharSet(CSFrom);
std::optional<text_encoding::id> To = getKnownCharSet(CSTo);		std::optional<text_encoding::id> To = getKnownCharSet(CSTo);
if (From && To)		if (From && To)
return create(From, To);		return create(From, To);
		#if HAVE_ICONV
		iconv_t ConvDesc = iconv_open(CSTo.str().c_str(), CSFrom.str().c_str());
		if (ConvDesc == (iconv_t)-1)
		return std::error_code(errno, std::generic_category());
		std::unique_ptr<details::CharSetConverterImplBase> Converter =
		std::make_unique<CharSetConverterIconv>(ConvDesc);
		return CharSetConverter(std::move(Converter));
		#endif
return std::make_error_code(std::errc::invalid_argument);		return std::make_error_code(std::errc::invalid_argument);
}		}

clang/unittests/Basic/CharSetTest.cpp

Show All 34 Lines	static const char AccentUTF[] =
"\xc3\x8a\x61\xc3\xa0\xc3\xa1\xc3\xa2\xc3\xa3\xc3\xa4\x65\xc3\xa8\xc3\xa9"		"\xc3\x8a\x61\xc3\xa0\xc3\xa1\xc3\xa2\xc3\xa3\xc3\xa4\x65\xc3\xa8\xc3\xa9"
"\xc3\xaa\xc3\xab";		"\xc3\xaa\xc3\xab";
static const char AccentE[] = "\xaa\x4a\xb1\xc1\x63\x67\x9e\xc5\x74\x71\x72"		static const char AccentE[] = "\xaa\x4a\xb1\xc1\x63\x67\x9e\xc5\x74\x71\x72"
"\x81\x44\x45\x42\x46\x43\x85\x54\x51\x52\x53";		"\x81\x44\x45\x42\x46\x43\x85\x54\x51\x52\x53";

// String with Cyrillic character ya.		// String with Cyrillic character ya.
static const char CyrillicUTF[] = "\xd0\xaf";		static const char CyrillicUTF[] = "\xd0\xaf";

		// String "Earth地球".
		// ISO-2022-JP: Sequence ESC $ B (\x1B\x24\x42) switches to JIS X 0208-1983, and
		// sequence ESC ( B (\x1B\x28\x42) switches back to ASCII.
		// IBM-939: Byte 0x0E shifts from single byte to double byte, and 0x0F shifts
		// back.
		static const char EarthUTF[] = "\x45\x61\x72\x74\x68\xe5\x9c\xb0\xe7\x90\x83";
		// Identical to above, except the final character (球) has its last byte taken
		// away from it.
		static const char EarthUTFBroken[] = "\x45\x61\x72\x74\x68\xe5\x9c\xb0\xe7\x90";
		static const char EarthISO2022[] =
		"\x45\x61\x72\x74\x68\x1B\x24\x42\x43\x4F\x35\x65\x1B\x28\x42";
		static const char EarthISO2022ShiftBack[] =
		"\x45\x61\x72\x74\x68\x1B\x24\x42\x43\x4F\x35\x65";
		static const char EarthIBM939[] =
		"\xc5\x81\x99\xa3\x88\x0e\x45\xc2\x48\xdb\x0f";
		static const char ShiftBackOnly[] = "\x1B\x28\x42";

		// String "地球".
		static const char EarthKanjiOnlyUTF[] = "\xe5\x9c\xb0\xe7\x90\x83";
		static const char EarthKanjiOnlyISO2022[] =
		"\x1B\x24\x42\x43\x4F\x35\x65\x1b\x28\x42";
		static const char EarthKanjiOnlyIBM939[] = "\x0e\x45\xc2\x48\xdb\x0f";

TEST(CharSet, FromUTF8) {		TEST(CharSet, FromUTF8) {
// Hello string.		// Hello string.
StringRef Src(HelloA);		StringRef Src(HelloA);
SmallString<64> Dst;		SmallString<64> Dst;

CharSetConverter Conv = CharSetConverter::create(text_encoding::id::UTF8,		CharSetConverter Conv = CharSetConverter::create(text_encoding::id::UTF8,
text_encoding::id::IBM1047);		text_encoding::id::IBM1047);
std::error_code EC = Conv.convert(Src, Dst, true);		std::error_code EC = Conv.convert(Src, Dst, true);
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	TEST(CharSet, ToUTF8) {

// Accent string.		// Accent string.
Src = AccentE;		Src = AccentE;
EC = Conv.convert(Src, Dst, true);		EC = Conv.convert(Src, Dst, true);
EXPECT_TRUE(!EC);		EXPECT_TRUE(!EC);
EXPECT_STREQ(AccentUTF, static_cast<std::string>(Dst).c_str());		EXPECT_STREQ(AccentUTF, static_cast<std::string>(Dst).c_str());
}		}

		TEST(CharSet, RoundTrip) {
		ErrorOr<CharSetConverter> ConvToUTF16 =
		CharSetConverter::create("IBM-1047", "UTF-16");
		// Stop test if conversion is not supported (no underlying iconv support).
		if (!ConvToUTF16) {
		ASSERT_EQ(ConvToUTF16.getError(),
		std::make_error_code(std::errc::invalid_argument));
		return;
		}
		ErrorOr<CharSetConverter> ConvToUTF32 =
		CharSetConverter::create("UTF-16", "UTF-32");
		// Stop test if conversion is not supported (no underlying iconv support).
		if (!ConvToUTF32) {
		ASSERT_EQ(ConvToUTF32.getError(),
		std::make_error_code(std::errc::invalid_argument));
		return;
		}
		ErrorOr<CharSetConverter> ConvToEBCDIC =
		CharSetConverter::create("UTF-32", "IBM-1047");
		// Stop test if conversion is not supported (no underlying iconv support).
		if (!ConvToEBCDIC) {
		ASSERT_EQ(ConvToEBCDIC.getError(),
		std::make_error_code(std::errc::invalid_argument));
		return;
		}

		// Setup source string.
		char SrcStr[256];
		for (size_t I = 0; I < 256; ++I)
		SrcStr[I] = (I + 1) % 256;

		SmallString<99> Dst1Str, Dst2Str, Dst3Str;

		std::error_code EC = ConvToUTF16->convert(StringRef(SrcStr), Dst1Str, true);
		EXPECT_TRUE(!EC);
		EC = ConvToUTF32->convert(Dst1Str, Dst2Str, true);
		EXPECT_TRUE(!EC);
		EC = ConvToEBCDIC->convert(Dst2Str, Dst3Str, true);
		EXPECT_TRUE(!EC);
		EXPECT_STREQ(SrcStr, static_cast<std::string>(Dst3Str).c_str());
		}

		TEST(CharSet, ShiftState2022) {
		// Earth string.
		StringRef Src(EarthUTF);
		SmallString<64> Dst;

		ErrorOr<CharSetConverter> ConvTo2022 =
		CharSetConverter::create("UTF-8", "ISO-2022-JP");
		// Stop test if conversion is not supported (no underlying iconv support).
		if (!ConvTo2022) {
		ASSERT_EQ(ConvTo2022.getError(),
		std::make_error_code(std::errc::invalid_argument));
		return;
		}

		// Check that the string is properly converted.
		std::error_code EC = ConvTo2022->convert(Src, Dst, true);
		EXPECT_TRUE(!EC);
		EXPECT_STREQ(EarthISO2022, static_cast<std::string>(Dst).c_str());
		}

		TEST(CharSet, ShiftState2022Flush) {
		StringRef Src0(EarthUTFBroken);
		StringRef Src1(EarthKanjiOnlyUTF);
		SmallString<64> Dst0;
		SmallString<64> Dst1;
		ErrorOr<CharSetConverter> ConvTo2022Flush =
		CharSetConverter::create("UTF-8", "ISO-2022-JP");
		if (!ConvTo2022Flush) {
		ASSERT_EQ(ConvTo2022Flush.getError(),
		std::make_error_code(std::errc::invalid_argument));
		return;
		}

		// This should emit an error; there is a malformed multibyte character in the
		// input string.
		std::error_code EC0 = ConvTo2022Flush->convert(Src0, Dst0, true);
		EXPECT_TRUE(EC0);
		std::error_code EC1 = ConvTo2022Flush->flush();
		EXPECT_TRUE(!EC1);
		std::error_code EC2 = ConvTo2022Flush->convert(Src1, Dst1, true);
		EXPECT_TRUE(!EC2);
		EXPECT_STREQ(EarthKanjiOnlyISO2022, static_cast<std::string>(Dst1).c_str());
		}

		TEST(CharSet, ShiftStateIBM939) {
		// Earth string.
		StringRef Src(EarthUTF);
		SmallString<64> Dst;

		ErrorOr<CharSetConverter> ConvToIBM939 =
		CharSetConverter::create("UTF-8", "IBM-939");
		// Stop test if conversion is not supported (no underlying iconv support).
		if (!ConvToIBM939) {
		ASSERT_EQ(ConvToIBM939.getError(),
		std::make_error_code(std::errc::invalid_argument));
		return;
		}

		// Check that the string is properly converted.
		std::error_code EC = ConvToIBM939->convert(Src, Dst, true);
		EXPECT_TRUE(!EC);
		EXPECT_STREQ(EarthIBM939, static_cast<std::string>(Dst).c_str());
		}

		TEST(CharSet, ShiftStateIBM939Flush) {
		StringRef Src0(EarthUTFBroken);
		StringRef Src1(EarthKanjiOnlyUTF);
		SmallString<64> Dst0;
		SmallString<64> Dst1;
		ErrorOr<CharSetConverter> ConvTo939Flush =
		CharSetConverter::create("UTF-8", "IBM-939");
		if (!ConvTo939Flush) {
		ASSERT_EQ(ConvTo939Flush.getError(),
		std::make_error_code(std::errc::invalid_argument));
		return;
		}

		// This should emit an error; there is a malformed multibyte character in the
		// input string.
		std::error_code EC0 = ConvTo939Flush->convert(Src0, Dst0, true);
		EXPECT_TRUE(EC0);
		std::error_code EC1 = ConvTo939Flush->flush();
		EXPECT_TRUE(!EC1);
		std::error_code EC2 = ConvTo939Flush->convert(Src1, Dst1, true);
		EXPECT_TRUE(!EC2);
		EXPECT_STREQ(EarthKanjiOnlyIBM939, static_cast<std::string>(Dst1).c_str());
		}

		TEST(CharSet, ShiftState2022Flush1) {
		StringRef Src0(EarthUTF);
		SmallString<64> Dst0;
		SmallString<64> Dst1;
		ErrorOr<CharSetConverter> ConvTo2022Flush =
		CharSetConverter::create("UTF-8", "ISO-2022-JP");
		if (!ConvTo2022Flush) {
		ASSERT_EQ(ConvTo2022Flush.getError(),
		std::make_error_code(std::errc::invalid_argument));
		return;
		}

		std::error_code EC0 = ConvTo2022Flush->convert(Src0, Dst0, false);
		EXPECT_TRUE(!EC0);
		EXPECT_STREQ(EarthISO2022ShiftBack, static_cast<std::string>(Dst0).c_str());
		std::error_code EC1 = ConvTo2022Flush->flush(Dst1);
		EXPECT_TRUE(!EC1);
		EXPECT_STREQ(ShiftBackOnly, static_cast<std::string>(Dst1).c_str());
		}

} // namespace		} // namespace

This is an archive of the discontinued LLVM Phabricator instance.

Adding iconv support to CharSetConverter classNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 533352

clang/include/clang/Basic/CharSet.h

clang/include/clang/Config/config.h.cmake

clang/lib/Basic/CMakeLists.txt

clang/lib/Basic/CharSet.cpp

clang/unittests/Basic/CharSetTest.cpp

Adding iconv support to CharSetConverter class
Needs ReviewPublic