This is an archive of the discontinued LLVM Phabricator instance.

non-Unicode response file support on Windows
AbandonedPublic

Authored by ygao on Jan 22 2015, 6:11 PM.

Download Raw Diff

Details

Reviewers: None

Summary

Hi community,
I have a customer who wants to use Japanese characters in the response
file (Shift-JIS encoded) on Windows. I am attempting to implement it in
lib/Support. It would be very helpful if someone can take a look and give
some feedback.
Thanks in advance,

Diff Detail

Event Timeline

ygao updated this revision to Diff 18647.Jan 22 2015, 6:11 PM

ygao retitled this revision from to non-Unicode response file support on Windows.

ygao updated this object.

ygao edited the test plan for this revision. (Show Details)

ygao added subscribers: Unknown Object (MLST), rnk, majnemer and 3 others.

The main issue I think is that there is not a lot of precedent for this on windows:

GNU tools only use the local codepage, but that is probably a consequence of simplistically using the crt read function.
Clang support UTF-16 and UTF-8.
Microsoft own tools require UTF-16.

The part of this patch that I don't expect to be controversial is skipping the utf-8 bom if one is present. Can you move that to an independent patch?

Also, it needs a testcase. Something as simple an running llvm-as @file_with_utf8_bom should do.

llvm/lib/Support/ConvertUTFWrapper.cpp
90	This can be a static helper in CommandLine.cpp.
97	This can be a static helper in CommandLine.cpp

The main change here is that BOM-less response files on Windows will use the current codepage rather than defaulting to UTF-8. That seems unfortunate, as it does not promote "UTF8 everywhere", but it's probably the more Windows-y thing to do.

I agree with Rafael, this behavior would be out of line with other tools. Perhaps an even strange side effect of this change would be us assuming the current code page for response files but not source files, a strange exception.

Assuming that the file is in codepage encoding instead of UTF-8 sounds like a step backwards.

ygao mentioned this in D7156: (Part 1/2) non-Unicode response file on Windows: UTF-8 BOM.Jan 23 2015, 3:44 PM

Hi Rafael, Reid, David, thanks for the reviews. I separated the UTF-8 BOM part of the patch into D7156.
It sounds like the support for Shift-JIS encoded files may be specific only to Sony platforms, is that true? I am kinda curious to hear from someone developing in Japanese environment. In writing the original patch, I was making the assumption that if a text file on Windows does not start with the BOM sequence, then it is using the current codepage. That seems to be the case for files created from Notepad, but of course these files can be created from many different sources. If this assumption is not valid, then I wonder what would be a good way to differentiate a UTF-8 file from a current-codepage one; maybe a command-line option?
Thoughts and advice are appreciated,

From: Rafael Espíndola [mailto:rafael.espindola@gmail.com]
Sent: Friday, January 23, 2015 4:07 PM
Thanks for splitting the patch!
When Rafael Auler implemented the bits for *writing* response files from clang, I think the observed behavior was

GNU tools use the current codepage.

MS Tools use UTF-16 only.

Clang uses UTF-16 or UTF-8 (non-BOM)

The first part of you patch adds support for UTF-8 BOM, which I think is a strict improvement.
The change to assume current codepage in a tool that can handle utf is what I think is problematic, since there is no precedent for it (that I know of).
Response files are small (relative to the work they cause), so maybe one options would be to try to check if the file is UTF-8 and fallback to current codepage if that fails.

I think I just found Rafael Auler’s commit, r217792 (right?). And I confirm that mingw on Windows (tested with MinGW-W64 4.9.2) accepts system codepage-encoded response files (but not UTF-8). I guess what's new here is to try to support both UTF-8 and system codepage.

In D7133#113113, @ygao wrote:

From: Rafael Espíndola [mailto:rafael.espindola@gmail.com]
Sent: Friday, January 23, 2015 4:07 PM
Thanks for splitting the patch!
When Rafael Auler implemented the bits for *writing* response files from clang, I think the observed behavior was

GNU tools use the current codepage.

MS Tools use UTF-16 only.

Clang uses UTF-16 or UTF-8 (non-BOM)

The first part of you patch adds support for UTF-8 BOM, which I think is a strict improvement.
The change to assume current codepage in a tool that can handle utf is what I think is problematic, since there is no precedent for it (that I know of).
Response files are small (relative to the work they cause), so maybe one options would be to try to check if the file is UTF-8 and fallback to current codepage if that fails.

I think I just found Rafael Auler’s commit, r217792 (right?). And I confirm that mingw on Windows (tested with MinGW-W64 4.9.2) accepts system codepage-encoded response files (but not UTF-8). I guess what's new here is to try to support both UTF-8 and system codepage.

So why is this patch needed for Shift-JIS then? Wouldn't Shift-JIS be the system codepage?

Yes, but Clang assumes UTF-8 (Windows or Linux) when no BOM is present. So, even though the system code page is Japanese, when Clang opens input files, it will use a function that interprets the file name strings as UTF8.

Modern Windows applications should always use UTF-16. In the Windows API, to use functions that uses the old system code page you must explicitly call functions with the suffix A (ansi), while the recommended functions for internationalization support ends with W (wide) (this transition began back in Windows 95!). I was surprised to see that GNU tools needed files in system code page. IIRC VS tools will always use UTF16.

Oh, I think I misinterpreted gao's comment as being about clang, but now that I reread it, it is about mingw (and not clang running on mingw or whatever I thought he said).

maybe one options would be to try to check if the file is UTF-8 and fallback to current codepage if that fails.

Cleaned up and rebased the patch.
I figured that what Rafael (Espindola) meant earlier is probably to run isLegalUTF8String() before making the assumption about the system code page?

ygao abandoned this revision.Feb 3 2015, 3:19 PM

Revision Contents

Path

Size

llvm/

include/

llvm/

Support/

ConvertUTF.h

19 lines

lib/

Support/

CommandLine.cpp

14 lines

ConvertUTFWrapper.cpp

54 lines

Diff 18647

llvm/include/llvm/Support/ConvertUTF.h

Context not available.
	bool hasUTF16ByteOrderMark(ArrayRef<char> SrcBytes);	bool hasUTF16ByteOrderMark(ArrayRef<char> SrcBytes);

	/**	/**
		* Returns true if a blob of text starts with a UTF-8 byte order mark.
		* UTF-8 BOM is a sequence of bytes on Windows and is not affected by the host
		* system's endianness.
		*/
		bool hasUTF8ByteOrderMark(ArrayRef<char> SrcBytes);

		#ifdef LLVM_ON_WIN32
		/**
		* Converts a stream of raw bytes assumed to be encoded in ANSI code page (aka
		* Windows system locale) into a UTF8 std::string.
		*
		* \param [in] SrcBytes A buffer of what is assumed to be ANSI-encoded text.
		* \param [out] Out Converted UTF-8 is stored here on success.
		* \returns true on success
		*/
		bool convertANSIToUTF8String(ArrayRef<char> SrcBytes, std::string &Out);
		#endif

		/**
	* Converts a stream of raw bytes assumed to be UTF16 into a UTF8 std::string.	* Converts a stream of raw bytes assumed to be UTF16 into a UTF8 std::string.
	*	*
	* \param [in] SrcBytes A buffer of what is assumed to be UTF-16 encoded text.	* \param [in] SrcBytes A buffer of what is assumed to be UTF-16 encoded text.
Context not available.

llvm/lib/Support/CommandLine.cpp

Context not available.
	return false;	return false;
	Str = StringRef(UTF8Buf);	Str = StringRef(UTF8Buf);
	}	}
		// If we see UTF-8 BOM sequence at the beginning of a file, we shall remove
		// these bytes before parsing.
		// Reference: http://en.wikipedia.org/wiki/UTF-8#Byte_order_mark
		else if (hasUTF8ByteOrderMark(BufRef))
		Str = StringRef(BufRef.data() + 3, BufRef.size() - 3);
		#ifdef LLVM_ON_WIN32
		// Otherwise, this might be a hand-written text file encoded in the system's
		// default code page.
		else {
		if (!convertANSIToUTF8String(BufRef, UTF8Buf))
		return false;
		Str = StringRef(UTF8Buf);
		}
		#endif

	// Tokenize the contents into NewArgv.	// Tokenize the contents into NewArgv.
	Tokenizer(Str, Saver, NewArgv, MarkEOLs);	Tokenizer(Str, Saver, NewArgv, MarkEOLs);
Context not available.

llvm/lib/Support/ConvertUTFWrapper.cpp

Context not available.
	#include <string>	#include <string>
	#include <vector>	#include <vector>

		#ifdef LLVM_ON_WIN32
		#include "Windows/WindowsSupport.h"
		#endif

	namespace llvm {	namespace llvm {

	bool ConvertUTF8toWide(unsigned WideCharWidth, llvm::StringRef Source,	bool ConvertUTF8toWide(unsigned WideCharWidth, llvm::StringRef Source,
Context not available.
	(S[0] == '\xfe' && S[1] == '\xff')));	(S[0] == '\xfe' && S[1] == '\xff')));
	}	}

		// It is called byte order marker but the UTF-8 BOM is actually not affected
		// by the host system's endianness.
		bool hasUTF8ByteOrderMark(ArrayRef<char> S) {
		rafaelUnsubmitted Not Done Reply Inline Actions This can be a static helper in CommandLine.cpp. rafael: This can be a static helper in CommandLine.cpp.
		return (S.size() >= 3 &&
		S[0] == '\xef' && S[1] == '\xbb' && S[2] == '\xbf');
		}

		#ifdef LLVM_ON_WIN32
		// Convert system-locale encoded string to UTF8
		bool convertANSIToUTF8String(ArrayRef<char> SrcBytes, std::string &Out) {
		rafaelUnsubmitted Not Done Reply Inline Actions This can be a static helper in CommandLine.cpp rafael: This can be a static helper in CommandLine.cpp
		assert(Out.empty());

		if (SrcBytes.empty())
		return true;

		SmallVector<wchar_t, 128> utf16;
		SmallVector<char, 128> utf8;

		int len = ::MultiByteToWideChar(CP_ACP, MB_ERR_INVALID_CHARS, SrcBytes.data(),
		SrcBytes.size(), utf16.begin(), 0);
		if (len == 0)
		return false;

		utf16.reserve(len + 1);
		utf16.set_size(len);

		len = ::MultiByteToWideChar(CP_ACP, MB_ERR_INVALID_CHARS, SrcBytes.data(),
		SrcBytes.size(), utf16.begin(), utf16.size());
		if (len == 0)
		return false;

		len = ::WideCharToMultiByte(CP_UTF8, 0, utf16.begin(), utf16.size(),
		utf8.begin(), 0, nullptr, nullptr);
		if (len == 0)
		return false;

		utf8.reserve(len + 1);
		utf8.set_size(len);

		len = ::WideCharToMultiByte(CP_UTF8, 0, utf16.begin(), utf16.size(),
		utf8.data(), utf8.size(), nullptr, nullptr);
		if (len == 0)
		return false;

		Out.resize(utf8.size());
		std::copy(utf8.begin(), utf8.end(), Out.begin());
		return true;
		}
		#endif // LLVM_ON_WIN32

	bool convertUTF16ToUTF8String(ArrayRef<char> SrcBytes, std::string &Out) {	bool convertUTF16ToUTF8String(ArrayRef<char> SrcBytes, std::string &Out) {
	assert(Out.empty());	assert(Out.empty());

Context not available.