This is an archive of the discontinued LLVM Phabricator instance.

non-Unicode response file support on Windows
AbandonedPublic

Authored by ygao on Jan 22 2015, 6:11 PM.

Details

Reviewers
None
Summary

Hi community,
I have a customer who wants to use Japanese characters in the response
file (Shift-JIS encoded) on Windows. I am attempting to implement it in
lib/Support. It would be very helpful if someone can take a look and give
some feedback.
Thanks in advance,

  • Gao

Diff Detail

Event Timeline

ygao updated this revision to Diff 18647.Jan 22 2015, 6:11 PM
ygao retitled this revision from to non-Unicode response file support on Windows.
ygao updated this object.
ygao edited the test plan for this revision. (Show Details)
ygao added subscribers: Unknown Object (MLST), rnk, majnemer and 3 others.

The main issue I think is that there is not a lot of precedent for this on windows:

  • GNU tools only use the local codepage, but that is probably a consequence of simplistically using the crt read function.
  • Clang support UTF-16 and UTF-8.
  • Microsoft own tools require UTF-16.

The part of this patch that I don't expect to be controversial is skipping the utf-8 bom if one is present. Can you move that to an independent patch?

Also, it needs a testcase. Something as simple an running llvm-as @file_with_utf8_bom should do.

llvm/lib/Support/ConvertUTFWrapper.cpp
90

This can be a static helper in CommandLine.cpp.

97

This can be a static helper in CommandLine.cpp

rnk added a comment.Jan 23 2015, 11:10 AM

The main change here is that BOM-less response files on Windows will use the current codepage rather than defaulting to UTF-8. That seems unfortunate, as it does not promote "UTF8 everywhere", but it's probably the more Windows-y thing to do.

I agree with Rafael, this behavior would be out of line with other tools. Perhaps an even strange side effect of this change would be us assuming the current code page for response files but not source files, a strange exception.

Assuming that the file is in codepage encoding instead of UTF-8 sounds like a step backwards.

ygao added a comment.Jan 23 2015, 3:55 PM

Hi Rafael, Reid, David, thanks for the reviews. I separated the UTF-8 BOM part of the patch into D7156.
It sounds like the support for Shift-JIS encoded files may be specific only to Sony platforms, is that true? I am kinda curious to hear from someone developing in Japanese environment. In writing the original patch, I was making the assumption that if a text file on Windows does not start with the BOM sequence, then it is using the current codepage. That seems to be the case for files created from Notepad, but of course these files can be created from many different sources. If this assumption is not valid, then I wonder what would be a good way to differentiate a UTF-8 file from a current-codepage one; maybe a command-line option?
Thoughts and advice are appreciated,

  • Gao
ygao added a comment.Jan 23 2015, 8:56 PM

From: Rafael Espíndola [mailto:rafael.espindola@gmail.com]
Sent: Friday, January 23, 2015 4:07 PM
Thanks for splitting the patch!
When Rafael Auler implemented the bits for *writing* response files from clang, I think the observed behavior was

  • GNU tools use the current codepage.
  • MS Tools use UTF-16 only.
  • Clang uses UTF-16 or UTF-8 (non-BOM)

The first part of you patch adds support for UTF-8 BOM, which I think is a strict improvement.
The change to assume current codepage in a tool that can handle utf is what I think is problematic, since there is no precedent for it (that I know of).
Response files are small (relative to the work they cause), so maybe one options would be to try to check if the file is UTF-8 and fallback to current codepage if that fails.

I think I just found Rafael Auler’s commit, r217792 (right?). And I confirm that mingw on Windows (tested with MinGW-W64 4.9.2) accepts system codepage-encoded response files (but not UTF-8). I guess what's new here is to try to support both UTF-8 and system codepage.

silvas added a subscriber: silvas.Jan 24 2015, 8:11 AM
In D7133#113113, @ygao wrote:

From: Rafael Espíndola [mailto:rafael.espindola@gmail.com]
Sent: Friday, January 23, 2015 4:07 PM
Thanks for splitting the patch!
When Rafael Auler implemented the bits for *writing* response files from clang, I think the observed behavior was

  • GNU tools use the current codepage.
  • MS Tools use UTF-16 only.
  • Clang uses UTF-16 or UTF-8 (non-BOM)

The first part of you patch adds support for UTF-8 BOM, which I think is a strict improvement.
The change to assume current codepage in a tool that can handle utf is what I think is problematic, since there is no precedent for it (that I know of).
Response files are small (relative to the work they cause), so maybe one options would be to try to check if the file is UTF-8 and fallback to current codepage if that fails.

I think I just found Rafael Auler’s commit, r217792 (right?). And I confirm that mingw on Windows (tested with MinGW-W64 4.9.2) accepts system codepage-encoded response files (but not UTF-8). I guess what's new here is to try to support both UTF-8 and system codepage.

So why is this patch needed for Shift-JIS then? Wouldn't Shift-JIS be the system codepage?

Yes, but Clang assumes UTF-8 (Windows or Linux) when no BOM is present. So, even though the system code page is Japanese, when Clang opens input files, it will use a function that interprets the file name strings as UTF8.

Modern Windows applications should always use UTF-16. In the Windows API, to use functions that uses the old system code page you must explicitly call functions with the suffix A (ansi), while the recommended functions for internationalization support ends with W (wide) (this transition began back in Windows 95!). I was surprised to see that GNU tools needed files in system code page. IIRC VS tools will always use UTF16.

Oh, I think I misinterpreted gao's comment as being about clang, but now that I reread it, it is about mingw (and not clang running on mingw or whatever I thought he said).

ygao updated this revision to Diff 18794.Jan 26 2015, 7:03 PM

maybe one options would be to try to check if the file is UTF-8 and fallback to current codepage if that fails.

Cleaned up and rebased the patch.
I figured that what Rafael (Espindola) meant earlier is probably to run isLegalUTF8String() before making the assumption about the system code page?

ygao abandoned this revision.Feb 3 2015, 3:19 PM