This is an archive of the discontinued LLVM Phabricator instance.

Add writeFileWithSystemEncoding to LibLLVMSupport
ClosedPublic

Authored by rafaelauler on Aug 13 2014, 8:07 PM.

Details

Reviewers
rafael
Summary

This patch adds to LLVMSupport the capability of writing files with international characters encoded in the current system encoding. This is relevant for Windows, where we can either use UTF16 or the current code page (the legacy Windows international characters). On UNIX, the file is always saved in UTF8.

This patch also fixes a bug in the Unix version of argumentsFitWithinSystemLimits(). Both functions will be used in a patch for clang to thoroughly support response files creation when calling other tools, addressing PR15171. On Windows, to correctly support internationalization, we need the ability to write response files both in UTF16 or the current code page, depending on the tool we will call. GCC for mingw, for instance, requires files to be encoded in the current code page. MSVC tools requires files to be encoded in UTF16.

Diff Detail

Event Timeline

rafaelauler retitled this revision from to Add writeFileWithSystemEncoding to LibLLVMSupport.
rafaelauler updated this object.
rafaelauler edited the test plan for this revision. (Show Details)
rafaelauler added a reviewer: rafael.
rafaelauler added a subscriber: Unknown Object (MLST).

The functionality introduced in this patch will be used by http://reviews.llvm.org/D4897

rnk added a subscriber: rnk.Aug 14 2014, 4:59 PM
rnk added a comment.Aug 14 2014, 5:02 PM

Does gcc ignore things like UTF BOMs in response files? Clang sniffs for the UTF-16 BOM when parsing response files on Windows. The best case would be that we give gcc a UTF-8 response file with a BOM, IMO.

Hi Rafael and Reid,

Thanks for sharing your opinion, I appreciate it. I organized a table with the testings I made in my Windows system. I encoded a response file with international characters in different encodings and tested them on different tools. Here are my findings:

ToolUTF8-no-BOMUTF8-BOMUTF16-BOMCurrent Code Page (ISO-8859-1 in my system)
GCC 4.8.1 MinGWFailFailFailWorks
LD 2.24 MinGWFailFailFailWorks
GCC 4.8.3 CygwinWorksFailFailFail
LD 2.24.51 CygwinWorksFailFailFail

For Cygwin, I used bash and, for MinGW programs, the Windows command prompt.

This led me to believe that:

  • GNU tools on Cygwin or any UNIX system accepts plain UTF8 without any BOM. Using BOM will confuse the tool. No other encoding is understood.
  • GNU tools on MinGW only accepts the current code page of the system. Using any other encoding, with or without BOM, is not understood.

That's why I designed my patch the way it is. On Windows native or MinGW, it uses current CP or UTF16 with BOM (for MSVC tools). On UNIX (including cygwin), it always uses UTF8 without BOM.

I supposed that all GNU tools work in this way and extended the information on all Clang Tool objects related to GNU to follow this as well. This is the meaning of using the enum member ResponseFileSupport::FullWithoutUTF16 in all GNU tools (no UTF16 means that it will use UTF8 on UNIX and Current code page on Windows).

I will update the comments in this patch to make this clear. I will also open a bug in binutils requesting them to implement UTF8/UTF16 response files on Windows/MinGW.

Best regards,
Rafael Auler

In this new patch, I implemented Rafael's suggestion of writing an encoding enum. Thus, I changed the last parameter of writeFileWithEncoding to be an encoding enum member.

rnk added a comment.Aug 22 2014, 10:34 AM

Looks pretty good. Let me know if you need help landing it when you send a revised patch.

include/llvm/Support/Program.h
142 ↗(On Diff #12579)

The contents parameter should be a StringRef. Most callers will have the length handy.

lib/Support/Unix/Program.inc
461–465 ↗(On Diff #12579)

How about returning a std::error_code and using std::errc::io_error here and in the Windows implementation?

lib/Support/Windows/Program.inc
465–468 ↗(On Diff #12579)

Similarly, we can just propagate ec here if we return std::error_code.

482 ↗(On Diff #12579)

ditto

unittests/Support/ProgramTest.cpp
275 ↗(On Diff #12579)

Hm, looks like we can't convert from SmallString to Twine. Twine is usually an implementation detail. Can you call .str() here instead?

rafael edited edge metadata.Aug 22 2014, 10:42 AM

Only a nit in addition to what Reid noticed. LGTM otherwise.

include/llvm/Support/Program.h
141 ↗(On Diff #12579)

Please include a quick summary of your table. At least say that what requires us to use EM_CurrentCodePage is that gnu tools (ld, and gcc at least) on mingw only work with the current code page.

BTW, have you tested with mingw-w64 too? If they support UTF8 or UTF16 and the old mingw does not that seems like a reason to push for a switch at some point.

Just one extra nit in addition to what Reid noticed. LGTM otherwise.

Thanks for your suggestions, I will submit a revised patch soon. Rafael, regarding your comment, I didn't add a table, but I did add a comment describing which encoding each tool use in the clang-side of the patch at http://reviews.llvm.org/D4897, file include/clang/Driver/Tool.h. However, if you want, I can add the table here too. What do you think?

rafaelauler edited edge metadata.

Implemented rnk's and rafael's suggestions. Will update the clang part to D4897.

Rafael, I finished debugging our arabic test case. I tested a filename with the character 0xd6 and it worked OK just using regular cp 720. My initial command line test was failing because I was using "touch.exe" to create the file, and it was creating the file with the wrong name (the response file was OK, but the filename used a different character).

Almost there. Thanks for testing all the strange codepage combinations!

include/llvm/Support/Program.h
146 ↗(On Diff #12872)

Please don't pass in the ErrMsg string. The caller can use EC.message() method. That is an old API design we try to avoid. I see it is used because raw_fd_osstream was never updated to avoid it :-(

Suggestion: pass in a file descriptor. That way you can use a raw_fd_ostream with a saner interface and we don't spread the use of std::string pointers for error messages to other APIs.

Another option: keep the filename, but use openFileForWrite + the fd raw_fd_ostream constructor.

lib/Support/Unix/Program.inc
479 ↗(On Diff #12872)

This bug fix could be in another patch, no?

lib/Support/Windows/Program.inc
171 ↗(On Diff #12872)

This refactoring could be in another patch, no?

Hi Rafael,

Thanks for this extra round of review, I answered your questions below. I will send an updated patch shortly.

include/llvm/Support/Program.h
146 ↗(On Diff #12872)

I will use your updated constructor from r216293, thanks for updating it!

lib/Support/Unix/Program.inc
479 ↗(On Diff #12872)
lib/Support/Windows/Program.inc
171 ↗(On Diff #12872)

Patch addressing reviewers' concerns. I uploaded the clang side in D4897, currently being reviewed by Sean.

Added the struct "EncodingStrategy", used to represent how the user wants to encode her file in different OSes. This was used to allow the removal of several ifdefs in the clang side of the patch at D4897.

rafael added inline comments.Aug 26 2014, 7:41 AM
include/llvm/Support/Program.h
140 ↗(On Diff #12943)

When is the UnixEncoding not UTF8?

If the problem was just the assert, I would suggest just writing the Unix version as

​std::error_code llvm::sys::writeFileWithEncoding(const char *FileName,

    		​                                                 StringRef Contents,
		​                                                     EncodingStrategy /*ignored*/) {

and documenting UTF8 is always used on Unix. You can even name enum WindowsEncodingMethod to make it explicit that is why we use on Windows.

154 ↗(On Diff #12943)

FileName can be a StringRef now, no?

Introduce "WindowsEncodingMethod" to stress that the file encoding is only relevant on Windows systems. Remove the old "EncodingStrategy", addressing Rafael's concerns.

I don't know why, but phabricator didn't send the email here updating this
thread. Anyway, I sent a new patch addressing your concerns. It is
available at http://reviews.llvm.org/D4896.

Description:

Introduce "WindowsEncodingMethod" to stress that the file encoding is only
relevant on Windows systems. Remove the old "EncodingStrategy", addressing
Rafael's concerns.

rafael accepted this revision.Aug 26 2014, 2:15 PM
rafael edited edge metadata.

LGTM

This revision is now accepted and ready to land.Aug 26 2014, 2:15 PM
rafaelauler edited edge metadata.

I realize this patch has been accepted, but I am updating it in light of the discussion in D4897, to update the LLVM side of it.

I will copy and paste the description that I put in D4897, since it reflects the modifications made on both sides (LLVM and Clang):

This update refactors the new code in Job.cpp to use streams, no longer building and managing its own char buffers. Thanks to Sean’s suggestion, the code now is much more simple. However, I wanted to build a stream that could directly write to the response file with the correct encoding, but raw_fd_ostream lacks the capability to write files with different encodings. Therefore, I added this capability to raw_fd_ostream with a small modification and introduced a new constructor variant that creates buffers that, when flushed to a file, is written in a different encoding. If you think it is inadequate to have such a feature in raw_fd_ostream, I can work on creating my own stream class to do so.

I also refactored part of the write function in raw_fd_ostream out of this class, to avoid duplicating code in this implementation. Then, I changed by writeFileWithEncoding() function to write in a given file descriptor, which is assumed to be opened, rather than opening the file by my own and then closing it. This enabled me to make raw_fd_ostream work with different encodings with little extra code.

I also refactored part of the write function in raw_fd_ostream out of this class, to avoid duplicating code in this implementation.

Yes, this class is way too central to get an extra feature just for
response files.

You can maybe add a new streamer class, but I must say I am not sure
it is worth it. Response files are relatively small, so I would
probably start with what you had before: build a buffer and then dump
it. This also avoids having to handle incomplete characters
(getIncompleteUTFBytes)

Cheers,
Rafael

Hi Rafael,

No problem, I will just revert it to my last patch and rebase. This is the updated patch. In the Clang side, I will just use a raw_string_ostream and only call writeFileWithEncoding when the full buffer is written.

Best regards,
Rafael Auler

Do you need me to commit this?