Use GetArgumentVector to retrieve the utf-8 encoded arguments on all platforms
AbandonedPublic

Authored by asmith on Apr 11 2018, 6:48 PM.

Details

Summary

After running the lld unit tests on Windows, it was discovered that various llvm tools on Windows are not handling unicode command lines. An example of a test that does not work on Windows because of this issue is lld/test/ELF/format-binary-non-ascii.s.

This problem was originally fixed in r329468 by modifying Windows/Path.inc to try and use the current codepage. Eli Friedman recommended instead modifying the llvm tools to properly handle unicode command lines by calling sys::Process::GetArgumentVector.

With this change and the change in D45551, lld/test/ELF/format-binary-non-ascii.s passes.

Diff Detail

asmith created this revision.Apr 11 2018, 6:48 PM
asmith edited the summary of this revision. (Show Details)
asmith changed the repository for this revision from rL LLVM to rLLD LLVM Linker.
ruiu added a comment.Apr 11 2018, 7:13 PM

lld/test/ELF/format-binary-non-ascii.s may not run properly on Windows if the expected encoding for the command line option is not UTF-8, as the test file contains a pathname containing a non-ASCII character encoded in UTF-8 encoding. That is not immediately an issue on Windows.

I think always handling command line options as if they were encoded in UTF-8 is actually wrong.

Are you sure that this is really what you want to do?

Any particular reason why you think this is wrong?

At least a few people recommend that you always treat char* as UTF-8 on
Windows.

http://utf8everywhere.org

ruiu added a comment.Apr 11 2018, 9:30 PM

That website recommends we handle all strings as UTF-8 whether externally or internally, and if you need to handle strings encoded not in UTF-8, you convert them to UTF-8 first. I'd agree that principle.

But the thing I'm pointing out is different from that. If your Windows code page setting is ISO-8859-1, for example, I think the command line arguments are supposed to be encoded in ISO-8859-1. Passing a UTF-8 string as an argument is not valid if that's the case. Look at lld/test/ELF/format-binary-non-ascii.s. It contains raw UTF-8 string as a command line argument. That test doesn't make sense unless your code page is UTF-8.

ruiu added a comment.Apr 11 2018, 9:31 PM

What are you actually trying to fix with the patch? Did you just find that lld/test/ELF/format-binary-non-ascii.s didn't pass on your machine and trying to fix it?

If you look at the log for the test you will see we previously fixed it to work with Python 3. The test itself does not work on Windows for the reason your described.

Now look at D45551 which is the other part of these changes. You will see that we are reverting the change that we made previously to deal with the default code page on Windows. Eli pointed out it would be better to call GetArgumentVector and convert the command line to a UTF8 string.

Is your viewpoint that this test is bogus on Windows and should be marked XFAIL?

There are a couple of things going on here:

  1. The test itself is verifying that each of the tools can properly handle non-ASCII characters in the path. This indicates that each of the tools *should* handle non-ASCII characters in their inputs correctly
  2. The test only passes on Windows with Python 2 because the lit code for Python 2 uses a byte array to pass the input thus breaking the one Unicode pound character into two characters. In Python 3, the pound character is passed correctly as a single character causing the test to fail on Windows. This is the same behavior as the command line in Windows - the tools fail to decode non-ASCII characters, so while we caught this in a test, this is a bug in the tools assuming they should handle non-ASCII characters correctly.

The reason the test fails on Windows without either the change that Aaron is undoing or the change to use GetArgumentVector is exactly because the inputs are not UTF-8 encoded, but rather encoded with the windows default code page. This means that we cannot treat the input as always UTF-8 encoded.

The original fix that we submitted was to modify the Windows Path.inc to try to decode the input as UTF-8 first (which is the Unix default encoding) and then try the windows default code page. This impacts the way any tool that uses Path.inc on Windows behaves and made *all* of the tools (llvm-mc, ld.lld, llvm-objdump, etc.) handle non-ASCII characters correctly whether they were encoded as UTF-8 or with the windows default code page. The drawback of this, as Eli pointed out, is that now the tools are *guessing* what the right encoding is. This happens to work a lot more frequently than assuming UTF-8 but it is not entirely correct.

The fix the Eli suggested was to use GetArgumentVector. On Windows, this uses a combination of GetCommandLineW and CommandLineToArgW to return the arguments are proper UTF-8. Now we are no longer guessing what the encoding is and the tools again handle non-ASCII characters correctly. The drawback here is that every tool needs to be updated to use GetArgumentVector in order to properly handle non-ASCII characters on Windows. In this case, to get this one test to work, we have to touch three tools - ld.lld, llvm-mc and llvm-objdump. I briefly looked through the other tools and there's at least another half dozen that will need the same change to work correctly on Windows.

At the end of the day, using GetArgumentVector is the *correct* change which needs to be applied to all tools, but using the windows default code page is the change that works almost always for all the tools. In order to keep the code cleaner, we've opted for the correct change with the understanding that we had to update at least the three tools to keep the test passing on Windows with Python 3 and that eventually the rest of the tools will need the same update.

ruiu added a comment.Apr 12 2018, 11:22 AM

Stella,

Thank you for the detailed explanation! That's very helpful. I agree that we shouldn't make a guess on character encoding of command line arguments, instead, we should use Windows APIs to get them in the correct encoding and then convert them to UTF-8. That definitely looks like the right direction.

That being said, I believe that the failing lld/test/ELF/format-binary-non-ascii.s test should just be disabled on Windows. The test itself is written in UTF-8, and if your code page is, say, CP943, the test file could be converted multiple times as follows:

When Python 3 passes the command line argument to an external command, I believe it converts arguments to the current code page, which is CP943
The command gets arguments using GetCommandLineW and convert it from CP943 to UTF-8

The round trip doesn't guarantee that the pound sign (£; U+00A3) to be as-is. It might be converted to Full Width Pound Sign (£; U+FFE1). We might be handle this correctly in all code pages, but I don't think it's worth to do. The easiest way to avoid a mess is to just not run that particular test on Windows.

ruiu added a comment.Apr 12 2018, 11:25 AM

So, I mean I think what this patch is correct, but orthogonal to that, we should disable the failing test on Windows so that we don't think too hard about character conversion round trip issues.

When Python 3 passes the command line argument to an external command, I believe it converts arguments to the current code page

That's not correct; Python 3 uses CreateProcessW, which accepts Unicode (UTF-16) directly, and if the called process uses GetCommandLineW, there's never any conversion. Python 2 uses CreateProcessA, which has the issues you've noted.

Speaking of the *test* itself, the fact that Python 2 uses CreateProcessA on Windows is just one of the problems.

This function (from utils\lit\lit\util.py) is the reason the test passes on Windows without any of the changes when run in Python 2. What this ends up doing is calling return b.encode('utf-8') on the inputs in the test (since they are Unicode and not simple strings), so the path being passed to llvm-mc, ld.lld and llvm-objdump in the test no longer contains the pound sign (£), but rather contains two separate bytes representing pound. Neither of the bytes is really a proper ASCII character, so in a way, the test is still verifying that the tools will accept non-ASCII characters. I think this defeats the purpose of the test, however, since we cannot confirm through it that *any* non-ASCII character will be treated correctly and the result doesn't contain the character we started with in the first place. Python 3 (through its addition of str as Unicode) does pass the character correctly as pound to the tools.

This is a limitation of the current lit infrastructure and for the test to really be valid, lit would need some additional changes. I do think that as long as the tools are expected to support non-ASCII characters, this test and others like it should exist to verify that it is the case. In fact, all tools that are expected to support non-ASCII characters should have at least one test. (BTW, I am not sure whether there are other tests currently that are similarly impacted by the to_string function since not all tests run on Windows and we use Python 3 only for our Windows testing.)

def to_string(b):
    """Return the parameter as type 'str', possibly encoding it.

    In Python2, the 'str' type is the same as 'bytes'. In Python3, the
    'str' type is (essentially) Python2's 'unicode' type, and 'bytes' is
    distinct.

    """
    if isinstance(b, str):
        # In Python2, this branch is taken for types 'str' and 'bytes'.
        # In Python3, this branch is taken only for 'str'.
        return b
    if isinstance(b, bytes):
        # In Python2, this branch is never taken ('bytes' is handled as 'str').
        # In Python3, this is true only for 'bytes'.
        try:
            return b.decode('utf-8')
        except UnicodeDecodeError:
            # If the value is not valid Unicode, return the default
            # repr-line encoding.
            return str(b)

    # By this point, here's what we *don't* have:
    #
    #  - In Python2:
    #    - 'str' or 'bytes' (1st branch above)
    #  - In Python3:
    #    - 'str' (1st branch above)
    #    - 'bytes' (2nd branch above)
    #
    # The last type we might expect is the Python2 'unicode' type. There is no
    # 'unicode' type in Python3 (all the Python3 cases were already handled). In
    # order to get a 'str' object, we need to encode the 'unicode' object.
    try:
        return b.encode('utf-8')
    except AttributeError:
        raise TypeError('not sure how to convert %s to %s' % (type(b), str))

I would say it's not worth investing any effort into making the test work correctly on Python2 on Windows; as long as Python3 is doing the right thing, that's good enough. (If we really wanted to, we could avoid the charset conversion in Python2 by calling CreateProcessW using ctypes.)

ruiu added a comment.Apr 12 2018, 1:41 PM

I think I'm convinced. We probably should enable that particular test if Python 3.

tools/lld/lld.cpp
129–130

Why is this function's signature so complex? How can it fail?

tools/lld/lld.cpp
129–130

On Windows, this calls several functions which can fail:

  • GetCommandLineW
  • CommandLineToArgvW
  • WideCharToMultiByte
ruiu added inline comments.Apr 12 2018, 3:13 PM
tools/lld/lld.cpp
129–130

IIUC, you want to do the same thing for all llvm tools that have their own main() functions. And I think that you always exit on error. If that's the case, I'd reduce the amount of boilerplate by moving code.

tools/lld/lld.cpp
129–130

We don't intend on making the change to all tools at this time - just the ones needed to make the test pass - in order to scope the work.

GetArgumentVector is used from a variety of places not all of which share the same code exit behavior (though most do). The same is true for PrintStackTraceOnErrorSignal and PrettyStackTraceProgram. Since all of these are frequently used together (and it's probably an omission of they are not), I agree it would be a great idea to create a simpler pattern than can be used by tools. To properly make the change, we would have to update all the tools to use the new pattern, which is also out of our scope right now. Perhaps we should open a bug/task to be addressed later?

ruiu accepted this revision.Apr 12 2018, 3:28 PM

GetArgumentVector is used from a variety of places not all of which share the same code exit behavior (though most do). The same is true for PrintStackTraceOnErrorSignal and PrettyStackTraceProgram. Since all of these are frequently used together (and it's probably an omission of they are not), I agree it would be a great idea to create a simpler pattern than can be used by tools. To properly make the change, we would have to update all the tools to use the new pattern, which is also out of our scope right now. Perhaps we should open a bug/task to be addressed later?

Yeah. Ideally we probably should have something like InitLLVM X(int Argc, const char **Argv) which does all these common process initializations all at once. But as you said that can be done later in another patch.

Thank you very much for answering all my questions! LGTM

This revision is now accepted and ready to land.Apr 12 2018, 3:28 PM
asmith abandoned this revision.Apr 14 2018, 1:54 PM