Page MenuHomePhabricator

Add Support for Creating and Deleting Unicode Files and Directories in Lit

Authored by jmittert on Jan 15 2019, 4:03 PM.



This enables lit to work with unicode file names via mkdir, rm, and redirection. Lit still uses utf-8 internally, but converts to utf-16 on Windows, or just utf-8 bytes on everything else.

Taking my best guess at a reviewer.

Diff Detail


Event Timeline

jmittert created this revision.Jan 15 2019, 4:03 PM
smeenai added a subscriber: smeenai.

Could you upload this with context? The easiest way is to use arcanist ( Otherwise use -U9999 when generating the diff to upload to Phabricator.

jmittert updated this revision to Diff 181935.Jan 15 2019, 4:50 PM

Recreated with context/arc diff

1110 ↗(On Diff #181935)

I'm not 100% sure here, but this looks very redundant and non-maintainable. Why not always open the file in binary mode?

jmittert updated this revision to Diff 182096.Jan 16 2019, 10:38 AM

Updated to open the script file in binary mode regardless of platform.

1087 ↗(On Diff #182096)

That's nce, but now, you need to use the b prefix for all strings, and for decoding o f the joined command, right?

serge-sans-paille requested changes to this revision.Jan 17 2019, 1:39 AM
This revision now requires changes to proceed.Jan 17 2019, 1:39 AM
jmittert updated this revision to Diff 182329.Jan 17 2019, 9:57 AM

Ugh, I had broken something somewhere else which caused the tests look like
they were always passing...

I've now properly fixed the binary mode.

serge-sans-paille added inline comments.
444 ↗(On Diff #182329)

Bad news: I've been discussing with a Python Core Dev and he acknowledge this function to be ultra dangerous. We should find a way to have everything right without this tricky call.

My take would be to first have the whole script work in unicode mode (i.e. in Python3) and then do the minimal work to have it work on Python2. Does that makes sense to you?

This revision is now accepted and ready to land.Jan 22 2019, 11:34 AM
serge-sans-paille requested changes to this revision.Jan 22 2019, 11:34 AM
This revision now requires changes to proceed.Jan 22 2019, 11:34 AM
jmittert marked an inline comment as done.Jan 22 2019, 3:06 PM
jmittert added inline comments.
444 ↗(On Diff #182329)

Let me check that I understand correctly. Right now, lit uses strings on python3 which are unicode aware, and strings/bytes on python2 which are not unicode aware which causes issues when utf8 characters are passed to os.path on Windows.

What about alternatively adding a to_unicode a la the to_string and to_bytes in which would return str on python3, and unicode on python2?

Then adding something like

s = some_string
if windows:
    s = to_unicode(s)

before each call to os.path.some_func?

444 ↗(On Diff #182329)

Your first statement is correct. However I don't understand why there's anything windows specific in there. Let think about it this way: in Python2, strings where used to represent two different kind of objects : raw bytes, and unicode string. They are different types in Python3. So my advice would be:

  1. have the code work in Python3, inserting bytes to string conversion whenever needed. These conversion should not be platform specific.
  1. try to run the same code under Python2 and see where it fails.

Does that look like a reasonable path to you?

jmittert marked an inline comment as done.Jan 24 2019, 10:57 AM
jmittert added inline comments.
444 ↗(On Diff #182329)

Hmm, I was under the impression that that was exactly what I've done here.

Lit works as of now with python3 on both Windows and Unix. The "convertToLocalEncoding" is essentially the fix up for when I ran it in python2 and saw where it failed. In order for the os.path functions to handle unicode properly on python 2, they need to take a unicode string on Windows. Otherwise, the os.path functions create/open a file with the literal utf-8 bytes which causes a broken filename for every other application that expects utf-16 on Windows. On Unix, utf-8 means things work fine in python2 as is. Passing a unicode type to os.path on unix/python2 gives an error about not being able to decode to ascii, but just passing bytes works.

Right now, with python 2 lit passes utf-8 bytes around (the python2 string/byte type) which works for Unix, but needs converting on Windows. What I have right now shouldn't result in any functional changes except for python2 with Windows.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 6 2019, 3:18 PM

@jmittert I tried the following (simpler) patch on Linux and it seems to work nice for both Python2 and Python3

Index: lit/
--- lit/	(revision 353501)
+++ lit/	(working copy)
@@ -345,7 +345,7 @@
     exitCode = 0
     for dir in args:
         if not os.path.isabs(dir):
-            dir = os.path.realpath(os.path.join(cmd_shenv.cwd, dir))
+            dir = os.path.realpath(to_bytes(os.path.join(cmd_shenv.cwd, dir)))
         if parent:
@@ -599,7 +599,7 @@
     exitCode = 0
     for path in args:
         if not os.path.isabs(path):
-            path = os.path.realpath(os.path.join(cmd_shenv.cwd, path))
+            path = os.path.realpath(to_bytes(os.path.join(cmd_shenv.cwd, path)))
         if force and not os.path.exists(path):
@@ -695,7 +695,7 @@
             # Make sure relative paths are relative to the cwd.
             redir_filename = os.path.join(cmd_shenv.cwd, name)
-            fd = open(redir_filename, mode)
+            fd = open(to_bytes(redir_filename), mode)
         # Workaround a Win32 and/or subprocess bug when appending.
         # FIXME: Actually, this is probably an instance of PR6753.

What do you think of this path?

jmittert added a comment.EditedFeb 8 2019, 11:59 AM

What do you think of this path?

This doesn't work on Windows with Python 2 because it to_bytes doesn't convert the bytes to UTF16. It will work on Python 3 with Windows because py3 strings are already unicode aware. Running with python 2 creates the garbled 中文 directory on Windows because it tries to interpret the UTF8 as UTF16. Running with python 3 properly produces the 中文 directory.

For example, adding a quick test with

# RUN: mkdir -p  c:/Users/jmittertreiner/Output/中文


S:\build\Ninja-DebugAssert\llbuild-windows-amd64> dir C:\Users\jmittertreiner\Output\                                                                                     
    Directory: C:\Users\jmittertreiner\Output                                                                                                                             
Mode                LastWriteTime         Length Name                                                                                                                     
----                -------------         ------ ----                                                                                                                     
d-----         2/8/2019   9:48 AM                中文             <-- Running with Python 2                                                                                                      
d-----         2/8/2019   9:49 AM                中文              <-- Running with Python 3
compnerd added inline comments.Feb 11 2019, 10:07 AM
444 ↗(On Diff #182329)

@serge-sans-paille this is a nifty workaround for Windows. When on the python side, a unicode object is passed, the C side will use the W versions of the file APIs rather than A versions which is required to access any non-ASCII file path. Additionally, the use of the W variant of the APIs permits the use of NT style paths (\\?\ prefixed paths) which bypass the Win32 layer and go right to the kernel to avoid the 261 character limit.

jmittert marked an inline comment as done.Feb 12 2019, 1:15 PM
jmittert added inline comments.
444 ↗(On Diff #182329)


@compnerd mentioned to me that you wanted to have this function changed to be called something along the lines of "convertToUnicode". I should mention though that this only converts to Unicode on Windows, On non Windows platforms it converts to bytes, not the unicode type. Think that convertToPlatformEncoding is maybe more descriptive instead?

jmittert marked an inline comment as done.Feb 21 2019, 11:40 AM
jmittert added inline comments.
444 ↗(On Diff #182329)
efriedma added inline comments.
440 ↗(On Diff #182329)

The inner "if" is redundant; on Python2, bytes is an alias for str, so both sides have the same meaning.

This function seems dangerous in the sense that it isn't clear what type of data it's expecting as input. Probably you should have separate functions for each of the possible cases: input is bytes, input is str, or input is Python2 unicode/Python3 str. (Not sure which of those you actually need, but those are the reasonable possibilities, I think.)

jmittert marked an inline comment as done.Feb 22 2019, 10:20 AM
jmittert added inline comments.
440 ↗(On Diff #182329)

I'm fairly certain the if isn't redundant: on python 2, if we get bytes(/str) we want to decode that to a unicode. In python 2, .decode returns a unicode, not a str like in python 3.

Part of the reason it has confusing inputs is because what input it gets depends on they python version. It gets bytes/str on python 2 and str on python 3. I figured it was safer to handle all text cases (python 2 bytes/str/unicode, python 3 str/bytes). At least that way it's idempotent. I do agree it's dangerous in that the type it outputs is different based on the platform and version, but I've tried to limit use of it to only right at the edge before os calls.

efriedma added inline comments.Feb 22 2019, 11:59 AM
440 ↗(On Diff #182329)

on python 2, if we get bytes(/str) we want to decode that to a unicode

I didn't mean it's a no-op, just that isinstance(text, str) and isinstance(text, bytes) always return the same result on Python2, so you don't need the explicit version_info check.

I figured it was safer to handle all text cases

Making the function handle any string-like input is "safer" in the sense that you're less likely to get a runtime error in this function, but it makes the code harder to understand, and it's more likely to lead to confusing results if some other part of the code isn't handling strings consistently.

jmittert updated this revision to Diff 187995.Feb 22 2019, 2:54 PM

Rather than using the ambiguous (and not particularly safe)
convertToLocalEncoding, define a to_unicode and use that as well as the
existing to_bytes depending on the platform.

jmittert marked an inline comment as done.Feb 22 2019, 2:57 PM
jmittert added inline comments.
440 ↗(On Diff #182329)

Ah, I see what you mean, yeah, that makes sense to do.

I've removed the convertToLocalEncoding and instead added to_unicode, which I think is significantly less confusing.

@jmittert sorry for the long delay, but I'm finally fine with this patch now. I like how it explicitly emphasizes on the Windows/Linux difference. However the patch needs to be rebased against master, can you update it?

This revision was not accepted when it landed; it landed in state Needs Review.Feb 28 2019, 11:16 AM
This revision was automatically updated to reflect the committed changes.