Download Raw Diff

Details

Reviewers

MaggieYi
rnk
serge-sans-paille
zturner
modocache

Commits

rG54be909aa08d: Add Support for Creating and Deleting Unicode Files and Directories in Lit
rL355122: Add Support for Creating and Deleting Unicode Files and Directories in Lit

Summary

This enables lit to work with unicode file names via mkdir, rm, and redirection. Lit still uses utf-8 internally, but converts to utf-16 on Windows, or just utf-8 bytes on everything else.

Taking my best guess at a reviewer.

Diff Detail

Repository: rL LLVM

Event Timeline

jmittert created this revision.Jan 15 2019, 4:03 PM

Herald added a reviewer: serge-sans-paille. · View Herald TranscriptJan 15 2019, 4:03 PM

Herald added subscribers: llvm-commits, delcypher. · View Herald Transcript

Could you upload this with context? The easiest way is to use arcanist (https://llvm.org/docs/Phabricator.html). Otherwise use -U9999 when generating the diff to upload to Phabricator.

Recreated with context/arc diff

Harbormaster completed remote builds in B26888: Diff 181935.Jan 15 2019, 4:50 PM

serge-sans-paille added inline comments.Jan 16 2019, 7:08 AM

utils/lit/lit/TestRunner.py
1110 ↗	(On Diff #181935)	I'm not 100% sure here, but this looks very redundant and non-maintainable. Why not always open the file in binary mode?

Updated to open the script file in binary mode regardless of platform.

Harbormaster completed remote builds in B26924: Diff 182096.Jan 16 2019, 10:38 AM

serge-sans-paille added inline comments.Jan 17 2019, 1:39 AM

utils/lit/lit/TestRunner.py
1087 ↗	(On Diff #182096)	That's nce, but now, you need to use the `b` prefix for all strings, and for decoding o f the joined command, right?

serge-sans-paille requested changes to this revision.Jan 17 2019, 1:39 AM

This revision now requires changes to proceed.Jan 17 2019, 1:39 AM

Ugh, I had broken something somewhere else which caused the tests look like
they were always passing...

I've now properly fixed the binary mode.

Harbormaster completed remote builds in B26996: Diff 182329.Jan 17 2019, 9:58 AM

serge-sans-paille accepted this revision.Jan 22 2019, 11:34 AM

serge-sans-paille added inline comments.

utils/lit/lit/util.py
444 ↗	(On Diff #182329)	Bad news: I've been discussing with a Python Core Dev and he acknowledge this function to be ultra dangerous. We should find a way to have everything right without this tricky call. My take would be to first have the whole script work in unicode mode (i.e. in Python3) and then do the minimal work to have it work on Python2. Does that makes sense to you?

This revision is now accepted and ready to land.Jan 22 2019, 11:34 AM

serge-sans-paille requested changes to this revision.Jan 22 2019, 11:34 AM

This revision now requires changes to proceed.Jan 22 2019, 11:34 AM

jmittert marked an inline comment as done.Jan 22 2019, 3:06 PM

jmittert added inline comments.

utils/lit/lit/util.py
444 ↗	(On Diff #182329)	Let me check that I understand correctly. Right now, lit uses strings on python3 which are unicode aware, and strings/bytes on python2 which are not unicode aware which causes issues when utf8 characters are passed to os.path on Windows. What about alternatively adding a to_unicode a la the to_string and to_bytes in util.py which would return str on python3, and unicode on python2? Then adding something like s = some_string if windows: s = to_unicode(s) os.path.some_func(s) before each call to os.path.some_func?

serge-sans-paille added inline comments.Jan 24 2019, 2:42 AM

utils/lit/lit/util.py
444 ↗	(On Diff #182329)	Your first statement is correct. However I don't understand why there's anything windows specific in there. Let think about it this way: in Python2, strings where used to represent two different kind of objects : raw bytes, and unicode string. They are different types in Python3. So my advice would be: have the code work in Python3, inserting bytes to string conversion whenever needed. These conversion should not be platform specific. try to run the same code under Python2 and see where it fails. Does that look like a reasonable path to you?

jmittert marked an inline comment as done.Jan 24 2019, 10:57 AM

jmittert added inline comments.

utils/lit/lit/util.py
444 ↗	(On Diff #182329)	Hmm, I was under the impression that that was exactly what I've done here. Lit works as of now with python3 on both Windows and Unix. The "convertToLocalEncoding" is essentially the fix up for when I ran it in python2 and saw where it failed. In order for the os.path functions to handle unicode properly on python 2, they need to take a `unicode` string on Windows. Otherwise, the os.path functions create/open a file with the literal utf-8 bytes which causes a broken filename for every other application that expects utf-16 on Windows. On Unix, utf-8 means things work fine in python2 as is. Passing a `unicode` type to os.path on unix/python2 gives an error about not being able to decode to ascii, but just passing bytes works. Right now, with python 2 lit passes utf-8 bytes around (the python2 string/byte type) which works for Unix, but needs converting on Windows. What I have right now shouldn't result in any functional changes except for python2 with Windows.

@serge-sans-paille ping

Herald added a project: Restricted Project. · View Herald TranscriptFeb 6 2019, 3:18 PM

@jmittert I tried the following (simpler) patch on Linux and it seems to work nice for both Python2 and Python3

Index: lit/TestRunner.py
===================================================================
--- lit/TestRunner.py	(revision 353501)
+++ lit/TestRunner.py	(working copy)
@@ -345,7 +345,7 @@
     exitCode = 0
     for dir in args:
         if not os.path.isabs(dir):
-            dir = os.path.realpath(os.path.join(cmd_shenv.cwd, dir))
+            dir = os.path.realpath(to_bytes(os.path.join(cmd_shenv.cwd, dir)))
         if parent:
             lit.util.mkdir_p(dir)
         else:
@@ -599,7 +599,7 @@
     exitCode = 0
     for path in args:
         if not os.path.isabs(path):
-            path = os.path.realpath(os.path.join(cmd_shenv.cwd, path))
+            path = os.path.realpath(to_bytes(os.path.join(cmd_shenv.cwd, path)))
         if force and not os.path.exists(path):
             continue
         try:
@@ -695,7 +695,7 @@
         else:
             # Make sure relative paths are relative to the cwd.
             redir_filename = os.path.join(cmd_shenv.cwd, name)
-            fd = open(redir_filename, mode)
+            fd = open(to_bytes(redir_filename), mode)
         # Workaround a Win32 and/or subprocess bug when appending.
         #
         # FIXME: Actually, this is probably an instance of PR6753.

What do you think of this path?

What do you think of this path?

This doesn't work on Windows with Python 2 because it to_bytes doesn't convert the bytes to UTF16. It will work on Python 3 with Windows because py3 strings are already unicode aware. Running with python 2 creates the garbled ä¸æ–‡ directory on Windows because it tries to interpret the UTF8 as UTF16. Running with python 3 properly produces the 中文 directory.

For example, adding a quick test with

# RUN: mkdir -p  c:/Users/jmittertreiner/Output/中文

Produces

S:\build\Ninja-DebugAssert\llbuild-windows-amd64> dir C:\Users\jmittertreiner\Output\                                                                                     
                                                                                                                                                                          
                                                                                                                                                                          
    Directory: C:\Users\jmittertreiner\Output                                                                                                                             
                                                                                                                                                                          
                                                                                                                                                                          
Mode                LastWriteTime         Length Name                                                                                                                     
----                -------------         ------ ----                                                                                                                     
d-----         2/8/2019   9:48 AM                ä¸æ–‡             <-- Running with Python 2                                                                                                      
d-----         2/8/2019   9:49 AM                中文              <-- Running with Python 3

compnerd added a subscriber: compnerd.Feb 8 2019, 5:15 PM

compnerd added inline comments.Feb 11 2019, 10:07 AM

utils/lit/lit/util.py
444 ↗	(On Diff #182329)	@serge-sans-paille this is a nifty workaround for Windows. When on the python side, a unicode object is passed, the C side will use the `W` versions of the file APIs rather than `A` versions which is required to access any non-ASCII file path. Additionally, the use of the `W` variant of the APIs permits the use of NT style paths (`\\?\` prefixed paths) which bypass the Win32 layer and go right to the kernel to avoid the 261 character limit.

jmittert marked an inline comment as done.Feb 12 2019, 1:15 PM

jmittert added inline comments.

utils/lit/lit/util.py
444 ↗	(On Diff #182329)	@serge-sans-paille @compnerd mentioned to me that you wanted to have this function changed to be called something along the lines of "convertToUnicode". I should mention though that this only converts to Unicode on Windows, On non Windows platforms it converts to bytes, not the unicode type. Think that convertToPlatformEncoding is maybe more descriptive instead?

jmittert marked an inline comment as done.Feb 21 2019, 11:40 AM

jmittert added inline comments.

utils/lit/lit/util.py
444 ↗	(On Diff #182329)	@serge-sans-paille ping again

efriedma added a subscriber: efriedma.Feb 21 2019, 5:30 PM

efriedma added inline comments.

utils/lit/lit/util.py
440 ↗	(On Diff #182329)	The inner "if" is redundant; on Python2, `bytes` is an alias for `str`, so both sides have the same meaning. This function seems dangerous in the sense that it isn't clear what type of data it's expecting as input. Probably you should have separate functions for each of the possible cases: input is `bytes`, input is `str`, or input is Python2 `unicode`/Python3 `str`. (Not sure which of those you actually need, but those are the reasonable possibilities, I think.)

jmittert marked an inline comment as done.Feb 22 2019, 10:20 AM

jmittert added inline comments.

utils/lit/lit/util.py
440 ↗	(On Diff #182329)	I'm fairly certain the if isn't redundant: on python 2, if we get `bytes`(/`str`) we want to decode that to a `unicode`. In python 2, `.decode` returns a `unicode`, not a `str` like in python 3. Part of the reason it has confusing inputs is because what input it gets depends on they python version. It gets bytes/str on python 2 and str on python 3. I figured it was safer to handle all text cases (python 2 `bytes`/`str`/`unicode`, python 3 `str`/`bytes`). At least that way it's idempotent. I do agree it's dangerous in that the type it outputs is different based on the platform and version, but I've tried to limit use of it to only right at the edge before os calls.

efriedma added inline comments.Feb 22 2019, 11:59 AM

utils/lit/lit/util.py
440 ↗	(On Diff #182329)	on python 2, if we get bytes(/str) we want to decode that to a unicode I didn't mean it's a no-op, just that `isinstance(text, str)` and `isinstance(text, bytes)` always return the same result on Python2, so you don't need the explicit version_info check. I figured it was safer to handle all text cases Making the function handle any string-like input is "safer" in the sense that you're less likely to get a runtime error in this function, but it makes the code harder to understand, and it's more likely to lead to confusing results if some other part of the code isn't handling strings consistently.

Rather than using the ambiguous (and not particularly safe)
convertToLocalEncoding, define a to_unicode and use that as well as the
existing to_bytes depending on the platform.

Harbormaster completed remote builds in B28432: Diff 187995.Feb 22 2019, 2:54 PM

jmittert marked an inline comment as done.Feb 22 2019, 2:57 PM

jmittert added inline comments.

utils/lit/lit/util.py
440 ↗	(On Diff #182329)	Ah, I see what you mean, yeah, that makes sense to do. I've removed the convertToLocalEncoding and instead added to_unicode, which I think is significantly less confusing.

@jmittert sorry for the long delay, but I'm finally fine with this patch now. I like how it explicitly emphasizes on the Windows/Linux difference. However the patch needs to be rebased against master, can you update it?

Rebased!

Harbormaster completed remote builds in B28563: Diff 188424.Feb 26 2019, 11:19 AM

This revision was not accepted when it landed; it landed in state Needs Review.Feb 28 2019, 11:16 AM

Closed by commit rL355122: Add Support for Creating and Deleting Unicode Files and Directories in Lit (authored by serge_sans_paille). · Explain Why

This revision was automatically updated to reflect the committed changes.

Diff 188763

llvm/trunk/utils/lit/lit/TestRunner.py

Show All 17 Lines	try:
from StringIO import StringIO		from StringIO import StringIO
except ImportError:		except ImportError:
from io import StringIO		from io import StringIO

from lit.ShCommands import GlobItem		from lit.ShCommands import GlobItem
import lit.ShUtil as ShUtil		import lit.ShUtil as ShUtil
import lit.Test as Test		import lit.Test as Test
import lit.util		import lit.util
from lit.util import to_bytes, to_string		from lit.util import to_bytes, to_string, to_unicode
from lit.BooleanExpression import BooleanExpression		from lit.BooleanExpression import BooleanExpression

class InternalShellError(Exception):		class InternalShellError(Exception):
def __init__(self, command, message):		def __init__(self, command, message):
self.command = command		self.command = command
self.message = message		self.message = message

kIsWindows = platform.system() == 'Windows'		kIsWindows = platform.system() == 'Windows'
▲ Show 20 Lines • Show All 304 Lines • ▼ Show 20 Lines	for o, a in opts:
assert False, "unhandled option"		assert False, "unhandled option"

if len(args) == 0:		if len(args) == 0:
raise InternalShellError(cmd, "Error: 'mkdir' is missing an operand")		raise InternalShellError(cmd, "Error: 'mkdir' is missing an operand")

stderr = StringIO()		stderr = StringIO()
exitCode = 0		exitCode = 0
for dir in args:		for dir in args:
		cwd = cmd_shenv.cwd
		dir = to_unicode(dir) if kIsWindows else to_bytes(dir)
		cwd = to_unicode(cwd) if kIsWindows else to_bytes(cwd)
if not os.path.isabs(dir):		if not os.path.isabs(dir):
dir = os.path.realpath(os.path.join(cmd_shenv.cwd, dir))		dir = os.path.realpath(os.path.join(cwd, dir))
if parent:		if parent:
lit.util.mkdir_p(dir)		lit.util.mkdir_p(dir)
else:		else:
try:		try:
os.mkdir(dir)		os.mkdir(dir)
except OSError as err:		except OSError as err:
stderr.write("Error: 'mkdir' command failed, %s\n" % str(err))		stderr.write("Error: 'mkdir' command failed, %s\n" % str(err))
exitCode = 1		exitCode = 1
▲ Show 20 Lines • Show All 236 Lines • ▼ Show 20 Lines	def on_rm_error(func, path, exc_info):
# path contains the path of the file that couldn't be removed		# path contains the path of the file that couldn't be removed
# let's just assume that it's read-only and remove it.		# let's just assume that it's read-only and remove it.
os.chmod(path, stat.S_IMODE( os.stat(path).st_mode) \| stat.S_IWRITE)		os.chmod(path, stat.S_IMODE( os.stat(path).st_mode) \| stat.S_IWRITE)
os.remove(path)		os.remove(path)

stderr = StringIO()		stderr = StringIO()
exitCode = 0		exitCode = 0
for path in args:		for path in args:
		cwd = cmd_shenv.cwd
		path = to_unicode(path) if kIsWindows else to_bytes(path)
		cwd = to_unicode(cwd) if kIsWindows else to_bytes(cwd)
if not os.path.isabs(path):		if not os.path.isabs(path):
path = os.path.realpath(os.path.join(cmd_shenv.cwd, path))		path = os.path.realpath(os.path.join(cwd, path))
if force and not os.path.exists(path):		if force and not os.path.exists(path):
continue		continue
try:		try:
if os.path.isdir(path):		if os.path.isdir(path):
if not recursive:		if not recursive:
stderr.write("Error: %s is a directory\n" % path)		stderr.write("Error: %s is a directory\n" % path)
exitCode = 1		exitCode = 1
shutil.rmtree(path, onerror = on_rm_error if force else None)		shutil.rmtree(path, onerror = on_rm_error if force else None)
▲ Show 20 Lines • Show All 79 Lines • ▼ Show 20 Lines	for (index, r) in enumerate(redirects):
fd = tempfile.TemporaryFile(mode=mode)		fd = tempfile.TemporaryFile(mode=mode)
elif kIsWindows and name == '/dev/tty':		elif kIsWindows and name == '/dev/tty':
# Simulate /dev/tty on Windows.		# Simulate /dev/tty on Windows.
# "CON" is a special filename for the console.		# "CON" is a special filename for the console.
fd = open("CON", mode)		fd = open("CON", mode)
else:		else:
# Make sure relative paths are relative to the cwd.		# Make sure relative paths are relative to the cwd.
redir_filename = os.path.join(cmd_shenv.cwd, name)		redir_filename = os.path.join(cmd_shenv.cwd, name)
		redir_filename = to_unicode(redir_filename) \
		if kIsWindows else to_bytes(redir_filename)
fd = open(redir_filename, mode)		fd = open(redir_filename, mode)
# Workaround a Win32 and/or subprocess bug when appending.		# Workaround a Win32 and/or subprocess bug when appending.
#		#
# FIXME: Actually, this is probably an instance of PR6753.		# FIXME: Actually, this is probably an instance of PR6753.
if mode == 'a':		if mode == 'a':
fd.seek(0, 2)		fd.seek(0, 2)
# Mutate the underlying redirect list so that we can redirect stdout		# Mutate the underlying redirect list so that we can redirect stdout
# and stderr to the same place without opening the file twice.		# and stderr to the same place without opening the file twice.
▲ Show 20 Lines • Show All 385 Lines • ▼ Show 20 Lines	if isWin32CMDEXE:
f.write('@echo on\n')		f.write('@echo on\n')
else:		else:
f.write('@echo off\n')		f.write('@echo off\n')
f.write('\n@if %ERRORLEVEL% NEQ 0 EXIT\n'.join(commands))		f.write('\n@if %ERRORLEVEL% NEQ 0 EXIT\n'.join(commands))
else:		else:
for i, ln in enumerate(commands):		for i, ln in enumerate(commands):
commands[i] = re.sub(kPdbgRegex, ": '\\1'; ", ln)		commands[i] = re.sub(kPdbgRegex, ": '\\1'; ", ln)
if test.config.pipefail:		if test.config.pipefail:
f.write('set -o pipefail;')		f.write(b'set -o pipefail;' if mode == 'wb' else 'set -o pipefail;')
if litConfig.echo_all_commands:		if litConfig.echo_all_commands:
f.write('set -x;')		f.write(b'set -x;' if mode == 'wb' else 'set -x;')
		if sys.version_info > (3,0) and mode == 'wb':
		f.write(bytes('{ ' + '; } &&\n{ '.join(commands) + '; }', 'utf-8'))
		else:
f.write('{ ' + '; } &&\n{ '.join(commands) + '; }')		f.write('{ ' + '; } &&\n{ '.join(commands) + '; }')
f.write('\n')		f.write(b'\n' if mode == 'wb' else '\n')
f.close()		f.close()

if isWin32CMDEXE:		if isWin32CMDEXE:
command = ['cmd','/c', script]		command = ['cmd','/c', script]
else:		else:
if bashPath:		if bashPath:
command = [bashPath, script]		command = [bashPath, script]
else:		else:
▲ Show 20 Lines • Show All 485 Lines • Show Last 20 Lines

llvm/trunk/utils/lit/lit/util.py

Show First 20 Lines • Show All 96 Lines • ▼ Show 20 Lines	def to_string(b):
# 'unicode' type in Python3 (all the Python3 cases were already handled). In		# 'unicode' type in Python3 (all the Python3 cases were already handled). In
# order to get a 'str' object, we need to encode the 'unicode' object.		# order to get a 'str' object, we need to encode the 'unicode' object.
try:		try:
return b.encode('utf-8')		return b.encode('utf-8')
except AttributeError:		except AttributeError:
raise TypeError('not sure how to convert %s to %s' % (type(b), str))		raise TypeError('not sure how to convert %s to %s' % (type(b), str))


		def to_unicode(s):
		"""Return the parameter as type which supports unicode, possibly decoding
		it.

		In Python2, this is the unicode type. In Python3 it's the str type.

		"""
		if isinstance(s, bytes):
		# In Python2, this branch is taken for both 'str' and 'bytes'.
		# In Python3, this branch is taken only for 'bytes'.
		return s.decode('utf-8')
		return s


def detectCPUs():		def detectCPUs():
"""Detects the number of CPUs on a system.		"""Detects the number of CPUs on a system.

Cribbed from pp.		Cribbed from pp.

"""		"""
# Linux, Unix and MacOS:		# Linux, Unix and MacOS:
if hasattr(os, 'sysconf'):		if hasattr(os, 'sysconf'):
▲ Show 20 Lines • Show All 328 Lines • Show Last 20 Lines

llvm/trunk/utils/lit/tests/Inputs/shtest-shell/rm-unicode-0.txt

				# Check removing unicode
				#
				# RUN: mkdir -p Output/中文
				# RUN: echo "" > Output/中文/你好.txt
				# RUN: rm Output/中文/你好.txt
				# RUN: echo "" > Output/中文/你好.txt
				# RUN: rm -r Output/中文

llvm/trunk/utils/lit/tests/shtest-shell.py

	Show First 20 Lines • Show All 218 Lines • ▼ Show 20 Lines
	# CHECK: error: command failed with exit status: 1			# CHECK: error: command failed with exit status: 1
	# CHECK: ***			# CHECK: ***

	# CHECK: FAIL: shtest-shell :: rm-error-3.txt			# CHECK: FAIL: shtest-shell :: rm-error-3.txt
	# CHECK: * TEST 'shtest-shell :: rm-error-3.txt' FAILED *			# CHECK: * TEST 'shtest-shell :: rm-error-3.txt' FAILED *
	# CHECK: Exit Code: 1			# CHECK: Exit Code: 1
	# CHECK: ***			# CHECK: ***

				# CHECK: PASS: shtest-shell :: rm-unicode-0.txt
	# CHECK: PASS: shtest-shell :: sequencing-0.txt			# CHECK: PASS: shtest-shell :: sequencing-0.txt
	# CHECK: XFAIL: shtest-shell :: sequencing-1.txt			# CHECK: XFAIL: shtest-shell :: sequencing-1.txt
	# CHECK: PASS: shtest-shell :: valid-shell.txt			# CHECK: PASS: shtest-shell :: valid-shell.txt
	# CHECK: Failing Tests (27)			# CHECK: Failing Tests (27)

This is an archive of the discontinued LLVM Phabricator instance.

Add Support for Creating and Deleting Unicode Files and Directories in Lit
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 188763

llvm/trunk/utils/lit/lit/TestRunner.py

llvm/trunk/utils/lit/lit/util.py

llvm/trunk/utils/lit/tests/Inputs/shtest-shell/rm-unicode-0.txt

llvm/trunk/utils/lit/tests/shtest-shell.py

This is an archive of the discontinued LLVM Phabricator instance.

Add Support for Creating and Deleting Unicode Files and Directories in LitClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 188763

llvm/trunk/utils/lit/lit/TestRunner.py

llvm/trunk/utils/lit/lit/util.py

llvm/trunk/utils/lit/tests/Inputs/shtest-shell/rm-unicode-0.txt

llvm/trunk/utils/lit/tests/shtest-shell.py

Add Support for Creating and Deleting Unicode Files and Directories in Lit
ClosedPublic