This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
utils/lit/
-
lit/
-
lit/
2
TestRunner.py
6/11
util.py
-
tests/
-
Inputs/shtest-shell/
-
shtest-shell/
-
rm-unicode-0.txt
-
shtest-shell.py

Differential D56754

Add Support for Creating and Deleting Unicode Files and Directories in Lit
ClosedPublic

Authored by jmittert on Jan 15 2019, 4:03 PM.

Download Raw Diff

Details

Reviewers

MaggieYi
rnk
serge-sans-paille
zturner
modocache

Commits

rG54be909aa08d: Add Support for Creating and Deleting Unicode Files and Directories in Lit
rL355122: Add Support for Creating and Deleting Unicode Files and Directories in Lit

Summary

This enables lit to work with unicode file names via mkdir, rm, and redirection. Lit still uses utf-8 internally, but converts to utf-16 on Windows, or just utf-8 bytes on everything else.

Taking my best guess at a reviewer.

Diff Detail

Repository

rL LLVM

Build Status

Buildable 26924
Build 26923: arc lint + arc unit

Event Timeline

jmittert created this revision.Jan 15 2019, 4:03 PM

Herald added a reviewer: serge-sans-paille. · View Herald TranscriptJan 15 2019, 4:03 PM

Herald added subscribers: llvm-commits, delcypher. · View Herald Transcript

Could you upload this with context? The easiest way is to use arcanist (https://llvm.org/docs/Phabricator.html). Otherwise use -U9999 when generating the diff to upload to Phabricator.

Recreated with context/arc diff

Harbormaster completed remote builds in B26888: Diff 181935.Jan 15 2019, 4:50 PM

serge-sans-paille added inline comments.Jan 16 2019, 7:08 AM

utils/lit/lit/TestRunner.py
1109	I'm not 100% sure here, but this looks very redundant and non-maintainable. Why not always open the file in binary mode?

Updated to open the script file in binary mode regardless of platform.

Harbormaster completed remote builds in B26924: Diff 182096.Jan 16 2019, 10:38 AM

serge-sans-paille added inline comments.Jan 17 2019, 1:39 AM

utils/lit/lit/TestRunner.py
1087	That's nce, but now, you need to use the `b` prefix for all strings, and for decoding o f the joined command, right?

serge-sans-paille requested changes to this revision.Jan 17 2019, 1:39 AM

This revision now requires changes to proceed.Jan 17 2019, 1:39 AM

Ugh, I had broken something somewhere else which caused the tests look like
they were always passing...

I've now properly fixed the binary mode.

Harbormaster completed remote builds in B26996: Diff 182329.Jan 17 2019, 9:58 AM

serge-sans-paille accepted this revision.Jan 22 2019, 11:34 AM

serge-sans-paille added inline comments.

utils/lit/lit/util.py
444	Bad news: I've been discussing with a Python Core Dev and he acknowledge this function to be ultra dangerous. We should find a way to have everything right without this tricky call. My take would be to first have the whole script work in unicode mode (i.e. in Python3) and then do the minimal work to have it work on Python2. Does that makes sense to you?

This revision is now accepted and ready to land.Jan 22 2019, 11:34 AM

serge-sans-paille requested changes to this revision.Jan 22 2019, 11:34 AM

This revision now requires changes to proceed.Jan 22 2019, 11:34 AM

jmittert marked an inline comment as done.Jan 22 2019, 3:06 PM

jmittert added inline comments.

utils/lit/lit/util.py
444	Let me check that I understand correctly. Right now, lit uses strings on python3 which are unicode aware, and strings/bytes on python2 which are not unicode aware which causes issues when utf8 characters are passed to os.path on Windows. What about alternatively adding a to_unicode a la the to_string and to_bytes in util.py which would return str on python3, and unicode on python2? Then adding something like s = some_string if windows: s = to_unicode(s) os.path.some_func(s) before each call to os.path.some_func?

serge-sans-paille added inline comments.Jan 24 2019, 2:42 AM

utils/lit/lit/util.py
444	Your first statement is correct. However I don't understand why there's anything windows specific in there. Let think about it this way: in Python2, strings where used to represent two different kind of objects : raw bytes, and unicode string. They are different types in Python3. So my advice would be: have the code work in Python3, inserting bytes to string conversion whenever needed. These conversion should not be platform specific. try to run the same code under Python2 and see where it fails. Does that look like a reasonable path to you?

jmittert marked an inline comment as done.Jan 24 2019, 10:57 AM

jmittert added inline comments.

utils/lit/lit/util.py
444	Hmm, I was under the impression that that was exactly what I've done here. Lit works as of now with python3 on both Windows and Unix. The "convertToLocalEncoding" is essentially the fix up for when I ran it in python2 and saw where it failed. In order for the os.path functions to handle unicode properly on python 2, they need to take a `unicode` string on Windows. Otherwise, the os.path functions create/open a file with the literal utf-8 bytes which causes a broken filename for every other application that expects utf-16 on Windows. On Unix, utf-8 means things work fine in python2 as is. Passing a `unicode` type to os.path on unix/python2 gives an error about not being able to decode to ascii, but just passing bytes works. Right now, with python 2 lit passes utf-8 bytes around (the python2 string/byte type) which works for Unix, but needs converting on Windows. What I have right now shouldn't result in any functional changes except for python2 with Windows.

@serge-sans-paille ping

Herald added a project: Restricted Project. · View Herald TranscriptFeb 6 2019, 3:18 PM

@jmittert I tried the following (simpler) patch on Linux and it seems to work nice for both Python2 and Python3

Index: lit/TestRunner.py
===================================================================
--- lit/TestRunner.py	(revision 353501)
+++ lit/TestRunner.py	(working copy)
@@ -345,7 +345,7 @@
     exitCode = 0
     for dir in args:
         if not os.path.isabs(dir):
-            dir = os.path.realpath(os.path.join(cmd_shenv.cwd, dir))
+            dir = os.path.realpath(to_bytes(os.path.join(cmd_shenv.cwd, dir)))
         if parent:
             lit.util.mkdir_p(dir)
         else:
@@ -599,7 +599,7 @@
     exitCode = 0
     for path in args:
         if not os.path.isabs(path):
-            path = os.path.realpath(os.path.join(cmd_shenv.cwd, path))
+            path = os.path.realpath(to_bytes(os.path.join(cmd_shenv.cwd, path)))
         if force and not os.path.exists(path):
             continue
         try:
@@ -695,7 +695,7 @@
         else:
             # Make sure relative paths are relative to the cwd.
             redir_filename = os.path.join(cmd_shenv.cwd, name)
-            fd = open(redir_filename, mode)
+            fd = open(to_bytes(redir_filename), mode)
         # Workaround a Win32 and/or subprocess bug when appending.
         #
         # FIXME: Actually, this is probably an instance of PR6753.

What do you think of this path?

What do you think of this path?

This doesn't work on Windows with Python 2 because it to_bytes doesn't convert the bytes to UTF16. It will work on Python 3 with Windows because py3 strings are already unicode aware. Running with python 2 creates the garbled ä¸æ–‡ directory on Windows because it tries to interpret the UTF8 as UTF16. Running with python 3 properly produces the 中文 directory.

For example, adding a quick test with

# RUN: mkdir -p  c:/Users/jmittertreiner/Output/中文

Produces

S:\build\Ninja-DebugAssert\llbuild-windows-amd64> dir C:\Users\jmittertreiner\Output\                                                                                     
                                                                                                                                                                          
                                                                                                                                                                          
    Directory: C:\Users\jmittertreiner\Output                                                                                                                             
                                                                                                                                                                          
                                                                                                                                                                          
Mode                LastWriteTime         Length Name                                                                                                                     
----                -------------         ------ ----                                                                                                                     
d-----         2/8/2019   9:48 AM                ä¸æ–‡             <-- Running with Python 2                                                                                                      
d-----         2/8/2019   9:49 AM                中文              <-- Running with Python 3

compnerd added a subscriber: compnerd.Feb 8 2019, 5:15 PM

compnerd added inline comments.Feb 11 2019, 10:07 AM

utils/lit/lit/util.py
444	@serge-sans-paille this is a nifty workaround for Windows. When on the python side, a unicode object is passed, the C side will use the `W` versions of the file APIs rather than `A` versions which is required to access any non-ASCII file path. Additionally, the use of the `W` variant of the APIs permits the use of NT style paths (`\\?\` prefixed paths) which bypass the Win32 layer and go right to the kernel to avoid the 261 character limit.

jmittert marked an inline comment as done.Feb 12 2019, 1:15 PM

jmittert added inline comments.

utils/lit/lit/util.py
444	@serge-sans-paille @compnerd mentioned to me that you wanted to have this function changed to be called something along the lines of "convertToUnicode". I should mention though that this only converts to Unicode on Windows, On non Windows platforms it converts to bytes, not the unicode type. Think that convertToPlatformEncoding is maybe more descriptive instead?

jmittert marked an inline comment as done.Feb 21 2019, 11:40 AM

jmittert added inline comments.

utils/lit/lit/util.py
444	@serge-sans-paille ping again

efriedma added a subscriber: efriedma.Feb 21 2019, 5:30 PM

efriedma added inline comments.

utils/lit/lit/util.py
440	The inner "if" is redundant; on Python2, `bytes` is an alias for `str`, so both sides have the same meaning. This function seems dangerous in the sense that it isn't clear what type of data it's expecting as input. Probably you should have separate functions for each of the possible cases: input is `bytes`, input is `str`, or input is Python2 `unicode`/Python3 `str`. (Not sure which of those you actually need, but those are the reasonable possibilities, I think.)

jmittert marked an inline comment as done.Feb 22 2019, 10:20 AM

jmittert added inline comments.

utils/lit/lit/util.py
440	I'm fairly certain the if isn't redundant: on python 2, if we get `bytes`(/`str`) we want to decode that to a `unicode`. In python 2, `.decode` returns a `unicode`, not a `str` like in python 3. Part of the reason it has confusing inputs is because what input it gets depends on they python version. It gets bytes/str on python 2 and str on python 3. I figured it was safer to handle all text cases (python 2 `bytes`/`str`/`unicode`, python 3 `str`/`bytes`). At least that way it's idempotent. I do agree it's dangerous in that the type it outputs is different based on the platform and version, but I've tried to limit use of it to only right at the edge before os calls.

efriedma added inline comments.Feb 22 2019, 11:59 AM

utils/lit/lit/util.py
440	on python 2, if we get bytes(/str) we want to decode that to a unicode I didn't mean it's a no-op, just that `isinstance(text, str)` and `isinstance(text, bytes)` always return the same result on Python2, so you don't need the explicit version_info check. I figured it was safer to handle all text cases Making the function handle any string-like input is "safer" in the sense that you're less likely to get a runtime error in this function, but it makes the code harder to understand, and it's more likely to lead to confusing results if some other part of the code isn't handling strings consistently.

Rather than using the ambiguous (and not particularly safe)
convertToLocalEncoding, define a to_unicode and use that as well as the
existing to_bytes depending on the platform.

Harbormaster completed remote builds in B28432: Diff 187995.Feb 22 2019, 2:54 PM

jmittert marked an inline comment as done.Feb 22 2019, 2:57 PM

jmittert added inline comments.

utils/lit/lit/util.py
440	Ah, I see what you mean, yeah, that makes sense to do. I've removed the convertToLocalEncoding and instead added to_unicode, which I think is significantly less confusing.

@jmittert sorry for the long delay, but I'm finally fine with this patch now. I like how it explicitly emphasizes on the Windows/Linux difference. However the patch needs to be rebased against master, can you update it?

Rebased!

Harbormaster completed remote builds in B28563: Diff 188424.Feb 26 2019, 11:19 AM

This revision was not accepted when it landed; it landed in state Needs Review.Feb 28 2019, 11:16 AM

Closed by commit rL355122: Add Support for Creating and Deleting Unicode Files and Directories in Lit (authored by serge_sans_paille). · Explain Why

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

utils/

lit/

TestRunner.py

21 lines

util.py

18 lines

tests/

Inputs/

shtest-shell/

rm-unicode-0.txt

7 lines

shtest-shell.py

1 line

Diff 182096

utils/lit/lit/TestRunner.py

Show First 20 Lines • Show All 338 Lines • ▼ Show 20 Lines	for o, a in opts:
assert False, "unhandled option"		assert False, "unhandled option"

if len(args) == 0:		if len(args) == 0:
raise InternalShellError(cmd, "Error: 'mkdir' is missing an operand")		raise InternalShellError(cmd, "Error: 'mkdir' is missing an operand")

stderr = StringIO()		stderr = StringIO()
exitCode = 0		exitCode = 0
for dir in args:		for dir in args:
		dir = lit.util.convertToLocalEncoding(dir)
		cwd = lit.util.convertToLocalEncoding(cmd_shenv.cwd)
if not os.path.isabs(dir):		if not os.path.isabs(dir):
dir = os.path.realpath(os.path.join(cmd_shenv.cwd, dir))		dir = os.path.realpath(os.path.join(cwd, dir))
if parent:		if parent:
lit.util.mkdir_p(dir)		lit.util.mkdir_p(dir)
else:		else:
try:		try:
os.mkdir(dir)		os.mkdir(dir)
except OSError as err:		except OSError as err:
stderr.write("Error: 'mkdir' command failed, %s\n" % str(err))		stderr.write("Error: 'mkdir' command failed, %s\n" % str(err))
exitCode = 1		exitCode = 1
▲ Show 20 Lines • Show All 236 Lines • ▼ Show 20 Lines	def on_rm_error(func, path, exc_info):
# path contains the path of the file that couldn't be removed		# path contains the path of the file that couldn't be removed
# let's just assume that it's read-only and remove it.		# let's just assume that it's read-only and remove it.
os.chmod(path, stat.S_IMODE( os.stat(path).st_mode) \| stat.S_IWRITE)		os.chmod(path, stat.S_IMODE( os.stat(path).st_mode) \| stat.S_IWRITE)
os.remove(path)		os.remove(path)

stderr = StringIO()		stderr = StringIO()
exitCode = 0		exitCode = 0
for path in args:		for path in args:
		path = lit.util.convertToLocalEncoding(path)
		cwd = lit.util.convertToLocalEncoding(cmd_shenv.cwd)
if not os.path.isabs(path):		if not os.path.isabs(path):
path = os.path.realpath(os.path.join(cmd_shenv.cwd, path))		path = os.path.realpath(os.path.join(cwd, path))
if force and not os.path.exists(path):		if force and not os.path.exists(path):
continue		continue
try:		try:
if os.path.isdir(path):		if os.path.isdir(path):
if not recursive:		if not recursive:
stderr.write("Error: %s is a directory\n" % path)		stderr.write("Error: %s is a directory\n" % path)
exitCode = 1		exitCode = 1
shutil.rmtree(path, onerror = on_rm_error if force else None)		shutil.rmtree(path, onerror = on_rm_error if force else None)
▲ Show 20 Lines • Show All 79 Lines • ▼ Show 20 Lines	for (index, r) in enumerate(redirects):
fd = tempfile.TemporaryFile(mode=mode)		fd = tempfile.TemporaryFile(mode=mode)
elif kIsWindows and name == '/dev/tty':		elif kIsWindows and name == '/dev/tty':
# Simulate /dev/tty on Windows.		# Simulate /dev/tty on Windows.
# "CON" is a special filename for the console.		# "CON" is a special filename for the console.
fd = open("CON", mode)		fd = open("CON", mode)
else:		else:
# Make sure relative paths are relative to the cwd.		# Make sure relative paths are relative to the cwd.
redir_filename = os.path.join(cmd_shenv.cwd, name)		redir_filename = os.path.join(cmd_shenv.cwd, name)
fd = open(redir_filename, mode)		fd = open(lit.util.convertToLocalEncoding(redir_filename), mode)
# Workaround a Win32 and/or subprocess bug when appending.		# Workaround a Win32 and/or subprocess bug when appending.
#		#
# FIXME: Actually, this is probably an instance of PR6753.		# FIXME: Actually, this is probably an instance of PR6753.
if mode == 'a':		if mode == 'a':
fd.seek(0, 2)		fd.seek(0, 2)
# Mutate the underlying redirect list so that we can redirect stdout		# Mutate the underlying redirect list so that we can redirect stdout
# and stderr to the same place without opening the file twice.		# and stderr to the same place without opening the file twice.
r[2] = fd		r[2] = fd
▲ Show 20 Lines • Show All 368 Lines • ▼ Show 20 Lines
def executeScript(test, litConfig, tmpBase, commands, cwd):		def executeScript(test, litConfig, tmpBase, commands, cwd):
bashPath = litConfig.getBashPath()		bashPath = litConfig.getBashPath()
isWin32CMDEXE = (litConfig.isWindows and not bashPath)		isWin32CMDEXE = (litConfig.isWindows and not bashPath)
script = tmpBase + '.script'		script = tmpBase + '.script'
if isWin32CMDEXE:		if isWin32CMDEXE:
script += '.bat'		script += '.bat'

# Write script file		# Write script file
mode = 'w'		f = open(script, 'wb')
		serge-sans-pailleUnsubmitted Not Done Reply Inline Actions That's nce, but now, you need to use the `b` prefix for all strings, and for decoding o f the joined command, right? serge-sans-paille: That's nce, but now, you need to use the `b` prefix for all strings, and for decoding o f the…
if litConfig.isWindows and not isWin32CMDEXE:
mode += 'b' # Avoid CRLFs when writing bash scripts.
f = open(script, mode)
if isWin32CMDEXE:		if isWin32CMDEXE:
for i, ln in enumerate(commands):		for i, ln in enumerate(commands):
commands[i] = re.sub(kPdbgRegex, "echo '\\1' > nul && ", ln)		commands[i] = re.sub(kPdbgRegex, "echo '\\1' > nul && ", ln)
if litConfig.echo_all_commands:		if litConfig.echo_all_commands:
f.write('@echo on\n')		f.write('@echo on\n')
else:		else:
f.write('@echo off\n')		f.write('@echo off\n')
f.write('\n@if %ERRORLEVEL% NEQ 0 EXIT\n'.join(commands))		f.write('\n@if %ERRORLEVEL% NEQ 0 EXIT\n'.join(commands))
else:		else:
for i, ln in enumerate(commands):		for i, ln in enumerate(commands):
commands[i] = re.sub(kPdbgRegex, ": '\\1'; ", ln)		commands[i] = re.sub(kPdbgRegex, ": '\\1'; ", ln)
if test.config.pipefail:		if test.config.pipefail:
f.write('set -o pipefail;')		f.write('set -o pipefail;\n')
if litConfig.echo_all_commands:		if litConfig.echo_all_commands:
f.write('set -x;')		f.write('set -x;\n')
f.write('{ ' + '; } &&\n{ '.join(commands) + '; }')		f.write('{ ' + '; } &&\n{ '.join(commands) + '; }\n')
f.write('\n')		f.write('\n')
f.close()		f.close()

if isWin32CMDEXE:		if isWin32CMDEXE:
command = ['cmd','/c', script]		command = ['cmd','/c', script]
else:		else:
		serge-sans-pailleUnsubmitted Not Done Reply Inline Actions I'm not 100% sure here, but this looks very redundant and non-maintainable. Why not always open the file in binary mode? serge-sans-paille: I'm not 100% sure here, but this looks very redundant and non-maintainable. Why not always open…
if bashPath:		if bashPath:
command = [bashPath, script]		command = [bashPath, script]
else:		else:
command = ['/bin/sh', script]		command = ['/bin/sh', script]
if litConfig.useValgrind:		if litConfig.useValgrind:
# FIXME: Running valgrind on sh is overkill. We probably could just		# FIXME: Running valgrind on sh is overkill. We probably could just
# run on clang with no real loss.		# run on clang with no real loss.
command = litConfig.valgrindArgs + command		command = litConfig.valgrindArgs + command
▲ Show 20 Lines • Show All 467 Lines • Show Last 20 Lines

utils/lit/lit/util.py

Show First 20 Lines • Show All 418 Lines • ▼ Show 20 Lines	try:
for child in children_iterator:		for child in children_iterator:
try:		try:
child.kill()		child.kill()
except psutil.NoSuchProcess:		except psutil.NoSuchProcess:
pass		pass
psutilProc.kill()		psutilProc.kill()
except psutil.NoSuchProcess:		except psutil.NoSuchProcess:
pass		pass

		def convertToLocalEncoding(text):
		"""This function converts utf-8 text into a format which the local system prefers
		to work with.
		"""
		if platform.system() == 'Windows':
		if sys.version_info < (3,0):
		# On Windows and Python2, we want to use 'unicode' so it gets converted
		# to UTF-16
		return text.decode('utf-8') if isinstance(text, str) else test
		else:
		# On Windows and Python3, we want to use a unicode string so it gets
		# converted to UTF-16
		return text.decode('utf-8') if isinstance(text, bytes) else test
		efriedmaUnsubmitted Not Done Reply Inline Actions The inner "if" is redundant; on Python2, `bytes` is an alias for `str`, so both sides have the same meaning. This function seems dangerous in the sense that it isn't clear what type of data it's expecting as input. Probably you should have separate functions for each of the possible cases: input is `bytes`, input is `str`, or input is Python2 `unicode`/Python3 `str`. (Not sure which of those you actually need, but those are the reasonable possibilities, I think.) efriedma: The inner "if" is redundant; on Python2, `bytes` is an alias for `str`, so both sides have the…
		jmittertAuthorUnsubmitted Done Reply Inline Actions I'm fairly certain the if isn't redundant: on python 2, if we get `bytes`(/`str`) we want to decode that to a `unicode`. In python 2, `.decode` returns a `unicode`, not a `str` like in python 3. Part of the reason it has confusing inputs is because what input it gets depends on they python version. It gets bytes/str on python 2 and str on python 3. I figured it was safer to handle all text cases (python 2 `bytes`/`str`/`unicode`, python 3 `str`/`bytes`). At least that way it's idempotent. I do agree it's dangerous in that the type it outputs is different based on the platform and version, but I've tried to limit use of it to only right at the edge before os calls. jmittert: I'm fairly certain the if isn't redundant: on python 2, if we get `bytes`(/`str`) we want to…
		efriedmaUnsubmitted Not Done Reply Inline Actions on python 2, if we get bytes(/str) we want to decode that to a unicode I didn't mean it's a no-op, just that `isinstance(text, str)` and `isinstance(text, bytes)` always return the same result on Python2, so you don't need the explicit version_info check. I figured it was safer to handle all text cases Making the function handle any string-like input is "safer" in the sense that you're less likely to get a runtime error in this function, but it makes the code harder to understand, and it's more likely to lead to confusing results if some other part of the code isn't handling strings consistently. efriedma: > on python 2, if we get bytes(/str) we want to decode that to a unicode I didn't mean it's a…
		jmittertAuthorUnsubmitted Done Reply Inline Actions Ah, I see what you mean, yeah, that makes sense to do. I've removed the convertToLocalEncoding and instead added to_unicode, which I think is significantly less confusing. jmittert: Ah, I see what you mean, yeah, that makes sense to do. I've removed the convertToLocalEncoding…

		# On non Windows, we just want bytes so python don't try to
		# convert it to ascii
		return text if isinstance(text, bytes) else test.encode('utf-8')
		serge-sans-pailleUnsubmitted Not Done Reply Inline Actions Bad news: I've been discussing with a Python Core Dev and he acknowledge this function to be ultra dangerous. We should find a way to have everything right without this tricky call. My take would be to first have the whole script work in unicode mode (i.e. in Python3) and then do the minimal work to have it work on Python2. Does that makes sense to you? serge-sans-paille: Bad news: I've been discussing with a Python Core Dev and he acknowledge this function to be…
		jmittertAuthorUnsubmitted Done Reply Inline Actions Let me check that I understand correctly. Right now, lit uses strings on python3 which are unicode aware, and strings/bytes on python2 which are not unicode aware which causes issues when utf8 characters are passed to os.path on Windows. What about alternatively adding a to_unicode a la the to_string and to_bytes in util.py which would return str on python3, and unicode on python2? Then adding something like s = some_string if windows: s = to_unicode(s) os.path.some_func(s) before each call to os.path.some_func? jmittert: Let me check that I understand correctly. Right now, lit uses strings on python3 which are…
		serge-sans-pailleUnsubmitted Not Done Reply Inline Actions Your first statement is correct. However I don't understand why there's anything windows specific in there. Let think about it this way: in Python2, strings where used to represent two different kind of objects : raw bytes, and unicode string. They are different types in Python3. So my advice would be: have the code work in Python3, inserting bytes to string conversion whenever needed. These conversion should not be platform specific. try to run the same code under Python2 and see where it fails. Does that look like a reasonable path to you? serge-sans-paille: Your first statement is correct. However I don't understand why there's anything windows…
		jmittertAuthorUnsubmitted Done Reply Inline Actions Hmm, I was under the impression that that was exactly what I've done here. Lit works as of now with python3 on both Windows and Unix. The "convertToLocalEncoding" is essentially the fix up for when I ran it in python2 and saw where it failed. In order for the os.path functions to handle unicode properly on python 2, they need to take a `unicode` string on Windows. Otherwise, the os.path functions create/open a file with the literal utf-8 bytes which causes a broken filename for every other application that expects utf-16 on Windows. On Unix, utf-8 means things work fine in python2 as is. Passing a `unicode` type to os.path on unix/python2 gives an error about not being able to decode to ascii, but just passing bytes works. Right now, with python 2 lit passes utf-8 bytes around (the python2 string/byte type) which works for Unix, but needs converting on Windows. What I have right now shouldn't result in any functional changes except for python2 with Windows. jmittert: Hmm, I was under the impression that that was exactly what I've done here. Lit works as of now…
		compnerdUnsubmitted Not Done Reply Inline Actions @serge-sans-paille this is a nifty workaround for Windows. When on the python side, a unicode object is passed, the C side will use the `W` versions of the file APIs rather than `A` versions which is required to access any non-ASCII file path. Additionally, the use of the `W` variant of the APIs permits the use of NT style paths (`\\?\` prefixed paths) which bypass the Win32 layer and go right to the kernel to avoid the 261 character limit. compnerd: @serge-sans-paille this is a nifty workaround for Windows. When on the python side, a unicode…
		jmittertAuthorUnsubmitted Done Reply Inline Actions @serge-sans-paille @compnerd mentioned to me that you wanted to have this function changed to be called something along the lines of "convertToUnicode". I should mention though that this only converts to Unicode on Windows, On non Windows platforms it converts to bytes, not the unicode type. Think that convertToPlatformEncoding is maybe more descriptive instead? jmittert: @serge-sans-paille @compnerd mentioned to me that you wanted to have this function changed to…
		jmittertAuthorUnsubmitted Done Reply Inline Actions @serge-sans-paille ping again jmittert: @serge-sans-paille ping again

utils/lit/tests/Inputs/shtest-shell/rm-unicode-0.txt

This file was added.

				# Check creating and removing unicode files
				#
				# RUN: mkdir -p Output/中文
				# RUN: echo "" > Output/中文/你好.txt
				# RUN: rm Output/中文/你好.txt
				# RUN: echo "" > Output/中文/你好.txt
				# RUN: rm -r Output/中文

utils/lit/tests/shtest-shell.py

	Show First 20 Lines • Show All 218 Lines • ▼ Show 20 Lines
	# CHECK: error: command failed with exit status: 1			# CHECK: error: command failed with exit status: 1
	# CHECK: ***			# CHECK: ***

	# CHECK: FAIL: shtest-shell :: rm-error-3.txt			# CHECK: FAIL: shtest-shell :: rm-error-3.txt
	# CHECK: * TEST 'shtest-shell :: rm-error-3.txt' FAILED *			# CHECK: * TEST 'shtest-shell :: rm-error-3.txt' FAILED *
	# CHECK: Exit Code: 1			# CHECK: Exit Code: 1
	# CHECK: ***			# CHECK: ***

				# CHECK: PASS: shtest-shell :: rm-unicode-0.txt
	# CHECK: PASS: shtest-shell :: sequencing-0.txt			# CHECK: PASS: shtest-shell :: sequencing-0.txt
	# CHECK: XFAIL: shtest-shell :: sequencing-1.txt			# CHECK: XFAIL: shtest-shell :: sequencing-1.txt
	# CHECK: PASS: shtest-shell :: valid-shell.txt			# CHECK: PASS: shtest-shell :: valid-shell.txt
	# CHECK: Failing Tests (27)			# CHECK: Failing Tests (27)

This is an archive of the discontinued LLVM Phabricator instance.

Add Support for Creating and Deleting Unicode Files and Directories in LitClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 182096

utils/lit/lit/TestRunner.py

utils/lit/lit/util.py

utils/lit/tests/Inputs/shtest-shell/rm-unicode-0.txt

utils/lit/tests/shtest-shell.py

Add Support for Creating and Deleting Unicode Files and Directories in Lit
ClosedPublic