This is an archive of the discontinued LLVM Phabricator instance.

[lit] Fix some convoluted logic around Unicode encoding, and de-duplicate across modules that used it.
ClosedPublic

Authored by dlj on Jun 28 2017, 5:33 PM.

Download Raw Diff

Details

Reviewers

zturner
modocache

Commits

rGd59c9cd539ab: [lit] Fix some convoluted logic around Unicode encoding, and de-duplicate…
rL306625: [lit] Fix some convoluted logic around Unicode encoding, and de-duplicate…

Summary

In Python2 and Python3, the various (non-)?Unicode string types are sort of
spaghetti. Python2 has unicode support tacked on via the 'unicode' type, which
is distinct from 'str' (which are bytes). Python3 takes the "unicode-everywhere"
approach, with 'str' representing a Unicode string.

Both have a 'bytes' type. In Python3, it is the only way to represent raw bytes.
However, in Python2, 'bytes' is an alias for 'str'. This leads to interesting
problems when an interface requires a precise type, but has to run under both
Python2 and Python3.

The previous logic appeared to be correct in all cases, but went through more
layers of indirection than necessary. This change does the necessary conversions
in one shot, with documentation about which paths might be taken in Python2 or
Python3.

Diff Detail

Repository: rL LLVM

Event Timeline

dlj created this revision.Jun 28 2017, 5:33 PM

Herald added a subscriber: sanjoy. · View Herald TranscriptJun 28 2017, 5:33 PM

LGTM! Thanks for all the cleanups!

utils/lit/lit/util.py
51 ↗	(On Diff #104554)	Typo, I think: "uncode" should be "unicode".

This revision is now accepted and ready to land.Jun 28 2017, 5:55 PM

Fix spelling: uncode -> unicode.

utils/lit/lit/util.py
51 ↗	(On Diff #104554)	Ah, good catch. You are correct.

Closed by commit rL306625: [lit] Fix some convoluted logic around Unicode encoding, and de-duplicate… (authored by dlj). · Explain WhyJun 28 2017, 6:04 PM

This revision was automatically updated to reflect the committed changes.

Seems incompatible to py3.
http://bb.pgr.jp/builders/test-llvm-i686-linux-RA/builds/3747

I am investigating. Excuse me if I would revert this.

In D34793#794757, @chapuni wrote:

Seems incompatible to py3.
http://bb.pgr.jp/builders/test-llvm-i686-linux-RA/builds/3747

I am investigating. Excuse me if I would revert this.

Sorry for the trouble, reverted.

In D34793#794765, @dlj wrote:

In D34793#794757, @chapuni wrote:

Seems incompatible to py3.
http://bb.pgr.jp/builders/test-llvm-i686-linux-RA/builds/3747

I am investigating. Excuse me if I would revert this.

Sorry for the trouble, reverted.

I re-applied this in r306643. There are some tests that simply yield binary output, and I was missing a conversion in googletest. I verified with Python 2.7 and Python 3.4.3, both on Linux.

@dlj Great, thanks!

Seems it also fixes D34464.

The change LGTM, but please keep it on topic (see below).

llvm/trunk/utils/lit/lit/util.py
70–71	This isn't really related to str vs unicode is it? Can you please commit such changes separately and not "hide" them in a change such as this so we can revert and reason about them separately if necessary? (That said this particular change seems simple enough to me that post-commit review is enough without phab).

dlj marked an inline comment as done.Jun 28 2017, 10:57 PM

dlj added inline comments.

llvm/trunk/utils/lit/lit/util.py
70–71	It's actually quite related... the capture() function basically did nothing but call subprocess.Popen, then pass the output through the strange encode/decode loop. So this was the only remaining (indirect) usage of convert_string.

In D34793#794883, @chapuni wrote:

@dlj Great, thanks!

Seems it also fixes D34464.

Interesting... to_string now has to fall back to str(bytes) in Python3 when there is an invalid input. In that case, the resulting string looks more like the output of repr(), which is not what one would want for a filename.

It's not clear to me why Python's behaviour of treating *filenames* as unicode is actually the right choice.

Strictly speaking, I think the only well-defined filename encoding that covers all platforms targeted by Clang is the one defined by the Posix spec:
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_282

(But of course, our supported OSes do support broader character sets.)

I'll think more about what to_string should do, but I'll also leave a comment on the other review thread.

MatzeB added inline comments.Jun 29 2017, 10:23 AM

llvm/trunk/utils/lit/lit/util.py
70–71	ok, makes sense.

Revision Contents

Path

Size

llvm/

trunk/

utils/

lit/

formats/

googletest.py

29 lines

util.py

82 lines

Diff 104566

llvm/trunk/utils/lit/lit/formats/googletest.py

Show All 24 Lines	def getGTestTests(self, path, litConfig, localConfig):
Return the tests available in gtest executable.		Return the tests available in gtest executable.

Args:		Args:
path: String path to a gtest executable		path: String path to a gtest executable
litConfig: LitConfig instance		litConfig: LitConfig instance
localConfig: TestingConfig instance"""		localConfig: TestingConfig instance"""

try:		try:
lines = lit.util.capture([path, '--gtest_list_tests'],		output = subprocess.check_output([path, '--gtest_list_tests'],
env=localConfig.environment)		env=localConfig.environment)
if kIsWindows:		except subprocess.CalledProcessError as exc:
lines = lines.replace('\r', '')		litConfig.warning(
lines = lines.split('\n')		"unable to discover google-tests in %r: %s. Process output: %s"
except Exception as exc:		% (path, sys.exc_info()[1], exc.output))
out = exc.output if isinstance(exc, subprocess.CalledProcessError) else ''
litConfig.warning("unable to discover google-tests in %r: %s. Process output: %s"
% (path, sys.exc_info()[1], out))
raise StopIteration		raise StopIteration

nested_tests = []		nested_tests = []
for ln in lines:		for ln in output.splitlines(False): # Don't keep newlines.
		if 'Running main() from gtest_main.cc' in ln:
		# Upstream googletest prints this to stdout prior to running
		# tests. LLVM removed that print statement in r61540, but we
		# handle it here in case upstream googletest is being used.
		continue

# The test name list includes trailing comments beginning with		# The test name list includes trailing comments beginning with
# a '#' on some lines, so skip those. We don't support test names		# a '#' on some lines, so skip those. We don't support test names
# that use escaping to embed '#' into their name as the names come		# that use escaping to embed '#' into their name as the names come
# from C++ class and method names where such things are hard and		# from C++ class and method names where such things are hard and
# uninteresting to support.		# uninteresting to support.
ln = ln.split('#', 1)[0].rstrip()		ln = ln.split('#', 1)[0].rstrip()
if not ln.lstrip():		if not ln.lstrip():
continue		continue

if 'Running main() from gtest_main.cc' in ln:
# Upstream googletest prints this to stdout prior to running
# tests. LLVM removed that print statement in r61540, but we
# handle it here in case upstream googletest is being used.
continue

index = 0		index = 0
while ln[index2:index2+2] == ' ':		while ln[index2:index2+2] == ' ':
index += 1		index += 1
while len(nested_tests) > index:		while len(nested_tests) > index:
nested_tests.pop()		nested_tests.pop()

ln = ln[index*2:]		ln = ln[index*2:]
if ln.endswith('.'):		if ln.endswith('.'):
▲ Show 20 Lines • Show All 77 Lines • Show Last 20 Lines

llvm/trunk/utils/lit/lit/util.py

import errno		import errno
import itertools		import itertools
import math		import math
import os		import os
import platform		import platform
import signal		import signal
import subprocess		import subprocess
import sys		import sys
import threading		import threading

def to_bytes(str):		def to_bytes(s):
# Encode to UTF-8 to get binary data.		"""Return the parameter as type 'bytes', possibly encoding it.
if isinstance(str, bytes):
return str
return str.encode('utf-8')

def to_string(bytes):
if isinstance(bytes, str):
return bytes
return to_bytes(bytes)

def convert_string(bytes):		In Python2, the 'bytes' type is the same as 'str'. In Python3, they are
		distinct.
		"""
		if isinstance(s, bytes):
		# In Python2, this branch is taken for both 'str' and 'bytes'.
		# In Python3, this branch is taken only for 'bytes'.
		return s
		# In Python2, 's' is a 'unicode' object.
		# In Python3, 's' is a 'str' object.
		# Encode to UTF-8 to get 'bytes' data.
		return s.encode('utf-8')

		def to_string(b):
		"""Return the parameter as type 'str', possibly encoding it.

		In Python2, the 'str' type is the same as 'bytes'. In Python3, the
		'str' type is (essentially) Python2's 'unicode' type, and 'bytes' is
		distinct.
		"""
		if isinstance(b, str):
		# In Python2, this branch is taken for types 'str' and 'bytes'.
		# In Python3, this branch is taken only for 'str'.
		return b
		if isinstance(b, bytes):
		# In Python2, this branch is never taken ('bytes' is handled as 'str').
		# In Python3, this is true only for 'bytes'.
		return b.decode('utf-8')

		# By this point, here's what we don't have:
		#
		# - In Python2:
		# - 'str' or 'bytes' (1st branch above)
		# - In Python3:
		# - 'str' (1st branch above)
		# - 'bytes' (2nd branch above)
		#
		# The last type we might expect is the Python2 'unicode' type. There is no
		# 'uncode' type in Python3 (all the Python3 cases were already handled). In
		# order to get a 'str' object, we need to encode the 'unicode' object.
try:		try:
return to_string(bytes.decode('utf-8'))		return b.encode('utf-8')
except AttributeError: # 'str' object has no attribute 'decode'.		except AttributeError:
return str(bytes)		raise TypeError('not sure how to convert %s to %s' % (type(b), str))
except UnicodeError:
return str(bytes)

def detectCPUs():		def detectCPUs():
"""		"""
Detects the number of CPUs on a system. Cribbed from pp.		Detects the number of CPUs on a system. Cribbed from pp.
"""		"""
# Linux, Unix and MacOS:		# Linux, Unix and MacOS:
if hasattr(os, "sysconf"):		if hasattr(os, "sysconf"):
if "SC_NPROCESSORS_ONLN" in os.sysconf_names:		if "SC_NPROCESSORS_ONLN" in os.sysconf_names:
# Linux & Unix:		# Linux & Unix:
ncpus = os.sysconf("SC_NPROCESSORS_ONLN")		ncpus = os.sysconf("SC_NPROCESSORS_ONLN")
if isinstance(ncpus, int) and ncpus > 0:		if isinstance(ncpus, int) and ncpus > 0:
return ncpus		return ncpus
else: # OSX:		else: # OSX:
return int(capture(['sysctl', '-n', 'hw.ncpu']))		return int(subprocess.check_output(['sysctl', '-n', 'hw.ncpu'],
		stderr=subprocess.STDOUT))
		MatzeBUnsubmitted Done Reply Inline Actions This isn't really related to str vs unicode is it? Can you please commit such changes separately and not "hide" them in a change such as this so we can revert and reason about them separately if necessary? (That said this particular change seems simple enough to me that post-commit review is enough without phab). MatzeB: This isn't really related to str vs unicode is it? Can you please commit such changes…
		dljAuthorUnsubmitted Not Done Reply Inline Actions It's actually quite related... the capture() function basically did nothing but call subprocess.Popen, then pass the output through the strange encode/decode loop. So this was the only remaining (indirect) usage of convert_string. dlj: It's actually quite related... the capture() function basically did nothing but call subprocess.
		MatzeBUnsubmitted Not Done Reply Inline Actions ok, makes sense. MatzeB: ok, makes sense.
# Windows:		# Windows:
if "NUMBER_OF_PROCESSORS" in os.environ:		if "NUMBER_OF_PROCESSORS" in os.environ:
ncpus = int(os.environ["NUMBER_OF_PROCESSORS"])		ncpus = int(os.environ["NUMBER_OF_PROCESSORS"])
if ncpus > 0:		if ncpus > 0:
# With more than 32 processes, process creation often fails with		# With more than 32 processes, process creation often fails with
# "Too many open files". FIXME: Check if there's a better fix.		# "Too many open files". FIXME: Check if there's a better fix.
return min(ncpus, 32)		return min(ncpus, 32)
return 1 # Default		return 1 # Default
Show All 11 Lines	def mkdir_p(path):
try:		try:
os.mkdir(path)		os.mkdir(path)
except OSError:		except OSError:
e = sys.exc_info()[1]		e = sys.exc_info()[1]
# Ignore EEXIST, which may occur during a race condition.		# Ignore EEXIST, which may occur during a race condition.
if e.errno != errno.EEXIST:		if e.errno != errno.EEXIST:
raise		raise

def capture(args, env=None):
"""capture(command) - Run the given command (or argv list) in a shell and
return the standard output. Raises a CalledProcessError if the command
exits with a non-zero status."""
p = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
env=env)
out, err = p.communicate()
out = convert_string(out)
err = convert_string(err)
if p.returncode != 0:
raise subprocess.CalledProcessError(cmd=args,
returncode=p.returncode,
output="{}\n{}".format(out, err))
return out

def which(command, paths = None):		def which(command, paths = None):
"""which(command, [paths]) - Look up the given command in the paths string		"""which(command, [paths]) - Look up the given command in the paths string
(or the PATH environment variable, if unspecified)."""		(or the PATH environment variable, if unspecified)."""

if paths is None:		if paths is None:
paths = os.environ.get('PATH','')		paths = os.environ.get('PATH','')

# Check for absolute match first.		# Check for absolute match first.
▲ Show 20 Lines • Show All 135 Lines • ▼ Show 20 Lines	try:

out,err = p.communicate(input=input)		out,err = p.communicate(input=input)
exitCode = p.wait()		exitCode = p.wait()
finally:		finally:
if timerObject != None:		if timerObject != None:
timerObject.cancel()		timerObject.cancel()

# Ensure the resulting output is always of string type.		# Ensure the resulting output is always of string type.
out = convert_string(out)		out = to_string(out)
err = convert_string(err)		err = to_string(err)

if hitTimeOut[0]:		if hitTimeOut[0]:
raise ExecuteCommandTimeoutException(		raise ExecuteCommandTimeoutException(
msg='Reached timeout of {} seconds'.format(timeout),		msg='Reached timeout of {} seconds'.format(timeout),
out=out,		out=out,
err=err,		err=err,
exitCode=exitCode		exitCode=exitCode
)		)
▲ Show 20 Lines • Show All 66 Lines • Show Last 20 Lines