This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
utils/lit/lit/
-
lit/
-
lit/
3
Test.py

Differential D34464

lit: Make sure testnames are unicode strings
AbandonedPublic

Authored by MatzeB on Jun 21 2017, 10:53 AM.

Download Raw Diff

Details

Reviewers

modocache
• ddunbar
chapuni

Summary

This fixes the problem reported in D33468

The version check feels a bit unfortunate (on the other hand I didn't want to introduce a dependency on the external six module just for this). Maybe someone knows a better way?

I'm also not sure how to unittest this: I do not want to put a file with a non-ascii name into the repository. Can leave it without a test?

Diff Detail

Event Timeline

MatzeB created this revision.Jun 21 2017, 10:53 AM

Herald added a subscriber: mcrosier. · View Herald TranscriptJun 21 2017, 10:53 AM

For the record if someone wants to reproduce:

$ touch test/CodeGen/X86/hällö.ll
$ build/bin/llvm-lit -s test/CodeGen/X86/hällö.ll

Without this patch:

Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 405, in _handle_results
    cache[job]._set(i, obj)
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 572, in _set
    self._callback(self._value)
  File "/Users/mbraun/dev/public_llvm/utils/lit/lit/run.py", line 344, in consume_test_result
    self.display.update(test_with_result)
  File "/Users/mbraun/dev/public_llvm/utils/lit/lit/main.py", line 50, in update
    test.getFullName())
  File "/Users/mbraun/dev/public_llvm/utils/lit/lit/ProgressBar.py", line 271, in update
    self.term.CLEAR_EOL + message)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 21: ordinal not in range(128)

MatzeB mentioned this in D33468: Enable unicode output on terminals..Jun 21 2017, 10:56 AM

chapuni added inline comments.Jun 21 2017, 2:58 PM

utils/lit/lit/Test.py
232	Could you make it work without checking version? (I haven't tested yet)

MatzeB added inline comments.Jun 21 2017, 3:04 PM

utils/lit/lit/Test.py
232	No I could not. It seems on python2 you get `str` objects back from listdir() sometimes, while on python3 you always get a (unicode) string back. Calling decode on a string in python3 is not allowed.

I have reproduced the issue, and confirmed this works.

Yet another option; Could we make ProgressBar accept unicode?
The difference is; sys.stdout.write() doesn't accept unicode (by default) but print() accepts.

In D34464#787900, @chapuni wrote:

I have reproduced the issue, and confirmed this works.

Yet another option; Could we make ProgressBar accept unicode?
The difference is; sys.stdout.write() doesn't accept unicode (by default) but print() accepts.

The thing is doing sys.stdout.write(u'Hällö\n') works perfectly fine as python knows how to interpret the unicode object. As far as I understand it, what we have here is that the filename is rather a sequence of bytes and python refuses to simply interpret that byte string as utf-8. Even a stream of bytes can still be printed perfectly fine to stdout. What breaks in the ProgressBar is that we try to concatenate a unicode string with a string of bytes. That's the point where python needs to know the encoding or it can't create a unicode string for the result.

This illustrates it I think (note that this is about python2; things are different in python3)

import sys
bla = 'H\xc3\xa4ll\xc3\xb6\n'
# All of these work
sys.stdout.write(u"I say: ")
sys.stdout.write(bla)
sys.stdout.write(u"I say: " + bla.decode('utf-8'))
# This fails
sys.stdout.write(u"I say: " + bla)

If you want to solve the problem in ProgressBar then we have to make sure we do not concatenate strings to much and rather use separate write calls. However I rather prefer the solution here where we decode the filename to a unicode object upfront.

It works.

--- a/llvm/utils/lit/lit/ProgressBar.py
+++ b/llvm/utils/lit/lit/ProgressBar.py
@@ -268,7 +268,8 @@ class ProgressBar:
             self.BOL + self.term.UP + self.term.CLEAR_EOL +
             (self.bar % (prefix, '='*n, '-'*(barWidth-n), suffix)) +
             self.XNL +
-            self.term.CLEAR_EOL + message)
+            self.term.CLEAR_EOL)
+        sys.stdout.write(message)
         if not self.term.XN:
             sys.stdout.flush()

I think, in the future, we may handle filepaths as unicode string in lit.

chapuni mentioned this in D34793: [lit] Fix some convoluted logic around Unicode encoding, and de-duplicate across modules that used it..Jun 28 2017, 9:50 PM

dlj added a subscriber: dlj.Jun 29 2017, 12:22 AM

dlj added inline comments.

utils/lit/lit/Test.py
232	Python3's listdir can return bytes as well, if a bytes object was passed as the path. The simplest path forward is probably something like this: testpath = os.pathsep.join(self.path_in_suite) if isinstance(testpath, bytes): testpath = testpath.decode(sys.getfilesystemencoding(), 'surrogateescape') return ... That should always give you a Unicode-like object (Py2 unicode or Py3 str). The surrogateescape behaviour matches how Py3 handles odd filenames: https://github.com/python/cpython/blob/6f0eb93183519024cb360162bdd81b9faec97ba6/Python/bltinmodule.c#L35 It's a subtle point, but by using Py3's listdir, the filenames could contain non-utf-8-encodable Unicode values. That means that, in any case, whatever prints the filename to the terminal needs to guard against a UnicodeEncodeError (at a minimum, in case of surrogates in the filename).

Sorry, this issue is still in trunk. Py2 crashes.

ping...

ping2...

Herald added a subscriber: delcypher. · View Herald TranscriptFeb 21 2018, 2:01 AM

MatzeB abandoned this revision.Sep 10 2021, 10:13 AM

Herald added a subscriber: wenlei. · View Herald TranscriptSep 10 2021, 10:13 AM

Revision Contents

Path

Size

utils/

lit/

Test.py

10 lines

Diff 103423

utils/lit/lit/Test.py

import os		import os
		import sys
from xml.sax.saxutils import escape		from xml.sax.saxutils import escape
from json import JSONEncoder		from json import JSONEncoder

from lit.BooleanExpression import BooleanExpression		from lit.BooleanExpression import BooleanExpression

# Test result codes.		# Test result codes.

class ResultCode(object):		class ResultCode(object):
▲ Show 20 Lines • Show All 209 Lines • ▼ Show 20 Lines	def setResult(self, result):
elif self.result.code == FAIL:		elif self.result.code == FAIL:
self.result.code = XFAIL		self.result.code = XFAIL
except ValueError as e:		except ValueError as e:
# Syntax error in an XFAIL line.		# Syntax error in an XFAIL line.
self.result.code = UNRESOLVED		self.result.code = UNRESOLVED
self.result.output = str(e)		self.result.output = str(e)

def getFullName(self):		def getFullName(self):
return self.suite.config.name + ' :: ' + '/'.join(self.path_in_suite)		testpath = '/'.join(self.path_in_suite)
		# Test names should be unicode strings. In python2 filepaths are not
		# necessarily unicode strings yet as listdir() tends to return them
		# as str objects when they don't decode cleanly.
		if sys.version_info < (3,0,0):
		chapuniUnsubmitted Not Done Reply Inline Actions Could you make it work without checking version? (I haven't tested yet) chapuni: Could you make it work without checking version? (I haven't tested yet)
		MatzeBAuthorUnsubmitted Not Done Reply Inline Actions No I could not. It seems on python2 you get `str` objects back from listdir() sometimes, while on python3 you always get a (unicode) string back. Calling decode on a string in python3 is not allowed. MatzeB: No I could not. It seems on python2 you get `str` objects back from listdir() sometimes, while…
		dljUnsubmitted Not Done Reply Inline Actions Python3's listdir can return bytes as well, if a bytes object was passed as the path. The simplest path forward is probably something like this: testpath = os.pathsep.join(self.path_in_suite) if isinstance(testpath, bytes): testpath = testpath.decode(sys.getfilesystemencoding(), 'surrogateescape') return ... That should always give you a Unicode-like object (Py2 unicode or Py3 str). The surrogateescape behaviour matches how Py3 handles odd filenames: https://github.com/python/cpython/blob/6f0eb93183519024cb360162bdd81b9faec97ba6/Python/bltinmodule.c#L35 It's a subtle point, but by using Py3's listdir, the filenames could contain non-utf-8-encodable Unicode values. That means that, in any case, whatever prints the filename to the terminal needs to guard against a UnicodeEncodeError (at a minimum, in case of surrogates in the filename). dlj: Python3's listdir can return bytes as well, if a bytes object was passed as the path. The…
		testpath = testpath.decode(sys.getfilesystemencoding(),
		errors='replace')
		return self.suite.config.name + ' :: ' + testpath

def getFilePath(self):		def getFilePath(self):
if self.file_path:		if self.file_path:
return self.file_path		return self.file_path
return self.getSourcePath()		return self.getSourcePath()

def getSourcePath(self):		def getSourcePath(self):
return self.suite.getSourcePath(self.path_in_suite)		return self.suite.getSourcePath(self.path_in_suite)
▲ Show 20 Lines • Show All 127 Lines • Show Last 20 Lines