Page MenuHomePhabricator

lit: Make sure testnames are unicode strings
Needs ReviewPublic

Authored by MatzeB on Jun 21 2017, 10:53 AM.



This fixes the problem reported in D33468

The version check feels a bit unfortunate (on the other hand I didn't want to introduce a dependency on the external six module just for this). Maybe someone knows a better way?

I'm also not sure how to unittest this: I do not want to put a file with a non-ascii name into the repository. Can leave it without a test?

Diff Detail

Event Timeline

MatzeB created this revision.Jun 21 2017, 10:53 AM

For the record if someone wants to reproduce:

$ touch test/CodeGen/X86/hällö.ll
$ build/bin/llvm-lit -s test/CodeGen/X86/hällö.ll

Without this patch:

Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/", line 801, in __bootstrap_inner
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/", line 405, in _handle_results
    cache[job]._set(i, obj)
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/", line 572, in _set
  File "/Users/mbraun/dev/public_llvm/utils/lit/lit/", line 344, in consume_test_result
  File "/Users/mbraun/dev/public_llvm/utils/lit/lit/", line 50, in update
  File "/Users/mbraun/dev/public_llvm/utils/lit/lit/", line 271, in update
    self.term.CLEAR_EOL + message)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 21: ordinal not in range(128)
chapuni added inline comments.Jun 21 2017, 2:58 PM

Could you make it work without checking version?
(I haven't tested yet)

MatzeB added inline comments.Jun 21 2017, 3:04 PM

No I could not. It seems on python2 you get str objects back from listdir() sometimes, while on python3 you always get a (unicode) string back. Calling decode on a string in python3 is not allowed.

chapuni edited edge metadata.Jun 22 2017, 5:59 AM

I have reproduced the issue, and confirmed this works.

Yet another option; Could we make ProgressBar accept unicode?
The difference is; sys.stdout.write() doesn't accept unicode (by default) but print() accepts.

MatzeB added a comment.EditedJun 22 2017, 11:19 AM

I have reproduced the issue, and confirmed this works.

Yet another option; Could we make ProgressBar accept unicode?
The difference is; sys.stdout.write() doesn't accept unicode (by default) but print() accepts.

The thing is doing sys.stdout.write(u'Hällö\n') works perfectly fine as python knows how to interpret the unicode object. As far as I understand it, what we have here is that the filename is rather a sequence of bytes and python refuses to simply interpret that byte string as utf-8. Even a stream of bytes can still be printed perfectly fine to stdout. What breaks in the ProgressBar is that we try to concatenate a unicode string with a string of bytes. That's the point where python needs to know the encoding or it can't create a unicode string for the result.

This illustrates it I think (note that this is about python2; things are different in python3)

import sys
bla = 'H\xc3\xa4ll\xc3\xb6\n'
# All of these work
sys.stdout.write(u"I say: ")
sys.stdout.write(u"I say: " + bla.decode('utf-8'))
# This fails
sys.stdout.write(u"I say: " + bla)

If you want to solve the problem in ProgressBar then we have to make sure we do not concatenate strings to much and rather use separate write calls. However I rather prefer the solution here where we decode the filename to a unicode object upfront.

It works.

--- a/llvm/utils/lit/lit/
+++ b/llvm/utils/lit/lit/
@@ -268,7 +268,8 @@ class ProgressBar:
             self.BOL + self.term.UP + self.term.CLEAR_EOL +
             ( % (prefix, '='*n, '-'*(barWidth-n), suffix)) +
             self.XNL +
-            self.term.CLEAR_EOL + message)
+            self.term.CLEAR_EOL)
+        sys.stdout.write(message)
         if not self.term.XN:

I think, in the future, we may handle filepaths as unicode string in lit.

dlj added a subscriber: dlj.Jun 29 2017, 12:22 AM
dlj added inline comments.

Python3's listdir can return bytes as well, if a bytes object was passed as the path.

The simplest path forward is probably something like this:

testpath = os.pathsep.join(self.path_in_suite)
if isinstance(testpath, bytes):
  testpath = testpath.decode(sys.getfilesystemencoding(), 'surrogateescape')
return ...

That should always give you a Unicode-like object (Py2 unicode or Py3 str). The surrogateescape behaviour matches how Py3 handles odd filenames:

It's a subtle point, but by using Py3's listdir, the filenames could contain non-utf-8-encodable Unicode values. That means that, in any case, whatever prints the filename to the terminal needs to guard against a UnicodeEncodeError (at a minimum, in case of surrogates in the filename).

Sorry, this issue is still in trunk. Py2 crashes.