This is an archive of the discontinued LLVM Phabricator instance.

Re-submit rL258759: Write the session log file in UTF-8.
AbandonedPublic

Authored by tberghammer on Jan 28 2016, 3:50 AM.

Details

Reviewers
zturner
tfiala
Summary

Re-submit rL258759: Write the session log file in UTF-8.

Previously we were writing in the default encoding, which depends
on the operating system and is not guaranteed to be unicode aware.
On Python 3, this would lead to a situation where writing unicode
text to the log file generates an exception. The fix here is to
write session logs using the proper encoding, which incidentally
fixes another test, so xfail is removed from that.

Diff Detail

Event Timeline

tberghammer retitled this revision from to Re-submit rL258759: Write the session log file in UTF-8..
tberghammer updated this object.
tberghammer added reviewers: zturner, tfiala.
tberghammer added a subscriber: lldb-commits.

I know that this version is not the cleanest implementation but I think this is a very robust because we don't depend on the system encoding anywhere. Please gave it a try and feel free to commit it if if works.

tfiala edited edge metadata.Jan 28 2016, 6:24 AM

I can give this a try later today.

tfiala accepted this revision.Jan 28 2016, 7:08 AM
tfiala edited edge metadata.

Hey Tamas,

I haven't tried it yet, but I won't be into the office until later. Why don't you go ahead and submit it, and if it blows up the OS X testbot, you'll get an email and you (or whoever finds it first) can revert it?

There may be a better way to address it longer term but if this works, it seems okay for now.

-Todd

This revision is now accepted and ready to land.Jan 28 2016, 7:08 AM

I am more worried about Windows and python3 and I don't think we have any build bot running the tests on Windows. I don't really know what is the original issue this commit try to address so I plan to wait until Zachary can check it out (the current ToT works on Linux so I am not worried)

zturner edited edge metadata.Jan 28 2016, 8:45 AM
zturner added a subscriber: zturner.

Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:43:06) [MSC v.1600 32
bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.

dir(str)

['add', 'class', 'contains', 'delattr', 'dir',
'doc', 'eq', 'format', 'ge', 'getattribute',
'getitem', 'getnewargs', 'gt', 'hash', 'init',
'iter', 'le', 'len', 'lt', 'mod', 'mul', 'ne',
'new', 'reduce', 'reduce_ex', 'repr', 'rmod',
'rmul', 'setattr', 'sizeof', 'str', 'subclasshook',
'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith',
'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum',
'isalpha', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric',
'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower',
'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust',
'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith',
'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']

No decode method. Which gets to the heart of the problem, I couldn't find
a single syntax that worked on all platforms :-/

I think the best thing to do is make a decode_utf8() function that looks
like this:

def decode_utf8(text):

if six.PY2:
    return text.decode("utf-8", errors="replace")
return text

and then write

print(decode_utf8(traceback), file=self.session)

and make sure that the file is opened with io.open(encoding="utf-8")

On second thought I think the best solution is just to decide that we're
always going to use unicode strings everywhere. That means using u""
instead of "" everywhere. It's hard to do this all over the entire
codebase, but we can start with this and see what happens. This is what I
tried to do in my followup patch the other day, but I think I missed the
place where we were writing self.getvalue() and that was still a regular
string. So anywhere where we write a string variable and not a string
literal, we still have to decode it conditionally if we're on python 2

The place where we print the traceback, what happens if you do this:

self.session = io.open(path, mode="w", encoding="utf-8")
traceback.print_exc(file=self.session)

Does that work?

I don't think using unicode strings everywhere is a good idea because when we compare a string coming from the SB API with a string literal we don't want to do a lot of conversion. Also I think enforcing all string literals to be unicode is something what will break all the time.

I tried to change the code to use "self.session = io.open(path, mode="w", encoding="utf-8")" but it was failing with the following stack trace:

Traceback (most recent call last):
File "./test/dotest.py", line 7, in <module>
  lldbsuite.test.run_suite()
File "/mnt/ssd/ll/git/lldb/packages/Python/lldbsuite/test/dotest.py", line 1089, in run_suite
  resultclass=test_result.LLDBTestResult).run(configuration.suite)
File "/mnt/ssd/ll/git/lldb/third_party/Python/module/unittest2/unittest2/runner.py", line 162, in run
  test(result)
File "/mnt/ssd/ll/git/lldb/third_party/Python/module/unittest2/unittest2/suite.py", line 65, in __call__
  return self.run(*args, **kwds)
File "/mnt/ssd/ll/git/lldb/third_party/Python/module/unittest2/unittest2/suite.py", line 85, in run
  self._wrapped_run(result)
File "/mnt/ssd/ll/git/lldb/third_party/Python/module/unittest2/unittest2/suite.py", line 115, in _wrapped_run
  test._wrapped_run(result, debug)
File "/mnt/ssd/ll/git/lldb/third_party/Python/module/unittest2/unittest2/suite.py", line 117, in _wrapped_run
  test(result)
File "/mnt/ssd/ll/git/lldb/third_party/Python/module/unittest2/unittest2/case.py", line 433, in __call__
  return self.run(*args, **kwds)
File "/mnt/ssd/ll/git/lldb/third_party/Python/module/unittest2/unittest2/case.py", line 369, in run
  self.dumpSessionInfo()
File "/mnt/ssd/ll/git/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 1857, in dumpSessionInfo
  print("Session info generated @", datetime.datetime.now().ctime(), file=self.session)
TypeError: must be unicode, not str

I converted the strings it was complaining about but I haven't managed to get the traceback printing working without converting the data to utf-8 manually. We are using traceback.format_exception to collect the stack trace what returns a string, then we store it in a variable and print it later.

What if you try traceback.print_exception()? Does it work? Maybe it's
smart enough to detect whether it needs to encode the traceback before
printing.

The traceback calculation is done inside unittest2 so I think we shouldn't change it and we don't have access to the session variable either.

I think I have a good idea how to fix this. in lldbsuite.support we can
add unicode_file.py. Make it implement the file interface so that it looks
just like an io object so it can be used with print(file=), and with the
'with' keyword, etc. Make the open method call io.open() with an encoding,
then make the read / write methods do a python version check before sending
the read / write call to the underlying io object. Then you basically just
do

self.session_file = unicode_file.open(path, mode='w', encoding='utf-8')

and then you can write however you want.

print(u"", self.session_file)
print("", self.session_file)

will both work in any python version

That seems very reasonable.

I'm working on getting the MSVC buildbot using VS 2015, but I'll take a
stab at this after it's done. Feel free to leave everything as is for now,
I dont' think anyone but Windows is broken on this anyway for now.

Sounds great, Zachary. Thanks!

tberghammer abandoned this revision.Feb 9 2016, 5:45 AM

The change was submitted with a different CL