This is an archive of the discontinued LLVM Phabricator instance.

Re-submit rL258759: Write the session log file in UTF-8.
AbandonedPublic

Authored by tberghammer on Jan 28 2016, 3:50 AM.

Download Raw Diff

Details

Reviewers

zturner
tfiala

Summary

Re-submit rL258759: Write the session log file in UTF-8.

Previously we were writing in the default encoding, which depends
on the operating system and is not guaranteed to be unicode aware.
On Python 3, this would lead to a situation where writing unicode
text to the log file generates an exception. The fix here is to
write session logs using the proper encoding, which incidentally
fixes another test, so xfail is removed from that.

Diff Detail

Event Timeline

tberghammer updated this revision to Diff 46256.Jan 28 2016, 3:50 AM

tberghammer retitled this revision from to Re-submit rL258759: Write the session log file in UTF-8..

tberghammer updated this object.

tberghammer added reviewers: zturner, tfiala.

tberghammer added a subscriber: lldb-commits.

I know that this version is not the cleanest implementation but I think this is a very robust because we don't depend on the system encoding anywhere. Please gave it a try and feel free to commit it if if works.

I can give this a try later today.

Hey Tamas,

I haven't tried it yet, but I won't be into the office until later. Why don't you go ahead and submit it, and if it blows up the OS X testbot, you'll get an email and you (or whoever finds it first) can revert it?

There may be a better way to address it longer term but if this works, it seems okay for now.

-Todd

This revision is now accepted and ready to land.Jan 28 2016, 7:08 AM

I am more worried about Windows and python3 and I don't think we have any build bot running the tests on Windows. I don't really know what is the original issue this commit try to address so I plan to wait until Zachary can check it out (the current ToT works on Linux so I am not worried)

Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:43:06) [MSC v.1600 32
bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.

dir(str)

['add', 'class', 'contains', 'delattr', 'dir',
'doc', 'eq', 'format', 'ge', 'getattribute',
'getitem', 'getnewargs', 'gt', 'hash', 'init',
'iter', 'le', 'len', 'lt', 'mod', 'mul', 'ne',
'new', 'reduce', 'reduce_ex', 'repr', 'rmod',
'rmul', 'setattr', 'sizeof', 'str', 'subclasshook',
'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith',
'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum',
'isalpha', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric',
'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower',
'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust',
'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith',
'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']

No decode method. Which gets to the heart of the problem, I couldn't find
a single syntax that worked on all platforms :-/

I think the best thing to do is make a decode_utf8() function that looks
like this:

def decode_utf8(text):

if six.PY2:
    return text.decode("utf-8", errors="replace")
return text

and then write

print(decode_utf8(traceback), file=self.session)

and make sure that the file is opened with io.open(encoding="utf-8")

On second thought I think the best solution is just to decide that we're
always going to use unicode strings everywhere. That means using u""
instead of "" everywhere. It's hard to do this all over the entire
codebase, but we can start with this and see what happens. This is what I
tried to do in my followup patch the other day, but I think I missed the
place where we were writing self.getvalue() and that was still a regular
string. So anywhere where we write a string variable and not a string
literal, we still have to decode it conditionally if we're on python 2

The place where we print the traceback, what happens if you do this:

self.session = io.open(path, mode="w", encoding="utf-8")
traceback.print_exc(file=self.session)

Does that work?

I don't think using unicode strings everywhere is a good idea because when we compare a string coming from the SB API with a string literal we don't want to do a lot of conversion. Also I think enforcing all string literals to be unicode is something what will break all the time.

I tried to change the code to use "self.session = io.open(path, mode="w", encoding="utf-8")" but it was failing with the following stack trace:

Traceback (most recent call last):
File "./test/dotest.py", line 7, in <module>
  lldbsuite.test.run_suite()
File "/mnt/ssd/ll/git/lldb/packages/Python/lldbsuite/test/dotest.py", line 1089, in run_suite
  resultclass=test_result.LLDBTestResult).run(configuration.suite)
File "/mnt/ssd/ll/git/lldb/third_party/Python/module/unittest2/unittest2/runner.py", line 162, in run
  test(result)
File "/mnt/ssd/ll/git/lldb/third_party/Python/module/unittest2/unittest2/suite.py", line 65, in __call__
  return self.run(*args, **kwds)
File "/mnt/ssd/ll/git/lldb/third_party/Python/module/unittest2/unittest2/suite.py", line 85, in run
  self._wrapped_run(result)
File "/mnt/ssd/ll/git/lldb/third_party/Python/module/unittest2/unittest2/suite.py", line 115, in _wrapped_run
  test._wrapped_run(result, debug)
File "/mnt/ssd/ll/git/lldb/third_party/Python/module/unittest2/unittest2/suite.py", line 117, in _wrapped_run
  test(result)
File "/mnt/ssd/ll/git/lldb/third_party/Python/module/unittest2/unittest2/case.py", line 433, in __call__
  return self.run(*args, **kwds)
File "/mnt/ssd/ll/git/lldb/third_party/Python/module/unittest2/unittest2/case.py", line 369, in run
  self.dumpSessionInfo()
File "/mnt/ssd/ll/git/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 1857, in dumpSessionInfo
  print("Session info generated @", datetime.datetime.now().ctime(), file=self.session)
TypeError: must be unicode, not str

I converted the strings it was complaining about but I haven't managed to get the traceback printing working without converting the data to utf-8 manually. We are using traceback.format_exception to collect the stack trace what returns a string, then we store it in a variable and print it later.

What if you try traceback.print_exception()? Does it work? Maybe it's
smart enough to detect whether it needs to encode the traceback before
printing.

The traceback calculation is done inside unittest2 so I think we shouldn't change it and we don't have access to the session variable either.

I think I have a good idea how to fix this. in lldbsuite.support we can
add unicode_file.py. Make it implement the file interface so that it looks
just like an io object so it can be used with print(file=), and with the
'with' keyword, etc. Make the open method call io.open() with an encoding,
then make the read / write methods do a python version check before sending
the read / write call to the underlying io object. Then you basically just
do

self.session_file = unicode_file.open(path, mode='w', encoding='utf-8')

and then you can write however you want.

print(u"", self.session_file)
print("", self.session_file)

will both work in any python version

That seems very reasonable.

I'm working on getting the MSVC buildbot using VS 2015, but I'll take a
stab at this after it's done. Feel free to leave everything as is for now,
I dont' think anyone but Windows is broken on this anyway for now.

Sounds great, Zachary. Thanks!

The change was submitted with a different CL

Revision Contents

Path

Size

packages/

Python/

lldbsuite/

test/

lang/

cpp/

char1632_t/

TestChar1632T.py

1 line

lldbtest.py

4 lines

Diff 46256

packages/Python/lldbsuite/test/lang/cpp/char1632_t/TestChar1632T.py

Show All 19 Lines	def setUp(self):
# Call super's setUp().		# Call super's setUp().
TestBase.setUp(self)		TestBase.setUp(self)
# Find the line number to break for main.cpp.		# Find the line number to break for main.cpp.
self.source = 'main.cpp'		self.source = 'main.cpp'
self.lines = [ line_number(self.source, '// breakpoint1'),		self.lines = [ line_number(self.source, '// breakpoint1'),
line_number(self.source, '// breakpoint2') ]		line_number(self.source, '// breakpoint2') ]

@expectedFailureIcc # ICC (13.1) does not emit the DW_TAG_base_type for char16_t and char32_t.		@expectedFailureIcc # ICC (13.1) does not emit the DW_TAG_base_type for char16_t and char32_t.
@expectedFailureWindows("llvm.org/pr24489: Name lookup not working correctly on Windows")
def test(self):		def test(self):
"""Test that the C++11 support for char16_t and char32_t works correctly."""		"""Test that the C++11 support for char16_t and char32_t works correctly."""
self.build()		self.build()
exe = os.path.join(os.getcwd(), "a.out")		exe = os.path.join(os.getcwd(), "a.out")

# Create a target by the debugger.		# Create a target by the debugger.
target = self.dbg.CreateTarget(exe)		target = self.dbg.CreateTarget(exe)
self.assertTrue(target, VALID_TARGET)		self.assertTrue(target, VALID_TARGET)
▲ Show 20 Lines • Show All 54 Lines • Show Last 20 Lines

packages/Python/lldbsuite/test/lldbtest.py

Show First 20 Lines • Show All 256 Lines • ▼ Show 20 Lines	def __exit__(self, type, value, tb):
"""		"""
Context management protocol on exit from the body of the with statement.		Context management protocol on exit from the body of the with statement.
If trace is ON, it emits the recordings into stderr. Always add the		If trace is ON, it emits the recordings into stderr. Always add the
recordings to our session object. And close the SixStringIO object, too.		recordings to our session object. And close the SixStringIO object, too.
"""		"""
if self.trace:		if self.trace:
print(self.getvalue(), file=sys.stderr)		print(self.getvalue(), file=sys.stderr)
if self.session:		if self.session:
print(self.getvalue(), file=self.session)		print(self.getvalue().decode('utf8', errors='replace').encode(errors='replace'), file=self.session)
self.close()		self.close()

@add_metaclass(abc.ABCMeta)		@add_metaclass(abc.ABCMeta)
class _BaseProcess(object):		class _BaseProcess(object):

@abc.abstractproperty		@abc.abstractproperty
def pid(self):		def pid(self):
"""Returns process PID if has been launched already."""		"""Returns process PID if has been launched already."""
▲ Show 20 Lines • Show All 1,564 Lines • ▼ Show 20 Lines	def dumpSessionInfo(self):
elif self.__unexpected__:		elif self.__unexpected__:
prefix = 'UnexpectedSuccess'		prefix = 'UnexpectedSuccess'
else:		else:
prefix = 'Success'		prefix = 'Success'

if not self.__unexpected__ and not self.__skipped__:		if not self.__unexpected__ and not self.__skipped__:
for test, traceback in pairs:		for test, traceback in pairs:
if test is self:		if test is self:
print(traceback, file=self.session)		print(traceback.decode('utf8', errors='replace').encode(errors='replace'), file=self.session)

# put footer (timestamp/rerun instructions) into session		# put footer (timestamp/rerun instructions) into session
testMethod = getattr(self, self._testMethodName)		testMethod = getattr(self, self._testMethodName)
if getattr(testMethod, "__benchmarks_test__", False):		if getattr(testMethod, "__benchmarks_test__", False):
benchmarks = True		benchmarks = True
else:		else:
benchmarks = False		benchmarks = False

▲ Show 20 Lines • Show All 960 Lines • Show Last 20 Lines