This is an archive of the discontinued LLVM Phabricator instance.

Always write the session log file in UTF-8
ClosedPublic

Authored by zturner on Jan 29 2016, 1:50 PM.

Details

Summary

Hopefully the 10th time's a charm on this one.

This patch attempts to solve the Python 2 / Python 3 incompatibilities by introducing a new encoded_file abstraction that we use instead of io.open(). The problem with the builtin implementation of io.open is that read and write accept and return unicode objects, which are not always convenient to work with in Python 2. We solve this by making encoded_file.open() return the same object returned by io.open() but with hooked read() and write() methods. These hooked methods will accept binary or text data, and conditionally convert what it gets to a unicode object using the correct encoding. When calling read() it also does any conversion necessary to convert the output back into the native string type of the running python version.

I actually tested this one on both Windows and Linux and it works fine. So hopefully this is closer to a working solution!

Diff Detail

Repository
rL LLVM

Event Timeline

zturner updated this revision to Diff 46426.Jan 29 2016, 1:50 PM
zturner retitled this revision from to Always write the session log file in UTF-8.
zturner updated this object.
zturner added reviewers: tfiala, labath, tberghammer.
zturner added a subscriber: lldb-commits.
tfiala edited edge metadata.Jan 29 2016, 3:10 PM

I'll give it a run!

Running the test suite now. So far so good (it didn't blow up immediately like it has for other attempts). I'll post back when the test run completes.

The only test I see fail with this is:
TestGDBRemoteMemoryRead.py:

ERROR: test_memory_read_dwarf (TestGDBRemoteMemoryRead.MemoryReadTestCase)

Traceback (most recent call last):

File "/Users/tfiala/lldb/tot/git-svn/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 2289, in dwarf_test_method
  return attrvalue(self)
File "/Users/tfiala/lldb/tot/git-svn/lldb/packages/Python/lldbsuite/test/tools/lldb-server/TestGDBRemoteMemoryRead.py", line 37, in test_memory_read
  self.match("process plugin packet send x%x,%x" % (pc, size), ["response:", memory])
File "/Users/tfiala/lldb/tot/git-svn/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 2615, in match
  self.runCmd(str, msg=msg, trace = (True if trace else False), check = not error)
File "/Users/tfiala/lldb/tot/git-svn/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 2591, in runCmd
  print(self.res.GetError(), file=sbuf)
File "/Users/tfiala/lldb/tot/git-svn/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 267, in __exit__
  print(self.getvalue(), file=self.session)
File "/Users/tfiala/lldb/tot/git-svn/lldb/packages/Python/lldbsuite/support/encoded_file.py", line 32, in impl
  s = s.decode(encoding)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
  return codecs.utf_8_decode(input, errors, True)

UnicodeDecodeError: 'utf8' codec can't decode byte 0x89 in position 90: invalid start byte

Hmm, that almost looks to me like it's finding a real bug. Which it
probably is, because higher up the callstack it's reading the response of a
memory send. I'm guessing that the response includes non-printable
characters (raw memory data) that wasn't being written before but was being
silently ignored, and now it's giving an error.

Can you try passing errors='replace' to encoded_file.open()? Then open the
log file and see what's in it.

Hmm, that almost looks to me like it's finding a real bug. Which it
probably is, because higher up the callstack it's reading the response of a
memory send. I'm guessing that the response includes non-printable
characters (raw memory data) that wasn't being written before but was being
silently ignored, and now it's giving an error.

Can you try passing errors='replace' to encoded_file.open()? Then open the
log file and see what's in it.

Yep, I can give that a try. It'll end up being later (not at a computer now) but I'll give it a shot.

Hmm, that almost looks to me like it's finding a real bug. Which it
probably is, because higher up the callstack it's reading the response of a
memory send. I'm guessing that the response includes non-printable
characters (raw memory data) that wasn't being written before but was being
silently ignored, and now it's giving an error.

Can you try passing errors='replace' to encoded_file.open()? Then open the
log file and see what's in it.

Yep, I can give that a try. It'll end up being later (not at a computer now) but I'll give it a shot.

I'm going to have a look at trying this modification now.

I'm going to have a look at trying this modification now.

I'm getting the same error with the replace.

Here is the patch (okay the whole encoded_file.py) I was able to use to get past this - which ultimately looks to be an error in the match result printing of a raw byte buffer (never meant to be unicode printable) in the result stream (i.e. the error I was getting looked to be purely a by-product of printing match results that succeeded, not failed, but the unicode decoding introduced the actual failure point):

"""
                     The LLVM Compiler Infrastructure

This file is distributed under the University of Illinois Open Source
License. See LICENSE.TXT for details.

Prepares language bindings for LLDB build process.  Run with --help
to see a description of the supported command line arguments.
"""

# Python modules:
import io

# Third party modules
import six

def _encoded_read(old_read, encoding):
    def impl(size):
        result = old_read(size)
        # If this is Python 2 then we need to convert the resulting `unicode` back
        # into a `str` before returning
        if six.PY2:
            result = result.encode(encoding)
        return result
    return impl

def _encoded_write(old_write, encoding):
    def impl(s):
        # If we were asked to write a `str` (in Py2) or a `bytes` (in Py3) decode it
        # as unicode before attempting to write.
        if isinstance(s, six.binary_type):
            try:
                s = s.decode(encoding)
            except UnicodeDecodeError as decode_err:
                import sys
                sys.stderr.write("error: unicode decode failed on raw string '{}': '{}'".format(s, decode_err))
                s = u"Could not decode unicode string, see stderr for details"
        return old_write(s)
    return impl

'''
Create a Text I/O file object that can be written to with either unicode strings or byte strings
under Python 2 and Python 3, and automatically encodes and decodes as necessary to return the
native string type for the current Python version
'''
def open(file, encoding, mode='r', buffering=-1, errors=None, newline=None, closefd=True):
    wrapped_file = io.open(file, mode=mode, buffering=buffering, encoding=encoding,
                           errors=errors, newline=newline, closefd=closefd)
    new_read = _encoded_read(getattr(wrapped_file, 'read'), encoding)
    new_write = _encoded_write(getattr(wrapped_file, 'write'), encoding)
    setattr(wrapped_file, 'read', new_read)
    setattr(wrapped_file, 'write', new_write)
    return wrapped_file

It just adds a try/except block around the Unicode decode. Is is highly likely that might not run on Python 3 - i.e. ping pong this back into a Python 3 error. I may try to bring this up on Windows to see if that does actually happen.

In any event, the right fix here probably is to have displays of matched/expected text for known-to-be binary data *not* try to print results in the expect-string-match code since these are just going to have no way of being valid. The other way to go (perhaps better) would be to put some kind of safe wrapper around the byte compares, so that they are tested as ASCII-ified output, or use an entirely different mechanism here.

If this works on Windows the way I fixed this up, you can go ahead and check this in. (I no longer get failures with the code change I made above). If it doesn't work but you can tweak that slightly, feel free to go ahead and do that as well. In the meantime I am going to see if I can get the binary aspect of the matching handled properly (i.e. not done as string compares). This might be one of my tests. (It's at least in the goop of lldb-server tests that I had written 1.5 to 2 years ago).

In any event, the right fix here probably is to have displays of matched/expected text for known-to-be binary data *not* try to print results in the expect-string-match code since these are just going to have no way of being valid.

i.e. a test fix. I think we still want to keep the unicode decode guard in, as it could easily be something printing binary-based data like we had here, but the culprit under this is basically the method the test is using to verify binary data.

In fact I'd say you could do this, now that I've looked at the test:

diff --git a/packages/Python/lldbsuite/test/tools/lldb-server/TestGDBRemoteMemoryRead.py b/packages/Python/lldbsuite/test/tools/lldb-server/TestGDBRemoteMemoryRead.py
index 7b974e5..9f80abe 100644
--- a/packages/Python/lldbsuite/test/tools/lldb-server/TestGDBRemoteMemoryRead.py
+++ b/packages/Python/lldbsuite/test/tools/lldb-server/TestGDBRemoteMemoryRead.py
@@ -34,7 +34,7 @@ class MemoryReadTestCase(TestBase):
             error = lldb.SBError()
             memory = process.ReadMemory(pc, size, error)
             self.assertTrue(error.Success())
-            self.match("process plugin packet send x%x,%x" % (pc, size), ["response:", memory])
+            # self.match("process plugin packet send x%x,%x" % (pc, size), ["response:", memory])
             self.match("process plugin packet send m%x,%x" % (pc, size), ["response:", binascii.hexlify(memory)])

         process.Continue()

(i.e. comment out the direct memory compare with strings). That'll take a little bit of work to re-work, but the 'm' hexlified version is sufficient for verifying memory reads for now until we can rework the test to compare the bytes directly instead of counting on print-style matches. You could comment it out and file a bug on it to fix.

labath edited edge metadata.Feb 1 2016, 5:03 AM

TestGDBRemoteMemoryRead.py is my test and I gotta admit it is somewhat hackish. If this is the only issue that is stopping this from going through, then please xfail it, and i'll rewrite it in some other way.

tfiala added a subscriber: tfiala.Feb 1 2016, 6:45 AM

Just that x command needs to be commented out. The m command is fine as is.

-Todd

This revision was automatically updated to reflect the committed changes.
tfiala added a comment.Feb 2 2016, 8:19 AM

This change appears to have caused a test error (rather than a test failure) on the OS X builder. See here (master builder step):
http://lab.llvm.org:8080/green/job/LLDB/16318/

and here (build and test run step):
http://lab.llvm.org:8080/green/job/lldb_build_test/16209/

It didn't get reported since the test collector Jenkins step is somehow behaving differently than other scenarios and wasn't marking the build as failed. (I have since fixed this morning).

These look to be some kind of Unicode encoding failure, which means the test is probably doing something that isn't Unicode safe. I'll have a look at it.

tfiala added a comment.Feb 2 2016, 8:21 AM

This change appears to have caused a test error (rather than a test failure) on the OS X builder.

It's not clear to me yet why I didn't see this at home, but my first guess is that I have all my consoles set to utf-8 encoding, so if this is terminal related, I may have masked the issue that way. In any event, I'll have a look.

tfiala added a comment.Feb 2 2016, 9:05 AM

Skipped TestWatchLocation.py test method on OS X with r259526 while I investigate.