This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
packages/Python/lldbsuite/support/
-
Python/
-
lldbsuite/
-
support/
-
encoded_file.py

Differential D49379

lldbsuite: ignore decoding errors in _encoded_write
ClosedPublic

Authored by kbaladurin on Jul 16 2018, 8:05 AM.

Download Raw Diff

Details

Reviewers

labath
stella.stamenova

Summary

This patch fixes random test fails due to UnicodeDecodeError:

  ...
   File "lldb/packages/Python/lldbsuite/test/lldbtest.py", line 2070, in runCmd
    print(self.res.GetError(), file=sbuf)
  File "lldb/packages/Python/lldbsuite/test/lldbtest.py", line 293, in __exit__
    print(self.getvalue(), file=self.session)
  File "lldb/packages/Python/lldbsuite/support/encoded_file.py", line 34, in impl
    s = s.decode(encoding)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfb in position 257: invalid start byte

This error occurs when we try to decode string that contains garbage, for example:

runCmd: process continue
output: Process 579 resuming
Process 579 stopped
* thread #2, name = 'test_hello_watc', stop reason = watchpoint 1
    frame #0: 0x000000000040150f func(char_ptr="\x01$\xad�, new_val='\x01') at main.cpp:40
   37           unsigned what = new_val;
   38           printf("new value written to location(%p) = %u\n", char_ptr, what);
   39           *char_ptr = new_val;
-> 40       }
   41
   42       uint32_t
   43       access_pool (bool flag = false)

In such cases we should ignore decoding errors.

Diff Detail

Event Timeline

kbaladurin created this revision.Jul 16 2018, 8:05 AM

Herald added a subscriber: llvm-commits. · View Herald TranscriptJul 16 2018, 8:05 AM

Please do not commit this yet. I've seen errors like this other places and this was generally not the fix - ignoring the errors in a lot of these cases would have resulted in the tests not actually verifying what they were meant check. I'd like to walk through the test case and see if I can figure out whether there's a different solution.

So I went through the code that leads here and I think long-term we need a better solution. We don't really want to be losing data from the session logs (which is what this is writing) and we have similar code for handling Unicode decode/encode in the lit code (so we should merge the two).

Having said that, I think there is a short-term solution that is viable:

You should use 'replace' rather than 'ignore', so that we can at least see in the log that some characters were lost
You should add a comment before the decode along the lines of: When decoding the string, replace invalid characters with (U+FFFD, ‘REPLACEMENT CHARACTER’). This may result in data loss

This revision now requires changes to proceed.Jul 16 2018, 2:40 PM

Thank you for comments, I've updated patch. I agree that it is not complete solution, maybe better to make all tests unicode safe.

In D49379#1164949, @kbaladurin wrote:

Thank you for comments, I've updated patch. I agree that it is not complete solution, maybe better to make all tests unicode safe.

I've been debating this with myself yesterday. In the case you encountered, the issue is that the c string we read from the inferior is not valid utf8. This is kind of a problem because it means that it can corrupt subsequent lldb output (most likely the closing " character).

One possibility here would be to do the replacement character substitution directly when printing out strings read from inferior memory. However, I couldn't decide for myself whether I want my debugger to mess with these strings or not...

In D49379#1164975, @labath wrote:

In D49379#1164949, @kbaladurin wrote:

Thank you for comments, I've updated patch. I agree that it is not complete solution, maybe better to make all tests unicode safe.

I've been debating this with myself yesterday. In the case you encountered, the issue is that the c string we read from the inferior is not valid utf8. This is kind of a problem because it means that it can corrupt subsequent lldb output (most likely the closing " character).

One possibility here would be to do the replacement character substitution directly when printing out strings read from inferior memory. However, I couldn't decide for myself whether I want my debugger to mess with these strings or not...

Also I think we can report about decoding errors like gdb does:

Thread 3 "a.out" hit Breakpoint 1, do_bad_thing_with_location (char_ptr=0x619c20 "\001$\255", <incomplete sequence \373>, new_val=1 '\001') at main.cpp:40
40	}
(gdb) x/10xb char_ptr
0x619c20:	0x01	0x24	0xad	0xfb	0x00	0x00	0x00	0x00
0x619c28:	0x00	0x00

In D49379#1165123, @kbaladurin wrote:

Also I think we can report about decoding errors like gdb does:

Thread 3 "a.out" hit Breakpoint 1, do_bad_thing_with_location (char_ptr=0x619c20 "\001$\255", <incomplete sequence \373>, new_val=1 '\001') at main.cpp:40
40	}
(gdb) x/10xb char_ptr
0x619c20:	0x01	0x24	0xad	0xfb	0x00	0x00	0x00	0x00
0x619c28:	0x00	0x00

That's a great idea !

I think this change is good with the replacement, but if you want to improve it and implement your suggestion for errors, please do :).

This revision is now accepted and ready to land.Jul 17 2018, 8:51 AM

jankratochvil added a subscriber: jankratochvil.Jul 18 2018, 6:29 AM

What is the status? Thank you!

In D49379#1311309, @kbaladurin wrote:

What is the status? Thank you!

It looks like we signed off on your change, so you're free to commit it. :)

In D49379#1311581, @stella.stamenova wrote:

In D49379#1311309, @kbaladurin wrote:

What is the status? Thank you!

It looks like we signed off on your change, so you're free to commit it. :)

Thank you! But I haven't permissions to commit. It would be great if somebody commit it :)

An equivalent patch has been already committed in September by @labath as rL341274.

Revision Contents

Path

Size

packages/

Python/

lldbsuite/

support/

encoded_file.py

4 lines

Diff 155826

packages/Python/lldbsuite/support/encoded_file.py

Context not available.
	# If we were asked to write a `str` (in Py2) or a `bytes` (in Py3) decode it	# If we were asked to write a `str` (in Py2) or a `bytes` (in Py3) decode it
	# as unicode before attempting to write.	# as unicode before attempting to write.
	if isinstance(s, six.binary_type):	if isinstance(s, six.binary_type):
	s = s.decode(encoding)	# When decoding the string, replace invalid characters with (U+FFFD, 'REPLACEMENT CHARACTER').
		# This may result in data loss
		s = s.decode(encoding, 'replace')
	return old_write(s)	return old_write(s)
	return impl	return impl

Context not available.