This is an archive of the discontinued LLVM Phabricator instance.

lldbsuite: ignore decoding errors in _encoded_write
ClosedPublic

Authored by kbaladurin on Jul 16 2018, 8:05 AM.

Details

Summary

This patch fixes random test fails due to UnicodeDecodeError:

  ...
   File "lldb/packages/Python/lldbsuite/test/lldbtest.py", line 2070, in runCmd
    print(self.res.GetError(), file=sbuf)
  File "lldb/packages/Python/lldbsuite/test/lldbtest.py", line 293, in __exit__
    print(self.getvalue(), file=self.session)
  File "lldb/packages/Python/lldbsuite/support/encoded_file.py", line 34, in impl
    s = s.decode(encoding)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfb in position 257: invalid start byte

This error occurs when we try to decode string that contains garbage, for example:

runCmd: process continue
output: Process 579 resuming
Process 579 stopped
* thread #2, name = 'test_hello_watc', stop reason = watchpoint 1
    frame #0: 0x000000000040150f func(char_ptr="\x01$\xad�, new_val='\x01') at main.cpp:40
   37           unsigned what = new_val;
   38           printf("new value written to location(%p) = %u\n", char_ptr, what);
   39           *char_ptr = new_val;
-> 40       }
   41
   42       uint32_t
   43       access_pool (bool flag = false)

In such cases we should ignore decoding errors.

Diff Detail

Event Timeline

kbaladurin created this revision.Jul 16 2018, 8:05 AM

Please do not commit this yet. I've seen errors like this other places and this was generally not the fix - ignoring the errors in a lot of these cases would have resulted in the tests not actually verifying what they were meant check. I'd like to walk through the test case and see if I can figure out whether there's a different solution.

stella.stamenova requested changes to this revision.Jul 16 2018, 2:40 PM

So I went through the code that leads here and I think long-term we need a better solution. We don't really want to be losing data from the session logs (which is what this is writing) and we have similar code for handling Unicode decode/encode in the lit code (so we should merge the two).

Having said that, I think there is a short-term solution that is viable:

  1. You should use 'replace' rather than 'ignore', so that we can at least see in the log that some characters were lost
  2. You should add a comment before the decode along the lines of: When decoding the string, replace invalid characters with (U+FFFD, ‘REPLACEMENT CHARACTER’). This may result in data loss
This revision now requires changes to proceed.Jul 16 2018, 2:40 PM

Thank you for comments, I've updated patch. I agree that it is not complete solution, maybe better to make all tests unicode safe.

Thank you for comments, I've updated patch. I agree that it is not complete solution, maybe better to make all tests unicode safe.

I've been debating this with myself yesterday. In the case you encountered, the issue is that the c string we read from the inferior is not valid utf8. This is kind of a problem because it means that it can corrupt subsequent lldb output (most likely the closing " character).

One possibility here would be to do the replacement character substitution directly when printing out strings read from inferior memory. However, I couldn't decide for myself whether I want my debugger to mess with these strings or not...

Thank you for comments, I've updated patch. I agree that it is not complete solution, maybe better to make all tests unicode safe.

I've been debating this with myself yesterday. In the case you encountered, the issue is that the c string we read from the inferior is not valid utf8. This is kind of a problem because it means that it can corrupt subsequent lldb output (most likely the closing " character).

One possibility here would be to do the replacement character substitution directly when printing out strings read from inferior memory. However, I couldn't decide for myself whether I want my debugger to mess with these strings or not...

Also I think we can report about decoding errors like gdb does:

Thread 3 "a.out" hit Breakpoint 1, do_bad_thing_with_location (char_ptr=0x619c20 "\001$\255", <incomplete sequence \373>, new_val=1 '\001') at main.cpp:40
40	}
(gdb) x/10xb char_ptr
0x619c20:	0x01	0x24	0xad	0xfb	0x00	0x00	0x00	0x00
0x619c28:	0x00	0x00

Also I think we can report about decoding errors like gdb does:

Thread 3 "a.out" hit Breakpoint 1, do_bad_thing_with_location (char_ptr=0x619c20 "\001$\255", <incomplete sequence \373>, new_val=1 '\001') at main.cpp:40
40	}
(gdb) x/10xb char_ptr
0x619c20:	0x01	0x24	0xad	0xfb	0x00	0x00	0x00	0x00
0x619c28:	0x00	0x00

That's a great idea !

stella.stamenova accepted this revision.Jul 17 2018, 8:51 AM

I think this change is good with the replacement, but if you want to improve it and implement your suggestion for errors, please do :).

This revision is now accepted and ready to land.Jul 17 2018, 8:51 AM

What is the status? Thank you!

What is the status? Thank you!

It looks like we signed off on your change, so you're free to commit it. :)

What is the status? Thank you!

It looks like we signed off on your change, so you're free to commit it. :)

Thank you! But I haven't permissions to commit. It would be great if somebody commit it :)

An equivalent patch has been already committed in September by @labath as rL341274.