This is an archive of the discontinued LLVM Phabricator instance.

[lit] Fix some convoluted logic around Unicode encoding, and de-duplicate across modules that used it.
ClosedPublic

Authored by dlj on Jun 28 2017, 5:33 PM.

Details

Summary

In Python2 and Python3, the various (non-)?Unicode string types are sort of
spaghetti. Python2 has unicode support tacked on via the 'unicode' type, which
is distinct from 'str' (which are bytes). Python3 takes the "unicode-everywhere"
approach, with 'str' representing a Unicode string.

Both have a 'bytes' type. In Python3, it is the only way to represent raw bytes.
However, in Python2, 'bytes' is an alias for 'str'. This leads to interesting
problems when an interface requires a precise type, but has to run under both
Python2 and Python3.

The previous logic appeared to be correct in all cases, but went through more
layers of indirection than necessary. This change does the necessary conversions
in one shot, with documentation about which paths might be taken in Python2 or
Python3.

Diff Detail

Repository
rL LLVM

Event Timeline

dlj created this revision.Jun 28 2017, 5:33 PM
modocache accepted this revision.Jun 28 2017, 5:55 PM

LGTM! Thanks for all the cleanups!

utils/lit/lit/util.py
51 ↗(On Diff #104554)

Typo, I think: "uncode" should be "unicode".

This revision is now accepted and ready to land.Jun 28 2017, 5:55 PM
dlj updated this revision to Diff 104565.Jun 28 2017, 6:03 PM
dlj marked an inline comment as done.
  • Fix spelling: uncode -> unicode.
utils/lit/lit/util.py
51 ↗(On Diff #104554)

Ah, good catch. You are correct.

This revision was automatically updated to reflect the committed changes.

Seems incompatible to py3.
http://bb.pgr.jp/builders/test-llvm-i686-linux-RA/builds/3747

I am investigating. Excuse me if I would revert this.

dlj added a comment.Jun 28 2017, 7:23 PM

Seems incompatible to py3.
http://bb.pgr.jp/builders/test-llvm-i686-linux-RA/builds/3747

I am investigating. Excuse me if I would revert this.

Sorry for the trouble, reverted.

dlj added a comment.Jun 28 2017, 9:38 PM
In D34793#794765, @dlj wrote:

Seems incompatible to py3.
http://bb.pgr.jp/builders/test-llvm-i686-linux-RA/builds/3747

I am investigating. Excuse me if I would revert this.

Sorry for the trouble, reverted.

I re-applied this in r306643. There are some tests that simply yield binary output, and I was missing a conversion in googletest. I verified with Python 2.7 and Python 3.4.3, both on Linux.

chapuni added a subscriber: MatzeB.Jun 28 2017, 9:50 PM

@dlj Great, thanks!

Seems it also fixes D34464.

The change LGTM, but please keep it on topic (see below).

llvm/trunk/utils/lit/lit/util.py
70–71

This isn't really related to str vs unicode is it? Can you please commit such changes separately and not "hide" them in a change such as this so we can revert and reason about them separately if necessary?
(That said this particular change seems simple enough to me that post-commit review is enough without phab).

dlj marked an inline comment as done.Jun 28 2017, 10:57 PM
dlj added inline comments.
llvm/trunk/utils/lit/lit/util.py
70–71

It's actually quite related... the capture() function basically did nothing but call subprocess.Popen, then pass the output through the strange encode/decode loop. So this was the only remaining (indirect) usage of convert_string.

dlj marked an inline comment as done.Jun 28 2017, 11:11 PM

@dlj Great, thanks!

Seems it also fixes D34464.

Interesting... to_string now has to fall back to str(bytes) in Python3 when there is an invalid input. In that case, the resulting string looks more like the output of repr(), which is not what one would want for a filename.

It's not clear to me why Python's behaviour of treating *filenames* as unicode is actually the right choice.

Strictly speaking, I think the only well-defined filename encoding that covers all platforms targeted by Clang is the one defined by the Posix spec:
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_282

(But of course, our supported OSes do support broader character sets.)

I'll think more about what to_string should do, but I'll also leave a comment on the other review thread.

MatzeB added inline comments.Jun 29 2017, 10:23 AM
llvm/trunk/utils/lit/lit/util.py
70–71

ok, makes sense.