This is an archive of the discontinued LLVM Phabricator instance.

Make git-clang-format python 3 compatible
ClosedPublic

Authored by EricWF on Mar 8 2017, 8:41 PM.

Details

Summary

This patch attempts to make git-clang-format both python2 and python3 compatible. Currently it only works in python2.

Diff Detail

Event Timeline

EricWF created this revision.Mar 8 2017, 8:41 PM
EricWF planned changes to this revision.Mar 8 2017, 8:47 PM

There seem to be a couple cases where it's non-trivial to convert the output from bytes to str. I'll look into this further and update.

mgorny added a subscriber: mgorny.Mar 8 2017, 11:22 PM

There seem to be a couple cases where it's non-trivial to convert the output from bytes to str. I'll look into this further and update.

Generally bytes<->str conversion should be done using .decode() and .encode(), i.e. respecting a particular character encoding in bytes.

There seem to be a couple cases where it's non-trivial to convert the output from bytes to str. I'll look into this further and update.

Generally bytes<->str conversion should be done using .decode() and .encode(), i.e. respecting a particular character encoding in bytes.

Right, and that's what I'm currently doing. The non-trivial cases are where we have an open file object referring to output from a child process instead of a string representing the full output.

kimgr added a subscriber: kimgr.Mar 9 2017, 12:16 AM

Some nits from a Python3 hacker.

tools/clang-format/git-clang-format
144

I find print('template', tail) surprising in Python3, but it could be because we don't use it in our local standards. I'd jump immediately to formatting to make this 2/3-compatible:

print('    %s' % filename)

or in this case maybe just join the strings:

print('    ' + filename)
325

This should not be necessary for iteration -- in Python3 it returns a generator instead of a list, but generators can be iterated.

501–504

No need for parentheses around the string

502

I wonder if file=sys.stderr works with Python 2.7's print statement. Have you tested this with 2.7?

You can use __future__ to get the print function behavior in 2.7 as well:
http://stackoverflow.com/questions/32032697/how-to-use-from-future-import-print-function

522–525

This looks wrong -- where does to_bytes come from?

I can never get my head around Python2's string/bytes philosophy. In Python3 you would do:

  1. stdout and stderr are always bytes stdout = stdout.decode('utf-8') stderr = stderr.decode('utf-8')

(assuming the input *is* utf-8)

Not sure how that plays with Python2, but the current code doesn't look right.

EricWF marked 4 inline comments as done.Mar 9 2017, 12:34 AM
EricWF added inline comments.
tools/clang-format/git-clang-format
325

This was done by the 2to3 tool. I'll have to see what it thinks it was up to. But I agree this seems wrong.

502

It doesn't. I think getting the new behavior from __future__ is the only reasonable way to go.

522–525

This is wrong. And it will be removed in the next set of changes.

EricWF marked 2 inline comments as done.Mar 9 2017, 12:37 AM
EricWF added inline comments.
tools/clang-format/git-clang-format
325

Ah, so this is needed because in python 3 it's a runtime error to remove items from a dictionary while iterating over the keys. So we have to make a copy first.

kimgr added inline comments.Mar 9 2017, 1:08 AM
tools/clang-format/git-clang-format
325

Oh, I see, I didn't catch that. Carry on.

EricWF updated this revision to Diff 91145.Mar 9 2017, 1:31 AM

This still isn't working in python3 due to more bytes vs str issues, but this is my progress so far.

mgorny added inline comments.Mar 9 2017, 8:10 AM
tools/clang-format/git-clang-format
293

Pretty much a nit but using variable names that match type names can be a bit confusing here.

302

Shouldn't this be:

return bytes.decode('utf-8')

?

Otherwise, unless I'm missing something this function will always return the parameter [with no modifications], either in the first conditional if it's of str type, or in the conditional inside to_bytes() if it's of bytes type.

305

Any reason not to use:

br'^\+...'

? i.e. make it bytes immediately instead of converting.

306

This logic looks really weird to me. What is the purpose of having both to_string() and convert_string()? Why do to_bytes() and to_string() use isinstance() to recognize types, and here you rely on exceptions? Why is to_string() called after decoding?

310

I don't think this can ever succeed. If the argument is not valid utf8, it certainly won't be valid ASCII.

327

Since you are using keys_to_delete now, you can remove the list().

328

Another nit. I think it'd be better to just append a single item instead of a list of 1 item ;-).

keys_to_delete.append(filename)
EricWF updated this revision to Diff 91449.Mar 10 2017, 9:00 PM

Add complete implementation.

As far as I've tested this works perfectly in both python2 and python3.

EricWF marked 4 inline comments as done.Mar 10 2017, 9:27 PM
EricWF added inline comments.
tools/clang-format/git-clang-format
293

These functions are all lifted directly from lit/util.py

302

Nope. So in python3 this line will never be hit. but it will be in python2.

Specifically convert_string calls to_string(bytes.decode('utf-8')). Which in python2 calls to_string with the type unicode. We need to de-code the unicode into a string using bytes.encode('utf-8').

I agree it's a bit misleading that it's calling a function called to_bytes though.

305

I tried rb'blah' and that didn't work in python2. However this has since been removed.

306

to_string is called after decoding because in python2 the result of decoding is a unicode type, and we need to encode it a str type. Hense to_string.

EricWF updated this revision to Diff 91452.Mar 10 2017, 9:34 PM
EricWF marked an inline comment as done.
  • rename variables so they don't conflict with type names.
mgorny added inline comments.Mar 10 2017, 11:35 PM
tools/clang-format/git-clang-format
306

No offense intended but this sounds really horrible. In modern Python, everything is either bytes or unicode. The difference basically is that str in py2 was pretty much bytes (except that bytes explicitly removes some operations that are unsuitable for bytestrings), and that str in py3 is equivalent to unicode before.

So if you are specifically converting to str, it means that you want to have two distinct types in py2/py3. Which really sounds like you're doing something wrong.

EricWF added inline comments.Mar 11 2017, 2:06 AM
tools/clang-format/git-clang-format
306

No offence taken. I had to do way to much work to answer your questions accurately which means the code is way too complicated or non-obvious.

However there are a couple of things you got wrong. In python2 everything is normally bytes, str, or unicode. The to_string method converts both unicode and bytes to the type str.

So if you are specifically converting to str, it means that you want to have two distinct types in py2/py3. Which really sounds like you're doing something wrong.

I don't think having the Python library return different types in py2/py3 means something is wrong. In fact that's exactly what's happening and that's exactly what convert_string is trying to fix. In python 2 stdout, stderr, and stdin return strings, but in python 3 they return bytes. convert_string is meant to transform these different types into str.

Regardless I'll remove the call to to_bytes from inside to_string because it's too confusing.

EricWF updated this revision to Diff 91456.Mar 11 2017, 2:08 AM
  • Rewrite to_string for clarities sake.
mgorny added inline comments.Mar 11 2017, 3:10 AM
tools/clang-format/git-clang-format
306

In Python 2, bytes and str is the same type, e.g.:

In [13]: bytes('foo').__class__
Out[13]: str

In other words, in each version of Python there are two kinds of strings: binary strings (std::string in C++) and Unicode strings (alike std::u32string). In Python 2, str is binary string (bytes is also accepted for compatibility) and unicode is the Unicode string. In Python 3, bytes is binary string and str is Unicode string.


Now, let's consider subprocess streams. In Python 2 they use str -- which means binary strings. In Python 3 they use bytes -- i.e. also binary strings. So there's really no change here, except that Python 3 is more strict in handling different types.

So if you plan to use the data as binary data, you can just use the data directly. If you need to use Unicode strings, you can reliably use .decode() and .encode() to convert. If you really believe you need to use str (i.e. two distinct types at the same time), it just means that the code is most likely wrong and partially relies on characteristic of one type, and partially on the other.


For completeness, I should also mention that string types used by standard system streams (i.e. sys.std*) *have* changed between Python 2 and 3. Python 2 is operating them in bytes mode by default, while Python 3 is operating them (transcoding) in str (i.e. unicode) mode. In the latter case, you can use the .buffer attribute to access the underlying bytes stream -- e.g. sys.stdin.buffer will yield bytes.

EricWF updated this revision to Diff 96025.Apr 20 2017, 2:50 PM

I've committed all but the str vs byte changes upstream.

jbcoe accepted this revision.Apr 21 2017, 2:45 AM

LGTM

This revision is now accepted and ready to land.Apr 21 2017, 2:45 AM
EricWF closed this revision.May 25 2017, 8:24 AM