This patch attempts to make git-clang-format both python2 and python3 compatible. Currently it only works in python2.
There seem to be a couple cases where it's non-trivial to convert the output from bytes to str. I'll look into this further and update.
Generally bytes<->str conversion should be done using .decode() and .encode(), i.e. respecting a particular character encoding in bytes.
Right, and that's what I'm currently doing. The non-trivial cases are where we have an open file object referring to output from a child process instead of a string representing the full output.
Some nits from a Python3 hacker.
I find print('template', tail) surprising in Python3, but it could be because we don't use it in our local standards. I'd jump immediately to formatting to make this 2/3-compatible:
print(' %s' % filename)
or in this case maybe just join the strings:
print(' ' + filename)
This should not be necessary for iteration -- in Python3 it returns a generator instead of a list, but generators can be iterated.
No need for parentheses around the string
I wonder if file=sys.stderr works with Python 2.7's print statement. Have you tested this with 2.7?
You can use __future__ to get the print function behavior in 2.7 as well:
This looks wrong -- where does to_bytes come from?
I can never get my head around Python2's string/bytes philosophy. In Python3 you would do:
(assuming the input *is* utf-8)
Not sure how that plays with Python2, but the current code doesn't look right.
This was done by the 2to3 tool. I'll have to see what it thinks it was up to. But I agree this seems wrong.
It doesn't. I think getting the new behavior from __future__ is the only reasonable way to go.
This is wrong. And it will be removed in the next set of changes.
Ah, so this is needed because in python 3 it's a runtime error to remove items from a dictionary while iterating over the keys. So we have to make a copy first.
This still isn't working in python3 due to more bytes vs str issues, but this is my progress so far.
Pretty much a nit but using variable names that match type names can be a bit confusing here.
Shouldn't this be:
Otherwise, unless I'm missing something this function will always return the parameter [with no modifications], either in the first conditional if it's of str type, or in the conditional inside to_bytes() if it's of bytes type.
Any reason not to use:
? i.e. make it bytes immediately instead of converting.
This logic looks really weird to me. What is the purpose of having both to_string() and convert_string()? Why do to_bytes() and to_string() use isinstance() to recognize types, and here you rely on exceptions? Why is to_string() called after decoding?
I don't think this can ever succeed. If the argument is not valid utf8, it certainly won't be valid ASCII.
Since you are using keys_to_delete now, you can remove the list().
Another nit. I think it'd be better to just append a single item instead of a list of 1 item ;-).
Add complete implementation.
As far as I've tested this works perfectly in both python2 and python3.
These functions are all lifted directly from lit/util.py
Nope. So in python3 this line will never be hit. but it will be in python2.
Specifically convert_string calls to_string(bytes.decode('utf-8')). Which in python2 calls to_string with the type unicode. We need to de-code the unicode into a string using bytes.encode('utf-8').
I agree it's a bit misleading that it's calling a function called to_bytes though.
I tried rb'blah' and that didn't work in python2. However this has since been removed.
to_string is called after decoding because in python2 the result of decoding is a unicode type, and we need to encode it a str type. Hense to_string.
No offense intended but this sounds really horrible. In modern Python, everything is either bytes or unicode. The difference basically is that str in py2 was pretty much bytes (except that bytes explicitly removes some operations that are unsuitable for bytestrings), and that str in py3 is equivalent to unicode before.
So if you are specifically converting to str, it means that you want to have two distinct types in py2/py3. Which really sounds like you're doing something wrong.
No offence taken. I had to do way to much work to answer your questions accurately which means the code is way too complicated or non-obvious.
However there are a couple of things you got wrong. In python2 everything is normally bytes, str, or unicode. The to_string method converts both unicode and bytes to the type str.
I don't think having the Python library return different types in py2/py3 means something is wrong. In fact that's exactly what's happening and that's exactly what convert_string is trying to fix. In python 2 stdout, stderr, and stdin return strings, but in python 3 they return bytes. convert_string is meant to transform these different types into str.
Regardless I'll remove the call to to_bytes from inside to_string because it's too confusing.
In Python 2, bytes and str is the same type, e.g.:
In : bytes('foo').__class__ Out: str
In other words, in each version of Python there are two kinds of strings: binary strings (std::string in C++) and Unicode strings (alike std::u32string). In Python 2, str is binary string (bytes is also accepted for compatibility) and unicode is the Unicode string. In Python 3, bytes is binary string and str is Unicode string.
Now, let's consider subprocess streams. In Python 2 they use str -- which means binary strings. In Python 3 they use bytes -- i.e. also binary strings. So there's really no change here, except that Python 3 is more strict in handling different types.
So if you plan to use the data as binary data, you can just use the data directly. If you need to use Unicode strings, you can reliably use .decode() and .encode() to convert. If you really believe you need to use str (i.e. two distinct types at the same time), it just means that the code is most likely wrong and partially relies on characteristic of one type, and partially on the other.
For completeness, I should also mention that string types used by standard system streams (i.e. sys.std*) *have* changed between Python 2 and 3. Python 2 is operating them in bytes mode by default, while Python 3 is operating them (transcoding) in str (i.e. unicode) mode. In the latter case, you can use the .buffer attribute to access the underlying bytes stream -- e.g. sys.stdin.buffer will yield bytes.