Hacker Read

ltbarcly3 | karma 4105 | avg karma 4.68 · 2020-04-20 23:04:26

I guess it seems hard, but if you understand the basics it's just a matter of finding IO and encoding/decoding things appropriately. Python 2 str, unicode and Python 3 str are almost exactly the same as they ever were, and if you are doing binary IO bytes should work as expected.

It's not that big of a project, even for a moderately large codebase. If you think you can't get it done in a reasonable period of time feel free to hire me as a contractor and I'll knock it out.

reply

int_19h | karma 21203 | avg karma 1.69 · 2020-04-21 05:58:32+00:00

The real problem in practice is all the places that relied on implicit bytes <-> unicode conversions. These are also likely to be broken on unusual inputs due to ASCII being the default encoding, rather than user locale... but that kind of thing is depressingly common even so.

ltbarcly3 | karma 4105 | avg karma 4.68 · 2020-04-21 23:04:11+00:00

there isn't an implicit bytes/unicode conversion in python2.

Do you mean calling unicode('some non unicode string')? That just uses system default encoding. ( sys.setdefaultencoding() ). Just find and replace them and slap in a .decode('UTF-8') or whatever your default encoding was in python2.

Grep for the strings encode, decode, unicode and just mechanically fix them one at a time by making the old implicit behavior explicit. How many times could you be doing that anyway? A few hundred? A thousand? You could even script this pretty reliably and just page through the diff you end up with to eyeball them one at a time.

I guess you might mean 'str' + u'unicodestr' or something, but again you can find these pretty easily by rooting out where the non-unicode strings are being produced and fixing the problem there. They are either literals or they are coming from IO or calls to str, right? Anyway, I've done this quite a few times and the main concern I've always had was trying to get the patch in place before people commit too much stuff for me to be able to merge the fixed up branch.

Of course, you could just do it little by little by taking out places that you are relying on systemdefaultencoding by monkeypatching the default decoding function in sys to log tracebacks whenever it is used, and then whacking them as they come up so you end up with properly handled and explicit unicode decoding before you move away from python2. I bet you could find and fix 95% of the cases in a day of effort.

reply

int_19h | karma 21203 | avg karma 1.69 · 2020-04-21 23:35:35+00:00

('str' + u'unicodestr') involves an implicit conversion of the bytes operand on the left side to unicode, yes, so clearly Python 2 has it. And it's not the only such case- the fundamental problem is that the Python/C API for Python 2 implements this for the standard argument parsing functions. So basically any function implemented in native code that invokes PyArg_ParseTuple("s"), will also accept Unicode objects, and will implicitly encode them with ASCII. IIRC the reverse is true for "u". So those conversions happen every time you cross from Python into native - and all built-in data types, operators, and functions are native, as are large swaths of the standard library.

And yeah, u"" literals aren't hard to find, but the problem is when you get data flowing from different sources, so both sides are variables. For example, one is read from a text file, and another one comes from parsed JSON - so the former is raw bytes, and the latter is Unicode - and you need to combine them together. Like you said, the proper way to do this is to ensure that as soon as data crosses the I/O boundary, it should be of the correct type (i.e. unicode rather than bytes) - which, ironically, is exactly what Python 3 encourages with its changes. But it can be hard to find all such places - you have to actually audit every use of I/O one by one, because in Python 2, the code by itself doesn't always reflect whether it's supposed to be dealing with text or binary data.

reply

ltbarcly3 | karma 4105 | avg karma 4.68 · 2020-04-29 01:32:05

So I'm back to my idea of monkeypatching the standard libraries to add logging to a file wherever it resorts to the default encoding or an implicit decode, and one by one fix them until the log file stops having stuff in it. After a couple of months of not seeing any you just assume you found them all, and you can safely bet that you'll find the last couple in a couple more months, but by then you have managed to move to Python3 a few months ago.