Handling binary documents having ASCII-compatible markup

Python3 has bytes, which are sequences of 8-bit integers, and str, sequences of Unicode code-points. To go from one to the other, you need to encode or decode by giving an explicit encoding. There are many protocols where the markup is ASCII, even if the data is some other encoding that you don't know. If you know that other encoding is ASCII-compatible, it is useful to be able to parse, split etc. the markup, and you just need to pass-through the payload.

An initial search on the Internet brought up an article by Eric S. Raymond that touches on that, and suggests to decode the data as ISO-8859-1, handle it as str, then the payload can be losslessly recovered by reencoding it. ISO-8859-1 has the property that each byte of the input is exactly one codepoint; and that all 256 possible values are mapped to a code point. As a result, the following is idempotent:

>>> by = bytes(range(0xff))
>>> by2 = bytes(str(by, encoding="iso-8859-1"), encoding="iso-8859-1")
>>> by == by2
True

A few days later I came across that article by Alyssa Coghlan, which mentions the same idea, and also the existence of the errors="surrogateencoding" (PEP 383) error handler (also: codecs in the Python documentation), which is designed to allow exactly what I needed:

>>> by = bytes(range(0xff))
>>> by2 = bytes(str(by, encoding="ascii", errors="surrogateescape"), encoding="ascii", errors="surrogateescape")
>>> by == by2
True

Alyssa Coghlan has some discussion about the merits of each approach, I can't really say that functionally they have any meaningful difference. As she points out, if you do anything but re-encode the non-ASCII compatible parts using the same codec, you risk getting Mobijake (or if you're lucky, a UnicodeDecodeError rather than silently producing garbage).

Performance-wise, let's see:

>>> import timeit
>>> timeit.timeit("""bytes(str(by, encoding="ISO-8859-1"), encoding="ISO-8859-1")""", setup="import random; by = random.randbytes(10_000)")
0.8885893229962676
>>> timeit.timeit("""bytes(str(by, encoding="ascii", errors="surrogateescape"), encoding="ascii", errors="surrogateescape")""", setup="import random; by = random.randbytes(10_000)")
125.00223343299876

That's… a very large difference. ESR's article points out that ISO-8859-1 has some properties that make it some efficient (it maps bytes 0x80—0xff to Unicode code-points of the same numeric value, so there is no translation cost, and the in-memory representation is more efficient). Trying increasing sizes:

>>> for size in 10, 100, 1_000, 2_000, 5_000, 10_000, 20_000:
...     by = random.randbytes(size)
...     duration = timeit.timeit("""bytes(str(by, encoding="ascii", errors="surrogateescape"), encoding="ascii", errors="surrogateescape")""", globals={"by": by}, number=100_000)
...     print(f"{size}\t{duration}")
... 
10	0.0650910490003298
100	0.1047916579991579
1000	0.5472217770002317
2000	1.5103355319952243
5000	5.779067411000142
10000	12.497241530996689
20000	25.78209423399676

That seems to grow faster than O(n); the duration seems to be ×3 when the size goes ×2, is it growing like O(n^1.5)?

In contrast, using the ISO-8859-1 method seems to having a complexity O(n):

>>> for size in 10, 100, 1_000, 2_000, 5_000, 10_000, 20_000, 50_000, 100_000:
...     by = random.randbytes(size)
...     duration = timeit.timeit("""bytes(str(by, encoding="iso-8859-1"), encoding="iso-8859-1")""", globals={"by": by}, number=100_000)
...     print(f"{size}\t{duration}")
... 
10	0.05453772499458864
100	0.037617702000716235
1000	0.05454556500626495
2000	0.05654650100041181
5000	0.06352802200126462
10000	0.0898260960020707
20000	0.23981017799815163
50000	0.4997737009980483
100000	0.9646763860000647