-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decode stderr bytes before logging #33
base: master
Are you sure you want to change the base?
Conversation
# log stderr | ||
stderr = proc.streams.stderr # bytes in py3 | ||
stderr = stderr.decode() if isinstance(stderr, bytes) else stderr | ||
emit("stderr for '{}' @ {}\n{}\n".format(ua.name, ua.srcaddr, stderr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels wrong.
stderr was also bytes in py2 (non-unicode-str; there is no future import unicode_literals). But once you decode it, it would be unicode, and conflict with the emit(bytestring.format())
$ cat 1.py
from subprocess import Popen, PIPE
p = Popen(r"/usr/bin/printf '\xc3\xa1' >&2", shell=True, stderr=PIPE, stdout=PIPE)
out, err = p.communicate()
print(repr(err))
stderr = err
print(repr(stderr.decode()))
$ python 1.py
'\xc3\xa1'
Traceback (most recent call last):
File "1.py", line 6, in <module>
print(repr(stderr.decode()))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
$ python3 1.py
b'\xc3\xa1'
'á'
But this should work:
stderr = stderr.decode() if not isinstance(stderr, str) else stderr
^-- py2: str==bytes=>noop, py3: str!=bytes=>decode
And the bytestring.format() doesn't choke either.
Howver. Always decoding a bytestring as utf-8 is not so safe. Better do decode with 'replace':
>>> bytes([0x41, 0x81, 0xc3, 0xa1]).decode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 1: invalid start byte
>>> bytes([0x41, 0x81, 0xc3, 0xa1]).decode('utf-8', 'replace')
'A�á'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the default encoding changed from ascii to utf-8 with python3.
To go back to your code snippet above, this works on both py2 and py3:
from subprocess import Popen, PIPE
p = Popen(r"/usr/bin/printf '\xc3\xa1' >&2", shell=True, stderr=PIPE, stdout=PIPE)
out, err = p.communicate()
print(repr(err))
stderr = err
print(repr(stderr.decode('utf-8', 'replace')))
When run:
$ python2 ./foo.py
'\xc3\xa1'
u'\xe1'
$ python3 ./foo.py
b'\xc3\xa1'
'á'
So you don't even need the isinstance check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Errata: its utf-8 since python3.5.
And probably more future proof too since python 3.7 is going to lean on utf-8 more:
I mean, its my humble opinion that if your system isn't set up to support utf8 these days, you're running a misconfigured system 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you don't even need the isinstance check.
Well, we still do, because:
print('{}'.format(stderr.decode('utf-8', 'replace')))
Traceback (most recent call last):
File "1.py", line 6, in <module>
print('{}'.format(stderr.decode('utf-8', 'replace')))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)
The emit() gets a bytestring in py2, which doesn't do unicode.
I mean, its my humble opinion that if your system isn't set up to support utf8 these days, you're running a misconfigured system
Absolutely, but we don't want to error out because an application sends invalid utf-8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's because you can't embed a unicode string into a ascii format string (well, implicit encoding happens). Adding a u
to the format string also makes it work on both:
print(u'{}'.format(stderr.decode('utf-8', 'replace')))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point exactly :)
Although I did not dive into emit() to see if it takes an unicodestring in py2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so I think we're on the same page.
Just personally I think I'd rather see the underlying format string adjusted than a if isinstance(...)
expression.
On py3 we get
bytes
back from the stderr output. Decode to a nativestr
before logging to keep a readable output.