Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decode stderr bytes before logging #33

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

goodboy
Copy link
Member

@goodboy goodboy commented Dec 4, 2017

On py3 we get bytes back from the stderr output. Decode to a native str before logging to keep a readable output.

@goodboy goodboy requested review from vodik and wdoekes December 4, 2017 15:54
# log stderr
stderr = proc.streams.stderr # bytes in py3
stderr = stderr.decode() if isinstance(stderr, bytes) else stderr
emit("stderr for '{}' @ {}\n{}\n".format(ua.name, ua.srcaddr, stderr))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels wrong.

stderr was also bytes in py2 (non-unicode-str; there is no future import unicode_literals). But once you decode it, it would be unicode, and conflict with the emit(bytestring.format())

$ cat 1.py 
from subprocess import Popen, PIPE
p = Popen(r"/usr/bin/printf '\xc3\xa1' >&2", shell=True, stderr=PIPE, stdout=PIPE)
out, err = p.communicate()
print(repr(err))
stderr = err
print(repr(stderr.decode()))
$ python 1.py 
'\xc3\xa1'
Traceback (most recent call last):
  File "1.py", line 6, in <module>
    print(repr(stderr.decode()))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
$ python3 1.py 
b'\xc3\xa1'
'á'

But this should work:

stderr = stderr.decode() if not isinstance(stderr, str) else stderr

^-- py2: str==bytes=>noop, py3: str!=bytes=>decode

And the bytestring.format() doesn't choke either.

Howver. Always decoding a bytestring as utf-8 is not so safe. Better do decode with 'replace':

>>> bytes([0x41, 0x81, 0xc3, 0xa1]).decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 1: invalid start byte
>>> bytes([0x41, 0x81, 0xc3, 0xa1]).decode('utf-8', 'replace')
'A�á'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the default encoding changed from ascii to utf-8 with python3.

To go back to your code snippet above, this works on both py2 and py3:

from subprocess import Popen, PIPE
p = Popen(r"/usr/bin/printf '\xc3\xa1' >&2", shell=True, stderr=PIPE, stdout=PIPE)
out, err = p.communicate()
print(repr(err))
stderr = err
print(repr(stderr.decode('utf-8', 'replace')))

When run:

$ python2 ./foo.py 
'\xc3\xa1'
u'\xe1'
$ python3 ./foo.py 
b'\xc3\xa1'
'á'

So you don't even need the isinstance check.

Copy link
Member

@vodik vodik Dec 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errata: its utf-8 since python3.5.

And probably more future proof too since python 3.7 is going to lean on utf-8 more:

I mean, its my humble opinion that if your system isn't set up to support utf8 these days, you're running a misconfigured system 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you don't even need the isinstance check.

Well, we still do, because:

print('{}'.format(stderr.decode('utf-8', 'replace')))
Traceback (most recent call last):
  File "1.py", line 6, in <module>
    print('{}'.format(stderr.decode('utf-8', 'replace')))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)

The emit() gets a bytestring in py2, which doesn't do unicode.

I mean, its my humble opinion that if your system isn't set up to support utf8 these days, you're running a misconfigured system

Absolutely, but we don't want to error out because an application sends invalid utf-8.

Copy link
Member

@vodik vodik Dec 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because you can't embed a unicode string into a ascii format string (well, implicit encoding happens). Adding a u to the format string also makes it work on both:

print(u'{}'.format(stderr.decode('utf-8', 'replace')))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point exactly :)

Although I did not dive into emit() to see if it takes an unicodestring in py2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so I think we're on the same page.

Just personally I think I'd rather see the underlying format string adjusted than a if isinstance(...) expression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants