Skip to content

Commit

Permalink
Now encodes line mapping.
Browse files Browse the repository at this point in the history
As a module-level list named `__mapping__`, and as a compressed delta
encoding for production use as ``__gzmapping__``.

The compressed version is the base-84 representation of a gzipped
stream of packed integer bytes defining the difference in line number
from line to line.
  • Loading branch information
amcgregor committed Dec 6, 2015
1 parent 6dd1800 commit 647d44a
Showing 1 changed file with 24 additions and 7 deletions.
31 changes: 24 additions & 7 deletions cinje/block/module.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,27 @@
# encoding: utf-8

from __future__ import unicode_literals

from gzip import compress, decompress
from base64 import b85encode, b85decode
from pprint import pformat
from collections import deque

from ..util import py, Line
from ..util import py, Line, iterate


def red(numbers):
"""Encode the deltas to reduce entropy."""

line = 0
deltas = []

for value in numbers:
deltas.append(value - line)
line = value

return b85encode(compress(b''.join(chr(i).encode('latin1') for i in deltas))).decode('latin1')



class Module(object):
Expand Down Expand Up @@ -55,15 +73,14 @@ def __call__(self, context):
context.templates = []

# Snapshot the line number mapping.
# TODO: Run-length encode the line number deltas, 'cause damn, this is a lot of data.
mapping = deque(context.mapping)
mapping.reverse()

mapping = deque(pformat(list(mapping), indent=0, width=105).split('\n'))

yield Line(0, '')
yield Line(0, '__mapping__ = ' + mapping.popleft())
for line in mapping:
yield Line(0, line)

if __debug__:

This comment has been minimized.

Copy link
@crosoftzach

crosoftzach Mar 10, 2021

This check appears to cause latin-1 encoding to occur when debug is False; special chars will throw an exception not thrown when debug is True.

This comment has been minimized.

Copy link
@amcgregor

amcgregor Mar 10, 2021

Author Member

🤔 Is there evidence to back this up? Execution with PYTHONOPTIMIZE set (or -O flag provided) is unusual. My own testing and triage of issues such as #30 does not utilize these flags, largely to ensure the majority of the code flow is tested.

  • This check appears to cause latin-1 encoding to occur when debug is False.

What evidence is there of this? And does this condition only hold true for Python 2? Reference the generated code from the linked issue, "native" quoted strings (without b prefix) are utilized under Python 3, meaning Unicode text. No encodings should take place.

  • Special chars will throw an exception not thrown when debug is True.

Do you have a demonstration of one of these exceptions? If it's at all related to the linked issue, I'm super curious!

Thanks!

Edited to add: Cinje encoding is equivalent to UTF-8 encoding, after translation, as a heads up! Ensure your cinje templates are UTF-8 encoded to prevent issues.

This comment has been minimized.

Copy link
@crosoftzach

crosoftzach Mar 10, 2021

I read this code wrong (as an else); you're right the debug state doesn't seem to be the issue. Here's what we get on some systems:

/usr/local/lib/python3.8/codecs.py:322: in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
        data       = b'# encoding: cinje\n\n: def render_verification_email_text token, host\n    : url = host+"/user/verify_email?token="+...'
        final      = True
        input      = b''
        self       = <cinje.encoding.CinjeIncrementalDecoder object at 0x7fd489fbf730>
/usr/local/lib/python3.8/site-packages/cinje/encoding.py:29: in _buffer_decode
    output = transform(bytes(input).decode('utf8', errors))
        errors     = 'strict'
        final      = True
        input      = b'# encoding: cinje\n\n: def render_verification_email_text token, host\n    : url = host+"/user/verify_email?token="+...'
        self       = <cinje.encoding.CinjeIncrementalDecoder object at 0x7fd489fbf730>
/usr/local/lib/python3.8/site-packages/cinje/encoding.py:15: in transform
    return '\n'.join(str(i) for i in translator.stream)
        input      = '# encoding: cinje\n\n: def render_verification_email_text token, host\n    : url = host+"/user/verify_email?token="+t...'
        translator = Context(Lines(0), 0, {'init', 'buffer'})
/usr/local/lib/python3.8/site-packages/cinje/encoding.py:15: in <genexpr>
    return '\n'.join(str(i) for i in translator.stream)
        .0         = <generator object Context.stream at 0x7fd489fce510>
        i          = Line(0, text, "")
/usr/local/lib/python3.8/site-packages/cinje/util.py:505: in stream
    for line in handler(self):  # This re-indents the code to match, if missing explicit scope.
        handler    = <cinje.block.module.Module object at 0x7fd489fbf910>
        line       = Line(0, text, "")
        mapping    = deque([387, 387, 387, 387, 387, 387, ...])
        root       = True
        self       = Context(Lines(0), 0, {'init', 'buffer'})
/usr/local/lib/python3.8/site-packages/cinje/block/module.py:84: in __call__
    yield Line(0, '__gzmapping__ = b"' + red(mapping).replace('"', '\"') + '"')
        context    = Context(Lines(0), 0, {'init', 'buffer'})
        i          = Line(0, text, "yield "".join(_buffer)")
        imported   = False
        input      = Lines(0)
        line       = Line(3, code, "def render_verification_email_text token, host")
        mapping    = deque([0, 2, 2, 2, 2, 2, ...])
        self       = <cinje.block.module.Module object at 0x7fd489fbf910>
/usr/local/lib/python3.8/site-packages/cinje/block/module.py:22: in red
    return b64encode(compress(b''.join(chr(i).encode('latin1') for i in deltas))).decode('latin1')
        deltas     = [0, 2, 0, 0, 0, 0, ...]
        line       = 387
        numbers    = deque([0, 2, 2, 2, 2, 2, ...])
        value      = 387
/usr/local/lib/python3.8/site-packages/cinje/block/module.py:22: in <genexpr>
    return b64encode(compress(b''.join(chr(i).encode('latin1') for i in deltas))).decode('latin1')
E   UnicodeEncodeError: 'latin-1' codec can't encode character '\u015e' in position 0: ordinal not in range(256)
        .0         = <list_iterator object at 0x7fd489fbf940>
        i          = 350

This comment has been minimized.

Copy link
@amcgregor

amcgregor Mar 10, 2021

Author Member

Diving through the stack trace: (triple quotes for blocks of code, BTW 😉)

  1. data is binary, the pure source file.
  2. input is binary, the pure source file passed to the cinje decoder.
  3. The referenced line of code ensures the input is binary, then decodes UTF-8 explicitly:
     output = transform(bytes(input).decode('utf8', errors))
    
  4. input is Unicode text, having been decoded processing the arguments to the call immediately above.

From the final frame, I'd really need to see the output of input to the compress call. It there's a delta into the 15E range, that's 350 lines of code that probably doesn't belong in a template file being "skipped" over. 😕

This comment has been minimized.

Copy link
@crosoftzach

crosoftzach Mar 10, 2021

Running this some more. System is docker "python:3.8-slim-buster" and when PYTHONOPTIMIZE=1 the above exception is thrown. If PYTHONOPTIMIZE=0 it seems to work. If I delete the special chars from the template file, it also appears to work with PYTHONOPTIMIZE=1.

yield Line(0, '__mapping__ = [' + ','.join(str(i) for i in mapping) + ']')

yield Line(0, '__gzmapping__ = rb"""' + red(mapping) + '"""')

context.flag.remove('init')

0 comments on commit 647d44a

Please sign in to comment.