Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved python module architecture and added algos. #2

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 6 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,10 +53,9 @@ Add [ruby/lib/](https://github.com/kaitai-io/kaitai_compress/tree/master/ruby/li

| Algorithm | Process name | Arguments | Conforming | Test file extension |
| - | - | - | - | - |
| [Brotli](https://en.wikipedia.org/wiki/Brotli) | `brotli` | None | [RFC 7932](https://datatracker.ietf.org/doc/html/rfc7932) | br |
| [LZ4](https://en.wikipedia.org/wiki/LZ4_(compression_algorithm)) | `lz4` | None | [LZ4 block specification](https://lz4.github.io/lz4/lz4_Block_format.md) | lz4 |
| [LZMA](https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm) | `lzma_raw` | None | Raw LZMA stream | lzma_raw |
| [LZMA](https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm) | `lzma_lzma` | None | Legacy .lzma file format (AKA alone) | lzma |
| [LZMA](https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm) | `lzma_xz` | None | .xz file format | xz |
| [DEFLATE](https://en.wikipedia.org/wiki/DEFLATE) (AKA zlib) | `zlib` | None | [RFC 1951](https://tools.ietf.org/html/rfc1951) | zlib |
| [zstd](https://zstd.net) (AKA zstandard) | `zstd` | None | [Spec & ref implementation](http://facebook.github.io/zstd/zstd_manual.html) | zst |
| [brotli](https://en.wikipedia.org/wiki/brotli) | `brotli` | compression level (`0`-`11`), mode (`generic`, `text`, `font`), log window size , log block size, dictionary | [RFC 7932](https://datatracker.ietf.org/doc/html/rfc7932) | `br` |
| [LZ4](https://en.wikipedia.org/wiki/LZ4_(compression_algorithm)) | `lz4` | block_size, if should link blocks, compression level (`0`-`16`), if should checksum frame, if should checksum each block | [LZ4 block specification](https://lz4.github.io/lz4/lz4_Block_format.md) | `lz4` |
| [LZMA](https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm) | `lzma` | algorithm version (`1, 2`), compression level (`0-9`, `-1` - don't compress with lzma, but use other filters specified), format (`auto`, `alone`, `raw`, `xz`), checksumming algorithm (`none`, `crc32`, `crc64`, `sha256`), modifiers (`e` for `--extreme`), dictionary size, literal context bit count, literal position bit count, position bit count, match finder (`hc3`, `hc4`, `bt2`, `bt3`, `bt4`), mode (`normal`, `fast`), additional_filters | Raw LZMA stream | `lzma` |
| [DEFLATE](https://en.wikipedia.org/wiki/DEFLATE) (AKA zlib) | `zlib` | container type (`raw`, `zlib`, `gzip`), log of window size (`9`-`15`), dictionary, compression level (`0`-`9`, `-1` for default), memory level (`0`-`9`), strategy (`filtered`, `huffman_only`), method (currently only `deflated` is supported) | [RFC 1951](https://tools.ietf.org/html/rfc1951) | `zlib`, `gz` |
| [zstd](https://zstd.net) (AKA zstandard) | `zstd` | format (`zstd1_magicless`, `zstd1`), log of (max) window size, dictionary, compression level (`1` - `22`, `-5` - `-1`), if should write checksum, if should write uncompressed size, if should write dict ID, strategy (`fast`, `dfast`, `greedy`, `lazy`, `lazy2`, `btlazy2`, `btopt`, `btultra`, `btultra2`), hash log size, match min size, chain log size, search log size, overlap log size, target length, if should use long distance matching, ldm hash log size, ldm match min size, ldm bucket size log, ldm hash rate log, job size, force max window | [Spec & ref implementation](http://facebook.github.io/zstd/zstd_manual.html) | `zst` |
| [bzip2](https://en.wikipedia.org/wiki/bzip2) | `bz2` | compression level (`1` - `9`) to add |[Official repo](https://gitlab.com/federicomenaquintero/bzip2)|`bz2`|
Binary file added _test/compressed/25k_uuids.ascii.implode
Binary file not shown.
Binary file added _test/compressed/25k_uuids.binary.implode
Binary file not shown.
Binary file added _test/compressed/25k_uuids.lzham
Binary file not shown.
1 change: 1 addition & 0 deletions _test/compressed/90_a.ascii.implode
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
���
Binary file added _test/compressed/90_a.binary.implode
Binary file not shown.
Binary file added _test/compressed/90_a.lzham
Binary file not shown.
1 change: 1 addition & 0 deletions _test/compressed/ascii_text.ascii.implode
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
 �-�gj:��i:xB2�dӽ�ּB������Ԁ
Binary file added _test/compressed/ascii_text.binary.implode
Binary file not shown.
Binary file added _test/compressed/ascii_text.lzham
Binary file not shown.
13 changes: 11 additions & 2 deletions _test/generate-data
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ for I in uncompressed/*.dat; do
BASE=$(basename "$I" | sed 's/\.dat$//')

echo "$BASE.lz4"
lz4 -9 <$I >compressed/$BASE.lz4
lz4 --best -BD <$I >compressed/$BASE.lz4

echo "$BASE.zlib"
ruby -e 'require "zlib"; $stdout.write(Zlib::deflate($stdin.read))' <$I >compressed/$BASE.zlib
Expand All @@ -22,7 +22,7 @@ for I in uncompressed/*.dat; do
zstd --ultra -22 -f -o compressed/$BASE.zst --format=zstd $I

echo "$BASE.br"
brotli <$I -o compressed/$BASE.br
brotli -f -o compressed/$BASE.br $I

echo "$BASE.raw.sz"
python3 -c "import sys, snappy; from pathlib import Path; i = Path(sys.argv[1]); o = Path(sys.argv[2]); o.write_bytes(snappy.compress(i.read_bytes()));" $I compressed/$BASE.raw.sz
Expand All @@ -32,4 +32,13 @@ for I in uncompressed/*.dat; do

echo "$BASE.hadoop.sz"
python3 -c "import sys, snappy; from pathlib import Path; i = Path(sys.argv[1]).open('rb'); o = Path(sys.argv[2]).open('wb'); snappy.hadoop_stream_compress(i, o); i.close(); o.close();" $I compressed/$BASE.hadoop.sz

echo "$BASE.lzham"
lzhamtest -m4 -d29 -u -x -o -e -h0 c $I compressed/$BASE.lzham

echo "$BASE.binary.implode" # no official extension
python3 -c "import sys, pkimplode; from pathlib import Path; i = Path(sys.argv[1]).open('rb'); o = Path(sys.argv[2]).open('wb'); pkimplode.compressStreamToStream(i, o, compressionType=pkimplode.CompressionType.binary, dictionarySize=4096); i.close(); o.close();" $I compressed/$BASE.binary.implode

echo "$BASE.ascii.implode" # no official extension
python3 -c "import sys, pkimplode; from pathlib import Path; i = Path(sys.argv[1]).open('rb'); o = Path(sys.argv[2]).open('wb'); pkimplode.compressStreamToStream(i, o, compressionType=pkimplode.CompressionType.ascii, dictionarySize=4096); i.close(); o.close();" $I compressed/$BASE.ascii.implode
done
6 changes: 6 additions & 0 deletions _test/ksy/test_implode_ascii.ksy
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
meta:
id: test_implode_ascii
seq:
- id: body
size-eos: true
process: kaitai.compress.implode(4096, 1)
6 changes: 6 additions & 0 deletions _test/ksy/test_implode_binary.ksy
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
meta:
id: test_implode_binary
seq:
- id: body
size-eos: true
process: kaitai.compress.implode(4096, 0)
2 changes: 1 addition & 1 deletion _test/ksy/test_lzma_lzma.ksy
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ meta:
seq:
- id: body
size-eos: true
process: kaitai.compress.lzma_lzma
process: kaitai.compress.lzma(1, 9, "alone")
2 changes: 1 addition & 1 deletion _test/ksy/test_lzma_raw.ksy
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ meta:
seq:
- id: body
size-eos: true
process: kaitai.compress.lzma_raw
process: kaitai.compress.lzma(2, 9, "raw")
2 changes: 1 addition & 1 deletion _test/ksy/test_lzma_xz.ksy
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ meta:
seq:
- id: body
size-eos: true
process: kaitai.compress.lzma_xz
process: kaitai.compress.lzma(2, 9, "xz")
6 changes: 6 additions & 0 deletions _test/ksy/test_snappy.ksy
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
meta:
id: test_snappy
seq:
- id: body
size-eos: true
process: kaitai.compress.snappy
67 changes: 42 additions & 25 deletions _test/test-python.py
Original file line number Diff line number Diff line change
@@ -1,36 +1,53 @@
#!/usr/bin/env python3

from glob import glob
from os.path import basename
from pathlib import Path
import re
import unittest

from test_lz4 import TestLz4
from test_lzma_lzma import TestLzmaLzma
from test_lzma_raw import TestLzmaRaw
from test_lzma_xz import TestLzmaXz
from test_zlib import TestZlib
from test_snappy import TestSnappy
from test_brotli import TestBrotli
from test_zstd import TestZstd
from test_implode_binary import TestImplodeBinary
from test_implode_ascii import TestImplodeAscii

for uncompressed_fn in glob('uncompressed/*.dat'):
name = re.sub(r'.dat$', '', basename(uncompressed_fn))
print(name)

f = open(uncompressed_fn, 'rb')
uncompressed_data = f.read()
f.close()

algs = [
(TestLz4, 'lz4'),
(TestLzmaLzma, 'lzma'),
# (TestLzmaRaw, 'lzma_raw'), # requires filters= to be set
(TestLzmaXz, 'xz'),
(TestZlib, 'zlib'),
(TestBrotli, 'brotli'),
]

for alg in algs:
test_class = alg[0]
ext = alg[1]

obj = test_class.from_file('compressed/%s.%s' % (name, ext))
print(obj.body == uncompressed_data)
cwd = Path(".").absolute()
this_dir = Path(__file__).absolute().parent.relative_to(cwd)
compressed_dir = this_dir / "compressed"
uncompressed_dir = this_dir / "uncompressed"


class SimpleTests(unittest.TestCase):
def testCompressors(self):
for uncompressed_fn in uncompressed_dir.glob("*.dat"):
name = uncompressed_fn.stem
print(name)

uncompressed_data = uncompressed_fn.read_bytes()

algs = [
(TestLz4, "lz4"),
(TestLzmaLzma, "lzma"),
# (TestLzmaRaw, 'lzma_raw'), # requires filters= to be set
(TestLzmaXz, "xz"),
(TestZlib, "zlib"),
(TestSnappy, "snappy"),
(TestBrotli, "br"),
(TestZstd, "zst"),
(TestImplodeBinary, "binary.implode"),
(TestImplodeAscii, "ascii.implode"),
]

for test_class, ext in algs:
compressed_fn = compressed_dir / (name + "." + ext)
with self.subTest(test_class=test_class, file=compressed_fn):
obj = test_class.from_file(str(compressed_fn))
self.assertEqual(obj.body, uncompressed_data)


if __name__ == "__main__":
unittest.main()
6 changes: 6 additions & 0 deletions python/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
__pycache__
*.pyc
*.pyo
/build
/dist
/*.egg-info
15 changes: 9 additions & 6 deletions python/kaitai/compress/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
from .lz4 import Lz4
from .zlib import Zlib
from .lzma_raw import LzmaRaw
from .lzma_lzma import LzmaLzma
from .lzma_xz import LzmaXz
from .brotli import Brotli
from .core import *
from .algorithms.zlib import Zlib
from .algorithms.lzma import Lzma
from .algorithms.lz4 import Lz4
KOLANICH marked this conversation as resolved.
Show resolved Hide resolved
from .algorithms.brotli import Brotli
from .algorithms.zstd import Zstd
from .algorithms.bz2 import Bz2
from .algorithms.snappy import Snappy
from .algorithms.implode import Implode
Empty file.
43 changes: 43 additions & 0 deletions python/kaitai/compress/algorithms/brotli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
import typing

from ..core import KaitaiCompressor, ProcessorContextStub

# pylint:disable=arguments-differ


class Brotli(KaitaiCompressor):
__slots__ = ("compressorParams", "decompressorParams")
brotli = None

def __init__(self, level: typing.Optional[int] = None, mode: typing.Optional[str] = "generic", log_window_size: typing.Optional[int] = None, log_block_size: typing.Optional[int] = None, dictionary: typing.Optional[bytes] = None) -> None: # pylint:disable=redefined-builtin,too-many-arguments,too-many-locals,unused-argument
super().__init__()
if self.__class__.brotli is None:
import brotli # pylint:disable=import-outside-toplevel

self.__class__.brotli = brotli
self.compressorParams = {}
self.decompressorParams = {}

if mode is not None:
if isinstance(mode, str):
mode = getattr(self.__class__.brotli, "MODE_" + mode.upper())
self.compressorParams["mode"] = mode

if level is not None:
self.compressorParams["quality"] = level

if log_window_size is not None:
self.compressorParams["lgwin"] = log_window_size

if log_block_size is not None:
self.compressorParams["lgblock"] = log_block_size

if dictionary is not None:
self.decompressorParams["dictionary"] = self.compressorParams["dictionary"] = dictionary

# new API
def process(self, data: typing.Union[bytes, bytearray]) -> ProcessorContextStub:
return ProcessorContextStub(self.__class__.brotli.decompress(bytes(data), **self.decompressorParams))

def unprocess(self, data: typing.Union[bytes, bytearray]) -> ProcessorContextStub:
return ProcessorContextStub(self.__class__.brotli.compress(data, **self.compressorParams))
22 changes: 22 additions & 0 deletions python/kaitai/compress/algorithms/bz2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import bz2
import typing

from ..core import KaitaiCompressor, ProcessorContextStub

# pylint:disable=arguments-differ


class Bz2(KaitaiCompressor):
__slots__ = ("level",)

def __init__(self, level: int = 9, *args, **kwargs) -> None: # pylint:disable=unused-argument
super().__init__()
self.level = level

def process(self, data: typing.Union[bytes, bytearray]) -> ProcessorContextStub:
decompressor = bz2.BZ2Decompressor()
return ProcessorContextStub(decompressor.decompress(data))

def unprocess(self, data: typing.Union[bytes, bytearray]) -> ProcessorContextStub:
compressor = bz2.BZ2Compressor(self.level)
return ProcessorContextStub(compressor.compress(data) + compressor.flush())
37 changes: 37 additions & 0 deletions python/kaitai/compress/algorithms/implode.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import typing

from ..core import KaitaiCompressor, ProcessorContextStub

# pylint:disable=arguments-differ


class Implode(KaitaiCompressor):
"""PKWare implode format"""

__slots__ = ("dictionarySize", "compressionType")

def __init__(self, dictionarySize: int = 4096, compressionType: int = 0, *args, **kwargs) -> None: # pylint:disable=unused-argument
super().__init__()

try:
from pklib_base import CompressionType
except ImportError:
pass
else:
if isinstance(compressionType, str):
compressionType = CompressionType[compressionType.lower()]
else:
compressionType = CompressionType(compressionType)

self.compressionType = compressionType
self.dictionarySize = dictionarySize

def process(self, data: typing.Union[bytes, bytearray]) -> ProcessorContextStub:
import pkblast

return ProcessorContextStub(pkblast.decompressBytesWholeToBytes(data)[1])

def unprocess(self, data: typing.Union[bytes, bytearray]) -> ProcessorContextStub:
from pkimplode import compressBytesChunkedToBytes

return ProcessorContextStub(compressBytesChunkedToBytes(data, compressionType=self.compressionType, dictionarySize=self.dictionarySize,))
37 changes: 37 additions & 0 deletions python/kaitai/compress/algorithms/lrzip.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import typing
from enum import IntEnum

from ..core import KaitaiCompressor, ProcessorContextStub

# pylint:disable=arguments-differ


class LRZip(KaitaiCompressor):
__slots__ = ("algo",)

lrzip = None
Algos = None

@classmethod
def initLib(cls):
import lrzip

self.__class__.lrzip = lrzip
prefix = "LRZIP_MODE_COMPRESS_"
self.__class__.Algos = IntEnum("A", sorted(((k[len(prefix) :].lower(), getattr(lrzip, k)) for k in dir(lrzip) if k[: len(prefix)] == prefix), key=lambda x: x[1]))

def __init__(self, algo: typing.Union[int, str] = "none", *args, **kwargs) -> None: # pylint:disable=unused-argument
if self.__class__.lrzip is None:
self.__class__.initLib()
if isinstance(algo, str):
algo = self.__class__.Algos[algo.lower()]
else:
algo = self.__class__.Algos(algo)
self.algo = algo
super().__init__()

def process(self, data: typing.Union[bytes, bytearray]) -> ProcessorContextStub:
return ProcessorContextStub(self.__class__.lrzip.decompress(data))

def unprocess(self, data: typing.Union[bytes, bytearray]) -> ProcessorContextStub:
return ProcessorContextStub(self.__class__.lrzip.compress(data, compressMode=self.algo))
42 changes: 42 additions & 0 deletions python/kaitai/compress/algorithms/lz4.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
import typing

from ..core import KaitaiCompressor, ProcessorContextStub

# pylint:disable=arguments-differ


class Lz4(KaitaiCompressor):
__slots__ = ("compressorParams",)
lz4Frame = None

def __init__(self, block_size: typing.Optional[int] = None, should_link_blocks: bool = True, compression_level: typing.Optional[int] = None, frame_checksum: bool = False, block_checksum: bool = False, *args, **kwargs) -> None: # pylint:disable=unused-argument,too-many-arguments
super().__init__()
if self.__class__.lz4Frame is None:
import lz4.frame # pylint:disable=import-outside-toplevel

self.__class__.lz4Frame = lz4.frame

if compression_level is None:
compression_level = self.__class__.lz4Frame.COMPRESSIONLEVEL_MAX
if block_size is None:
block_size = self.__class__.lz4Frame.BLOCKSIZE_MAX4MB
self.compressorParams = {
"block_size": block_size,
"block_linked": should_link_blocks,
"compression_level": compression_level,
"content_checksum": frame_checksum,
"block_checksum": block_checksum,
"return_bytearray": False
}

def process(self, data: typing.Union[bytes, bytearray]) -> ProcessorContextStub:
obj = self.__class__.lz4Frame.LZ4FrameDecompressor(return_bytearray=False)
return ProcessorContextStub(obj.decompress(data))

def unprocess(self, data: typing.Union[bytes, bytearray]) -> ProcessorContextStub:
obj = self.__class__.lz4Frame.LZ4FrameCompressor(**self.compressorParams)
return ProcessorContextStub(obj.begin(len(data)) + obj.compress(data) + obj.flush())

def extract_args(self, data: typing.Union[bytes, bytearray]):
res = self.__class__.lz4Frame.get_frame_info(data)
return (res["block_size"], res["linker"], res["compression_level"], res["content_checksum"], res["block_checksum"])
Loading