Skip to content

Commit

Permalink
Improved python module architecture and added algos.
Browse files Browse the repository at this point in the history
* Splitted core and algos
* Introduced a new API
* Added zstd, brotli, snappy, lzham, implode and bzip
* Fixed brotli files extension
* Added paramaters for all the algos
* Some improvements in packaging.
* Improved testing.
  • Loading branch information
KOLANICH committed Apr 5, 2022
1 parent 12f4cff commit 6525ac9
Show file tree
Hide file tree
Showing 40 changed files with 797 additions and 77 deletions.
13 changes: 6 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,10 +53,9 @@ Add [ruby/lib/](https://github.com/kaitai-io/kaitai_compress/tree/master/ruby/li

| Algorithm | Process name | Arguments | Conforming | Test file extension |
| - | - | - | - | - |
| [Brotli](https://en.wikipedia.org/wiki/Brotli) | `brotli` | None | [RFC 7932](https://datatracker.ietf.org/doc/html/rfc7932) | br |
| [LZ4](https://en.wikipedia.org/wiki/LZ4_(compression_algorithm)) | `lz4` | None | [LZ4 block specification](https://lz4.github.io/lz4/lz4_Block_format.md) | lz4 |
| [LZMA](https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm) | `lzma_raw` | None | Raw LZMA stream | lzma_raw |
| [LZMA](https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm) | `lzma_lzma` | None | Legacy .lzma file format (AKA alone) | lzma |
| [LZMA](https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm) | `lzma_xz` | None | .xz file format | xz |
| [DEFLATE](https://en.wikipedia.org/wiki/DEFLATE) (AKA zlib) | `zlib` | None | [RFC 1951](https://tools.ietf.org/html/rfc1951) | zlib |
| [zstd](https://zstd.net) (AKA zstandard) | `zstd` | None | [Spec & ref implementation](http://facebook.github.io/zstd/zstd_manual.html) | zst |
| [brotli](https://en.wikipedia.org/wiki/brotli) | `brotli` | compression level (`0`-`11`), mode (`generic`, `text`, `font`), log window size , log block size, dictionary | [RFC 7932](https://datatracker.ietf.org/doc/html/rfc7932) | `br` |
| [LZ4](https://en.wikipedia.org/wiki/LZ4_(compression_algorithm)) | `lz4` | block_size, if should link blocks, compression level (`0`-`16`), if should checksum frame, if should checksum each block | [LZ4 block specification](https://lz4.github.io/lz4/lz4_Block_format.md) | `lz4` |
| [LZMA](https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm) | `lzma` | algorithm version (`1, 2`), compression level (`0-9`, `-1` - don't compress with lzma, but use other filters specified), format (`auto`, `alone`, `raw`, `xz`), checksumming algorithm (`none`, `crc32`, `crc64`, `sha256`), modifiers (`e` for `--extreme`), dictionary size, literal context bit count, literal position bit count, position bit count, match finder (`hc3`, `hc4`, `bt2`, `bt3`, `bt4`), mode (`normal`, `fast`), additional_filters | Raw LZMA stream | `lzma` |
| [DEFLATE](https://en.wikipedia.org/wiki/DEFLATE) (AKA zlib) | `zlib` | container type (`raw`, `zlib`, `gzip`), log of window size (`9`-`15`), dictionary, compression level (`0`-`9`, `-1` for default), memory level (`0`-`9`), strategy (`filtered`, `huffman_only`), method (currently only `deflated` is supported) | [RFC 1951](https://tools.ietf.org/html/rfc1951) | `zlib`, `gz` |
| [zstd](https://zstd.net) (AKA zstandard) | `zstd` | format (`zstd1_magicless`, `zstd1`), log of (max) window size, dictionary, compression level (`1` - `22`, `-5` - `-1`), if should write checksum, if should write uncompressed size, if should write dict ID, strategy (`fast`, `dfast`, `greedy`, `lazy`, `lazy2`, `btlazy2`, `btopt`, `btultra`, `btultra2`), hash log size, match min size, chain log size, search log size, overlap log size, target length, if should use long distance matching, ldm hash log size, ldm match min size, ldm bucket size log, ldm hash rate log, job size, force max window | [Spec & ref implementation](http://facebook.github.io/zstd/zstd_manual.html) | `zst` |
| [bzip2](https://en.wikipedia.org/wiki/bzip2) | `bz2` | compression level (`1` - `9`) to add |[Official repo](https://gitlab.com/federicomenaquintero/bzip2)|`bz2`|
Binary file added _test/compressed/25k_uuids.ascii.implode
Binary file not shown.
Binary file added _test/compressed/25k_uuids.binary.implode
Binary file not shown.
Binary file added _test/compressed/25k_uuids.lzham
Binary file not shown.
1 change: 1 addition & 0 deletions _test/compressed/90_a.ascii.implode
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
���
Binary file added _test/compressed/90_a.binary.implode
Binary file not shown.
Binary file added _test/compressed/90_a.lzham
Binary file not shown.
1 change: 1 addition & 0 deletions _test/compressed/ascii_text.ascii.implode
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
 �-�gj:��i:xB2�dӽ�ּB������Ԁ
Binary file added _test/compressed/ascii_text.binary.implode
Binary file not shown.
Binary file added _test/compressed/ascii_text.lzham
Binary file not shown.
13 changes: 11 additions & 2 deletions _test/generate-data
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ for I in uncompressed/*.dat; do
BASE=$(basename "$I" | sed 's/\.dat$//')

echo "$BASE.lz4"
lz4 -9 <$I >compressed/$BASE.lz4
lz4 --best -BD <$I >compressed/$BASE.lz4

echo "$BASE.zlib"
ruby -e 'require "zlib"; $stdout.write(Zlib::deflate($stdin.read))' <$I >compressed/$BASE.zlib
Expand All @@ -22,7 +22,7 @@ for I in uncompressed/*.dat; do
zstd --ultra -22 -f -o compressed/$BASE.zst --format=zstd $I

echo "$BASE.br"
brotli <$I -o compressed/$BASE.br
brotli -f -o compressed/$BASE.br $I

echo "$BASE.raw.sz"
python3 -c "import sys, snappy; from pathlib import Path; i = Path(sys.argv[1]); o = Path(sys.argv[2]); o.write_bytes(snappy.compress(i.read_bytes()));" $I compressed/$BASE.raw.sz
Expand All @@ -32,4 +32,13 @@ for I in uncompressed/*.dat; do

echo "$BASE.hadoop.sz"
python3 -c "import sys, snappy; from pathlib import Path; i = Path(sys.argv[1]).open('rb'); o = Path(sys.argv[2]).open('wb'); snappy.hadoop_stream_compress(i, o); i.close(); o.close();" $I compressed/$BASE.hadoop.sz

echo "$BASE.lzham"
lzhamtest -m4 -d29 -u -x -o -e -h0 c $I compressed/$BASE.lzham

echo "$BASE.binary.implode" # no official extension
python3 -c "import sys, pkimplode; from pathlib import Path; i = Path(sys.argv[1]).open('rb'); o = Path(sys.argv[2]).open('wb'); pkimplode.compressStreamToStream(i, o, compressionType=pkimplode.CompressionType.binary, dictionarySize=4096); i.close(); o.close();" $I compressed/$BASE.binary.implode

echo "$BASE.ascii.implode" # no official extension
python3 -c "import sys, pkimplode; from pathlib import Path; i = Path(sys.argv[1]).open('rb'); o = Path(sys.argv[2]).open('wb'); pkimplode.compressStreamToStream(i, o, compressionType=pkimplode.CompressionType.ascii, dictionarySize=4096); i.close(); o.close();" $I compressed/$BASE.ascii.implode
done
6 changes: 6 additions & 0 deletions _test/ksy/test_implode_ascii.ksy
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
meta:
id: test_implode_ascii
seq:
- id: body
size-eos: true
process: kaitai.compress.implode(4096, 1)
6 changes: 6 additions & 0 deletions _test/ksy/test_implode_binary.ksy
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
meta:
id: test_implode_binary
seq:
- id: body
size-eos: true
process: kaitai.compress.implode(4096, 0)
2 changes: 1 addition & 1 deletion _test/ksy/test_lzma_lzma.ksy
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ meta:
seq:
- id: body
size-eos: true
process: kaitai.compress.lzma_lzma
process: kaitai.compress.lzma(1, 9, "alone")
2 changes: 1 addition & 1 deletion _test/ksy/test_lzma_raw.ksy
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ meta:
seq:
- id: body
size-eos: true
process: kaitai.compress.lzma_raw
process: kaitai.compress.lzma(2, 9, "raw")
2 changes: 1 addition & 1 deletion _test/ksy/test_lzma_xz.ksy
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ meta:
seq:
- id: body
size-eos: true
process: kaitai.compress.lzma_xz
process: kaitai.compress.lzma(2, 9, "xz")
6 changes: 6 additions & 0 deletions _test/ksy/test_snappy.ksy
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
meta:
id: test_snappy
seq:
- id: body
size-eos: true
process: kaitai.compress.snappy
67 changes: 42 additions & 25 deletions _test/test-python.py
Original file line number Diff line number Diff line change
@@ -1,36 +1,53 @@
#!/usr/bin/env python3

from glob import glob
from os.path import basename
from pathlib import Path
import re
import unittest

from test_lz4 import TestLz4
from test_lzma_lzma import TestLzmaLzma
from test_lzma_raw import TestLzmaRaw
from test_lzma_xz import TestLzmaXz
from test_zlib import TestZlib
from test_snappy import TestSnappy
from test_brotli import TestBrotli
from test_zstd import TestZstd
from test_implode_binary import TestImplodeBinary
from test_implode_ascii import TestImplodeAscii

for uncompressed_fn in glob('uncompressed/*.dat'):
name = re.sub(r'.dat$', '', basename(uncompressed_fn))
print(name)

f = open(uncompressed_fn, 'rb')
uncompressed_data = f.read()
f.close()

algs = [
(TestLz4, 'lz4'),
(TestLzmaLzma, 'lzma'),
# (TestLzmaRaw, 'lzma_raw'), # requires filters= to be set
(TestLzmaXz, 'xz'),
(TestZlib, 'zlib'),
(TestBrotli, 'brotli'),
]

for alg in algs:
test_class = alg[0]
ext = alg[1]

obj = test_class.from_file('compressed/%s.%s' % (name, ext))
print(obj.body == uncompressed_data)
cwd = Path(".").absolute()
this_dir = Path(__file__).absolute().parent.relative_to(cwd)
compressed_dir = this_dir / "compressed"
uncompressed_dir = this_dir / "uncompressed"


class SimpleTests(unittest.TestCase):
def testCompressors(self):
for uncompressed_fn in uncompressed_dir.glob("*.dat"):
name = uncompressed_fn.stem
print(name)

uncompressed_data = uncompressed_fn.read_bytes()

algs = [
(TestLz4, "lz4"),
(TestLzmaLzma, "lzma"),
# (TestLzmaRaw, 'lzma_raw'), # requires filters= to be set
(TestLzmaXz, "xz"),
(TestZlib, "zlib"),
(TestSnappy, "snappy"),
(TestBrotli, "br"),
(TestZstd, "zst"),
(TestImplodeBinary, "binary.implode"),
(TestImplodeAscii, "ascii.implode"),
]

for test_class, ext in algs:
compressed_fn = compressed_dir / (name + "." + ext)
with self.subTest(test_class=test_class, file=compressed_fn):
obj = test_class.from_file(str(compressed_fn))
self.assertEqual(obj.body, uncompressed_data)


if __name__ == "__main__":
unittest.main()
6 changes: 6 additions & 0 deletions python/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
__pycache__
*.pyc
*.pyo
/build
/dist
/*.egg-info
15 changes: 9 additions & 6 deletions python/kaitai/compress/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
from .lz4 import Lz4
from .zlib import Zlib
from .lzma_raw import LzmaRaw
from .lzma_lzma import LzmaLzma
from .lzma_xz import LzmaXz
from .brotli import Brotli
from .core import *
from .algorithms.zlib import Zlib
from .algorithms.lzma import Lzma
from .algorithms.lz4 import Lz4
from .algorithms.brotli import Brotli
from .algorithms.zstd import Zstd
from .algorithms.bz2 import Bz2
from .algorithms.snappy import Snappy
from .algorithms.implode import Implode
Empty file.
43 changes: 43 additions & 0 deletions python/kaitai/compress/algorithms/brotli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
import typing

from ..core import KaitaiCompressor, ProcessorContextStub

# pylint:disable=arguments-differ


class Brotli(KaitaiCompressor):
__slots__ = ("compressorParams", "decompressorParams")
brotli = None

def __init__(self, level: typing.Optional[int] = None, mode: typing.Optional[str] = "generic", log_window_size: typing.Optional[int] = None, log_block_size: typing.Optional[int] = None, dictionary: typing.Optional[bytes] = None) -> None: # pylint:disable=redefined-builtin,too-many-arguments,too-many-locals,unused-argument
super().__init__()
if self.__class__.brotli is None:
import brotli # pylint:disable=import-outside-toplevel

self.__class__.brotli = brotli
self.compressorParams = {}
self.decompressorParams = {}

if mode is not None:
if isinstance(mode, str):
mode = getattr(self.__class__.brotli, "MODE_" + mode.upper())
self.compressorParams["mode"] = mode

if level is not None:
self.compressorParams["quality"] = level

if log_window_size is not None:
self.compressorParams["lgwin"] = log_window_size

if log_block_size is not None:
self.compressorParams["lgblock"] = log_block_size

if dictionary is not None:
self.decompressorParams["dictionary"] = self.compressorParams["dictionary"] = dictionary

# new API
def process(self, data: typing.Union[bytes, bytearray]) -> ProcessorContextStub:
return ProcessorContextStub(self.__class__.brotli.decompress(bytes(data), **self.decompressorParams))

def unprocess(self, data: typing.Union[bytes, bytearray]) -> ProcessorContextStub:
return ProcessorContextStub(self.__class__.brotli.compress(data, **self.compressorParams))
22 changes: 22 additions & 0 deletions python/kaitai/compress/algorithms/bz2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import bz2
import typing

from ..core import KaitaiCompressor, ProcessorContextStub

# pylint:disable=arguments-differ


class Bz2(KaitaiCompressor):
__slots__ = ("level",)

def __init__(self, level: int = 9, *args, **kwargs) -> None: # pylint:disable=unused-argument
super().__init__()
self.level = level

def process(self, data: typing.Union[bytes, bytearray]) -> ProcessorContextStub:
decompressor = bz2.BZ2Decompressor()
return ProcessorContextStub(decompressor.decompress(data))

def unprocess(self, data: typing.Union[bytes, bytearray]) -> ProcessorContextStub:
compressor = bz2.BZ2Compressor(self.level)
return ProcessorContextStub(compressor.compress(data) + compressor.flush())
37 changes: 37 additions & 0 deletions python/kaitai/compress/algorithms/implode.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import typing

from ..core import KaitaiCompressor, ProcessorContextStub

# pylint:disable=arguments-differ


class Implode(KaitaiCompressor):
"""PKWare implode format"""

__slots__ = ("dictionarySize", "compressionType")

def __init__(self, dictionarySize: int = 4096, compressionType: int = 0, *args, **kwargs) -> None: # pylint:disable=unused-argument
super().__init__()

try:
from pklib_base import CompressionType
except ImportError:
pass
else:
if isinstance(compressionType, str):
compressionType = CompressionType[compressionType.lower()]
else:
compressionType = CompressionType(compressionType)

self.compressionType = compressionType
self.dictionarySize = dictionarySize

def process(self, data: typing.Union[bytes, bytearray]) -> ProcessorContextStub:
import pkblast

return ProcessorContextStub(pkblast.decompressBytesWholeToBytes(data)[1])

def unprocess(self, data: typing.Union[bytes, bytearray]) -> ProcessorContextStub:
from pkimplode import compressBytesChunkedToBytes

return ProcessorContextStub(compressBytesChunkedToBytes(data, compressionType=self.compressionType, dictionarySize=self.dictionarySize,))
37 changes: 37 additions & 0 deletions python/kaitai/compress/algorithms/lrzip.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import typing
from enum import IntEnum

from ..core import KaitaiCompressor, ProcessorContextStub

# pylint:disable=arguments-differ


class LRZip(KaitaiCompressor):
__slots__ = ("algo",)

lrzip = None
Algos = None

@classmethod
def initLib(cls):
import lrzip

self.__class__.lrzip = lrzip
prefix = "LRZIP_MODE_COMPRESS_"
self.__class__.Algos = IntEnum("A", sorted(((k[len(prefix) :].lower(), getattr(lrzip, k)) for k in dir(lrzip) if k[: len(prefix)] == prefix), key=lambda x: x[1]))

def __init__(self, algo: typing.Union[int, str] = "none", *args, **kwargs) -> None: # pylint:disable=unused-argument
if self.__class__.lrzip is None:
self.__class__.initLib()
if isinstance(algo, str):
algo = self.__class__.Algos[algo.lower()]
else:
algo = self.__class__.Algos(algo)
self.algo = algo
super().__init__()

def process(self, data: typing.Union[bytes, bytearray]) -> ProcessorContextStub:
return ProcessorContextStub(self.__class__.lrzip.decompress(data))

def unprocess(self, data: typing.Union[bytes, bytearray]) -> ProcessorContextStub:
return ProcessorContextStub(self.__class__.lrzip.compress(data, compressMode=self.algo))
42 changes: 42 additions & 0 deletions python/kaitai/compress/algorithms/lz4.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
import typing

from ..core import KaitaiCompressor, ProcessorContextStub

# pylint:disable=arguments-differ


class Lz4(KaitaiCompressor):
__slots__ = ("compressorParams",)
lz4Frame = None

def __init__(self, block_size: typing.Optional[int] = None, should_link_blocks: bool = True, compression_level: typing.Optional[int] = None, frame_checksum: bool = False, block_checksum: bool = False, *args, **kwargs) -> None: # pylint:disable=unused-argument,too-many-arguments
super().__init__()
if self.__class__.lz4Frame is None:
import lz4.frame # pylint:disable=import-outside-toplevel

self.__class__.lz4Frame = lz4.frame

if compression_level is None:
compression_level = self.__class__.lz4Frame.COMPRESSIONLEVEL_MAX
if block_size is None:
block_size = self.__class__.lz4Frame.BLOCKSIZE_MAX4MB
self.compressorParams = {
"block_size": block_size,
"block_linked": should_link_blocks,
"compression_level": compression_level,
"content_checksum": frame_checksum,
"block_checksum": block_checksum,
"return_bytearray": False
}

def process(self, data: typing.Union[bytes, bytearray]) -> ProcessorContextStub:
obj = self.__class__.lz4Frame.LZ4FrameDecompressor(return_bytearray=False)
return ProcessorContextStub(obj.decompress(data))

def unprocess(self, data: typing.Union[bytes, bytearray]) -> ProcessorContextStub:
obj = self.__class__.lz4Frame.LZ4FrameCompressor(**self.compressorParams)
return ProcessorContextStub(obj.begin(len(data)) + obj.compress(data) + obj.flush())

def extract_args(self, data: typing.Union[bytes, bytearray]):
res = self.__class__.lz4Frame.get_frame_info(data)
return (res["block_size"], res["linker"], res["compression_level"], res["content_checksum"], res["block_checksum"])
Loading

0 comments on commit 6525ac9

Please sign in to comment.