Hybrid Text Compression (HTC) uses different techniques to achieve maximum text compression (lossless). It makes use of LZW, Huffman, Run length encoding and Burrows Wheeler Transform to compress different types of files.
It is a dictionary based technique for compressing data created by Abraham Lempel, Jacob Ziv, and Terry Welch. It is the improved version of LZ78.
Huffman code is a type of prefix code which is used for lossless data compression. It was developed by David A. Huffman in 1952. It is an entropy/frequency encoding method.
The BWT is a block-sorting compression algorithm. It rearranges a string into runs of similar character. It was invented by Michael Burrows and David Wheeler in 1994. It is used in Bio-informics. In Next Generation Sequencing, DNA is fragmented into small pieces of which first few bases are sequenced, yielding several millions of reads each 30 to 500 base pairs(“DNA Characters”) long.
It is a form of lossless data compression in which runs of data are stored as a single data value and count, rather than as original run.
- C++
- Python 2.x/3.x
- Matplotlib C++ API
- Boost for reading/writing binary data (in actual/raw bits)
git clone https://github.com/IAMIQBAL/Hybrid-Text-Compression
cd Hybrid-Text-Compression
pacman -S boost
g++ -o main HybridCompressor.cpp
./main
We have written a test class (tests.cpp) which can be used to check the compression ratio and time taken on a scatter plot. The class uses Matplotlib’s C++ Library to plot the scatter plot. The Tests are as follows:
Note: Red = .json | Green = .txt | Blue = .xml, .html
Text, json, html and other formats data have been used for testing purposes.