Skip to content

DeepSqueeze: Deep Semantic Compression for Tabular Data

License

Notifications You must be signed in to change notification settings

andrewcrotty/deepsqueeze

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepSqueeze

This repository contains a basic implementation of DeepSqueeze from our SIGMOD 2020 paper. DeepSqueeze is a semantic compression framework that uses autoencoders to capture complex relationships in real-world tabular datasets, in particular focusing on error-bounded lossy compression of numerical data.

This implementation does not include some advanced techniques described in the paper, such as parameter sharing for categorical attributes, gated mixture of experts, and Bayesian optimization for hyperparameter tuning. If you are interested in these aspects of the work, please see this (unaffiliated) implementation: https://github.com/MikeXydas/DeepSqueeze

You can use the following commands to reproduce the results from the paper:

# Corel
python3 deepsqueeze.py data/corel.csv -c brotli -l 11 -e 0.005 -E 1500
python3 deepsqueeze.py corel.csv.tar.gz -d -C data/corel.csv

# Forest
python3 deepsqueeze.py data/forest.csv -c brotli -l 11 -e 0.005 -E 300
python3 deepsqueeze.py forest.csv.tar.gz -d -C data/forest.csv

If you use this code, please cite:

@inproceedings{DBLP:conf/sigmod/IlkhechiCGMFSC20,
  author       = {Amir Ilkhechi and
                  Andrew Crotty and
                  Alex Galakatos and
                  Yicong Mao and
                  Grace Fan and
                  Xiran Shi and
                  Ugur {\c{C}}etintemel},
  editor       = {David Maier and
                  Rachel Pottinger and
                  AnHai Doan and
                  Wang{-}Chiew Tan and
                  Abdussalam Alawini and
                  Hung Q. Ngo},
  title        = {DeepSqueeze: Deep Semantic Compression for Tabular Data},
  booktitle    = {Proceedings of the 2020 International Conference on Management of
                  Data, {SIGMOD} Conference 2020, online conference [Portland, OR, USA],
                  June 14-19, 2020},
  pages        = {1733--1746},
  publisher    = {{ACM}},
  year         = {2020},
  url          = {https://doi.org/10.1145/3318464.3389734},
  doi          = {10.1145/3318464.3389734},
  timestamp    = {Wed, 04 May 2022 13:02:28 +0200},
  biburl       = {https://dblp.org/rec/conf/sigmod/IlkhechiCGMFSC20.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

About

DeepSqueeze: Deep Semantic Compression for Tabular Data

Resources

License

Stars

Watchers

Forks

Languages