NEWS! - 19/11/2023

Our new paper TabuLa: Harnessing Language Models for Tabular Data Synthesis is on arxiv now! The code is published here. Tabula improves tabular data synthesis by leveraging language model structures without the burden of pre-trained model weights. It offers a faster training process by preprocessing tabular data to shorten token sequence, which sharply reducing training time while consistently delivering higher-quality synthetic data. Its training time is longer than CTAB-GAN+, but the synthetic data fidelity is amazing! It also works for high-dimentional categorical columns!

NEWS! - 09/10/2022

The CTAB-GAN+ code is released. CTAB-GAN+ updates the CTAB-GAN with new losses (i.e., WGAN+GP) and new feature engineering (i.e., general transform), the training is more stable and efficient. The problem type supports Classification and Regression dataset. You can also indicate the problem_type as None in CTAB-GAN+ code.

CTAB-GAN

This is the official git paper CTAB-GAN: Effective Table Data Synthesizing. The paper is published on Asian Conference on Machine Learning (ACML 2021), please check our pdf on PMLR website for our newest version of paper, it adds more content on time consumption analysis of training CTAB-GAN. If you have any question, please contact [email protected] for more information.

Prerequisite

The required package version

numpy==1.21.0
torch==1.9.1
pandas==1.2.4
sklearn==0.24.1
dython==0.6.4.post1
scipy==1.4.1

The sklean package in newer version has updated its function for sklearn.mixture.BayesianGaussianMixture. Therefore, user should use this proposed sklearn version to successfully run the code!

Example

Experiment_Script_Adult.ipynb is an example notebook for training CTAB-GAN with Adult dataset. The dataset is alread under Real_Datasets folder. The evaluation code is also provided.

For large dataset

If your dataset has large number of column, you may encounter the problem that our currnet code cannot encode all of your data since CTAB-GAN will wrap the encoded data into an image-like format. What you can do is changing the line 341 and 348 in model/synthesizer/ctabgan_synthesizer.py. The number in the slide list

sides = [4, 8, 16, 24, 32]

is the side size of image. You can enlarge the list to [4, 8, 16, 24, 32, 64] or [4, 8, 16, 24, 32, 64, 128] for accepting larger dataset.

Bibtex

To cite this paper, you could use this bibtex

@InProceedings{zhao21,
  title = 	 {CTAB-GAN: Effective Table Data Synthesizing},
  author =       {Zhao, Zilong and Kunar, Aditya and Birke, Robert and Chen, Lydia Y.},
  booktitle = 	 {Proceedings of The 13th Asian Conference on Machine Learning},
  pages = 	 {97--112},
  year = 	 {2021},
  editor = 	 {Balasubramanian, Vineeth N. and Tsang, Ivor},
  volume = 	 {157},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--19 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v157/zhao21a/zhao21a.pdf},
  url = 	 {https://proceedings.mlr.press/v157/zhao21a.html}
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Fake_Datasets/Adult		Fake_Datasets/Adult
Real_Datasets		Real_Datasets
model		model
.gitignore		.gitignore
Experiment_Script_Adult.ipynb		Experiment_Script_Adult.ipynb
LICENSE		LICENSE
License.txt		License.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

NEWS! - 19/11/2023

NEWS! - 09/10/2022

CTAB-GAN

Prerequisite

Example

For large dataset

Bibtex

About

Licenses found

Releases

Packages

Languages

License

Licenses found

Team-TUD/CTAB-GAN

Folders and files

Latest commit

History

Repository files navigation

NEWS! - 19/11/2023

NEWS! - 09/10/2022

CTAB-GAN

Prerequisite

Example

For large dataset

Bibtex

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages