Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query regarding "non_Categorical_columns" #8

Open
yash-rathore-arya opened this issue Jun 8, 2023 · 2 comments
Open

Query regarding "non_Categorical_columns" #8

yash-rathore-arya opened this issue Jun 8, 2023 · 2 comments

Comments

@yash-rathore-arya
Copy link

Dataset : https://drive.google.com/file/d/1NNn4rvijmrJnwb_D7-3KweOcqkv99-8j/view?usp=sharing : Lending Loan Club Dataset.
I was doing comparative analysis of using non_categorical_column parameter vs not using it wrt time taken for training and quality of generated data.

I understand that - To include columns in "non_categorical_columns", you actually need to add the column also in "categorical_columns".
So that's what I did :

synthesizer = CTABGAN(
raw_csv_path = real_path,
test_ratio = 0.20,
categorical_columns=cat,
log_columns = [],
mixed_columns= {},
general_columns =[],
non_categorical_columns = high_cardinality_cols,
integer_columns=num,
problem_type= {None:None})

where :
high_cardinality_cols refers to categorical columns with unique values >10k (namely 'emp_title')
cat refers to all categorical columns (including high cardinality ones)
num refers to all numeric or integer columns

I did training for first 100 rows for 150 epochs. Training completed in 2m20s but there seems to be error in
syn = synthesizer.generate_samples() #part of code
Error I encountered is :
/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_label.py in inverse_transform(self, y)
160 diff = np.setdiff1d(y, np.arange(len(self.classes_)))
161 if len(diff):
--> 162 raise ValueError("y contains previously unseen labels: %s" % str(diff))
163 y = np.asarray(y)
164 return self.classes_[y]

ValueError: y contains previously unseen labels: [-25 -24 -22 -21 -17 -14 -12 -11 -10 -9 -8 -6 -5 -4 -3 -2 -1 91
92 93 94 98 99 101 103]

CLEARLY THE ERROR IS DUE TO INVERSE TRANFORMATION METHODS.
@zhao-zilong Can you tell the reason for the error . Along with some methods to determine which columns should go into non_categorical_column parameter?

Also please do give a formal defintion of these 3 parameters :
general_columns,mixed_columns,log_columns

Thanks in advance!!

@zhao-zilong
Copy link
Contributor

Hi @yash-rathore-arya , I think your understanding of non_categorical_columns is correct, I will also set up the training like yours, did you solve this problem?

For the definition of "general_columns,mixed_columns,log_columns", it's all written in our ctab-gan+ paper (https://arxiv.org/pdf/2204.00401).

@yash-rathore-arya
Copy link
Author

Please do setup the training like mine. The dataset is public and I can also offer the colab nb link if you wish. The error that : y contains previously unseen labels: [-25 -24 -22 -21 -17 -14 -12 -11 -10 -9 -8 -6 -5 -4 -3 -2 -1 91
92 93 94 98 99 101 103] is still prominent and I have no idea why. No, I have'nt solved the issue yet!
Yes, I have relied on the paper for the definition of rest of the columns. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants