Query regarding "non_Categorical_columns" #8

yash-rathore-arya · 2023-06-08T08:40:39Z

Dataset : https://drive.google.com/file/d/1NNn4rvijmrJnwb_D7-3KweOcqkv99-8j/view?usp=sharing : Lending Loan Club Dataset.
I was doing comparative analysis of using non_categorical_column parameter vs not using it wrt time taken for training and quality of generated data.

I understand that - To include columns in "non_categorical_columns", you actually need to add the column also in "categorical_columns".
So that's what I did :

synthesizer = CTABGAN(
raw_csv_path = real_path,
test_ratio = 0.20,
categorical_columns=cat,
log_columns = [],
mixed_columns= {},
general_columns =[],
non_categorical_columns = high_cardinality_cols,
integer_columns=num,
problem_type= {None:None})

where :
high_cardinality_cols refers to categorical columns with unique values >10k (namely 'emp_title')
cat refers to all categorical columns (including high cardinality ones)
num refers to all numeric or integer columns

I did training for first 100 rows for 150 epochs. Training completed in 2m20s but there seems to be error in
syn = synthesizer.generate_samples() #part of code
Error I encountered is :
/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_label.py in inverse_transform(self, y)
160 diff = np.setdiff1d(y, np.arange(len(self.classes_)))
161 if len(diff):
--> 162 raise ValueError("y contains previously unseen labels: %s" % str(diff))
163 y = np.asarray(y)
164 return self.classes_[y]

ValueError: y contains previously unseen labels: [-25 -24 -22 -21 -17 -14 -12 -11 -10 -9 -8 -6 -5 -4 -3 -2 -1 91
92 93 94 98 99 101 103]

CLEARLY THE ERROR IS DUE TO INVERSE TRANFORMATION METHODS.
@zhao-zilong Can you tell the reason for the error . Along with some methods to determine which columns should go into non_categorical_column parameter?

Also please do give a formal defintion of these 3 parameters :
general_columns,mixed_columns,log_columns

Thanks in advance!!

zhao-zilong · 2023-06-18T06:23:31Z

Hi @yash-rathore-arya , I think your understanding of non_categorical_columns is correct, I will also set up the training like yours, did you solve this problem?

For the definition of "general_columns,mixed_columns,log_columns", it's all written in our ctab-gan+ paper (https://arxiv.org/pdf/2204.00401).

yash-rathore-arya · 2023-06-18T10:49:05Z

Please do setup the training like mine. The dataset is public and I can also offer the colab nb link if you wish. The error that : y contains previously unseen labels: [-25 -24 -22 -21 -17 -14 -12 -11 -10 -9 -8 -6 -5 -4 -3 -2 -1 91
92 93 94 98 99 101 103] is still prominent and I have no idea why. No, I have'nt solved the issue yet!
Yes, I have relied on the paper for the definition of rest of the columns. Thanks a lot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query regarding "non_Categorical_columns" #8

Query regarding "non_Categorical_columns" #8

yash-rathore-arya commented Jun 8, 2023

zhao-zilong commented Jun 18, 2023

yash-rathore-arya commented Jun 18, 2023

Query regarding "non_Categorical_columns" #8

Query regarding "non_Categorical_columns" #8

Comments

yash-rathore-arya commented Jun 8, 2023

zhao-zilong commented Jun 18, 2023

yash-rathore-arya commented Jun 18, 2023