You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I understand that - To include columns in "non_categorical_columns", you actually need to add the column also in "categorical_columns".
So that's what I did :
where :
high_cardinality_cols refers to categorical columns with unique values >10k (namely 'emp_title')
cat refers to all categorical columns (including high cardinality ones)
num refers to all numeric or integer columns
I did training for first 100 rows for 150 epochs. Training completed in 2m20s but there seems to be error in
syn = synthesizer.generate_samples() #part of code
Error I encountered is : /usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_label.py in inverse_transform(self, y)
160 diff = np.setdiff1d(y, np.arange(len(self.classes_)))
161 if len(diff):
--> 162 raise ValueError("y contains previously unseen labels: %s" % str(diff))
163 y = np.asarray(y)
164 return self.classes_[y]
CLEARLY THE ERROR IS DUE TO INVERSE TRANFORMATION METHODS. @zhao-zilong Can you tell the reason for the error . Along with some methods to determine which columns should go into non_categorical_column parameter?
Also please do give a formal defintion of these 3 parameters :
general_columns,mixed_columns,log_columns
Thanks in advance!!
The text was updated successfully, but these errors were encountered:
Hi @yash-rathore-arya , I think your understanding of non_categorical_columns is correct, I will also set up the training like yours, did you solve this problem?
For the definition of "general_columns,mixed_columns,log_columns", it's all written in our ctab-gan+ paper (https://arxiv.org/pdf/2204.00401).
Please do setup the training like mine. The dataset is public and I can also offer the colab nb link if you wish. The error that : y contains previously unseen labels: [-25 -24 -22 -21 -17 -14 -12 -11 -10 -9 -8 -6 -5 -4 -3 -2 -1 91
92 93 94 98 99 101 103] is still prominent and I have no idea why. No, I have'nt solved the issue yet!
Yes, I have relied on the paper for the definition of rest of the columns. Thanks a lot!
Dataset : https://drive.google.com/file/d/1NNn4rvijmrJnwb_D7-3KweOcqkv99-8j/view?usp=sharing : Lending Loan Club Dataset.
I was doing comparative analysis of using non_categorical_column parameter vs not using it wrt time taken for training and quality of generated data.
I understand that - To include columns in "non_categorical_columns", you actually need to add the column also in "categorical_columns".
So that's what I did :
synthesizer = CTABGAN(
raw_csv_path = real_path,
test_ratio = 0.20,
categorical_columns=cat,
log_columns = [],
mixed_columns= {},
general_columns =[],
non_categorical_columns = high_cardinality_cols,
integer_columns=num,
problem_type= {None:None})
where :
high_cardinality_cols refers to categorical columns with unique values >10k (namely 'emp_title')
cat refers to all categorical columns (including high cardinality ones)
num refers to all numeric or integer columns
I did training for first 100 rows for 150 epochs. Training completed in 2m20s but there seems to be error in
syn = synthesizer.generate_samples() #part of code
Error I encountered is :
/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_label.py in inverse_transform(self, y)
160 diff = np.setdiff1d(y, np.arange(len(self.classes_)))
161 if len(diff):
--> 162 raise ValueError("y contains previously unseen labels: %s" % str(diff))
163 y = np.asarray(y)
164 return self.classes_[y]
ValueError: y contains previously unseen labels: [-25 -24 -22 -21 -17 -14 -12 -11 -10 -9 -8 -6 -5 -4 -3 -2 -1 91
92 93 94 98 99 101 103]
CLEARLY THE ERROR IS DUE TO INVERSE TRANFORMATION METHODS.
@zhao-zilong Can you tell the reason for the error . Along with some methods to determine which columns should go into non_categorical_column parameter?
Also please do give a formal defintion of these 3 parameters :
general_columns,mixed_columns,log_columns
Thanks in advance!!
The text was updated successfully, but these errors were encountered: