Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what's your dataset? and how does it works? #1

Open
oblivion120 opened this issue Jan 31, 2019 · 14 comments
Open

what's your dataset? and how does it works? #1

oblivion120 opened this issue Jan 31, 2019 · 14 comments

Comments

@oblivion120
Copy link

No description provided.

@voidism
Copy link
Owner

voidism commented Jan 31, 2019

Hi,
you can modify the line 328 in v2_cyclegan.py

utils = Utils(X_data_path="big_cou.txt", Y_data_path="big_cna.txt")

change the "big_cou.txt" to your X domain data path.
change the "big_cna.txt" to your Y domain data path.
the data format is simply putting sentences line by line, and the words are space separated in each sentence.

Before training the main cycleGAN model, you need to pretrain the generator and reconstructor (in same model):

if args.mode == "pretrain":
pretrain(model, embedding_layer, utils, int(args.epoch))

and pretrain the discriminator:

if args.mode == "disc":
disc = Discriminator(word_dim=utils.emb_mat.shape[1], inner_dim=512, seq_len=20)
main_model = CycleGAN(disc, model, utils, embedding_layer)
main_model.to(device)
main_model.pretrain_disc()

and then you can run the main cycleGAN section, which load the pretrain model as initialization:

if args.mode == "cycle":
disc = Discriminator(word_dim=utils.emb_mat.shape[1], inner_dim=512, seq_len=20)
main_model = CycleGAN(disc, model, utils, embedding_layer)
main_model.to(device)
main_model.load_model(g_file="model_pretrain.ckpt", r_file="model_pretrain.ckpt", d_file="model_disc_pretrain.ckpt")
main_model.train_model()

see v2_cyclegan.py for more information.

@sheetalsh456
Copy link

Hey,
I'm not able to find any documentation about the jexus module in python, hence I'm not able to run this code because I don't have it installed. Can you please point me to any documentation about it?

@voidism
Copy link
Owner

voidism commented Oct 23, 2019

@sheetalsh456 Sorry that I forgot it because I have put this file in my /usr/lib/python/site-packages/ for import it from anywhere.
You can use:
https://gist.github.com/voidism/22691f2f7d9ec0fac2df3884dc3e31d0
The main function of this file is to print time bars like tqdm but with further information about loss or accuracy.

@sheetalsh456
Copy link

Sure, I included that file, thanks! :)
Also, now I'm getting the following error in the load_embedding() function of utils.py
No such file or directory: WordEmb/idx2word.npy
I'm guessing there had to be an npy file there?

@voidism
Copy link
Owner

voidism commented Oct 25, 2019

@sheetalsh456 This file is the word embedding layer weights. This is a numpy array with shape=(vocab_size, embedding_dim). The order of the word vectors in this array should follow the way that you convert the words into indices.
In my experiment I used a Chinese word embedding weights trained by skip-gram using gensim. I think you may not want to train this model with Chinese corpus, so you need to prepare it by yourself.

@sheetalsh456
Copy link

Okay, and is this the vocabulary for X_data or Y_data or both?

@voidism
Copy link
Owner

voidism commented Oct 25, 2019

@sheetalsh456 Both!

@sheetalsh456
Copy link

So if I understand correctly, there are 2 npy files.

  1. WordEmb/word2vec_weights.npy : This is a npy array of shape (vocab_size, embedding_dim), what you just mentioned above.
  2. WordEmb/idx2word.npy : This is also a numpy array. But is it a numpy array of strings? And what will be its dimension?

@voidism
Copy link
Owner

voidism commented Oct 26, 2019

@sheetalsh456 Yes, it is a numpy array of strings, and the order of words is according to the indices of words.
p.s. Actually, it was just a vocab list at the begining but I save it by np.save("idx2word.npy", vocab_list) so I didn't need to import pickle to save it for convenience.

@sheetalsh456
Copy link

Oh okay, makes sense!
Also, do you have any results/graphs of the performance of this Cycle GAN on any sort of text dataset (Chinese works too)? Cycle GAN is said to be really unstable for text, so I'm just curious to see how this one works!

@voidism
Copy link
Owner

voidism commented Oct 27, 2019

@sheetalsh456 I did not leave the results. It was too long ago. In my experiments, I put formal text like news corpus in X_data and informal text like video subtitles in Y_data. Finally, the model can learn to insert some filler words before/after the original input formal text as its output. The model indeed has learned something about transferring from formal to informal, although outputting non-fluent text did happen in many cases.

I think if you are doing easier tasks like sentiment style transfer from positive sentences to negative sentences, the performance may be better.

@sheetalsh456
Copy link

Hey @voidism thanks a lot! :)

@voidism
Copy link
Owner

voidism commented Oct 28, 2019

@sheetalsh456 You're welcome!

@MuhammadArsalan155
Copy link

Thanks, @viodism, for the detailed explanation. I've set up the code, but when I try to pretrain the model, my notebook crashes. How do you manage it?
if args.mode == "pretrain":
pretrain(model, embedding_layer, utils, int(args.epoch))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants