Datasets in .pkl format? #1

jazzsaxmafia · 2015-05-25T04:18:36Z

Hello, thank you for sharing this great project.

I would like to run the code, but it seems like the project does not contain the datasets used. Even though I can get flickr or coco dataset but I do not know how the data is preprocessed in those .pkl files.

Can I possibly get the data as it is used in the project?

Thank you.

kelvinxu · 2015-05-25T06:52:12Z

Hey, thanks for your question. Unfortunately, the preprocessed datasets are still quite large so we have no resources to host all of them at the moment. What we can do however is add some preprocessing instructions so that you can extract the same features using an open source tool. We will try to do so in the next few days.

jnhwkim · 2015-05-27T00:54:09Z

@jazzsaxmafia @kelvinxu I've also encountered the same problem.

leo-zhou · 2015-05-28T02:15:04Z

@kelvinxu Could you provide some simple information about the pkl file? For example, what's in the pkl file and their structures. Thank you very much. The preprocessing instruction is nicer if it won't take too long.

jnhwkim · 2015-05-28T07:16:16Z

@leo-zhou Just reference, it's my tentative guess.
data.pkl -> first dump: cap, second dump: feat
cap -> [[sentence, feature #], ... ]
feat-> ~~numpy.array(vgg_conv4).shape[N, 1, L x D]~~ scipy.sparse.csr_matrix(vgg_conv4).shape[N, L x D]

dictionary.pkl -> ~~[[word, word #], ... ]~~ {word: word#, ... }
(word # starts from 2, (= frequency rank + 1), 0 and 1 are reserved by the program.)

Updated based upon other comments.

kyunghyuncho · 2015-05-28T11:59:00Z

@leo-zhou
@jnhwkim is correct, except that feat is saved as a sparse matrix of shape [N, 14 * 14 * 512]. In the dictionary (dictionary.pkl), 0 and 1 are reserved for the end of caption and a unknown word.

jnhwkim · 2015-05-28T12:01:38Z

@kyunghyuncho Oh, I got it. This is so why you've used ff.todense() in coco.py:38. Thanks!

leo-zhou · 2015-05-28T12:05:44Z

@jnhwkim @kyunghyuncho Thanks a lot !

jazzsaxmafia · 2015-05-28T12:36:30Z

Thank you very much. I think those were enough for me to set the data myself

kelvinxu · 2015-05-28T20:26:17Z

@jnhwkim, very minor addition to prevent confusion is that the dictionary.pkl doesn't load a list but a python dictionary in the form {word : word #}. This is probably what you meant. Thanks!

jnhwkim · 2015-05-28T22:29:51Z

@kelvinxu Yes, you're right. For preventing confusion, I'll update my comment.

samim23 · 2015-07-23T06:52:37Z

Any news on the preprocessing instructions or even the preprocessed datasets upload? Great library, bit improved documentation would be welcome though.

kelvinxu · 2015-07-24T05:56:00Z

Hey samim23,

The feature extraction procedure was described in the paper (you should extract conv5_4), but I agree that it should be explained reproduced somewhere here in the repo.

asampat3090 · 2015-08-17T21:26:45Z

Has anyone gotten the dataset conversion working? If so, it would be great if you could share the code. Will be trying this myself as well.

cxj273 · 2015-09-17T19:37:18Z

@asampat3090 I saw you have implemented the code of dataset conversion. Can you reproduce the results in Kelvin's paper? Thanks.

ffmpbgrnn · 2015-09-17T19:38:01Z

Hey guys, anyone succeeded in generating the pkl file? Any link would be very helpful! Thank you.

asampat3090 · 2015-09-17T19:54:07Z

@cxj273 I haven't actually tried. I'll try this weekend. @ffmpbgrnn check out my code - I have a generator for the flickr_30k, but I haven't documented much though.

ffmpbgrnn · 2015-09-17T19:58:27Z

@asampat3090 I will have a look. Many thanks! :-)

flipvrijn · 2015-10-01T08:31:23Z

@asampat3090 Would your code actually work though? The image ids refer to the whole image collection, whereas you point to an image feature in a subset using the index that is meant for the whole image collection. Or am I missing something?

I'm trying to port your code to the COCO dataset.

cxj273 · 2015-10-10T18:51:18Z

@asampat3090 From my understanding, line 54 is wrong. You can't get all the training captions using the training image idx. Correct me if I am wrong.

xlhdh · 2015-11-04T06:52:03Z

Hi, can I ask how large those .pkl files are? I tried to make them for the MSCOCO dataset, and the features from VGG for the training set alone take around 75GB. I stored them in scipy.sparse.csr_matrix. According to coco.py, it seems they all get loaded into memory together, so I was wondering if there is anything I was missing...

kelvinxu · 2015-11-04T07:30:31Z

@xlhdh It should be something around 15 Gbs. They are all loaded into memory at once, but we unsparsifying them one batch at a time. Are you unsparsifying them all at once?

xlhdh · 2015-11-04T15:54:52Z

@kelvinxu The original weights were around 15GB, but once I pickle them, they got to like 75... And they were csr_matrix from top to toe. I guess I'll look at it again to see if there's any bug!

kyunghyuncho · 2015-11-04T16:51:55Z

It's likely because you didn't use "protocol=cPickle.HIGHEST_PROTOCOL" as
an argument with cPickle.dump.

K

On Wed, Nov 4, 2015 at 10:54 AM, Yizhou Hu [email protected] wrote:

@kelvinxu https://github.com/kelvinxu The original weights were around
15GB, but once I pickle them, they got to like 75... And they were
csr_matrix from top to toe. I guess I'll look at it again to see if there's
any bug!

—
Reply to this email directly or view it on GitHub
#1 (comment)
.

xlhdh · 2015-11-06T00:23:21Z

@kyunghyuncho Thank you, I used the highest protocol (I thought that was default) and it worked! The only thing I wasn't able to do was to dump the image features to disk all at once so I had to read several files in and assemble them in memory.

asampat3090 · 2015-11-09T03:01:43Z

@cxj273 @gamer13 sorry for the delay, I'm not sure I quite understood the issue. So I suppose there might be a mismatch between the "features" and "caps" variables in "prepare_data" here, but if I understand correctly you're saying we would need to re-index all of the image ids? If so, did you guys have any success doing that? I'm still trying to figure that out.

UPDATE: I believe I have reindexed it such that features are referenced properly. Does anyone else have working code?

intuinno · 2015-11-29T22:49:42Z

@asampat3090 Thank you for your effort for sharing your script. I had trouble running this model and your code was very helpful. I am still struggling. But here is my suggestions for your code.

Suggestions

It seems like the capgen.py train function requires dictionary with 'A' and it means in your vectorizer code, the options should include following
- vectorizer = CountVectorizer(analyzer=str.split, lowercase=False).fit(captions)
- Or you could make sentence in caps file as lower case. This would be better approach for the data sparseness problem.
You used conv5_3 as the feature extraction. However according to the paper, con5_4 seems better features.

Thanks.

frajem · 2016-01-28T07:14:56Z

@kyunghyuncho , @kelvinxu
Hello,
By using default parameters for Coco, soft attention, I get much lower results on the test set than what was published: BLEU-1=0.545, METEOR=0.164, CIDer=0.274.
The only difference I see is that early stop is done on NLL. Can this cause such big gaps?
Also, my coco_align.train.pkl is about 6 GB, and not 15 GB.
Thanks!

rayz0620 · 2016-01-28T07:32:19Z

I observe that in function prepare_data() line 40 of file flickr30k.py, the code set all words with id larger than n_words to be 1(UNK). Therefore when we create the dictionary, we should assign id in descending order of word frequency, assign smaller id for words with larger frequency.
@asampat3090 In your make_flickr_data.py, you used CountVectorizer from scikit.learn, which assigns word id in occurrence order. This might be the reason why there are too many UNK in training data.

frajem · 2016-02-03T13:32:26Z

Yes, the dictionary has IDs in descending frequency order.
Any idea about why I'm getting so much lower metrics on Coco (see my comment above)?
Thanks.

rowanz · 2016-02-16T05:56:39Z

Hey all, I've created a script that appears to work for preprocessing. The source is
here. It does everything besides create the word-ID dictionary.

frajem · 2016-02-16T06:53:40Z

Thanks @rowanz
What metrics values do you get, for example for Coco, soft attention?

intuinno · 2016-02-25T22:27:58Z

Hello, Thank you @rowanz . I myself struggled for creation of the preprocessing and created a repo for anybody needs it. You can check it at here

ericclei · 2016-03-21T22:42:07Z

Hi @intuinno, I'm trying to run your prepare_caffe_and_dictionary_coco.ipynb. Could you please explain what the file dataset_coco.json is?

Lorne0 · 2016-05-13T12:24:40Z

I forked @intuinno 's work and added some codes and a simple doc in README.md . (No need dataset_coco.json)
https://github.com/Lorne0/arctic-captions
Hope it's helpful.

Litchiware · 2016-05-14T14:40:08Z

Just run this one line script to generate file dictionary.pkl.
cat flickr8k/Flickr8k_text/Flickr8k.token.txt | awk -F '\t' '{print $2}' | awk '{for(i=1;i<=NF;i++) print $i}' | sort | uniq -c | sort -nr | awk '{print $2,NR+1}' | python -c "import sys; import cPickle as pkl; pkl.dump(dict([line.strip('\n').split(' ') for line in sys.stdin.readlines()]), open('features/dictionary.pkl', 'wb'))"

athenspeterlong · 2016-05-17T00:04:30Z

Hello @Lorne0 , thank you so much for your code. It helps me a lot to reproduce the project.
I am looking into the code and have a question, for the preprocess.sh, why do the crop 224_224 instead of stay 256_256?
Thanks

vyouman · 2016-05-22T08:44:52Z

Hi, @athenspeterlong. Because the pretrained CNN requires 224*224 input, we should crop the images at first to feed them to CNN.

dipanjan06 · 2016-08-09T19:55:03Z

Hi ,@intuinno , thank you for sharing the preprocessing code. I am using Flicker8k dataset and was able to build the necessary .pkl files and dictionary using prepare_flickr8k.py.
Now I am trying to run the train function using evaluate_flickr8k.py but I am getting "coo_matrix object does not support indexing" in flickr8k.py line no 16 ..

Any idea why this is happening ..

Thanks

jetsmith · 2016-09-08T14:24:54Z

@Lorne0, I have tried to reproduce your results using your code，when I run the prepare_model_coco.py，some errors happen：
val（5000,100352）
train （5000，100352）
Traceback (most recent call last):
File "prepare_model_coco.py", line 70, in
result = np.empty((numImage, 100352))
MemoryError
I don't know why it happens？
thanks

xxxyyyzzzz · 2018-10-04T10:59:43Z

@rowanz Thanks for ur preprocessing code using pkl files. In your code I can see that the training imgs include restval set also. Is that recommended by the author @kelvinxu! Can you clarify the reason behind it?
It will be really helpful for me. Thanks in advance

noureldien mentioned this issue Jan 18, 2016

'sample = empty string' when trying to train flickr_30k #16

Closed

Lorne0 mentioned this issue May 7, 2016

MemoryError #24

Closed

jamiechoi1995 mentioned this issue Aug 14, 2018

Beam search yunjey/show-attend-and-tell#41

Open

Datasets in .pkl format? #1

Datasets in .pkl format? #1

Comments

jazzsaxmafia commented May 25, 2015

kelvinxu commented May 25, 2015

jnhwkim commented May 27, 2015

leo-zhou commented May 28, 2015

jnhwkim commented May 28, 2015

kyunghyuncho commented May 28, 2015

jnhwkim commented May 28, 2015

leo-zhou commented May 28, 2015

jazzsaxmafia commented May 28, 2015

kelvinxu commented May 28, 2015

jnhwkim commented May 28, 2015

samim23 commented Jul 23, 2015

kelvinxu commented Jul 24, 2015

asampat3090 commented Aug 17, 2015

cxj273 commented Sep 17, 2015

ffmpbgrnn commented Sep 17, 2015

asampat3090 commented Sep 17, 2015

ffmpbgrnn commented Sep 17, 2015

flipvrijn commented Oct 1, 2015

cxj273 commented Oct 10, 2015

xlhdh commented Nov 4, 2015

kelvinxu commented Nov 4, 2015

xlhdh commented Nov 4, 2015

kyunghyuncho commented Nov 4, 2015

xlhdh commented Nov 6, 2015

asampat3090 commented Nov 9, 2015

intuinno commented Nov 29, 2015

Suggestions

frajem commented Jan 28, 2016

rayz0620 commented Jan 28, 2016

frajem commented Feb 3, 2016

rowanz commented Feb 16, 2016

frajem commented Feb 16, 2016

intuinno commented Feb 25, 2016

ericclei commented Mar 21, 2016

Lorne0 commented May 13, 2016

Litchiware commented May 14, 2016

athenspeterlong commented May 17, 2016

vyouman commented May 22, 2016

dipanjan06 commented Aug 9, 2016

jetsmith commented Sep 8, 2016

xxxyyyzzzz commented Oct 4, 2018