-
Notifications
You must be signed in to change notification settings - Fork 351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datasets in .pkl format? #1
Comments
Hey, thanks for your question. Unfortunately, the preprocessed datasets are still quite large so we have no resources to host all of them at the moment. What we can do however is add some preprocessing instructions so that you can extract the same features using an open source tool. We will try to do so in the next few days. |
@jazzsaxmafia @kelvinxu I've also encountered the same problem. |
@kelvinxu Could you provide some simple information about the pkl file? For example, what's in the pkl file and their structures. Thank you very much. The preprocessing instruction is nicer if it won't take too long. |
@leo-zhou Just reference, it's my tentative guess. dictionary.pkl -> Updated based upon other comments. |
@kyunghyuncho Oh, I got it. This is so why you've used |
@jnhwkim @kyunghyuncho Thanks a lot ! |
Thank you very much. I think those were enough for me to set the data myself |
@jnhwkim, very minor addition to prevent confusion is that the dictionary.pkl doesn't load a list but a python dictionary in the form |
@kelvinxu Yes, you're right. For preventing confusion, I'll update my comment. |
Any news on the preprocessing instructions or even the preprocessed datasets upload? Great library, bit improved documentation would be welcome though. |
Hey samim23, The feature extraction procedure was described in the paper (you should extract conv5_4), but I agree that it should be explained reproduced somewhere here in the repo. |
Has anyone gotten the dataset conversion working? If so, it would be great if you could share the code. Will be trying this myself as well. |
@asampat3090 I saw you have implemented the code of dataset conversion. Can you reproduce the results in Kelvin's paper? Thanks. |
Hey guys, anyone succeeded in generating the pkl file? Any link would be very helpful! Thank you. |
@cxj273 I haven't actually tried. I'll try this weekend. @ffmpbgrnn check out my code - I have a generator for the flickr_30k, but I haven't documented much though. |
@asampat3090 I will have a look. Many thanks! :-) |
@asampat3090 Would your code actually work though? The image ids refer to the whole image collection, whereas you point to an image feature in a subset using the index that is meant for the whole image collection. Or am I missing something? I'm trying to port your code to the COCO dataset. |
@asampat3090 From my understanding, line 54 is wrong. You can't get all the training captions using the training image idx. Correct me if I am wrong. |
Hi, can I ask how large those .pkl files are? I tried to make them for the MSCOCO dataset, and the features from VGG for the training set alone take around 75GB. I stored them in scipy.sparse.csr_matrix. According to coco.py, it seems they all get loaded into memory together, so I was wondering if there is anything I was missing... |
@xlhdh It should be something around 15 Gbs. They are all loaded into memory at once, but we unsparsifying them one batch at a time. Are you unsparsifying them all at once? |
@kelvinxu The original weights were around 15GB, but once I pickle them, they got to like 75... And they were csr_matrix from top to toe. I guess I'll look at it again to see if there's any bug! |
It's likely because you didn't use "protocol=cPickle.HIGHEST_PROTOCOL" as
On Wed, Nov 4, 2015 at 10:54 AM, Yizhou Hu [email protected] wrote:
|
@kyunghyuncho Thank you, I used the highest protocol (I thought that was default) and it worked! The only thing I wasn't able to do was to dump the image features to disk all at once so I had to read several files in and assemble them in memory. |
@cxj273 @gamer13 sorry for the delay, I'm not sure I quite understood the issue. So I suppose there might be a mismatch between the "features" and "caps" variables in "prepare_data" here, but if I understand correctly you're saying we would need to re-index all of the image ids? If so, did you guys have any success doing that? I'm still trying to figure that out. UPDATE: I believe I have reindexed it such that features are referenced properly. Does anyone else have working code? |
@asampat3090 Thank you for your effort for sharing your script. I had trouble running this model and your code was very helpful. I am still struggling. But here is my suggestions for your code. Suggestions
Thanks. |
@kyunghyuncho , @kelvinxu |
I observe that in function |
Yes, the dictionary has IDs in descending frequency order. |
Hey all, I've created a script that appears to work for preprocessing. The source is |
Thanks @rowanz |
Hi @intuinno, I'm trying to run your prepare_caffe_and_dictionary_coco.ipynb. Could you please explain what the file dataset_coco.json is? |
I forked @intuinno 's work and added some codes and a simple doc in README.md . (No need dataset_coco.json) |
Just run this one line script to generate file |
Hello @Lorne0 , thank you so much for your code. It helps me a lot to reproduce the project. |
Hi, @athenspeterlong. Because the pretrained CNN requires 224*224 input, we should crop the images at first to feed them to CNN. |
Hi ,@intuinno , thank you for sharing the preprocessing code. I am using Flicker8k dataset and was able to build the necessary .pkl files and dictionary using prepare_flickr8k.py. Any idea why this is happening .. Thanks |
@Lorne0, I have tried to reproduce your results using your code,when I run the prepare_model_coco.py,some errors happen: |
Hello, thank you for sharing this great project.
I would like to run the code, but it seems like the project does not contain the datasets used. Even though I can get flickr or coco dataset but I do not know how the data is preprocessed in those .pkl files.
Can I possibly get the data as it is used in the project?
Thank you.
The text was updated successfully, but these errors were encountered: