Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to prepare my own dataset #1

Open
wanghaisheng opened this issue Apr 27, 2018 · 3 comments
Open

how to prepare my own dataset #1

wanghaisheng opened this issue Apr 27, 2018 · 3 comments

Comments

@wanghaisheng
Copy link

lets say i want to classiy domain document into about 48 categories, am I create like The RVL-CDIP Dataset? what`s the proper dpi of document image ?should I process them into grayscale?
400,000 grayscale images in 16 classes, with 25,000 images per class

400,0003 grayscale images in 163 classes, with 25,000 images per class

@robical
Copy link
Contributor

robical commented May 1, 2018

Hi,

If you want to test your methodology first, the RVL-CDIP dataset is easier to use since they have already classified all the 400K documents in 16 classes manually; if you want to further extend the classification granularity of the RVL-CDIP to 48 classes, you have several options.
If you want to classify documents from a different specific domain (let's say, english literature documents), you can still start from the CNN weights trained on the RVL-CDIP dataset, and retrain the model on your classes (that you had to label manually first).
The number of image per class needed strongly depends on the depth of your network; it is not mandatory to use the same structure used in the RVL-CDIP article (which is a variation of AlexNet, quite deep); it is however relevant to maintain a balanced number of training example per class, in order to avoid the introduction of intrinsic bias during the training phase (e.g. if the number of training samples is not approximately the same in all classes, the risk is to bias the model toward the classes more represented by the training set).
Instead, if your purpose is to increase the level of categorization detail for the RVL-CDIP dataset, these are different options (not exaustive of course):

  1. Use the OCR part of the RVL-CDIP dataset, and apply BoW or semantic quantization (e.g. word2vec) + clustering, in order to obtain subclasses
  2. Apply clustering techniques on various features of each single class and use a sparse visualization technique to check if there is any other obvious additional category
  3. Use 1) with doc2vec, and see if there is any way to rebuild the dataset per topic; now that would be extremely useful in real life scenarios.

Hope this would be somewhat useful.
Roberto

@wanghaisheng
Copy link
Author

really helpful thxs

@neerajbhat98
Copy link

Hi , can you tell me how many GPUs were required for the training purpose?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants