-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up bulk processing with Tika #18
Comments
@jeremybmerrill My apologies for just responding now. Thanks for your work on this! I like your idea of having a server mode for Yomu and I'm open to changing the API. If you're still up for it, may I know what changes you have in mind and the syntax for wrapping it in a block? I was thinking it would go something like this:
|
Hi @Erol: My code that does this was merged in via #23 (@rogeriochaves) -- he added tests to my implementation. The syntax of this implementation (below) is not optimal and very un-Rubyish. Block-like syntax would be far better -- I think I didn't do it just because this implementation was easier and I was in a rush. There'd need to be some refactoring to allow a Yomu instance to process more than one file.
|
Just discovered this but only from this issue - made my processing about 100x faster! Worth putting a hint in the README perhaps? |
Any way to make this work just for metadata? I can get Yomu to read in many files, but it is slow going to pull out the metadata. Thanks, |
should work |
Thanks for the response, really appreciate it. Ill give that a go :) |
Update: Yeah that did improve the speed noticeably, many many thanks :) Now I just need to figure out why its so slow running through "INFO Document is encrypted" lines and im set :) |
Yomu is great. I'm currently using it to process thousands of documents. Unfortunately, this is very slow, because, right now, Yomu starts the JVM for each document. This takes about 2 seconds per document -- which significantly slows me down.
Tika has thought of this and included "server" mode, where Tika starts as a server and processes whatever documents are thrown at it over a socket. Starting Java in server mode takes a little longer, but only has to happen once.
I've modified Yomu to support server mode. The API is the same, but if you want server mode, put this
before your code and
after it.
For processing even only 6 documents, the speed-up is noticeable: 12ish seconds with the current version of Yomu and 4ish with my server version.
In order to preserve the API as-is (tests pass on my branch with no changes), my method isn't terribly elegant (e.g. class variables) and requires the target extraction type (text/html/metadata) to be selected when the server is inited (this is a Tika constraint). A more elegant and Rubyish way would be to do all the server-based extraction in a block. But this would require changing the API.
If you'd be amenable to this as a patch, @yomu, I'll write tests, edit the docs and submit a PR. I'm happy, too, to submit as is or with the block-based method I mention above, based on what you think is best for the library. Until then, my version is at https://github.com/jeremybmerrill/yomu/tree/feature/servermode
The text was updated successfully, but these errors were encountered: