Speed up bulk processing with Tika #18

jeremybmerrill · 2014-08-15T21:15:19Z

Yomu is great. I'm currently using it to process thousands of documents. Unfortunately, this is very slow, because, right now, Yomu starts the JVM for each document. This takes about 2 seconds per document -- which significantly slows me down.

Tika has thought of this and included "server" mode, where Tika starts as a server and processes whatever documents are thrown at it over a socket. Starting Java in server mode takes a little longer, but only has to happen once.

I've modified Yomu to support server mode. The API is the same, but if you want server mode, put this

Yomu.server(:text)

before your code and

Yomu.kill_server!

after it.

For processing even only 6 documents, the speed-up is noticeable: 12ish seconds with the current version of Yomu and 4ish with my server version.

In order to preserve the API as-is (tests pass on my branch with no changes), my method isn't terribly elegant (e.g. class variables) and requires the target extraction type (text/html/metadata) to be selected when the server is inited (this is a Tika constraint). A more elegant and Rubyish way would be to do all the server-based extraction in a block. But this would require changing the API.

If you'd be amenable to this as a patch, @yomu, I'll write tests, edit the docs and submit a PR. I'm happy, too, to submit as is or with the block-based method I mention above, based on what you think is best for the library. Until then, my version is at https://github.com/jeremybmerrill/yomu/tree/feature/servermode

The text was updated successfully, but these errors were encountered:

Erol · 2014-12-20T03:01:43Z

@jeremybmerrill My apologies for just responding now. Thanks for your work on this!

I like your idea of having a server mode for Yomu and I'm open to changing the API. If you're still up for it, may I know what changes you have in mind and the syntax for wrapping it in a block? I was thinking it would go something like this:

Yomu.start :text do |yomu|
  yomu.read 'path/to/file'
  yomu.read 'path/of/another/file'
end

jeremybmerrill · 2014-12-23T16:42:59Z

Hi @Erol:

My code that does this was merged in via #23 (@rogeriochaves) -- he added tests to my implementation.

The syntax of this implementation (below) is not optimal and very un-Rubyish. Block-like syntax would be far better -- I think I didn't do it just because this implementation was easier and I was in a rush. There'd need to be some refactoring to allow a Yomu instance to process more than one file.

Yomu.server(:text)
Yomu.new(filename).text
Yomu.kill_server!

xavriley · 2015-07-08T15:17:52Z

Just discovered this but only from this issue - made my processing about 100x faster! Worth putting a hint in the README perhaps?

hatlord · 2016-03-02T16:18:40Z

Any way to make this work just for metadata? I can get Yomu to read in many files, but it is slow going to pull out the metadata.

Thanks,

jeremybmerrill · 2016-03-02T22:56:58Z

Yomu.server(:metadata)
Yomu.new(filename1).metadata
Yomu.new(filename2).metadata
Yomu.new(filename3).metadata
Yomu.kill_server!

should work

hatlord · 2016-03-03T09:26:26Z

Thanks for the response, really appreciate it. Ill give that a go :)

hatlord · 2016-03-03T09:56:59Z

Update: Yeah that did improve the speed noticeably, many many thanks :)

Now I just need to figure out why its so slow running through "INFO Document is encrypted" lines and im set :)

jeremybmerrill mentioned this issue Aug 21, 2014

adds server mode #19

Closed

rogeriochaves mentioned this issue Nov 10, 2014

adds server mode with tests #21

Closed

rogeriochaves mentioned this issue Dec 19, 2014

Added server mode with tests #23

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up bulk processing with Tika #18

Speed up bulk processing with Tika #18

jeremybmerrill commented Aug 15, 2014

Erol commented Dec 20, 2014

jeremybmerrill commented Dec 23, 2014

xavriley commented Jul 8, 2015

hatlord commented Mar 2, 2016

jeremybmerrill commented Mar 2, 2016

hatlord commented Mar 3, 2016

hatlord commented Mar 3, 2016

Speed up bulk processing with Tika #18

Speed up bulk processing with Tika #18

Comments

jeremybmerrill commented Aug 15, 2014

Erol commented Dec 20, 2014

jeremybmerrill commented Dec 23, 2014

xavriley commented Jul 8, 2015

hatlord commented Mar 2, 2016

jeremybmerrill commented Mar 2, 2016

hatlord commented Mar 3, 2016

hatlord commented Mar 3, 2016