Writing CASes to a zip archive #135

daxenberger · 2015-06-09T14:11:16Z

Originally reported on Google Code with ID 135

DKPro-Core 1.6.1. will support writing to ZIP archives using e.g. BinaryCasWriter. We
should make use of this feature:

[PreprocessingTask]

AnalysisEngineDescription writer = createEngineDescription(BinaryCasWriter.class,
BinaryCasWriter.PARAM_TARGET_LOCATION, "jar:file:" + root + "/archive.zip", 
BinaryCasWriter.PARAM_TYPE_SYSTEM_LOCATION, root + "/typesystem.bin",
BinaryCasWriter.PARAM_FORMAT, "6");

and likewise for the Meta- and FeatureExtractionTasks.

One problem remains: I am not sure whether this makes sense for the BatchTaskCrossValidation,
where we (currently) need to split the overall set of files into various folds (file
sets), that need to be retrieved individually in each fold.

Reported by daxenberger.j on 2014-05-28 12:41:02

The text was updated successfully, but these errors were encountered:

daxenberger · 2015-06-09T14:11:16Z

"root" points to the path on the file system. Unless you have a strong reason to store
the type system outside the ZIP, I suggest you remove the "root" from PARAM_TYPE_SYSTEM_LOCATION
and just set it to "typesystem.bin" (no slash). Relative type system locations are
placed inside the ZIP - absolute locations are placed directly on the file system.

Reported by richard.eckart on 2014-05-28 12:42:45

daxenberger · 2015-06-09T14:11:17Z

Thanks for the hint. I don't see a reason to store the typesystem outside the ZIP, so
the location should be relative.

Reported by daxenberger.j on 2014-05-28 12:47:58

daxenberger · 2015-06-09T14:11:17Z

Reported by daxenberger.j on 2014-06-04 16:09:40

Labels added: Milestone-Release0.7.0

daxenberger · 2015-06-09T14:11:18Z

I wonder, didn't we plan to do this in 0.6.0?

Reported by richard.eckart on 2014-06-25 15:04:57

daxenberger · 2015-06-09T14:11:19Z

Because of the problem mentioned in the first post: I'm not sure how to integrate this
with the current Crossvalidation BatchTask.

Reported by daxenberger.j on 2014-06-25 15:09:46

daxenberger · 2015-06-09T14:11:19Z

Ah, I see. It shouldn't be a big problem but it is probably too much for the 0.6.0 release.


The basic principle should remain the same. We'd just need some extra code to extract
the file names for the folds from the ZIP instead of scanning them from the file system.

Reported by richard.eckart on 2014-06-25 15:11:57

daxenberger · 2015-06-09T14:11:20Z

Reported by daxenberger.j on 2015-01-06 11:40:17

Labels added: Milestone-Release0.8.0
Labels removed: Milestone-Release0.7.0

Horsmann · 2016-04-30T18:19:37Z

@daxenberger this one can be closed as won't fix now, right?

daxenberger · 2016-05-02T09:36:20Z

This is independent of the latest changes to CV mode. The idea here was to write all CASes into a zip archive rather than individual files.

Or why did you think it is obsolete?

Horsmann · 2016-05-05T13:40:51Z

Oh ok, I misunderstood it then. Sry.

Horsmann · 2018-02-09T22:40:22Z

@reckart Is this feature available now? What exactly is the benefit of writing a single .zip instead of N bin-cas? Both is not human-readable but the naming of the bin-cas by document name allows some visual confirmation that the reader read what it was supposed to read? It helps to understand at least a little bit what TC is doing. Unless this makes processing a lot faster I would rather not have zips?

reckart · 2018-02-09T22:46:33Z

Should be available.

reckart · 2018-02-09T22:48:12Z

I don't remember the rationale. Might be to avoid using subfolders in an execution context... or to reduce the number of files which can at times become very large... maybe @daxenberger remembers more.

daxenberger · 2018-02-13T06:30:42Z

This was certainly to reduce the number of files produce by TC - which can become quite big for larger datasets. The "visual confirmation" issue could be avoided by writing some sort of log(?) file, which records the names of files written to the archive.

Horsmann · 2018-02-16T10:03:43Z

@reckart Do you have a code-example that writes to .zip?

reckart · 2018-02-16T15:15:56Z

There are examples in these unit tests: https://github.com/dkpro/dkpro-core/blob/57dc82892d1bb419158eff37119dfaaca0763d8b/dkpro-core-api-io-asl/src/test/java/de/tudarmstadt/ukp/dkpro/core/api/io/JCasFileWriter_ImplBaseTest.java

reckart · 2018-02-16T15:16:25Z

Actually, it's even in the documentation: https://dkpro.github.io/dkpro-core/releases/1.9.0/docs/user-guide.html#_working_with_zip_archives

Horsmann · 2018-02-17T19:06:41Z

Hm, when adapting this for the BinaryCasWriter and BinaryCasReader I get a Not in GZIP format exception

writing:
        AnalysisEngineDescription xmiWriter = createEngineDescription(BinaryCasWriter.class,
                BinaryCasWriter.PARAM_TARGET_LOCATION,
                "jar:file:" + aContext.getFolder(output, AccessMode.READWRITE).getPath() + "/data.gz",
                BinaryCasWriter.PARAM_FORMAT, "6+"
                );

reading:
createReaderDescription(BinaryCasReader.class, BinaryCasReader.PARAM_SOURCE_LOCATION,
            		root.getAbsolutePath() + "/data.gz!*.bin");

reckart · 2018-02-19T10:38:11Z

Looks like during reading, you are missing the jar:file: prefix.

reckart · 2018-02-19T10:39:37Z

... and mind that these are "zip" files, not "gz" files.

daxenberger assigned dkpro Jun 9, 2015

daxenberger added Type-Enhancement Priority-Low labels Jun 9, 2015

reckart modified the milestone: 0.8.0 Aug 8, 2015

reckart removed the Milestone-Release0.8.0 label Aug 8, 2015

reckart added enhancement and removed Type-Enhancement labels Sep 6, 2015

Horsmann modified the milestones: 0.9.0, 0.8.0 Mar 26, 2016

Horsmann modified the milestones: 1.0.0, 0.9.0 Oct 19, 2016

Horsmann unassigned dkpro Oct 19, 2016

Horsmann modified the milestones: 1.0.0, 1.1.0 Apr 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing CASes to a zip archive #135

Writing CASes to a zip archive #135

daxenberger commented Jun 9, 2015

daxenberger commented Jun 9, 2015

daxenberger commented Jun 9, 2015

daxenberger commented Jun 9, 2015

daxenberger commented Jun 9, 2015

daxenberger commented Jun 9, 2015

daxenberger commented Jun 9, 2015

daxenberger commented Jun 9, 2015

Horsmann commented Apr 30, 2016

daxenberger commented May 2, 2016

Horsmann commented May 5, 2016

Horsmann commented Feb 9, 2018

reckart commented Feb 9, 2018

reckart commented Feb 9, 2018 •

edited

Loading

daxenberger commented Feb 13, 2018

Horsmann commented Feb 16, 2018

reckart commented Feb 16, 2018

reckart commented Feb 16, 2018

Horsmann commented Feb 17, 2018

reckart commented Feb 19, 2018

reckart commented Feb 19, 2018

Writing CASes to a zip archive #135

Writing CASes to a zip archive #135

Comments

daxenberger commented Jun 9, 2015

daxenberger commented Jun 9, 2015

daxenberger commented Jun 9, 2015

daxenberger commented Jun 9, 2015

daxenberger commented Jun 9, 2015

daxenberger commented Jun 9, 2015

daxenberger commented Jun 9, 2015

daxenberger commented Jun 9, 2015

Horsmann commented Apr 30, 2016

daxenberger commented May 2, 2016

Horsmann commented May 5, 2016

Horsmann commented Feb 9, 2018

reckart commented Feb 9, 2018

reckart commented Feb 9, 2018 • edited Loading

daxenberger commented Feb 13, 2018

Horsmann commented Feb 16, 2018

reckart commented Feb 16, 2018

reckart commented Feb 16, 2018

Horsmann commented Feb 17, 2018

reckart commented Feb 19, 2018

reckart commented Feb 19, 2018

reckart commented Feb 9, 2018 •

edited

Loading