-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange behaviour XmlAssembler? #49
Comments
In general In general my idea was that This is the general idea. As (I am running a bit out of time now. Will at least submit this comment and get back). |
Ok, went through the test_xml_assembler.zip tests and the implementation of the
I think I understand why this code is required: I still see no reason for Thanks to your excellent work on the Unit tests we can also do any redesign/refactoring in a more controlled way! |
So XmlAssembler should set I've added some handy base code in Chain and a PacketBuffer Filter. This should make it easier to inspect ETL-results in unit-tests. Unfortunately the commit for XmlAssembler tests went with issue #40: this was commit 16e9646. It shows an enhanced implementation for XmlAssembler Unit Test (FileInput version, see cfg):
|
I agree that in this specific test case an XmlElementStreamer file input is actually a more appropriate source. I also agree that the FileInput itself could become more generic (by adding globbing and optional unzip capabilities). A single file, multiple files (glob), a dir or one or more ZIP files should all be treated the same. Basically, all format-specific filters (like XmlElementReader, note, the "Reader" part in the name!) should move to inputs. The same is also true for any possible outputs. We should not have a specific "Zipper" output, which only zips data, since compressing is a generic operation. Note that it doesn't exist, but what about the HttpOutput? What "format" does it output? And what if we want to write OGR output to a HTTP stream? Too much questions, beyond the scope of this issue. We should distinguish between the file format (GML, JSON, Shape, PostGIS, whatever) and the way how the data is delivered (file, dir, HTTP stream, zipped, combination of these ways). However, regarding my original question, I can also imagine to be cases when a filter will convert a single input package into multiple output packages. The input package could come from a Stetl input or a filter. That doesn't matter. If the input package has The opposite is also true: when a filter combines multiple input packages into a single output package, and one of the input packages (the last one ideally!) has I agree that the extra call to self.next is likely part of this problem. I'll try to understand better, as I'm not very familiar with the Stetl code yet. Writing unit tests is a very good way to become more familiar, because then you're forced to think about such things :) |
As for the original use of the XmlElementReader followed by the XmlAssembler, this is the current TOP10NL workflow in NLExtract. The BGT and BRK are based on that as well. I'm giving priority to the current workflows. |
On 08-08-16 14:11, Frank Steggink wrote:
I will open a separate issue for "generic file input". Some file inputs Note: XmlElementStreamerFileInput and XmlAssembler were inspired from But in general the current ETL use-cases, like NLExtract should be
Splitting to multiple ETL streams is another common ETL-feature. There
|
While writing unit tests for XmlAssembler, I ran into a couple of issues. At first I've set up a chain reading only one GML file with three FeatureMember elements. In my config I wanted to write an etree doc for every two elements. I'm expecting two documents in this case, one with two elements, and one with only one element (the last one). I was surprised that no doc was written (to stdout). Here is my config:
I was suspecting this check in XmlAssembler.consume_element:
if element is None or packet.is_end_of_stream() is True:
(Note that the
is True
is redundant, but that doesn't matter.)It turned indeed out that packet.is_end_of_stream was true. I think it is already caused by the GlobFileInput. I've just added this input class yesterday. It could be the case that I'm not understanding properly when is_end_of_stream should be set to true, but I'm wondering whether a filter which can return multiple packets based on one input packet (for example when an XML file is being parsed using XmlElementReader) should actually reset is_end_of_stream or is_end_of_doc.
When I skip this check, so I'm only checking for
element is None
, then a new XML document is generatedfor every XML element, so I was getting 3 documents, instead of the expected 2.When I'm reading all GML files in my test data directory (currently 3 files), by setting file_path to tests/data/*.gml in input_glob_file, I'm getting either 6 documents (while checking for
packet.is_end_of_stream()
) or 9 documents. With 3 files I'm actually expecting 6 documents (3 x 2), namely a doc with 2 elements followed by a doc with 1 element, three times. However, each document contains only one element, only of the first 2 GML files. When disabling the aforementioned check I'm getting 9 docs, each with one element.So, my question is how packet.is_end_of_stream and packet.is_end_of_doc should actually behave. Should they be reset when one input packet result in multiple output packets for the particular component? Or is there more to it?
I've attached my unit test file. The method test_execute is just a work-in-progress.
test_xml_assembler.zip
The text was updated successfully, but these errors were encountered: