Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeking prevents the use of streams with parse_incr #100

Open
kylebgorman opened this issue Oct 12, 2024 · 4 comments
Open

Seeking prevents the use of streams with parse_incr #100

kylebgorman opened this issue Oct 12, 2024 · 4 comments

Comments

@kylebgorman
Copy link
Contributor

kylebgorman commented Oct 12, 2024

One of the contexts I could easily imagine using parse_incr is on stdin or a process substitution-style file descriptor. E.g.:

xzcat huge_data.conllu.xz | ./application_that_uses_parse_incr.py ...
./application_that_uses_parse_incr.py <(zcat big_data.conllu.gz)

What these two types of input have in common is that they are both streaming and Python will crash if you attempt to seek on them, which parse_incr does indirectly here. I don't feel like I have the full context, but I think it's because it reads the sentence before it reads the metadata for whatever reason.

This bit us in here. We will probably just move off of conllu and use our own custom solution which doesn't rewind, but my colleague thought it worth reporting to you here in case it's avoidable.

@EmilStenstrom
Copy link
Owner

@kylebgorman I'm happy to discuss other solutions to this.

The reason it seeks (once, on the first line) is to find the global.columns comment that is included in CoNLL-U Plus. That sets the definition of which columns the current file has. For your use-case, what would be a better way to handle that?

@kylebgorman
Copy link
Contributor Author

I wasn't familiar with that format so I just looked it up. Admitting I don't really have the full context and this might be wildly ignorant, I don't see how that requires backtracking. Without global columns the algorithm is something like:

for line in source:
   if is_metadata(line):
     handle_metadata(line, metadata)
   elif is_token(line):
      handle_token(line, tokens)
   elif is_blank(line):
     yield TokenList(...)

I'd just add another clause (with highest priority) to handle the case where something that looks like metadata but is in fact global.columns, and do whatever you prefer if you start handling tokens but haven't set global.columns.

@EmilStenstrom
Copy link
Owner

@kylebgorman I think I figured it out. Please try the latest version of conllu. It's seek free: https://pypi.org/project/conllu/6.0.0/

(did a major version dump because I removed a function in the public API)

@kylebgorman
Copy link
Contributor Author

That works very well, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants