Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip newlines when parsing TextDicts to avoid OverflowError #540

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 8 additions & 18 deletions returnn/search.py
Original file line number Diff line number Diff line change
Expand Up @@ -348,8 +348,7 @@ def tasks(self):
yield Task("run", mini_task=True)

def run(self):
d = eval(util.uopen(self.search_py_output, "rt").read(), {"nan": float("nan"), "inf": float("inf")})
assert isinstance(d, dict) # seq_tag -> bpe string
d = util.parse_text_dict(self.search_py_output)
assert not os.path.exists(self.out_word_search_results.get_path())
with util.uopen(self.out_word_search_results, "wt") as out:
out.write("{\n")
Expand Down Expand Up @@ -400,8 +399,7 @@ def tasks(self):
yield Task("run", mini_task=True)

def run(self):
d = eval(util.uopen(self.search_py_output, "rt").read(), {"nan": float("nan"), "inf": float("inf")})
assert isinstance(d, dict) # seq_tag -> bpe string
d = util.parse_text_dict(self.search_py_output)
assert not os.path.exists(self.out_search_results.get_path())

def _transform_text(s: str):
Expand Down Expand Up @@ -446,8 +444,7 @@ def tasks(self):
def run(self):
corpus = Corpus()
corpus.load(self.bliss_corpus.get_path())
d = eval(util.uopen(self.recog_words_file.get_path(), "rt").read())
assert isinstance(d, dict), "only search output file with dict format is supported"
d = util.parse_text_dict(self.recog_words_file)
with util.uopen(self.out_ctm_file.get_path(), "wt") as out:
out.write(";; <name> <track> <start> <duration> <word> <confidence> [<n-best>]\n")
for seg in corpus.segments():
Expand Down Expand Up @@ -531,10 +528,7 @@ def tasks(self):
yield Task("run", mini_task=True)

def run(self):
# nan/inf should not be needed, but avoids errors at this point and will print an error below,
# that we don't expect an N-best list here.
d = eval(util.uopen(self.recog_words_file, "rt").read(), {"nan": float("nan"), "inf": float("inf")})
assert isinstance(d, dict), "only search output file with dict format is supported"
d = util.parse_text_dict(self.recog_words_file)
if self.seq_order_file is not None:
seq_order = eval(util.uopen(self.seq_order_file, "rt").read(), {"nan": float("nan"), "inf": float("inf")})
assert isinstance(seq_order, (dict, list, tuple))
Expand Down Expand Up @@ -647,8 +641,7 @@ def tasks(self):

def run(self):
"""run"""
d = eval(util.uopen(self.search_py_output, "rt").read(), {"nan": float("nan"), "inf": float("inf")})
assert isinstance(d, dict) # seq_tag -> bpe string
d = util.parse_text_dict(self.search_py_output)
assert not os.path.exists(self.out_best_search_results.get_path())
with util.uopen(self.out_best_search_results, "wt") as out:
out.write("{\n")
Expand Down Expand Up @@ -686,8 +679,7 @@ def tasks(self):

def run(self):
"""run"""
d = eval(util.uopen(self.search_py_output, "rt").read(), {"nan": float("nan"), "inf": float("inf")})
assert isinstance(d, dict) # seq_tag -> bpe string
d = util.parse_text_dict(self.search_py_output)
assert not os.path.exists(self.out_search_results.get_path())
with util.uopen(self.out_search_results, "wt") as out:
out.write("{\n")
Expand Down Expand Up @@ -727,8 +719,7 @@ def tasks(self):

def run(self):
"""run"""
d = eval(util.uopen(self.search_py_output, "rt").read(), {"nan": float("nan"), "inf": float("inf")})
assert isinstance(d, dict) # seq_tag -> bpe string
d = util.parse_text_dict(self.search_py_output)
assert not os.path.exists(self.out_search_results.get_path())
with util.uopen(self.out_search_results, "wt") as out:
out.write("{\n")
Expand Down Expand Up @@ -786,8 +777,7 @@ def logsumexp(*args):
lsp = numpy.log(sum(numpy.exp(a - a_max) for a in args))
return a_max + lsp

d = eval(util.uopen(self.search_py_output, "rt").read(), {"nan": float("nan"), "inf": float("inf")})
assert isinstance(d, dict) # seq_tag -> bpe string
d = util.parse_text_dict(self.search_py_output)
assert not os.path.exists(self.out_search_results.get_path())
with util.uopen(self.out_search_results, "wt") as out:
out.write("{\n")
Expand Down
10 changes: 4 additions & 6 deletions text/convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@
"TextDictToStmJob",
]

from typing import Optional, Union, Sequence, Dict, List, Tuple
from typing import Union, Sequence, Dict, Tuple
import re
from sisyphus import Job, Path, Task
from i6_core.util import uopen
from i6_core.util import parse_text_dict, uopen


class TextDictToTextLinesJob(Job):
Expand All @@ -30,8 +30,7 @@ def tasks(self):
def run(self):
# nan/inf should not be needed, but avoids errors at this point and will print an error below,
# that we don't expect an N-best list here.
d = eval(uopen(self.text_dict, "rt").read(), {"nan": float("nan"), "inf": float("inf")})
assert isinstance(d, dict) # seq_tag -> text
d = parse_text_dict(self.text_dict)

with uopen(self.out_text_lines, "wt") as out:
for seq_tag, entry in d.items():
Expand Down Expand Up @@ -83,8 +82,7 @@ def tasks(self):
def run(self):
# nan/inf should not be needed, but avoids errors at this point and will print an error below,
# that we don't expect an N-best list here.
c = eval(uopen(self.text_dict, "rt").read(), {"nan": float("nan"), "inf": float("inf")})
assert isinstance(c, dict)
c = parse_text_dict(self.text_dict)

all_tags = [
("d%d" % i, "default%d" % i, "all other segments of category %d" % i)
Expand Down
21 changes: 21 additions & 0 deletions util.py
Original file line number Diff line number Diff line change
Expand Up @@ -383,3 +383,24 @@ def update_nested_dict(dict1: Dict[str, Any], dict2: Dict[str, Any]):
else:
dict1[k] = v
return dict1


def parse_text_dict(path: Union[str, tk.Path]) -> Dict[str, str]:
"""
Loads the text dict at :param:`path` making sure not to trigger line counter overflow.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand this explanation?

  • Put a ref to the issue here: Parsing a large TextDict fails on recent python versions #539
  • Also quote the exception you got: OverflowError: line number table is too long
  • Explain the exception. What does it mean? Why does this function fix it? I don't really understand what the exception means. My guess would be that the dict becomes too big. But then this function here would have the same problem, because as I understand it, you intend to create just the same dict, just in a different way?

"""

with uopen(path, "rt") as text_dict_file:
txt = text_dict_file.read()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder a bit: as I understand it, your problem is that the dict is really huge (how huge? how many entries? how big is the file?). But then, isn't it a problem to load it all into memory? Also via the following code (strip, splitlines, etc), you are even creating multiple copies of the data in memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is my understanding that the memory itself is not an issue, but rather that the file in my case is about 280k lines long, which is where python's line counter overflows for some reason. This is why I'm not worried about memory consumption itself in this module, but rather just the way the string is parsed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the Python line counter? There are no "lines" in a dict? Or you mean for Python source code?

But anyway, you should add this explanation to the code, so that you can understand why we have this code there.


# remove leading and trailing dict brackets
txt = txt.strip().strip("{}").strip()
NeoLegends marked this conversation as resolved.
Show resolved Hide resolved

lines = txt.splitlines()
result = {
k: v
# parse chunkwise to avoid line counter overflow when the text dict is very large
for chunk in chunks(lines, max(1, len(lines) // 1000))
for k, v in eval("\n".join(["{", *chunk, "}"]), {"nan": float("nan"), "inf": float("inf")}).items()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code does not work in all cases. E.g. consider a dict formatted as this:

{
"a":
42
}

(Yea, this probably will not happen when it was written with some of our tools, but who knows, maybe it comes from elsewhere, and/or there was some black on the generated output, or whatever.)
This should at least be discussed in a code comment.

Copy link
Member

@albertz albertz Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, in RETURNN, we use returnn.util.literal_py_to_pickle.literal_eval, which would solve this problem in another way, which should be faster and more generic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should at least be discussed in a code comment.

Yes, I'll add one.

Note, in RETURNN, we use returnn.util.literal_py_to_pickle.literal_eval, which would solve this problem in another way, which should be faster and more generic.

That implementation contains a native C++ module, I don't think this should be made part of i6_core.


Why did we end up with an implementation based on Python literals anyway, and why haven't we used a proper data format (e.g. JSON) from the get go for this? All this custom code would never have had to be written and the issue here would likely have never occured in the first place with a proper data exchange format.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

native C++ module

Well, you could argue, the Python stdlib is full of native code... I don't really see being "native" a problem here. The main problem is maybe that then RETURNN becomes a dependency for those specific jobs. That's maybe unwanted. But I'm not sure if it's really so much a problem. But then, maybe we can also use this code here for now and keep that in mind. (Maybe add it in a comment, that this is an alternative.)

Why did we end up with an implementation based on Python literals anyway, and why haven't we used a proper data format (e.g. JSON) from the get go for this?

My assumption was, it's faster and simpler. Those assumptions turned out to be wrong though, so yes, JSON would have been better.

Python literals are also more flexible and generic though, but in many cases (like here), this is not really needed, and/or you can work around that. But JSON is really restricted. I think it also does not support inf/nan, so it doesn't properly work for N-best lists (rarely, but sometimes we get inf in the score; it would be annoying if it then rarely crashes, and/or you need a stupid workaround just to clip it to 1e30 or so).

}
return result
Loading