-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strip newlines when parsing TextDict
s to avoid OverflowError
#540
base: main
Are you sure you want to change the base?
Changes from 1 commit
d0cf85e
f25fb80
0721c94
f116329
323eeaa
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -383,3 +383,24 @@ def update_nested_dict(dict1: Dict[str, Any], dict2: Dict[str, Any]): | |
else: | ||
dict1[k] = v | ||
return dict1 | ||
|
||
|
||
def parse_text_dict(path: Union[str, tk.Path]) -> Dict[str, str]: | ||
""" | ||
Loads the text dict at :param:`path` making sure not to trigger line counter overflow. | ||
""" | ||
|
||
with uopen(path, "rt") as text_dict_file: | ||
txt = text_dict_file.read() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder a bit: as I understand it, your problem is that the dict is really huge (how huge? how many entries? how big is the file?). But then, isn't it a problem to load it all into memory? Also via the following code ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is my understanding that the memory itself is not an issue, but rather that the file in my case is about 280k lines long, which is where python's line counter overflows for some reason. This is why I'm not worried about memory consumption itself in this module, but rather just the way the string is parsed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the Python line counter? There are no "lines" in a dict? Or you mean for Python source code? But anyway, you should add this explanation to the code, so that you can understand why we have this code there. |
||
|
||
# remove leading and trailing dict brackets | ||
txt = txt.strip().strip("{}").strip() | ||
NeoLegends marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
lines = txt.splitlines() | ||
result = { | ||
k: v | ||
# parse chunkwise to avoid line counter overflow when the text dict is very large | ||
for chunk in chunks(lines, max(1, len(lines) // 1000)) | ||
for k, v in eval("\n".join(["{", *chunk, "}"]), {"nan": float("nan"), "inf": float("inf")}).items() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This code does not work in all cases. E.g. consider a dict formatted as this: {
"a":
42
} (Yea, this probably will not happen when it was written with some of our tools, but who knows, maybe it comes from elsewhere, and/or there was some There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note, in RETURNN, we use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, I'll add one.
That implementation contains a native C++ module, I don't think this should be made part of Why did we end up with an implementation based on Python literals anyway, and why haven't we used a proper data format (e.g. JSON) from the get go for this? All this custom code would never have had to be written and the issue here would likely have never occured in the first place with a proper data exchange format. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Well, you could argue, the Python stdlib is full of native code... I don't really see being "native" a problem here. The main problem is maybe that then RETURNN becomes a dependency for those specific jobs. That's maybe unwanted. But I'm not sure if it's really so much a problem. But then, maybe we can also use this code here for now and keep that in mind. (Maybe add it in a comment, that this is an alternative.)
My assumption was, it's faster and simpler. Those assumptions turned out to be wrong though, so yes, JSON would have been better. Python literals are also more flexible and generic though, but in many cases (like here), this is not really needed, and/or you can work around that. But JSON is really restricted. I think it also does not support inf/nan, so it doesn't properly work for N-best lists (rarely, but sometimes we get inf in the score; it would be annoying if it then rarely crashes, and/or you need a stupid workaround just to clip it to 1e30 or so). |
||
} | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand this explanation?
OverflowError: line number table is too long