Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dill cannot serialize large model #37

Open
mattwigway opened this issue Jul 27, 2020 · 2 comments
Open

dill cannot serialize large model #37

mattwigway opened this issue Jul 27, 2020 · 2 comments

Comments

@mattwigway
Copy link
Owner

struct.error: 'I' format requires 0 <= number <= 4294967295

We shouldn't use dill anyways. Rewrite to use numpy.savez_compressed

@mattwigway
Copy link
Owner Author

I have a temporary fix for this that saves the model in two pieces, the big arrays using numpy.savez_compressed and everything else using dill. This is inelegant, and additionally dill (and pickle) can cause arbitrary code execution with malicious files - there's no reason to allow arbitrary code execution from a data file. Rewrite everything using savez_compressed, with helpers write_scalar wrapping scalars in arrays, write_series saving series as index and values separately, and write_dataframe writing index, columns, and each column separately.

@mattwigway
Copy link
Owner Author

The reason to write each column separately is that dtypes may differ, and writing as a big matrix will force everything to a single dtype. The big sticking point, though, is the PriceIncomeFunction, which is an arbitrary Python function. That's not going to be serializable in npz format. We could either retain pickle for that (maybe just pickle that function and save it as bytes) although that leaves the security concerns above. Alternately, we could just store a string referencing which functional form.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant