-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE REQUEST] Model Storage #68
Comments
@lantiga let's continue our discussion here. |
@rlizzo Why did we close this? |
Sorry that was a mistake! The long and short of it is that I'm leaning more towards not implementing this. I'll post more comments about my rationale later, but me and Luca talked today and came to similar conclusions. |
Ok. so for reference, I've decided (with input from @lantiga) that we will not implement this inside of hangar. The rationale is this:
For these reasons (and maybe a few more still in my head), Hangar will not create a custom container class to store machine learning models. That being said, it's a very valid point that people don't want to retrain machine learning models if they want to go back in time. In the future, the supporters of the Hangar project [tensor]werk inc. will release a more focused tool to handle all of this. Some preliminary designs and ideas are in the works though the timeline for an open source alpha release is not yet know. @hhsecond, is this a valid reason in your point of view? I'm open to hearing any counter arguments to this decision. |
@rlizzo These points make absolute sense and keeping hangar a tool for "data" (only) is probably the best way of handling it as a project/product since hangar is designed to store numerical arrays. But, In that case, just hangar is going to be useful for a niche crowd because we are essentially trying to solve a problem which most of the people in the industry is not worried about or they are too big to use hangar (it might be just me, but this is what I have seen). That being said, I could see the absence of a tool like hangar as the reason organizations haven't done what you lay down in the above thread yet. |
After sleeping on this, I think all we should still allow users to store binary blobs as a special case of the current functionality. Either store a blob as a dataset of one sample, where the sample is a byte array, or treat it as meta data (we may not be able to due to size considerations, not sure). Models as a concept will ideally come later in a separate tool that will manage lineage across source code, data, hyperparameters, trained models, etc. It's quite possible that it will share significant portions of its source code with hangar (e.g. the Merkle DAG thing). |
@lantiga : I suppose blob stores are a fair compromise here. If we go that route, I would keep essentially the same conventions (and probably just use most of the code) we define for metadata, namely:
The only reason this isn't totally trivial to implement is that Unlike metadata (which we define a limit for the size of values) my gut says we wouldn't store the blobs in lmdb directly as there is no concept of cross record value compression. It's not an issue for short strings, but it would be more desirable to pack and compress arbitrarily sized blobs together. We would basically be emulating what a Maybe that's premature though? Of course we could provide the same partial clone semantics we are tooling for data records, but it's still very much in the works... I'm going to say for right now that I think it's needed and should be done. We just need to decide how much we care about disk space. If we don't, it's trivial, if we do, it's a project and I cannot see this getting to the top of my priority list anytime soon. (I think that even if you wanted to work on this, both if your time would be much better spent on user facing features)? So... What are your thoughts? How important is this? |
@hhsecond for the sake of not duplicating a bunch of documentation I've already written about the benefits of storing you data via the Let me know if that addresses your point. It's extremely valid, and if the current explanations aren't clear or compelling enough I will make the time to improve the. |
@rlizzo I have re-read the docs and here are my thoughts about the changes required. Let me know what do you think. The Domain-Specific File Format Problem:
Human & Computational Cost
Others
|
Reviving this thread. Especially with partial fetch, I still think we should provide the ability to store individual blobs and version them. In fact they would behave like single-tensor dsets, maybe with a dedicated data store that just writes the blob to disk as a file instead of relying on a HDF5 file. This would solve the model store problem with very little added complexity, with the benefit of partial fetching a certain version of a model. |
Is your feature request related to a problem? Please describe.
Allow binary model storage for trained models
Describe the solution you'd like
A clear and concise description of what you want to happen:
The text was updated successfully, but these errors were encountered: