-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE REQUEST] Multi-tensor samples from complex data sources #79
Comments
Hey @elistevens, thanks for your patience, we should have gotten to this earlier. So, that’s right, a Dataset in Hangar maps to a single schema (a single collection of tensors). A Repo, on the other hand, can have more than one Datasets, so, for instance, to store data where two different pieces of info sit alongside each other you’d have a Repo with two Datasets. The samples in the dataset would have the same sample Thus, right now what you are looking for regarding 1 sample -> 2+ tensors is possible, it’s probably a matter of: a) terminology: a Dataset and a Sample in Hangar pertain to a single piece of what you’d call Dataset and Sample - food for thought as a possible source of confusion moving forward; In the future we might introduce integrity constraints as an option, but it’s probably not very high priority at the moment (compared to other aspects, being a young project). As for blobs and metadata: we have sample-level metadata already, which is properly serialized and versioned together with the tensor. Depending on the user’s needs, metadata could contain an URI of the original blob, for instance. Our current design is to avoid storing the original blobs in Hangar, although you could always store them in a Dataset as 1D tensors with a byte dtype, so to speak (a bit of a hack, but not too that bad after all). We are designing a way to go to and from the original file format (or other formats that make sense for the data) through an extensible plugin mechanism. The import step would store the necessary extra info in the metadata, the export step would look for that info in the metadata. The same plugin mechanism can also be leveraged for visualizing changes during diffs and conflict resolution, which we find quite exciting at the moment. I hope I’ve addressed your concerns at least partially. I’d like to hear your thoughts on the design side of things. |
Perhaps confusion would be reduced if the Dataset/Sample classes were renamed to Tensorset/Tensor? That would free up Sample to be "one or more Tensors" and Dataset to be "a collection of Samples." I'm imagining a PyTorch Dataset subclass like: class HangarTorchDataset(torch.utils.data.Dataset):
def __init__(self, tensorset_list):
self.tensorset_list = tensorset_list
self.key_list = list(tensorset_list[0].keys())
def __len__(self):
return len(self.key_list)
def __getitem__(self, ndx):
key = self.key_list[ndx]
return (ts[key] for ts in self.tensorset_list) It's going to be confusing if all of those items in the |
Hey @elistevens, I for one like the name change quite a lot. It gives us the ability to free up the concepts for a later higher-level implementation of proper Datasets and Samples. We really need to avoid any possible confusion around the mental model, and this is a good step in that direction. I'll let @rlizzo and @hhsecond chime in too, but this is a +1 for me. |
Hey Eli, sorry for the delayed response here (it's been rather hectic lately, but i'm back in the groove now 😄). Addressing some of the points (in no particular order) Nested Collections of Related Data SchemasIn some of the very first hangar prototype implementations, we had a much deeper hierarchy of containers for data which essentially would have grouped what we now think of as a I believe this would be an architecture which would satisfy your initial problem statement? We decided to kill this type of hardcoded architecture and flatten the namespace for a few reasons:
@elistevens, does this make sense in the context of your specific problem? You're not too far off from something we probably should solve in the future, but hopefully the rationale for current behavior and workarounds are clear? At the end of the day it comes down to time allocation, if this is a real priority or show-stopper for you, let's talk about how we might be able to put a solution together. Naming ConventionsWith the above said, I'm on the fence (but leaning towards approval) for changing the name The term |
@rlizzo to be clear, I don't need Hangar to be an end-to-end, out of the box solution. I just want to make sure there's a clear, documented "this is the right way to solve this class of problems with Hangar" path forward. Having multiple Tensorsets with the same set of keys and using that to build training samples by pulling the same key from each Tensorset (plus metadata as needed) should work fine. Same for having URIs point back to external blob stores. I'd want to be able to commit/branch/etc. on multiple Tensorsets at the same time, and be able to refer to a single hash to say "this model was trained with data commit I agree that introducing a new name "Tensorset" isn't great, but I think that introducing a new concept but giving it a name that's familiar but the underlying concept is subtly different is worse. Those subtle differences are going to trip people up a lot, especially when your thing is more limited than the general concept. That said, I'm not super-sold on renaming what you currently label "Sample" to "Tensor" because really what you have is a tensor (in the TF/PyTorch sense) plus the metadata. The addition of the metadata does change it a bit. I still think that Sample is the wrong name, since samples can have multiple tensors, plus metadata. Whatever the metadata+tensor combo gets called should inform what Tensorset gets named too. "Tensor" is probably fine, since you can say "Hangar Tensors have richer metadata than Tensors in PyTorch or Tensorflow, but the represent the same fundamental concept." Since your Tensors here are more capable than the baseline idea, it's not as big a problem. Still, worth pondering, IMO. |
On the ability to perform multi
|
I thought that samples had a place to store things like patient age/sex/smoking history, or JPEG lat/long, etc. alongside the actual tensor. That's the metadata I mean. Also, FYI, your terminology of "non-named dataset samples" seems confusing to me. It sounds like you're talking about datasets that auto-key Tensor-Samples added to them, vs. situations where the user provides the key. Is that right? |
So we actually removed that long ago (before the initial commit was made public) in favor of keeping everything as tensors stored in datasets. Right now that info could either be stored as related samples in another dataset (with appropriate size of the row, etc.) or in the top level (text based) The rationale is that the hangar core should be kept as simple as possible, having no concept of relations between datasets or samples. Higher level convenience functions (automatically dealing with aggregations, relationship mapping, etc.) can be built on top of those to eliminate boiler-plate code for the user, but only once the core is stable and there is a need for them. Do you have a use case where storing info alongside tensors is rather important?
Precisely. Should probably change it to a clearer definition/name... |
I think at this point we could give the idea of sample-level metadata a minute. It would be very convenient to keep track of provenance for a tensor (the URI of the file or resource it was generated from, for instance, or the meta data for a Dicom series). This would allow to potentially regenerate something close to the source data if needed. We could keep such metadata as a blob in LMDB. I’m guessing the biggest hurdle would be merging. What do you think @rlizzo? |
I'm going to take a day or two to think. I can see the benefit, but the implementation needs to be figured out. Making it work would be trivial, but making this multithread safe would be difficult (I have a Branch where Read checkouts are now completely thread and process safe, increasing throughput linearly with core count). Lmdb doesn't tend to play nice in these aspects. Ill go into more details here later. Merging wouldn't be an issue so long as we decide what a conflict is from a conceptual standpoint. Eli, I'm imagining this type of attached metadata would effectively act as a key/value store. Would that be sufficient? Any idea what you might want an API to be. |
I don't have any concrete use cases, nor am I certain that the "just more tensors" approach wouldn't work for anything I came up with. I had been imagining using it to filter or shape the data in some way. Something like "cancer stage" for a binary tumor classifier that a PyTorch Dataset could use to balance the training data across stages, even if training samples of a given class are over-represented. Or being able to limit an Imagenet Tensorset to only birds+airplanes. |
Related to #162 |
Is your feature request related to a problem? Please describe.
I want to know how to accomplish the following. It doesn't need to require zero effort on the user's part, but there needs to be a clear best-hangar-practices path to a workable setup.
Take a source data format that is complex (e.g. DICOM, JPEG) and infeasible to reconstitute bit-exact from the tensor+metadata form. Each instance of the raw data produces a sample that consists of 2 (or more) tensors; an image tensor and a 1D tensor that encodes things like lat/long or age/sex/etc. (to be concatenated with the output of the convolutional layers prior to the fully connected layers). To be clear, this is intended to be an illustrative example, not a concrete use case.
Per my reading of the docs, right now these two tensors wouldn't qualify as being in the same hangar dataset (it's not clear if that's problematic or not).
Let's express the above conversion as:
f_v1(raw) -> (t1, t2)
Users will need to:
f_v2
and repopulatet1
andt2
.f_v3
which outputs(t1, t2, t3)
.t1
andt2
.t1
andt2
for training/validation (including when training is randomized).Describe the solution you'd like
I think that changing the definition of a sample to be a tuple of binary blobs plus a tuple of tensors plus metadata would work, but I haven't considered the potential impacts from that kind of change. Seems potentially large.
Describe alternatives you've considered
Another option would be to have separate datasets for
t1
andt2
and combine them manually, plus manage the binary blobs separately. That seems like a lot of infra work, and might be at risk of having drift between the samples themselves, and with the blobs.Additional context
I suspect that I want/expect Hangar to solve a larger slice of the problem than it's intended to, but it's not clear at first glance what the intended approach would be for more complicated setups like the above.
The text was updated successfully, but these errors were encountered: