-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dwc:datasetName use to group datasets, yes, no, options please? #199
Comments
This is a significant issue in camera trapping. Most of the major projects (e.g., eMammal, Snapshot USA, Wildlife Insights) are collaborations between multiple institutions. These are referred to as 'initiatives' since they are larger than one 'organization' or 'institution'. Within those initiatives, providers may submit their own datasets as part of the effort. |
@ben-norton Scenario 2 is what I'm thinking about (although I'd guess in the situation I'm thinking about, your first scenario is very likely going to happen). It makes me think hard from a Latimer Core perspective. We're really talking about whole/parts relationships and the many variables around which we might pivot or group data. So, for a given distributed project, then Or, is it possible, that we (and GBIF) give up worrying about duplicates and instead use the data / metadata to find the dupes as Nicky's algorithms are doing? d) Even in that case ^^^ The original project would still like to see / grab through the API / visualize their aggregated data. Where does this leave us? Are there standards in place to help us do this? Or do we have a gap? |
Yes, please. There is no real way to do this. Institutions don't know and should not control what other institutions do with their data. Institutions also share pieces of a single thing and one should not be prevented from publishing just because the other one is. |
Gotcha @Jegelewicz although in this case I'm really talking about projects, not really institutions. We've needed, for a long time, for example, a way to say a specific dataset was mobilized via a specific grant number. Something like this would be useful. When you start to tease it apart, many specimens will be touched / imaged / sampled / etc in connection to different grants. So it's also a one:many thing. We need a way to group the objects around that grant number... |
I could see this being covered by the Identifier class in LatimerCore. |
Thanks! It's definitely parallel to the idea of pivoting different parts of the same collection in different ways. |
I am not sure if this is relevant, but we use the project id in the eml, assuming that 1 dataset only has 1 project of course. GBIF can group these datasets together like this: https://www.gbif.org/dataset/search?project_id=BR~2F154~2FA1~2FRECTO For datasets with multiple projects, an issue is opened here: gbif/ipt#1780 |
@ymgan in the scenario I'm describing, various groups across the USA would be collecting data (observations and specimens) on their own in their own areas of the USA. They'd be using a standard protocol. The goal, would be to have all these distributed sets be able to come together using a particular data point. Perhaps this Project ID in the EML could do that then. Does this sound parallel to what you are describing? |
Yes, that is parallel to what I am describing. However, it is at dataset level though. For the record level, indeed datasetName and datasetID seem to be for this purpose: @dagendresen made a good remark here: gbif/pipelines#665 (comment)
|
See https://data-blog.gbif.org/post/clustering-occurrences/, which describes what we're already doing and references Nicky's work. |
Question about the use of the dwc:datasetName field.
Scenario: different groups, as part of a formal national initiative, will each observe the same taxonomic group in their respective region. Some of the observed taxa will be collected and vouchered in various collections across this given nation.
General Puzzlement One: Each group publishes their own dataset to GBIF (observations and specimen records). How do these disparate datasets find each other? How can they be grouped "after-the-publishing step?"
Specific Puzzlement Two: What if each group gave their own dataset the same name? Example: dwc:datasetName =
Our [taxonomic group] Data
. Would this work? Say, for publishing to GBIF, does it matter if two datasets have the same datasetName?Last Puzzlement: Is there a better strategy? What are the options (standard terms? extensions?) for ensuring (or at least improving the chances) these data can be aggregated and understood to be part of the same initiative? Would this be a use case for a term that conveyed a funding number (a grant number)? In that way, all datasets (and for that matter, data records) would be gatherable by that?
Insights, discussion, and options welcome. I'm guessing others have already solved this or at least grok current possible options as well as needs to make this a reasonable approach to a distributed national-level project.
The text was updated successfully, but these errors were encountered: