-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to search the GrSciColl Collection descriptors #558
Comments
Thanks @MortenHofft, this is really nice! I have been wanting to write this down for 2 weeks now (good you did it).
Yes! And it would have several tables of descriptors which can have (or not) some of the same fields (in that case numbers aren't comparable).
Yes to all the fields! I think when it comes to multivalued fields, the easiest would probably be to align with how we process the same fields for occurrences. It will be easier to explain and make more sense. There are more fields than those that I would like to index including some Latimer Core term that would require controlled values. This doesn't have to happen in the first implementation.
Yes! that would be perfect!
I think you are right and we have to ignore this for now.
I agree that aggregation would be nice and that we likely would like specimens and collections. Not that if this is too tricky to implement, the priority is to make the collections discoverable, not to make metrics. Also linking to this issue containing examples: #557 |
I've also deployed it to UAT2 and imported some descriptors for these collections: https://api.gbif-uat2.org/v1/grscicoll/collection/e5097454-2826-473a-b610-05e15ccd7ad2/descriptorSet |
@marcos-lg writes
Is that worth implementing? i had hoped that if a collection owner was engaged enough to create CSVs with holding they would also have edit access. And casual users that just want to correct typos etc wouldn't know the content of the collection anyhow. How is it used Marie? Is it worth spending energy on suggestions for CDs? |
Thanks @MortenHofft this is a good question. I think that that people who would have the tables wouldn't necessarily have an account on GBIF and/or permission to edit the entry on GRSciColl. Most people who update their (own) entries don't have an account. I think that having a suggestion system to upload a table would be helpful. It removes the extra steps of creating an account on GBIF and sending an email to ask editing permissions. With that in mind, we could have a first phase where only editors/mediators upload the tables. We could work on the suggestion system later if we get the sense that this would really facilitate getting descriptors in. |
Synchronisation with IHIH makes available Collections summary which contain breakdown of collections. I think it would be great to make these into collection descriptor tables.
Mapping in practiceHere is an example of a table from NY (https://sweetgum.nybg.org/science/ih/herbarium-details/?irn=125525):
This is how I would like to see it mapped
|
@ManonGros here is an attempt to describe what we would like to do with those collection descriptors. This is what I have understood from our conversations, but I've tried to make it slightly closer to something that can be implemented.
Could you please see if it makes sense or I'm misguided.
A collection roughly looks like
We would like to have search
We should be able to search the same fields as now: freetext, institutionKey, code, country, numberOfSpecimens, etc
And then some new fields that is based on the collection descriptors (CD)
scientificName: Only one value per row in the CSVs. interpret same as occurrences.
just like occurrences we add the higher ranks so that users can search with higher taxa.
country: Only one value per row in the CSVs. interpret same as occurrences.
We could possibly infer continent or other regions. But for a first pass just having country would be fine.
individualCount: not the same interpretation as in the occurrence index I would assume. no value doesn't imply exactly 1 specimen.
identifiedBy: pipe seperated. interpret same as occurrences
recordedBy: pipe seperated. interpret same as occurrences
typeStatus: @ManonGros this is normally a pipe separated field. That means that we cannot use it for charts - at least not with specimen counts
Results
I imagine the results would be displayed as collections as the entity.
So a search result for filter
taxonKey=puma & country=MX & q=male & hl=true
would provide a result likePossibilities of conflicts
Now that we introduce CDs with specimens counts that could clash with the specimenCount on the core record.
Just like the specimensInGbif can. At some point we might want a flag for that. And other oddities, but it is my impression that we can ignore issues like that for now.
Aggregation options would be nice
We cannot do much for roll ups across collections and CSVs unless there is some agreement in how to count. We cannot tell from the numbers alone. So perhaps we should include some flags the collection owners can set to indicate that their CDs support counting and comparison.
noDoubleCounting:true
could mean that the same specimen is not included in more than one row (not in 2 rows in the same csv and not in 2 distinct csvs).We could then do agregations for individual collections. And possibly across collections, with some caveats.
aggregation example:
facet=specimenCountryCode & facet=countryCode & facet=kingdomKey & facet=decade & facet=preparation & facet=discipline facet=hasTypes
Ideally it would be nice with cardinalities for those as well, but that isn't someting we normally have in our APIs, but for hosted portals we get that directly from Elastisearch
What do we count
What do we count in those facets. It could be collection, collection descriptors or specimens.
I'm guessing that counting collection descriptors is uninteresting. But that both specimen and collection counts are intersting. Collection counts might be intersting when comparing across institutions? (E.g. give me a breakdown of countries and list how many collections have data about each - "Ohh that is interesting - there is only one collection in the world stat states it has information about butterflies in Pakistan").
But normally for an endpoint when we do facets we count the entries, so this would be different. Or we could have 2 types of facets. specimenFacets and (collection)Facets.
@ManonGros - what is your thoughts?
The text was updated successfully, but these errors were encountered: