Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define scope much more aggressively #19

Closed
rdmpage opened this issue May 10, 2017 · 9 comments
Closed

Define scope much more aggressively #19

rdmpage opened this issue May 10, 2017 · 9 comments
Labels

Comments

@rdmpage
Copy link

rdmpage commented May 10, 2017

From the sidelines this project already seems huge, far too driven by legacy concerns, and lacking clear input from "users". Once again we see data providers having a stake but not the users. I'm not privy to the original discussion about this project, but it seems to me that it would be nice to have at least three things:

Names linked to evidence

Most catalogues of names have few if any links to the actual evidence for those names, i.e. the literature. Most of the existing links are not digital. In fact, I would invert this problem as not being one of names + links to literature, but literature annotated with (amongst other thing) names. I suspect that both people and machines will gain more from access to the evidence. This is the intersection of BHL, the ever growing non-BHL digitised literature, and the nomenclators (why is ION not included, it has an order of magnitude more names that ZooBank?). For this to be effective, literature needs to be front and centre, the names are simply annotations. Let's stop rehashing 5x3 index cards. It's not the names that matter, it's the literature.

Names linked to names

Probably the single biggest frustration for users, and the one area I think this project should probably focus on. Synonyms drive people nuts. Botany does a good job of tracking objective synonyms (IPNI) and link to evidence for name change (albeit mostly old-skool text strings), zoology doesn't, and nobody really tracks subjective synonyms (other than simply listing them, without supporting evidence). Lots of scope for text mining to help discover both synonyms and the evidence for them.

Names in a tree (or other navigation structure)

It's not entirely clear to me that we need ••yet another** classification, especially one not based on evidence. Why not defer to the Open Tree of Life which is notionally evidence based (e.g., phylogenies). Leaving aside the classification/phylogeny distinction, lots of recent name changes will be driven by phylogenetic analyses. So why not delegate the classification to Open Tree? Is it not crazy and colossally wasteful for our field to have several projects all building all-encompassing taxonomic classifications?

Summary

Without trying to sound too cynical, we've been at this a while, and the same old issues keep coming around again and again. Doesn't this suggest that we're doing it wrong?

@deepreef
Copy link
Collaborator

Thanks, Rod.

In order to get out of the rut we keep spinning in year after year (decade after decade), we need to define what, exactly we mean by things like “name” and “concept”. Until we do that, we’ll be trapped in the same endless cycle of re-inventing the same dysfunctional wheels.

For example, by “name” do you mean text string? Clusters of unique text strings anchored to the same combination? Clusters of text strings and combinations anchored to the same Basionym/Protonym? How many “names” are represented here:

  1. Aus bus L.
  2. Aus ba L.
  3. Xus bus (L.)
  4. Aus bus Linnaeus
  5. Aus bus Linnaeus 1758
  6. Aus cus Jones
  7. Aus bus subsp. cus Jones
  8. Aus bus cus Jones

By my count, there are 8 NameStrings, 4 Protonyms (two in the genus group, two in the species group), either 3 or 4 combinations (depending on whether you count “Aus bus cus” as a different combination from “Aus cus”), and one orthographic variant. The word “name” can be legitimately applied to any of those things. To make real progress, we need to get past the semantic barrier and be exceptionally explicit about the things we are trying to slap persistent, reusable identifiers on.

It should be relatively easy to get past this hurdle for “names”. But the difficulty for doing the same for “concepts” (aka “taxa”, aka “circumscriptions”) is vastly greater. Given that concepts/taxa are the raison d'être for CoL, it’s sort of important that we define (explicitly) what we mean by them.

Once we overcome the semantics/definitions challenge, then the rest should fall into place pretty intuitively.

On that point, I agree with many of your points, with some qualifications:

  • Names (however they are defined) linked to Literature (=References); and of course the converse of literature associated with an index of names (reciprocals of each other, achieved by the same index). BHL and ION were (and are) very much part of the conversation, so we should be alright there. Personally, I tend to agree that the literature should be driving this, with names as secondary links. But the “L” in “CoL” stands for “Life”, not “Literature”, so in the context of CoL-Plus, the names are kind of front & center by necessity (as a prelude to taxon concepts).
  • Heterotypic synonyms (Names linked to names): You’ll be happy to know that this is indeed one of the front-and-center aspects of the conversation in Woods Hole, and the plan to move things forward with CoL-Plus. I don’t think you’ll find any resistance there.
  • As for classification, EVERYONE in this space agrees that we don’t need yet another classification. The existing CoL management hierarchy (especially when extended down to families) is more than enough to fulfill the needs of CoL-Plus. ToL is WAY too granular for this context – that’s more in the realm of phylogenies. Eventually we need a unified system for mapping names/concepts to a phylogeny, and eventually ToL is the logical backbone for that. But CoL is coming at it from the leaves, not the trunk, and as such a ToL-type classification is way overkill.

One thing I am ABSOLUTELY in full agreement with you on is this: “… we've been at this a while, and the same old issues keep coming around again and again.” Having played this game since the 1980’s, the problem has largely been due to the fact that we have never nailed down what exactly we mean by “names” and “concepts”, and until we do, we’ll continue to talk past each other, and develop incompatible systems built on units of information defined in subtly or not-so-subtly different ways. If we can solve that (provided we define the objects in a way that they allow us to index the information we want to index), the rest will flow easy – I’m sure of it.

That was my soapbox rant….

@mdoering
Copy link
Member

mdoering commented May 10, 2017

@rdmpage Linking names to literature and tracking recombinations and homotypic synonyms is exactly one of the 2 main goals we try to achieve. And opening up the editing and data to the entire community, not just a few selected.

I very much agree literature or better evidence is critical here. We like to anchor all assertions to some source. But I fail to see your point that names are simply annotations and only literature is what matters. We want a catalogue of names and taxa as these are used in biodiversity information. Not primarily the literature.

@rdmpage
Copy link
Author

rdmpage commented May 10, 2017

@deepreef So I guess I would make this really simple. Names are what are currently stored in nomenclators, many of which have been running for a while, have persistent identifiers for names (LSIDs) and serve RDF using the vocabulary @rogerhyam put together a decade ago. So, nomenclators have essentially solved the name problem (all the name variants are just fluff that makes searching "interesting").

Concepts are what the classifications represent. GBIF has concepts, ultimately definable by the set of occurrences that are linked to leaves of the subtree rooted on whatever taxon you pick in the tree. Likewise NCBI has concepts, ultimately defined by the set of sequences linked to those nodes.

Link the concept to the name (e.g., as in the TDWG LSID vocabulary and we have a usable system). Link the names to identifiers for the literature and we have a system that is linked to evidence, plus it is linked to the people doing the work (i.e., the authors of those taxonomic publications).

We've had the basics in place for a decade, why do we insist on making this harder than it has to be? I actually think to make progress you want to go in the opposite direction from thinking of all the different permutations of "names" and have a simple, focussed core.

@rdmpage
Copy link
Author

rdmpage commented May 10, 2017

@mdoering I'm pleased I can still surprise you ;) I'm playing devils advocate, a little, but I think one reason we are in this mess is that we have divorced names from context. There are historical reasons for doing this, very much like the way libraries handled digitisation: first we digitise the catalogue, because that's a manageable abstraction, then we digitise the actual books, which is vastly more useful (because we actually want to read the books, not the index cards).

I guess all I'm suggesting is that before rehashing endless discussions about "what is a name" etc., it might be worthwhile thinking whether a list of names is really what we need, or whether it's a legacy of the days when compiling catalogues was essential as libraries were scarce and physically remote. This is no longer the case, so are we making lists because that is what we've always done?

@deepreef
Copy link
Collaborator

deepreef commented May 10, 2017

@rdmpage "Simple" is in the eye of the beholder. There are many different representations of a "name" in different nomenclators. This is why it's not a simple task to agregate the nomenclators. Their objects are defined differently. Botanical and Zoological nomenclators have different meanings for a "name". We don't lack definitions -- we have plenty of them. It's just a matter of deciding which one(s) are the most useful. The vocabulary that @rogerhyam put together a decade ago didn't serve the needs. TCS created the main elements, but people didn't adopt it (perhaps because it's an XML schema?). I would turn to TCS as the starting point, and then tweak that as needed. We created a glossary as part of TCS, but I have not been able to find it (I don't think it's available online anymore). If only it were true that nomenclators have solved the problem. In fact, they have helped perpetuate the problem.

There is a big discussion on issue #6 concerning concepts. Several of us define a concept by its content (=circumscription), but @ThierryBourgoin prefers to include context (classification) as part of the definition of a concept.

Don't misunderstand me: I do not advocate defining all the permutations of "name". I advocate we all pick the SAME definition and work with that. We're all in agreement that "simple" is the solution. But I'm not sure we all have the same idea of what is simple and what is complex.

@deepreef
Copy link
Collaborator

@rdmpage Let me restate that last bit a different way: the goal here is not to come up with dozens of different definitions of what a "name" might be. The goal here is to recognize that a major impediment to progress on all this stuff during the past few decades is that we keep talking past each other when we talk about names and concepts and such. We all know we want to link it together, but the existing data out there is largely mutually incopmatible because of subtle or not-so-sublte differences in what everyone means by "name". We hashed all this out for TCS, but it's been mostly ignored. Again, we don't want to have endless discussions about "what a name is". Rather, we want to make sure we're all speaking the same language before we end up having enless (and useless) discussions that don't lead to practical solutions because we continue to talk past each other.

@rdmpage
Copy link
Author

rdmpage commented May 10, 2017

@deepreef Maybe this document Taxon Concept Schema - User Guide is the one you mean?

@deepreef and @mdoering I realise that have have somewhat different views as to what needs to be done, and there's probably not a lot of point me repeating my views - a combination "of do it already" and "well, this sucks" ;) So, rather than distract you I'll revert to lurking and following progress.

@rogerhyam
Copy link

Happily I no longer have a PTS episode when I see the words Aus bus L. ! I've moved to pastures where I can think of biological names just like any other words. But I am reminded of a rambling talk I watched the other day and the phrase: "we got out of the rules business"

https://youtu.be/COSXg5HKaO4?t=12m50s

But as in our field I can't say what "a good outcome" is - so ML/AI probably not appropriate!

Good luck with the project.

@deepreef
Copy link
Collaborator

@rdmpage Thanks for sharing the TCS Guide, but what I was referring to was a Wiki page that had a large and detailed glossary fleshed out by Roger, Jessie, James Ytow, myself and a bunch of others, with precise terms for all flavors of "Names" and "Concepts" and stuff. There were things like "CanonicalNameString", "NameStringWithAuthors", "NameStringWithAuthorAndYear", etc. In any case I don't want to repeat that exercise; but instead just want to make sure we're not developing a system with parts designed for apples, and other parts designed for oranges.

In any case, as I said earlier, your views are actually very similar to mine, especially with regard to the Literature side. During the TCS days, one of my mantras was "Taxon Names are meaningless except in the context of literature". So many of our systems are built around "names" as stand-alone entities, but all the information we care about (taxon concepts, identifications of specimens/observations, and pretty-much everything else) only has meaning in the context of "Name sec. Authors". And of course, this is even more true of "Taxon Concepts". This is why for decades I have been harping on the notion of a Taxon Name Usage (TNU) (or "Treatment", if you prefer PLAZI terminology) as the core entity around which everything else we're trying to do in this space revolves. Persistent, actionable identifiers assined to TNUs/Treatments are the common currency that connects everything we're trying to do in an unambiguous way. Besides our semantics problem, one of the reasons we have failed to build the ultimate functional names/concept system is that we too-often try to short-circuit a solution through treating "names" or "taxon concepts" as stand-alone objects with clouds of abstract proprties around them, without anchoring to specific literature-based instances of those names and concepts. My greatest fear for CoL-Plus is we repeat this mistake yet again.

So personally, I'm very glad to have your input on this, because we're mostly singing the same tune here. But we're at risk of running around endlessly in circles if we don't converge on some cler an unambiguous definitions of the "things" we want to cross-link informatically. If you get a chance, take a look at my comment on the "Things" in #6. Ignore Things 1 & 2 are about Authors and Names. Thing 3 is what you call Literature (and I call References). Thing 4 is a NameString sensu @dimus and GNI, Things 5 & 6 are both TNUs, and Things 7 & 8 are about Taxon Concepts (normally I would only have one, but this conversation has revealed a potential need for distinguishing circumscriptions (content) form something that includes classifications (context) as part of the Concept.

I thing we're making genuinbe progress! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants