Skip to content

Commit

Permalink
Merge branch 'development'
Browse files Browse the repository at this point in the history
  • Loading branch information
jlpereira committed May 24, 2024
2 parents 934138b + 9eea468 commit a787ced
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 3 deletions.
3 changes: 3 additions & 0 deletions docs/about/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,9 @@ Fields prefixed in `cached_` are auto-generated by TaxonWorks based on other val
### Context sensitive
Something that appears similar at the outset (e.g. an Icon), but changes in behaviour or appearance given where and when it is encountered in the application.

### Fidelity
In this context, we want to talk about and focus on data **fidelity**. While we often refer to data _quality_, we note that in reality, quality as a categorical goal proves quite hard to define. It's subjective and "it depends" on may other factors. With fidelity as the goal, we can seek to ensure the data are as fit as possible (e.g formatted as expected, compliant with relevant standards) and whose _completeness_ can be better understood or visualized. With fidelity as the goal, others can then determine if said data are fit-for-use for their research / data management needs and questions. We recognize and appreciate this considered nuance in terminology as shared by Erica Krimmel at TaxonWorks Together 2024 in our Data Quality Round Table Conversation.

### Hot keys
Typing a combination of keys to trigger a behaviour in the [UI](/about/glossary#UI). Universal hot keys include concepts like `ctrl-c` for "Copy text to clipboard". TaxonWorks has numerous hot key combinations that speed tasks.

Expand Down
12 changes: 9 additions & 3 deletions docs/guide/data-quality.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,25 @@ sidebarPosition: 55

# Data Quality Help and Hints

_Curating data to best support reproducible and [FAIR](https://en.wikipedia.org/wiki/FAIR_data) use means we all need ways to address data quality (e.g. completeness, consistency, compliance). We note **Quality**, as a abstract and rather subject term, proves difficult to pin down. **Fidelity*** may prove more a more tractable term. Here we gather our collective tips on defining, finding, fixing (and preventing) some of the more common issues._
_Curating data to best support reproducible and [FAIR](https://en.wikipedia.org/wiki/FAIR_data) use means we all need ways to address data quality (e.g. completeness, consistency, compliance). We note **Quality**, as a abstract and rather subjective term, proves difficult to pin down. **Fidelity*** may prove more a more tractable term. Here we gather our collective tips on defining, finding, fixing (and preventing) some of the more common issues._

## Rationale and Background
Our TW Philosopy on _data quality_ or _fidelity_: we try to build in methods to prevent issues in the first place. Where we know they can happen, we try to build in tools to help you both find and fix. We also plan further development to extend our `soft validation` tools which will discover issues for you and offer to fix them `on click`. Note that when, where, and how you find any data anomalies will vary. And in turn, this influences the options and methods for fixing them (e. g. one-by-one, bulk annotation, scripts). For example, you might notice issues when:
- cleaning data up in a spreadsheet _before_ upload to any CMS
- exploring your exported data with tools like OpenRefine, or via R, or via another API
- looking at feedback from another source (e. g. GBIF or iDigBio or ALA or OBIS or [Bionomia](https://bionomia.net/))
- someone on the internet sees something and contacts you
- perusing data already in your own database
- mapping you data to migrate to another database or share with an aggregator
- using your database _data visualization_ tools to see _distinct values_ in a given field (e. g. Project vocabulary task in TaxonWorks) or on a map. See also [Distinct Values - Why This Data Directory?](https://github.com/tdwg/dwc-qa/tree/master/data)
- using your database _data visualization_ tools to see _distinct values_ in a given field (e. g. `Project vocabulary task` in TaxonWorks) or on a map.
- See also [Distinct Values - Why This Data Directory?](https://github.com/tdwg/dwc-qa/tree/master/data)
- reviewing your software repository issue-tracker (e. g. [gitHub for TaxonWorks](https://github.com/SpeciesFileGroup/taxonworks/issues))

In structuring these hints, we group the known issues into categories: `Identifiers` (e .g. catalog numbers), `Time` (e. g. dates), `Place` (aka geography, location), `Taxon`, and `Other` and `Tools and Resources`
As a co-organizer of a Workshop at [Digital Data 8](https://digitaldata2024.sched.com/) called **Data cleaning for maximum impact: Tools and workflows for data providers to efficiently find and fix data quality issues** we created this resource. Other co-organizers did likewise and the resulting cross-platform page can be found at iDigBio: [Data Quality Toolkit 2024](https://www.idigbio.org/wiki/index.php/Data_Quality_Toolkit_2024). Each section below is linked to its corresponding topic on that iDigBio page.

To _extend the value and scope of this work_, in each section below, we link to the work of the [Biodiversity Information Standards (TDWG) Biodiversity Data Quality Task Group (BDQ)](https://github.com/tdwg/bdq). We list the BDQ tests relevant to each issue, where they exist. We gratefully acknowledge the efforts of this TDWG Task Group and the contributions and conversations with Paul Morris and Lee Belbin in figuring out how to do this. Special thanks to Paul Morris for work done to map the BDQ tests to the specific data quality issues highlighted in this workshop and on this page. With these connections, we hope to enhance the software developer's vision and work to connect to the BDQ tests to the CMS functionality around preventing, finding, and fixing these types of issues.

In structuring these hints, we group the known issues into categories: `Identifiers` (e .g. catalog numbers), `Time` (e. g. dates), `Place` (aka geography, location), `Taxon`, and `Other` and `Tools and Resources`.

## Identifiers

Expand Down

0 comments on commit a787ced

Please sign in to comment.