diff --git a/joss.07305/10.21105.joss.07305.crossref.xml b/joss.07305/10.21105.joss.07305.crossref.xml new file mode 100644 index 0000000000..f4a21c2632 --- /dev/null +++ b/joss.07305/10.21105.joss.07305.crossref.xml @@ -0,0 +1,330 @@ + + + + 20241022153929-98112d4d82276665812a943084d254df980beb4e + 20241022153929 + + JOSS Admin + admin@theoj.org + + The Open Journal + + + + + Journal of Open Source Software + JOSS + 2475-9066 + + 10.21105/joss + https://joss.theoj.org + + + + + 10 + 2024 + + + 9 + + 102 + + + + harmonize-wq: Standardize, clean and wrangle Water +Quality Portal data into more analytic-ready formats + + + + Justin + Bousquin + https://orcid.org/0000-0001-5797-4322 + + + Cristina A. + Mullin + https://orcid.org/0000-0002-0615-6087 + + + + 10 + 22 + 2024 + + + 7305 + + + 10.21105/joss.07305 + + + http://creativecommons.org/licenses/by/4.0/ + http://creativecommons.org/licenses/by/4.0/ + http://creativecommons.org/licenses/by/4.0/ + + + + Software archive + 10.5281/zenodo.13356847 + + + GitHub review issue + https://github.com/openjournals/joss-reviews/issues/7305 + + + + 10.21105/joss.07305 + https://joss.theoj.org/papers/10.21105/joss.07305 + + + https://joss.theoj.org/papers/10.21105/joss.07305.pdf + + + + + + tbeptools: An R package for synthesizing +estuarine data for environmental research + Beck + Journal of Open Source +Software + 65 + 6 + 10.21105/joss.03485 + 2021 + Beck, S., M. W., & Best, B. D. +(2021). tbeptools: An R package for synthesizing estuarine data for +environmental research. Journal of Open Source Software, 6(65), 3485. +https://doi.org/10.21105/joss.03485 + + + A Web‐Based Decision Support System for +Assessing Regional Water‐Quality Conditions and Management +Actions + Booth + Journal of the American Water Resources +Association + 5 + 47 + 10.1111/j.1752-1688.2011.00573.x + 2011 + Booth, E., N. L., & Murphy, L. +(2011). A Web‐Based Decision Support System for Assessing Regional +Water‐Quality Conditions and Management Actions. Journal of the American +Water Resources Association, 47(5), 1136–1150. +https://doi.org/10.1111/j.1752-1688.2011.00573.x + + + Discrete Global Grid Systems as scalable +geospatial frameworks for characterizing coastal +environments + Bousquin + Environmental Modelling & +Software + 146 + 10.1016/j.envsoft.2021.105210 + 2021 + Bousquin, J. (2021). Discrete Global +Grid Systems as scalable geospatial frameworks for characterizing +coastal environments. Environmental Modelling & Software, 146, +105210. +https://doi.org/10.1016/j.envsoft.2021.105210 + + + HyRiver: Hydroclimate Data +Retriever + Chegini + Journal of Open Source +Software + 66 + 6 + 10.21105/joss.03175 + 2021 + Chegini, T., Li, H.-Y., & Leung, +L. R. (2021). HyRiver: Hydroclimate Data Retriever. Journal of Open +Source Software, 6(66), 1–3. +https://doi.org/10.21105/joss.03175 + + + dataRetrieval: R packages for discovering and +retrieving water data available from U.S. federal hydrologic web +services + De Cicco + 10.5066/P9X4L3GE + 2022 + De Cicco, L. A., Lorenz, D., Hirsch, +R. M., Watkins, W., & Johnson, M. (2022). dataRetrieval: R packages +for discovering and retrieving water data available from U.S. federal +hydrologic web services (Version 2.7.12) [Computer software]. U.S. +Geological Survey; U.S. Geological Survey. +https://doi.org/10.5066/P9X4L3GE + + + Linking mountaintop removal mining to water +quality for imperiled species using satellite data + Evans + PloS one + 11 + 16 + 10.1371/journal.pone.0239691 + 2021 + Evans, K., M. J., & Malcom, J. W. +(2021). Linking mountaintop removal mining to water quality for +imperiled species using satellite data. PloS One, 16(11), e0239691. +https://doi.org/10.1371/journal.pone.0239691 + + + Pint: Operate and manipulate physical +quantities in Python + Grecco + 2021 + Grecco, H., & Chéron, J. (2021). +Pint: Operate and manipulate physical quantities in Python (Version +1.9). https://github.com/hgrecco/pint + + + dataretrieval (Python): a Python package for +discovering and retrieving water data available from U.S. federal +hydrologic web services + Hodson + 10.5066/P94I5TX3 + 2023 + Hodson, H., T. O., & Horsburgh, +J. S. (2023). dataretrieval (Python): a Python package for discovering +and retrieving water data available from U.S. federal hydrologic web +services (Version 1.0.2). U.S. Geological Survey; U.S. Geological +Survey. https://doi.org/10.5066/P94I5TX3 + + + geopandas/geopandas: v0.10.2 + Kelsey Jordahl + 10.5281/zenodo.5573592 + 2021 + Kelsey Jordahl, M. F., Joris Van den +Bossche, & Wasser, L. (2021). geopandas/geopandas: v0.10.2 (Version +v0.10.2). Zenodo. +https://doi.org/10.5281/zenodo.5573592 + + + Transport of N and P in US streams and rivers +differs with land use and between dissolved and particulate +forms + Manning + Ecological Applications + 30 + 10.1002/eap.2130 + 2020 + Manning, R., D. W., & Kominoski, +J. S. (2020). Transport of N and P in US streams and rivers differs with +land use and between dissolved and particulate forms. Ecological +Applications, 30, p.e02130. +https://doi.org/10.1002/eap.2130 + + + Water quality data for national‐scale aquatic +research: The Water Quality Portal. + Read + Water Resources Research + 53 + 10.1002/2016WR019993 + 2017 + Read, C., E. K., & Winslow, L. A. +(2017). Water quality data for national‐scale aquatic research: The +Water Quality Portal. Water Resources Research, 53, 1735–1745. +https://doi.org/10.1002/2016WR019993 + + + AquaSat: A data set to enable remote sensing +of water quality for inland waters + Ross + Water Resources Research + 55 + 10.1029/2019WR024883 + 2019 + Ross, T., M. R., & Pavelsky, T. +M. (2019). AquaSat: A data set to enable remote sensing of water quality +for inland waters. Water Resources Research, 55, 10012–10025. +https://doi.org/10.1029/2019WR024883 + + + Three Principles to Use in Streamlining Water +Quality Research through Data Uniformity + Shaughnessy + Environmental Science & +Technology + 53 + 10.1021/acs.est.9b06406 + 2019 + Shaughnessy, W., A. R., & +Brantley, S. L. (2019). Three Principles to Use in Streamlining Water +Quality Research through Data Uniformity. Environmental Science & +Technology, 53, 13549–13550. +https://doi.org/10.1021/acs.est.9b06406 + + + Estimating nitrogen and phosphorus +concentrations in streams and rivers, within a machine learning +framework + Shen + Scientific Data + 7 + 10.1038/s41597-020-0478-7 + 2020 + Shen, A., L. Q., & Domisch, S. +(2020). Estimating nitrogen and phosphorus concentrations in streams and +rivers, within a machine learning framework. Scientific Data, 7, 161. +https://doi.org/10.1038/s41597-020-0478-7 + + + Challenges with secondary use of multi-source +water-quality data in the United States + Sprague + Water Research + 110 + 10.1016/j.watres.2016.12.024 + 2017 + Sprague, O., L. A., & Argue, D. +M. (2017). Challenges with secondary use of multi-source water-quality +data in the United States. Water Research, 110, 252–261. +https://doi.org/10.1016/j.watres.2016.12.024 + + + Tidy data + Wickham + The Journal of Statistical +Software + 59 + 10.18637/jss.v059.i10 + 2014 + Wickham, H. (2014). Tidy data. The +Journal of Statistical Software, 59, 252–261. +https://doi.org/10.18637/jss.v059.i10 + + + WQX Web API + 2018 + WQX Web API. (2018). [Computer +software]. U.S. Environmental Protection Agency, Office of Water; U.S. +Environmental Protection Agency. +https://www.epa.gov/sites/default/files/2018-09/documents/wqx_web_application_programming_interface_api.pdf + + + WQX web user guide + 2020 + WQX web user guide (Version 3.0). +(2020). [Computer software]. U.S. Environmental Protection Agency, +Office of Water; U.S. Environmental Protection Agency. +https://www.epa.gov/sites/default/files/2020-03/documents/wqx_web_user_guide_v3.0.pdf + + + + + + diff --git a/joss.07305/10.21105.joss.07305.pdf b/joss.07305/10.21105.joss.07305.pdf new file mode 100644 index 0000000000..3747a5fe05 Binary files /dev/null and b/joss.07305/10.21105.joss.07305.pdf differ diff --git a/joss.07305/paper.jats/10.21105.joss.07305.jats b/joss.07305/paper.jats/10.21105.joss.07305.jats new file mode 100644 index 0000000000..4c7894299d --- /dev/null +++ b/joss.07305/paper.jats/10.21105.joss.07305.jats @@ -0,0 +1,515 @@ + + +
+ + + + +Journal of Open Source Software +JOSS + +2475-9066 + +Open Journals + + + +7305 +10.21105/joss.07305 + +harmonize-wq: Standardize, clean and wrangle Water +Quality Portal data into more analytic-ready formats + + + +https://orcid.org/0000-0001-5797-4322 + +Bousquin +Justin + + + + +https://orcid.org/0000-0002-0615-6087 + +Mullin +Cristina A. + + + + + +U.S. Environmental Protection Agency, Gulf Ecosystem +Measurement and Modeling Division, Gulf Breeze, FL 32561 + + + + +U.S. Environmental Protection Agency, Watershed +Restoration, Assessment and Protection Division, Washington, D.C. +20460 + + + + +20 +12 +2023 + +9 +102 +7305 + +Authors of papers retain copyright and release the +work under a Creative Commons Attribution 4.0 International License (CC +BY 4.0) +2022 +The article authors + +Authors of papers retain copyright and release the work under +a Creative Commons Attribution 4.0 International License (CC BY +4.0) + + + +Python +water quality +data set analysis + + + + + + Summary +

The U.S. EPA’s Water Quality Exchange (WQX) allows state + environmental agencies, the EPA, other federal agencies, universities, + private citizens, and other organizations to provide water quality, + biological, and physical data + (Read + & Winslow, 2017). The Water Quality Portal (WQP) is a data + warehouse that facilitates access to data stored in large water + quality databases, including WQX, in a common format. WQP has become + an essential resource with tools to facilitate both data publishing + (WQX + Web API, 2018; + WQX + Web User Guide, 2020) and data retrieval + (De + Cicco et al., 2022; + Hodson + & Horsburgh, 2023). However, given the variety of data + originators and methods, using the data in analysis often requires + cleaning to ensure it meets required quality standards and wrangling + to get it in a more analytic-ready format. Although there are many + examples where this data cleaning or wrangling has been performed + (Bousquin, + 2021; + Evans + & Malcom, 2021; + Manning + & Kominoski, 2020; + Ross + & Pavelsky, 2019; + Shen + & Domisch, 2020), standardized tools to perform this task + will make it less time-intensive, more standardized, and more + reproducible. More standardized data cleansing and wrangling allows + easier integration of outputs into other tools in the water quality + data pipeline, e.g., for integration into hydrologic analysis + (Chegini + et al., 2021), dashboards for visualization + (Beck + & Best, 2021) or decision support tools + (Booth + & Murphy, 2011).

+
+ + Statement of need +

Due to the diversity of data originators metadata quality varies + and can pose significant challenges preventing WQP from being used as + an analysis-ready data set + (Shaughnessy + & Brantley, 2019; + Sprague + & Argue, 2017). Recognizing the definition of + ‘analysis-ready’ varies depending on the analysis, our goal with + harmonize-wq is to provide a robust, flexible, water quality specific + framework that will help the data analyst identify differences in data + units, sampling or analytic methods, and resolve data errors using + transparent assumptions. Domain experts must decide what data meets + their quality standards for data comparability and any thresholds for + acceptance or rejection.

+
+ + Current Functionality +

WQP is intended to be flexible in how data providers structure + their data, what data they provide, and what metadata is associated + with the data. The harmonize-wq package does not identify results for + rejection, but it does flag those that were altered in a QA column. + The package uses the metadata available to clean characteristic data + into usable, comparable measures. Four data characteristics are the + focus for cleaning the data:

+ + +

Measure – If missing (NAN) or not the correct data type, e.g., + non-numeric and non-categorical, it cannot be used in + analysis.

+
+ +

Sample Fraction – A measure for a given WQP characteristic, + e.g., Phosphorous, may have differences in the analyzed samples, + e.g., filtered, dissolved, organic, inorganic, etc. Where these + may make measures incomparable to one another results are split + into sample fraction specific columns.

+
+ +

Speciation/Basis/Standards - A measure for a given WQP + characteristic, e.g., Nitrogen, may have differences in the + molecular basis measured, e.g., ‘as NO3’ vs. ‘as N’. Likewise, + some measures will differ depending on sample conditions, such as + temperature and pressure. Since these differences will alter the + comparability of results they are moved to the appropriate column + for consideration in conversions and analyst decisions.

+
+ +

Units - Units of measure are converted using Pint + (Grecco + & Chéron, 2021). To facilitate this, harmonize-wq + defines new units, e.g., ‘NTU’ for turbidity, and updates WQP + units for recognition by Pint, e.g., ‘deg C’ for water temperature + is updated to ‘degC.’ Where units are missing (NAN) or + unrecognized, an attempt is made to assume standard or + user-specified units and a flag is added to the QA column. Pint + contexts are used to change dimensionality of units, e.g., from + mg/l (mass/volume) to g/kg of water (dimensionless), before final + conversion. Some additional custom conversions were added, e.g., + dissolved oxygen percent saturation to concentration in mg/l. When + a unit is falsely recognized, e.g., ‘deg c’ recognized as degree * + speed of light, it will typically result in a dimensionality error + during conversion. The default is for conversion issues to error, + but the user has the option to suppress that error, replacing the + results with the un-converted units or as NAN.

+
+
+

In addition to cleaning characteristic results, the package also + harmonizes metadata defining the observation. These metadata include + site location – where geopandas + (Kelsey + Jordahl & Wasser, 2021) transforms points to a consistent + datum, and time of observation – where dataRetrieval + (Hodson + & Horsburgh, 2023) interprets timezone.

+

Data wrangling involves reducing the complexity of the data to make + it more accessible and reshaping the data for use in analysis. The WQP + data format is complex, with each row corresponding to a specific + result for a specific characteristic and many columns for metadata + specific to that result. The harmonize-wq package reshapes the table + to loosely adhere to tidy principles + (Wickham, + 2014), where each variable forms a column (i.e., one + characteristic per column) and each observation forms a row (i.e., one + row per site and time stamp). Given the number of result specific WQP + metadata columns, to avoid conflicts during reshaping the package has + functions to differentiate these based on the original characteristic, + e.g., ‘QA’ becoming ‘QA_Nitrogen’. Once the data has been cleansed and + result specific columns differentiated many of the original columns + can be reduced. The package also has resources for entity resolution, + both for deduplication when one source has duplicate results during + reshaping (e.g., quality control or calibration sample) and when the + same result is reported by different sources after the table is + reshaped.

+
+ + Disclaimer +

The views expressed in this article are those of the authors and do + not necessarily represent the views or policies of the U.S. + Environmental Protection Agency. Any mention of trade names, products, + or services does not imply endorsement by the U.S. Government or the + U.S. Environmental Protection Agency. The EPA does not endorse any + commercial products, services, or enterprises.

+

This contribution is identified by tracking number ORD-056806 of + the U.S. Environmental Protection Agency, Office of Research and + Development, Center for Environmental Measurement and Modeling, Gulf + Ecosystem Measurement and Modeling Division.

+
+ + Acknowledgments +

Many people have contributed in various ways to the development of + harmonize-wq. We are grateful to Rosmin Ennis, Farnaz Nojavan Asghari, + Marc Weber, Catherine Birney, Lisa M. Smith and Elizabeth George for + their early reviews of this paper.

+
+ + + + + + + + BeckSchrandtM. W. + BestB. D. + + tbeptools: An R package for synthesizing estuarine data for environmental research + Journal of Open Source Software + 2021 + 6 + 65 + https://doi.org/10.21105/joss.03485 + 10.21105/joss.03485 + 3485 + + + + + + + BoothEvermanN. L. + MurphyL. + + A Web‐Based Decision Support System for Assessing Regional Water‐Quality Conditions and Management Actions + Journal of the American Water Resources Association + 2011 + 47 + 5 + https://doi.org/10.1111/j.1752-1688.2011.00573.x + 10.1111/j.1752-1688.2011.00573.x + 1136 + 1150 + + + + + + BousquinJ. + + Discrete Global Grid Systems as scalable geospatial frameworks for characterizing coastal environments + Environmental Modelling & Software + 2021 + 146 + https://doi.org/10.1016/j.envsoft.2021.105210 + 10.1016/j.envsoft.2021.105210 + 105210 + + + + + + + CheginiTaher + LiHong-Yi + LeungL. Ruby + + HyRiver: Hydroclimate Data Retriever + Journal of Open Source Software + 202110 + 6 + 66 + 10.21105/joss.03175 + 1 + 3 + + + + + + De CiccoLaura A. + LorenzDavid + HirschRobert M. + WatkinsWilliam + JohnsonMike + + dataRetrieval: R packages for discovering and retrieving water data available from U.S. federal hydrologic web services + U.S. Geological Survey; U.S. Geological Survey + Reston, VA + 2022 + https://code.usgs.gov/water/dataRetrieval + 10.5066/P9X4L3GE + + + + + + EvansKayM. J. + MalcomJ. W. + + Linking mountaintop removal mining to water quality for imperiled species using satellite data + PloS one + 2021 + 16 + 11 + https://doi.org/10.1371/journal.pone.0239691 + 10.1371/journal.pone.0239691 + e0239691 + + + + + + + GreccoH. + ChéronJ. + + Pint: Operate and manipulate physical quantities in Python + 2021 + https://github.com/hgrecco/pint + + + + + + HodsonHariharanT. O. + HorsburghJ. S. + + dataretrieval (Python): a Python package for discovering and retrieving water data available from U.S. federal hydrologic web services + U.S. Geological Survey; U.S. Geological Survey + 2023 + https://doi.org/10.5066/P94I5TX3 + 10.5066/P94I5TX3 + + + + + + Kelsey JordahlMartin FleischmannJoris Van den Bossche + WasserLeah + + geopandas/geopandas: v0.10.2 + Zenodo + 202110 + https://doi.org/10.5281/zenodo.5573592 + 10.5281/zenodo.5573592 + + + + + + ManningRosemondD. W. + KominoskiJ. S. + + Transport of N and P in US streams and rivers differs with land use and between dissolved and particulate forms + Ecological Applications + 2020 + 30 + https://doi.org/10.1002/eap.2130 + 10.1002/eap.2130 + p.e02130 + + + + + + + ReadCarrE. K. + WinslowL. A. + + Water quality data for national‐scale aquatic research: The Water Quality Portal. + Water Resources Research + 2017 + 53 + https://doi.org/10.1002/2016WR019993 + 10.1002/2016WR019993 + 1735 + 1745 + + + + + + RossToppM. R. + PavelskyT. M. + + AquaSat: A data set to enable remote sensing of water quality for inland waters + Water Resources Research + 2019 + 55 + https://doi.org/10.1029/2019WR024883 + 10.1029/2019WR024883 + 10012 + 10025 + + + + + + ShaughnessyWenA. R. + BrantleyS. L. + + Three Principles to Use in Streamlining Water Quality Research through Data Uniformity + Environmental Science & Technology + 2019 + 53 + https://doi.org/10.1021/acs.est.9b06406 + 10.1021/acs.est.9b06406 + 13549 + 13550 + + + + + + ShenAmatulliL. Q. + DomischS. + + Estimating nitrogen and phosphorus concentrations in streams and rivers, within a machine learning framework + Scientific Data + 2020 + 7 + https://doi.org/10.1038/s41597-020-0478-7 + 10.1038/s41597-020-0478-7 + 161 + + + + + + + SpragueOelsnerL. A. + ArgueD. M. + + Challenges with secondary use of multi-source water-quality data in the United States + Water Research + 2017 + 110 + https://doi.org/10.1016/j.watres.2016.12.024 + 10.1016/j.watres.2016.12.024 + 252 + 261 + + + + + + WickhamH. + + Tidy data + The Journal of Statistical Software + 2014 + 59 + https://doi.org/10.18637/jss.v059.i10 + 10.18637/jss.v059.i10 + 252 + 261 + + + + + WQX Web API + U.S. Environmental Protection Agency, Office of Water; U.S. Environmental Protection Agency + Washington, DC + 2018 + https://www.epa.gov/sites/default/files/2018-09/documents/wqx_web_application_programming_interface_api.pdf + + + + + WQX web user guide + U.S. Environmental Protection Agency, Office of Water; U.S. Environmental Protection Agency + Washington, DC + 2020 + https://www.epa.gov/sites/default/files/2020-03/documents/wqx_web_user_guide_v3.0.pdf + + + + +