Challenge 20- Bridge the Gap: Bridging Gaps in Streamflow Observations with ML-driven Solutions #2

RubenRT7 · 2024-02-16T10:47:21Z

Challenge 20 - Bridge the Gap: Bridging Gaps in Streamflow Observations with ML-driven Solutions

Stream 2 - Machine Learning for Earth Sciences applications

Goal

Develop machine learning solutions to bridge gaps in streamflow observations, enhancing the accuracy and reliability of hydrological data analysis and forecasting.

Mentors and skills

Mentors: Maliko Tanguy, Gwyneth Matthews, Mariana Clare, Cinzia Mazzetti (all ECMWF)
Skills required:
Essential:
- Python (numpy, pandas, xarray, ...)
- Machine learning (scikit-learn, Pytorch/Tensorflow)
- Visualisation (mapping, graphs)
Desirable:
- Time series analysis
- Open-source collaboration (Git)
Advantageous:
- Ability to create clear documentation / communication
- Basic hydrology understanding

Challenge description

Introduction
Operational flood forecasting systems like EFAS and GloFAS, part of the Copernicus Emergency Management Service (CEMS), play a pivotal role in providing advanced warnings for devastating flood events, significantly impacting societies worldwide. These systems must be reliable and accurate, making the assessment of forecast skill a critical aspect in gauging their trustworthiness and utility.
A major limitation in calibrating and evaluating these forecasting systems is the scarcity, quality, and incompleteness of observational data, particularly in areas where flood impacts are most severe. In addition, the calculation of some forecasting skill scores such as the Continuous Ranked Probability Skill Score (CRPSS) necessitates continuous time series, posing a challenge when data is unavailable or incomplete. Extending the time series also allows for the provision of reference or climatology values against which to compare forecasts, enhancing the robustness of the evaluation process.
Building upon existing literature (e.g. [1,2,3]), various ML methods, such as Random Forests and LSTM models, have shown promise in gap-filling river flow data. However, a comprehensive understanding of their strengths and limitations is essential for informed implementation.

Project objectives
The primary objective is to explore different approaches to gap-fill observed daily streamflow time series, comparing their performance and determining the maximum length of gap that can be reliably filled. The project aims to implement these methods into an open-source software package based on Python, providing a user-friendly solution for filling gaps in observational datasets.

Methodology

Data Collection:
Observed river flow data from GRDC and catchment average precipitation data from ERA5 will be provided for a subset of river gauging stations used in GloFAS and EFAS. The inclusion of remote sensing water level data could also be considered, with a focus on addressing associated challenges (e.g. data accuracy, resolution, temporal and spatial coverage).
Selection of methods:
Based on a brief review of existing literature on the topic, the team will select a few different statistical and ML methods to be implemented and compared. Proposals should focus on head catchments but ideas of how to manage nested catchments are also encouraged.
Coding and Implementation:
Open-source software, predominantly Python, will be used for the implementation of different gap-filling methods. The coding phase will be organised into milestones to ensure a systematic and timely execution of the project.
Evaluation and Comparison:
A comprehensive evaluation will be conducted, comparing different methods based on general performance and considering the size of the data gap. The team will develop strategies for assessing performance variations with an increase in gap size, providing valuable insights into the reliability of each method.

Expected outcome
The project’s final outcome will be a well-documented, user-friendly Python code available on GitHub, featuring one or several gap-filling options. Accompanying this code will be information on method performance, including the maximum reliable gap size and a degradation table detailing performance with increasing gap size, which will help users to select the best method for their data.

Strech goals (optional)
Ready for an extra challenge? For those eager to push their limits, we offer optional stretch goals:

Extend the application of these methods to temporally disaggregate time series to refine data resolution (e.g. from monthly to daily river flow data).
Evaluate the use of these methods to extend time series, beyond gap-filling.

References
[1] Arriagada et al. (2021)
[2] Dariane & Borhan (2024)
[3] Ren et al. (2022)

RonT23 · 2024-03-17T20:04:41Z

Hello, we are interested in this challenge and have a few questions:

Will the dataset be provided upon (or if) acceptance of the proposal, or should we reference the dataset in the proposal?
Our proposal addresses the problem in a more abstract and general manner; do we need to provide more specific details?
Could you please provide an example dataset for us to study the features? We found something related to the problem at hand but we are not pretty sure that it is the right one.
Thank you in advance!

ecMaliko · 2024-03-18T10:13:29Z

Hi @RonT23 ,
Thank you for your interest and questions about our challenge!
Regarding questions 1 and 3: I'm checking with my fellow mentors about this and will let you know ASAP.
Regarding question 2: We’re looking for proposals with clear steps and milestones rather than abstract solutions. A more detailed plan will help ensure a tangible outcome by the end of the challenge. We are happy with some flexibility, as there will inevitably be some unexpected issues along the way, and some trial and error with some methodological issues. But within that flexibility, the more specifics you can provide, the better.

ecMaliko · 2024-03-19T11:54:26Z

Hi @RonT23
I can now confirm that we will be able to provide you with daily river flow observations at GRDC sites, and catchment averaged ERA5 precipitation for these sites. I haven't checked the data, but I think we have a few thousands sites across the world, with variable length of record. We will prepare the data before the start of the challenge. We don't have any remote sensing data though. Therefore, if you are planning to use these in your project, this would need to be sourced by yourselves.

I could prepare some sample data for a few of these sites for you to explore, but it would take me a couple of days. Could you please confirm you would like me to prepare the sample data for you?

Best wishes,

Maliko

daniel-obrien · 2024-03-19T18:20:41Z

Hi Maliko,

It would be very helpful for use if you could prepare that data.

Thank you,
Daniel

RonT23 · 2024-03-19T21:21:43Z

Hi @ecMaliko,
We would appriciate it if you can provide us with a sample dataset!
Thank you,
Ronaldo T.

ecMaliko · 2024-03-20T11:44:28Z

Hi @RonT23 and @daniel-obrien ,
I have attached here some sample data (100 stations) for you to explore the format and type of data that will be provided. The full dataset will have a few thousands stations. There is one netcdf file with observed discharge data, and another one with catchment averaged precipitation data from ERA5. In addition, I have also included a csv file with some additional metadata. The 'statid' variable in the netcdf files corresponds to the 'station_id_num' column in the metadata file.
Please note that the reference date in the precipitation file is different from the discharge file!
Let me know if you have any questions.
Regards,
Maliko

sample_data_code4earth.zip

RonT23 · 2024-03-20T20:17:39Z

Thank you, that is really helpfull!
R. T.

KonstantinosPl · 2024-03-22T09:19:35Z

Dear @ecMaliko

Are there going to be any gauge stations that belong to the same catchment (water basin)?
Is the average ERA5 precipitation derived from the same catchment?
Can we use the distributed ERA5 precipitation data?
Will we have the distinction between rainfall and snow and hail?
Can we add more input data that affect the precipitation-streamflow relationship in our models?

Thanks in advance.

K. P.

ecMaliko · 2024-03-22T11:37:49Z

Dear @KonstantinosPl ,

Thank you for your interest in this challenge!

These are the answers to your question:

Good point: some catchments will be nested (smaller catchments being sub-catchments of bigger catchments). We will flag which catchments are nested, this will be prepared before the start of the challenge. We will also provide a shapefile of catchments.
The average ERA5 precipitation provided is an average over each of the individual catchments.
You can use the distributed ERA5 precipitation data, but you will have to source the data yourselves if you decide to use it. This might increase your data preparation time substantially.
We don’t have information on the precipitation type (rainfall, snow, hail). This information is available in ERA5, but again, keep in mind the additional data preparation time that this would add.
You can add any input that you might think is relevant, as long as it is data openly available.

While all the data you mention would surely contribute to improve the final product, don’t forget that the challenge is only 4 months. Therefore, make sure your proposed work is realistic within that timeframe.

Let me know if you have further questions!

Maliko

danghieutrung · 2024-03-30T07:45:39Z

Hi @ecMaliko,

I have some questions following this discussion:

I have reviewed the data files you attached and noticed that the small data you sent contained around 4-5 days of data (9 - 13 Jan 1970). Could you give us an approximation of the time range of the actual data for the project? I speculate the whole dataset would contain somewhat 40-50 years, from 1970 to around 2010, 2020.
Does one model have to apply to all stations, or different stations from different geographical area could use different models? For example, we may implement an LSTM model for each continent (Europe, Asia,...), and all LSTM Models should have the same architecture (same configurations with equal number of parameters), but the weights are different.
Do we have access to any GPU server during the project?

Thank you!
Hieu

ecMaliko · 2024-04-01T11:07:09Z

Hi @danghieutrung ,

My colleagues are on Easter break, so I will reply to the best of my knowledge, and I will get back to you with updated information as soon as I hear from them.

I apologise that the sample data that I had shared only had 5-7 days of observed data. I hadn’t checked the amount of data it included (maybe I should have), as it was mainly to share the format and type of data that would be provided. We have a mixture of sites with quite complete records, and others with very little data. I am afraid I can’t give you an accurate estimate of the exact amount of observed data that we hold at this moment. The paper from Chevuturi et al. (2023) can give you an idea of the amount of data available in the GRDC dataset (a subset of 119 stations), if you look at their figure S1 in supplementary information (the white parts are all missing data on this plot): https://ars.els-cdn.com/content/image/1-s2.0-S0022169423005498-mmc1.pdf
It doesn’t necessarily need to be one single model for all stations. If you think different models for different continents (or other subsets of stations) would work better, you are welcome to propose this in your project. It is also possible that some continents won’t have enough data to train the model (we know there is more data in Europe and the US than elsewhere), you might only be able to build models for the regions of the world with more dense data.
I think you would have access to GPUs, but I am not 100% sure. Let me come back on this point once I manage to talk to my colleagues.

Maliko

BargavReddyM · 2024-04-01T11:41:37Z

Hello, we are interested in this challenge, and I have a question:

Are Indian nationals allowed/eligible to apply for this (I am from India)
If allowed, is it mandatory to work specifically for the study area of European nations, or can any study area in the world be chosen?

ecMaliko · 2024-04-01T11:53:34Z

Dear @BargavReddyM ,

Thank you for your interest in this challenge.
Unfortunately, for this year’s Code4Earth challenges, the call is only open to candidates who are citizens from ECMWF Member States and Co-operating States. You can find the list here: https://www.ecmwf.int/en/about/who-we-are/member-states
We wish we could be more open, but this is restricted by the conditions set by our funders.
Regarding the second question: we are more interested in the methodology developed rather than the specific area used to develop the method. Therefore, it doesn’t necessarily need to be based in Europe. However, Europe is one of the most data-rich area (in terms of river flow), and therefore it could be a good starting point.

Kind regards,

Maliko

BargavReddyM · 2024-04-01T11:55:01Z

Thank you for the reply

trakasa · 2024-04-01T15:11:25Z

Hi @danghieutrung, hi @ecMaliko

I think you would have access to GPUs, but I am not 100% sure. Let me come back on this point once I manage to talk to my colleagues.

AT: Yes, that is correct. Thanks @ecMaliko for answering!
If the selected proposals need access to computing resources you can access the European Weather Cloud or WEkEO.

Bye, Athina

trakasa · 2024-04-01T15:19:18Z

@BargavReddyM @ecMaliko

Thank you for the reply
AT: Indeed, as funding comes from different (European) sources, we have to follow certain rules for eligibility.
You have to be citizen or resident of an ECMWF Member State or Co-operating State or EU Member State, or from a country associated with EU’s Space Programme (currently Iceland, Norway and United Kingdom) and countries associated with EU’s Digital Europe Programme (currently Albania, Iceland, Lichtenstein, Montenegro, North Macedonia, Norway, Serbia and Türkiye).

For more details please check the Code for Earth Terms & Conditions (mainly Article 3).

Thanks @ecMaliko for getting back to Bargav!

Bye, Athina

wsyip85 · 2024-04-09T04:55:35Z

Hello. I could not submit my proposal because the link to submit the form said refused to connect. May I have some help please ? Here is the link from the website.
https://codeforearth.commpla.com/ecmwf-code-for-earth-2024-submission-form

wsyip85 · 2024-04-09T06:06:50Z

Thank you, the link is now okay.

ecMaliko · 2024-04-09T06:15:34Z

Hi @wsyip85
I am glad the problem is now solved.
Kind regards,
Maliko

RubenRT7 changed the title ~~Challenge xx - Bridge the Gap: Bridging Gaps in Streamflow Observations with ML-driven Solutions~~ Challenge 02 - Bridge the Gap: Bridging Gaps in Streamflow Observations with ML-driven Solutions Feb 16, 2024

EsperanzaCuartero assigned EsperanzaCuartero, trakasa, ecMaliko, mc4117 and ecCinziaMazzetti Feb 16, 2024

EsperanzaCuartero changed the title ~~Challenge 02 - Bridge the Gap: Bridging Gaps in Streamflow Observations with ML-driven Solutions~~ Challenge 08- Bridge the Gap: Bridging Gaps in Streamflow Observations with ML-driven Solutions Feb 22, 2024

EsperanzaCuartero added the Machine Learning Machine learning for Earth Sciences applications label Feb 22, 2024

EsperanzaCuartero changed the title ~~Challenge 08- Bridge the Gap: Bridging Gaps in Streamflow Observations with ML-driven Solutions~~ Challenge 20- Bridge the Gap: Bridging Gaps in Streamflow Observations with ML-driven Solutions Feb 23, 2024

EsperanzaCuartero assigned GwynethMatthews Feb 23, 2024

RubenRT7 added ECMWF New feature or request Machine Learning Machine learning for Earth Sciences applications and removed ECMWF New feature or request Machine Learning Machine learning for Earth Sciences applications labels Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Challenge 20- Bridge the Gap: Bridging Gaps in Streamflow Observations with ML-driven Solutions #2

Challenge 20- Bridge the Gap: Bridging Gaps in Streamflow Observations with ML-driven Solutions #2

RubenRT7 commented Feb 16, 2024 •

edited by trakasa

Loading

RonT23 commented Mar 17, 2024

ecMaliko commented Mar 18, 2024

ecMaliko commented Mar 19, 2024

daniel-obrien commented Mar 19, 2024

RonT23 commented Mar 19, 2024

ecMaliko commented Mar 20, 2024

RonT23 commented Mar 20, 2024

KonstantinosPl commented Mar 22, 2024

ecMaliko commented Mar 22, 2024 •

edited

Loading

danghieutrung commented Mar 30, 2024

ecMaliko commented Apr 1, 2024

BargavReddyM commented Apr 1, 2024

ecMaliko commented Apr 1, 2024

BargavReddyM commented Apr 1, 2024

trakasa commented Apr 1, 2024

trakasa commented Apr 1, 2024

wsyip85 commented Apr 9, 2024

wsyip85 commented Apr 9, 2024

ecMaliko commented Apr 9, 2024

Challenge 20- Bridge the Gap: Bridging Gaps in Streamflow Observations with ML-driven Solutions #2

Challenge 20- Bridge the Gap: Bridging Gaps in Streamflow Observations with ML-driven Solutions #2

Comments

RubenRT7 commented Feb 16, 2024 • edited by trakasa Loading

Challenge 20 - Bridge the Gap: Bridging Gaps in Streamflow Observations with ML-driven Solutions

Goal

Mentors and skills

Challenge description

RonT23 commented Mar 17, 2024

ecMaliko commented Mar 18, 2024

ecMaliko commented Mar 19, 2024

daniel-obrien commented Mar 19, 2024

RonT23 commented Mar 19, 2024

ecMaliko commented Mar 20, 2024

RonT23 commented Mar 20, 2024

KonstantinosPl commented Mar 22, 2024

ecMaliko commented Mar 22, 2024 • edited Loading

danghieutrung commented Mar 30, 2024

ecMaliko commented Apr 1, 2024

BargavReddyM commented Apr 1, 2024

ecMaliko commented Apr 1, 2024

BargavReddyM commented Apr 1, 2024

trakasa commented Apr 1, 2024

trakasa commented Apr 1, 2024

wsyip85 commented Apr 9, 2024

wsyip85 commented Apr 9, 2024

ecMaliko commented Apr 9, 2024

RubenRT7 commented Feb 16, 2024 •

edited by trakasa

Loading

ecMaliko commented Mar 22, 2024 •

edited

Loading