Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Challenge 20- Bridge the Gap: Bridging Gaps in Streamflow Observations with ML-driven Solutions #2

Open
RubenRT7 opened this issue Feb 16, 2024 · 19 comments
Assignees
Labels
ECMWF New feature or request Machine Learning Machine learning for Earth Sciences applications

Comments

@RubenRT7
Copy link
Contributor

RubenRT7 commented Feb 16, 2024

Challenge 20 - Bridge the Gap: Bridging Gaps in Streamflow Observations with ML-driven Solutions

Stream 2 - Machine Learning for Earth Sciences applications

Goal

Develop machine learning solutions to bridge gaps in streamflow observations, enhancing the accuracy and reliability of hydrological data analysis and forecasting.

Mentors and skills

  • Mentors: Maliko Tanguy, Gwyneth Matthews, Mariana Clare, Cinzia Mazzetti (all ECMWF)

  • Skills required:

  • Essential:

    • Python (numpy, pandas, xarray, ...)
    • Machine learning (scikit-learn, Pytorch/Tensorflow)
    • Visualisation (mapping, graphs)
  • Desirable:

    • Time series analysis
    • Open-source collaboration (Git)
  • Advantageous:

    • Ability to create clear documentation / communication
    • Basic hydrology understanding

Challenge description

Introduction
Operational flood forecasting systems like EFAS and GloFAS, part of the Copernicus Emergency Management Service (CEMS), play a pivotal role in providing advanced warnings for devastating flood events, significantly impacting societies worldwide. These systems must be reliable and accurate, making the assessment of forecast skill a critical aspect in gauging their trustworthiness and utility.
A major limitation in calibrating and evaluating these forecasting systems is the scarcity, quality, and incompleteness of observational data, particularly in areas where flood impacts are most severe. In addition, the calculation of some forecasting skill scores such as the Continuous Ranked Probability Skill Score (CRPSS) necessitates continuous time series, posing a challenge when data is unavailable or incomplete. Extending the time series also allows for the provision of reference or climatology values against which to compare forecasts, enhancing the robustness of the evaluation process.
Building upon existing literature (e.g. [1,2,3]), various ML methods, such as Random Forests and LSTM models, have shown promise in gap-filling river flow data. However, a comprehensive understanding of their strengths and limitations is essential for informed implementation.

Project objectives
The primary objective is to explore different approaches to gap-fill observed daily streamflow time series, comparing their performance and determining the maximum length of gap that can be reliably filled. The project aims to implement these methods into an open-source software package based on Python, providing a user-friendly solution for filling gaps in observational datasets.

Methodology

  • Data Collection:
    Observed river flow data from GRDC and catchment average precipitation data from ERA5 will be provided for a subset of river gauging stations used in GloFAS and EFAS. The inclusion of remote sensing water level data could also be considered, with a focus on addressing associated challenges (e.g. data accuracy, resolution, temporal and spatial coverage).
  • Selection of methods:
    Based on a brief review of existing literature on the topic, the team will select a few different statistical and ML methods to be implemented and compared. Proposals should focus on head catchments but ideas of how to manage nested catchments are also encouraged.
  • Coding and Implementation:
    Open-source software, predominantly Python, will be used for the implementation of different gap-filling methods. The coding phase will be organised into milestones to ensure a systematic and timely execution of the project.
  • Evaluation and Comparison:
    A comprehensive evaluation will be conducted, comparing different methods based on general performance and considering the size of the data gap. The team will develop strategies for assessing performance variations with an increase in gap size, providing valuable insights into the reliability of each method.

Expected outcome
The project’s final outcome will be a well-documented, user-friendly Python code available on GitHub, featuring one or several gap-filling options. Accompanying this code will be information on method performance, including the maximum reliable gap size and a degradation table detailing performance with increasing gap size, which will help users to select the best method for their data.

Strech goals (optional)
Ready for an extra challenge? For those eager to push their limits, we offer optional stretch goals:

  1. Extend the application of these methods to temporally disaggregate time series to refine data resolution (e.g. from monthly to daily river flow data).
  2. Evaluate the use of these methods to extend time series, beyond gap-filling.

References
[1] Arriagada et al. (2021)
[2] Dariane & Borhan (2024)
[3] Ren et al. (2022)

@RubenRT7 RubenRT7 changed the title Challenge xx - Bridge the Gap: Bridging Gaps in Streamflow Observations with ML-driven Solutions Challenge 02 - Bridge the Gap: Bridging Gaps in Streamflow Observations with ML-driven Solutions Feb 16, 2024
@EsperanzaCuartero EsperanzaCuartero changed the title Challenge 02 - Bridge the Gap: Bridging Gaps in Streamflow Observations with ML-driven Solutions Challenge 08- Bridge the Gap: Bridging Gaps in Streamflow Observations with ML-driven Solutions Feb 22, 2024
@EsperanzaCuartero EsperanzaCuartero added the Machine Learning Machine learning for Earth Sciences applications label Feb 22, 2024
@EsperanzaCuartero EsperanzaCuartero changed the title Challenge 08- Bridge the Gap: Bridging Gaps in Streamflow Observations with ML-driven Solutions Challenge 20- Bridge the Gap: Bridging Gaps in Streamflow Observations with ML-driven Solutions Feb 23, 2024
@RubenRT7 RubenRT7 added ECMWF New feature or request Machine Learning Machine learning for Earth Sciences applications and removed ECMWF New feature or request Machine Learning Machine learning for Earth Sciences applications labels Mar 7, 2024
@RonT23
Copy link

RonT23 commented Mar 17, 2024

Hello, we are interested in this challenge and have a few questions:

  1. Will the dataset be provided upon (or if) acceptance of the proposal, or should we reference the dataset in the proposal?
  2. Our proposal addresses the problem in a more abstract and general manner; do we need to provide more specific details?
  3. Could you please provide an example dataset for us to study the features? We found something related to the problem at hand but we are not pretty sure that it is the right one.
    Thank you in advance!

@ecMaliko
Copy link

Hi @RonT23 ,
Thank you for your interest and questions about our challenge!
Regarding questions 1 and 3: I'm checking with my fellow mentors about this and will let you know ASAP.
Regarding question 2: We’re looking for proposals with clear steps and milestones rather than abstract solutions. A more detailed plan will help ensure a tangible outcome by the end of the challenge. We are happy with some flexibility, as there will inevitably be some unexpected issues along the way, and some trial and error with some methodological issues. But within that flexibility, the more specifics you can provide, the better.

@ecMaliko
Copy link

Hi @RonT23
I can now confirm that we will be able to provide you with daily river flow observations at GRDC sites, and catchment averaged ERA5 precipitation for these sites. I haven't checked the data, but I think we have a few thousands sites across the world, with variable length of record. We will prepare the data before the start of the challenge. We don't have any remote sensing data though. Therefore, if you are planning to use these in your project, this would need to be sourced by yourselves.

I could prepare some sample data for a few of these sites for you to explore, but it would take me a couple of days. Could you please confirm you would like me to prepare the sample data for you?

Best wishes,

Maliko

@daniel-obrien
Copy link

Hi Maliko,

It would be very helpful for use if you could prepare that data.

Thank you,
Daniel

@RonT23
Copy link

RonT23 commented Mar 19, 2024

Hi @ecMaliko,
We would appriciate it if you can provide us with a sample dataset!
Thank you,
Ronaldo T.

@ecMaliko
Copy link

Hi @RonT23 and @daniel-obrien ,
I have attached here some sample data (100 stations) for you to explore the format and type of data that will be provided. The full dataset will have a few thousands stations. There is one netcdf file with observed discharge data, and another one with catchment averaged precipitation data from ERA5. In addition, I have also included a csv file with some additional metadata. The 'statid' variable in the netcdf files corresponds to the 'station_id_num' column in the metadata file.
Please note that the reference date in the precipitation file is different from the discharge file!
Let me know if you have any questions.
Regards,
Maliko

sample_data_code4earth.zip

@RonT23
Copy link

RonT23 commented Mar 20, 2024

Thank you, that is really helpfull!
R. T.

@KonstantinosPl
Copy link

Dear @ecMaliko

  1. Are there going to be any gauge stations that belong to the same catchment (water basin)?
  2. Is the average ERA5 precipitation derived from the same catchment?
  3. Can we use the distributed ERA5 precipitation data?
  4. Will we have the distinction between rainfall and snow and hail?
  5. Can we add more input data that affect the precipitation-streamflow relationship in our models?

Thanks in advance.

K. P.

@ecMaliko
Copy link

ecMaliko commented Mar 22, 2024

Dear @KonstantinosPl ,

Thank you for your interest in this challenge!

These are the answers to your question:

  1. Good point: some catchments will be nested (smaller catchments being sub-catchments of bigger catchments). We will flag which catchments are nested, this will be prepared before the start of the challenge. We will also provide a shapefile of catchments.
  2. The average ERA5 precipitation provided is an average over each of the individual catchments.
  3. You can use the distributed ERA5 precipitation data, but you will have to source the data yourselves if you decide to use it. This might increase your data preparation time substantially.
  4. We don’t have information on the precipitation type (rainfall, snow, hail). This information is available in ERA5, but again, keep in mind the additional data preparation time that this would add.
  5. You can add any input that you might think is relevant, as long as it is data openly available.

While all the data you mention would surely contribute to improve the final product, don’t forget that the challenge is only 4 months. Therefore, make sure your proposed work is realistic within that timeframe.

Let me know if you have further questions!

Maliko

@danghieutrung
Copy link

Hi @ecMaliko,

I have some questions following this discussion:

  1. I have reviewed the data files you attached and noticed that the small data you sent contained around 4-5 days of data (9 - 13 Jan 1970). Could you give us an approximation of the time range of the actual data for the project? I speculate the whole dataset would contain somewhat 40-50 years, from 1970 to around 2010, 2020.
  2. Does one model have to apply to all stations, or different stations from different geographical area could use different models? For example, we may implement an LSTM model for each continent (Europe, Asia,...), and all LSTM Models should have the same architecture (same configurations with equal number of parameters), but the weights are different.
  3. Do we have access to any GPU server during the project?

Thank you!
Hieu

@ecMaliko
Copy link

ecMaliko commented Apr 1, 2024

Hi @danghieutrung ,

My colleagues are on Easter break, so I will reply to the best of my knowledge, and I will get back to you with updated information as soon as I hear from them.

  1. I apologise that the sample data that I had shared only had 5-7 days of observed data. I hadn’t checked the amount of data it included (maybe I should have), as it was mainly to share the format and type of data that would be provided. We have a mixture of sites with quite complete records, and others with very little data. I am afraid I can’t give you an accurate estimate of the exact amount of observed data that we hold at this moment. The paper from Chevuturi et al. (2023) can give you an idea of the amount of data available in the GRDC dataset (a subset of 119 stations), if you look at their figure S1 in supplementary information (the white parts are all missing data on this plot): https://ars.els-cdn.com/content/image/1-s2.0-S0022169423005498-mmc1.pdf
  2. It doesn’t necessarily need to be one single model for all stations. If you think different models for different continents (or other subsets of stations) would work better, you are welcome to propose this in your project. It is also possible that some continents won’t have enough data to train the model (we know there is more data in Europe and the US than elsewhere), you might only be able to build models for the regions of the world with more dense data.
  3. I think you would have access to GPUs, but I am not 100% sure. Let me come back on this point once I manage to talk to my colleagues.

Maliko

@BargavReddyM
Copy link

Hello, we are interested in this challenge, and I have a question:

  1. Are Indian nationals allowed/eligible to apply for this (I am from India)
  2. If allowed, is it mandatory to work specifically for the study area of European nations, or can any study area in the world be chosen?

@ecMaliko
Copy link

ecMaliko commented Apr 1, 2024

Dear @BargavReddyM ,

Thank you for your interest in this challenge.
Unfortunately, for this year’s Code4Earth challenges, the call is only open to candidates who are citizens from ECMWF Member States and Co-operating States. You can find the list here: https://www.ecmwf.int/en/about/who-we-are/member-states
We wish we could be more open, but this is restricted by the conditions set by our funders.
Regarding the second question: we are more interested in the methodology developed rather than the specific area used to develop the method. Therefore, it doesn’t necessarily need to be based in Europe. However, Europe is one of the most data-rich area (in terms of river flow), and therefore it could be a good starting point.

Kind regards,

Maliko

@BargavReddyM
Copy link

Thank you for the reply

@trakasa
Copy link
Contributor

trakasa commented Apr 1, 2024

Hi @danghieutrung, hi @ecMaliko

  1. I think you would have access to GPUs, but I am not 100% sure. Let me come back on this point once I manage to talk to my colleagues.

AT: Yes, that is correct. Thanks @ecMaliko for answering!
If the selected proposals need access to computing resources you can access the European Weather Cloud or WEkEO.

Bye, Athina

@trakasa
Copy link
Contributor

trakasa commented Apr 1, 2024

@BargavReddyM @ecMaliko

Thank you for the reply
AT: Indeed, as funding comes from different (European) sources, we have to follow certain rules for eligibility.
You have to be citizen or resident of an ECMWF Member State or Co-operating State or EU Member State, or from a country associated with EU’s Space Programme (currently Iceland, Norway and United Kingdom) and countries associated with EU’s Digital Europe Programme (currently Albania, Iceland, Lichtenstein, Montenegro, North Macedonia, Norway, Serbia and Türkiye).

For more details please check the Code for Earth Terms & Conditions (mainly Article 3).

Thanks @ecMaliko for getting back to Bargav!

Bye, Athina

@wsyip85
Copy link

wsyip85 commented Apr 9, 2024

Hello. I could not submit my proposal because the link to submit the form said refused to connect. May I have some help please ? Here is the link from the website.
https://codeforearth.commpla.com/ecmwf-code-for-earth-2024-submission-form

@wsyip85
Copy link

wsyip85 commented Apr 9, 2024

Thank you, the link is now okay.

@ecMaliko
Copy link

ecMaliko commented Apr 9, 2024

Hi @wsyip85
I am glad the problem is now solved.
Kind regards,
Maliko

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ECMWF New feature or request Machine Learning Machine learning for Earth Sciences applications
Projects
None yet
Development

No branches or pull requests