-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MMFlood dataset #2450
base: main
Are you sure you want to change the base?
Add MMFlood dataset #2450
Conversation
@microsoft-github-policy-service agree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for the contribution, this is great! I have made a first pass of comments below. If you have questions about anything feel free to comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good now, thanks for the hard work!
Only other comment I would make is that the recommended approach for these "curated" geospatial datasets (RasterDataset
s containing both images and masks) is to create a dummy dataset for the images, a dummy dataset for the masks, and an IntersectionDataset
that combines them. This usually lets you completely skip the __init__
and __getitem__
since it will inherit from RasterDataset
. See L7Irish
and L8Biome
for examples of these. Up to you whether or not you want to do this since you're almost done, but it could make the code a bit cleaner.
tests/data/mmflood/data.py
Outdated
MAX_VALUE = 1000.0 | ||
MIN_VALUE = 0.0 | ||
RANGE = MAX_VALUE - MIN_VALUE | ||
FOLDERS = ['s1_raw', 'DEM', 'mask'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lowercase would be better for local variables. Just note that range
is a builtin and this would overshadow it, so maybe name it something else or just use (min_value - max_value)
everywhere.
docs/api/datasets/geo_datasets.csv
Outdated
@@ -20,6 +20,7 @@ Dataset,Type,Source,License,Size (px),Resolution (m) | |||
`L8 Biome`_,"Imagery, Masks",Landsat,"CC0-1.0","8,900x8,900","15, 30" | |||
`LandCover.ai Geo`_,"Imagery, Masks",Aerial,"CC-BY-NC-SA-4.0","4,200--9,500",0.25--0.5 | |||
`Landsat`_,Imagery,Landsat,"public domain","8,900x8,900",30 | |||
`MMFlood`_,"Imagery,DEM,Masks","Sentinel, MapZen/TileZen, OpenStreetMap",CC-BY-4.0,"2,147x2,313",20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The paper is CC-BY-4.0, but the data is MIT, I would use MIT here
def __init__( | ||
self, | ||
root: Path = 'data', | ||
crs: CRS | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also add a res
parameter to match all other GeoDatasets
torchgeo/datasets/mmflood.py
Outdated
class MMFlood(RasterDataset): | ||
"""MMFlood dataset. | ||
|
||
`MMFlood <https://huggingface.co/datasets/links-ads/mmflood>`__ dataset is a multimodal flood delineation dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is longer than 88 characters, can you add some newlines?
torchgeo/datasets/mmflood.py
Outdated
check_folders: if True, verifies pairings of all s1, dem and mask data across all the folders | ||
load_all: if True, loads all tif files contained in the "activations" folder in the root folder specified. Otherwise, only acquisitions for the given split are loaded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines too long, wrap at 88 chars
torchgeo/datamodules/mmflood.py
Outdated
mean = torch.tensor([0.1785585, 0.03574104, 168.45529]) | ||
median = torch.tensor([0.116051525, 0.025692634, 86.0]) | ||
std = torch.tensor([2.405442, 0.22719479, 242.74359]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these work both include_dem
True and False? Seems like the length should change
torchgeo/datasets/mmflood.py
Outdated
the merged image | ||
""" | ||
image = self._load_tif(index, modality='s1_raw', query=query).float() | ||
if self.include_dem: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the docstring you mentioned that the DEM is not available for all regions, but is this also explicitly handled? Or what happens if the DEM is not available. Do I get a tensor with a zero padded channel dimension?
Or maybe as an additional check, if you have the full dataset downloaded, can you instantiate a datamodule or dataloader over the dataset an iterate through the length of the dataset without errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DEM is available for all the regions. Hydrography maps are missing for many of the tiles (738 missing). They are downloaded but not handled by this class. Maybe should I remove them from the docstring? Or introduce a flag include_hydro
in the constructor and load only a portion of the dataset (1012 tiles out of 1748 total)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, misread. If you think that hydrography should be useful to users, then I think a flag to include_hydro
with default False
would be a good thing. And then just add in the docstring like you mentioned here, that if someone wants to use hydro they will only have a subset of the data available.
Thank you @adamjstewart for your comments. You managed all of your comments, including the conversion of
|
|
Ok. I should have implemented it using the
Yes, that is correct. To access this timestamp I should do as follows
I haven’t checked properly, but I think that should not be a big issue the current way it is implemented (i.e. do not retrieve the timestamp for each date), since I believe most of the overlapping tiles refers to the same date… |
This PR adds the MMFlood dataset from the paper "MMFlood: A Multimodal Dataset for Flood Delineation From Satellite Imagery". This is a Sentinel-1 + DEM dataset for Image Segmentation.
Original tif files are of variable resolution. Max height in pixels is 2147, max width in pixels is 2313 (which are the ones reported in the docs). The dataset also includes hydrography information, but it is not available for all acquisitions (currently the implemented class does not read such tif files).
Example with False Color representation