You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Xrspatial not working with large dask backed arrays
I believe there's a bug in the xrspatial.proximity method whereby the coordinates for a large array (even if its a dask array) are loaded into memory as a numpy array instead of there being a dask array. For context I'm attempting to compute a distance to coast for a 90m global dataset. I had a look into the source code and I believe the issue is here (in the _process function of proximity):
`
raster_dims = raster.dims
if raster_dims != (y, x):
raise ValueError(
"raster.coords should be named as coordinates:"
"({0}, {1})".format(y, x)
)
distance_metric = DISTANCE_METRICS.get(distance_metric, None)
if distance_metric is None:
distance_metric = DISTANCE_METRICS["EUCLIDEAN"]
target_values = np.asarray(target_values)
# x-y coordinates of each pixel.
# flatten the coords of input raster and reshape to 2d
xs = np.tile(raster[x].data, raster.shape[0]).reshape(raster.shape)
ys = np.repeat(raster[y].data, raster.shape[1]).reshape(raster.shape)
`
Therefore XS and YS are huge numpy arrays that don't fit into memory, whereas, if the input data is a dask array these should probably be dask arrays rather than numpy arrays. Later on in the processing sequence the proximity calculation is either done using dask or numpy and I think there should be a similar thing here for dask/numpy processing.
If I've missed something please let me know, more than happy to share some more code if needs be and in the mean time I can use gdal_proximity.py directly but I guess it would be slower than using the dask backed xrspatial.
Thanks!
The text was updated successfully, but these errors were encountered:
@GrahamReveley@bstadlbauer thanks contributing here and I believe this is a known issue to @thuydotm, but hasn't been formally documented. also sorry for delay in responding.
@thuydotm at a minimum we need to document the limitation in proximity.
We are going to prioritize this proximity over the next sprints, but happy to help review / test a PR if y'all want contribute a fix.
@brendancol We do have a constrained usecase and are working with a changed fork of the proximity code here, but IIRC that's nothing that we can easily generalize to work for all usecases
Xrspatial not working with large dask backed arrays
I believe there's a bug in the xrspatial.proximity method whereby the coordinates for a large array (even if its a dask array) are loaded into memory as a numpy array instead of there being a dask array. For context I'm attempting to compute a distance to coast for a 90m global dataset. I had a look into the source code and I believe the issue is here (in the _process function of proximity):
`
raster_dims = raster.dims
if raster_dims != (y, x):
raise ValueError(
"raster.coords should be named as coordinates:"
"({0}, {1})".format(y, x)
)
`
Therefore XS and YS are huge numpy arrays that don't fit into memory, whereas, if the input data is a dask array these should probably be dask arrays rather than numpy arrays. Later on in the processing sequence the proximity calculation is either done using dask or numpy and I think there should be a similar thing here for dask/numpy processing.
If I've missed something please let me know, more than happy to share some more code if needs be and in the mean time I can use gdal_proximity.py directly but I guess it would be slower than using the dask backed xrspatial.
Thanks!
The text was updated successfully, but these errors were encountered: