-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
channel selection does not lead to anticipated speedup for meerKAT MSs #31
Comments
The following should do it:
yes, but it's set up through re: katdal, I think channel tiling depends on the application (spectral, continuum etc.) but lets chat in person |
BTW, @sjperkins, @JSKenyon, is the recent activity I saw on dask-ms related to getcolslicenp relevant here perhaps? |
@sjperkins are you sure? Surely that's the chunkinfg specification, not the channel subset specification. When I do it that way, it crashes with:
Can't find it in the dask-ms docs... |
Yes should be an integer or a tuple specifying the individual channels chunks (which should add up to the full channel range). So for 60 channels, |
xds_from_ms(ms, chunks={'row': row_chunks, 'chan': chanslice.end - chanslice.start}) |
But where is the actual channel subset specified, if I want one? I've been applying it as a slice to the DataArrays, is that the best way to do it? |
Apologies I'm losing the plot in this thread. You can slice the DataArrays with the caveat that it'll read the entire channel range and then slice the relevant chunk out. To do this with maximal efficiency, you' have to use a pre-process to figure out the optimal chunking strategy for the channel dimensions # Initial dataset partition on FIELD_ID and DATA_DESC_ID
ddids = [ds.DATA_DESC_ID for ds in xds_from_ms("3C286.ms")]
# Read in very small DATA_DESCRIPTION table into memory
ddid = xds_from_table("3C286.ms::DATA_DESCRIPTION").compute()
# Create a dataset per row from SPECTRAL_WINDOW
spws = xds_from_table("3C286.ms::SPECTRAL_WINDOW", group_cols="__row__")
# Number of channels for each dataset
nchan = [spws[ddid[i].SPECTRAL_WINDOW_ID[0]].CHAN_FREQ.shape[0] for i in ddids]
# Channel chunking schema for each dataset
chan_chunks = [(chanslice.start - 0, chanslice.end - chanslice.start, nc - chanslice.end)
for nc in nchan]
# Chunking schema for each dataset
chunks = [{'row': 100000, 'chan': cc} for cc in chan_chunks]
# Re-open exact same datasets with a different chunking strategy
datasets = xds_from_ms("3C286.ms", chunks=chunks) I typed this out without running it, but it should illustrate the idea. The above is clunky, I'm thinking about general improvements for the process in ratt-ru/dask-ms#86 |
Bottom line is, I need to adjust the chunking as per above for maximum efficiency. |
I've added a channel subset selector, which works in the usual slicing manner::
The first case pots 100 channels, the second case plots all 4096. However the speedup in the first case is only x2. This suggests that not much I/O has been saved.
I apply the channel selection by slicing the DataArray in the group object, using
array[dict=(chan=chanslice)]
, which I thought was the prescribed manner to do this (@sjperkins please confirm).Perhap the problem is with the DataManager in the MS. Looking at it:
...it doesn't tile the channel dimension. Which means that the underlying table system is still reading entire rows, which is very inefficient if one only wants a small subset of the channels. In which case this is a katdal issue to be fixed.
@sjperkins please also confirm -- if I'm slicing the array as above, it will eventually result in a
getcolslice[np]()
call to read the data and notgetcol()
, correct?The text was updated successfully, but these errors were encountered: