Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunked w-gridder with dask-ms and ducc #204

Merged
merged 43 commits into from
Sep 6, 2020
Merged

Chunked w-gridder with dask-ms and ducc #204

merged 43 commits into from
Sep 6, 2020

Conversation

landmanbester
Copy link
Collaborator

I am trying to build a chunked up interface for the wgridder using dask-ms. I am mainly thinking about the functionality I need for pfb-clean but let's discuss here @sjperkins @o-smirnov @mreineck.

My thinking is to traverse the chunks in serial and leave all the parallelisation to the ducc.wgridder (derived from the nifty_gridder). This should be better for memory consumption and performance but let's see (interleaved io could be very cool here indeed). Anyway, I seem to have a working gridder (the degridder is tomorrow's task). I really expected this to be harder as there is a certain amount of nested parallelism involved. However, to my surprise, setting (so that the chunks are traversed in serial)

dask.config.set(pool=ThreadPool(1))

and passing in nthreads > 1 parameter to the wgridder, magically spins up all threads to 100% with all of them looking beautifully green. The dirty image looks like what I expect but I am yet to test that it agrees with the non-dask version. I would love some feedback on this. Please check out the docstrings in this class

https://github.com/landmanbester/codex-africanus/blob/f973bb149b03cf5652f6a6cb838d9327b6d2b4c9/africanus/gridding/nifty/dask.py#L94

where I have tried to describe the intended behaviour.

@sjperkins
Copy link
Member

@landmanbester I'll wait for you to implement the degridding functionality before taking a closer look.

Regarding the nested parallelism, I can see this is going to be a topic that needs be addressed in a paradigm where we're running code using OpenMP/TBB parallelism inside dask threads.

Playing the devil's advocate: too many dask + OpenMP threads lead to oversubscription, while if we only use a single dask thread, then what's the point of using dask?

The answer may lie in the middle. In past conversations with @JSKenyon, he mentioned a hybrid strategy of some dask threads and some OpenMP/TBB threads working well. Perhaps the same might apply to the NIFTY gridder? Some support for this kind of parallelism was implemented for the current NIFTY dask wrappers (below the new code you've added) where the user can ask for N gridding streams.

This is not quite ThreadPool(N): it creates the graph in such a way that there are N serial gridding chains rather than gridding each visibility chunk separately and summing them together in a Parallel Reduction

If Task Annotations makes it into dask, we'll have finer control of the scheduling of tasks. For example: creating a dedicated gridding thread or dedicated solver thread per dask worker on which only these operations may run.

@mreineck
Copy link

It most likely makes no difference, but I'd still like to point this out: the threads used within DUCC are plain POSIX threads, not OpenMP or TBB. Just so you are not surprised in case you are analysing the code with OpenMP-specific tools and don't find any parallelism ;-)

@sjperkins
Copy link
Member

It most likely makes no difference, but I'd still like to point this out: the threads used within DUCC are plain POSIX threads, not OpenMP or TBB. Just so you are not surprised in case you are analysing the code with OpenMP-specific tools and don't find any parallelism ;-)

Ah I see you've implemented your own ThreadPools and Schedulers in ducc0.

@mreineck
Copy link

I was using OpenMP originally and was quite happy with it. However, when pypocketfft was integrated into scipy, we needed to ditch OpenMP since it does not always interact well with Python's own multithreading (no idea which side is to blame here). Peter Bell, who has helped me a lot with pypocketfft provided a pthread-based implementation which worked quite well, but originally only had "static" scheduling.

Later on I saw Andre Offringa's approach to multi-threading in wsclean, and from those two starting points I managed to put something together which is almost as powerful as OpenMP, but relies only on standard C++ features. I'm now using it everywhere and so far didn't encounter any problems.

@landmanbester
Copy link
Collaborator Author

Well, this is why I want to discuss the implementation here. Currently, we are mainly chunking up the data because it doesn't fit into memory. Doesn't a parallel reduction always come with a linear increase in memory? This is exactly what I want to avoid.

My reasons for using dask are twofold:
i) dask-ms is a very neat and useful interface to the measurement set that handles chunking in a way that makes the code far more transparent than it would be if there was clunky hand-written chunking logic involved. Truth be told I don't even know how to achieve the same thing using pyrap directly.
ii) I really hope we are moving towards a distributed gridding implementation that completely avoids the disk (as would be required for resolve type algorithms) and I think dask distributed is the way to go.

Anyway, the wrapper around the degridder is doing something but it's not doing it correctly. I see threads spinning up and data does end up in the measurement set but it also grows by about a factor of 6 after running the predict. I suspect the logic here

https://github.com/landmanbester/codex-africanus/blob/b67a29229f1dd36bcf4a8f25cf8489ef8efabd7c/africanus/gridding/nifty/dask.py#L400

@sjperkins will appreciate it if you can have a look when you get a chance

@sjperkins
Copy link
Member

sjperkins commented Jun 25, 2020

Well, this is why I want to discuss the implementation here. Currently, we are mainly chunking up the data because it doesn't fit into memory. Doesn't a parallel reduction always come with a linear increase in memory? This is exactly what I want to avoid.

It seems to be linear in the number of threads, at maximum. In fact, if a parallel reduction was used to sum the dirties produced by each chunk, I'd expect to see a memory profile similar to that of shadems which also uses a parallel reduction: holoviz/datashader#926. In particular see this comment: ratt-ru/shadeMS#34 (comment) and ratt-ru/shadeMS#34 (comment). shadems plots are 1024**2, whereas it sounds like MeerKAT-sized grids are 8192**2. (but what about the internal w-planes ducc uses internally?)

The shadems images were part of 3D cube with an image for each category, so I'm not even sure the data sizes are that different from each other. In any case, the visibility data dominated in my admittedly pen and paper exercises.

I also want to very much emphasise that shadems and parallel gridding are doing exactly the same thing in terms of the distributed algorithm, except shadems is plotting vis data, while here it is gridded. Additionally, shadems must support more reduction cases because it supports plotting every axis against most other axes (Correct me here @o-smirnov).

My reasons for using dask are twofold:
i) dask-ms is a very neat and useful interface to the measurement set that handles chunking in a way that makes the code far more transparent than it would be if there was clunky hand-written chunking logic involved. Truth be told I don't even know how to achieve the same thing using pyrap directly.

Cool, the intention has been for dask-ms to be an easy interface for the algorithm writer. I think that dask Arrays are also easy for the algorithm writer to work with.

ii) I really hope we are moving towards a distributed gridding implementation that completely avoids the disk (as would be required for resolve type algorithms) and I think dask distributed is the way to go.

I agree. This is currently doable with the distributed Client interface, but it requires the user to manually place tasks on workers, rather than depending on the distributed scheduler to "do the right thing". As discussed in dask/dask#6217 (comment) I believe the more general approach is to annotate individual dask tasks with information that scheduler plugins (written by us) use to decide where to place tasks.

Anyway, the wrapper around the degridder is doing something but it's not doing it correctly. I see threads spinning up and data does end up in the measurement set but it also grows by about a factor of 6 after running the predict. I suspect the logic here

https://github.com/landmanbester/codex-africanus/blob/b67a29229f1dd36bcf4a8f25cf8489ef8efabd7c/africanus/gridding/nifty/dask.py#L400

@sjperkins will appreciate it if you can have a look when you get a chance

Sure, will take a look.

@landmanbester
Copy link
Collaborator Author

As far as I know, the gridder's memory requirements do not scale with the number of w-stacks (please correct me if I am wrong, @mreineck). @sjperkins, if you have suggestions on how to improve the wrapper I am all ears. My basic design principles are:

i) the disk in unavoidable when it comes the predict since the data don't fit into memory. But we can assume that the image does.
ii) I need an operator that produces a residual image and a chi-square value by touching the disk as little as possible. I don't need to save these intermediate visibilities to disk (in place subtraction, read but no write) and I need to apply this operator at least O(10) times.
iii) I also need a way to write MODEL_DATA and RESIDUAL_VIS to the MS but I only need to do that once so speed is less of an issue

@mreineck
Copy link

As far as I know, the gridder's memory requirements do not scale with the number of w-stacks (please correct me if I am wrong, @mreineck).

That's correct.

@sjperkins
Copy link
Member

@landmanbester Just to recap our zoom conversation:

  1. The paradigm we're working with is that dask Arrays are the interface between the data source/sink and the algorithm.
  2. For this reason, we shouldn't explicitly include dask-ms logic within codex-africanus algorithms.
  3. I'm happy to explore the idea of a wgridder object with dot and hdot methods that express a Linear Operator. e.g. pylops-distributed seems to be taking this approach. However, the Linear Operator methods should take dask Arrays as arguments and return dask Arrays.
  4. Otherwise, the ms2dirty and dirty2ms ducc methods can be wrapped with dask.array.blockwise calls representing dot and hdot for the gridder/degridder.

@landmanbester
Copy link
Collaborator Author

I am not sure what is going wrong here. Some tests fail when I run them through pytest but not when I run them manually. Will keep digging

@mreineck
Copy link

Do you use the auto_choice branch in CI? The ducc0 branch will fail if you feed it nu=nv=0.

@mreineck
Copy link

Unofortunately Github refuses to show me the test logs ...

@sjperkins
Copy link
Member

@landmanbester The test cases are failing on the base install (without optional packages) and test. You need to guard your ducc0 imports and use the @requires_optional decorator. I'd imagine you'd also need to configure the github workflow to install ducc0 for the complete install and test.

https://github.com/ska-sa/codex-africanus/blob/master/.github/workflows/ci.yml#L65-L72

https://github.com/ska-sa/codex-africanus/blob/master/.github/workflows/ci.yml#L65-L72

@sjperkins
Copy link
Member

Unofortunately Github refuses to show me the test logs ...

@mreineck Out of interest, can you see https://github.com/ska-sa/codex-africanus/pull/204/checks?check_run_id=971810643?

And the following?

https://github.com/ska-sa/codex-africanus/runs/971809780

I'm never sure what github shows to maintainers compared to those with public access...

@landmanbester
Copy link
Collaborator Author

Yes, sorry. I wasn't expecting the automated PR tests to pass. I mainly pushed it so I can discuss it with @sjperkins during the meeting tomorrow. Is there a way to push changes without triggering the tests or would I have to create a separate branch?

@mreineck I am using the master branch for now. I don't think it's anything in the gridder that's causing the tests to fail. They seem to fail on certain parametrisations using pytest but when I run the test manually with the same parameters they pass so I'm probably using @pmp wrong or something.

@mreineck
Copy link

Hi @sjperkins, I can open the pages you have linked, but as soon as I try to open the sub-sections (e.g. by clicking on "run base test suite") I get "We are currently unable to download the log. Please try again later.". So it looks more like a technical problem.

@mreineck
Copy link

Ahh, my fault ... my uMatrix was too paranoid. Sorry for the noise!

@sjperkins
Copy link
Member

Yes, sorry. I wasn't expecting the automated PR tests to pass. I mainly pushed it so I can discuss it with @sjperkins during the meeting tomorrow. Is there a way to push changes without triggering the tests or would I have to create a separate branch?

Ah don't worry if they're failing: Its only important to get the green ticks before we press the merge button and the runs are short enough.

You can put "[skip ci]" in your commits but that will only skip the "push" event. There's another "pull_request" event but for various reasons it just runs on every commit.

@landmanbester
Copy link
Collaborator Author

Hmmm not sure what's going wrong. I just ran the tests locally with ducc0 from PYPI and they all pass. The failures are all due to a lack of expected relative precision (1.5e-5 as apposed to 1e-5 at single precision). Not a big deal, I'll just lower the precision but would be good to know whats going on

@landmanbester
Copy link
Collaborator Author

My efforts to allow for odd pixel sizes were unsuccessful but this is low priority, I'll get to it eventually. I set some sensible defaults and changed the naming convention as discussed above. I removed the nu, nv options since these won't be available for long anyway. I think this is ready for a review when you have a sec @sjperkins

Copy link
Member

@sjperkins sjperkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good. I've some minor comments and suggested some typo corrections.

africanus/gridding/wgridder/dask.py Outdated Show resolved Hide resolved
africanus/gridding/wgridder/dask.py Outdated Show resolved Hide resolved
africanus/gridding/wgridder/im2vis.py Outdated Show resolved Hide resolved
africanus/gridding/wgridder/im2vis.py Outdated Show resolved Hide resolved
africanus/gridding/wgridder/im2vis.py Outdated Show resolved Hide resolved
africanus/gridding/nifty/dask.py Outdated Show resolved Hide resolved
africanus/gridding/wgridder/im2residim.py Outdated Show resolved Hide resolved
africanus/util/dask_util.py Outdated Show resolved Hide resolved
africanus/util/dask_util.py Outdated Show resolved Hide resolved
africanus/util/testing.py Outdated Show resolved Hide resolved
@sjperkins
Copy link
Member

Perhaps this can be a birthday PR ;-)

@mreineck
Copy link

mreineck commented Sep 4, 2020

That is a bit strange. These tests all pass locally on the ducc0 branch. Is the PYPI release up to date @mreineck?

Not completely, but I'm not aware of a bug fix since the last release.
I'll try to make another release soon, but before that I have to apply another patch that fixes a rare source of deadlocks with the new thread pool implementation. Might take a few days.

@landmanbester
Copy link
Collaborator Author

I don't recall making the changes to africanus/util gridding/nifty but I've implemented the suggestions. I see numba is still pinned to <= 0.49 in setup.py. Maybe a result of merging in master when the perleypolyhedron gridder was in there. I take it we don't want to keep it as is?

@sjperkins sjperkins changed the title WIP: chunked w-gridder with dask-ms and ducc Chunked w-gridder with dask-ms and ducc Sep 6, 2020
@mreineck
Copy link

FYI: I just released ducc0 0.6, which contains the promised memory reductions; it's a bit faster as well!

@sjperkins
Copy link
Member

FYI: I just released ducc0 0.6, which contains the promised memory reductions; it's a bit faster as well!

Thanks @mreineck. I've opened #221 to track this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants