Skip to content

Latest commit

 

History

History
162 lines (89 loc) · 12.7 KB

blocks.rst

File metadata and controls

162 lines (89 loc) · 12.7 KB
.. index:: blocks, batch jobs

Blocks

In the task overview, I assumed that process worker pools magically existed in the right place: on the local machine with the example configuration, but on HPC worker nodes when running a more serious workflow.

The theme of this section is: how to get those process worker pools running on the nodes where we want to do the work.

The configurations for blocks are usually the most non-portable pieces of a Parsl workflow, because they are closely tied to the behaviour of particular HPC machines: this part of the configuration describes what an HPC machine looks like (at least, as much as Parsl needs to know) and so the descriptions will be different for different machines. (So it's one of the most useful areas for admins and users to contribute documentation: for example, the Parsl user guide has a section with configurations for different machines, and ALCF and NERSC both maintain their own Parsl examples).

.. index:: pilot jobs

Pilot jobs

Not all executors need workers - for example the ThreadPoolExecutor runs tasks within the main workflow process. But the HighThroughputExecutor as shown in the previous section, and also the Work Queue and Task Vine executors use workers running (for example) on the worker nodes of an HPC system. (This is also known as the pilot job model.)

We don't need to describe the work to be performed by the workers (at least, not much), because once the workers are running they'll get their own work from the interchange, as I talked about in the previous section. (that's another feature of the pilot job model)

Why are we doing things this way when an HPC system already has a system for running jobs (such as Slurm)? Because the overhead on that kind of job can be very big - those systems are targeted more at the scale of "run one job that uses 1000 nodes for 24 hours" but tasks in Parsl might be subsecond: even getting a new Python process started to run that subsecond task could be a prohibitive overhead.

As I mentioned above, most HPC systems have batch job systems that prefer big submissions (in relation to the average Parsl task) and that includes a preference for batch jobs that use many nodes (for example, some systems will offer a discount for batch jobs that use over a certain count - incentivising the use of a small number of many-node batch jobs, even though a pilot job workload could sometimes be scheduled more efficiently with a large number of smaller batch jobs)

In Parsl, this terminology is used to separate out batch jobs which are the units of work submitted to a batch system, and correspond to blocks of workers; and tasks which correspond to individual app invocations (mentioned in the previous chapter). These are different things, and there is no pre-planned allocation of which task will run inside which batch job, because worker pools running inside jobs pull tasks as they are ready for more work.

.. index:: providers
           plugins; providers

Starting a block of workers

assume we're using an executor configured to use parsl scaling (it doesn't have to be this way)

Parsl is going to decide it wants some workers (how and why, see the upcoming scaling section). For now assume it has decided it wants some.

.. index:: blocks

The unit of allocation of workers in Parsl is called a block.

This translates into whatever the underlying batch system uses for allocating workers: for exampl,e with traditional supercomputing systems, the SlurmProvider will make 1 block = 1 Slurm batch job; the KubernetesProvider will make 1 block = 1 pod; and with the LocalProvider, there is no meaningful allocation of jobs and the provider will run the worker directly on the local machine.

The base class for all providers is ExecutionProvider https://github.com/Parsl/parsl/blob/3f2bf1865eea16cc44d6b7f8938a1ae1781c61fd/parsl/providers/base.py#L11

As far as getting a new block of workers running, this is the most important method that a concrete provider must implement:

@abstractmethod
def submit(self, command: str, tasks_per_node: int, job_name: str = "parsl.auto") -> object:
  ...

The key argument here is command. This will be (after some mutation) be the Unix shell command (generated by some other part of Parsl) that should be run on each allocated node to start the workers.

In the HighThroughputExecutor, this command is formed like this at executor start:

.. todo:: source code

and then the provider is invoked with it here:

.. todo:: source code

In the Task Vine executor, something similar happens at line TODO and line TODO (hrefs)

.. todo:: line numbers / source code link


Warning

tasks_per_node is always 1 here when called by Parsl. It should perhaps be removed. It's a vestige of an earlier time when Parsl wanted the batch system to start multiple workers on each worker node (for the long-removed IPyParallel executor). More recent executors, the HighThroughputExecutor, the WorkQueue and TaskVine executor and the MPIExecutor choose to manage (in different ways) how work is performed on a particular node rather than asking the batch system for a particular fixed number of workers.

Maybe interesting here is what is missing from the submit call: there is no mention of batch system queues, no mention of how many nodes to request in this block, no mention of pod image identifiers. Attributes like that are usually the same for every block submitted through (to/by?) the provider, and usually only make sense in the context of whatever the underlying batch system is: for example, a slurm job might have a queue specification and a kubernetes job might have a persistent volume specification, to be set on all jobs. These are defined in the initializer for each provider, so the provider API doesn't need to know about these specifics at all.

.. index:: launchers
           plugins; launchers

Launchers

.. index:: mpirun, srun, mpiexec

Some batch systems separate allocation of worker nodes and execution of commands on worker nodes. In non-Parsl contexts that looks like: you write a batch script and submit it to slurm or PBS, and inside that batch script you prefix your application command line with something like mpiexec or srun which causes your application to run on all the worker nodes. Without that prefix, the command would run on a single node (sometimes not even in the batch allocation!)

To support this, some providers take a launcher parameter, which understands how to put that prefix onto the front of the relevant command. They're mostly quite simple.

All of the included launchers live in parsl.launchers.launchers and usually consist of shell scripting around something like mpiexec or srun.

Who starts processes?

.. todo:: a paragraph that in traditional HPC workloads, this launcher command is often responsible for starting multiple copies of your code on the same node - so if you wanted 24 cores used for an MPI code, you might use mpirun (TODO: processes_per_node param) to start 24 copies which would run in parallel. This is not how things work with parsl block workers: both the process worker pool and the WQ/TV equivalents usually manage all the tasks on a node from a single worker. So if you're feeling the temptation to make your launcher launch multiple copies of the pilot job worker, maybe there's something else going wrong?   and note this is a common problem in modern times, also with OMP, where multiple layers of software think *they* are the one to spawn multiple processes/threads which leads to exponential explosion of threads. which doesn't necessarily kill your workfload but can lead to myterious performance problems. - also this section should consider *user apps* which make the same assumption (so easily 3 layers to draw diagrams about!)

.. index:: pair: scaling; strategy

Choosing when to start or end a block

Parsl has some scaling code that starts and ends blocks as the task load presented by a workflow changes.

There are three scaling strategies, which run (by default) every 5 seconds.

There are three strategy parameters defined on providers which are used by the scaling strategy: init_blocks, min_blocks and max_blocks. Broadly, at the start of a run, Parsl will launch an initial number of blocks (init_blocks) and then scale between a minimum (min_blocks) and maximum (max_blocks) number of blocks during the run.

The init only strategy, none

This strategy only makes use of the init_blocks configuration parameter. At the start of a workflow, it starts the specified number of blocks. After that it does not try to start any more blocks.

Warning

Question: What happens if all of these initially started blocks terminate before all of the workflow's work is completed?

The simple strategy

This strategy will add more blocks when it sees that there are not enough workers.

When an executor becomes completely idle for some time, it will cancel all blocks. Even one task on the executor will inhibit cancellation - the history of this is that for abstract block-using executors, there is nothing to identify which blocks (if any) are idle. so scale out and scale in are not symmetric operations in that sense.

The scaling calculation looks at the number of tasks outstanding and compares it to the number of task slots (worker slots?) that are either running now or queued to be run.

There is a parallelism parameter (where?), to allow users to control the ratio of tasks to workers - by default this is 1 so Parsl will try to submit blocks to give as many worker slots as there are tasks. This does not assign tasks to particular workers: so it is common for one block to start up and a lot of the outstanding work to be processed by that block, before a second block starts which is then completely idle.

Warning

Question: what does init_blocks mean in this context? Start init_blocks blocks then immediately scale (up or down) to the needed number of blocks?

.. index:: htex_auto_scale
           High Throughput Executor; htex_auto_scale

The htex_auto_scale strategy

This is like the simple strategy for scale-out, but with better scale-in behaviour that makes use of some High Throughput Executor features: the high throughput executor knows which blocks are empty, so when there is scale-in pressure, can scale-in empty blocks while leaving non-empty blocks still running. Some prototype work has happened to try to make htex try to make blocks empty faster too, but that has not reached the production codebase.

Warning

.. todo:: reference block draining problem and matthew's work.

What link here? if more stuff merged into Parsl or existing as a PR (I think there is a PR?), then the PR can be linkable. otherwise later on maybe a SuperComputing 2024 publication - but still unknown.

Starting workers in other ways

You can start workers without using this automated scaling: set init_blocks = min_blocks = max_blocks = 0, and then find the worker command line in the log file and run it yourself in which ever situation you want. This is good for trying things out that the provider or scaling code can't do.

The Work Queue and Task Vine executors also have their own executor specific ways for starting workers: Work Queue has a worker factory command line tool and TaskVine has a worker launch method configuration parameter.

block error handling

.. todo:: write error handling section (as two parts of the same feedback loop)

Worker environments

batch job environments (esp worker_init) - think about parsl requirements a bit more: Python versions, Parsl versions, installed user packages. forward reference serialization chapter.

batch job systems generally won't make the environment that your batch job providers look like the environment the submission comes from (in the case of eg. kubernetes, that's very deliberate: the job description describes the environment, not whatever ambient environment existing around the submission command. so there's a bit of tension there when you want the environment to magically look like your submission environment)

generally the python and parsl versions need to be the same as on the submit side (although people often push on this limit, and the serialization chapter will give some hints about understanding what can go wrong)