Unable to use Prefect with stateful flows #15806

RaphaelRobidas · 2024-10-24T21:26:21Z

RaphaelRobidas
Oct 24, 2024

I am currently bumping against an apparent limitation with stateful flows. Consider the following example:

import numpy as np
import threading

from prefect import flow, task
from prefect.futures import as_completed
from prefect_dask import DaskTaskRunner
from prefect.task_runners import ThreadPoolTaskRunner


@task
def work(arg):
    print(f"Starting task in a function with argument: {arg}")
    return arg


class MainObj:
    def __init__(self):
        # This object should not be pickled
        self.large_object = np.random.random((10000, 10000))
        self.lock = threading.Lock()
        self.counter = 0

    #@flow(task_runner=ThreadPoolTaskRunner)
    @flow(task_runner=DaskTaskRunner())
    def run(self):
        futures = [work.submit(i) for i in range(3)]

        for fut in as_completed(futures):
            res = fut.result()
            self.process_result(res)

    def process_result(self, res):
        print(f"Task finished with {res}")
        self.counter += res

if __name__ == "__main__":
    obj = MainObj()
    print(f"Counter is {obj.counter}")
    obj.run()
    print(f"Counter is {obj.counter}")

Here, the flow needs to update the MainObj object as it runs. However, the MainObj instance is large, unpicklable and should not be sent to task runners.

The code works with ThreadPoolTaskRunner:

Counter is 0
16:37:49.923 | INFO    | prefect.engine - Created flow run 'gray-quetzal' for flow 'run'
16:37:49.926 | INFO    | prefect.engine - View at https://app.prefect.cloud/account/...
Starting task in a function with argument: 2
Starting task in a function with argument: 1
Starting task in a function with argument: 0
16:37:50.261 | INFO    | Task run 'work-e99' - Finished in state Completed()
16:37:50.269 | INFO    | Task run 'work-a7c' - Finished in state Completed()
16:37:50.270 | INFO    | Task run 'work-226' - Finished in state Completed()
Task finished with 2
Task finished with 1
Task finished with 0
16:37:50.412 | INFO    | Flow run 'gray-quetzal' - Finished in state Completed()
Counter is 3
16:37:50.563 | WARNING | EventsWorker - Still processing items: 1 items remaining...

However, it fails with DaskTaskRunner for distributed calculations:

16:34:33.535 | ERROR   | Flow run 'skinny-parakeet' - Finished in state Failed("Flow run encountered an exception: TypeError: ('Could not serialize object of type HighLevelGraph', '<ToPickle: HighLevelGraph with 1 layers.\\n<dask.highlevelgraph.HighLevelGraph object at 0x12e5247649e0>\\n 0. 20775414960832\\n>')")
Traceback (most recent call last):
  File "/home/raphael/.../python3.12/site-packages/distributed/protocol/pickle.py", line 60, in dumps
    result = pickle.dumps(x, **dump_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_pickle.PicklingError: Can't pickle <function work at 0x12e527f7f060>: it's not the same object as __main__.work

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/raphael/.../python3.12/site-packages/distributed/protocol/pickle.py", line 65, in dumps
    pickler.dump(x)
_pickle.PicklingError: Can't pickle <function work at 0x12e527f7f060>: it's not the same object as __main__.work

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/raphael/.../python3.12/site-packages/distributed/protocol/serialize.py", line 366, in serialize
    header, frames = dumps(x, context=context) if wants_context else dumps(x)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/raphael/.../python3.12/site-packages/distributed/protocol/serialize.py", line 78, in pickle_dumps
    frames[0] = pickle.dumps(
                ^^^^^^^^^^^^^
  File "/home/raphael/.../python3.12/site-packages/distributed/protocol/pickle.py", line 77, in dumps
    result = cloudpickle.dumps(x, **dump_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/raphael/.../python3.12/site-packages/cloudpickle/cloudpickle.py", line 1479, in dumps
    cp.dump(obj)
  File "/home/raphael/.../python3.12/site-packages/cloudpickle/cloudpickle.py", line 1245, in dump
    return super().dump(obj)
           ^^^^^^^^^^^^^^^^^
TypeError: cannot pickle '_thread.lock' object

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/raphael/tmp/prefect/min_desired_v2.py", line 39, in <module>
    obj.run()
  File "/home/raphael/.../python3.12/site-packages/prefect/flows.py", line 1345, in __call__
    return run_flow(
           ^^^^^^^^^
  File "/home/raphael/.../python3.12/site-packages/prefect/flow_engine.py", line 821, in run_flow
    return run_flow_sync(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/raphael/.../python3.12/site-packages/prefect/flow_engine.py", line 701, in run_flow_sync
    return engine.state if return_type == "state" else engine.result()
                                                       ^^^^^^^^^^^^^^^
  File "/home/raphael/.../python3.12/site-packages/prefect/flow_engine.py", line 255, in result
    raise self._raised
  File "/home/raphael/.../python3.12/site-packages/prefect/flow_engine.py", line 655, in run_context
    yield self
  File "/home/raphael/.../python3.12/site-packages/prefect/flow_engine.py", line 699, in run_flow_sync
    engine.call_flow_fn()
  File "/home/raphael/.../python3.12/site-packages/prefect/flow_engine.py", line 678, in call_flow_fn
    result = call_with_parameters(self.flow.fn, self.parameters)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/raphael/.../python3.12/site-packages/prefect/utilities/callables.py", line 206, in call_with_parameters
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/raphael/tmp/prefect/min_desired_v2.py", line 26, in run
    futures = [work.submit(i) for i in range(3)]
               ^^^^^^^^^^^^^^
  File "/home/raphael/.../python3.12/site-packages/prefect/tasks.py", line 1163, in submit
    future = task_runner.submit(self, parameters, wait_for)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/raphael/.../python3.12/site-packages/prefect_dask/task_runners.py", line 353, in submit
    future = self._client.submit(
             ^^^^^^^^^^^^^^^^^^^^
  File "/home/raphael/.../python3.12/site-packages/prefect_dask/client.py", line 64, in submit
    future = super().submit(
             ^^^^^^^^^^^^^^^
  File "/home/raphael/.../python3.12/site-packages/distributed/client.py", line 2165, in submit
    futures = self._graph_to_futures(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/raphael/.../python3.12/site-packages/distributed/client.py", line 3355, in _graph_to_futures
    header, frames = serialize(ToPickle(dsk), on_error="raise")
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/raphael/.../python3.12/site-packages/distributed/protocol/serialize.py", line 392, in serialize
    raise TypeError(msg, str_x) from exc
TypeError: ('Could not serialize object of type HighLevelGraph', '<ToPickle: HighLevelGraph with 1 layers.\n<dask.highlevelgraph.HighLevelGraph object at 0x12e5247649e0>\n 0. 20775414960832\n>')

This is because MainObj is being serialized with the flow by dask, while it can't and shouldn't be pickled. I was initially under the impression that flows ran in the process that called them while only tasks were sent to task runners, but it seems that flows get sent to the task runners too?

If so, then it's reasonable that the code doesn't work: the task runner cannot update the MainObj instance if it doesn't have the object in memory. However, I cannot find a working alternative.

The work function cannot be the flow, since flows can't really be run in parallel (Workflow cannot run in parallel in multi-threading context #9228, Running two different flows in parallel using multi-threading will raise exception #9296, prefect3.0 not running subflows in parallel #15415). Also, I'm working with thousands to tens of thousands of jobs (currently with dask), so having flows in different processes or something like that doesn't seem like the way to go.
Running tasks outside of a flow means that the task runner doesn't get set to DaskTaskRunner.

Is there any way to use Prefect in this case? Any insight would be very much appreciated!

Answered by zzstoatzz

Oct 31, 2024

hmm, I'm not sure I understand your intended use of threading.Lock in this context

if the above represents what you're trying to do, what about this? i.e just call your dask work (instance method or not) from a flow

docs on this

# /// script
# dependencies = [
#     "numpy",
#     "prefect-dask",
# ]
# ///

import numpy as np
from prefect_dask import DaskTaskRunner
from prefect import flow, task

@task(task_run_name="dask work on input: {arg}")
def work(arg):
    print(f"Starting task in a function with argument: {arg}")
    return arg

@flow(task_runner=DaskTaskRunner())
def submit_to_dask(n: int = 100):
    return [work.submit(1) for i in range(n)]

class MainObj:
    def __init__(self, x

View full answer

zzstoatzz · 2024-10-30T16:49:40Z

zzstoatzz
Oct 30, 2024
Maintainer

hi @RaphaelRobidas

would something like this work for your case?

example

# /// script
# dependencies = [
#     "numpy",
#     "prefect-dask",
# ]
# ///

import time

import dask.array as da
from prefect_dask import DaskTaskRunner

from prefect import flow, task
from prefect.futures import as_completed


@task
def process_chunk(chunk_data, chunk_id: int):
    print(f"Processing chunk {chunk_id} with shape: {chunk_data.shape}")
    time.sleep(1)  # Simulate work

    # Calculate mean absolute value - naturally some will be above 0.5
    result = abs(chunk_data).mean().compute()
    print(f"Chunk {chunk_id} result: {result:.4f}")
    return {"chunk_id": chunk_id, "result": result}


@task
def process_followup(previous_result: float):
    print(f"Starting followup for result: {previous_result}")
    time.sleep(3)  # Longer sleep to make it obvious
    print(f"Completed followup for result: {previous_result}")
    return previous_result * 0.5


@flow(task_runner=DaskTaskRunner(), log_prints=True)
def process_large_dataset():
    large_array = da.random.random((3000, 3000), chunks=(1000, 1000)) * 2 - 1
    print(
        f"Created array with shape {large_array.shape} and chunks {large_array.chunks}"
    )

    futures = []
    for i in range(large_array.numblocks[0]):
        chunk = large_array[i * 1000 : (i + 1) * 1000, :]
        futures.append(process_chunk.submit(chunk, i))

    # Track all results including followups
    all_results = {}
    followup_futures = []  # Separate list for followups

    print("Starting to process results...")
    for future in as_completed(futures):
        result = future.result()
        all_results[result["chunk_id"]] = result["result"]
        print(f"Processed chunk {result['chunk_id']}")

        if result["result"] > 0.5:
            print(f"Spawning followup task for chunk {result['chunk_id']}")
            followup_future = process_followup.submit(result["result"])
            followup_futures.append(followup_future)

    print(
        f"Main processing complete. Waiting for {len(followup_futures)} followup tasks..."
    )

    # Explicitly wait for followups
    for future in as_completed(followup_futures):
        result = future.result()
        print(f"Completed followup with result: {result:.4f}")

    return all_results


if __name__ == "__main__":
    print(sorted(process_large_dataset().items()))

uv run this_file.py

here we are:

submitting pure functions to the dask executor, instead of an instance of a class with all the state on it
still able to process follow up tasks
using dask arrays

feel free to let me know if there's some nuance where prefect specifically is getting in the way

1 reply

RaphaelRobidas Oct 30, 2024
Author

Thanks a lot @zzstoatzz for the suggestion!

This approach could probably work in principle, but would require a significant refactor. Specifically, it would require moving all the class methods to the inside of process_large_dataset in order to use the variables across functions, which looks like an antipattern of OOP to me and would eliminate the advantages of being able to deal with this flow and its results as an object.

The MainObj class in my minimal example is pretty simple, but my real class has many methods and attributes, including other objects. It is also not necessarily the top-level object, meaning that it could be the attribute of another object. So this approach is a bit too constraining in my case. Considering that prefect is a Python library, I would consider that not being "object-friendly" is an issue...

RaphaelRobidas · 2024-10-30T21:32:04Z

RaphaelRobidas
Oct 30, 2024
Author

I recently stumbled upon Covalent, which works in the desired manner:

import numpy as np
import threading
import time

import covalent as ct


@ct.electron
def work(arg):
    print(f"Starting task in a function with argument: {arg}")
    time.sleep(1)
    return arg


class MainObj:
    def __init__(self):
        # This object should not be pickled
        self.large_object = np.random.random((10000, 10000))
        self.lock = threading.Lock()
        self.counter = 0

    def run(self):
        dask_executor = (
            ct.executor.DaskExecutor()
        )  # Not actually necessary, since Dask is also the default executor

        futures = [
            ct.dispatch(ct.lattice(work, executor=dask_executor))(i) for i in range(3)
        ]

        for fut in futures:
            wresult = ct.get_result(dispatch_id=fut, wait=True)
            res = wresult.get_node_result(node_id=0)["output"].get_deserialized()
            self.process_result(res)

    def process_result(self, res):
        print(f"Task finished with {res}")
        self.counter += res


if __name__ == "__main__":
    obj = MainObj()
    print(f"Counter is {obj.counter}")
    obj.run()
    print(f"Counter is {obj.counter}")

The main difference here is that lattice (Covalent's equivalent of flow) can itself be submitted/dispatched, once or multiple times. With Prefect, this approach does not work currently.

5 replies

zzstoatzz Oct 31, 2024
Maintainer

hmm, I'm not sure I understand your intended use of threading.Lock in this context

if the above represents what you're trying to do, what about this? i.e just call your dask work (instance method or not) from a flow

docs on this

# /// script
# dependencies = [
#     "numpy",
#     "prefect-dask",
# ]
# ///

import numpy as np
from prefect_dask import DaskTaskRunner
from prefect import flow, task

@task(task_run_name="dask work on input: {arg}")
def work(arg):
    print(f"Starting task in a function with argument: {arg}")
    return arg

@flow(task_runner=DaskTaskRunner())
def submit_to_dask(n: int = 100):
    return [work.submit(1) for i in range(n)]

class MainObj:
    def __init__(self, x: int, y: int, initial_counter: int = 0):
        self.large_object = np.random.random((x, y))
        self.counter = initial_counter

    def process_result(self, res):
        self.counter += res

    # @flow # (if you really need to wrap this method)
    def run(self):
        for future in submit_to_dask():
            self.process_result(future.result())

@flow(log_prints=True)  # but instead i'd just define a main function
def setup_experiment(x: int, y: int, initial_counter: int = 0) -> int:
    obj = MainObj(x, y, initial_counter)
    print(f"Initial counter: {obj.counter}")
    obj.run()
    print(f"Final counter: {obj.counter}")
    return obj.counter

if __name__ == "__main__":
    print(setup_experiment(10000, 10000, 0))

» PREFECT_LOGGING_LEVEL=CRITICAL uv run proofs/oop_dask.py
100

let me know if thats helpful!

Answer selected by RaphaelRobidas

RaphaelRobidas Oct 31, 2024
Author

Thanks again for the help!

The threading.Lock is purely there to cause a crash if MainObj is pickled. Otherwise, the large numpy array could be pickled anyway, letting the example work locally, but causing problems down the line in production, where the large array needs to be sent on the network to workers for every job.

Your example is interesting and nearly works for what I want to do. Though, in my case, I also need to receive and process the task results as they complete. I didn't have that part in the covalent example since as_completed doesn't seem to be a builtin utility function, but it is critical for my application. Since the flow won't return before all subtasks are completed, it's not possible to use as_completed here.

In the end, the limitation of just running one flow at the time seems to be the core limitation for my very specific use-case. Otherwise, this approach would work for similar workflows that don't require so much interactivity. I will look more into covalent, since it seems better suited for my application and also supports dask. Thanks a lot for taking the time to consider this issue!

zzstoatzz Oct 31, 2024
Maintainer

fwiw, in my example the flow is returning the futures (before resolving them), so it's not clear to me why you couldn't do as_completed on what you get from calling the flow, like in the first example i shared in this discussion

RaphaelRobidas Oct 31, 2024
Author

Well, the following crashes for some reason:

# /// script
# dependencies = [
#     "numpy",
#     "prefect-dask",
# ]
# ///

import numpy as np
from prefect.futures import as_completed
from prefect_dask import DaskTaskRunner
from prefect import flow, task

@task(task_run_name="dask work on input: {arg}")
def work(arg):
    print(f"Starting task in a function with argument: {arg}")
    return arg

@flow(task_runner=DaskTaskRunner())
def submit_to_dask(n: int = 100):
    return [work.submit(1) for i in range(n)]

class MainObj:
    def __init__(self, x: int, y: int, initial_counter: int = 0):
        self.large_object = np.random.random((x, y))
        self.counter = initial_counter

    def process_result(self, res):
        print(f"Got result: {res}")
        self.counter += res

    def run(self):
        for future in as_completed(submit_to_dask()):
            self.process_result(future.result())

@flow(log_prints=True)
def setup_experiment(x: int, y: int, initial_counter: int = 0) -> int:
    obj = MainObj(x, y, initial_counter)
    print(f"Initial counter: {obj.counter}")
    obj.run()
    print(f"Final counter: {obj.counter}")
    return obj.counter

if __name__ == "__main__":
    print(setup_experiment(10000, 10000, 0))

with this message:

18:20:28.143 | ERROR   | Flow run 'viridian-platypus' - Encountered exception during execution: AttributeError("'State' object has no attribute '_final_state'")
Traceback (most recent call last):
  File "/home/raphael/.../python3.12/site-packages/prefect/flow_engine.py", line 655, in run_context
    yield self
  File "/home/raphael/.../python3.12/site-packages/prefect/flow_engine.py", line 699, in run_flow_sync
    engine.call_flow_fn()
  File "/home/raphael/.../python3.12/site-packages/prefect/flow_engine.py", line 678, in call_flow_fn
    result = call_with_parameters(self.flow.fn, self.parameters)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/raphael/.../python3.12/site-packages/prefect/utilities/callables.py", line 206, in call_with_parameters
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/raphael/4QuantumChemistry/4GOL_GoldRush/4GOL.01_src/workflows/prefect_alt/alt2.py", line 42, in setup_experiment
    obj.run()
  File "/home/raphael/4QuantumChemistry/4GOL_GoldRush/4GOL.01_src/workflows/prefect_alt/alt2.py", line 34, in run
    for future in as_completed(submit_to_dask()):
  File "/home/raphael/.../python3.12/site-packages/prefect/futures.py", line 340, in as_completed
    done = {f for f in unique_futures if f._final_state}
                                         ^^^^^^^^^^^^^^
  File "/home/raphael/.../python3.12/site-packages/pydantic/main.py", line 856, in __getattr__
    raise AttributeError(f'{type(self).__name__!r} object has no attribute {item!r}')
AttributeError: 'State' object has no attribute '_final_state'

and there is not message "Got result..." from process_results, so the results haven't been processed before crashing.

Defining the flow as such instead:

@flow(task_runner=DaskTaskRunner())
def submit_to_dask(n: int = 100):
    return as_completed([work.submit(1) for i in range(n)])

and run as such:

    def run(self):
        for future in submit_to_dask():
            self.process_result(future.result())

the execution instead finishes with this message:

18:24:01.404 | ERROR   | asyncio - Future exception was never retrieved
future: <Future finished exception=TimeoutError('Timeout')>
TimeoutError: Timeout
18:24:01.874 | INFO    | Flow run 'armored-ibis' - Finished in state Completed()
18:24:01.875 | INFO    | Flow run 'dark-platypus' - Final counter: 0

The counter also indicates that the results haven't been processed.

zzstoatzz Nov 1, 2024
Maintainer

interesting! that seems like a shortcoming of the dask task runner implementation - we'll take a look at that! thanks for the example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to use Prefect with stateful flows #15806

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Unable to use Prefect with stateful flows #15806

RaphaelRobidas Oct 24, 2024

Replies: 2 comments · 6 replies

zzstoatzz Oct 30, 2024 Maintainer

RaphaelRobidas Oct 30, 2024 Author

RaphaelRobidas Oct 30, 2024 Author

zzstoatzz Oct 31, 2024 Maintainer

RaphaelRobidas Oct 31, 2024 Author

zzstoatzz Oct 31, 2024 Maintainer

RaphaelRobidas Oct 31, 2024 Author

zzstoatzz Nov 1, 2024 Maintainer

RaphaelRobidas
Oct 24, 2024

Replies: 2 comments 6 replies

zzstoatzz
Oct 30, 2024
Maintainer

RaphaelRobidas Oct 30, 2024
Author

RaphaelRobidas
Oct 30, 2024
Author

zzstoatzz Oct 31, 2024
Maintainer

RaphaelRobidas Oct 31, 2024
Author

zzstoatzz Oct 31, 2024
Maintainer

RaphaelRobidas Oct 31, 2024
Author

zzstoatzz Nov 1, 2024
Maintainer