Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Cannot run KFP pipeline for fuzzy dedup with more than 100 actors #803

Open
2 tasks done
cmadam opened this issue Nov 16, 2024 · 1 comment
Open
2 tasks done
Labels
bug Something isn't working

Comments

@cmadam
Copy link
Collaborator

cmadam commented Nov 16, 2024

Search before asking

  • I searched the issues and found no similar issues.

Component

KFP workflows, Library/core, Transforms/universal/fdedup

What happened + What you expected to happen

I am not able to run a KFP pipeline for fuzzy dedup that creates 102 actors with the following configuration:

{
  "cpu": 8,
  "image": "us.icr.io/cil15-shared-registry/preprocessing-pipelines/fuzzydedup-ray:0.2.8-v5",
  "imagePullPolicy": "Always",
  "image_pull_secret": "****-***-***-**",
  "max_replicas": 35,
  "memory": 60,
  "min_replicas": 35,
  "replicas": 35
}

The error I am getting while trying to run the pipeline is:

(orchestrate pid=382, ip=172.17.29.195) 11:01:01 INFO - Cluster resources: {'cpus': 280, 'gpus': 0, 'memory': 2180.0, 'object_store': 652.3036788804457}
(orchestrate pid=382, ip=172.17.29.195) 11:01:01 INFO - Number of workers - 102 with {'num_cpus': 2.3333333333333335, 'max_restarts': -1} each
(orchestrate pid=382, ip=172.17.29.195) Traceback (most recent call last):
(orchestrate pid=382, ip=172.17.29.195)   File "/home/ray/anaconda3/lib/python3.10/site-packages/data_processing_ray/runtime/ray/transform_orchestrator.py", line 96, in orchestrate
(orchestrate pid=382, ip=172.17.29.195)     processors = RayUtils.create_actors(
(orchestrate pid=382, ip=172.17.29.195)   File "/home/ray/anaconda3/lib/python3.10/site-packages/data_processing_ray/runtime/ray/ray_utils.py", line 121, in create_actors
(orchestrate pid=382, ip=172.17.29.195)     raise UnrecoverableException(f"out of {len(actors)} created actors only {len(alive)} alive")
(orchestrate pid=382, ip=172.17.29.195) data_processing.utils.unrecoverable.UnrecoverableException: out of 102 created actors only 100 alive
(orchestrate pid=382, ip=172.17.29.195) 11:03:06 ERROR - Exception during execution out of 102 created actors only 100 alive: None

Reproduction script

Compile a KFP pipeline using this code commit: https://github.com/IBM/data-prep-kit/tree/4941d5bab37a0bdc1e5873ce8e7288483703751f

cd transforms/universal/fdedup/kfp_ray
make workflow-venv
source ../../../venv/bin/activate
export FDEDUP_IMAGE_LOCATION=us.icr.io/cil15-shared-registry/preprocessing-pipelines/fuzzydedup-ray:0.2.8-v5
export FDEDUP_IMAGE_PULL_SECRET="****-***-***-**"
python fdedup_wf.py

Upload the pipeline in OCP cluster. Run the pipeline with the worker config mentioned above:

{
  "cpu": 8,
  "image": "us.icr.io/cil15-shared-registry/preprocessing-pipelines/fuzzydedup-ray:0.2.8-v5",
  "imagePullPolicy": "Always",
  "image_pull_secret": "****-***-***-**",
  "max_replicas": 35,
  "memory": 60,
  "min_replicas": 35,
  "replicas": 35
}

Anything else

This problem happens every time I am attempting to start a fuzzy dedup pipeline. If I decrease the number of actors to be less than 100 (e.g. 99), the pipeline runs every time.

OS

Red Hat Enterprise Linux (RHEL)

Python

3.11.x

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@cmadam cmadam added the bug Something isn't working label Nov 16, 2024
@blublinsky
Copy link
Collaborator

blublinsky commented Nov 18, 2024

As we discussed before, the issue is the amount of resources. The amount of cpus that you have - 280 is not continious, the actor allocation is based or the amount of cpus per node, as the allocation is happening on the individual nodes. A node itself needs to have resources for its management, etc. You are leaving 1 cpu per node for this, which is not enough for some of nodes (we have no control over resource utilization of Ray itself). I was using .85 which gives you 6.8 cpu per node available for user workload, not 7 as you assume. so as a result some of the nodes do not have 7 CPUs available for actor allocation as they are running additional DPK things - transform statistics, orchestrator, etc.
General recommendations based on experience:

  1. Larger nodes are better, I would suggest using at least 16 CPU, 256 memory per node.
  2. I would suggest not to use explicit number of workers, but rather rely on computations provided by project
  3. If you still want to use manual number of workers overwrite, assuming you have 16 cpus per node you can use roughly 13 cpus per node for actors. Assuming actor CPU is 2, its 6 actors per node. With for example 20 nodes you can use 120 actors. Anything above that is questionable.

To get a full picture, I would suggest to remove an exit handler (Revital can help you with this) and then look at the cluster usage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants