Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Cannot run pipeline.py in personal_copilot folder #38

Open
tahababou12 opened this issue Dec 5, 2024 · 0 comments
Open

[BUG]: Cannot run pipeline.py in personal_copilot folder #38

tahababou12 opened this issue Dec 5, 2024 · 0 comments

Comments

@tahababou12
Copy link

OS: 8 GB Memory / 4 AMD vCPUs / 160 GB Disk / NYC3 - Ubuntu 24.10 x64

(in personal_copilot/ directory)

python pipeline.py

ERROR:

(venv) ➜  dataset_generation git:(main) ✗ python pipeline.py
2024-12-05 22:55:37,660 [DEBUG] Data folder 'hf_public_repos' found.
2024-12-05 22:55:37,661 [DEBUG] Using selector: EpollSelector
2024-12-05 22:55:37,919 [DEBUG] Found 14758 files under '/home/taha/LLM-Workshop/personal_copilot/dataset_generation/hf_public_repos' with pattern '**/*'.
2024-12-05 22:55:37,923 [DEBUG] open file: /home/taha/LLM-Workshop/personal_copilot/dataset_generation/logs/2024-12-05_22-55-37_ixcek/executor.json
2024-12-05 22:55:39,573 [DEBUG] Using selector: EpollSelector
2024-12-05 22:55:39,659 [DEBUG] open file: /home/taha/LLM-Workshop/personal_copilot/dataset_generation/logs/2024-12-05_22-55-37_ixcek/logs/task_00000.log
2024-12-05 22:55:39,721 [DEBUG] Using selector: EpollSelector
2024-12-05 22:55:39.729 | INFO     | datatrove.utils.logging:add_task_logger:58 - Launching pipeline for rank=0
2024-12-05 22:55:39.729 | INFO     | datatrove.utils.logging:log_pipeline:90 - 
--- 🛠️ PIPELINE 🛠
📖 - READER: 👾 PersonalCopilot
🔻 - FILTER: 🧑🏽‍💻 Code Filter
💽 - WRITER: 🐿 Jsonl
2024-12-05 22:55:39,738 [DEBUG] open file: /home/taha/LLM-Workshop/personal_copilot/dataset_generation/logs/2024-12-05_22-55-37_ixcek/logs/task_00001.log
2024-12-05 22:55:39.729 | ERROR    | datatrove.executor.base:_run_for_rank:108 - BasicCodeFilter.run() takes 2 positional arguments but 4 were given
Traceback (most recent call last):

  File "<string>", line 1, in <module>
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/forkserver.py", line 273, in main
    code = _serve_one(child_r, fds,
           │          │        └ [24, 25, 26, 27, 28, 29]
           │          └ 8
           └ <function _serve_one at 0x75acdb9c3920>
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/forkserver.py", line 312, in _serve_one
    code = spawn._main(child_r, parent_sentinel)
           │     │     │        └ 4
           │     │     └ 8
           │     └ <function _main at 0x75acdb9c2b60>
           └ <module 'multiprocess.spawn' from '/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/spawn.py'>
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/spawn.py", line 135, in _main
    return self._bootstrap(parent_sentinel)
           │    │          └ 4
           │    └ <function BaseProcess._bootstrap at 0x75acdbddf420>
           └ <ForkServerProcess name='ForkServerPoolWorker-14' parent=50854 started daemon>
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
    │    └ <function BaseProcess.run at 0x75acdbdde980>
    └ <ForkServerProcess name='ForkServerPoolWorker-14' parent=50854 started daemon>
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    │    │        │    │        │    └ {}
    │    │        │    │        └ <ForkServerProcess name='ForkServerPoolWorker-14' parent=50854 started daemon>
    │    │        │    └ (<multiprocess.queues.SimpleQueue object at 0x75acd6f40500>, <multiprocess.queues.SimpleQueue object at 0x75acd6f6b620>, None...
    │    │        └ <ForkServerProcess name='ForkServerPoolWorker-14' parent=50854 started daemon>
    │    └ <function worker at 0x75acd6f751c0>
    └ <ForkServerProcess name='ForkServerPoolWorker-14' parent=50854 started daemon>
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    │     │       └ {}
                    │     └ (0,)
                    └ functools.partial(<bound method LocalPipelineExecutor._launch_run_for_rank of <datatrove.executor.local.LocalPipelineExecutor...
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/executor/local.py", line 76, in _launch_run_for_rank
    return self._run_for_rank(rank, local_rank)
           │    │             │     └ 0
           │    │             └ 0
           │    └ <function PipelineExecutor._run_for_rank at 0x75acda55ec00>
           └ <datatrove.executor.local.LocalPipelineExecutor object at 0x75acdbd50d70>
> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/executor/base.py", line 90, in _run_for_rank
    pipelined_data = pipeline_step(pipelined_data, rank, self.world_size)
                     │             │               │     │    └ <property object at 0x75acda5437e0>
                     │             │               │     └ <datatrove.executor.local.LocalPipelineExecutor object at 0x75acdbd50d70>
                     │             │               └ 0
                     │             └ <generator object BaseDiskReader.run at 0x75acdbde19c0>
                     └ 🔻 - FILTER: 🧑🏽‍💻 Code Filter
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/pipeline/base.py", line 119, in __call__
    return self.run(data, rank, world_size)
           │    │   │     │     └ 16
           │    │   │     └ 0
           │    │   └ <generator object BaseDiskReader.run at 0x75acdbde19c0>
           │    └ <function BasicCodeFilter.run at 0x75acd6f25f80>
           └ 🔻 - FILTER: 🧑🏽‍💻 Code Filter

TypeError: BasicCodeFilter.run() takes 2 positional arguments but 4 were given
2024-12-05 22:55:39.764 | ERROR    | datatrove.executor.base:_run_for_rank:108 - BasicCodeFilter.run() takes 2 positional arguments but 4 were given
Traceback (most recent call last):

  File "<string>", line 1, in <module>
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/forkserver.py", line 273, in main
    code = _serve_one(child_r, fds,
           │          │        └ [14, 15, 18, 19, 20, 21]
           │          └ 8
           └ <function _serve_one at 0x75acdb9c3920>
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/forkserver.py", line 312, in _serve_one
    code = spawn._main(child_r, parent_sentinel)
           │     │     │        └ 4
           │     │     └ 8
           │     └ <function _main at 0x75acdb9c2b60>
           └ <module 'multiprocess.spawn' from '/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/spawn.py'>
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/spawn.py", line 135, in _main
    return self._bootstrap(parent_sentinel)
           │    │          └ 4
           │    └ <function BaseProcess._bootstrap at 0x75acdbddf420>
           └ <ForkServerProcess name='ForkServerPoolWorker-6' parent=50854 started daemon>
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
    │    └ <function BaseProcess.run at 0x75acdbdde980>
    └ <ForkServerProcess name='ForkServerPoolWorker-6' parent=50854 started daemon>
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    │    │        │    │        │    └ {}
    │    │        │    │        └ <ForkServerProcess name='ForkServerPoolWorker-6' parent=50854 started daemon>
    │    │        │    └ (<multiprocess.queues.SimpleQueue object at 0x75acd78b2570>, <multiprocess.queues.SimpleQueue object at 0x75acd6f6ba10>, None...
    │    │        └ <ForkServerProcess name='ForkServerPoolWorker-6' parent=50854 started daemon>
    │    └ <function worker at 0x75acd6f75260>
    └ <ForkServerProcess name='ForkServerPoolWorker-6' parent=50854 started daemon>
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    │     │       └ {}
                    │     └ (1,)
                    └ functools.partial(<bound method LocalPipelineExecutor._launch_run_for_rank of <datatrove.executor.local.LocalPipelineExecutor...
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/executor/local.py", line 76, in _launch_run_for_rank
    return self._run_for_rank(rank, local_rank)
           │    │             │     └ 1
           │    │             └ 1
           │    └ <function PipelineExecutor._run_for_rank at 0x75acda55eca0>
           └ <datatrove.executor.local.LocalPipelineExecutor object at 0x75acdb982480>
> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/executor/base.py", line 90, in _run_for_rank
    pipelined_data = pipeline_step(pipelined_data, rank, self.world_size)
                     │             │               │     │    └ <property object at 0x75acda5437e0>
                     │             │               │     └ <datatrove.executor.local.LocalPipelineExecutor object at 0x75acdb982480>
                     │             │               └ 1
                     │             └ <generator object BaseDiskReader.run at 0x75acdbde19c0>
                     └ 🔻 - FILTER: 🧑🏽‍💻 Code Filter
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/pipeline/base.py", line 119, in __call__
    return self.run(data, rank, world_size)
           │    │   │     │     └ 16
           │    │   │     └ 1
           │    │   └ <generator object BaseDiskReader.run at 0x75acdbde19c0>
           │    └ <function BasicCodeFilter.run at 0x75acd6f26020>
           └ 🔻 - FILTER: 🧑🏽‍💻 Code Filter

TypeError: BasicCodeFilter.run() takes 2 positional arguments but 4 were given
2024-12-05 22:55:39,769 [DEBUG] Using selector: EpollSelector
2024-12-05 22:55:39.784 | INFO     | datatrove.executor.local:_launch_run_for_rank:81 - 1/16 tasks completed.
2024-12-05 22:55:39.788 | INFO     | datatrove.executor.local:_launch_run_for_rank:81 - 2/16 tasks completed.
2024-12-05 22:55:39,794 [DEBUG] Using selector: EpollSelector
2024-12-05 22:55:39,811 [DEBUG] open file: /home/taha/LLM-Workshop/personal_copilot/dataset_generation/logs/2024-12-05_22-55-37_ixcek/logs/task_00003.log
2024-12-05 22:55:39,824 [DEBUG] open file: /home/taha/LLM-Workshop/personal_copilot/dataset_generation/logs/2024-12-05_22-55-37_ixcek/logs/task_00004.log
2024-12-05 22:55:39,837 [DEBUG] open file: /home/taha/LLM-Workshop/personal_copilot/dataset_generation/logs/2024-12-05_22-55-37_ixcek/logs/task_00002.log
2024-12-05 22:55:39.834 | ERROR    | datatrove.executor.base:_run_for_rank:108 - BasicCodeFilter.run() takes 2 positional arguments but 4 were given
Traceback (most recent call last):

  File "<string>", line 1, in <module>
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/forkserver.py", line 273, in main
    code = _serve_one(child_r, fds,
           │          │        └ [24, 25, 26, 27, 28, 29]
           │          └ 8
           └ <function _serve_one at 0x75acdb9c3920>
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/forkserver.py", line 312, in _serve_one
    code = spawn._main(child_r, parent_sentinel)
           │     │     │        └ 4
           │     │     └ 8
           │     └ <function _main at 0x75acdb9c2b60>
           └ <module 'multiprocess.spawn' from '/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/spawn.py'>
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/spawn.py", line 135, in _main
    return self._bootstrap(parent_sentinel)
           │    │          └ 4
           │    └ <function BaseProcess._bootstrap at 0x75acdbddf420>
           └ <ForkServerProcess name='ForkServerPoolWorker-14' parent=50854 started daemon>
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
    │    └ <function BaseProcess.run at 0x75acdbdde980>
    └ <ForkServerProcess name='ForkServerPoolWorker-14' parent=50854 started daemon>
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    │    │        │    │        │    └ {}
    │    │        │    │        └ <ForkServerProcess name='ForkServerPoolWorker-14' parent=50854 started daemon>
    │    │        │    └ (<multiprocess.queues.SimpleQueue object at 0x75acd6f40500>, <multiprocess.queues.SimpleQueue object at 0x75acd6f6b620>, None...
    │    │        └ <ForkServerProcess name='ForkServerPoolWorker-14' parent=50854 started daemon>
    │    └ <function worker at 0x75acd6f751c0>
    └ <ForkServerProcess name='ForkServerPoolWorker-14' parent=50854 started daemon>
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    │     │       └ {}
                    │     └ (3,)
                    └ functools.partial(<bound method LocalPipelineExecutor._launch_run_for_rank of <datatrove.executor.local.LocalPipelineExecutor...
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/executor/local.py", line 76, in _launch_run_for_rank
    return self._run_for_rank(rank, local_rank)
           │    │             │     └ 2
           │    │             └ 3
           │    └ <function PipelineExecutor._run_for_rank at 0x75acda55ec00>
           └ <datatrove.executor.local.LocalPipelineExecutor object at 0x75acd78b37d0>
> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/executor/base.py", line 90, in _run_for_rank
    pipelined_data = pipeline_step(pipelined_data, rank, self.world_size)
                     │             │               │     │    └ <property object at 0x75acda5437e0>
                     │             │               │     └ <datatrove.executor.local.LocalPipelineExecutor object at 0x75acd78b37d0>
                     │             │               └ 3
                     │             └ <generator object BaseDiskReader.run at 0x75acd6fcc7b0>
                     └ 🔻 - FILTER: 🧑🏽‍💻 Code Filter
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/pipeline/base.py", line 119, in __call__
    return self.run(data, rank, world_size)
           │    │   │     │     └ 16
           │    │   │     └ 3
           │    │   └ <generator object BaseDiskReader.run at 0x75acd6fcc7b0>
           │    └ <function BasicCodeFilter.run at 0x75acd6f25f80>
           └ 🔻 - FILTER: 🧑🏽‍💻 Code Filter

TypeError: BasicCodeFilter.run() takes 2 positional arguments but 4 were given
2024-12-05 22:55:39.843 | ERROR    | datatrove.executor.base:_run_for_rank:108 - BasicCodeFilter.run() takes 2 positional arguments but 4 were given
Traceback (most recent call last):

  File "<string>", line 1, in <module>
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/forkserver.py", line 273, in main
    code = _serve_one(child_r, fds,
           │          │        └ [14, 15, 18, 19, 20, 21]
           │          └ 8
           └ <function _serve_one at 0x75acdb9c3920>
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/forkserver.py", line 312, in _serve_one
    code = spawn._main(child_r, parent_sentinel)
           │     │     │        └ 4
           │     │     └ 8
           │     └ <function _main at 0x75acdb9c2b60>
           └ <module 'multiprocess.spawn' from '/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/spawn.py'>
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/spawn.py", line 135, in _main
    return self._bootstrap(parent_sentinel)
           │    │          └ 4
           │    └ <function BaseProcess._bootstrap at 0x75acdbddf420>
           └ <ForkServerProcess name='ForkServerPoolWorker-6' parent=50854 started daemon>
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
    │    └ <function BaseProcess.run at 0x75acdbdde980>
    └ <ForkServerProcess name='ForkServerPoolWorker-6' parent=50854 started daemon>
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    │    │        │    │        │    └ {}
    │    │        │    │        └ <ForkServerProcess name='ForkServerPoolWorker-6' parent=50854 started daemon>
    │    │        │    └ (<multiprocess.quemultiprocess.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/executor/local.py", line 76, in _launch_run_for_rank
    return self._run_for_rank(rank, local_rank)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/executor/base.py", line 109, in _run_for_rank
    raise e
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/executor/base.py", line 90, in _run_for_rank
    pipelined_data = pipeline_step(pipelined_data, rank, self.world_size)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/pipeline/base.py", line 119, in __call__
    return self.run(data, rank, world_size)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: BasicCodeFilter.run() takes 2 positional arguments but 4 were given
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/taha/LLM-Workshop/personal_copilot/dataset_generation/pipeline.py", line 118, in <module>
    run_code_dataset_generation()
  File "/home/taha/LLM-Workshop/personal_copilot/dataset_generation/pipeline.py", line 104, in run_code_dataset_generation
    print(executor_0.run())
          ^^^^^^^^^^^^^^^^
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/executor/local.py", line 133, in run
    stats = list(
            ^^^^^
  File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/pool.py", line 873, in next
    raise value
TypeError: BasicCodeFilter.run() takes 2 positional arguments but 4 were given
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant