We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OS: 8 GB Memory / 4 AMD vCPUs / 160 GB Disk / NYC3 - Ubuntu 24.10 x64
(in personal_copilot/ directory)
personal_copilot/
python pipeline.py
ERROR:
(venv) ➜ dataset_generation git:(main) ✗ python pipeline.py 2024-12-05 22:55:37,660 [DEBUG] Data folder 'hf_public_repos' found. 2024-12-05 22:55:37,661 [DEBUG] Using selector: EpollSelector 2024-12-05 22:55:37,919 [DEBUG] Found 14758 files under '/home/taha/LLM-Workshop/personal_copilot/dataset_generation/hf_public_repos' with pattern '**/*'. 2024-12-05 22:55:37,923 [DEBUG] open file: /home/taha/LLM-Workshop/personal_copilot/dataset_generation/logs/2024-12-05_22-55-37_ixcek/executor.json 2024-12-05 22:55:39,573 [DEBUG] Using selector: EpollSelector 2024-12-05 22:55:39,659 [DEBUG] open file: /home/taha/LLM-Workshop/personal_copilot/dataset_generation/logs/2024-12-05_22-55-37_ixcek/logs/task_00000.log 2024-12-05 22:55:39,721 [DEBUG] Using selector: EpollSelector 2024-12-05 22:55:39.729 | INFO | datatrove.utils.logging:add_task_logger:58 - Launching pipeline for rank=0 2024-12-05 22:55:39.729 | INFO | datatrove.utils.logging:log_pipeline:90 - --- 🛠️ PIPELINE 🛠 📖 - READER: 👾 PersonalCopilot 🔻 - FILTER: 🧑🏽💻 Code Filter 💽 - WRITER: 🐿 Jsonl 2024-12-05 22:55:39,738 [DEBUG] open file: /home/taha/LLM-Workshop/personal_copilot/dataset_generation/logs/2024-12-05_22-55-37_ixcek/logs/task_00001.log 2024-12-05 22:55:39.729 | ERROR | datatrove.executor.base:_run_for_rank:108 - BasicCodeFilter.run() takes 2 positional arguments but 4 were given Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/forkserver.py", line 273, in main code = _serve_one(child_r, fds, │ │ └ [24, 25, 26, 27, 28, 29] │ └ 8 └ <function _serve_one at 0x75acdb9c3920> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/forkserver.py", line 312, in _serve_one code = spawn._main(child_r, parent_sentinel) │ │ │ └ 4 │ │ └ 8 │ └ <function _main at 0x75acdb9c2b60> └ <module 'multiprocess.spawn' from '/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/spawn.py'> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/spawn.py", line 135, in _main return self._bootstrap(parent_sentinel) │ │ └ 4 │ └ <function BaseProcess._bootstrap at 0x75acdbddf420> └ <ForkServerProcess name='ForkServerPoolWorker-14' parent=50854 started daemon> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/process.py", line 314, in _bootstrap self.run() │ └ <function BaseProcess.run at 0x75acdbdde980> └ <ForkServerProcess name='ForkServerPoolWorker-14' parent=50854 started daemon> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/process.py", line 108, in run self._target(*self._args, **self._kwargs) │ │ │ │ │ └ {} │ │ │ │ └ <ForkServerProcess name='ForkServerPoolWorker-14' parent=50854 started daemon> │ │ │ └ (<multiprocess.queues.SimpleQueue object at 0x75acd6f40500>, <multiprocess.queues.SimpleQueue object at 0x75acd6f6b620>, None... │ │ └ <ForkServerProcess name='ForkServerPoolWorker-14' parent=50854 started daemon> │ └ <function worker at 0x75acd6f751c0> └ <ForkServerProcess name='ForkServerPoolWorker-14' parent=50854 started daemon> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/pool.py", line 125, in worker result = (True, func(*args, **kwds)) │ │ └ {} │ └ (0,) └ functools.partial(<bound method LocalPipelineExecutor._launch_run_for_rank of <datatrove.executor.local.LocalPipelineExecutor... File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/executor/local.py", line 76, in _launch_run_for_rank return self._run_for_rank(rank, local_rank) │ │ │ └ 0 │ │ └ 0 │ └ <function PipelineExecutor._run_for_rank at 0x75acda55ec00> └ <datatrove.executor.local.LocalPipelineExecutor object at 0x75acdbd50d70> > File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/executor/base.py", line 90, in _run_for_rank pipelined_data = pipeline_step(pipelined_data, rank, self.world_size) │ │ │ │ └ <property object at 0x75acda5437e0> │ │ │ └ <datatrove.executor.local.LocalPipelineExecutor object at 0x75acdbd50d70> │ │ └ 0 │ └ <generator object BaseDiskReader.run at 0x75acdbde19c0> └ 🔻 - FILTER: 🧑🏽💻 Code Filter File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/pipeline/base.py", line 119, in __call__ return self.run(data, rank, world_size) │ │ │ │ └ 16 │ │ │ └ 0 │ │ └ <generator object BaseDiskReader.run at 0x75acdbde19c0> │ └ <function BasicCodeFilter.run at 0x75acd6f25f80> └ 🔻 - FILTER: 🧑🏽💻 Code Filter TypeError: BasicCodeFilter.run() takes 2 positional arguments but 4 were given 2024-12-05 22:55:39.764 | ERROR | datatrove.executor.base:_run_for_rank:108 - BasicCodeFilter.run() takes 2 positional arguments but 4 were given Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/forkserver.py", line 273, in main code = _serve_one(child_r, fds, │ │ └ [14, 15, 18, 19, 20, 21] │ └ 8 └ <function _serve_one at 0x75acdb9c3920> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/forkserver.py", line 312, in _serve_one code = spawn._main(child_r, parent_sentinel) │ │ │ └ 4 │ │ └ 8 │ └ <function _main at 0x75acdb9c2b60> └ <module 'multiprocess.spawn' from '/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/spawn.py'> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/spawn.py", line 135, in _main return self._bootstrap(parent_sentinel) │ │ └ 4 │ └ <function BaseProcess._bootstrap at 0x75acdbddf420> └ <ForkServerProcess name='ForkServerPoolWorker-6' parent=50854 started daemon> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/process.py", line 314, in _bootstrap self.run() │ └ <function BaseProcess.run at 0x75acdbdde980> └ <ForkServerProcess name='ForkServerPoolWorker-6' parent=50854 started daemon> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/process.py", line 108, in run self._target(*self._args, **self._kwargs) │ │ │ │ │ └ {} │ │ │ │ └ <ForkServerProcess name='ForkServerPoolWorker-6' parent=50854 started daemon> │ │ │ └ (<multiprocess.queues.SimpleQueue object at 0x75acd78b2570>, <multiprocess.queues.SimpleQueue object at 0x75acd6f6ba10>, None... │ │ └ <ForkServerProcess name='ForkServerPoolWorker-6' parent=50854 started daemon> │ └ <function worker at 0x75acd6f75260> └ <ForkServerProcess name='ForkServerPoolWorker-6' parent=50854 started daemon> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/pool.py", line 125, in worker result = (True, func(*args, **kwds)) │ │ └ {} │ └ (1,) └ functools.partial(<bound method LocalPipelineExecutor._launch_run_for_rank of <datatrove.executor.local.LocalPipelineExecutor... File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/executor/local.py", line 76, in _launch_run_for_rank return self._run_for_rank(rank, local_rank) │ │ │ └ 1 │ │ └ 1 │ └ <function PipelineExecutor._run_for_rank at 0x75acda55eca0> └ <datatrove.executor.local.LocalPipelineExecutor object at 0x75acdb982480> > File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/executor/base.py", line 90, in _run_for_rank pipelined_data = pipeline_step(pipelined_data, rank, self.world_size) │ │ │ │ └ <property object at 0x75acda5437e0> │ │ │ └ <datatrove.executor.local.LocalPipelineExecutor object at 0x75acdb982480> │ │ └ 1 │ └ <generator object BaseDiskReader.run at 0x75acdbde19c0> └ 🔻 - FILTER: 🧑🏽💻 Code Filter File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/pipeline/base.py", line 119, in __call__ return self.run(data, rank, world_size) │ │ │ │ └ 16 │ │ │ └ 1 │ │ └ <generator object BaseDiskReader.run at 0x75acdbde19c0> │ └ <function BasicCodeFilter.run at 0x75acd6f26020> └ 🔻 - FILTER: 🧑🏽💻 Code Filter TypeError: BasicCodeFilter.run() takes 2 positional arguments but 4 were given 2024-12-05 22:55:39,769 [DEBUG] Using selector: EpollSelector 2024-12-05 22:55:39.784 | INFO | datatrove.executor.local:_launch_run_for_rank:81 - 1/16 tasks completed. 2024-12-05 22:55:39.788 | INFO | datatrove.executor.local:_launch_run_for_rank:81 - 2/16 tasks completed. 2024-12-05 22:55:39,794 [DEBUG] Using selector: EpollSelector 2024-12-05 22:55:39,811 [DEBUG] open file: /home/taha/LLM-Workshop/personal_copilot/dataset_generation/logs/2024-12-05_22-55-37_ixcek/logs/task_00003.log 2024-12-05 22:55:39,824 [DEBUG] open file: /home/taha/LLM-Workshop/personal_copilot/dataset_generation/logs/2024-12-05_22-55-37_ixcek/logs/task_00004.log 2024-12-05 22:55:39,837 [DEBUG] open file: /home/taha/LLM-Workshop/personal_copilot/dataset_generation/logs/2024-12-05_22-55-37_ixcek/logs/task_00002.log 2024-12-05 22:55:39.834 | ERROR | datatrove.executor.base:_run_for_rank:108 - BasicCodeFilter.run() takes 2 positional arguments but 4 were given Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/forkserver.py", line 273, in main code = _serve_one(child_r, fds, │ │ └ [24, 25, 26, 27, 28, 29] │ └ 8 └ <function _serve_one at 0x75acdb9c3920> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/forkserver.py", line 312, in _serve_one code = spawn._main(child_r, parent_sentinel) │ │ │ └ 4 │ │ └ 8 │ └ <function _main at 0x75acdb9c2b60> └ <module 'multiprocess.spawn' from '/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/spawn.py'> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/spawn.py", line 135, in _main return self._bootstrap(parent_sentinel) │ │ └ 4 │ └ <function BaseProcess._bootstrap at 0x75acdbddf420> └ <ForkServerProcess name='ForkServerPoolWorker-14' parent=50854 started daemon> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/process.py", line 314, in _bootstrap self.run() │ └ <function BaseProcess.run at 0x75acdbdde980> └ <ForkServerProcess name='ForkServerPoolWorker-14' parent=50854 started daemon> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/process.py", line 108, in run self._target(*self._args, **self._kwargs) │ │ │ │ │ └ {} │ │ │ │ └ <ForkServerProcess name='ForkServerPoolWorker-14' parent=50854 started daemon> │ │ │ └ (<multiprocess.queues.SimpleQueue object at 0x75acd6f40500>, <multiprocess.queues.SimpleQueue object at 0x75acd6f6b620>, None... │ │ └ <ForkServerProcess name='ForkServerPoolWorker-14' parent=50854 started daemon> │ └ <function worker at 0x75acd6f751c0> └ <ForkServerProcess name='ForkServerPoolWorker-14' parent=50854 started daemon> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/pool.py", line 125, in worker result = (True, func(*args, **kwds)) │ │ └ {} │ └ (3,) └ functools.partial(<bound method LocalPipelineExecutor._launch_run_for_rank of <datatrove.executor.local.LocalPipelineExecutor... File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/executor/local.py", line 76, in _launch_run_for_rank return self._run_for_rank(rank, local_rank) │ │ │ └ 2 │ │ └ 3 │ └ <function PipelineExecutor._run_for_rank at 0x75acda55ec00> └ <datatrove.executor.local.LocalPipelineExecutor object at 0x75acd78b37d0> > File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/executor/base.py", line 90, in _run_for_rank pipelined_data = pipeline_step(pipelined_data, rank, self.world_size) │ │ │ │ └ <property object at 0x75acda5437e0> │ │ │ └ <datatrove.executor.local.LocalPipelineExecutor object at 0x75acd78b37d0> │ │ └ 3 │ └ <generator object BaseDiskReader.run at 0x75acd6fcc7b0> └ 🔻 - FILTER: 🧑🏽💻 Code Filter File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/pipeline/base.py", line 119, in __call__ return self.run(data, rank, world_size) │ │ │ │ └ 16 │ │ │ └ 3 │ │ └ <generator object BaseDiskReader.run at 0x75acd6fcc7b0> │ └ <function BasicCodeFilter.run at 0x75acd6f25f80> └ 🔻 - FILTER: 🧑🏽💻 Code Filter TypeError: BasicCodeFilter.run() takes 2 positional arguments but 4 were given 2024-12-05 22:55:39.843 | ERROR | datatrove.executor.base:_run_for_rank:108 - BasicCodeFilter.run() takes 2 positional arguments but 4 were given Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/forkserver.py", line 273, in main code = _serve_one(child_r, fds, │ │ └ [14, 15, 18, 19, 20, 21] │ └ 8 └ <function _serve_one at 0x75acdb9c3920> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/forkserver.py", line 312, in _serve_one code = spawn._main(child_r, parent_sentinel) │ │ │ └ 4 │ │ └ 8 │ └ <function _main at 0x75acdb9c2b60> └ <module 'multiprocess.spawn' from '/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/spawn.py'> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/spawn.py", line 135, in _main return self._bootstrap(parent_sentinel) │ │ └ 4 │ └ <function BaseProcess._bootstrap at 0x75acdbddf420> └ <ForkServerProcess name='ForkServerPoolWorker-6' parent=50854 started daemon> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/process.py", line 314, in _bootstrap self.run() │ └ <function BaseProcess.run at 0x75acdbdde980> └ <ForkServerProcess name='ForkServerPoolWorker-6' parent=50854 started daemon> File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/process.py", line 108, in run self._target(*self._args, **self._kwargs) │ │ │ │ │ └ {} │ │ │ │ └ <ForkServerProcess name='ForkServerPoolWorker-6' parent=50854 started daemon> │ │ │ └ (<multiprocess.quemultiprocess.pool.RemoteTraceback: """ Traceback (most recent call last): File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/pool.py", line 125, in worker result = (True, func(*args, **kwds)) ^^^^^^^^^^^^^^^^^^^ File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/executor/local.py", line 76, in _launch_run_for_rank return self._run_for_rank(rank, local_rank) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/executor/base.py", line 109, in _run_for_rank raise e File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/executor/base.py", line 90, in _run_for_rank pipelined_data = pipeline_step(pipelined_data, rank, self.world_size) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/pipeline/base.py", line 119, in __call__ return self.run(data, rank, world_size) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: BasicCodeFilter.run() takes 2 positional arguments but 4 were given """ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/taha/LLM-Workshop/personal_copilot/dataset_generation/pipeline.py", line 118, in <module> run_code_dataset_generation() File "/home/taha/LLM-Workshop/personal_copilot/dataset_generation/pipeline.py", line 104, in run_code_dataset_generation print(executor_0.run()) ^^^^^^^^^^^^^^^^ File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/datatrove/executor/local.py", line 133, in run stats = list( ^^^^^ File "/home/taha/LLM-Workshop/venv/lib/python3.12/site-packages/multiprocess/pool.py", line 873, in next raise value TypeError: BasicCodeFilter.run() takes 2 positional arguments but 4 were given
The text was updated successfully, but these errors were encountered:
No branches or pull requests
OS: 8 GB Memory / 4 AMD vCPUs / 160 GB Disk / NYC3 - Ubuntu 24.10 x64
(in
personal_copilot/
directory)python pipeline.py
ERROR:
The text was updated successfully, but these errors were encountered: