-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use separate download and processing queues #72
Comments
Hey @rgaudin , I think this indeed is a good plan. However, I thnk we can improve the performance a bit more by using multiprocessing and multithreading together (sort of a hybrid approach) as downloading and converting are fundamentally different kinds of tasks. What I mean in general is that we can use multiprocessing for format conversion queue as it is a CPU intensive task and multithreading for download queue as it is more IO intensive. In general, as far as I know, multiprocessing would give better results on CPU intensive tasks and multithreading would give an edge on tasks where CPU has long wait times. We can try to use ThreadPoolExecutor and ProcessPoolExecutor simultaneously. However, at the current moment, I'm not super sure how we would implement it but it seems that it'd work. What are your thoughts on this? |
Thanks for the suggestion. I prefer we reconsider this later, once we have some zimfarm experience with the new S3 option. Main goal of this issue was to work around the youtube IP ban but the situation will be quite different with our S3 cache. We may be happy without this fixed or we might need something smarter so let's get some experience first. That said, it's important to note that maintainability is key for scrapers. We need those code base to be easy to understand and hack-on because it usually start with a need but then works in auto mode for months or years before a need to tweak it arise and the author is probably not around anymore (or has forgotten all about it). Performance is important but it's easier to have better HW or longer tasks than coding time. My point being, concurrency brings complexity. We do it when we have to but we need to balance proven gains vs complexity when deciding to implement it. Mixing threads and processes for an uncertain performance gain seems too much at first sight. But as I said, let's gain some experience with S3 and reassess our bottlenecks. |
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
is the issue still open? |
Yes it's still open but we are not clear exactly about how that should be done and it's not trivial, so definitely not appropriate for a newcomer. |
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
The way multiprocessing works at the moment is by launching a number a threads that would individually take a chunk of the video IDs list and for each: sequentially download and process them one at a time.
It would make more sense given the platform's restrictions on download to manage two different queues for which we'd each be able to set the workload (number of items to work in parallel):
1
.2
.The text was updated successfully, but these errors were encountered: