Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use separate download and processing queues #72

Open
rgaudin opened this issue Mar 13, 2020 · 7 comments
Open

Use separate download and processing queues #72

rgaudin opened this issue Mar 13, 2020 · 7 comments
Assignees
Milestone

Comments

@rgaudin
Copy link
Member

rgaudin commented Mar 13, 2020

The way multiprocessing works at the moment is by launching a number a threads that would individually take a chunk of the video IDs list and for each: sequentially download and process them one at a time.

It would make more sense given the platform's restrictions on download to manage two different queues for which we'd each be able to set the workload (number of items to work in parallel):

  • one for download. Default would be concurrency of 1.
  • one for processing (format conversion). Default might be 2.
@satyamtg
Copy link
Contributor

Hey @rgaudin , I think this indeed is a good plan. However, I thnk we can improve the performance a bit more by using multiprocessing and multithreading together (sort of a hybrid approach) as downloading and converting are fundamentally different kinds of tasks. What I mean in general is that we can use multiprocessing for format conversion queue as it is a CPU intensive task and multithreading for download queue as it is more IO intensive. In general, as far as I know, multiprocessing would give better results on CPU intensive tasks and multithreading would give an edge on tasks where CPU has long wait times. We can try to use ThreadPoolExecutor and ProcessPoolExecutor simultaneously. However, at the current moment, I'm not super sure how we would implement it but it seems that it'd work. What are your thoughts on this?

@rgaudin
Copy link
Member Author

rgaudin commented Apr 20, 2020

Thanks for the suggestion.

I prefer we reconsider this later, once we have some zimfarm experience with the new S3 option. Main goal of this issue was to work around the youtube IP ban but the situation will be quite different with our S3 cache.

We may be happy without this fixed or we might need something smarter so let's get some experience first.

That said, it's important to note that maintainability is key for scrapers. We need those code base to be easy to understand and hack-on because it usually start with a need but then works in auto mode for months or years before a need to tweak it arise and the author is probably not around anymore (or has forgotten all about it).

Performance is important but it's easier to have better HW or longer tasks than coding time. My point being, concurrency brings complexity. We do it when we have to but we need to balance proven gains vs complexity when deciding to implement it. Mixing threads and processes for an uncertain performance gain seems too much at first sight. But as I said, let's gain some experience with S3 and reassess our bottlenecks.

@stale
Copy link

stale bot commented Jun 19, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Jun 19, 2020
@stale stale bot removed the stale label Jun 25, 2020
@stale
Copy link

stale bot commented Aug 24, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Aug 24, 2020
@kelson42 kelson42 pinned this issue Dec 12, 2021
@stale stale bot removed the stale label Dec 12, 2021
@kelson42 kelson42 unpinned this issue Dec 12, 2021
@the-illuminatus
Copy link

is the issue still open?

@rgaudin
Copy link
Member Author

rgaudin commented Feb 5, 2022

Yes it's still open but we are not clear exactly about how that should be done and it's not trivial, so definitely not appropriate for a newcomer.

@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Apr 16, 2022
@benoit74 benoit74 added this to the later milestone Jul 29, 2024
@stale stale bot removed the stale label Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants