Use separate download and processing queues #72

rgaudin · 2020-03-13T13:34:25Z

The way multiprocessing works at the moment is by launching a number a threads that would individually take a chunk of the video IDs list and for each: sequentially download and process them one at a time.

It would make more sense given the platform's restrictions on download to manage two different queues for which we'd each be able to set the workload (number of items to work in parallel):

one for download. Default would be concurrency of 1.
one for processing (format conversion). Default might be 2.

The text was updated successfully, but these errors were encountered:

satyamtg · 2020-04-19T17:57:12Z

Hey @rgaudin , I think this indeed is a good plan. However, I thnk we can improve the performance a bit more by using multiprocessing and multithreading together (sort of a hybrid approach) as downloading and converting are fundamentally different kinds of tasks. What I mean in general is that we can use multiprocessing for format conversion queue as it is a CPU intensive task and multithreading for download queue as it is more IO intensive. In general, as far as I know, multiprocessing would give better results on CPU intensive tasks and multithreading would give an edge on tasks where CPU has long wait times. We can try to use ThreadPoolExecutor and ProcessPoolExecutor simultaneously. However, at the current moment, I'm not super sure how we would implement it but it seems that it'd work. What are your thoughts on this?

rgaudin · 2020-04-20T10:25:21Z

Thanks for the suggestion.

I prefer we reconsider this later, once we have some zimfarm experience with the new S3 option. Main goal of this issue was to work around the youtube IP ban but the situation will be quite different with our S3 cache.

We may be happy without this fixed or we might need something smarter so let's get some experience first.

That said, it's important to note that maintainability is key for scrapers. We need those code base to be easy to understand and hack-on because it usually start with a need but then works in auto mode for months or years before a need to tweak it arise and the author is probably not around anymore (or has forgotten all about it).

Performance is important but it's easier to have better HW or longer tasks than coding time. My point being, concurrency brings complexity. We do it when we have to but we need to balance proven gains vs complexity when deciding to implement it. Mixing threads and processes for an uncertain performance gain seems too much at first sight. But as I said, let's gain some experience with S3 and reassess our bottlenecks.

stale · 2020-06-19T14:52:40Z

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale · 2020-08-24T09:08:22Z

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

the-illuminatus · 2022-02-04T15:49:22Z

is the issue still open?

rgaudin · 2022-02-05T18:12:05Z

Yes it's still open but we are not clear exactly about how that should be done and it's not trivial, so definitely not appropriate for a newcomer.

stale · 2022-04-16T03:43:32Z

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

rgaudin added the enhancement label Mar 13, 2020

rgaudin mentioned this issue Mar 14, 2020

Fix low quality #73

Merged

stale bot added the stale label Jun 19, 2020

kelson42 assigned rgaudin Jun 25, 2020

stale bot removed the stale label Jun 25, 2020

stale bot added the stale label Aug 24, 2020

kelson42 pinned this issue Dec 12, 2021

stale bot removed the stale label Dec 12, 2021

kelson42 unpinned this issue Dec 12, 2021

stale bot added the stale label Apr 16, 2022

benoit74 added this to the later milestone Jul 29, 2024

stale bot removed the stale label Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use separate download and processing queues #72

Use separate download and processing queues #72

rgaudin commented Mar 13, 2020

satyamtg commented Apr 19, 2020

rgaudin commented Apr 20, 2020

stale bot commented Jun 19, 2020

stale bot commented Aug 24, 2020

the-illuminatus commented Feb 4, 2022

rgaudin commented Feb 5, 2022

stale bot commented Apr 16, 2022

Use separate download and processing queues #72

Use separate download and processing queues #72

Comments

rgaudin commented Mar 13, 2020

satyamtg commented Apr 19, 2020

rgaudin commented Apr 20, 2020

stale bot commented Jun 19, 2020

stale bot commented Aug 24, 2020

the-illuminatus commented Feb 4, 2022

rgaudin commented Feb 5, 2022

stale bot commented Apr 16, 2022