Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New class for spiders that filter lists of URLs by date #600

Open
aguilerapy opened this issue Jan 21, 2021 · 3 comments
Open

New class for spiders that filter lists of URLs by date #600

aguilerapy opened this issue Jan 21, 2021 · 3 comments
Labels
framework-spiders Relating to common spider functionality

Comments

@aguilerapy
Copy link
Contributor

After PR 596, Afghanistan, Colombia, Costa Rica, Dominican Republic, Malta, Zambia (and others) will filter the dates from the URLs of their base lists.

Should we create a new spider class? They all have a list of available URLs in common, and they can have a method like this:

    def parse_dates_to_filter(self, date):
        if self.from_date and self.until_date:
            return False

        if self.date_format == '%Y':
            if not (self.from_date.year <= date <= self.until_date.year):
                return True
        elif self.date_format == '%Y-%m':
            return not ((self.from_date.year <= date.year <= self.until_date.year)
                        and (self.from_date.month <= date.month <= self.until_date.month))
        else:
            date = datetime.strptime(date, self.date_format)
            return not (self.from_date <= date <= self.until_date)

We have the PeriodicSpider class but it doesn't work with those scrapers.

@aguilerapy aguilerapy added discussion framework-spiders Relating to common spider functionality labels Jan 21, 2021
@jpmckinney
Copy link
Member

Sounds good to me! @yolile ?

@yolile
Copy link
Member

yolile commented Jan 21, 2021

Yes, the new class sounds good to me as well, although instead of having the suggested method, the class could have the date_format and date_pattern attribute, start_request method with a callback to a build_urls function where the returned URL list is filtered and yielded according to the date_format and date_pattern

@jpmckinney jpmckinney changed the title New spider class? New class for spiders that filter lists of URLs by date Feb 1, 2021
@jpmckinney
Copy link
Member

jpmckinney commented Apr 21, 2021

See discussion in #701 (comment) about related updates to Afghanistan, Zambia and Malta.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
framework-spiders Relating to common spider functionality
Projects
None yet
Development

No branches or pull requests

3 participants