New class for spiders that filter lists of URLs by date #600

aguilerapy · 2021-01-21T18:00:46Z

After PR 596, Afghanistan, Colombia, Costa Rica, Dominican Republic, Malta, Zambia (and others) will filter the dates from the URLs of their base lists.

Should we create a new spider class? They all have a list of available URLs in common, and they can have a method like this:

    def parse_dates_to_filter(self, date):
        if self.from_date and self.until_date:
            return False

        if self.date_format == '%Y':
            if not (self.from_date.year <= date <= self.until_date.year):
                return True
        elif self.date_format == '%Y-%m':
            return not ((self.from_date.year <= date.year <= self.until_date.year)
                        and (self.from_date.month <= date.month <= self.until_date.month))
        else:
            date = datetime.strptime(date, self.date_format)
            return not (self.from_date <= date <= self.until_date)

We have the PeriodicSpider class but it doesn't work with those scrapers.

The text was updated successfully, but these errors were encountered:

jpmckinney · 2021-01-21T19:13:09Z

Sounds good to me! @yolile ?

yolile · 2021-01-21T19:57:26Z

Yes, the new class sounds good to me as well, although instead of having the suggested method, the class could have the date_format and date_pattern attribute, start_request method with a callback to a build_urls function where the returned URL list is filtered and yielded according to the date_format and date_pattern

jpmckinney · 2021-04-21T19:09:27Z

See discussion in #701 (comment) about related updates to Afghanistan, Zambia and Malta.

aguilerapy added discussion framework-spiders Relating to common spider functionality labels Jan 21, 2021

jpmckinney changed the title ~~New spider class?~~ New class for spiders that filter lists of URLs by date Feb 1, 2021

yolile mentioned this issue Apr 21, 2021

moldova: add from_date support #701

Merged

jpmckinney removed the discussion label Sep 1, 2021

yolile added the hacktoberfest label Oct 7, 2021

yolile removed the hacktoberfest label Nov 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New class for spiders that filter lists of URLs by date #600

New class for spiders that filter lists of URLs by date #600

aguilerapy commented Jan 21, 2021

jpmckinney commented Jan 21, 2021

yolile commented Jan 21, 2021

jpmckinney commented Apr 21, 2021 •

edited

Loading

New class for spiders that filter lists of URLs by date #600

New class for spiders that filter lists of URLs by date #600

Comments

aguilerapy commented Jan 21, 2021

jpmckinney commented Jan 21, 2021

yolile commented Jan 21, 2021

jpmckinney commented Apr 21, 2021 • edited Loading

jpmckinney commented Apr 21, 2021 •

edited

Loading