-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fetch results of evening voting sessions #957
base: main
Are you sure you want to change the base?
Conversation
fac4ac1
to
6a43010
Compare
While on most plenary days, there is only one voting session around midday, on some days there is another sesssion in the evening, usually around 17:00. The vote results of the evening sessions are appended to the same source document that also contains the results of the midday votes. Currently, we only run the `RCVListPipeline` between 12:00 and 15:00 until we’ve been able to fetch vote results successfully. Once we’ve fetched the results, we do not attempt to fetch them again. That means that so far we did not (automatically) fetch results of the evening voting session. In addition to the current behavior, this change tries to fetch vote results between 17:00 and 20:00, until we’ve been able to fetch them successfully a second time. This is only the first part of the solution, as we also need to check that we only stop scraping vote results once we’ve been able to fetch updated results (e.g. by storing a hash of the source data for every successful pipeline run).
6a43010
to
3c05563
Compare
… tests, not just scraper tests
…xit early if data has not changed This allows us to repeatedly run the pipeline to check if the data has been updated, while stopping the pipeline as soon as possible if it remains unchanged. In case of this pipeline that means we send one request for the RCV list in XML format, but we do not send a requests to fetch pages from EUR-Lex, OEIL, etc. for each of the votes if the RCV list hasn’t changed.
3c05563
to
a7bd688
Compare
This is analogous to the `BaseScraper` class for scrapers. In contrast to the scrapers, pipelines don’t share a lot of code (except for some error handling and logging), so there wasn’t really a need to add this before. However, having a common base class is also helpful for typing.
d0a9c05
to
a567d10
Compare
In preparation for changes in the next commit to avoid ambiguous naming
a567d10
to
a7589a2
Compare
Right now, the worker’s `schedule_pipeline` method also handles all the exceptions that might be raised during pipeline execution, e.g. `DataUnavailableError`. This has worked fine so far, but it’s getting a little difficult to test now that we also want to store the checksum of a pipeline run. So instead, pipelines now return a `PipelineResult`, which includes information about the status (i.e. whether the run was successful or failed because the source data was unavailable, etc.) and `schedule_pipeline` doesn’t need to handle the different ways a pipeline could fail.
0559d73
to
e5a2c87
Compare
I initially added some code to `BaseScraper` that automatically computes the response checksum and exposes it as a property. This is only ever used by pipelines, so in order to remove some indirection, I’m moving it to a pipelines helper.
All three are only used by other modules from the same package
4236764
to
efa13e3
Compare
class PipelineError(Exception): | ||
pass | ||
|
||
|
||
class DataUnavailableError(PipelineError): | ||
pass | ||
|
||
|
||
class DataUnchangedError(PipelineError): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: These aren’t actually errors, they are exceptions used for flow control. I think I’ve blindly followed the linter recommendation, but this naming doesn’t really make sense for this case.
According to PEP 8, exception classes should have an Error
suffix, but only if they actually are errors, otherwise they should have no special suffix, so in this case DataUnchanged
and DataUnavailable
would be more appropriate.
https://peps.python.org/pep-0008/#exception-names
It probably makes sense to fix this separately though, there might be other similar cases elsewhere.
While on most plenary days, there is only one voting session around midday, on some days there is another sesssion in the evening, usually around 17:00. The vote results of the evening sessions are appended to the same source document that also contains the results of the midday votes.
Currently, we only run the
RCVListPipeline
between 12:00 and 15:00 until we’ve been able to fetch vote results successfully. Once we’ve fetched the results, we do not attempt to fetch them again. That means that so far we did not (automatically) fetch results of the evening voting session.The easy approach to this problem would be to simply keep running the
RCVListPipeline
until the evening, even after it has successfully completed for the first time. However, this approach has a few disadvantages:Instead, we do the following:
This is implemented as follows:
RCVListPipeline
, it is a simple hash of the contents of the RCV list. (In theory, we could make use of HTTP caching mechanisms and try to e.g. store the response ETag. In practice, many of the official sources do not include an ETag in responses, and this would add unnecessary complexity for irrelevant savings.)Fixes #917