Fetch results of evening voting sessions #957

tillprochaska · 2024-06-08T15:40:17Z

While on most plenary days, there is only one voting session around midday, on some days there is another sesssion in the evening, usually around 17:00. The vote results of the evening sessions are appended to the same source document that also contains the results of the midday votes.

Currently, we only run the RCVListPipeline between 12:00 and 15:00 until we’ve been able to fetch vote results successfully. Once we’ve fetched the results, we do not attempt to fetch them again. That means that so far we did not (automatically) fetch results of the evening voting session.

The easy approach to this problem would be to simply keep running the RCVListPipeline until the evening, even after it has successfully completed for the first time. However, this approach has a few disadvantages:

We store timestamps of when we accessed the data sources related for a specific vote. Following the easy approach means we’d store the timestamp when the data was last accessed. For diagnostics and analysis, it’s nice to keep the timestamp of the first access.
Every pipeline triggers multiple requests for each vote. As it’s not uncommon to see hundreds of votes per day, this would be quite wasteful and not appropriate for a well-behaved scraper.

Instead, we do the following:

Between 12:00 and 15:00, we periodically check if the vote results are available until we’ve been able to fetch the vote results successfully (just as we did before).
Between 17:00 and 20:00, we periodically check if an updated RCV list has been published. When an updated list has been published, we rerun the entire pipeline and stop the periodic checks afterwards. When the list remains unchanged, we don’t rerun subsequent steps of the pipeline.

This is implemented as follows:

When a pipeline has run successfully, it can return a checksum. The checksum is stored and is provided to subsequent runs of the pipeline.
The checksum can be any value, but in case of the RCVListPipeline, it is a simple hash of the contents of the RCV list. (In theory, we could make use of HTTP caching mechanisms and try to e.g. store the response ETag. In practice, many of the official sources do not include an ETag in responses, and this would add unnecessary complexity for irrelevant savings.)
If during a subsequent run, the hash of RCV list contents hasn’t changed, the pipeline exists early to skip unnecessary requests. Once the pipeline has been run successfully for a second time, we also stop periodically checking if the RCV list contents have changed.

Fixes #917

While on most plenary days, there is only one voting session around midday, on some days there is another sesssion in the evening, usually around 17:00. The vote results of the evening sessions are appended to the same source document that also contains the results of the midday votes. Currently, we only run the `RCVListPipeline` between 12:00 and 15:00 until we’ve been able to fetch vote results successfully. Once we’ve fetched the results, we do not attempt to fetch them again. That means that so far we did not (automatically) fetch results of the evening voting session. In addition to the current behavior, this change tries to fetch vote results between 17:00 and 20:00, until we’ve been able to fetch them successfully a second time. This is only the first part of the solution, as we also need to check that we only stop scraping vote results once we’ve been able to fetch updated results (e.g. by storing a hash of the source data for every successful pipeline run).

… tests, not just scraper tests

…xit early if data has not changed This allows us to repeatedly run the pipeline to check if the data has been updated, while stopping the pipeline as soon as possible if it remains unchanged. In case of this pipeline that means we send one request for the RCV list in XML format, but we do not send a requests to fetch pages from EUR-Lex, OEIL, etc. for each of the votes if the RCV list hasn’t changed.

This is analogous to the `BaseScraper` class for scrapers. In contrast to the scrapers, pipelines don’t share a lot of code (except for some error handling and logging), so there wasn’t really a need to add this before. However, having a common base class is also helpful for typing.

In preparation for changes in the next commit to avoid ambiguous naming

Right now, the worker’s `schedule_pipeline` method also handles all the exceptions that might be raised during pipeline execution, e.g. `DataUnavailableError`. This has worked fine so far, but it’s getting a little difficult to test now that we also want to store the checksum of a pipeline run. So instead, pipelines now return a `PipelineResult`, which includes information about the status (i.e. whether the run was successful or failed because the source data was unavailable, etc.) and `schedule_pipeline` doesn’t need to handle the different ways a pipeline could fail.

…ame day

I initially added some code to `BaseScraper` that automatically computes the response checksum and exposes it as a property. This is only ever used by pipelines, so in order to remove some indirection, I’m moving it to a pipelines helper.

All three are only used by other modules from the same package

tillprochaska · 2024-12-08T15:46:25Z

backend/howtheyvote/pipelines/common.py

 class PipelineError(Exception):
    pass


 class DataUnavailableError(PipelineError):
    pass
+
+
+class DataUnchangedError(PipelineError):
+    pass


Note to self: These aren’t actually errors, they are exceptions used for flow control. I think I’ve blindly followed the linter recommendation, but this naming doesn’t really make sense for this case.

According to PEP 8, exception classes should have an Error suffix, but only if they actually are errors, otherwise they should have no special suffix, so in this case DataUnchanged and DataUnavailable would be more appropriate.

https://peps.python.org/pep-0008/#exception-names

It probably makes sense to fix this separately though, there might be other similar cases elsewhere.

tillprochaska marked this pull request as draft June 8, 2024 15:41

tillprochaska force-pushed the 917-fix-multiple-voting-sessions branch 2 times, most recently from fac4ac1 to 6a43010 Compare June 8, 2024 16:08

tillprochaska force-pushed the 917-fix-multiple-voting-sessions branch from 6a43010 to 3c05563 Compare December 7, 2024 16:59

tillprochaska added 2 commits December 7, 2024 18:08

Move load_fixture helper to parent package as it's useful for other…

f93eed5

… tests, not just scraper tests

tillprochaska force-pushed the 917-fix-multiple-voting-sessions branch from 3c05563 to a7bd688 Compare December 7, 2024 17:10

tillprochaska added 2 commits December 7, 2024 19:12

Refactor pipelines to use BasePipeline

89de9ab

tillprochaska force-pushed the 917-fix-multiple-voting-sessions branch from d0a9c05 to a567d10 Compare December 8, 2024 12:39

Rename PipelineRunResult to PipelineStatus

a7589a2

In preparation for changes in the next commit to avoid ambiguous naming

tillprochaska force-pushed the 917-fix-multiple-voting-sessions branch from a567d10 to a7589a2 Compare December 8, 2024 12:42

tillprochaska added 2 commits December 8, 2024 14:15

Store pipeline checksums and provide them to subsequent runs on the s…

2a969eb

…ame day

tillprochaska force-pushed the 917-fix-multiple-voting-sessions branch from 0559d73 to e5a2c87 Compare December 8, 2024 15:10

tillprochaska added 2 commits December 8, 2024 16:19

Refactor: Remove unnecessary package exports

efa13e3

All three are only used by other modules from the same package

tillprochaska force-pushed the 917-fix-multiple-voting-sessions branch from 4236764 to efa13e3 Compare December 8, 2024 15:19

tillprochaska marked this pull request as ready for review December 8, 2024 15:37

tillprochaska requested a review from linusha December 8, 2024 15:37

tillprochaska commented Dec 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch results of evening voting sessions #957

Fetch results of evening voting sessions #957

tillprochaska commented Jun 8, 2024 •

edited

Loading

tillprochaska Dec 8, 2024

Fetch results of evening voting sessions #957

Are you sure you want to change the base?

Fetch results of evening voting sessions #957

Conversation

tillprochaska commented Jun 8, 2024 • edited Loading

tillprochaska Dec 8, 2024

Choose a reason for hiding this comment

tillprochaska commented Jun 8, 2024 •

edited

Loading