Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate pageviews for all articles across all wikipedias #755

Merged
merged 13 commits into from
Sep 8, 2024

Conversation

audiodude
Copy link
Member

@audiodude audiodude commented Aug 17, 2024

For #171

Here we download the pageviews text file for the previous month and ingest it, storing the view counts for every article, across every wikipedia, that has a view count.

This PR includes the schema for the page_scores table, which will eventually include more than just the page views, also the links and lang_links that make up the components of the page score used in classic selection.

Previous attempts at this logic involved streaming the pageviews bz2 file over HTTP, however this was found to be unreliable because the remote server would close the connection after ~20 minutes. Instead, we create a tempfile location, which in production is mapped to /srv/, and download the entire file there. The BZ2 decompression and line by line processing is still streamed.

EDIT: There is currently no scheduling for this process because we need to run it manually a couple of times first to make sure it works. Scheduling will be added in a future PR.

@audiodude
Copy link
Member Author

Hi @benoit74 , can you please take a look at this PR? Thanks!

@audiodude audiodude force-pushed the calculate-pageviews branch 4 times, most recently from 77b9e1a to 7816048 Compare August 17, 2024 15:46
@audiodude
Copy link
Member Author

Planning on comitting with codefactor issues. I tried ignoring them in the codefactor GUI, and also applying # nosec comments, but nothing worked. There is no real security issue, the /tmp directory is only used in test.

Copy link

codecov bot commented Aug 17, 2024

Codecov Report

Attention: Patch coverage is 94.81481% with 7 lines in your changes missing coverage. Please review.

Project coverage is 91.14%. Comparing base (670a87d) to head (c44be71).
Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
wp1/scores.py 94.73% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #755      +/-   ##
==========================================
+ Coverage   90.99%   91.14%   +0.14%     
==========================================
  Files          65       66       +1     
  Lines        3411     3546     +135     
==========================================
+ Hits         3104     3232     +128     
- Misses        307      314       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kelson42
Copy link
Collaborator

@benoit74 being unavailable for the moment, @rgaudin could youmpleaye do the review?

@audiodude audiodude requested review from rgaudin and removed request for benoit74 August 18, 2024 15:16
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

def wiki_languages():
r = requests.get(
'https://wikistats.wmcloud.org/api.php?action=dump&table=wikipedias&format=csv',
headers={'User-Agent': WP1_USER_AGENT},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always include a timeout

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

os.remove(prev_filepath)

cur_filepath = get_cur_file_path()
if os.path.exists(cur_filepath):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When downloading below, you're writing to the actual file. Should there be any issue, a partial file will still be on disk without possibility to fix it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added error handling in the download, with corresponding test, PTAL.

wp1/scores.py Outdated
# File already downloaded
return

with requests.get(get_pageview_url(), stream=True) as r:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

wp1/scores.py Outdated
r.raise_for_status()
with open(cur_filepath, 'wb') as f:
# Read data in 8 KB chunks
for chunk in r.iter_content(chunk_size=8 * 1024):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8KB is ridiculously small for downloading 3GB. I think you can afford more RAM than this. Will surely boost download speed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you suggest, maybe 1 MB? 8 MB?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8MB seems fine

Copy link
Member Author

@audiodude audiodude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! PTAL.

def wiki_languages():
r = requests.get(
'https://wikistats.wmcloud.org/api.php?action=dump&table=wikipedias&format=csv',
headers={'User-Agent': WP1_USER_AGENT},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

wp1/scores.py Outdated
# File already downloaded
return

with requests.get(get_pageview_url(), stream=True) as r:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

os.remove(prev_filepath)

cur_filepath = get_cur_file_path()
if os.path.exists(cur_filepath):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added error handling in the download, with corresponding test, PTAL.

@kelson42
Copy link
Collaborator

@audiodude With current solution, i keep track of pageviews over many years in both gathering and score computation. Reason is to mitigate the impact of local maximum because of recent event.

@audiodude
Copy link
Member Author

@audiodude With current solution, i keep track of pageviews over many years in both gathering and score computation. Reason is to mitigate the impact of local maximum because of recent event.

@kelson42 let's merge this as is and we can discuss some kind of backfill, or keeping a running tally.

@audiodude audiodude merged commit 92e24b4 into main Sep 8, 2024
5 checks passed
@audiodude audiodude deleted the calculate-pageviews branch September 8, 2024 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants