Calculate pageviews for all articles across all wikipedias #755

audiodude · 2024-08-17T15:23:17Z

Here we download the pageviews text file for the previous month and ingest it, storing the view counts for every article, across every wikipedia, that has a view count.

This PR includes the schema for the page_scores table, which will eventually include more than just the page views, also the links and lang_links that make up the components of the page score used in classic selection.

Previous attempts at this logic involved streaming the pageviews bz2 file over HTTP, however this was found to be unreliable because the remote server would close the connection after ~20 minutes. Instead, we create a tempfile location, which in production is mapped to /srv/, and download the entire file there. The BZ2 decompression and line by line processing is still streamed.

EDIT: There is currently no scheduling for this process because we need to run it manually a couple of times first to make sure it works. Scheduling will be added in a future PR.

audiodude · 2024-08-17T15:23:51Z

Hi @benoit74 , can you please take a look at this PR? Thanks!

audiodude · 2024-08-17T15:47:35Z

Planning on comitting with codefactor issues. I tried ignoring them in the codefactor GUI, and also applying # nosec comments, but nothing worked. There is no real security issue, the /tmp directory is only used in test.

codecov · 2024-08-17T16:26:11Z

Codecov Report

Attention: Patch coverage is 94.81481% with 7 lines in your changes missing coverage. Please review.

Project coverage is 91.14%. Comparing base (670a87d) to head (c44be71).
Report is 13 commits behind head on main.

Files with missing lines	Patch %	Lines
wp1/scores.py	94.73%	7 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #755      +/-   ##
==========================================
+ Coverage   90.99%   91.14%   +0.14%     
==========================================
  Files          65       66       +1     
  Lines        3411     3546     +135     
==========================================
+ Hits         3104     3232     +128     
- Misses        307      314       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kelson42 · 2024-08-18T07:35:41Z

@benoit74 being unavailable for the moment, @rgaudin could youmpleaye do the review?

rgaudin

LGTM

rgaudin · 2024-08-19T17:54:57Z

wp1/scores.py

+def wiki_languages():
+  r = requests.get(
+      'https://wikistats.wmcloud.org/api.php?action=dump&table=wikipedias&format=csv',
+      headers={'User-Agent': WP1_USER_AGENT},


Always include a timeout

rgaudin · 2024-08-19T18:06:17Z

wp1/scores.py

+    os.remove(prev_filepath)
+
+  cur_filepath = get_cur_file_path()
+  if os.path.exists(cur_filepath):


When downloading below, you're writing to the actual file. Should there be any issue, a partial file will still be on disk without possibility to fix it.

Added error handling in the download, with corresponding test, PTAL.

rgaudin · 2024-08-19T18:06:24Z

wp1/scores.py

+    # File already downloaded
+    return
+
+  with requests.get(get_pageview_url(), stream=True) as r:


rgaudin · 2024-08-19T18:07:26Z

wp1/scores.py

+    r.raise_for_status()
+    with open(cur_filepath, 'wb') as f:
+      # Read data in 8 KB chunks
+      for chunk in r.iter_content(chunk_size=8 * 1024):


8KB is ridiculously small for downloading 3GB. I think you can afford more RAM than this. Will surely boost download speed

What would you suggest, maybe 1 MB? 8 MB?

8MB seems fine

audiodude

Thanks for the review! PTAL.

audiodude · 2024-08-19T19:27:24Z

wp1/scores.py

+def wiki_languages():
+  r = requests.get(
+      'https://wikistats.wmcloud.org/api.php?action=dump&table=wikipedias&format=csv',
+      headers={'User-Agent': WP1_USER_AGENT},


audiodude · 2024-08-19T19:27:32Z

wp1/scores.py

+    # File already downloaded
+    return
+
+  with requests.get(get_pageview_url(), stream=True) as r:


audiodude · 2024-08-19T19:28:38Z

wp1/scores.py

+    os.remove(prev_filepath)
+
+  cur_filepath = get_cur_file_path()
+  if os.path.exists(cur_filepath):


Added error handling in the download, with corresponding test, PTAL.

kelson42 · 2024-08-20T04:00:32Z

@audiodude With current solution, i keep track of pageviews over many years in both gathering and score computation. Reason is to mitigate the impact of local maximum because of recent event.

audiodude · 2024-08-31T14:29:01Z

@audiodude With current solution, i keep track of pageviews over many years in both gathering and score computation. Reason is to mitigate the impact of local maximum because of recent event.

@kelson42 let's merge this as is and we can discuss some kind of backfill, or keeping a running tally.

- Add WP1_USER_AGENT to http calls - Get pageviews for previous month instead of hardcoded time - Throw error on non-successfuly HTTP status - Allow a filter_lang when updating DB from pageviews - Only commit db after every 10000 rows

audiodude requested a review from benoit74 August 17, 2024 15:23

audiodude force-pushed the calculate-pageviews branch 4 times, most recently from 77b9e1a to 7816048 Compare August 17, 2024 15:46

audiodude force-pushed the calculate-pageviews branch from e4131fa to 455b45f Compare August 17, 2024 16:22

audiodude force-pushed the calculate-pageviews branch from 0ca37c1 to 574bead Compare August 17, 2024 23:49

audiodude requested review from rgaudin and removed request for benoit74 August 18, 2024 15:16

audiodude force-pushed the calculate-pageviews branch from 574bead to bfe4d6b Compare August 19, 2024 17:04

rgaudin approved these changes Aug 19, 2024

View reviewed changes

audiodude commented Aug 19, 2024

View reviewed changes

audiodude added 12 commits August 31, 2024 10:29

Migration for adding scores table

2e0b43d

Retrieve wiki languages

690d656

Functions for streaming page view data

5ec1319

Re-work page_scores table

0d237d4

Add methods for saving pageviews to db

39b0cfa

Multiple small changes:

185d7ad

- Add WP1_USER_AGENT to http calls - Get pageviews for previous month instead of hardcoded time - Throw error on non-successfuly HTTP status - Allow a filter_lang when updating DB from pageviews - Only commit db after every 10000 rows

Add page_scores table to test schema

e989b5c

Re-factore scores.py to make it more testable and add some tests

f865865

Download full pageviews file, HTTP streaming was not working

8c65d13

Update e2e credentials

a0d3fbd

Add more tests

a62ca07

Code review fixes, with test

a844c52

Download in 8 MB chunks

c44be71

audiodude force-pushed the calculate-pageviews branch from 44d5991 to c44be71 Compare August 31, 2024 14:29

audiodude merged commit 92e24b4 into main Sep 8, 2024
5 checks passed

audiodude deleted the calculate-pageviews branch September 8, 2024 16:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculate pageviews for all articles across all wikipedias #755

Calculate pageviews for all articles across all wikipedias #755

audiodude commented Aug 17, 2024 •

edited

Loading

audiodude commented Aug 17, 2024

audiodude commented Aug 17, 2024

codecov bot commented Aug 17, 2024 •

edited

Loading

kelson42 commented Aug 18, 2024

rgaudin left a comment

rgaudin Aug 19, 2024

audiodude Aug 19, 2024

rgaudin Aug 19, 2024

audiodude Aug 19, 2024

rgaudin Aug 19, 2024

audiodude Aug 19, 2024

rgaudin Aug 19, 2024

audiodude Aug 19, 2024

rgaudin Aug 20, 2024

audiodude left a comment

audiodude Aug 19, 2024

audiodude Aug 19, 2024

audiodude Aug 19, 2024

kelson42 commented Aug 20, 2024

audiodude commented Aug 31, 2024

Calculate pageviews for all articles across all wikipedias #755

Calculate pageviews for all articles across all wikipedias #755

Conversation

audiodude commented Aug 17, 2024 • edited Loading

audiodude commented Aug 17, 2024

audiodude commented Aug 17, 2024

codecov bot commented Aug 17, 2024 • edited Loading

Codecov Report

kelson42 commented Aug 18, 2024

rgaudin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

audiodude left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kelson42 commented Aug 20, 2024

audiodude commented Aug 31, 2024

audiodude commented Aug 17, 2024 •

edited

Loading

codecov bot commented Aug 17, 2024 •

edited

Loading