Mischa - Github profile Morsaki - Medium blog
Spevktator provides a combined live feed of 5 popular Russian news channels on VK, along with translations, sentiment analysis and visualisation tools, all of which is accessible online, from anywhere (or offline if you prefer so). We currently have an archive of over 67,000 posts, dating back to the beginning of February 2022.
Originally, it was created to help research domestic Russian propaganda narratives, but can also act as a monitoring hub for VK media content, allowing researchers and journalists to stay up to date on disinformation, even as chaotic events unfold. For example Documenting Russian Coverage of the ZNPP by Morsaki.
Sophisticated researchers can run this tool locally, against their own targets of research and even perform their detailed analysis offline through an SQL interface or Observable Notebook.
In our public demo, we collect posts from 5 popular Russian news channels on VK (life
, mash
, nws_ru
, ria
and tassagency
).
Explore their posts, together with sentiment analysis, metrics and English translation:
https://spevktator.io/vk/posts_mega_view
Some more examples:
- How often is "Ukraine" mentioned per week, together with average sentiment and total number of views?
- Which weapon systems are most often mentioned?
- Which Aircrafts are most often mentioned?
- When is the "Moskva cruiser" in the news?
- What are related entities to ЗАЭС (or in English ZNPP)
- Coverage of "hackers" by Russian media on VK an analysis using Observable Notebook.
To install and run Spevktator locally, you need at least Python 3.9 and a couple Python libraries which you can install with pip
.
git clone https://github.com/MischaU8/spevktator.git
cd spevktator
Recommended: Take a look at venv. This tool provides isolated Python environments, which are more practical than installing packages systemwide. It also allows installing packages without administrator privileges.
Install the Python dependencies, this will take a while:
pip3 install .
To get you started, download and decompress our VK sqlite database dump (~26MB). This includes all public VK wall posts by life
, mash
, nws_ru
, ria
and tassagency
between the period of 2022-02-01
and 2022-09-04
. But you can also decide to scrape your own data, see below.
wget -v -O data/vk.db.xz https://spevktator.io/static/vk_2022-09-04_lite.db.xz
xz -d data/vk.db.xz
Spevktator uses the open source multi-tool Datasette for exploring and publishing the collected data. Run the Datasette server to explore the collected posts:
datasette data/
Visit the webinterface on http://127.0.0.1:8001 or explore our public demo on https://spevktator.io/
Learn more about Datasette and SQL on https://datasette.io/tutorials
After following the above installation instructions, you can use the command line tool spevktator
to collect your own datasets from VK and save them to a sqlite database.
$ spevktator --help
Usage: spevktator [OPTIONS] COMMAND [ARGS]...
Save wall posts from VK communities to a SQLite database
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
backfill Retrieve the backlog of wall posts from the VK...
extract-named-entities Extract named-entities from text
fetch Retrieve all wall posts from the VK communities...
install Download and install models, create database
listen Continuously retrieve all wall posts from the...
rescrape Rescrape HTML pages from the scrape_log
sentiment Perform dostoevsky (RU) sentiment analysis on...
stats Show statistics for the given database
translate-entities Translate entities from RU to EN-US
translate-posts Translate posts from RU to EN-US
$ spevktator stats data/vk.db
domain nr_posts first last
---------- ---------- ------------------- -------------------
life 26125 2022-01-31T21:05:00 2022-09-03T15:45:00
mash 3309 2022-01-31T17:52:00 2022-08-31T15:01:00
nws_ru 3528 2022-01-31T13:00:00 2022-08-31T20:05:00
ria 10198 2022-01-31T22:03:00 2022-09-01T05:01:00
tassagency 23890 2022-01-31T22:45:00 2022-09-01T05:15:00
$ spevktator install data/myproject.db
Downloading Dostoevsky sentiment model... DONE
Creating database...DONE
You can specify one or more domains (the VK jargon for channels / groups) to monitor:
$ spevktator listen data/myproject.db vkusnoitochka
Scraping VK domain 'vkusnoitochka'... https://m.vk.com/vkusnoitochka
POST vkusnoitochka/-213845894_28 2022-09-01T13:27:00 added
POST vkusnoitochka/-213845894_27 2022-08-29T16:33:00 added
POST vkusnoitochka/-213845894_26 2022-08-08T18:03:00 added
POST vkusnoitochka/-213845894_25 2022-08-06T21:25:00 added
POST vkusnoitochka/-213845894_24 2022-08-06T21:23:00 added
2022-09-03 18:51:32.327117 posts_added=5 last_post_added=True earliest_post_date=2022-08-06T21:23:00 page: 1 / 5
Extracting named-entities up to 5 posts...
[####################################] 100%
0 extracted out of 5 posts
next url will be https://m.vk.com/vkusnoitochka?offset=5&own=1
Scraping VK domain 'vkusnoitochka'... https://m.vk.com/vkusnoitochka?offset=5&own=1
POST vkusnoitochka/-213845894_23 2022-08-06T21:23:00 added
POST vkusnoitochka/-213845894_22 2022-07-10T21:07:00 added
Optional commandline arguments for listen
are:
--deepl-auth-key
(orDEEPL_AUTH_KEY
env variable) to provide your DeepL translation API key.--spevktator-proxy
(orSPEVKTATOR_PROXY
env variable) the HTTP / HTTPS proxy to use to connect to VK.
Some other spevktator
commands to fetch historic posts from VK:
backfill
- Retrieve the backlog of wall posts from the VK, until a certain date. Seespevktator backfill --help
for available options to restrict the data to be downloaded.fetch
- Retrieve all wall posts from the VK communities. Seespevktator fetch --help
for available options to restrict the data to be downloaded.
This section includes any additional information that you want to mention about the tool, including:
- Potential next steps for the tool (i.e. what you would implement if you had more time)
- Any limitations of the current implementation of the tool
- Motivation for design/architecture decisions
- Expose more VK post data (thumbnail images, videos, comments)
- Expose which channels to monitor through the UI
- Annotation (tags / comments) of posts
- UI notification when data has been updated
- User authentication for non-public information & configuration UI
- More robust installation instructions for various platforms (Windows, Docker)
- Packaging and distribution via pypi.
- Integrate with https://observablehq.com/ notebooks.
- Only passive monitoring is performed, no VK account is needed, so private groups won’t be scraped.
- Comments and other personal information isn’t collected due to GDPR.
- Sentiment prediction is based on RuSentiment and has moderate quality.
- Post metrics (shares, likes, views) are only tracked for a limited duration (last 5 posts).
- Post text longer than 2500 characters are not translated.
- Limited error handling and data loss recovery.
- The user interface will require SQL knowledge for more advanced usage.
The ability to conduct keyword searches with local data is much superior to any online search. I no longer need to worry about revealing details of my investigation to any third party. The online web interface is provided for demo purposes, but not required.
Setting up a data pipeline isn’t trivial, besides getting the raw data a lot of value is added with optional related data such as viewer metrics, sentiment, translation and named-entity extraction.
This tool is modular, the data can be exported in various file formats (CSV, TSV, JSON) through sqlite-utils while being stored in a very powerful and accessible database format (sqlite). Instead of reinventing the wheel for data exploration and visualisation, it builds on existing opensource tooling, such as Datasette.