This is a quick script to move the Stackoverflow data from the StackExchange data dump (Sept '14) to a Postgres SQL database.
Schema hints are taken from a post on Meta.StackExchange and from StackExchange Data Explorer.
Install requirements, create a new database (e.g. beerSO
below), and use load_into_pg.py
script:
$ pip install -r requirements.txt
...
Successfully installed argparse-1.2.1 libarchive-c-2.9 lxml-4.5.2 psycopg2-binary-2.8.4 six-1.10.0
$ createdb beerSO
$ python load_into_pg.py -s beer -d beerSO
This will download compressed files from archive.org and load all the tables at once.
You can use a custom database name as well. Make sure to explicitly give it while executing the script later.
Each table data is archived in an XML file. Available tables varies accross
history. load_into_pg.py
knows how to handle the following tables:
Badges
.Posts
.Tags
(not present in earliest dumps).Users
.Votes
.PostLinks
.PostHistory
.Comments
.
You can download manually the files to the folder from where the program is
executed: Badges.xml
, Votes.xml
, Posts.xml
, Users.xml
, Tags.xml
. In
some old dumps, the cases in the filenames are different.
Then load each file with e.g. python load_into_pg.py -t Badges
.
After all the initial tables have been created:
$ psql beerSO < ./sql/final_post.sql
For some additional indexes and tables, you can also execute the the following;
$ psql beerSO < ./sql/optional_post.sql
If you give a schema name using the -n
switch, all the tables will be moved
to the given schema. This schema will be created in the script.
The paths are not changed in the final scripts sql/final_post.sql
and
sql/optional_post.sql
. To run them, first set the search_path to your
schema name: SET search_path TO <myschema>;
- It prepares some indexes and views which may not be necessary for your analysis.
- The
Body
field inPosts
table is NOT populated by default. You have to use--with-post-body
argument to include it. - The
EmailHash
field inUsers
table is NOT populated.
- The
tags.xml
is missing from the data dump. Hence, thePostTag
andUserTagQA
tables will be empty afterfinal_post.sql
. - The
ViewCount
inPosts
is sometimes equal to anempty
value. It is replaced byNULL
in those cases.