Skip to content
/ argus Public

Fetches public personal information from Wikipedia

Notifications You must be signed in to change notification settings

Ran4/argus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Argus

Fetches public personal information in natural language from a Wikipedia dump, and stores it in a json-formatted database.

The project is named "Argus" after the hundred-eyed giant of Greek mythology (additionally, Argus was the name of the builder of the Argonauts' ship - the leader of whom was Jason, a name which is a homophone to the database format the program uses).

##Quick start

full_run.sh will parse a wikipedia xml dump, finding all the infoboxes and storing them all as a single json file in raw_output/. The initial json dump will then be cleaned, with the final output json residing in output/.

Manual run

Start by placing a copy of the full wikipedia xml (e.g. enwiki-20150304-pages-articles-multistream.xml) in the argus/ folder

#All paths given are relative to runstart in /src/
 
#xmlwikiparser2.py inputXMLFileName outputJSONFileName
python xmlwikiparser2.py ../enwiki-20150304-pages-articles-multistream.xml ../raw_output/ibs_person_raw.json
javac java_key_cleaner.java
#attribute_cleaner.py inputFileName outputFileName outputKeysFileName
python attribute_cleaner.py ../raw_output/ibs_person_raw.json ../output/infobox_output_cleaned.json ../debug/attribute_keys_cleaned.txt

#Cleaned JSON available here: https://mega.co.nz/#!YwUlSDRR!EAbguiWFg5ppVBsw5fRGoYQCuBjVvMTOoxTcuwH9I14

python statistics.py noshow silent

##Requirements

Python 2.7
Python modules:
    matplotlib  #Not required: used in statistics.py to generate plots

Java JDK >6

About

Fetches public personal information from Wikipedia

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages