StrepHit

StrepHit is a Natural Language Processing pipeline that understands human language, extracts facts from text and produces Wikidata statements with references.

StrepHit is a IEG project funded by the Wikimedia Foundation.

StrepHit will enhance the data quality of Wikidata by suggesting references to validate statements, and will help Wikidata become the gold-standard hub of the Open Data landscape.

Official Project Page

https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References

Documentation

https://www.mediawiki.org/wiki/StrepHit

Features

Web spiders to collect a biographical corpus from a list of reliable sources
Corpus analysis to understand the most meaningful verbs
Extraction of sentences and semi-structured data from a corpus
Train an automatic classifier through crowdsourcing
Extract facts from text in 2 ways:
- Supervised
- Rule-based
Several utilities, ranging from NLP tasks like tokenization and part-of-speech tagging, to facilities for parallel processing, caching and logging

Pipeline

Corpus Harvesting
Corpus Analysis
Sentence Extraction
N-ary Relation Extraction
Dataset Serialization

Get Ready

Install Python 2.7 and pip
Clone the repository and create the output folder:

$ git clone https://github.com/Wikidata/StrepHit.git
$ mkdir StrepHit/output

Install all the Python requirements (preferably in a virtualenv)

$ cd StrepHit
$ pip install -r requirements.txt

Install TreeTagger
Register for a free account on the Dandelion APIs
Create the file strephit/commons/secret_keys.py with your API token. You can find it in your dashboard

NEX_URL = 'https://api.dandelion.eu/datatxt/nex/v1/'
NEX_TOKEN = 'your API token here'

Optional dependency

If you want to extract sentences via syntactic parsing, you will need to install:

Java 8
Stanford CoreNLP, through our utility:

$ python -m strephit commons download stanford_corenlp

Command Line

You can run all the NLP pipeline components through a command line. Do not specify any argument, or use --help to see the available options. Each command can have a set of sub-commands, depending on its granularity.

$ python -m strephit                                                                             
Usage: __main__.py [OPTIONS] COMMAND [ARGS]...

Options:
  --log-level <TEXT CHOICE>...
  --cache-dir DIRECTORY
  --help                        Show this message and exit.

Commands:
  annotation          Corpus annotation via crowdsourcing
  classification      Roles classification
  commons             Common utilities used by others
  corpus_analysis     Corpus analysis module
  extraction          Data extraction from the corpus
  rule_based          Unsupervised fact extraction
  side_projects       Side projects scripts
  web_sources_corpus  Corpus retrieval from the web

Get Started

Generate a dataset of Wikidata assertions (QuickStatements syntax) from semi-structured data in the corpus (takes time, and a good internet connection):

$ python -m strephit extraction process_semistructured -p 1 samples/corpus.jsonlines

Produce a ranking of meaningful verbs:

$ python -m strephit commons pos_tag samples/corpus.jsonlines bio en
$ python -m strephit corpus_analysis rank_verbs output/pos_tagged.jsonlines bio en

Extract sentences using the ranking and perform Entity Linking:

$ python -m strephit extraction extract_sentences samples/corpus.jsonlines output/verbs.json en
$ python -m strephit commons entity_linking -p 1 output/sentences.jsonlines en

Extract facts with the rule-based classifier:

$ python -m strephit rule_based classify output/entity_linked.jsonlines samples/lexical_db.json en

Train the supervised classifier and extract facts:

$ python -m strephit annotation parse_results samples/crowdflower_results.csv
$ python -m strephit classification train output/training_set.jsonlines en
$ python -m strephit classification classify output/entity_linked.jsonlines output/classifier_model.pkl en

Serialize the supervised classification results into a dataset of Wikidata assertions (QuickStatements):

$ python -m strephit commons serialize -p 1 output/supervised_classified.jsonlines samples/lexical_db.json en

N.B.: you will find all the output files in the output folder.

Note on Parallel Processing

By default, StrepHit uses as many processes as the number of CPU cores in the machine where it runs. Add the -p parameter if you want to change the behavior.

Set -p 1 to disable parallel processing.

License

The source code is under the terms of the GNU General Public License, version 3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

StrepHit

Official Project Page

Documentation

Features

Pipeline

Get Ready

Optional dependency

Command Line

Get Started

Note on Parallel Processing

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

StrepHit

Official Project Page

Documentation

Features

Pipeline

Get Ready

Optional dependency

Command Line

Get Started

Note on Parallel Processing

License