-
Notifications
You must be signed in to change notification settings - Fork 37
6 Using Wikidata as KB
Diego Moussallem edited this page Apr 6, 2018
·
16 revisions
Wikidata comprises different version languages as Wikipedia, however, it assigns global identifiers to the resources differently as DBpedia. For example, Barack Obama has its identifier as Q76 (https://www.wikidata.org/entity/Q76) independently in which language the resource is described. What matters are the labels thus after downloading the data, we remove the languages tags that we do not need. The steps for creating the Wikidata index follows below:
https://dumps.wikimedia.org/wikidatawiki/entities
We recommend downloading the file in Ntriples(nt) format.
2) Remove the language tags. In the example below, we remove all and just leave English as the preferred language.
sed '/@de/d;/@fr/d;/@it/d;/@eo/d;/@pl/d;/@ru/d;/@ja/d;/@zh/d;/@es/d;/@nl/d;/@af/d;/@an/d;/@ar/d;/@arz/d;/@ast/d;/@az/d;/@bar/d;/@be/d;/@bg/d;/@br/d;/@bs/d;/@ca/d;/@cdo/d;/@cs/d;/@cv/d;/@cy/d;/@da/d;/@diq/d;/@dsb/d;/@el/d;/@et/d;/@eu/d;/@ext/d;/@fa/d;/@fi/d;/@fo/d;/@fy/d;/@ga/d;/@gd/d;/@gl/d;/@gu/d;/@gv/d;/@he/d;/@hi/d;/@hr/d;/@hsb/d;/@ht/d;/@hu/d;/@hy/d;/@ia/d;/@id/d;/@ilo/d;/@io/d;/@is/d;/@jv/d;/@ka/d;/@km/d;/@kn/d;/@ko/d;/@ku/d;/@kw/d;/@la/d;/@lb/d;/@lij/d;/@ln/d;/@lt/d;/@lv/d;/@ml/d;/@mn/d;/@mr/d;/@ms/d;/@mt/d;/@nds-nl/d;/@nn/d;/@nrm/d;/@oc/d;/@os/d;/@pms/d;/@pnb/d;/@pt/d;/@qu/d;/@rm/d;/@rmy/d;/@ro/d;/@scn/d;/@sco/d;/@sh/d;/@sk/d;/@sl/d;/@so/d;/@sq/d;/@sr/d;/@su/d;/@sv/d;/@ta/d;/@tet/d;/@tg/d;/@th/d;/@tl/d;/@tpi/d;/@tr/d;/@tt/d;/@uk/d;/@ur/d;/@vec/d;/@vi/d;/@war/d;/@xal/d;/@yi/d;/@yo/d;/@zea/d;/@nb/d;/@pt-br/d;/@yue/d;/@ang/d;/@bn/d;/@nap/d;/@be-tarask/d;/@nan/d;/@nov/d;/@pa/d;/@ie/d;/@stq/d;/@hak/d;/@li/d;/@am/d;/@ba/d;/@uz/d;/@kk/d;/@sc/d;/@en-gb/d;/@en-ca/d;/@mzn/d;/@ne/d;/@gom/d;/@gsw/d;/@ceb/d;/@lmo/d;/@bho/d;/@te/d;/@sw/d;/@si/d;/@gom-latn/d;/@gom-deva/d' downloaded-data.nt > final wikidata-en.nt
You can also download the pre-built index directly from our server and run it locally using the following command:
mvn clean package tomcat:run -DskipTests
Get the data via: wget http://hobbitdata.informatik.uni-leipzig.de/agdistis/wikidata/index_wikidata_en.zip
index=index_wikidata_en
index2=index_bycontext
#used to prune edges
nodeType=http://www.wikidata.org/entity/
edgeType=http://www.wikidata.org/prop/
baseURI =http://www.wikidata.org
#SPARQL endpoint to retrieve domain and range information
endpoint=https://query.wikidata.org/
#this is the trigram distance between words, default = 3
ngramDistance=3
#exploration depth of semantic disambiguation graph
maxDepth=2
#threshold for cutting of similar strings
threshholdTrigram=0.87
#heuristicExpansionOn explains whether simple coocurence resolution is done or not, e.g., Barack => Barack Obama if both are in the same text
heuristicExpansionOn=true
#list of entity domains and corporationAffixes
whiteList=/config/whiteList.txt
corporationAffixes=/config/corporationAffixes.txt
#Active popularity
popularity=false
#Choose an graph-based algorithm "hits" or "pagerank"
algorithm=hits
#Enable search by context
context=false
#Enable search by acronym
acronym=false
#Enable to find common entities
commonEntities=true
# IMPORTANT for creating an own index
folderWithTTLFiles=/Users/diegomoussallem/Desktop/AGDISTIS-WIKIDATA/wikidata/
surfaceFormTSV=
You can test your running AGDISTIS via:
curl --data-urlencode "text='<entity>Barack Obama</entity>.'" -d type='agdistis' http://localhost:8080/AGDISTIS