-
Notifications
You must be signed in to change notification settings - Fork 2
ItalianLexicalResources
This page provides information about the resources available for Italian language processing.
Currently the EXCITEMENT platform provides functionality that uses the Italian version of WordNet, available by request from FBK. Compared to previous releases, we now provide a version that is compatible with the Princeton WordNet's Java API. Using the resource is therefore similar to using the Princeton WordNet: download the resource in a local directory, and provide the access path in the configuration file. In EditDistanceEDA's configuration file, this would look like this:
<subsection name="wordnet">
<!-- path of the WordNet files -->
<property name="path">/tmp/wnita</property>
</subsection>
In the configuration file you must indicate that the platform should use the "wordnet" section you just defined, by adding the following line to the components section:
<property name="instances">wordnet</property>
Further information about configuration files can be found here.
The EOP package corresponding to using this resource is:
eu.excitementproject.eop.core.component.lexicalknowledge.wordnet
Entailment rules for Italian nouns were extracted from a corpus of Italian Wikipedia articles using the Wikipedia lexical miner, as described in Shnarch et al., 2009. The rule extraction relies on linking nouns based on various indicators -- redirects, hyperlinks, categories, parenthesis at the title, inference from term definition, category network, article text (in particular the first sentence, considered as a definition). The rules are scores based on the indicators used to identify them.
Currently, this resource is distributed as a(n archived) MySQL database. It contains approximately 7 million rules. To use, download the archive. After unpacking the resource, install it in MySQL by running the command:
> mysql -u username -p < resource
Then add a section in the configuration file that describes how to access the resource. The name of the database is "wikilesresita". Apart from the database name, the dbconnection, dbuser and dbpasswd should reflect your installation of MySQL:
<subsection name="wikipedia">
<!-- connection to the Wikipedia data base -->
<property name="dbconnection">jdbc:mysql://nathrezim:3306/wikilexresita</property>
<property name="dbuser">username</property>
<property name="dbpasswd">password</property>
</subsection>
To indicate to the platform to use this resource, in the chosen component in the configuration file, insert the line:
<property name="instances">wikipedia</property>
Further information about configuration files can be found here.
The EOP package corresponding to using this resource is:
eu.excitementproject.eop.core.component.lexicalknowledge.wordnet
The resources described here were generated with the distsim package. They provide similarity scores between words computed based on the words' distributional representation built from a parsed corpus. Read more about the resource generation process.
The resources described in this section were built from a corpus of Italian Wikipedia pages, parsed with TextPro. At the time of writing this documentation, the resources described below were built from 350M out of the 1G corpus of Italian Wikipedia pages of 2013/09/08.
The resources are available for download from the Artifactory repository -- Italian Redis DBs. The archive contains 7 Redis database files, which contain rules as directed similarity scores between pairs of words:
Model | File | Nr. of rules | File size |
---|---|---|---|
BAP | similarity-l2r.rdb | 443374 | 13M |
similarity-r2l.rdb | 443121 | 13M | |
DIRT | similarity-l2r.rdb | 1700 | 4k |
LIN proximity | similarity-l2r.rdb | 454964 | 13M |
similarity-r2l.rdb | 454964 | 13M | |
LIN dependency | similarity-l2r.rdb | 443374 | 13M |
similarity-r2l.rdb | 443374 | 13M |