-
Notifications
You must be signed in to change notification settings - Fork 3
Technical Documentation
Corset consists of several Docker containers that interact:
- A Solr engine (dp-solr)
- A PostgreSQL database (dp-postgres)
- The webapp front-end (dp-front)
- The webapp back-end (dp-back)
Solr is the search platform used to store and index data from corpora. Each base corpus, i.e. Paracrawl EN-ES, Paracrawl EN-DE, etc., must be stored as a different core/collection in Solr. A special collection (custom-samples
) must be created to store corset excerpts for preview.
The Solr schema needed for each collection is the following:
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="basecorpus" version="2.0">
<field name="_version_" type="tlong" indexed="false" stored="false"/>
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="src_url" type="string" indexed="false" stored="true"/>
<field name="trg_url" type="string" indexed="false" stored="true"/>
<field name="src" type="text" indexed="true" stored="true" />
<field name="trg" type="text" indexed="true" stored="true" />
<field name="tags" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="custom_score" type="tdouble" indexed="false" stored="true" />
<!-- primary key -->
<uniqueKey>id</uniqueKey>
<fieldType name="tlong" class="solr.TrieLongField" docValues="true"/>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldType name="tdouble" class="solr.TrieDoubleField" docValues="true" />
<fieldType name="text" class="solr.TextField">
<analyzer>
<charFilter class="solr.MappingCharFilterFactory" mapping="light-mapping-FoldToASCII.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<similarity class="solr.ClassicSimilarityFactory"/>
</schema>
You can also find it in the root of the Corset repository.
A core.properties
file is also needed for each core:
schema=schema.xml
dataDir=data
name={YOUR COLLECTION NAME}
config=solrconfig.xml
In order to create a new collection for a new corpus, follow these steps:
- 1: Go to the web interface of your Solr instance
- 2: Click on
Core admin
, and thenAdd core
- 3: Fill the form in, pointing to the appropiate
schema.xml
file.
Make sure that instanceDir
and dataDir
exist before proceeding.
Once your Solr is up, you need to upload one or more base corpora.
Data must be tab-separated, with the following format:
SOURCE_URL TARGET_URL SOURCE_SENTENCE TARGET_SENTENCE SCORE
"SCORE" is a metric used in order to sort results when querying Solr (the higher the value, the better the sentence)
We provide a script to upload you data, called txt2solr.py
, located at /scripts
, that is used as follows:
python3.7 txt2solr.py -c {THE_URL_TO_THE_SOLR_COLLECTION} -p {PREFIX} --liteformat {CORPUS_NAME} -u {SOLR_USER} -w {SOLR_PASSWORD}
For example:
python3.7 txt2solr.py -c http://localhost:20000/solr/paracrawl-en-es -p EN-ES --liteformat paracrawl.en-es.tsv -u solrusr -w solrpwd
You can get more information about txt2solr.py
here: txt2solr.py
The script used in order to select sentences from Solr based in a query/sample text is called miracle.py
, and is located in /scripts
. Learn more about it here: Miracle
In Corset, a PostgreSQL database is used. It's generated through a script when first starting the dp-postgres
container, by following docker-entrypoint-initdb.d/dpdb_initdb.sql
.
Documentation on front-end and API can be found here: Front end and API
Information on Corset deployment is here: Deployment