Skip to content

Technical Documentation

Marta Bañón edited this page May 31, 2021 · 6 revisions

Corset consists of several Docker containers that interact:

  • A Solr engine (dp-solr)
  • A PostgreSQL database (dp-postgres)
  • The webapp front-end (dp-front)
  • The webapp back-end (dp-back)

Content

Solr

Solr is the search platform used to store and index data from corpora. Each base corpus, i.e. Paracrawl EN-ES, Paracrawl EN-DE, etc., must be stored as a different core/collection in Solr. A special collection (custom-samples) must be created to store corset excerpts for preview.

The Solr schema needed for each collection is the following:

<?xml version="1.0" encoding="UTF-8" ?>

<schema name="basecorpus" version="2.0">

  <field name="_version_" type="tlong" indexed="false" stored="false"/>
  <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />

  <field name="src_url" type="string" indexed="false" stored="true"/>
  <field name="trg_url" type="string" indexed="false" stored="true"/>
  <field name="src" type="text" indexed="true" stored="true" />
  <field name="trg" type="text" indexed="true" stored="true" />
  <field name="tags" type="string" indexed="true" stored="true" multiValued="true"/>
  <field name="custom_score" type="tdouble" indexed="false" stored="true" />

  <!-- primary key -->
  <uniqueKey>id</uniqueKey>


  <fieldType name="tlong" class="solr.TrieLongField" docValues="true"/>
  <fieldType name="string" class="solr.StrField" sortMissingLast="true" />
  <fieldType name="tdouble" class="solr.TrieDoubleField" docValues="true" />
  <fieldType name="text" class="solr.TextField">
    <analyzer>
      <charFilter class="solr.MappingCharFilterFactory" mapping="light-mapping-FoldToASCII.txt"/>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>
  
  <similarity class="solr.ClassicSimilarityFactory"/>
</schema>

You can also find it in the root of the Corset repository.

A core.properties file is also needed for each core:

schema=schema.xml
dataDir=data
name={YOUR COLLECTION NAME}
config=solrconfig.xml

Creating new collections

In order to create a new collection for a new corpus, follow these steps:

  • 1: Go to the web interface of your Solr instance
  • 2: Click on Core admin, and then Add core
  • 3: Fill the form in, pointing to the appropiate schema.xml file.

Make sure that instanceDir and dataDir exist before proceeding.

Data

Once your Solr is up, you need to upload one or more base corpora.

Data must be tab-separated, with the following format:

SOURCE_URL  TARGET_URL SOURCE_SENTENCE TARGET_SENTENCE SCORE

"SCORE" is a metric used in order to sort results when querying Solr (the higher the value, the better the sentence)

We provide a script to upload you data, called txt2solr.py, located at /scripts, that is used as follows:

python3.7 txt2solr.py -c {THE_URL_TO_THE_SOLR_COLLECTION} -p {PREFIX} --liteformat  {CORPUS_NAME} -u {SOLR_USER} -w {SOLR_PASSWORD}

For example:

python3.7 txt2solr.py -c http://localhost:20000/solr/paracrawl-en-es -p EN-ES --liteformat  paracrawl.en-es.tsv -u solrusr -w solrpwd

You can get more information about txt2solr.py here: txt2solr.py

The script used in order to select sentences from Solr based in a query/sample text is called miracle.py, and is located in /scripts. Learn more about it here: Miracle

Database

In Corset, a PostgreSQL database is used. It's generated through a script when first starting the dp-postgres container, by following docker-entrypoint-initdb.d/dpdb_initdb.sql.

Front end and API

Documentation on front-end and API can be found here: Front end and API

Deployment

Information on Corset deployment is here: Deployment

Clone this wiki locally