Skip to content

Rolling out

jsdiaz edited this page Dec 17, 2018 · 2 revisions

Loose Thoughts

ShariaSource and CorpusBuilder are going to be updated separately most of the time. It might be that the first will be receiving updates more often.

One idea for the integration is to have CorpusBuilder be the submodule in git. Other one is to have it be a totally separate thing (which might be better).

Documentation

Instructions for setting up CorpusBuilder as a separate app

Setting up ENV variables

The following is a sample of ENV variables that are required for CorpusBuilder to work properly:

[email protected]
CORPUS_BUILDER_HOST=http://98.shariasource.berkman.temphost.net
CORPUS_BUILDER_PORT=7998

Setting up of the database

The database uses one non-standard extension that needs to be installed. In case of migration scripts erring out because of not sufficient privileges to create extension, the following code needs to be ran by the superuser in psql:

CREATE EXTENSION IF NOT EXISTS pgcrypto WITH SCHEMA public;

Other than that all migrations should be able to be executed without an issue.

As usual, the standard in Rails config/database.yml file will need to be created and populated with connection info. Additionally, the config/secrets.yml file will need to be adjusted as well.

The Python environment with Kraken for image preprocessing

CorpusBuilder provides an option of using Kraken as the OCR engine (with Tesseract being the default one). It also uses it for image preprocessing. A process running an instance of CorpusBuilder needs to have Python 2.7 within its environment as well as Kraken itself. The following line installs Kraken once Python and Pip are available:

pip install kraken

Setting up of Tesseract

The easiest way to install Tesseract 4 is to use the package manager as described in: https://github.com/tesseract-ocr/tesseract/wiki.

UI compilation

CorpusBuilder exposes JavaScript library to be used within the app that integrates with it. The library needs to be compiled first. Here are the versions of node, npm and yarn used during development:

  • node: v8.6.0
  • npm: 5.3.0
  • yarn: 1.2.1

To compile the CorpusBuilder JavaScript libraries and styles (to be pulled by ShariaSource):

yarn install
RAILS_ENV=production bundle exec rails webpacker:compile

Running the above in the production environment will make Webpacker minify and optimize the output JavaScript.

The jobs runner

All OCR processes are being ran as background jobs. Some of the document related tasks (like e.g branches merging) are implemented this way as well.

Running the jobs runner follows the usual Rails convention:

bundle exec rails jobs:work

The app uses DelayedJob via ActiveRecord for the jobs management.

The backup additions

Just as a reminder: we'll need the pg_dump for the new CorpusBuilder database. It would be best to at least have the file backup set up for the public/uploads and public/export directories.

Setting up ShariaSource to be aware of CorpusBuilder

Setting up of the ENV variables

Following is a sample set of environment variables needed to be available for the ShariaSource app's process to have it able to communicate with CorpusBuilder:

CORPUS_BUILDER_SHARIA_SOURCE_APP_ID=453aac29-386f-433a-b581-3b12912cc48e
CORPUS_BUILDER_SHARIA_SOURCE_TOKEN='\$2a\$08\$OG4Kfq9DuwkLdhhevSKswe1aldOqv1/ESPyKFpv2Lval/.5tAEDma'
CORPUS_BUILDER_HOST=localhost
CORPUS_BUILDER_PORT=7998
CORPUS_BUILDER_API_URL=http://0.0.0.0:7998
CORPUS_BUILDER_PUBLIC_URL=http://98.shariasource.berkman.temphost.net:7998
CORPUS_BUILDER_API_VERSION=1

Note that the $ signs in the token are being escaped in the above example.