-
Notifications
You must be signed in to change notification settings - Fork 7
Rolling out
ShariaSource and CorpusBuilder are going to be updated separately most of the time. It might be that the first will be receiving updates more often.
One idea for the integration is to have CorpusBuilder be the submodule in git. Other one is to have it be a totally separate thing (which might be better).
The following is a sample of ENV variables that are required for CorpusBuilder to work properly:
[email protected]
CORPUS_BUILDER_HOST=http://98.shariasource.berkman.temphost.net
CORPUS_BUILDER_PORT=7998
The database uses one non-standard extension that needs to be installed. In case of migration scripts erring out because of not sufficient privileges to create extension, the following code needs to be ran by the superuser in psql
:
CREATE EXTENSION IF NOT EXISTS pgcrypto WITH SCHEMA public;
Other than that all migrations should be able to be executed without an issue.
As usual, the standard in Rails config/database.yml
file will need to be created and populated with connection info. Additionally, the config/secrets.yml
file will need to be adjusted as well.
CorpusBuilder provides an option of using Kraken
as the OCR engine (with Tesseract
being the default one). It also uses it for image preprocessing. A process running an instance of CorpusBuilder needs to have Python 2.7
within its environment as well as Kraken
itself. The following line installs Kraken
once Python
and Pip
are available:
pip install kraken
The easiest way to install Tesseract 4
is to use the package manager as described in: https://github.com/tesseract-ocr/tesseract/wiki.
CorpusBuilder exposes JavaScript library to be used within the app that integrates with it. The library needs to be compiled first. Here are the versions of node
, npm
and yarn
used during development:
- node: v8.6.0
- npm: 5.3.0
- yarn: 1.2.1
To compile the CorpusBuilder JavaScript libraries and styles (to be pulled by ShariaSource):
RAILS_ENV=production bundle exec rails webpacker:compile
Running the above in the production environment will make Webpacker
minify and optimize the output JavaScript.
All OCR processes are being ran as background jobs. Some of the document related tasks (like e.g branches merging) are implemented this way as well.
Running the jobs runner follows the usual Rails convention:
bundle exec rails jobs:work
The app uses DelayedJob
via ActiveRecord
for the jobs management.
Just as a reminder: we'll need the pg_dump
for the new CorpusBuilder database. It would be best to at least have the file backup set up for the public/uploads
and public/export
directories.
Following is a sample set of environment variables needed to be available for the ShariaSource app's process to have it able to communicate with CorpusBuilder:
CORPUS_BUILDER_SHARIA_SOURCE_APP_ID=453aac29-386f-433a-b581-3b12912cc48e
CORPUS_BUILDER_SHARIA_SOURCE_TOKEN='\$2a\$08\$OG4Kfq9DuwkLdhhevSKswe1aldOqv1/ESPyKFpv2Lval/.5tAEDma'
CORPUS_BUILDER_HOST=localhost
CORPUS_BUILDER_PORT=7998
CORPUS_BUILDER_API_URL=http://0.0.0.0:7998
CORPUS_BUILDER_PUBLIC_URL=http://98.shariasource.berkman.temphost.net:7998
CORPUS_BUILDER_API_VERSION=1
Note that the $
signs in the token are being escaped in the above example.