Introduction

This project implements Lucene [1] based translation memory with BLEU rescoring as described in Multi-Domain Neural Machine Translation through Unsupervised Adaptation [2]

Requirements

Java JDK installation is required. Project is tested with JDK 8.

Installation process

To build project simply run:

./gradlew installDist which should result in a build target: ./build/install/tm

You can run it by passing arguments to bat script in ./build/install/tm/bin/tm.bat --port 8080 --bleu-rescoring-threshold 0.05 --index-dir my_index

Or you can run it straight from gradle ./gradlew run --args="--port 8080 --bleu-rescoring-threshold 0.05 --index-dir my_index"

In our case we used the jar file from python3 using jpype:

import jpype.imports
from jpype.types import *
jpype.addClassPath('build/libs/tm.jar')
jpype.startJVM(convertStrings=False)
import java.lang
from java.lang import System
from com import LuceneSentenceSearch

API Call examples

curl \
--header "Content-Type: application/json" \
--request POST \
--data '{"source":"Hello World !", "target": "Sveika pasaule!", "meta": {"uid": "Artūrs", "srclang": "en"}}' \
http://localhost:8080/save

Response: 
{
  "errorMessage": null,
  "status": "OK"
}

curl \
--header "Content-Type: application/json" \
--request POST \
--data '{"input":"Hello World !", "meta": {"uid": "Artūrs", "srclang": "en"}}' \
http://localhost:8080/get

Response: 
{
  "sourceContext" : [ "Hello World !", "Hello Worlds !" ],
  "targetContext" : [ "Sveika pasaule!", "Sveiki pasaules!" ],
  "status" : "OK",
  "errorMessage" : null
}

curl \
--header "Content-Type: application/json" \
--request POST \
--data '{"uid": "Artūrs"}' \
http://localhost:8080/delete

Response: 
{
  "errorMessage": null,
  "status": "OK"
}

Useful Functions

createIndexInDir("/tmp", "lv") - will initialize a Latvian source language translation memory stored in /tmp
addFileToIndex(srcFile, trgFile, "IT") - will load content of two parallel files in translation memory for domain IT
queryTM(String query_sentence, String domain, boolean skipBleuRescorer, int numberOfCandidates) - will retrieve at most numberOfCandidates sentences from TM that are similar with respect to stemmed query TFIDF; if skipBleuRescorer is True` then will also use BLEU rescoring to refine results further

References

[1] McCandless, Michael, et al. Lucene in action. Vol. 2. Greenwich: Manning, 2010.

[2] Farajian, M. Amin, et al. "Multi-domain neural machine translation through unsupervised adaptation." Proceedings of the Second Conference on Machine Translation. 2017.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Introduction

Requirements

Installation process

API Call examples

Useful Functions

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

Introduction

Requirements

Installation process

API Call examples

Useful Functions

References