Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous development infrastructure for python #353

Closed
wants to merge 22 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,5 @@ wasm/test_page/js/bergamot-translator-worker.*

# VSCode
.vscode

*.pyc
159 changes: 98 additions & 61 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,87 +1,124 @@
# Bergamot Translator
<img src="https://browser.mt/images/about.jpg">

[![CircleCI badge](https://img.shields.io/circleci/project/github/browsermt/bergamot-translator/main.svg?label=CircleCI)](https://circleci.com/gh/browsermt/bergamot-translator/)
# bergamot-translator

Bergamot translator provides a unified API for ([Marian NMT](https://marian-nmt.github.io/) framework based) neural machine translation functionality in accordance with the [Bergamot](https://browser.mt/) project that focuses on improving client-side machine translation in a web browser.
[![native](https://github.com/browsermt/bergamot-translator/actions/workflows/native.yml/badge.svg)]()
[![python + wasm](https://github.com/browsermt/bergamot-translator/actions/workflows/build.yml/badge.svg)]()
[![PyPI version](https://badge.fury.io/py/bergamot.svg)](https://badge.fury.io/py/bergamot)
[![twitter](https://img.shields.io/twitter/url.svg?label=Follow%20@BergamotProject&style=social&url=http://twitter.com/BergamotProject)](https://twitter.com/BergamotProject)

## Build Instructions
bergamot-translator enables client-side machine translation on the
consumer-grade machine. Developed as part of the
[Bergamot](https://browser.mt/) project, the library builds on top of:

1. [Marian](https://marian-nmt.github.io/): Neural Machine Translation (NMT)
library. This repository uses the fork
[browsermt/marian-dev](https://github.com/browsermt/marian-dev), which
optimizes for faster inference on intel CPUs and WebAssembly support.
2. [student models](https://github.com/browsermt/students): Compressed neural
models that enable translation on consumer-grade devices.

bergamot-translator wraps marian to add sentence splitting, on-the-fly
batching, HTML markup translation, and a more suitable API to develop
applications. Development continuously tests the functionality on Windows,
MacOS and Linux operating systems on `x86_64`. and WebAssembly cross-platform
target in addition. `aarch64` native support is under development.

## Usage

### As a C++ library

bergamot-translator uses the CMake build system. Use the library target
`bergamot-translator` in projects that intend to build applications on top of
the library. Latest developer documentation is available at
[browser.mt/docs/main](https://browser.mt/docs/main).

### In other languages

We provide bindings to Python and JavaScript through WebAssembly.

#### Python

This repository provides a python module which also comes with a command-line
interface to use available models. This is available through PyPI.

### Build Natively
Create a folder where you want to build all the artifacts (`build-native` in this case) and compile

```bash
mkdir build-native
cd build-native
cmake ../
make -j2
python3 -m pip install bergamot
```

### Build WASM
#### Prerequisite
Find an example for a quick-start on Colab below:

Building on wasm requires Emscripten toolchain. It can be downloaded and installed using following instructions:
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1AHpgewVJBFaupwAbZq0e6TdX6REx0Ul0)

* Get the latest sdk: `git clone https://github.com/emscripten-core/emsdk.git`
* Enter the cloned directory: `cd emsdk`
* Install the sdk: `./emsdk install 3.1.8`
* Activate the sdk: `./emsdk activate 3.1.8`
* Activate path variables: `source ./emsdk_env.sh`
For more comprehensive documentation of using the in python as a library see
[browser.mt/docs/main/python.html](https://browser.mt/docs/main/python.html).

#### <a name="Compile"></a> Compile
#### JavaScript/WebAssembly

To build a version that translates with higher speeds on Firefox Nightly browser, follow these instructions:
WebAssembly and JavaScript support is developed for an offline-translation
browser extension intended for use in Mozilla Firefox web-browser. emscripten
is used to compile C/C++ sources to WebAssembly. You may use the pre-built
`bergamot-translator-worker.js` and `bergamot-translator-worker.wasm` available
from [releases](https://github.com/browsermt/bergamot-translator/releases).

1. Create a folder where you want to build all the artifacts (`build-wasm` in this case) and compile
```bash
mkdir build-wasm
cd build-wasm
emcmake cmake -DCOMPILE_WASM=on ../
emmake make -j2
```
WebAssembly is available in Firefox and Google Chrome. It is also possible to
use these through NodeJS. For an example of how to use this, please look at
this [Hello World](./wasm/node-test.js) example. For a complete demo that
works locally in your modern browser see
[mozilla.github.io/translate](https://mozilla.github.io/translate/).

The wasm artifacts (.js and .wasm files) will be available in the build directory ("build-wasm" in this case).
WebAssembly is slower due to lack of optimized matrix-multiply primitives.
Nightly builds of Mozilla Firefox have faster GEMM (Generalized Matrix
Multiplication) capabilities and are expected to be slightly faster.

2. Enable SIMD Wormhole via Wasm instantiation API in generated artifacts
```bash
bash ../wasm/patch-artifacts-enable-wormhole.sh
```
## Applications

3. Patch generated artifacts to import GEMM library from a separate wasm module
```bash
bash ../wasm/patch-artifacts-import-gemm-module.sh
```
### translateLocally

To build a version that runs on all browsers (including Firefox Nightly) but translates slowly, follow these instructions:
For a cross platform batteries included GUI application that builds on top of
bergamot-translator, checkout
[translateLocally](https://github.com/XapaJIaMnu/translateLocally).
translateLocally provides model downloading from a repository and curates
available models.

1. Create a folder where you want to build all the artifacts (`build-wasm` in this case) and compile
```bash
mkdir build-wasm
cd build-wasm
emcmake cmake -DCOMPILE_WASM=on -DWORMHOLE=off ../
emmake make -j2
```
### Browser Extension

2. Patch generated artifacts to import GEMM library from a separate wasm module
```bash
bash ../wasm/patch-artifacts-import-gemm-module.sh
```
Mozilla, as part of Bergamot Project builds and maintains
[firefox-translations](https://github.com/mozilla/firefox-translations/). The
official Firefox extension uses WebAssembly.

#### Recompiling
As long as you don't update any submodule, just follow [Compile](#Compile) steps.\
If you update a submodule, execute following command in repository root folder before executing
[Compile](#Compile) steps.
```bash
git submodule update --init --recursive
```
See
[jelmervdl/firefox-translations](https://github.com/jelmervdl/firefox-translations/)
for Chrome extension (Manifest V2), which in addition to WebAssembly, supports
faster local translation via [Native
Messaging](https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/Native_messaging)
supported by
[translateLocally](https://github.com/XapaJIaMnu/translateLocally).


## Contributing

We appreciate all contributions. There are several ways to contribute to this
project.

1. **Code**: Improvements to the source are always welcome. If you are planning to
contribute back bug-fixes to this repository, please do so without any
further discussion. If you plan to contribute new features, utility functions,
or extensions to the core, please
[discuss](https://github.com/browsermt/bergamot-translator/discussions) the
feature with us first.
2. **Models**: Bergamot, being a wrapper on marian should comfortably work with
models trained using marian. We prefer models that are trained following the
recipe in
[browsermt/students](https://github.com/browsermt/students/tree/master/train-student)
so that they are smaller in size and enable fast inference on the
consumer-grade machine.

## How to use
## Acknowledgements

### Using Native version
This project has received funding from the European Union’s Horizon 2020
research and innovation programme under grant agreement No 825303.

The builds generate library that can be integrated to any project. All the public header files are specified in `src` folder.\
A short example of how to use the APIs is provided in `app/main.cpp` file.

### Using WASM version

Please follow the `README` inside the `wasm` folder of this repository that demonstrates how to use the translator in JavaScript.
6 changes: 5 additions & 1 deletion bindings/python/bergamot.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,10 @@ PYBIND11_MODULE(_bergamot, m) {
py::bind_vector<std::vector<std::string>>(m, "VectorString");
py::bind_vector<std::vector<Response>>(m, "VectorResponse");

py::bind_vector<std::vector<float>>(m, "VectorFloat");
py::bind_vector<Alignment>(m, "Alignment");
py::bind_vector<Alignments>(m, "Alignments");

py::enum_<ConcatStrategy>(m, "ConcatStrategy")
.value("FAITHFUL", ConcatStrategy::FAITHFUL)
.value("SPACE", ConcatStrategy::SPACE)
Expand All @@ -182,7 +186,7 @@ PYBIND11_MODULE(_bergamot, m) {
py::init<>([](bool qualityScores, bool alignment, bool HTML, bool sentenceMappings, ConcatStrategy strategy) {
return ResponseOptions{qualityScores, alignment, HTML, sentenceMappings, strategy};
}),
py::arg("qualityScores") = true, py::arg("alignment") = false, py::arg("HTML") = false,
py::arg("qualityScores") = false, py::arg("alignment") = false, py::arg("HTML") = false,
py::arg("sentenceMappings") = true, py::arg("concatStrategy") = ConcatStrategy::FAITHFUL)
.def_readwrite("qualityScores", &ResponseOptions::qualityScores)
.def_readwrite("HTML", &ResponseOptions::HTML)
Expand Down
8 changes: 5 additions & 3 deletions bindings/python/repository.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
APP = "bergamot"


class Repository(ABC):
class Repository(ABC): # pragma: no cover
"""
An interface for several repositories. Intended to enable interchangable
use of translateLocally and Mozilla repositories for usage through python.
Expand All @@ -32,7 +32,7 @@ def update(self):
pass

@abstractmethod
def models(self) -> t.List[str]:
def models(self, filter_downloaded: bool) -> t.List[str]:
"""returns identifiers for available models"""
pass

Expand Down Expand Up @@ -187,7 +187,9 @@ def modelConfigPath(self, name: str, code: str) -> PathLike:
)

def models(self, name: str, filter_downloaded: bool = True) -> t.List[str]:
return self.repositories.get(name, self.default_repository).models()
return self.repositories.get(name, self.default_repository).models(
filter_downloaded
)

def model(self, name: str, model_identifier: str) -> t.Any:
return self.repositories.get(name, self.default_repository).model(
Expand Down
25 changes: 25 additions & 0 deletions bindings/python/tests/test_all.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# type: ignore
import pytest
from bergamot import REPOSITORY, ResponseOptions, Service, ServiceConfig, VectorString
from bergamot.utils import toJSON


def test_basic():
keys = ["browsermt"]
models = ["de-en-tiny"]
config = ServiceConfig(numWorkers=1, logLevel="critical")
service = Service(config)
for repository in keys:
# models = REPOSITORY.models(repository, filter_downloaded=False)
for model in models:
REPOSITORY.download(repository, model)

for modelId in models:
configPath = REPOSITORY.modelConfigPath(repository, modelId)
model = service.modelFromConfigPath(configPath)
options = ResponseOptions(alignment=True, qualityScores=True, HTML=False)
print(repository, modelId)
source = "1 2 3 4 5 6 7 8 9"
responses = service.translate(model, VectorString([source]), options)
for response in responses:
print(toJSON(response, indent=4))
130 changes: 130 additions & 0 deletions bindings/python/tests/test_html.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# type: ignore
from collections import Counter
from string import whitespace

import pytest
from bergamot import REPOSITORY, ResponseOptions, Service, ServiceConfig, VectorString

try:
from lxml import etree, html
except:
raise ImportError("Please install lxml, the html tests require these")


def test_html():
MODEL = "en-de-tiny"
config = ServiceConfig(numWorkers=4, logLevel="warn")
service = Service(config)
config_path = REPOSITORY.modelConfigPath("browsermt", MODEL)
model = service.modelFromConfigPath(config_path)

example = """
<div class="wrap">
<div class="image-wrap"><img src="/images/about.jpg" alt=""></div>
<h2 id="the-bergamot-project">Das Projekt Bergamot</h2>
<p>Das Bergamot-Projekt wird die Übersetzung der Client-Seite der Maschine in einem Webbrowser ergänzen und verbessern.</p>
<p>Im Gegensatz zu aktuellen <b>Cloud-basierten Optionen</b>, die di<s>rekt</s> auf den Rechnern der Nutzer laufen, die Bürgerinnen und Bürger, ihre Privatsphäre zu bewahren und erhöht die Verbreitung von Sprachtechnologien in Europa in verschiedenen Sektoren, die Vertraulichkeit erfordern. Freie Software, die mit einem Open-Source-Webbrowser wie Mozilla Firefox integriert ist, wird die Akzeptanz von unten nach oben durch Nicht-Experten ermöglichen, was zu Kosteneinsparungen für private und öffentliche Nutzer führt, die andernfalls Übersetzungen beschaffen oder einsprachig arbeiten würden.</p>
<p>Bergamot ist ein Konsortium, das von der Universität Edinburgh mit den Partnern Charles University in Prag, der University of Sheffield, der University of Tartu und Mozilla koordiniert wird.</p>
</div>
"""

def translate(src, HTML=True):
options = ResponseOptions(HTML=HTML)
responses = service.translate(model, VectorString([src]), options)
return responses[0].target.text

def get_surrounding_text(element):
"""
Places for spaces: 0 <b> 1 … 2 </b> 3
0 before_open: prev.tail[-1] if prev else parent.text[-1]
1 after_open: elem.text[0]
2 before_close: last_child.tail[-1] if last_child else elem.text[-1]
3 after_close: elem.tail[0]
"""
before_open = (
element.getprevious().tail
if element.getprevious() is not None
else element.getparent().text
)
after_open = element.text
last_child = next(element.iterchildren(reversed=True), None)
before_close = last_child.tail if last_child is not None else element.text
after_close = element.tail
return [before_open, after_open, before_close, after_close]

def has_surrounding_text(element):
return [
text is not None and text.strip() != ""
for text in get_surrounding_text(element)
]

def has_surrounding_spaces(element):
return [
isinstance(text, str) and len(text) > 0 and text[index] in whitespace
for text, index in zip(get_surrounding_text(element), [-1, 0, -1, 0])
]

def format_whitespace(slots):
return "{}<t>{}…{}</t>{}".format(*["␣" if slot else "⊘" for slot in slots])

def format_element(element):
return "<{}{}>".format(
element.tag,
"".join(
f' {key}="{val}"' for key, val in element.items() if key != "x-test-id"
),
)

def clean_html(src):
tree = html.fromstring(src)
return html.tostring(tree, encoding="utf-8").decode()

def compare_html(src, translate):
"""Marks tags, then translates and compares translated HTML"""
src_tree = html.fromstring(src)
src_elements = {str(n): element for n, element in enumerate(src_tree.iter(), 1)}
# Assign each element a unique id to help it correlate after translation
for n, element in src_elements.items():
element.set("x-test-id", n)
src = html.tostring(src_tree, encoding="utf-8").decode()

tgt = translate(src)
tgt_tree = html.fromstring(tgt)

# Test if all elements are referenced once
print("Elements referenced:")
tgt_element_count = Counter(
element.get("x-test-id") for element in tgt_tree.iter()
)
for n, element in src_elements.items():
count = tgt_element_count[n]
if count != 1:
print(f"{count}: {element!r}")

# Test whether all elements have text around them (i.e. no empty elements
# that should not be empty)
print("Elements with missing text:")
for tgt_element in tgt_tree.iter():
n = tgt_element.get("x-test-id")
src_element = src_elements[n]
tgt_text = has_surrounding_text(tgt_element)
src_text = has_surrounding_text(src_element)
if tgt_text != src_text:
print(f"{element!r}: {tgt_text!r} (input: {src_text!r})")

# Test whether the spaces around the elements are present. All spaces are
# treated as a single space (unless <pre></pre>) thus it doesn't need to
# be exactly the same. But space vs no space does affect the flow of the
# document.
print("Elements with differences in whitespace around tags")
for tgt_element in tgt_tree.iter():
n = tgt_element.get("x-test-id")
src_element = src_elements[n]
tgt_spaces = has_surrounding_spaces(tgt_element)
src_spaces = has_surrounding_spaces(src_element)
if tgt_spaces != src_spaces:
print(
f"{format_whitespace(tgt_spaces)} (input: {format_whitespace(src_spaces)}) for {format_element(tgt_element)}"
)

compare_html(example, translate)
Loading