Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Pytest + Repo Config Setup #17

Merged
merged 8 commits into from
Jun 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 83 additions & 0 deletions .github/workflows/release.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
name: Build and upload to PyPI

on:
push:
release:
types:
- published

jobs:
run_tests:
name: Run tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3

- uses: actions/setup-python@v3

- name: Install
run: python3 -m pip install .

- name: Run runtime pytest
run: pytest tests

build_sdist:
name: Build source distribution
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3

- uses: actions/setup-python@v3

- name: Set version number
if: startsWith(github.ref, 'refs/tags/v')
run: echo "VERSION = \"${GITHUB_REF_NAME:1}\"" > opuscleaner/__about__.py

- name: Build sdist
run: pipx run build --sdist

- uses: actions/upload-artifact@v3
with:
name: sdist
path: dist/opuscleaner-*.tar.gz

build_wheels:
name: Build wheels
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3

- uses: actions/setup-python@v3

- name: Set version number
if: startsWith(github.ref, 'refs/tags/v')
run: echo "VERSION = \"${GITHUB_REF_NAME:1}\"" > opuscleaner/__about__.py

- name: Build wheels
run: python -m pip wheel -w wheelhouse .

- uses: actions/upload-artifact@v3
with:
name: wheels
path: ./wheelhouse/opuscleaner-*.whl

upload_pypi:
needs: [build_wheels, build_sdist]
runs-on: ubuntu-latest
if: github.event_name == 'release' && github.event.action == 'published'
steps:
- uses: actions/download-artifact@v3
with:
name: wheels
path: dist

- uses: actions/download-artifact@v3
with:
name: sdist
path: dist

- uses: pypa/[email protected]
with:
user: __token__
password: ${{ secrets.PYPI_API_TOKEN }}
50 changes: 50 additions & 0 deletions .github/workflows/run_test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
name: Run Pytest Unit Tests

on:
push:
branches: [main]
pull_request:
branches: [main]

permissions:
contents: read

jobs:
build:
runs-on: ubuntu-latest

strategy:
matrix:
python-version: ["3.7", "3.8", "3.9", "3.10"]

steps:
- uses: actions/checkout@v3

- name: Setup Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
cache: pip

- name: Display Python version
run: python -c "import sys; print(sys.version)"

- name: Install requirements
run: |
python -m pip install --upgrade pip setuptools
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
if [ -f requirements-all.txt ]; then pip install -r requirements-all.txt; fi

- name: Lint with Pre-commit
run: |
pre-commit run --all-files

- name: Test with Pytest
run: |
coverage run -m pytest -v -s

- name: Generate Coverage Report
run: |
coverage report -m
43 changes: 18 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
# OpusPocus on LUMI

This branch is an implementation of the machine translation (MT) training pipeline manager for LUMI HPC cluster.
It uses [OpusCleaner](https://github.com/hplt-project/OpusCleaner/tree/main) for data preparation and [OpusTrainer](https://github.com/hplt-project/OpusTrainer) for training scheduling (in progress).
Modular NLP pipeline manager.

OpusPocus is aimed at simplifying the description and execution of popular and custom NLP pipelines, including dataset preprocessing, model training and evaluation.
The pipeline manager supports execution using simple CLI (Bash) or common HPC schedulers (Slurm, HyperQueue).

It uses [OpusCleaner](https://github.com/hplt-project/OpusCleaner/tree/main) for data preparation and [OpusTrainer](https://github.com/hplt-project/OpusTrainer) for training scheduling (development in progress).


## Structure
Expand All @@ -11,13 +15,14 @@ It uses [OpusCleaner](https://github.com/hplt-project/OpusCleaner/tree/main) for
- `config/` - default configuration files (pipeline config, marian training config, ...)
- `examples/` - pipeline manager usage examples
- `scripts/` - helper scripts, at this moment not directly implemented in OpusPocus
- `tests/` - unit tests


## Installation

1. Install [MarianNMT](https://marian-nmt.github.io/docs/).

2. Prepare the OpusCleaner and OpusTrainer Python virtual environments.
2. Prepare the [OpusCleaner](https://github.com/hplt-project/OpusCleaner/blob/main/README.md#installation-for-cleaning) and [OpusTrainer](https://github.com/hplt-project/OpusTrainer/blob/main/README.md#installation) Python virtual environments.

3. Install the OpusPocus requirements.
```
Expand All @@ -27,39 +32,27 @@ pip install -r requirements.txt

## Usage (Simple Pipeline)

You can see the example of the pipeline manager usage in examples directory.
Alternatively, you can follow these steps:
See the ``examples/`` directory for example execution

1. Initialize the pipeline.
```
python go.py init \
--pipeline simple \
--pipeline-dir pipeline/destination/directory \
$ ./go.py init \
--pipeline-config path/to/pipeline/config/file \
--src-lang en \
--tgt-lang fr \
--raw-data-dir training/corpora/directory \
--valid-data-dir validation/data/directory \
--test-data-dir test/data/directory \
--marian-config path/to/marian/config/file \
--pipeline-dir pipeline/destination/directory \
```

(
The <training-corpora-dir> should contain the corpus .gz files, categories.json listing the corpora and their categories and (optional) the OpusCleaners .filter.json files.
The valid and test data dir should contain the parrallel validation corpora (plaintext).
Other pipeline parameters can be overwritten either by modifying the the pipeline config file (see the config/pipeline.* files) or by passing the parameter dicretly to the go.py command as a named argument.
)


2. Execute the pipeline.
```
python go.py run \
$ ./go.py run \
--pipeline-dir pipeline/destination/directory \
--runner sbatch \
--runner-opts <options-for-runner> \
--runner bash \
```

3. Check the pipeline status.
```
python go.py traceback --pipeline-dir pipeline/destination/directory
$ ./go.py traceback --pipeline-dir pipeline/destination/directory
```
OR
```
$ ./go.py status --pipeline-dir pipeline/destination/directory
```
1 change: 1 addition & 0 deletions opuspocus/__about__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
VERSION = "0.0.0"
15 changes: 4 additions & 11 deletions opuspocus/pipeline_steps/clean.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,7 @@ def register_categories(self) -> None:
OpusCleaner server app creates a categories.json file listing locally
available datasets and their user-specified categorization.
"""
shutil.copy(
self.prev_corpus_step.categories_path, self.categories_path
)
shutil.copy(self.prev_corpus_step.categories_path, self.categories_path)

def get_command_targets(self) -> List[Path]:
return [
Expand All @@ -64,9 +62,7 @@ def command(self, target_file: Path) -> None:
dataset = ".".join(str(target_filename).split(".")[:-2])
input_file = Path(self.input_dir, "{}.filters.json".format(dataset))

opuscleaner_bin_path = Path(
self.python_venv_dir, "bin", self.opuscleaner_cmd
)
opuscleaner_bin_path = Path(self.python_venv_dir, "bin", self.opuscleaner_cmd)

# Run OpusCleaner
proc = subprocess.Popen(
Expand All @@ -86,8 +82,7 @@ def command(self, target_file: Path) -> None:

# Get the correct order of languages
languages = [
file.split(".")[-2]
for file in json.load(open(input_file, "r"))["files"]
file.split(".")[-2] for file in json.load(open(input_file, "r"))["files"]
]

# Split OpusCleaner output into files
Expand All @@ -100,6 +95,4 @@ def command(self, target_file: Path) -> None:
# Check the return code
rc = proc.poll()
if rc:
raise Exception(
"Process {} exited with non-zero value.".format(proc.pid)
)
raise Exception("Process {} exited with non-zero value.".format(proc.pid))
5 changes: 1 addition & 4 deletions opuspocus/pipeline_steps/corpus_step.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,10 +134,7 @@ def shard_index(self) -> Optional[Dict[str, List[Path]]]:

def save_shard_dict(self, shard_dict: Dict[str, List[str]]) -> None:
assert self.is_sharded
yaml.dump(
shard_dict,
open(Path(self.shard_dir, self.shard_index_file), "w")
)
yaml.dump(shard_dict, open(Path(self.shard_dir, self.shard_index_file), "w"))

def get_shard_list(self, dset_filename: str) -> List[Path]:
assert self.shard_index
Expand Down
3 changes: 1 addition & 2 deletions opuspocus/pipeline_steps/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,7 @@ def other_corpus_step(self) -> CorpusStep:
def register_categories(self) -> None:
categories_dict = {}
categories_dict["categories"] = [
{"name" : cat}
for cat in self.prev_corpus_step.categories
{"name": cat} for cat in self.prev_corpus_step.categories
]

# Merge the category lists
Expand Down
35 changes: 12 additions & 23 deletions opuspocus/pipeline_steps/translate.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from opuspocus.pipeline_steps.corpus_step import CorpusStep
from opuspocus.pipeline_steps.opuspocus_step import OpusPocusStep
from opuspocus.pipeline_steps.train_model import TrainModelStep
from opuspocus.utils import RunnerResources, save_filestream, subprocess_wait
from opuspocus.utils import RunnerResources, save_filestream

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -59,8 +59,7 @@ def _inherits_sharded(self) -> bool:
def model_config_path(self) -> Path:
return Path(
"{}.{}.npz.decoder.yml".format(
self.model_step.model_path,
self.model_suffix
self.model_step.model_path, self.model_suffix
)
)

Expand All @@ -75,14 +74,12 @@ def infer_input(self, tgt_file: Path) -> Path:
src_filename = ".".join(
tgt_filename.split(".")[:-offset]
+ [self.src_lang]
+ tgt_filename.split(".")[-(offset - 1):]
+ tgt_filename.split(".")[-(offset - 1) :]
)
if self.prev_corpus_step.is_sharded:
src_file = Path(self.shard_dir, src_filename)
if not src_file.exists():
src_file.hardlink_to(
Path(self.input_shard_dir, src_filename)
)
src_file.hardlink_to(Path(self.input_shard_dir, src_filename))
else:
src_file = Path(self.output_dir, src_filename)
if not src_file.exists():
Expand All @@ -96,10 +93,9 @@ def get_command_targets(self) -> List[Path]:
targets = []
for dset in self.dataset_list:
dset_filename = "{}.{}.gz".format(dset, self.tgt_lang)
targets.extend([
shard_file
for shard_file in self.get_shard_list(dset_filename)
])
targets.extend(
[shard_file for shard_file in self.get_shard_list(dset_filename)]
)
return targets
return [
Path(self.output_dir, "{}.{}.gz".format(dset, self.tgt_lang))
Expand All @@ -117,10 +113,9 @@ def command_preprocess(self) -> None:
)
shard_dict[d_fname_target] = [
".".join(
shard.split(".")[:-3]
+ [self.tgt_lang]
+ shard.split(".")[-2:]
) for shard in s_fname_list
shard.split(".")[:-3] + [self.tgt_lang] + shard.split(".")[-2:]
)
for shard in s_fname_list
]
self.save_shard_dict(shard_dict)

Expand Down Expand Up @@ -158,11 +153,7 @@ def command(self, target_file: Path) -> None:

# Execute the command
proc = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=sys.stderr,
env=env,
text=True
cmd, stdout=subprocess.PIPE, stderr=sys.stderr, env=env, text=True
)

def terminate_signal(signalnum, handler):
Expand All @@ -176,6 +167,4 @@ def terminate_signal(signalnum, handler):
# Check the return code
rc = proc.poll()
if rc:
raise Exception(
"Process {} exited with non-zero value.".format(proc.pid)
)
raise Exception("Process {} exited with non-zero value.".format(proc.pid))
Loading
Loading