marian-nmt · Aug 2, 2023 · Aug 16, 2023 · Aug 16, 2023 · Aug 29, 2023 · Aug 29, 2023
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,12 @@
+# Changelog
+
+## [1.0.1] --- 2023-08-28
+
+### Fixed
+- Moved random seed initialization from DataSource to Constructor
+- Read version from project file manually instead of via importlib,
+  which created problems with Python 3.8
+
+## [1.0.0] --- 2023-07-31
+
+Initial public release.
diff --git a/README.md b/README.md
@@ -1,35 +1,37 @@
 # Sotastream
+[![image](http://img.shields.io/pypi/v/sotastream.svg)](https://pypi.python.org/pypi/sotastream/)
+[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](./LICENSE)
+[![Read the Docs](https://img.shields.io/readthedocs/sotastream.svg)](https://sotastream.readthedocs.io/)
 
-## Introduction
 
 Sotastream is a tool for data augmentation for training
 pipeline. It uses `infinibatch` internally to generate an infinite
 stream of shuffled training data and provides a means for on-the-fly
 data manipulation, augmentation, mixing, and sampling.
 
-## Cloning and initialization
 
-To begin, clone the repository:
+## Setup
 
-```
-git clone https://github.com/marian-nmt/sotastream
+To install from PyPI (https://pypi.org/project/sotastream/)
+```bash
+pip install sotastream
 ```
 
-You can then install it as follows.
+*Developer Setup:*
 
 ```bash
+# To begin, clone the repository:
+git clone https://github.com/marian-nmt/sotastream
 cd sotastream
-
+# option 1:
 python -m pip install .
-python -m pip install --no-deps .   # install without dependencies
+# option 2: install in --editable mode
+python -m pip install -e .
 ```
-If you already have your own version of requirements, add ` --no-deps / --no-dependencies` flag to skip installing dependencies.
 
-Entry points
+*Entry points*
 * As a module:  `python -m sotastream`
 * As a bin in your $PATH: `sotastream`
-* Via path to script: `python path/to/cli.py`. For convenience, cli.py is in the root of repository
-
 
 ## Development
 
@@ -76,8 +78,6 @@ to checksummed folders under `/tmp/sotastream/{checksum}`:
 python -m sotastream example parallel.tsv.gz backtrans.tsv.gz
 ```
 
-(The garbage file is assumed to have just a single column of data, which is copied).
-
 There are currently two main pipelines: "default", and "wmt". These vary according to
 the data sources they take as well as the other options available to them.
 
@@ -123,12 +123,30 @@ You can find some examples in `test/dummy_pipeline.py`, as well as the real exam
 
 Sotastream is developed by _TextMT Team_ @ Microsoft Translator.
 
-* Roman Grundkiewicz
-* Thamme Gowda
-* Rohit Jain
-* Huda Khayrallah
-* Matt Post
-* Marcin Junczys-Dowmunt
-
-
-> We are finishing up a paper that describes `sotastream` in detail; it will be linked here. 
+If you use this tool, please cite: 
+Paper link: https://arxiv.org/abs/2308.07489  | https://aclanthology.org/2023.nlposs-1.13/
+
+
+```bibtex
+@inproceedings{post-etal-2023-sotastream,
+    title = "{SOTASTREAM}: A Streaming Approach to Machine Translation Training",
+    author = "Post, Matt  and
+      Gowda, Thamme  and
+      Grundkiewicz, Roman  and
+      Khayrallah, Huda  and
+      Jain, Rohit  and
+      Junczys-Dowmunt, Marcin",
+    editor = "Tan, Liling  and
+      Milajevs, Dmitrijs  and
+      Chauhan, Geeticka  and
+      Gwinnup, Jeremy  and
+      Rippeth, Elijah",
+    booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
+    month = dec,
+    year = "2023",
+    address = "Singapore, Singapore",
+    publisher = "Empirical Methods in Natural Language Processing",
+    url = "https://aclanthology.org/2023.nlposs-1.13",
+    pages = "110--119",
+}
+```
diff --git a/docs/README.md b/docs/README.md
@@ -7,3 +7,27 @@ pip install -U sphinx sphinx_rtd_theme
 make clean
 make html
 ```
+
+
+
+## Release Package to PyPI
+
+```bash
+
+# run unit and regression tests
+make check
+
+pip install --upgrade build pip twine
+rm -rf dist/
+python -m build --sdist --wheel -o dist/
+
+# create your ~/.pypirc, if missing
+twine upload -r testpypi dist/*
+twine upload -r pypi dist/*
+
+```
+
+
+## Update Docs
+
+Go to https://readthedocs.org/projects/sotastream/ and click/touch "Build" button.
diff --git a/docs/conf.py b/docs/conf.py
@@ -44,10 +44,13 @@
 html_theme = 'sphinx_rtd_theme'
 html_static_path = ['_static']
 
+
 def run_apidoc(_):
     # from sphinx.apidoc import main   # for older Sphinx <= 1.6
     from sphinx.ext.apidoc import main  # for newer
+
     main(['-e', '-o', str(DOCS_DIR / 'api'), str(SRC_DIR), '--force'])
 
+
 def setup(app):
     app.connect('builder-inited', run_apidoc)
diff --git a/docs/introduction.rst b/docs/introduction.rst
@@ -10,31 +10,37 @@ uses `infinibatch <https://github.com/microsoft/infinibatch>`_ internally to gen
 shuffled training data and provides a means for on-the-fly data
 manipulation, augmentation, mixing, and sampling.
 
-Cloning and initialization
---------------------------
 
-To begin, clone the repository:
 
-::
 
-   git clone https://github.com/marian-nmt/sotastream
+Setup
+-----
+
+To install from PyPI (https://pypi.org/project/sotastream/)
 
-You can then install it as follows.
 
 .. code:: bash
 
-   cd sotastream
+   pip install sotastream
+
+
+*Developer Setup:*
 
+.. code:: bash
+
+   # To begin, clone the repository:
+   git clone https://github.com/marian-nmt/sotastream
+   cd sotastream
+   # option 1: 
    python -m pip install .
-   python -m pip install --no-deps .   # install without dependencies
+   # option 2: install in --editable mode
+   python -m pip install -e .
+
 
-If you already have your own version of requirements, add
-``--no-deps / --no-dependencies`` flag to skip installing dependencies.
+*Entry points*
+* As a module:  `python -m sotastream`
+* As a bin in your $PATH: `sotastream`
 
-Entry points \* As a module: ``python -m sotastream`` \* As a bin in
-your $PATH: ``sotastream`` \* Via path to script:
-``python path/to/cli.py``. For convenience, cli.py is in the root of
-repository
 
 Development
 -----------
@@ -94,11 +100,8 @@ sotastream will split them to checksummed folders under
 
    python -m sotastream example parallel.tsv.gz backtrans.tsv.gz
 
-(The garbage file is assumed to have just a single column of data, which
-is copied).
-
-There are currently two main pipelines: “default”, and “wmt”. These vary
-according to the data sources they take as well as the other options
+There are currently two main pipelines: “default”, and “wmt”.
+These vary according to the data sources they take as well as the other options
 available to them.
 
 There are global options that control behavioral aspects such as
@@ -116,7 +119,7 @@ can see these by running
    # see wmt pipeline options
    python -m sotastream wmt -h
 
-Don’t cross the streams!
+Don't cross the streams!
 ------------------------
 
 Sotastream workflows build a directed acyclic graph (DAG) consisting of

diff --git a/pyproject.toml b/pyproject.toml
@@ -1,6 +1,6 @@
 [project]
 name = "sotastream"
-version = "1.0.0.dev0"       # note: __init__.py:__version__ will get this via importlib.metadata
+dynamic = ["version"]   # see [tool.setuptools.dynamic] below
 description = """Sotastream is a command line tool that augments a batch of text and produces infinite stream of records."""
 readme = "README.md"
 requires-python = ">=3.6"
@@ -58,6 +58,9 @@ sotastream = "sotastream.cli:main"
 requires = ["setuptools", "wheel"]
 build-backend = "setuptools.build_meta"
 
+[tool.setuptools.dynamic]
+version = {attr = "sotastream.__version__"}
+
 [tool.setuptools.packages.find]
 #where = ["src"]  # ["."] by default
 include = ["sotastream*"]  # ["*"] by default

diff --git a/sotastream/__init__.py b/sotastream/__init__.py
@@ -1,11 +1,8 @@
 import sys
 
+__version__ = "1.0.1"
 sys.dont_write_bytecode = True
 
-from importlib import metadata
-
-__version__ = metadata.version(__package__)
-
 
 class Defaults:
     """

diff --git a/sotastream/augmentors/augmentors.py b/sotastream/augmentors/augmentors.py
@@ -75,8 +75,6 @@ def DataSource(
         instance_rank = 0
         logger.info(f"Opening path {path}")
 
-    random.seed(seed)
-
     # Worker ID i will only see every ith chunk
     chunk_file_paths = []
     total_chunks = 0

diff --git a/sotastream/pipelines/base.py b/sotastream/pipelines/base.py
@@ -1,6 +1,7 @@
 from abc import ABC
 import itertools
 import logging
+import random
 import os
 
 from sotastream import Defaults
@@ -45,6 +46,8 @@ def __init__(self, **kwargs) -> None:
         self.separator = kwargs.get("separator", Defaults.SEPARATOR)
         self.shuffle = not kwargs.get("no_shuffle", not Defaults.SHUFFLE)
 
+        random.seed(self.seed)
+
         # These are set in the environment of the caller when multiprocessing is enabled.
         # Each sub-process gets a distinct worker ID and knows the total number of workers.
         # These values are used to allocate the shards of a data source in a round-robin
@@ -101,21 +104,11 @@ def add_cli_args(cls, parser):
             parser.add_argument(name, help=desc, nargs=nargs)
 
         parser.add_argument("--spm", help="SPM model (for more accurate length calculation")
-        parser.add_argument(
-            "--sample-length",
-            action="store_true",
-            help="Whether to fill each sample with the maximum tokens (default) or first sample a length (uniformly at random).",
-        )
         parser.add_argument(
             "--separator",
             default=" ",
             help="String to use when joining sentences for data augmentation (default: '%(default)s').",
         )
-        parser.add_argument(
-            "--augment",
-            action="store_true",
-            help="Whether to add capitalization and target-copy augmentations",
-        )
         parser.add_argument(
             "--max-joined-tokens",
             "--max-tokens",