Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: marian-nmt/sotastream
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v1.0.0
Choose a base ref
...
head repository: marian-nmt/sotastream
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: main
Choose a head ref
  • 6 commits
  • 9 files changed
  • 3 contributors

Commits on Aug 2, 2023

  1. release v1.0.0; update README

    Thamme Gowda committed Aug 2, 2023
    Copy the full SHA
    6100ea1 View commit details

Commits on Aug 16, 2023

  1. Merge pull request #3 from marian-nmt/tg/v1

    release v1.0.0; update README
    thammegowda authored Aug 16, 2023

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    45068bb View commit details
  2. Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    076ee70 View commit details

Commits on Aug 29, 2023

  1. Fix random.seed() invocation and remove importlib (#5)

    * Moved random.seed() invocation from DataSource into the top-level pipeline constructor (closes #4)
    * Fix version computation so that it is synced between __init__  & pyproject.toml
    
    ---------
    
    Co-authored-by: Thamme Gowda <[email protected]>
    mjpost and Thamme Gowda authored Aug 29, 2023

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    a77c0d5 View commit details
  2. Fix typos and sync README -> docs/intro.rst

    Thamme Gowda committed Aug 29, 2023
    Copy the full SHA
    b50ae95 View commit details

Commits on Dec 7, 2023

  1. Update bibtex

    thammegowda authored Dec 7, 2023

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    2fbb6d0 View commit details
Showing with 111 additions and 60 deletions.
  1. +12 −0 CHANGELOG.md
  2. +41 −23 README.md
  3. +24 −0 docs/README.md
  4. +3 −0 docs/conf.py
  5. +23 −20 docs/introduction.rst
  6. +4 −1 pyproject.toml
  7. +1 −4 sotastream/__init__.py
  8. +0 −2 sotastream/augmentors/augmentors.py
  9. +3 −10 sotastream/pipelines/base.py
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Changelog

## [1.0.1] --- 2023-08-28

### Fixed
- Moved random seed initialization from DataSource to Constructor
- Read version from project file manually instead of via importlib,
which created problems with Python 3.8

## [1.0.0] --- 2023-07-31

Initial public release.
64 changes: 41 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,37 @@
# Sotastream
[![image](http://img.shields.io/pypi/v/sotastream.svg)](https://pypi.python.org/pypi/sotastream/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](./LICENSE)
[![Read the Docs](https://img.shields.io/readthedocs/sotastream.svg)](https://sotastream.readthedocs.io/)

## Introduction

Sotastream is a tool for data augmentation for training
pipeline. It uses `infinibatch` internally to generate an infinite
stream of shuffled training data and provides a means for on-the-fly
data manipulation, augmentation, mixing, and sampling.

## Cloning and initialization

To begin, clone the repository:
## Setup

```
git clone https://github.com/marian-nmt/sotastream
To install from PyPI (https://pypi.org/project/sotastream/)
```bash
pip install sotastream
```

You can then install it as follows.
*Developer Setup:*

```bash
# To begin, clone the repository:
git clone https://github.com/marian-nmt/sotastream
cd sotastream

# option 1:
python -m pip install .
python -m pip install --no-deps . # install without dependencies
# option 2: install in --editable mode
python -m pip install -e .
```
If you already have your own version of requirements, add ` --no-deps / --no-dependencies` flag to skip installing dependencies.

Entry points
*Entry points*
* As a module: `python -m sotastream`
* As a bin in your $PATH: `sotastream`
* Via path to script: `python path/to/cli.py`. For convenience, cli.py is in the root of repository


## Development

@@ -76,8 +78,6 @@ to checksummed folders under `/tmp/sotastream/{checksum}`:
python -m sotastream example parallel.tsv.gz backtrans.tsv.gz
```

(The garbage file is assumed to have just a single column of data, which is copied).

There are currently two main pipelines: "default", and "wmt". These vary according to
the data sources they take as well as the other options available to them.

@@ -123,12 +123,30 @@ You can find some examples in `test/dummy_pipeline.py`, as well as the real exam

Sotastream is developed by _TextMT Team_ @ Microsoft Translator.

* Roman Grundkiewicz
* Thamme Gowda
* Rohit Jain
* Huda Khayrallah
* Matt Post
* Marcin Junczys-Dowmunt


> We are finishing up a paper that describes `sotastream` in detail; it will be linked here.
If you use this tool, please cite:
Paper link: https://arxiv.org/abs/2308.07489 | https://aclanthology.org/2023.nlposs-1.13/


```bibtex
@inproceedings{post-etal-2023-sotastream,
title = "{SOTASTREAM}: A Streaming Approach to Machine Translation Training",
author = "Post, Matt and
Gowda, Thamme and
Grundkiewicz, Roman and
Khayrallah, Huda and
Jain, Rohit and
Junczys-Dowmunt, Marcin",
editor = "Tan, Liling and
Milajevs, Dmitrijs and
Chauhan, Geeticka and
Gwinnup, Jeremy and
Rippeth, Elijah",
booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
month = dec,
year = "2023",
address = "Singapore, Singapore",
publisher = "Empirical Methods in Natural Language Processing",
url = "https://aclanthology.org/2023.nlposs-1.13",
pages = "110--119",
}
```
24 changes: 24 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -7,3 +7,27 @@ pip install -U sphinx sphinx_rtd_theme
make clean
make html
```



## Release Package to PyPI

```bash

# run unit and regression tests
make check

pip install --upgrade build pip twine
rm -rf dist/
python -m build --sdist --wheel -o dist/

# create your ~/.pypirc, if missing
twine upload -r testpypi dist/*
twine upload -r pypi dist/*

```


## Update Docs

Go to https://readthedocs.org/projects/sotastream/ and click/touch "Build" button.
3 changes: 3 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
@@ -44,10 +44,13 @@
html_theme = 'sphinx_rtd_theme'
html_static_path = ['_static']


def run_apidoc(_):
# from sphinx.apidoc import main # for older Sphinx <= 1.6
from sphinx.ext.apidoc import main # for newer

main(['-e', '-o', str(DOCS_DIR / 'api'), str(SRC_DIR), '--force'])


def setup(app):
app.connect('builder-inited', run_apidoc)
43 changes: 23 additions & 20 deletions docs/introduction.rst
Original file line number Diff line number Diff line change
@@ -10,31 +10,37 @@ uses `infinibatch <https://github.com/microsoft/infinibatch>`_ internally to gen
shuffled training data and provides a means for on-the-fly data
manipulation, augmentation, mixing, and sampling.

Cloning and initialization
--------------------------

To begin, clone the repository:

::

git clone https://github.com/marian-nmt/sotastream
Setup
-----

To install from PyPI (https://pypi.org/project/sotastream/)

You can then install it as follows.

.. code:: bash
cd sotastream
pip install sotastream
*Developer Setup:*

.. code:: bash
# To begin, clone the repository:
git clone https://github.com/marian-nmt/sotastream
cd sotastream
# option 1:
python -m pip install .
python -m pip install --no-deps . # install without dependencies
# option 2: install in --editable mode
python -m pip install -e .
If you already have your own version of requirements, add
``--no-deps / --no-dependencies`` flag to skip installing dependencies.
*Entry points*
* As a module: `python -m sotastream`
* As a bin in your $PATH: `sotastream`

Entry points \* As a module: ``python -m sotastream`` \* As a bin in
your $PATH: ``sotastream`` \* Via path to script:
``python path/to/cli.py``. For convenience, cli.py is in the root of
repository

Development
-----------
@@ -94,11 +100,8 @@ sotastream will split them to checksummed folders under

python -m sotastream example parallel.tsv.gz backtrans.tsv.gz

(The garbage file is assumed to have just a single column of data, which
is copied).

There are currently two main pipelines: “default”, and “wmt”. These vary
according to the data sources they take as well as the other options
There are currently two main pipelines: “default”, and “wmt”.
These vary according to the data sources they take as well as the other options
available to them.

There are global options that control behavioral aspects such as
@@ -116,7 +119,7 @@ can see these by running
# see wmt pipeline options
python -m sotastream wmt -h

Dont cross the streams!
Don't cross the streams!
------------------------

Sotastream workflows build a directed acyclic graph (DAG) consisting of
5 changes: 4 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "sotastream"
version = "1.0.0.dev0" # note: __init__.py:__version__ will get this via importlib.metadata
dynamic = ["version"] # see [tool.setuptools.dynamic] below
description = """Sotastream is a command line tool that augments a batch of text and produces infinite stream of records."""
readme = "README.md"
requires-python = ">=3.6"
@@ -58,6 +58,9 @@ sotastream = "sotastream.cli:main"
requires = ["setuptools", "wheel"]
build-backend = "setuptools.build_meta"

[tool.setuptools.dynamic]
version = {attr = "sotastream.__version__"}

[tool.setuptools.packages.find]
#where = ["src"] # ["."] by default
include = ["sotastream*"] # ["*"] by default
5 changes: 1 addition & 4 deletions sotastream/__init__.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,8 @@
import sys

__version__ = "1.0.1"
sys.dont_write_bytecode = True

from importlib import metadata

__version__ = metadata.version(__package__)


class Defaults:
"""
2 changes: 0 additions & 2 deletions sotastream/augmentors/augmentors.py
Original file line number Diff line number Diff line change
@@ -75,8 +75,6 @@ def DataSource(
instance_rank = 0
logger.info(f"Opening path {path}")

random.seed(seed)

# Worker ID i will only see every ith chunk
chunk_file_paths = []
total_chunks = 0
13 changes: 3 additions & 10 deletions sotastream/pipelines/base.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from abc import ABC
import itertools
import logging
import random
import os

from sotastream import Defaults
@@ -45,6 +46,8 @@ def __init__(self, **kwargs) -> None:
self.separator = kwargs.get("separator", Defaults.SEPARATOR)
self.shuffle = not kwargs.get("no_shuffle", not Defaults.SHUFFLE)

random.seed(self.seed)

# These are set in the environment of the caller when multiprocessing is enabled.
# Each sub-process gets a distinct worker ID and knows the total number of workers.
# These values are used to allocate the shards of a data source in a round-robin
@@ -101,21 +104,11 @@ def add_cli_args(cls, parser):
parser.add_argument(name, help=desc, nargs=nargs)

parser.add_argument("--spm", help="SPM model (for more accurate length calculation")
parser.add_argument(
"--sample-length",
action="store_true",
help="Whether to fill each sample with the maximum tokens (default) or first sample a length (uniformly at random).",
)
parser.add_argument(
"--separator",
default=" ",
help="String to use when joining sentences for data augmentation (default: '%(default)s').",
)
parser.add_argument(
"--augment",
action="store_true",
help="Whether to add capitalization and target-copy augmentations",
)
parser.add_argument(
"--max-joined-tokens",
"--max-tokens",