JESC Code Release

Welcome to the JESC code release! This repo contains the crawlers, parsers, aligners, and various tools used to create the Japanese-English Subtitle Corpus (JESC).

Requirements

Use pip: pip install -r requirements.txt

Additionally, some of the corpus_processing scripts make use of google/sentencepiece, which has installation instructions on its github page.

Instructions

Each file is a standalone tool with usage instructions given in the comment header. These files are organized into the following categories (subdirectories):

corpus_generation: Scripts for downloading, parsing, and aligning subtitles from the internet.
corpus_cleaning: Scripts for converting file formats, thresholding on length ratios, and spellchecking.
corpus_processing: Scripts for manipulating completed datasets, including tokenization and train/test/dev splitting.

Citation

Please give the proper citation or credit if you use these data:

@ARTICLE{pryzant_jesc_2017,
   author = {{Pryzant}, R. and {Chung}, Y. and {Jurafsky}, D. and {Britz}, D.},
    title = "{JESC: Japanese-English Subtitle Corpus}",
  journal = {ArXiv e-prints},
archivePrefix = "arXiv",
   eprint = {1710.10639},
 keywords = {Computer Science - Computation and Language},
     year = 2017,
    month = oct,
}             ```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JESC Code Release

Requirements

Instructions

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
corpus_cleaning		corpus_cleaning
corpus_generation		corpus_generation
corpus_processing		corpus_processing
README.md		README.md
requirements.txt		requirements.txt

rpryzant/JESC

Folders and files

Latest commit

History

Repository files navigation

JESC Code Release

Requirements

Instructions

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages