-
Notifications
You must be signed in to change notification settings - Fork 2
aklement/babel
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Setting up the code ------------------- The project depends on Nutch v1.0 (http://lucene.apache.org/nutch/) and Hadoop Core v0.19 (http://hadoop.apache.org/). Please, download and include the corresponding jars in your classpath. Running the code ---------------- 1. Data preprocessing The first step is to extract and process data in a nutch database and handle incremental updates. The pre-processing stage is split in the following steps: a. Extract pages from a nutch database (babel.prep.extract.NutchPageExtractor). Versions of each page fetched by multiple nutch crawls and containing parse and content metadata along with parsed content are aggregated and collected into a page dataset. b. Merge two existing page datasets (babel.prep.merge.PageMerger). c. Collect page language information (babel.prep.langid.LangIdentifier). Page content language is identified for pages in a dataset with missing language metadata. d. Generate per-language dataset (babel.prep.corpus.CorpusGenerator). A dataset is split per-language and (optionally) saved as a set of XML documents. 2. More coming...
About
Translation without parallel corpora.
Resources
Stars
Watchers
Forks
Packages 0
No packages published