Wikipedia BERT pipeline

Pipeline for creating BERT vocabularies and pretraining examples based on Wikipedia texts.

Completed vocabularies

A number of language-specific BERT vocabularies created from Wikipedia texts using this pipeline are included in the vocabs subdirectory.

To recreate one of these vocabularies, create pretraining examples, or create a vocabulary and examples for a language not included in the collection, follow the instructions below.

Quickstart

Setup

git submodule init
git submodule update
pip3 install -r requirements.txt

To run the full process for a language with the two-character language code LC, run

./RUN.sh LC

(For example, ./RUN.sh fi for Finnish.)

Adding new languages

To add support for a new language with the two-character language code LC, create languages/LC.json and fill in the relevant values for word-chars (sequence of word characters), wiki-dump (URL) and udpipe-model (URL). See the existing JSON files in languages/ for examples.

Configuration

To modify the parameters sent to the key components of the pipeline, edit the appropriate file in the config/ directory.

Limitations

The source texts are extracted from Wikipedia, and the quality of the generated data depends on the quantity (and quality) of Wikipedia text available for the language.

The document filtering heuristics were developed for Indo-European languages written with (variants of) the Latin alphabet. They are likely to fail disastrously for other languages. The rest of the pipeline should work, though, if the filtering is disabled.

The pipeline performs sentence segmentation with UDPipe and is thus limited to the languages for which a UDPipe model is available. As of this writing, models are available for 60 languages from https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2998 .

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
bert @ cc7051d		bert @ cc7051d
config		config
languages		languages
pipeline		pipeline
reference-vocabs		reference-vocabs
scripts		scripts
sent2wordpiece @ 47ba44e		sent2wordpiece @ 47ba44e
vocabs		vocabs
wikiextractor @ 3162bb6		wikiextractor @ 3162bb6
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
RUN.sh		RUN.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia BERT pipeline

Completed vocabularies

Quickstart

Adding new languages

Configuration

Limitations

About

Releases

Packages

Languages

k-ronning/turkunlp-wiki-bert-pipeline

Folders and files

Latest commit

History

Repository files navigation

Wikipedia BERT pipeline

Completed vocabularies

Quickstart

Adding new languages

Configuration

Limitations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages