Pipeline for creating BERT vocabularies and pretraining examples based on Wikipedia texts.
A number of language-specific BERT vocabularies created from Wikipedia texts using this pipeline are included in the vocabs subdirectory.
To recreate one of these vocabularies, create pretraining examples, or create a vocabulary and examples for a language not included in the collection, follow the instructions below.
Setup
git submodule init
git submodule update
pip3 install -r requirements.txt
To run the full process for a language with the two-character language code LC, run
./RUN.sh LC
(For example, ./RUN.sh fi
for Finnish.)
To add support for a new language with the two-character language code
LC, create languages/LC.json
and fill in the relevant values for
word-chars
(sequence of word characters), wiki-dump
(URL) and
udpipe-model
(URL). See the existing JSON files in languages/
for examples.
To modify the parameters sent to the key components of the pipeline,
edit the appropriate file in the config/
directory.
The source texts are extracted from Wikipedia, and the quality of the generated data depends on the quantity (and quality) of Wikipedia text available for the language.
The document filtering heuristics were developed for Indo-European languages written with (variants of) the Latin alphabet. They are likely to fail disastrously for other languages. The rest of the pipeline should work, though, if the filtering is disabled.
The pipeline performs sentence segmentation with UDPipe and is thus limited to the languages for which a UDPipe model is available. As of this writing, models are available for 60 languages from https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2998 .