- Tatoeba (>3600 sentence pairs). Hugging Face / Source
- Finugorbib (>30k sentence pairs). Hugging Face
- Soviet geography book (>2700 sentence pairs). Hugging Face
- FLORES-250, translation benchmark. Hugging Face / Other languages
- Udmurt news (udmddn.ru and oshmes.info, in total 36k sentences). Hugging Face
- Wikipedia dump (more than 43k sentences). Download
- MADLAD-400 (651k sentences, 9.5 million words) Hugging Face / All languages
- Glot500-c (121k sentences) GitHub
- Zerpal (1.4M sentences) Hugging Face
- FineWeb2 (13.5 million words) Hugging Face
- Finno-Ugric SIB (SIB-SMUGRI) Hugging Face
- Zerpal-udmdunne (8,154 rows, 5 labels) Hugging Face
- Zerpal-udmurtmedia (15,274 rows, 10 labels) Hugging Face
- Zerpal-pos-tagging (12,392 rows, 17 classes) Hugging Face
- WikiANN (the transcription is problematic: Latin and Cyrillic are used inconsistently, Wikipedia Markup is parsed incorrectly, but if you want to use it, see
wikiann
directory)
- MURI-IT (2,751 rows) Hugging Face