Pipeline developed to clean datasets used for training MT and LLM models.
it is necessary to install git-lfs to clone the repository
sudo apt-get install git-lfs
docker build -t proxectonos/nos:pipeline .
docker run --mount src=path/to/folder,target=/aliasfolderfordocker/,type=bind proxectonos/nos:pipeline command(tokenizer, detokenizer, etc)
pip install -r requirements.txt
chmod +x entrypoint.sh
./entrypoint.sh command (see below)
By default it expects a .jsonl file. You can transform your .txt file into .jsonl format by using the following command:
./entrypoint formatter -p $path_to_file -delimiter $regex_to_divide_txt -o $output_file_path
Executing the command ./entrypoint standard_pipeline $path_input_file calls the following commands:
- encoding
- deduplication
- pyplexity (perplexity filter)
- quelingua (filter by lang)
sh entrypoint.sh --help
sh entrypoint.sh formatter --path --output --technique --delimiter
Transforms a .txt file input into a .jsonl file. The --delimiter can be any regex pattern, preceded by $ e.g. $'#|||#' where #|||# is the pattern used to divide the text.sh entrypoint.sh tokenizer --path --output
tokenizes a latin script text. This tokenizer was developed mainly for Galician.sh entrypoint.sh detokenizer --path --output
detokenizes a text previously parsed with tokenizer.sh entrypoint.sh filter_lang --path --output --filter_results_by_lang
-line by line identification of the language a document is written in. If filter_results_by_lang is provided, the output file will only contain text in the specified language. filter_results_by_lang languages are 2 letter tags e.g. gl for Galician, es for Spanish, etc. -sh entrypoint.sh recoglang --path
Reads an input text file and returns the language it is written in.sh entrypoint.sh encoder --path --output
fixes encoding issues in files.sh entrypoint.sh jaccard --path --output
Deduplicates files based on their Jaccard similarity.sh entrypoint.sh pyplexity
Calculates perplexity of the input. This script implements PyPlexity
Please, cite this paper if you use the modules of this NLP toolkit to clean a corpus:
- Iria de-Dios-Flores, Silvia Paniagua Suárez, Cristina Carbajal Pérez, Daniel Bardanca Outeiriño, Marcos Garcia, and Pablo Gamallo. 2024. CorpusNÓS: A massive Galician corpus for training large language models. In Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1, pages 593–599, Santiago de Compostela, Galicia/Spain. Association for Computational Lingustics.