diff --git a/README.md b/README.md index a7b9a7d4..66905302 100644 --- a/README.md +++ b/README.md @@ -35,7 +35,7 @@ Code and data are located in `/work` - Sentence length distribution: tokens per sentence for each language, showing total, unique and duplicate sentences. - Language distribution: shows percentage of automatically identified languages. - Quality Score distribution: as per language models (monolingual) or bicleaner scores (tool that computes the likelihood of two sentences of being mutual translations) -- Noise distribution: the result of applying hard rules and computing which percentage is affected by them (too short or too long sentences, sentences being URLs, sentences containing poor language, etc.) +- Noise distribution: the result of applying hard rules and computing which percentage is affected by them (too short or too long sentences, sentences being URLs, bad encoding, sentences containing poor language, etc.) - Common n-grams: 1-5 more frequent n-grams - MORE TO BE ADDED, SUGGESTIONS WELCOME!