Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
gramirez-prompsit authored Aug 29, 2023
1 parent d4b17e7 commit 2f7d2d8
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Code and data are located in `/work`
- Sentence length distribution: tokens per sentence for each language, showing total, unique and duplicate sentences.
- Language distribution: shows percentage of automatically identified languages.
- Quality Score distribution: as per language models (monolingual) or bicleaner scores (tool that computes the likelihood of two sentences of being mutual translations)
- Noise distribution: the result of applying hard rules and computing which percentage is affected by them (too short or too long sentences, sentences being URLs, sentences containing poor language, etc.)
- Noise distribution: the result of applying hard rules and computing which percentage is affected by them (too short or too long sentences, sentences being URLs, bad encoding, sentences containing poor language, etc.)
- Common n-grams: 1-5 more frequent n-grams

- MORE TO BE ADDED, SUGGESTIONS WELCOME!
Expand Down

0 comments on commit 2f7d2d8

Please sign in to comment.