We want to get a rough idea about the actual content of the cleaned verion of the 2nd data release. More specifically, for a subset that should correspond to some language L we want to estimate:
- the proportion of texts that are in fact not in the language L,
- the proportion of texts that can be considered undesirable because they are unnatural,
- the proportion of texts that can be considered undesirable porn texts.
Additionally, we plan to compare these characteristics for the older and the newer crawls, and also for the IA and CC crawls.
PROMPSIT ALTERNATIVE INSPECTION EFFORT (20 docs per lang in HPLT v2): results
- samples stratified by language and crawl group,
- 4 groups (cc/ia old/new),
- first 5 batches per language,
- 200 examples per batch,
- 500/500 characters from the beginning of the fist/second half of each text.
Please select one or more batches for a language you want to inspect. "Reserve" the batch(es) by filling in your name in the spreadsheet. Fill in the labels and push the updated files back to this repository.
We ask to provide 3 binary labels for each example:
- porn? empty/1: if the text looks like porn put 1, otherwise leave empty
- unnatural? empty/1: if the most text looks unnatural (e.g. word lists for SEO, mostly boilerplate) put 1, otherwise leave empty
- lang correct? 0/1: always fill this field (otherwise we will not distinguish labeled and unlabeled examples), put 0 if most of the text is not in the target language, otherwise put 1.
Inspecting 20 examples from Russian, batch0 took 5 minutes for me, thus, the estimated time for inspecting 1 batch is 1 hour. One way to inspec is using LibreOffice Calc. For convenience freeze first 4 columns (select them and click View -> Freeze Rows and Columns). Also make the text area larger. This can look like this: