release2_inspection

Purpose of inspection

We want to get a rough idea about the actual content of the cleaned verion of the 2nd data release. More specifically, for a subset that should correspond to some language L we want to estimate:

the proportion of texts that are in fact not in the language L,
the proportion of texts that can be considered undesirable because they are unnatural,
the proportion of texts that can be considered undesirable porn texts.

Additionally, we plan to compare these characteristics for the older and the newer crawls, and also for the IA and CC crawls.

PROMPSIT ALTERNATIVE INSPECTION EFFORT (20 docs per lang in HPLT v2): results

Data for round 1 of inspection:

samples stratified by language and crawl group,
4 groups (cc/ia old/new),
first 5 batches per language,
200 examples per batch,
500/500 characters from the beginning of the fist/second half of each text.

Inspection

Please select one or more batches for a language you want to inspect. "Reserve" the batch(es) by filling in your name in the spreadsheet. Fill in the labels and push the updated files back to this repository.

We ask to provide 3 binary labels for each example:

porn? empty/1: if the text looks like porn put 1, otherwise leave empty
unnatural? empty/1: if the most text looks unnatural (e.g. word lists for SEO, mostly boilerplate) put 1, otherwise leave empty
lang correct? 0/1: always fill this field (otherwise we will not distinguish labeled and unlabeled examples), put 0 if most of the text is not in the target language, otherwise put 1.

Advice on inspection

Inspecting 20 examples from Russian, batch0 took 5 minutes for me, thus, the estimated time for inspecting 1 batch is 1 hour. One way to inspec is using LibreOffice Calc. For convenience freeze first 4 columns (select them and click View -> Freeze Rows and Columns). Also make the text area larger. This can look like this:

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
annot_round1		annot_round1
observations		observations
plots		plots
Proportions.ipynb		Proportions.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

release2_inspection

Purpose of inspection

Data for round 1 of inspection:

Inspection

Advice on inspection

About

Releases

Packages

Contributors 18

Languages

hplt-project/release2_inspection

Folders and files

Latest commit

History

Repository files navigation

release2_inspection

Purpose of inspection

Data for round 1 of inspection:

Inspection

Advice on inspection

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 18

Languages

Packages