Skip to content

Commit

Permalink
stage2 QC: checking the match of the number of lines in input and out…
Browse files Browse the repository at this point in the history
…put files
  • Loading branch information
nvanva committed May 27, 2024
1 parent 7d79bae commit 6cb9d28
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 0 deletions.
7 changes: 7 additions & 0 deletions src/warc2text_runner/two/qualitycontrol/check_linecnt.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash

HTMLDIR=$1

module purge; module load LUMI systools parallel
find $HTMLDIR -name text.zst | parallel --eta -j250 "echo {//}; zstdcat {//}/metadata.zst|wc; zstdcat {//}/text.zst|wc; zstdcat {//}/lang.zst|wc" > text_lang_linecnts.log
cat text_lang_linecnts.log |tr '\n' ' ' | sed 's!/user!\n/user!g' >text_lang_linecnts.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
import pandas as pd

df=pd.read_csv('text_lang_linecnts.tsv', sep='\s+', header=None, names=['dir']+[f'{f}-{c}' for f in ('metadata','text','lang') for c in ('lines','words','bytes')])
print('\n'.join(df[df['metadata-lines'] != df['text-lines']]['dir']))
print('\n'.join(df[df['metadata-lines'] != df['lang-lines']]['dir']))
3 changes: 3 additions & 0 deletions src/warc2text_runner/two/qualitycontrol/traferr_stat.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/bash
# Feed content of text.zst to the stdin
jq -c .traferr | sort | uniq -c

0 comments on commit 6cb9d28

Please sign in to comment.