Skip to content

Commit

Permalink
sacct logs for all stage2
Browse files Browse the repository at this point in the history
  • Loading branch information
nvanva committed Jun 2, 2024
1 parent 6cb9d28 commit 67f708d
Show file tree
Hide file tree
Showing 7 changed files with 27,470 additions and 4 deletions.
3 changes: 2 additions & 1 deletion src/warc2text_runner/two/qualitycontrol/check_linecnt.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,6 @@
HTMLDIR=$1

module purge; module load LUMI systools parallel
find $HTMLDIR -name text.zst | parallel --eta -j250 "echo {//}; zstdcat {//}/metadata.zst|wc; zstdcat {//}/text.zst|wc; zstdcat {//}/lang.zst|wc" > text_lang_linecnts.log
find $HTMLDIR -name metadata.zst | parallel --eta -j250 "echo {//}; zstdcat {}|wc; test -f {//}/text.zst && zstdcat {//}/text.zst|wc || echo 0 0 0; test -f {//}/lang.zst && zstdcat {//}/lang.zst|wc || echo 0 0 0" >text_lang_linecnts.log
cat text_lang_linecnts.log |tr '\n' ' ' | sed 's!/user!\n/user!g' >text_lang_linecnts.tsv
python -m warc2text_runner.two.qualitycontrol.check_text_lang_linecnt text_lang_linecnts.tsv
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
import pandas as pd
import fire

df=pd.read_csv('text_lang_linecnts.tsv', sep='\s+', header=None, names=['dir']+[f'{f}-{c}' for f in ('metadata','text','lang') for c in ('lines','words','bytes')])
print('\n'.join(df[df['metadata-lines'] != df['text-lines']]['dir']))
print('\n'.join(df[df['metadata-lines'] != df['lang-lines']]['dir']))
def check_tsv(tsv):
df=pd.read_csv(tsv, sep='\s+', header=None, names=['dir']+[f'{f}-{c}' for f in ('metadata','text','lang') for c in ('lines','words','bytes')])
print('metadata-text lines mismatch:\n', '\n'.join(df[df['metadata-lines'] != df['text-lines']]['dir']))
print('metadata-lang lines mismatch:\n','\n'.join(df[df['metadata-lines'] != df['lang-lines']]['dir']))


fire.Fire(check_tsv)
Loading

0 comments on commit 67f708d

Please sign in to comment.