Skip to content

ENCCS/word-count-hpda

Repository files navigation

Binder

Word count example

These programs will count words in a given text, plot a bar chart of the 10 most common words, and print out the 10 most common words which can be used to test Zipf's law.

Example usage

All data files, 64 public domain books from Project Gutenberg reside under data. To compute the frequency distribution of words for one of the books:

# count words in two books
python source/wordcount.py data/pg10.txt > processed_data/pg10.dat
python source/wordcount.py data/pg65.txt > processed_data/pg65.dat

# (optionally) create plots
python source/plotcount.py processed_data/pg10.dat results/pg10.png
python source/plotcount.py processed_data/pg65.dat results/pg65.png

# print frequency of 10 most frequent words in both books to file
python source/zipf_test.py 10 pg10.dat pg65.dat > results.txt

This workflow is encoded in the Snakefile which can be used to run through all data files in serial or parallel:

# run workflow 
snakemake -j 1

# clear all output
snakemake -j 1 --delete-all-output

# run in parallel on 4 processes
snakemake -j 4

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published