These programs will count words in a given text, plot a bar chart of the 10 most common words, and print out the 10 most common words which can be used to test Zipf's law.
- Inspired by and derived from https://hpc-carpentry.github.io/hpc-python/ which is distributed under Creative Commons Attribution license (CC-BY 4.0).
- Documentation: https://word-count.readthedocs.io
All data files, 64 public domain books from Project Gutenberg
reside under data
. To compute the frequency distribution of words for one of the books:
# count words in two books
python source/wordcount.py data/pg10.txt > processed_data/pg10.dat
python source/wordcount.py data/pg65.txt > processed_data/pg65.dat
# (optionally) create plots
python source/plotcount.py processed_data/pg10.dat results/pg10.png
python source/plotcount.py processed_data/pg65.dat results/pg65.png
# print frequency of 10 most frequent words in both books to file
python source/zipf_test.py 10 pg10.dat pg65.dat > results.txt
This workflow is encoded in the Snakefile
which can be used to run
through all data files in serial or parallel:
# run workflow
snakemake -j 1
# clear all output
snakemake -j 1 --delete-all-output
# run in parallel on 4 processes
snakemake -j 4