Main module containing logic for data extraction and command line interface.
- Python 3
If you use Nix, then you can install most dependencies easily with nix-direnv
. Then you just need to do the venv
/pip
installation steps below.
The codebase has been formatted with black
and reformatted for compliance with PEP8. The reformattings resulted in two commits that changed a lot of lines, which in turn can make it unnecessarily challenging to use git blame
(and blame
integration in IDEs) to peek into the history of the project. However, there is a way around this challenge: the hashes of the reformatting commits are in .git-blame-ignore-revs
. To configure git
to use that file when using git blame
: git config blame.ignoreRevsFile .git-blame-ignore-revs
.
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp scripts/pre-commit .git/hooks
If you wish to chunk the html files with duplicate filtering, you will also need ssdeep. Installation of ssdeep is done through pip, but you also need to install ssdeep on your system, which can be done with apt:
sudo apt-get install ssdeep libfuzzy-dev libffi-dev python3-dev
More on ssdeep installation can be found here
If you need to generate the XML files with the CoNLLU/NLP data, you will need to perform the nlp-setup step: NB: NLP setup is very outdated as of 2022-08. It is due to be redone/updated. This notice will be removed when it is.
inv nlp-setup # NOTE: you need to have Java (eg. openjdk) installed for this to work
Note that ssdeep pip-package seems to be difficult to install on MacOS since it was tested
only on Linux systems according to their documentation. Ignore the dependency on MacOS
and install other packages from requirements.txt
. Everything else than chunking and
duplicating code will work and affected tests are skipped when ssdeep is not available.
Please cite if you use this software or datasets generated by it in your research:
T. Salmi, L. Kallioniemi, J. Loehr. Kaira-core [computer software]. Lammi Biological Station 2022 Available at https://github.com/Tumetsu/Kaira