This is a repo made primarily for NLP tasks and is based mainly on Haystack and Hugging face already built components.
The tasks performed include:
- Document processing: Processing the text from docx/text/pdf files and creating the paragraphs list.
- Search: Performing lexical or semantic search on the paragraphs list created in step 1.
- SDG Classification: Performing the SDG classification on the paragraphs text.
- Extracting the keywords based on Textrank/TFIDF/KeyBert
Please use the colab notebook to get familiar with basic usage of utils (use branch =main for non-streamlit usage). For more detailed walkthrough use the advanced colab notebook. There are two branch in the repo. One for using in streamlit environment and another for generic usage like in colab or local machine. You can clone the repo for your own use, or also install it as package.
To install as package (non-streamlit use):
pip install -e "git+https://github.com/gizdatalab/haystack_utils.git@main#egg=utils"
To install as package for streamlit app:
pip install -e "git+https://github.com/gizdatalab/haystack_utils.git@streamlit#egg=utils"
To install as package (for CPU-trac Streamlit app https://huggingface.co/spaces/GIZ/cpu_tracs):
pip install -e "git+https://github.com/gizdatalab/haystack_utils.git@cputrac#egg=utils"