This project aims to generate a knowledge graph from PDF files by extracting text and identifying relations within the text. The generated knowledge graph can be visualized through a Streamlit-based frontend application.
- Introduction
- Installation
- Software Requirements
- Pipeline Explanation
- Running the Application
- Folder Structure
- Notebooks
The knowledge graph generation process involves extracting text from PDF files, identifying relationships within the text, and visualizing the relationships in a graph. This README provides a comprehensive guide on setting up the project and running the application.
Clone the repository to your local machine:
git clone https://github.com/daJster/KG-generation.git
cd KG-generation
Install the required packages using pip :
pip3 install -r requirements.txt
Ensure the following Python packages are installed:
- nltk
- llm
- pygraft
- rdflib
- torch
- torchvision
- torchaudio
- pyvis
- PyMuPDF
- langdetect
- transformers
- -U sentence-transformers
- streamlit
- flask
-
Install Docker if not already installed. Refer to the official documentation for installation instructions.
-
Start Docker by running the following command:
sudo service docker start
- Install Memgraph using the following commands if not already installed:
sudo docker run -p 7687:7687 -p 7444:7444 -p 3000:3000 --name memgraph memgraph/memgraph-platform
- Start Memgraph using the following command:
sudo docker start memgraph
- Check if Memgraph is running by visiting http://localhost:3000 in your web browser.
The pipeline consists of the following steps:
-
Text Extraction: Extract text from PDF files using PyMuPDF.
-
Relation Extraction: Utilize NLP libraries such as nltk, llm, and transformers for relation extraction from the extracted text.
-
Merge entities: Avoid duplicate entities by merging entities. Use the AI model including All_Mini to identify same semantic entities.
-
Graph Generation: Construct a knowledge graph using extracted relations. Pygraft and rdflib are used for graph creation.
-
Admin interface: Automize pipeline use with streamlit and choose any pdf file to run in our model.
-
User Interface (UI): Create a Streamlit frontend application (app.py) for users to interact with the generated knowledge graph.
Both the admin and user interfaces need memgraph to be running. Ensure that memgraph is running before proceeding (refer to the Memgraph Installation section for instructions).
Run the main.py file for the first time to build streamlit web-app and to execute the knowledge graph generation process :
python3 src/pipeline/main.py
Then run the streamlit web-app :
streamlit run src/pipeline/main.py
If you encounter any issues, run the following command in the terminal and run the Streamlit app again.
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
Visit http://localhost:8501 in your web browser to access the admin web-app.
To run the UI interface :
- Go to the web-app folder :
cd src/web-app
- Run the app.py file :
python3 app.py
⚠️ DO NOT run python3 src/web-app directly: otherwise you won't be able to use radius search in the graph
└── KG-generation
├── Knowledge Graphs.pdf
├── README.md
├── datasets
│ ├── acronyms.txt
│ ├── differences.txt
│ ├── nearly_similarities.txt
│ ├── plurals.txt
│ └── similarities.txt
├── docker-compose.yml
├── dockerfile
├── notebooks
│ ├── Entities.ipynb
│ ├── Entity_Alignment_Hash.ipynb
│ ├── Evaluation_Metrics.ipynb
│ ├── Fine_tuning_2.ipynb
│ ├── Merge.ipynb
│ ├── NLP_for_economy.ipynb
│ ├── NLPlanet.ipynb
│ └── Pykeen_Metrics.ipynb
├── requirements.txt
└── src
├── pipeline
│ ├── KB_generation.py
│ ├── all_mini.py
│ ├── data_selection.py
│ ├── emissions.csv
│ ├── main.py
│ ├── merge_RDF.py
│ ├── params.py
│ └── text_selection.py
├── test_text_compare.py
└── web-app
├── app.py
├── assets
│ ├── css
│ │ ├── loader.css
│ │ └── style.css
│ ├── js
│ │ ├── get_details.js
│ │ ├── main.js
│ │ └── wikipedia_details.js
├── game_of_thrones.html
├── graph.html
├── index.html
The notebooks folder contains the following Jupyter notebooks:
This notebook focuses on accelerating the merge process of relations or entities in a graph, especially when dealing with large graphs where the merge operation can be computationally expensive (exponential complexity). The objective is to pre-categorize entities quickly and efficiently. The notebook explores various methods, including using BERT, NER (Named Entity Recognition), Zero-shot classification, FastText with a linear classifier, and an LSTM (Long Short-Term Memory) model.
This notebook introduces an approach for entity alignment and merging utilizing sentence embeddings and hashing to enhance processing efficiency. The primary focus is on addressing the challenge of merging entities or aligning them in a knowledge graph. The notebook employs a hashing function to group entities with similar types and presents practical examples to illustrate the merging process.
Analyzing a knowledge graph poses challenges as there is no ideal metric. Metrics such as the accuracy of information, the relationships between entities, and the overall coherence are subjective and often require expert judgment. Implementing these metrics on-the-fly in a calculative manner is challenging. While there are no supervised metrics for dynamic evaluation, unsupervised metrics can help compare the evolution of a graph. By observing changes in metric values between iterations, one can infer improvements or deteriorations. Although these metrics are not implemented here, exploring such approaches could be beneficial.
To address the absence of a native metric, this notebook proposes the use of PyKEEN, an implementation offering a variety of evaluation metrics.
Before running the notebook, install PyKEEN:
pip install pykeen
The notebook calculates various metrics such as Mean Rank, Hits@K, and Mean Reciprocal Rank (MRR) from the results.
This notebook demonstrates the fine-tuning of the "All-MiniLM-L6-v2" model to address the challenge of entity alignment and merging. The model, known for its small size, fast inference, and proficiency in identifying sentence similarities, was originally trained on a dataset generated by OpenAI's gpt4 model, focusing on the fields of economy and finance.
This file represents an initial ''trial'' version for merging entities, not triplets, extracted from mRebel. It does not require improvement, as it is an old version. The new version favors merging across the entire relation.
This pipeline aims to test the inference time based on the length of the text/context given to mRebel and different hyperparameters to obtain the best set of performance.
This pipeline focuses on extracting triplets using the REBEL and mREBEL models. It involves comparing inferences from both models and adapting the pipeline accordingly, including various tests. This comprehensive pipeline covers various aspects, from short text extraction to processing longer documents and generating knowledge graphs from PDFs, providing a versatile tool for information extraction and analysis.
This notebook uses the Pykeen library to evaluate the robustness of a set of triplets, which form a knowledge graph. It focuses on three evaluation metrics: Mean Rank (MR), Mean Reciprocal Rank (MRR), and Hits at K.