GitHub

Knowledge Graph Using BioBERT

About The Project

This repository includes code for Named Entity Recognition and Relationship Extraction methods and knowledge graph generation through EHR records. These methods were performed on n2c2 2018 challenge dataset which was augmented to include a sample of ADE corpus dataset. This project is capstone project for my undergraduate degree in Bachelors of Technology (Computer Science and Engineering).

The purpose of this project is to automatically structure this data into a format that would enable doctors and patients to quickly find information that they need. Specifically, build a Named Entity Recognition (NER) model that would recognize entities such as drug, strength, duration, frequency, adverse drug event (ADE), reason for taking the drug, route and form. Further, the model would also recognize the relationship between drugs and every other named entity as well and generate a knowledge graph based on it so as to make it easier for the doctors to analyze the patient’s disease and drug history at a quick glance. The model would also have the feature of query answering wherein the knowledge graph will be used to answer the user queries.

The main objective of the project is to use the extracted relationships between drugs and every other entity to build a comprehensive knowledge graph which could be used for providing quick summary, query answering and analysis, thus simplifying knowledge discovery in the biomedical field

Built With

Getting Started

To run this project locally you need to get the datasets from the links mentioned below and preprocess the datasets to generate the dataset for training and testing of NER and RE models. Also you will need to have an Neo4J account for knowledge graph generation.

Prerequisites

Datasets used for training.

https://huggingface.co/datasets/ade_corpus_v2

https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/

Neo4J Connection URI

https://neo4j.com/

Modules in requirements.txt

pip install -r requirements.txt

Installation

Preprocess the datasets

python generate_data.py 
    --task ner 
    --input_dir data
    --ade_dir ade_corpus
    --target_dir dataset
    --max_seq_len 512 
    --dev_split 0.1 
    --tokenizer biobert-base 
    --ext txt 
    --sep " "

The task parameter can be either ner or re for Named Entity Recognition and Relation Extraction tasks respectively.
The input directory should have two folders named train and test in them. Each folder should have txt and ann files from the original dataset.
ade_dir is an optional parameter. It should contain json files from the ADE Corpus dataset.
The max_seq_len should not exceed 512 for BioBERT model.
For BioBERT model, use biobert-base as the tokenizer value.
Use txt for the ext (extension) parameter and " " as the sep (seperator) parameter for NER, and tsv extension and tab as the seperator for RE.

Train the NER and RE Models

NER Model

export SAVE_DIR=./output
export DATA_DIR=./dataset

export MAX_LENGTH=128
export BATCH_SIZE=16
export NUM_EPOCHS=5
export SAVE_STEPS=1000
export SEED=0

python run_ner.py 
    --data_dir ${DATA_DIR}
    --labels ${DATA_DIR}/labels.txt 
    --model_name_or_path dmis-lab/biobert-large-cased-v1.1 
    --output_dir ${SAVE_DIR}
    --max_seq_length ${MAX_LENGTH} 
    --num_train_epochs ${NUM_EPOCHS} 
    --per_device_train_batch_size ${BATCH_SIZE} 
    --save_steps ${SAVE_STEPS} 
    --seed ${SEED} 
    --do_train 
    --do_eval 
    --do_predict 
    --overwrite_output_dir

RE Model

export SAVE_DIR=./output
export DATA_DIR=./dataset

export MAX_LENGTH=128
export BATCH_SIZE=8
export NUM_EPOCHS=3
export SAVE_STEPS=1000
export SEED=1
export LEARNING_RATE=5e-5

python run_re.py 
    --task_name ehr-re 
    --config_name bert-base-cased 
    --data_dir ${DATA_DIR} 
    --model_name_or_path dmis-lab/biobert-base-cased-v1.1 
    --max_seq_length ${MAX_LENGTH} 
    --num_train_epochs ${NUM_EPOCHS} 
    --per_device_train_batch_size ${BATCH_SIZE} 
    --save_steps ${SAVE_STEPS} 
    --seed ${SEED} 
    --do_train 
    --do_eval 
    --do_predict 
    --learning_rate ${LEARNING_RATE} 
    --output_dir ${SAVE_DIR} 
    --overwrite_output_dir

To start the app in development mode

uvicorn fast_api:app --reload

To run the front-end, in front-end/ehr.html and open the HTML file in a browser.

Usage

To show the operation of Named Entity Recognition (NER), Relationship Table and Knowledge Graph, a web app was developed using HTML, CSS, and JavaScript. A graphical user interface (GUI) is displayed, in which the user needs to upload an EHR from which entities, relationships have been identified based on which Knowledge graph is created. The retrieved entities can be viewed as a result. The relationship between retrieved entities can be viewed as a result. The Knowledge graph generated can be viewed as a result .

The uploaded ehr's data is stored in the Neo4J graph database shown in the following figure.
The example for query-answering is shown in the following image.

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have suggestions for adding or removing projects, feel free to open an issue to discuss it, or directly create a pull request after you edit the README.md file with necessary changes.
Please make sure you check your spelling and grammar.
Create individual PR for each suggestion.

Creating A Pull Request

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Authors

Vedant Shahu - Computer Science and Engineering Student - Vedant Shahu

Acknowledgements

Smit Kiri

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
__pycache__		__pycache__
biobert_ner		biobert_ner
biobert_re		biobert_re
front-end		front-end
images		images
preprocess		preprocess
preprocess_re		preprocess_re
static		static
template		template
trained-ner-model		trained-ner-model
trained_re_model		trained_re_model
.gitattributes		.gitattributes
README.md		README.md
Track2-evaluate-ver4.py		Track2-evaluate-ver4.py
annotations.py		annotations.py
ehr.py		ehr.py
fast_api.py		fast_api.py
generate_data.py		generate_data.py
input1.txt		input1.txt
input2.txt		input2.txt
predict.py		predict.py
requirements.txt		requirements.txt
run.txt		run.txt
utils.py		utils.py
webapp.py		webapp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knowledge Graph Using BioBERT

Table Of Contents

About The Project

Built With

Getting Started

Prerequisites

Installation

Usage

Contributing

Creating A Pull Request

License

Authors

Acknowledgements

About

Releases

Packages

Languages

vedants03/Knowledge_Graph_Using_BioBERT

Folders and files

Latest commit

History

Repository files navigation

Knowledge Graph Using BioBERT

Table Of Contents

About The Project

Built With

Getting Started

Prerequisites

Installation

Usage

Contributing

Creating A Pull Request

License

Authors

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages