Virology Papers Classification using SciBERT

This project utilizes BERT embeddings and cosine similarity to classify virology research papers based on keywords and relevant classification terms. The code extracts relevant abstracts, preprocesses them, and then classifies them into different categories based on natural language processing and computer vision methods.

Features

Data Preprocessing: Remove stopwords, lowercase text, and filter out abstracts without content.
Keyword Matching and Classification: Use a predefined set of keywords for matching relevant research topics.
BERT Embeddings: Extract embeddings using allenai/scibert_scivocab_uncased for text similarity calculations.
Cosine Similarity Scoring: Rank abstracts based on relevance to predefined terms.
Classification of Topics: Classify papers into "text mining," "computer vision," "both," or "other" categories.

GPU Usage for Optimization

The code was optimized using GPU acceleration for faster embedding generation and processing.

Requirements

To install the required dependencies, run:

pip install transformers scikit-learn torch pandas nltk

Additional setup for NLTK stopwords:

import nltk
nltk.download('stopwords')

Code Walkthrough

Loading and Preprocessing Data: Load abstracts from collection_with_abstracts.csv, filter out empty entries, and apply text preprocessing.
Embedding Generation: Use SciBERT to generate embeddings for abstracts and keywords.
Cosine Similarity Computation: Compare embeddings and compute relevance scores.
Classification and Keyword Extraction: Classify papers based on content and relevant terms.
Save Results: Filter and save the results with classification labels and relevance scores to filtered_virology_papers_bert.csv.

Usage

To use this code, ensure your data file (collection_with_abstracts.csv) is in the working directory. You can then execute the code to produce a CSV file with classifications and method names for each abstract.

Files

collection_with_abstracts.csv: Input file containing abstracts for processing.
filtered_virology_papers_bert.csv: Output file with filtered abstracts and classifications.

Why SciBERT and Cosine Similarity?

Traditional keyword-based filtering lacks contextual understanding it may include papers using irrelevant terms.SciBERT embeddings encode both the semantic and syntactic structure of the text. Using cosine similarity on these embeddings allows the system to detect related topics even when keywords are absent, providing an accurate classification.

Why SciBERT and not BERT?

SciBERT leverages unsupervised pretraining BERT on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
filtered_virology_papers_bert_result.csv		filtered_virology_papers_bert_result.csv
scibert_paper_filtering.py		scibert_paper_filtering.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Virology Papers Classification using SciBERT

Features

GPU Usage for Optimization

Requirements

Code Walkthrough

Usage

Files

Why SciBERT and Cosine Similarity?

Why SciBERT and not BERT?

About

Releases

Packages

Languages

fawadabbasi/Semantic-NLP-Filtering-for-Deep-Learning-Papers-in-Virology-Epidemiology

Folders and files

Latest commit

History

Repository files navigation

Virology Papers Classification using SciBERT

Features

GPU Usage for Optimization

Requirements

Code Walkthrough

Usage

Files

Why SciBERT and Cosine Similarity?

Why SciBERT and not BERT?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages