Given a large amount of literature and rapidly spreading COVID-19, it is difficult for a scientist to keep up with the research community promptly. Can we cluster similar research articles together to make it easier for health professionals to find relevant research articles? Clustering can be used to create a tool to identify similar articles, given a target article. It can also reduce the number of articles one has to go through as one can focus on a cluster of articles.
https://maksimekin.github.io/COVID19-Literature-Clustering/plots/t-sne_covid-19_interactive.html
t-SNE Output Clustered For Visualization
Approach:
- Unsupervised Learning task, because we don't have labels for the articles
- Clustering and Dimensionality Reduction task
- See how well labels from K-Means classify
- Use N-Grams with Hash Vectorizer
- Use plain text with Tfidf
- Use K-Means for clustering
- Use t-SNE for dimensionality reduction
- Use PCA for dimensionality reduction
- There is no continuous flow of data, no need to adjust to changing data, and the data is small enough to fit in memmory: Batch Learning
- Altough, there is no continuous flow of data, our approach has to be scalable as there will be more literature later
In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.
Maps generated using Novel Corona Virus 2019 Dataset | Kaggle.
Dataset/Task: COVID-19 Open Research Dataset Challenge (CORD-19), An AI challenge with AI2, CZI, MSR, Georgetown, NIH & The White House COVID-19 Open Research Dataset Challenge (CORD-19) | Kaggle
Code for loading the dataset into DF(cite): Dataset Parsing Code | Kaggle, COVID EDA: Initial Exploration Tool
Clustering section of the project: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition, by Aurelien Geron (O'Reilly). Copyright 2019 Kiwisoft S.A.S, 978-1-492-03264-9
Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset | White House, USA, March 16, 2020 Kaggle Submission: COVID-19 Literature Clustering | Kaggle
@inproceedings{Raff2020,
author = {Raff, Edward and Nicholas, Charles and McLean, Mark},
booktitle = {The Thirty-Fourth AAAI Conference on Artificial Intelligence},
title = {{A New Burrows Wheeler Transform Markov Distance}},
url = {http://arxiv.org/abs/1912.13046},
year = {2020}
}