The aim of this project is to use the text from biomedical and life science literature to gain insights on research topic trends over time. The data is extracted from the text mining collections made available by the PubMed Central (PMC) archive, an archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine. Through dynamic topic modeling, I have discovered the underlying themes of the text collections and observed some interesting changes over time. Using this method, I can potentially automate the processes of organizing, searching, indexing, and browsing large document collections.
You can start by reading main_analysis.ipynb
in the code
folder, which contains the executive summary and the actual code throughout the project; or you can go over the presentation slides (presentation.pdf
).
The data
folder includes two subsets from the original text mining collections.
This is my Capstone project for the Data Science Immersive program at General Assembly (Washington, DC).