Skip to content

miniproject: viral epidemics and disease

dheerajdhingani edited this page Oct 23, 2020 · 34 revisions

What diseases co-occur with viral epidemics?

owner:

Priya

collaborators:

Dheeraj Kumar

miniproject summary

Please read the INITIAL SUMMARY section first, if you have any difficulties in this section.

proposed activities

  • Use the communal corpus epidemic50noCov consisting of 50 articles. #f0b215 CREATED
  • Scrutinizing the 50 articles to know the true positives and false positives, that is, whether the articles are about viral epidemic or not. #C5F015 FINISHED
  • Using ami search to find whether the articles mentioned any comorbidity in a viral epidemic or not, annotating with dictionaries to create ami DataTables. #C5F015 FINISHED
  • Sectioning the articles using ami:section to extract the relevant information on comorbidity. #C5F015 FINISHED
  • Refining and rerunning the query to get a corpus of 950 articles. #f0b215 CREATED
  • Scrutinizing the 950 articles for true positives and false positives and creating a spreadsheet. #1589F0 PROGRESSING
  • Using ami search to create DataTables and ami section for sectioning the 950 articles. #C5F015 FINISHED
  • Using relevant ML technique for the classification of data whether the articles are based on viral epidemic and the diseases/disorders that co-occur. #1589F0 PROGRESSING
  • Creating a dashboard of knowledge, especially with an annotated map. #F03C15 NOT STARTED

outcomes

  • A spreadsheet will be developed based on the comorbidity during a viral epidemic and their count;
  1. for 50 articles in epidemic50noCov. #C5F015 FINISHED
  2. for 950 articles in disease corpus. #1589F0 PROGRESSING
  • Development of the ML model for data classification on accuracy. #1589F0 PROGRESSING
  • Annotated map with the obtained data. #F03C15 NOT STARTED

corpora #f0b215 CREATED

  • Initially the communal corpus epidemic50noCov will be used. (A small test corpus for using the large corpus disease)
  • Later a corpus of 950 articles created in disease corpus, using the syntax getpapers -q "viral epidemics AND human NOT COVID NOT corona virus NOT SARS-Cov-2" -o disease -f disease/log.txt -k 950 -x -p, will be used.

dictionaries

software

  • getpapers to create the corpus of 950 articles by downloading from EPMC.
  • AMI for creating DataTables, creating and using dictionaries, sectioning.
  • SPARQL for creating dictionaries.
  • KNIME for workflow and analytics.
  • keras, Jupyter Notebook [Python] for binary classification.

constraints

Respective pages #f0b215

  1. 50 articles corpus epidemic50noCov at - https://github.com/petermr/openVirus/tree/master/miniproject/epidemic50noCov

  2. 950 articles corpus disease at - https://github.com/petermr/openVirus/tree/master/miniproject/disease

  3. for getpapers - https://github.com/petermr/openVirus/wiki/getpapers#tester-2

  4. for installing ami - https://github.com/petermr/ami3/wiki/ami-installation

  5. for updating ami - https://github.com/petermr/openVirus/wiki/Tools:-ami3#updating-ami3

  6. for amidict/dictionary validation - https://github.com/petermr/openVirus/wiki/Tools:-ami3#amidict-validation

  7. for ami search - https://github.com/petermr/openVirus/wiki/ami-search

  8. for ami section - https://github.com/petermr/openVirus/wiki/ami:section

  9. for SPARQL - https://github.com/petermr/openVirus/wiki/Tools-:-SPARQL

  10. for ML technique jupyter notebook is used - https://github.com/petermr/openVirus/wiki/Jupyter-Notebooks#data-preparation-for-ml



Initial Summary

(by collaborator Dheeraj)

The aim of the mini-project

Our aim first of all, is that if we recognize diseases, then we will be able to give medicines for them. In this mini project, we will be able to find diseases with the help of disease dictionary (from open access articles) in accordance to "viral epidemic" by using ContentMine software(getpapers and ami).

Resources

Dictionary

  • The names of all diseases are updated in the dictionary of diseases which are helpful in searching particular diseases' words in the articles, just like the dictionary contains a store of words.
  • It's source is ICD-10(by WHO) and Wikidata and it was created using ami.
  • It's a multilingual dictionary ( contains english,hindi,tamil,Kannada,Spanish, Portuguese)

Corpus 950 (disease)

  • This is a group of articles which is based on viral epidemics and diseases. These articles contain information regarding diseases which are to be simplified.
  • This is a group of 950 articles that have been downloaded from EPMC via getpapers.

EPMC

This is a Pub Med Central website with a lot of scientific research knowledge articles. We are analyzing some of the open access articles from EPMC for our mini-project, which are downloaded using getpapers.

Tools

getpapers

ami

  • It is also a ContentMine software. It is used in creating a dictionary. It is useful for searching particular diseases' words that are updated in dictionary, sectioning downloaded articles and gathering information from them.
  • Like in this, we have created a dictionary of disease.

Wikidata SPARQL

  • The query service by wikidata. It has everything included from Wikipedia and even more.
  • In this mini project we needed ICD-10 code for Diseases and wanted the result in different languages.
  • We obtained primarily the following result. CLICK HERE results in four languages.

Work done #C5F015

  • I have read about getpapers and EPMC and also I have read about advanced search in EPMC and reading its articles too.
  • I read wikidata and learned to update the dictionary.
  • Also updated the Dictionary with the help of Wikidata Query Service with the ICD-10 codes.
  • So far I have manually classified some articles as True and False Positives.
  • Created a SPARQL query for multilingual(six languages) disease dictionary.

My goal #1589F0

  • As said that if diseases are known, then we can give medicines accordingly. Therefore, our main goal will be to find out the names of diseases that co-occur during viral epidemics and work accordingly.
  • Now have to manually classify all the articles into true positive and false positive.


Challenging #F03C15

  1. Learning KNIME for workflow and analytics.
  2. Learning Keras and Python code in Jupyter Notebook to use in binary classification.

Issue Rectification #C5F015

Splitting 950 corpus for ami search

  1. The 950 article corpus was large in size and hence using ami search popped the OutOfMemoryError.
  2. Hence, the disease corpus (Cproject) was split into 4-parts consisting of 200-250 Ctrees.
  3. Then, ami search was used in each parts successfully, which created DataTables.
  4. The test details at https://github.com/petermr/openVirus/wiki/ami-search#running-ami-search-in-disease-dictionary

_cooccurence folder

  1. Primarily in Windows amisearch created an empty _cooccurence folder.
  2. After debugging, AMI was updated which gave the desired result in _cooccurence folder.
  3. Thus the error was rectified.

Update #C5F015

Uploading corpus to GitHub

(Reference from Ambreen's update )

  1. Download VS code and clone the openVirus repository into your system.
  2. Open the openVirus folder in VS code (don't close it).
  3. Now open your openVirus folder in your directory and make your changes in it.
  4. Reopen the VS code that was minimized. Now commit the changes by selecting the commit symbol. It might take time with respect to your size of uploading files.
  5. After adding the remote repository, push the changes to GitHub. See this video for other clarification.

NOTE : If already had cloned the repository, first pull the repo and then push the changes.

Using Valid dictionary for ami search

  1. The syntax used in above ami search used the in-build disease dictionary.
  2. To use the Valid Disease Dictionary, the whole path must be specified in the syntax as follows:
ami -p <Cproject> search --dictionary openVirus/cambiohack2020/dictionaries/disease.xml

NOTE : <Cproject> must be replaced by the name of your Cproject, the one that contain Ctrees.

Clone this wiki locally