Skip to content

Outreach: InCoB2021 PMR presentation

petermr edited this page Oct 23, 2021 · 4 revisions

Title: ContentMining the Biological Literature

Abstract: Much science is only published as PDF "papers" designed for sighted english-speaking humans to read. This huge, multidisciplinary resource becomes much more useful when we convert it automatically into structured, semantic, machine-understandable form. Our Open Source toolkit allows scientists to rapidly mine the Open literature and create their own ontologies and semantic knowledgebase. ContentMining involves a number of rapid heuristic steps: downloading from Open repositories; converting PDF to structured documents; NLP analysis and term/phrase extraction; extraction of text and data from diagrams; annotation with Wikidata into dictionaries; searching documents with multiple dictionaries. Results can be analysed with standard tools to give tables, co-occurrences, maps, chemical pathways etc. The tools (Python3) are accessible for everyone, especially early career researchers, and include support for multiple languages through Wikidata synonyms. We work as an Open Notebook team, develop collaboratively, and welcome volunteers.

  • Discover:
  • Refine:
  • Re-use

Presentations

  • @Ayush

Pygetpapers. Include example of query generation from dictionaries

  • @Shweata

Docanalysis

  • @Anuv

pyamiimage

Notes: @mbeisen https://twitter.com/mbeisen/status/1451233646761824284 The current science publishing system is the worst form of science publishing system. That's it. There's no except.

Clone this wiki locally