Anomalous entity entries in CSV output #33

EmanuelFaria · 2022-12-23T00:28:01Z

ISSUE: Using sciSpacy, to create CSV I noted a number of issues in the entities it found and the labels attached to them.

NOTE: I suspect at least some of the problem could be due to the output being comma-delimited, so I propose we try with tab-delimited output, and I'll re-run this test corpus and compare.

In the attached PDF and CSV you'll see I added two new columns — Anomaly and issue. (I did not identify the issue for most of these, but you'll get the gist)

In this case I was focusing on what entities were mis-labeled as DISEASE

The types of errors include the following being identified as DISEASE:

email address
author names (or parts thereof)
apostrophe's (in many cases a lone apostrophe was labelled a DISEASE. Maybe before exporting we include a step to replace all smart quotes and apostrophes with dumb ones?)
plant names and extracts
organization names (or parts thereof)
chemical compounds
names of proteins
measurements (or parts thereof)
fatty acid is treated as a disease throughout
microbes are treated as a disease throughout, but I suspect that is intentional
factors such as TNF-alpha
chemical terms such as dissolution/solubility
Moroccan cultural heritage

Also, noticed some mis-labeled as CHEMICAL

COVID-19
random numbers

Also, many Abbreviations came up as entities, but were not expanded in the abbreviations_longform column

The text was updated successfully, but these errors were encountered:

EmanuelFaria changed the title ~~Anomalous entries in CSV output~~ Anomalous entity entries in CSV output Dec 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anomalous entity entries in CSV output #33

Anomalous entity entries in CSV output #33

EmanuelFaria commented Dec 23, 2022

Anomalous entity entries in CSV output #33

Anomalous entity entries in CSV output #33

Comments

EmanuelFaria commented Dec 23, 2022