You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ISSUE: Using sciSpacy, to create CSV I noted a number of issues in the entities it found and the labels attached to them.
NOTE: I suspect at least some of the problem could be due to the output being comma-delimited, so I propose we try with tab-delimited output, and I'll re-run this test corpus and compare.
In the attached PDF and CSV you'll see I added two new columns — Anomaly and issue. (I did not identify the issue for most of these, but you'll get the gist)
In this case I was focusing on what entities were mis-labeled as DISEASE
The types of errors include the following being identified as DISEASE:
email address
author names (or parts thereof)
apostrophe's (in many cases a lone apostrophe was labelled a DISEASE. Maybe before exporting we include a step to replace all smart quotes and apostrophes with dumb ones?)
plant names and extracts
organization names (or parts thereof)
chemical compounds
names of proteins
measurements (or parts thereof)
fatty acid is treated as a disease throughout
microbes are treated as a disease throughout, but I suspect that is intentional
factors such as TNF-alpha
chemical terms such as dissolution/solubility
Moroccan cultural heritage
Also, noticed some mis-labeled as CHEMICAL
COVID-19
random numbers
Also, many Abbreviations came up as entities, but were not expanded in the abbreviations_longform column
The text was updated successfully, but these errors were encountered:
EmanuelFaria
changed the title
Anomalous entries in CSV output
Anomalous entity entries in CSV output
Dec 23, 2022
ISSUE: Using sciSpacy, to create CSV I noted a number of issues in the
entities
it found and thelabels
attached to them.NOTE: I suspect at least some of the problem could be due to the output being comma-delimited, so I propose we try with tab-delimited output, and I'll re-run this test corpus and compare.
In the attached PDF and CSV you'll see I added two new columns — Anomaly and issue. (I did not identify the issue for most of these, but you'll get the gist)
In this case I was focusing on what
entities
were mis-labeled asDISEASE
The types of errors include the following being identified as DISEASE:
Also, noticed some mis-labeled as CHEMICAL
Also, many Abbreviations came up as entities, but were not expanded in the
abbreviations_longform
columnThe text was updated successfully, but these errors were encountered: