Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anomalous entity entries in CSV output #33

Open
EmanuelFaria opened this issue Dec 23, 2022 · 0 comments
Open

Anomalous entity entries in CSV output #33

EmanuelFaria opened this issue Dec 23, 2022 · 0 comments

Comments

@EmanuelFaria
Copy link
Collaborator

ISSUE: Using sciSpacy, to create CSV I noted a number of issues in the entities it found and the labels attached to them.

NOTE: I suspect at least some of the problem could be due to the output being comma-delimited, so I propose we try with tab-delimited output, and I'll re-run this test corpus and compare.

In the attached PDF and CSV you'll see I added two new columns — Anomaly and issue. (I did not identify the issue for most of these, but you'll get the gist)

In this case I was focusing on what entities were mis-labeled as DISEASE

The types of errors include the following being identified as DISEASE:

  • email address
  • author names (or parts thereof)
  • apostrophe's (in many cases a lone apostrophe was labelled a DISEASE. Maybe before exporting we include a step to replace all smart quotes and apostrophes with dumb ones?)
  • plant names and extracts
  • organization names (or parts thereof)
  • chemical compounds
  • names of proteins
  • measurements (or parts thereof)
  • fatty acid is treated as a disease throughout
  • microbes are treated as a disease throughout, but I suspect that is intentional
  • factors such as TNF-alpha
  • chemical terms such as dissolution/solubility
  • Moroccan cultural heritage

Also, noticed some mis-labeled as CHEMICAL

  • COVID-19
  • random numbers

Also, many Abbreviations came up as entities, but were not expanded in the abbreviations_longform column

@EmanuelFaria EmanuelFaria changed the title Anomalous entries in CSV output Anomalous entity entries in CSV output Dec 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant