Skip to content

Research and Select a Suitable NER Tool or Library

Yashodhar Pansuriya edited this page Jun 19, 2024 · 4 revisions
  • Question: Research and select an appropriate Named Entity Recognition (NER) tool or library to enhance the quality of our training data. The goal is to ensure the model better understands CNCF-specific knowledge. The process should:

    • Evaluate potential NER tools such as Transformers by HuggingFace, spaCy, and NLTK, among others.
    • Identify an approach that can automatically recognize entities and relationships using our data, possibly utilizing existing CNCF entities and their relationships.
  • Results:

Our industry partner demands Named Entity Recognition (NER) on the gathered data hence we need to find suitable existing tools/approaches to achieve that.

Findings

  1. This article (Custom Named Entity Recognition: A Solution for Unstructured Product Data) suggests using an existing LLM (preferably one that is trained on following instructions) with correct prompts. Models that should be analyzed/tested for this purpose include:
    • GoLLIE: Model trained for Information Extraction [Apache-2.0 license]
    • XLM-RoBERTa: Model trained on filtered CommonCrawl data containing 100 languages [MIT license]
    • mDebertaV3: Improved version of BERT and RoBERTa models [MIT license]
    • UDOP: Model designed for document image classification, document parsing, and document visual question answering [MIT license]
    • mPLUG-DocOwl 1.5: Model designed for document understanding [Apache-2.0 license]
  2. Another option would be to use existing solutions, such as python packages Natural Language Toolkit (NLTK) (NLTK) [Apache-2.0 license] or SpaCy (SpaCy) [MIT license] as described in these articles (NER with NLTK and SpaCy, NLP Entity Extraction NER Using Python NLTK)

Conclusion

There are several options that need to be tested to arrive at the best possible solution. As a first start, and if provided resources allow for it, using an LLM for the NER task seems promising and should be further investigated.