-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question]: Issue with NER Label Recognition using external annotated dataset #3314
Comments
Hi @Sanpau2022 please note that your results are heavily depending on your choice of model and the dataset you created. As you haven't shared information like the training script or training logs, we can only guess what might help you. My guess would be, that 7 is a very small dataset size, usually one wants to start with at least 100 training examples but prefer 1000. |
Thanks, I resolved the formatting issue by converting my files from TXT to CSV. However, I've encountered a new challenge. The code is not functioning as expected. Below is the code I am trying to use, including the error message I received during execution. I would greatly appreciate your insights or assistance! import flair nltk.download("punkt") MODEL_DIR = "./model" def read_csv_to_sentences(csv_file_path: str): def train(data_dir: str, model_dir: str):
train(DATA_DIR, MODEL_DIR) [nltk_data] Downloading package punkt to /home/jovyan/nltk_data... Exception Traceback (most recent call last) /tmp/ipykernel_290/178478599.py in train(data_dir, model_dir) ~/venvs/kgenv03/lib/python3.10/site-packages/flair/data.py in make_label_dictionary(self, label_type, min_count, add_unk) Exception: |
Hi again, as the warning shows:
You have not added any labels in your conversion method. Instead, you have added the labels as part of the token text. You can use the following function to create a sentence out of the list of tokens and token-labels (assuming BIO format) to add the labels to the sentence: from typing import List
from flair.data import get_spans_from_bio, Sentence
def create_labeled_sentence(tokens: List[str], tag_labels: List[str]) -> Sentence:
sentence = Sentence(tokens)
predicted_spans = get_spans_from_bio(tag_labels)
for idx, _, label in predicted_spans:
if value == "O":
continue
span = sentence[idx[0]: idx[-1] + 1]
span.add_label("ner", value=label)
return sentence This converts the BIO-tags to the target spans used in flair. |
Thanks, I modified the code as you suggested:
However, it is sill not working. After running that, I received: TypeError Traceback (most recent call last) /tmp/ipykernel_96/2207024690.py in train(data_dir, model_dir) /tmp/ipykernel_96/2207024690.py in read_csv_to_sentences(csv_file_path) /tmp/ipykernel_96/2207024690.py in create_labeled_sentence(tokens, tag_labels) ~/venvs/kgenv03/lib/python3.10/site-packages/flair/data.py in init(self, text, use_tokenizer, language_code, start_position) TypeError: sequence item 16: expected str instance, float found |
This looks like csv file contains some empty texts, that pandas will replace by the nan value. You can check those values with Btw when sharing code, please use syntax-highlighting for the code, e.g. by writing:
we'll see properly highlighted code: # code goes here
if True:
print("Hello World") which makes it way easier to read and understand you comments. |
I checked the files and they look OK, so you might want to have a look at the code I originally used when working with TXT annotations:
However, my BIO-formatted labels seem not being recognised, as when running that code, the metrics are all zero. What can I do? |
Can you share the logs of you training run? There are various reasons why a ML model could not learn. It wouldn't help you if I just guess what the problem could be |
I solved this issue by taking my previously successful txt files (train, test, dev) and integrating the content of the non-functional txt as new entries. Then runing the files it works. |
Question
Hi Flair Community, I've got an annotated dataset in BIO format that I'm attempting to use with Flair for annotating 7 PDFs. Unfortunately, all my metric results are consistently zero. I'm seeking guidance and expertise to understand why the labels might not be recognized in this context. Any insights or advice on improving label recognition in Flair would be greatly appreciated. Thank you!
The text was updated successfully, but these errors were encountered: