[Question]: Issue with NER Label Recognition using external annotated dataset #3314

SPVillacorta · 2023-09-08T02:32:59Z

Question

Hi Flair Community, I've got an annotated dataset in BIO format that I'm attempting to use with Flair for annotating 7 PDFs. Unfortunately, all my metric results are consistently zero. I'm seeking guidance and expertise to understand why the labels might not be recognized in this context. Any insights or advice on improving label recognition in Flair would be greatly appreciated. Thank you!

helpmefindaname · 2023-09-11T07:59:16Z

Hi @Sanpau2022

please note that your results are heavily depending on your choice of model and the dataset you created. As you haven't shared information like the training script or training logs, we can only guess what might help you.

My guess would be, that 7 is a very small dataset size, usually one wants to start with at least 100 training examples but prefer 1000.

SPVillacorta · 2023-09-12T01:13:04Z

Thanks, I resolved the formatting issue by converting my files from TXT to CSV. However, I've encountered a new challenge. The code is not functioning as expected. Below is the code I am trying to use, including the error message I received during execution. I would greatly appreciate your insights or assistance!

import flair
import glob
import nltk
import os
import pandas as pd
import pdfplumber
from flair.embeddings import FlairEmbeddings, StackedEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
from flair.data import Sentence, Corpus

nltk.download("punkt")

MODEL_DIR = "./model"
DATA_DIR = "./data"

def read_csv_to_sentences(csv_file_path: str):
df = pd.read_csv(csv_file_path)
sentences = []
current_sentence = []
for _, row in df.iterrows():
token, label = row['text'], row['label']
if token == '':
sentences.append(Sentence(current_sentence))
current_sentence = []
else:
current_sentence.append(f"{token} <{label}> ")
if current_sentence:
sentences.append(Sentence(current_sentence))
return sentences

def train(data_dir: str, model_dir: str):
train_data = read_csv_to_sentences(f"{data_dir}/train.csv")
dev_data = read_csv_to_sentences(f"{data_dir}/dev.csv")
test_data = read_csv_to_sentences(f"{data_dir}/test.csv")

corpus = Corpus(train=train_data, dev=dev_data, test=test_data)

label_type = 'ner'
tag_dictionary = corpus.make_label_dictionary(label_type=label_type)

embeddings: StackedEmbeddings = StackedEmbeddings([
    FlairEmbeddings("mix-forward"),
    FlairEmbeddings("mix-backward"),
])

tagger: SequenceTagger = SequenceTagger(
    hidden_size=256,
    embeddings=embeddings,
    tag_dictionary=tag_dictionary,
    tag_type=label_type,
    use_crf=True,
)

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train(
    model_dir,
    learning_rate=0.2,
    mini_batch_size=30,
    max_epochs=100,
)

train(DATA_DIR, MODEL_DIR)

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data] Package punkt is already up-to-date!
2023-09-12 03:55:47,092 Computing label dictionary. Progress:
1it [00:00, 3231.36it/s]
2023-09-12 03:55:47,097 ERROR: You specified label_type='ner' which is not in this dataset!
2023-09-12 03:55:47,098 ERROR: The corpus contains the following label types:

Exception Traceback (most recent call last)
/tmp/ipykernel_290/178478599.py in <cell line: 65>()
63 )
64
---> 65 train(DATA_DIR, MODEL_DIR)

/tmp/ipykernel_290/178478599.py in train(data_dir, model_dir)
39
40 label_type = 'ner'
---> 41 tag_dictionary = corpus.make_label_dictionary(label_type=label_type)
42
43 embeddings: StackedEmbeddings = StackedEmbeddings([

~/venvs/kgenv03/lib/python3.10/site-packages/flair/data.py in make_label_dictionary(self, label_type, min_count, add_unk)
1465 )
1466 log.error(f"ERROR: The corpus contains the following label types: {contained_labels}")
-> 1467 raise Exception
1468
1469 log.info(

Exception:

helpmefindaname · 2023-09-12T18:44:41Z

Hi again,

as the warning shows:

2023-09-12 03:55:47,097 ERROR: You specified label_type='ner' which is not in this dataset!
2023-09-12 03:55:47,098 ERROR: The corpus contains the following label types:
<Empty line>

You have not added any labels in your conversion method. Instead, you have added the labels as part of the token text.

You can use the following function to create a sentence out of the list of tokens and token-labels (assuming BIO format) to add the labels to the sentence:

from typing import List
from flair.data import get_spans_from_bio, Sentence


def create_labeled_sentence(tokens: List[str], tag_labels: List[str]) -> Sentence:
    sentence = Sentence(tokens)
    predicted_spans = get_spans_from_bio(tag_labels)
    for idx, _, label in predicted_spans:
        if value == "O":
            continue
        span = sentence[idx[0]: idx[-1] + 1]
        span.add_label("ner", value=label)
    return sentence

This converts the BIO-tags to the target spans used in flair.

SPVillacorta · 2023-09-13T03:14:55Z

Thanks, I modified the code as you suggested:

import flair
import glob
import nltk
import os
import pandas as pd
import pdfplumber
from flair.embeddings import FlairEmbeddings, StackedEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
from flair.data import Sentence, Corpus, get_spans_from_bio
from typing import List

nltk.download("punkt")

MODEL_DIR = "./model"
DATA_DIR = "./data"


# Function to create a labeled sentence
def create_labeled_sentence(tokens: List[str], tag_labels: List[str]) -> Sentence:
    sentence = Sentence(tokens)
    predicted_spans = get_spans_from_bio(tag_labels)
    for idx, _, label in predicted_spans:
        if label == "O":
            continue
        span = sentence[idx[0]: idx[-1] + 1]
        span.add_label("ner", value=label)
    return sentence

# Update read_csv_to_sentences to use create_labeled_sentence
def read_csv_to_sentences(csv_file_path: str):
    df = pd.read_csv(csv_file_path)
    sentences = []
    current_tokens = []
    current_labels = []
    for _, row in df.iterrows():
        token, label = row['text'], row['label']
        if token == '':
            if current_tokens and current_labels:
                sentences.append(create_labeled_sentence(current_tokens, current_labels))
            current_tokens = []
            current_labels = []
        else:
            current_tokens.append(token)
            current_labels.append(label)
    if current_tokens and current_labels:
        sentences.append(create_labeled_sentence(current_tokens, current_labels))
    return sentences


def train(data_dir: str, model_dir: str):
    train_data = read_csv_to_sentences(f"{data_dir}/train.csv")
    dev_data = read_csv_to_sentences(f"{data_dir}/dev.csv")
    test_data = read_csv_to_sentences(f"{data_dir}/test.csv")

    corpus = Corpus(train=train_data, dev=dev_data, test=test_data)

    label_type = 'ner'
    tag_dictionary = corpus.make_label_dictionary(label_type=label_type)

    embeddings: StackedEmbeddings = StackedEmbeddings([
    FlairEmbeddings("mix-forward"),
    FlairEmbeddings("mix-backward"),
    ])

    tagger: SequenceTagger = SequenceTagger(
    hidden_size=256,
    embeddings=embeddings,
    tag_dictionary=tag_dictionary,
    tag_type=label_type,
    use_crf=True,
    )

    trainer: ModelTrainer = ModelTrainer(tagger, corpus)

    trainer.train(
    model_dir,
    learning_rate=0.2,
    mini_batch_size=30,
    max_epochs=100,
    )
    
train(DATA_DIR, MODEL_DIR)

However, it is sill not working. After running that, I received:

TypeError Traceback (most recent call last)
/tmp/ipykernel_96/2207024690.py in <cell line: 85>()
83 )
84
---> 85 train(DATA_DIR, MODEL_DIR)

/tmp/ipykernel_96/2207024690.py in train(data_dir, model_dir)
52
53 def train(data_dir: str, model_dir: str):
---> 54 train_data = read_csv_to_sentences(f"{data_dir}/train.csv")
55 dev_data = read_csv_to_sentences(f"{data_dir}/dev.csv")
56 test_data = read_csv_to_sentences(f"{data_dir}/test.csv")

/tmp/ipykernel_96/2207024690.py in read_csv_to_sentences(csv_file_path)
47 current_labels.append(label)
48 if current_tokens and current_labels:
---> 49 sentences.append(create_labeled_sentence(current_tokens, current_labels))
50 return sentences
51

/tmp/ipykernel_96/2207024690.py in create_labeled_sentence(tokens, tag_labels)
21 # Function to create a labeled sentence
22 def create_labeled_sentence(tokens: List[str], tag_labels: List[str]) -> Sentence:
---> 23 sentence = Sentence(tokens)
24 predicted_spans = get_spans_from_bio(tag_labels)
25 for idx, _, label in predicted_spans:

~/venvs/kgenv03/lib/python3.10/site-packages/flair/data.py in init(self, text, use_tokenizer, language_code, start_position)
715 else:
716 words = cast(List[str], text)
--> 717 text = " ".join(words)
718
719 # determine token positions and whitespace_after flag

TypeError: sequence item 16: expected str instance, float found

helpmefindaname · 2023-09-18T08:19:39Z

This looks like csv file contains some empty texts, that pandas will replace by the nan value. You can check those values with pd.isnan(...) and exclude them.

Btw when sharing code, please use syntax-highlighting for the code, e.g. by writing:

```python
# code goes here
if True:
   print("Hello World")
```

we'll see properly highlighted code:

# code goes here
if True:
   print("Hello World")

which makes it way easier to read and understand you comments.

SPVillacorta · 2023-09-20T05:29:11Z

I checked the files and they look OK, so you might want to have a look at the code I originally used when working with TXT annotations:

import os
import flair
from flair.data import Corpus, Sentence
from flair.datasets import ColumnCorpus
from flair.embeddings import FlairEmbeddings, StackedEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
import pdfplumber

MODEL_DIR = "./model" # Folder to save the model
DATA_DIR = "./data" # Folder containing train, dev, test
PDF_DIR = "./pdfs"  # Folder containing PDF files

def pdf_to_conll(pdf_dir: str, data_dir: str):
    # Implement the function to convert PDF files to the required format using pdfplumber
    pass

def train(data_dir: str, model_dir: str):
    pdf_to_conll(PDF_DIR, DATA_DIR)

    assert os.path.isdir(
        data_dir
    ), "Directory for data does not exist - please create and add data then try again."
    assert os.path.isdir(
        model_dir
    ), "Directory for model does not exist - please create and try again."

    columns = {0: 'text', 1: 'ner'}
    corpus: Corpus = ColumnCorpus(data_dir, columns,
                                  train_file="train.txt",
                                  dev_file="dev.txt",
                                  test_file="test.txt")

    label_type = 'ner'
    tag_dictionary = corpus.make_label_dictionary(label_type=label_type)

    embeddings: StackedEmbeddings = StackedEmbeddings(
        [
            FlairEmbeddings("mix-forward"),
            FlairEmbeddings("mix-backward"),
        ]
    )

     # 6. initialize sequence tagger
    tagger: SequenceTagger = SequenceTagger(
        hidden_size=256,
        embeddings=embeddings,
        tag_dictionary=tag_dictionary,
        tag_type=label_type,
        use_crf=True,
    )

    # 7. initialize trainer 
    trainer: ModelTrainer = ModelTrainer(tagger, corpus)
    
  # 8. start training
    trainer.train(
        model_dir,
        learning_rate=0.1,
        mini_batch_size=2,
        max_epochs=50,
        embeddings_storage_mode=None,
    )

if __name__ == "__main__":
    train(DATA_DIR, MODEL_DIR)

However, my BIO-formatted labels seem not being recognised, as when running that code, the metrics are all zero. What can I do?

SPVillacorta · 2023-09-20T06:49:02Z

Actually, I notice differences among the TXT file which works (right side) and the one which doesn't (left side)

helpmefindaname · 2023-09-25T09:38:38Z

Can you share the logs of you training run? There are various reasons why a ML model could not learn. It wouldn't help you if I just guess what the problem could be

SPVillacorta · 2023-10-26T06:52:58Z

I solved this issue by taking my previously successful txt files (train, test, dev) and integrating the content of the non-functional txt as new entries. Then runing the files it works.

SPVillacorta added the question Further information is requested label Sep 8, 2023

helpmefindaname added the Awaiting Response Waiting for new input from the author label Sep 11, 2023

github-actions bot removed the Awaiting Response Waiting for new input from the author label Sep 11, 2023

helpmefindaname added the Awaiting Response Waiting for new input from the author label Sep 11, 2023

github-actions bot removed the Awaiting Response Waiting for new input from the author label Sep 12, 2023

helpmefindaname added the Awaiting Response Waiting for new input from the author label Sep 18, 2023

github-actions bot removed the Awaiting Response Waiting for new input from the author label Sep 18, 2023

helpmefindaname added the Awaiting Response Waiting for new input from the author label Sep 18, 2023

github-actions bot removed the Awaiting Response Waiting for new input from the author label Sep 20, 2023

helpmefindaname added the Awaiting Response Waiting for new input from the author label Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Issue with NER Label Recognition using external annotated dataset #3314

[Question]: Issue with NER Label Recognition using external annotated dataset #3314

SPVillacorta commented Sep 8, 2023

helpmefindaname commented Sep 11, 2023

SPVillacorta commented Sep 12, 2023 •

edited

Loading

helpmefindaname commented Sep 12, 2023

SPVillacorta commented Sep 13, 2023 •

edited

Loading

helpmefindaname commented Sep 18, 2023 •

edited

Loading

SPVillacorta commented Sep 20, 2023

SPVillacorta commented Sep 20, 2023

helpmefindaname commented Sep 25, 2023

SPVillacorta commented Oct 26, 2023

[Question]: Issue with NER Label Recognition using external annotated dataset #3314

[Question]: Issue with NER Label Recognition using external annotated dataset #3314

Comments

SPVillacorta commented Sep 8, 2023

Question

helpmefindaname commented Sep 11, 2023

SPVillacorta commented Sep 12, 2023 • edited Loading

helpmefindaname commented Sep 12, 2023

SPVillacorta commented Sep 13, 2023 • edited Loading

helpmefindaname commented Sep 18, 2023 • edited Loading

SPVillacorta commented Sep 20, 2023

SPVillacorta commented Sep 20, 2023

helpmefindaname commented Sep 25, 2023

SPVillacorta commented Oct 26, 2023

SPVillacorta commented Sep 12, 2023 •

edited

Loading

SPVillacorta commented Sep 13, 2023 •

edited

Loading

helpmefindaname commented Sep 18, 2023 •

edited

Loading