Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Issue with NER Label Recognition using external annotated dataset #3314

Open
SPVillacorta opened this issue Sep 8, 2023 · 9 comments
Labels
Awaiting Response Waiting for new input from the author question Further information is requested

Comments

@SPVillacorta
Copy link

Question

Hi Flair Community, I've got an annotated dataset in BIO format that I'm attempting to use with Flair for annotating 7 PDFs. Unfortunately, all my metric results are consistently zero. I'm seeking guidance and expertise to understand why the labels might not be recognized in this context. Any insights or advice on improving label recognition in Flair would be greatly appreciated. Thank you!

@SPVillacorta SPVillacorta added the question Further information is requested label Sep 8, 2023
@helpmefindaname
Copy link
Collaborator

Hi @Sanpau2022

please note that your results are heavily depending on your choice of model and the dataset you created. As you haven't shared information like the training script or training logs, we can only guess what might help you.

My guess would be, that 7 is a very small dataset size, usually one wants to start with at least 100 training examples but prefer 1000.

@helpmefindaname helpmefindaname added the Awaiting Response Waiting for new input from the author label Sep 11, 2023
@github-actions github-actions bot removed the Awaiting Response Waiting for new input from the author label Sep 11, 2023
@helpmefindaname helpmefindaname added the Awaiting Response Waiting for new input from the author label Sep 11, 2023
@SPVillacorta
Copy link
Author

SPVillacorta commented Sep 12, 2023

Thanks, I resolved the formatting issue by converting my files from TXT to CSV. However, I've encountered a new challenge. The code is not functioning as expected. Below is the code I am trying to use, including the error message I received during execution. I would greatly appreciate your insights or assistance!

import flair
import glob
import nltk
import os
import pandas as pd
import pdfplumber
from flair.embeddings import FlairEmbeddings, StackedEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
from flair.data import Sentence, Corpus

nltk.download("punkt")

MODEL_DIR = "./model"
DATA_DIR = "./data"

def read_csv_to_sentences(csv_file_path: str):
df = pd.read_csv(csv_file_path)
sentences = []
current_sentence = []
for _, row in df.iterrows():
token, label = row['text'], row['label']
if token == '':
sentences.append(Sentence(current_sentence))
current_sentence = []
else:
current_sentence.append(f"{token} <{label}> ")
if current_sentence:
sentences.append(Sentence(current_sentence))
return sentences

def train(data_dir: str, model_dir: str):
train_data = read_csv_to_sentences(f"{data_dir}/train.csv")
dev_data = read_csv_to_sentences(f"{data_dir}/dev.csv")
test_data = read_csv_to_sentences(f"{data_dir}/test.csv")

corpus = Corpus(train=train_data, dev=dev_data, test=test_data)

label_type = 'ner'
tag_dictionary = corpus.make_label_dictionary(label_type=label_type)

embeddings: StackedEmbeddings = StackedEmbeddings([
    FlairEmbeddings("mix-forward"),
    FlairEmbeddings("mix-backward"),
])

tagger: SequenceTagger = SequenceTagger(
    hidden_size=256,
    embeddings=embeddings,
    tag_dictionary=tag_dictionary,
    tag_type=label_type,
    use_crf=True,
)

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train(
    model_dir,
    learning_rate=0.2,
    mini_batch_size=30,
    max_epochs=100,
)

train(DATA_DIR, MODEL_DIR)


[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data] Package punkt is already up-to-date!
2023-09-12 03:55:47,092 Computing label dictionary. Progress:
1it [00:00, 3231.36it/s]
2023-09-12 03:55:47,097 ERROR: You specified label_type='ner' which is not in this dataset!
2023-09-12 03:55:47,098 ERROR: The corpus contains the following label types:


Exception Traceback (most recent call last)
/tmp/ipykernel_290/178478599.py in <cell line: 65>()
63 )
64
---> 65 train(DATA_DIR, MODEL_DIR)

/tmp/ipykernel_290/178478599.py in train(data_dir, model_dir)
39
40 label_type = 'ner'
---> 41 tag_dictionary = corpus.make_label_dictionary(label_type=label_type)
42
43 embeddings: StackedEmbeddings = StackedEmbeddings([

~/venvs/kgenv03/lib/python3.10/site-packages/flair/data.py in make_label_dictionary(self, label_type, min_count, add_unk)
1465 )
1466 log.error(f"ERROR: The corpus contains the following label types: {contained_labels}")
-> 1467 raise Exception
1468
1469 log.info(

Exception:

@github-actions github-actions bot removed the Awaiting Response Waiting for new input from the author label Sep 12, 2023
@helpmefindaname
Copy link
Collaborator

Hi again,

as the warning shows:

2023-09-12 03:55:47,097 ERROR: You specified label_type='ner' which is not in this dataset!
2023-09-12 03:55:47,098 ERROR: The corpus contains the following label types:
<Empty line>

You have not added any labels in your conversion method. Instead, you have added the labels as part of the token text.

You can use the following function to create a sentence out of the list of tokens and token-labels (assuming BIO format) to add the labels to the sentence:

from typing import List
from flair.data import get_spans_from_bio, Sentence


def create_labeled_sentence(tokens: List[str], tag_labels: List[str]) -> Sentence:
    sentence = Sentence(tokens)
    predicted_spans = get_spans_from_bio(tag_labels)
    for idx, _, label in predicted_spans:
        if value == "O":
            continue
        span = sentence[idx[0]: idx[-1] + 1]
        span.add_label("ner", value=label)
    return sentence

This converts the BIO-tags to the target spans used in flair.

@SPVillacorta
Copy link
Author

SPVillacorta commented Sep 13, 2023

Thanks, I modified the code as you suggested:

import flair
import glob
import nltk
import os
import pandas as pd
import pdfplumber
from flair.embeddings import FlairEmbeddings, StackedEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
from flair.data import Sentence, Corpus, get_spans_from_bio
from typing import List

nltk.download("punkt")

MODEL_DIR = "./model"
DATA_DIR = "./data"


# Function to create a labeled sentence
def create_labeled_sentence(tokens: List[str], tag_labels: List[str]) -> Sentence:
    sentence = Sentence(tokens)
    predicted_spans = get_spans_from_bio(tag_labels)
    for idx, _, label in predicted_spans:
        if label == "O":
            continue
        span = sentence[idx[0]: idx[-1] + 1]
        span.add_label("ner", value=label)
    return sentence

# Update read_csv_to_sentences to use create_labeled_sentence
def read_csv_to_sentences(csv_file_path: str):
    df = pd.read_csv(csv_file_path)
    sentences = []
    current_tokens = []
    current_labels = []
    for _, row in df.iterrows():
        token, label = row['text'], row['label']
        if token == '':
            if current_tokens and current_labels:
                sentences.append(create_labeled_sentence(current_tokens, current_labels))
            current_tokens = []
            current_labels = []
        else:
            current_tokens.append(token)
            current_labels.append(label)
    if current_tokens and current_labels:
        sentences.append(create_labeled_sentence(current_tokens, current_labels))
    return sentences


def train(data_dir: str, model_dir: str):
    train_data = read_csv_to_sentences(f"{data_dir}/train.csv")
    dev_data = read_csv_to_sentences(f"{data_dir}/dev.csv")
    test_data = read_csv_to_sentences(f"{data_dir}/test.csv")

    corpus = Corpus(train=train_data, dev=dev_data, test=test_data)

    label_type = 'ner'
    tag_dictionary = corpus.make_label_dictionary(label_type=label_type)

    embeddings: StackedEmbeddings = StackedEmbeddings([
    FlairEmbeddings("mix-forward"),
    FlairEmbeddings("mix-backward"),
    ])

    tagger: SequenceTagger = SequenceTagger(
    hidden_size=256,
    embeddings=embeddings,
    tag_dictionary=tag_dictionary,
    tag_type=label_type,
    use_crf=True,
    )

    trainer: ModelTrainer = ModelTrainer(tagger, corpus)

    trainer.train(
    model_dir,
    learning_rate=0.2,
    mini_batch_size=30,
    max_epochs=100,
    )
    
train(DATA_DIR, MODEL_DIR)

However, it is sill not working. After running that, I received:

TypeError Traceback (most recent call last)
/tmp/ipykernel_96/2207024690.py in <cell line: 85>()
83 )
84
---> 85 train(DATA_DIR, MODEL_DIR)

/tmp/ipykernel_96/2207024690.py in train(data_dir, model_dir)
52
53 def train(data_dir: str, model_dir: str):
---> 54 train_data = read_csv_to_sentences(f"{data_dir}/train.csv")
55 dev_data = read_csv_to_sentences(f"{data_dir}/dev.csv")
56 test_data = read_csv_to_sentences(f"{data_dir}/test.csv")

/tmp/ipykernel_96/2207024690.py in read_csv_to_sentences(csv_file_path)
47 current_labels.append(label)
48 if current_tokens and current_labels:
---> 49 sentences.append(create_labeled_sentence(current_tokens, current_labels))
50 return sentences
51

/tmp/ipykernel_96/2207024690.py in create_labeled_sentence(tokens, tag_labels)
21 # Function to create a labeled sentence
22 def create_labeled_sentence(tokens: List[str], tag_labels: List[str]) -> Sentence:
---> 23 sentence = Sentence(tokens)
24 predicted_spans = get_spans_from_bio(tag_labels)
25 for idx, _, label in predicted_spans:

~/venvs/kgenv03/lib/python3.10/site-packages/flair/data.py in init(self, text, use_tokenizer, language_code, start_position)
715 else:
716 words = cast(List[str], text)
--> 717 text = " ".join(words)
718
719 # determine token positions and whitespace_after flag

TypeError: sequence item 16: expected str instance, float found

@helpmefindaname
Copy link
Collaborator

helpmefindaname commented Sep 18, 2023

This looks like csv file contains some empty texts, that pandas will replace by the nan value. You can check those values with pd.isnan(...) and exclude them.

Btw when sharing code, please use syntax-highlighting for the code, e.g. by writing:

```python
# code goes here
if True:
   print("Hello World")
```

we'll see properly highlighted code:

# code goes here
if True:
   print("Hello World")

which makes it way easier to read and understand you comments.

@helpmefindaname helpmefindaname added the Awaiting Response Waiting for new input from the author label Sep 18, 2023
@github-actions github-actions bot removed the Awaiting Response Waiting for new input from the author label Sep 18, 2023
@helpmefindaname helpmefindaname added the Awaiting Response Waiting for new input from the author label Sep 18, 2023
@github-actions github-actions bot removed the Awaiting Response Waiting for new input from the author label Sep 20, 2023
@SPVillacorta
Copy link
Author

I checked the files and they look OK, so you might want to have a look at the code I originally used when working with TXT annotations:

import os
import flair
from flair.data import Corpus, Sentence
from flair.datasets import ColumnCorpus
from flair.embeddings import FlairEmbeddings, StackedEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
import pdfplumber

MODEL_DIR = "./model" # Folder to save the model
DATA_DIR = "./data" # Folder containing train, dev, test
PDF_DIR = "./pdfs"  # Folder containing PDF files

def pdf_to_conll(pdf_dir: str, data_dir: str):
    # Implement the function to convert PDF files to the required format using pdfplumber
    pass

def train(data_dir: str, model_dir: str):
    pdf_to_conll(PDF_DIR, DATA_DIR)

    assert os.path.isdir(
        data_dir
    ), "Directory for data does not exist - please create and add data then try again."
    assert os.path.isdir(
        model_dir
    ), "Directory for model does not exist - please create and try again."

    columns = {0: 'text', 1: 'ner'}
    corpus: Corpus = ColumnCorpus(data_dir, columns,
                                  train_file="train.txt",
                                  dev_file="dev.txt",
                                  test_file="test.txt")

    label_type = 'ner'
    tag_dictionary = corpus.make_label_dictionary(label_type=label_type)

    embeddings: StackedEmbeddings = StackedEmbeddings(
        [
            FlairEmbeddings("mix-forward"),
            FlairEmbeddings("mix-backward"),
        ]
    )

     # 6. initialize sequence tagger
    tagger: SequenceTagger = SequenceTagger(
        hidden_size=256,
        embeddings=embeddings,
        tag_dictionary=tag_dictionary,
        tag_type=label_type,
        use_crf=True,
    )

    # 7. initialize trainer 
    trainer: ModelTrainer = ModelTrainer(tagger, corpus)
    
  # 8. start training
    trainer.train(
        model_dir,
        learning_rate=0.1,
        mini_batch_size=2,
        max_epochs=50,
        embeddings_storage_mode=None,
    )

if __name__ == "__main__":
    train(DATA_DIR, MODEL_DIR)

However, my BIO-formatted labels seem not being recognised, as when running that code, the metrics are all zero. What can I do?

@SPVillacorta
Copy link
Author

Actually, I notice differences among the TXT file which works (right side) and the one which doesn't (left side)
Differences

@helpmefindaname
Copy link
Collaborator

Can you share the logs of you training run? There are various reasons why a ML model could not learn. It wouldn't help you if I just guess what the problem could be

@helpmefindaname helpmefindaname added the Awaiting Response Waiting for new input from the author label Oct 2, 2023
@SPVillacorta
Copy link
Author

I solved this issue by taking my previously successful txt files (train, test, dev) and integrating the content of the non-functional txt as new entries. Then runing the files it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Awaiting Response Waiting for new input from the author question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants